You are currently browsing the category archive for the 'Uncategorized' category.
Feedwhip suffered from stability issues the past few weeks, but I think I’ve got them straightened out. Enjoy.
Holy moly, I just pushed out an update to Feedwhip.
It’s certainly been a while. Feedwhip has been pretty much running itself for the past six months while I focus my energies on other exciting projects (like Picnik and my two-year-old daughter). But I had some time and inspiration today so I cranked up the ol’ text editor and fixed a few bugs.
I also installed some tracking code so I can see just how many RSS pages I’m serving on a day-to-day basis. My guess is that it’s at least an order of magnitude more than regular HTML pages. It’s a shame that adsense doesn’t publicly support RSS feeds.
You can now specify one of four different kinds of emails for your Feedwhip notifications. They are:
- Just an alert. The subject tells you the name of the feed that changed, and the email body contains just a link to the Feedwhip page with all the changes. This email format is perfect for sending as a text message if you’ve got an email-to-SMS gateway for your mobile phone.
- Just titles. Only the titles for the changes are sent, with links to the full changes back on Feedwhip.
- Short descriptions. Both titles and a truncated version of the change are sent. This is perfect if you typically scan through a list of changes looking for something specific.
- Full descriptions. Titles and the complete text of the change are included in the email.
The default is “short descriptions”, but it’s easy to change to whatever you like. First of all, there’s a default delivery type for each of your email addresses. Just click on the “account settings” link to set it up.
It’s also possible to handle one feed differently from all the others. Click on the “delivery” link when you’re looking at a feed’s page, and you’ll be able to specify delivery options just for that one feed.
As always, I’d love to hear what you think of this change.
Walkers were randomly dying on some specific feeds. After wandering around with things for a bit, I noticed that the HTML for those feeds had some really, really stupid nested tags — hundreds and hundreds of nested <font> tags, for example. Anyway, it turns out that PHP isn’t very good at recursion and when a website hit somewhere around 500 layers of nested HTML tags, PHP threw in the towel on behalf of my html-simplification routines.
It’s becoming clearer and clearer: PHP is simply not a suitable choice for creating stable back-end processes.
Feedwhip has finally been kicked out of its parents’ basement and has a nice new home right on the beach. The raw server from ServerBeach was nicely pre-configured with just what I needed, and setting up Feedwhip code was only slightly painful. The server has now been running happily for an hour or so and so far, so good. Mail might still be a little screwy, but we’ll see…
Moving the complete database over the ADSL line would have taken forever, so I didn’t copy any feed histories to the new server. Any website changes which occurred between around 11am and 4pm pacific time today will not have been noticed. Every feed now has a fresh snapshot, though, and you should start getting notifications and RSS updates as they happen.
Hopefully everyone will notice a big improvement in the performance of the website. If you do (or if you don’t) I’d love to hear about it.
I’m happy to announce that I’ll be joining the incredibly talented team over at Picnik on Monday. Picnik makes an outstanding online photo editor that towers over its competition and is an amazing example of just how good web applications can be. Walt Mossberg agrees.
FWIW, Picnik hit just about every workplace criterion I was looking for (see my blog post about jobs).
I am really, really excited to be joining Picnik’s team. Working from home for the past three years has been wonderful, but lately I feel like I’ve started to stagnate. It’s time to reboot my career.
For all you Feedwhip fans out there, don’t worry: it isn’t going anywhere. Not away, and, unfortunately, for the time being, not forward. I’ll be spending the next few days setting up a hosted server, and then I’ll be spending minimal time on Feedwhip for at least the next few months while I get settled into my new job. Things will probably be quiet on this blog, too. Every now and then, though, I might come across a few spare hours to throw in some of your most-requested features.
In the meantime, I’ll see you over at Picnik!
I’ve (finally) signed Feedwhip up for a dedicated hosted server. We’re moving out of the basement! The increased bandwidth and beefier server will improve both the performance of the web front end as well as increase the capacity of the website-checking back end. Unfortunately, this server comes with a hefty monthly price and the advertising on Feedwhip doesn’t even come close to paying the server costs. But don’t worry, I won’t be pulling the plug any time soon. Anyone want to donate some cash to a good cause?
Today I pushed out an update that lets you see what changes you’re NOT being shown. The idea is that this will help you to tweak your filter settings to get them just the way you like.
And just like that, I took out the automatic tagging code. It was kind of a burden on the server, but mostly I had gotten enough information out of it. I just wanted to seed the tagging system with some (more or less) useful data — now that the system is populated, user-generated tags can start to take over.
There’s not a lot of immediate benefit to users who tag their feeds, other than making it easier to direct your explorations of other feeds. I’m considering adding some more tag-centric features like being able to group your feeds by tag. Anyone have other ideas that will make tags even more useful?
I dropped in support for automatic tagging yesterday. Or was that the day before? Anyway, the code automatically looks for interesting words and tags feeds based on what is in them. It’s a good first try, not bad for a day of work, but it could use some improvement.
User-specified tags are pretty much done as well, but not launched yet. I need to figure out a way to integrate the two tagging systems.
The goal behind the tagging system is to make it easier for people to find existing feeds related to what they’re looking for. And, honestly, the goal behind THAT is to drive more page views which may one day generate more ad revenue. I ain’t running a charity here.
I’ll be pounding out a bunch of new changes over the next few weeks, and then things will come to a grinding halt. I’m going to work for Picnik later this month, and that won’t leave much time for this little hobby of mine. I’ll have more to say about Picnik in a later post.
Things have been quiet on this blog lately, but it’s not because I’m ignoring Feedwhip. On the contrary: I’ve been getting up early every morning and putting in some hours before switching over to the job that pays the bills.
Feedwhip’s big server died a few months ago, and I had to push the service onto a much less powerful server (from 2GB ram to 512MB and from two fast CPUs to one slower one). This had the expected effect of slowing down the service overall. Instead of running through all the feeds once per hour, it took upwards of four or five hours — even with user limits scaled way back.
Ever since that downgrade I’ve been working on performance issues. I drastically reduced the memory footprint, and rewrote a ton of the feed processing code to make it much more efficient. After several fits and starts, this weekend I finally started seeing this:

That’s a graph of the amount of bandwidth Feedwhip consumes. As you can see, it’s got a nice, steady, hourly rhythm to it. Previous incarnations would get stuck at a steady 100 kbps (far too low to be able to process everything in an hour), or slowly trail off over time (due to memory leaks or fatal errors). Now, though, I’ve got an awesome, steady signal. Seriously, I love that graph.
These performance improvements have taken a long time and been occasionally frustrating, so it’s really gratifying to see them finally pay off. The real beneficiaries, though, will be the end users. First of all, I’m now ready to throw the code onto a hosted server somewhere outside my basement — which means everybody will be able to get decent connection speeds and an improved user experience. And even more important, I can start working on a whole slew of new features that I know everyone is going to love.
Kid666.com talks a bit about Feedwhip in a recent blog post.
Technical limitations aside, what I liked best about Feedwhip was the idea of getting a feed for information from anywhere. I’d like to see something which takes this further and produces a usable interface to page scrape directly into a feed. Something like that which could be thrown at Pipes would be incredible. A user could create a feed of any information they wanted without any coding at all and subscribe to it with any feed reader at all.
This is exactly the kind of thing I had in mind when I first created Feedwhip — the ability to take any information from anywhere on the net and turn it into something more useful to you. Progress towards that goal has been slow lately, but it’s gratifying to know that somebody else can sense where we’re heading way, way before we get there.
Although I haven’t been posting much lately, don’t worry — Feedwhip is still alive and well and I’m still slowly pushing it forward. I’ve got a paying job which is taking priority, though, so I don’t have much extra time for blogging right now.
In the meantime, Feedwhip was mentioned in this blog post as a potential acquisition target for next month.
How much of a chance is there for essentialist alternatives in a post-autonomy world? Today’s small startup (e.g. http://www.feedwhip.com/) will be in the pocket of one media giant or the other by next month. And of course, this is not a new development.
Woo hoo! Seriously, though, I am open to M&A discussions…
Feedwhip’s big server has been stumbling heavily the past few days and as a result, no notifications were going out. The culprit is a bad hard disk. To get things running again I have moved all of Feedwhip’s backend operations onto a different, much less powerful server. And so, I’ve also had to dial down everyone’s notification frequencies and subscription limits.
The timing of this move is actually a little convenient, since I’m in the process of acquiring a full-time job, and my opportunities to work on Feedwhip are going to be severely limited in the future. I need to dial back Feedwhip’s growth and bandwidth consumption to a reasonable level. Remember, I’m not making any money off of Feedwhip — the cash is definitely flowing in the opposite direction.
I know this will come as a disappointment to people who have come to depend on Feedwhip for hourly updates of their hundreds of feeds, but it’s just not possible for me to support that level of service going forward. Feedwhip will continue to be operate, and will continue to be free, but I can’t afford to be as generous as before.
If you have any questions, I’d be happy to answer them. If you’re interesting in moving your feeds to another notification provider, I can help you migrate. Just drop us a line on our contact page.
The database server is struggling with a hard disk which is on the verge of failing. Rebooting seems to fix things only temporarily. The data is safely backed up, so that’s not a concern, but I don’t have another server able to handle the load of the database server. So, I don’t have a good solution at hand. Anyone want to buy me a new server?
As I’ve been clicking around local companies to see who’s hiring for what, I’ve come up with a short list of criteria for the kind of company that I’d like to work at:
- Inside the Seattle city limits. Sitting in my car in traffic is a colossal waste of time. I’ll most likely be taking the bus or riding my bike to work, and I’d like to keep my commute short. Also, working in an interesting neighborhood in Seattle would be nice, too.
- Small. “Small” could mean anything from two to a hundred people. This rules out big players like Microsoft, Amazon, Adobe… The biggest reason I want a small company is that I want to feel like my contribution is important to the bottom line. I never got that feeling at Microsoft — no matter how hard I worked, Windows would still make the company a gazillion dollars. Furthermore, big companies tend to require extreme specialization, whereas I’ve got a breadth of talent that I’d like to see exercised. Finally, big companies tend towards big, proven ideas. I want to be somewhere that will be willing to give small ideas a shot.
- Flexibility with my time. I’ve been working from home for 3 years now, and I’ve grown accustomed to setting my own hours. I would love to get back in an office and interact with real humans on a daily basis, but I also like taking my daughter to the grocery store every other day. I’m being realistic with this one, though — face time with my coworkers is important to me, personally, and to the company’s overall productivity.
- A real salary and benefits. For the past three years I’ve worked for free, I’ve worked for equity, and I’ve worked for half-pay. Sadly, stock options don’t pay the bills.
- A company I believe in. This is really two things: first, there’s got to be a realistic business model in place. Not just an idea for a popular product, but an idea for a money-making product. Secondly, I want to the product to be something I can really get excited about, and get excited about telling my friends and family about. My wife, parents, and in-laws use my current project (Feedwhip) on a regular basis, and that is really gratifying.
- Good people. Despite being last on the list, this one may actually be the most important. I’ll sacrifice a lot of the other bullet points to be working with smart, creative people in a productive, supportive environment. However, since every company talks about how they only hire smart, creative people to work in their amazing workplace, this is just going to be a gut-feeling call. Either I click with the people, or I don’t.
Although I still have a ton of ideas about how to improve Feedwhip, I’m putting things on hold. Feedwhip has been fun, but I’ll be lucky if it ever pays its own bandwidth costs, let alone other bills like, say, the mortgage. I’ve been thinking about getting a “real” job lately, and I’ve done a bit of looking at the local job market.
What’s been surprising (and a little disappointing) is how little demand there is for PHP engineers. ASP.NET/C# and Java dominate the market. Ruby is starting to make some inroads. PHP is really nowhere to be found — especially if you’re looking for advanced engineering work.
To that end, I’ve decided to pick up (yet) another language. As a seasoned engineer, I’ve gotten to the point where the language itself isn’t so important — they’re all more or less the same, although Ruby is a bit of an outlier — but it’s the surrounding framework and tools that take time to learn.
I’ve done some work in C#/.net in the past, but I’d like to stay away from MS platforms for now (why pay for software when the equivalent is available for free?). Java would be fine, but it appears that a lot of Java developers are choosing to do their new work in Ruby — so, Ruby it is!
I spent about a week playing with Ruby before starting Feedwhip, but found that the amount of magic — things which just happened without you really understanding how or why — to be a little frustrating to dig through. In the end I decided to borrow a bunch of ideas from Rails and create my own framework in PHP — an endeavor which was fun, educational, and pretty darn successful.
Now, I’m diving into Ruby on Rails waters for the second time. After plumbing the depths of PHP and model-view-controller frameworks over the past 18 months, I think I’ve got a better appreciation for just how handy Ruby’s magic is.
I’ve been grinding away at Feedwhip’s performance issues for almost a month now. I’ve made some huge improvements: connect times are 25x faster, page generation is 10x faster, and overall throughput is more than 50x better.
Here are a few tips and tricks that I can pass on:
- First of all, know your code: use a profiler. I used APD and it showed me exactly where CPU cycles were being burned — in some surprising places.
- Calling functions in PHP is expensive. This is surprising and stupid and all those nicely abstracted classes you created will end up costing you in the long run. PHP benefits greatly from caching values locally. So, for example, don’t write this:
for( $i=0; $i < count( $my_array ); $i++ ) …Instead, do this:
$c = count( $my_array );
for( $i = 0; $i < $c; $i++ ) …You can eliminate function calls in surprising ways. For example, instead of doing this:
if( strlen( $my_string ) > 0 ) …
do this:
if( isset( $my_string[0] ) ) …
issetis a statement, not a function, and it operates much faster.Pay special attention to function calls inside of loops and sort comparison functions. Remember that if your object implements
__set(so that you can do $object->property arbitrarily), then each access of those properties is a function call. Cache those values in a local variable if you’re looping or sorting.As part of my performance improvements, I got rid of lots of function calls which gave me nicely abstracted object properties, and instead I accessed the properties directly in the array in which they were stored. Sucks for overriding functionality in subclasses, but that’s the price you’ve got to pay.
- Refer to your database server by IP address instead of by name. This one-line change (you do store this value in just one place, right?) gave me a jaw-dropping 10x performance boost. If your database is on the same machine as the PHP code, you can call it 127.0.0.1.
- To reduce load times, use an opcode cache. I use APC. It is free and extremely easy to use.
- Even with an opcode cache, you need to reduce the amount of code that is loaded. Use PHP’s
__autoloadfunction to dynamically pull in only the files you need to use. - Cache pages which don’t change very often. It is trivial to create a file caching system on the web server — there are plenty of sample classes available, or you can roll your own in about an hour like I did. Feedwhip tends to have lots of dynamic content, so this doesn’t work for every page, but right now I only need to do real work for about 10% of the RSS requests — the rest of the time I either serve up a 304 (not modified) or dump a file directly out of the cache.
- Cache compute-expensive data. A bigger hard disk is cheaper than a second server. Feedwhip is caching RSS requests, simplified versions of HTML pages, generated feed items, and the list of most recent feed items for a subscription. We could go back and recalculate all of those values, as needed, off of the original HTML snaps, but that would take a horrific amount of time.
Now that Feedwhip’s performance is back in the realm of usefulness, I’m going to step away from the code for a week or two. I need to get some perspective on where Feedwhip is and where it needs to go. As always, I love to hear from Feedwhip’s users — it just takes one suggestion to get the feature you’ve always wanted!
When I got back to my computer this afternoon I took a look at the performance of my RSS feeds over the past few hours:
3498.8924 | 1808.4851 |
3522.5980 | 1610.3802 |
2838.9859 | 1510.7083 |
428.4015 | 191.3470 |
588.4480 | 238.6685 |
415.9069 | 123.9909 |
525.5345 | 215.9800 |
Each row is the average time (in milliseconds) it took to generate a page for a given hour. The left column is the average time for each page, and the right column is the amount of time just spent querying the database.
Four hours ago, something big changed. At first, I thought that four hours ago something big broke, like maybe all of the crawlers had suddenly stopped working and taken their load off the database, but no, everything is working fine. Almost too fine.
Then I remembered the last thing I’d done before I took Ruby out this morning…
Recalling something that was mentioned at last month’s PHP conference, I changed the web server’s configuration to refer to the database server by IP address instead of by name. So, instead of querying db.feedwhip.com, it was going to 216.172.217.XXX. That one change knocked almost 90% off the average amount of time I spent waiting on the database.
Wow.
Old code:
msecs/first-response: 332.921 mean, 2232.32 max, 107.822 min
New code:
msecs/first-response: 7.84185 mean, 22.079 max, 6.986 min
YEAH!
When I pushed my latest changes out to the servers last night, things broke. Specifically, the http server was prematurely dropping the connection and so browsers were seeing most, but not all, of eac html page. Unfortunately this is really bad because all of the javascript which makes the pages work is at the end.
This wasn’t happening on my test server, but happened on both production servers. The test server is running FC5 and PHP 5.1. The production servers are FC4 and PHP 5.0. A bit of scrambling led me to suspect something to do with sessions and PHP’s session_write_close, which changed its behavior between 5.0 and 5.1 and which can, apparently, just quietly terminate your connection. I tried explicit flush(), and I tried commenting out session_write_close.
After a few hours, I gave in and decided to upgrade to FC5 (so that I could get PHP 5.1). Not exactly the best-planned of upgrades, but it went fairly smoothly nonetheless. I kicked off the upgrades around midnight last night, and this morning they appeared to have worked beautifully. A few tweaks to some config files and we’re back online!
Feedwhip is temporarily offline while I perform some not-so-routine maintenance. I’m expecting that the servers will be up again tomorrow.
By far the most database-intensive thing Feedwhip does is look for feed items. Every change to a feed gets stored as a feed item, even tiny little changes to comment counters, and so finding the changes you care about for a given feed can be really slow. Some of the RSS feeds take hundreds of seconds to generate. Wow, that’s slow.
RSS feeds generally don’t change as often as people poll them for changes. This means that I keep looking up and regenerating the same page over and over again. The solution, of course, is to cache the changes on the web server so that we don’t need to hit the database and grind through the code unnecessarily. This point gets mentioned over and over again when people talk about how something as slow as PHP (see my previous post) can power a site as popular as Digg.
Of all the perf changes I’ve done (or have planned), I think this one will probably have the biggest impact. Which makes me wonder why I’ve waited so long to do it…
Thanks to some late nights and doting grandmothers, I’ve been able to work extra-hard on Feedwhip’s performance issues the past few days.
For a while there, I was in despair. Using my framework to render a static, three-word web page was giving me a throughput of 15 pages/sec, and the connect time was a horrific 300ms. At the PHP conference, Rasmus Lerdorf had said anything more than 10ms is just too slow. 10 milliseconds!
Well, I got to work. My framework automatically loads in all the PHP code you need right at the start, but that means it also loads the code you don’t need. Loading the models, controllers, and helper classes on demand brought the connect time down to 265ms, 220ms, and 200ms respectively. Better, but not really in the ballpark.
Profiling the code was giving unexpected results. Retrieving the value of a variable in my model class, an action which occurs a bajillion times, was really slow. It turns out that calling any functions in PHP, even ones which just return a value, is horrendously slow. So, I spent all day today looking at my loops and storing values locally outside of the loops.
Frankly, the discovery that PHP functions are slow was a big disappointment. My wonderfully architected and abstracted framework, which uses lots of virtual functions and inheritance, is just plain slow. I’ve now got a new perspective on how to code in PHP: use lots of locally cached values and avoid function calls. Unfortunately, when you’re more than a year into a function-call-heavy framework, things can be hard to change.
Because I had to touch so much code along the way, these changes will need a lot of testing. It’ll probably be a while — maybe a week — before they can be pushed live.
Feedwhip keeps track of the last two weeks’ worth of changes to your subscriptions. Anything past that gets deleted so that the servers’ hard disks don’t fill up.
Unfortunately, deleting individual items from database tables can be really slow. Once per hour, we were looking in the database at each feed’s items for entries that were old. This was consuming a lot of the database server’s time, so I came up with a better solution.
Now, we store the feed items in rotating tables. There are several of them, and each contains a week’s worth of data. Once per week, a cron job deletes the oldest table, renames all the other tables, and creates a fresh new table. This means that I’m no longer sucking up cycles looking for old items to delete — I know everything in the oldest table is old, and so I drop them all at once.
It also shrinks the size of the table containing the most likely source of items, but at the cost of having to do more than one lookup to find all the items for a given feed. Since most feeds are updated fairly regularly, you probably don’t need to go back more than a week to find the ten most recent items.
It’s my hope that this will lighten the load on the database server and eliminate some of the strange slow behavior I’ve been seeing — such as primary key lookups taking more than a second (they should be more-or-less instant).
If this doesn’t improve things dramatically (or even if it does), I’ve still got more plans on how to improve the performance — the easiest of which is throwing more RAM into the database server.
Tonight I pushed a big upgrade out to Feedwhip’s servers. I’ve spent a bunch of time examining performance and this upgrade tackles a few of the big issues Feedwhip has been facing. You’ll see some improvements, depending on what you’re doing, but there’s still a lot more to be done in the performance arena. Rest assured, I’m working on it every chance I get.
Okay, maybe not every chance. I spent yesterday and today redesigning the “feed detail” page. The old one was a jumbled layout of boxes and buttons that looked like crap. The new one is much easier on the eyes and should be less intimidating to first-time users who get redirected from an external website.
The UI redesign was done based on suggestions from Feedwhip’s users. If there’s something you think can be improved, let me know!
The conference was a typical mix of blindingly brilliant and creakingly dull. I didn’t get what I’d expected out of it (networking and a chance to show off Feedwhip’s blender framework), but I’m pleased with what I learned: that I’ve been a Bad Engineer.
Because my primary responsibility is taking care of my daughter, I’m left with just a few hours each week to work on Feedwhip. As a result, I’ve been in a rush. Each day when nap time rolls around I rush to get through my email and take care of some features and try to get something exciting out for the users. Feedwhip is an internet app, after all, and I’m working on internet time. Everything needs to be done yesterday.
But as a result, I’ve been cutting corners — especially when it comes to testing. Feedwhip is actually quite stable, more or less, but getting the testing done takes forever because I don’t have any good test suites written. After all, they take time and I don’t have enough of it. But not having test suites means my overall development is way less efficient. I can’t just type one command in and get a report on my latest changes — I need to test them all by hand.
The keynote by Rasmus Lerdorf (PHP’s inventor) was particularly interesting. He walked through a round of profiling on a sample app, showing the tricks and tools he’d use to gain a 500-fold increase in performance.
Profiling! Test suites! This is pretty basic stuff. Without either I don’t really have a good idea of what’s working well, and what should be fixed — especially with performance being such a big issue with Feedwhip right now.
For example, I’ve got some code that converts HTML into a simplified, extra-marked-up form that Feedwhip uses internally. By using profiling, I was able to figure out that this code was one of the big bottlenecks for the walkers.
As part of this code, I look up the current html tag in an array to see if it should be processed, dropped, etc. I was using the php in_array() function to do this. Well, it turns out that this function was taking up 25% of the execution time in this one function. Changing the structure of the array (so that I could search by key instead of by value) and combining several arrays into one (just a good idea) cut the execution time in half! Without profiling, there’s no way I’d have known to even look at the in_array() function.
So, I’ve come back from the conference with a new perspective on my engineering processes. Instead of new Feedwhip features on my to-do lists, it’s things like “unit tests”, “continuous integration”, and “benchmarking”. Once I get all these important foundations in place, I’ll be able to churn out new features more quickly and with more confidence.
I’ll be at the Vancouver PHP 2007 conference next week. If you’re going to be there, look me up! I’m the guy with the laptop covered in stickers.
Every few months the UPS (uninterruptable power supply — basically a power bar with built-in battery) in the server closet starts emitting a steady, high-pitched beep. Kate says it does this because it is overloaded. The only way to get it to shut up (did I mention that the server closet is literally two feet from my desk chair?) is to turn it off and on again. Which, unfortunately, means I need to gracefully power down all the systems in the closet, reset the UPS, and then bring them back up again.
This time around, I took advantage of the forced power down and moved two not-so-essential computers off the UPS, so hopefully the beeping won’t happen again.
This is my very long-winded way of saying that Feedwhip had a brief (less than 5 minutes) outage today.
A new filter option has been added to the filter drop-down: “Try to split changes into individual blog posts.” This option is on by default.
Normally, Feedwhip examines the content of a change and divides it into individual blog entries. Feedwhip’s algorithms are pretty darn smart about this and work really well for blogs. But if you’re not whipping a blog, then you probably don’t want this artificial splitting to be applied to your changes. So now, there’s an option to turn splitting off if you want.
A single snapshot might still generate multiple changes if those changes are separated by more than a few lines on the page. So, a change at the top and a change at the bottom will show up as two changes — even if they were noticed at the same time.
As part of this release I also cleaned up the formatting and made the filtering code a bit smarter about not showing changes you’re not interested in.
As always, your feedback guides the development process. Let me know what you think!
The update I mentioned in my last post was pushed live last night. A few issues have cropped up — mostly a case of too much information being presented — and I’ll be tweaking the algorithms to make things look better over the next few days. Feel free to forward examples of poorly-processed change emails to change-is-good at feedwhip.com.
It’s been a while since I’ve posted, but that’s generally a good thing — it means I’ve got my head down, working on a new release. I’m just about done with my latest round of changes and I think everybody’s going to love this update.
First of all, I’ve completely rewritten the change splitting code. This is the algorithm that takes a single change to a blog and breaks it up into individual blog posts, and it’s what makes Feedwhip so special. Feedwhip is the only service of its kind (that I know of) that is able to generate RSS feeds for any web page without any configuration. With this release, the change splitter is more accurate and less likely to pick the wrong line to start a change.
The other big update is to the way changes are displayed. Small changes will be presented inline and more context is presented around each change. This will help you to know what you’re looking at when an item is changed. If you’ve whipped one of those pesky Terms of Service pages, now you’ll see exactly what words are being modified in the paragraph.
Put together, these two changes amount to a big improvement to Feedwhip’s most important feature: showing people what’s changed. I’m doing my final round of testing this afternoon, and if things go well I’ll be releasing really soon now!
About 30 minutes ago I pushed out an update that has reduced the average database time per page load to less than a third of what it used to be. The change is specific to pages that show feed items (such as an RSS feed that we generate). These pages will now load much faster, and because of the overall reduced load on the database, everything should be faster.
The change was a relative easy one to make. What we were doing is querying the database for all the items that match a certain feed and date range. This could, potentially, be a few thousand items (more likely is a hundred or so), and since we generally only need the last 10, I load these into the web server for processing in blocks of 10. So what was happening is that we were executing the same find-me-these-items query a whole bunch of times, over and over, and throwing away most of the results except for the block of 10 we were interested in.
The new code uses an asynchronous SQL function called mysql_unbuffered_query(). This call generates the result set on the database and then holds onto them until we’re done. We can then pull items out, one at a time, until we’ve got the ones we want. The end result is that we do just one database search per feed instead of several or dozens. And the end result of THAT is the average database time per page query is now down below 5 seconds instead of up around 15 to 20.
As I said, this specific improvement will mostly affect pages that display feed items, but because of the reduced load on the database there should be a slight improvement across the board.
At my previous job (link might stop working any day now) I was the lead web developer (and lead Windows developer, and lead PHP developer…). It was always exciting to be working on new features for our users. But I often got distracted by the triumvirate of managers leading the company. “Statistics!” they’d cry. “We need more statistics!” Occasionally, I’d grudgingly comply.
Getting the right view on your data can mean the difference between success and failure. Getting a good reporting system in place is one of those not-so-sexy infrastructure things that doesn’t get users all excited, but can provide crucial insight into how your systems work and how they get used — which in turn tells you what sexy new features are the most important.
Feedwhip has been generating a decent daily report for me for the past year, but as I’ve been continuing to tackle performance issues, it’s been clear I don’t have enough info. So, I started logging more detailed stats about how pages are generated.
Lo and behold: useful data was obtained! Did you know that Feedwhip handles an RSS request every six seconds? And that serving up RSS requests are by far the most db-intensive thing I do?
With those two facts in my back pocket, I realized it was time (probably long past time) to add support for etags and last-modified headers. These headers let me report back to an RSS reader that a page hasn’t changed — so I don’t need to regenerate the content.
I don’t cache the content I generate (mostly because it changes so often, although this could be another performance enhancer to explore), so every time there’s a request for an RSS feed I need to go searching among the 1.5 million RSS items in the database for ones that match the user’s filter requirements. Some RSS feeds took more than 100 seconds to generate — and most of that was spent rifling through the database. By avoiding the generation of these pages altogether, I’ll be taking a huge load off the database — which means everything else will respond that much more quickly.
So, it’s all just a matter of finding the right statistics to give you an informative view on your system. Now with that done, I’ll go back to adding some sexy new features.
Yet another wind- and snowstorm is bearing down on Seattle — the third one this year. Let’s hope the power and phone lines hold out better than last time.
Update: the next day:
Well, the wind came and went with a few flickers of the lights but no loss of service. Tonight: snowstorm!
Some users have complained (and rightly so) that loading pages from Feedwhip is too slow. Optimizing the system and expanding its capabilities to match user growth is a never-ending battle, but it is one that I’m taking seriously and will continue to work on as Feedwhip becomes more and more popular.
Today I scored a major victory by reducing the amount of HTML code in the “Your Feeds” page to less than 1/3 of its size. This page will now load much faster, especially if you’ve got as many feeds as I do. The other pages on the website weren’t as prime for optimization, so you probably won’t notice any big improvements elsewhere, although I did tweak things a little.
A friend sent me a link to (yet another) competitor in the crowded page-change-watching field. NoticeThat generates RSS feeds and sends emails when pages change. It’s a little thin on features when compared to Feedwhip, but sometimes that simplicity can be an advantage.
One thing that looks great is the way they present small text changes inline and in context (see this screenshot. I want to do something just like that ASAP.
Also, they’re proudly tea-powered, just like Feedwhip!
In order to make Feedwhip as easy to use as possible, way back in 2006 I created something known as a weak account. Weak accounts are created whenever a user subscribes to a feed without first signing up. Although this lowers the barrier to usage, it does create a few headaches because users are then stuck in a halfway world where they have accounts, but can’t properly login to access them.
This is now changed! Weak accounts are no longer created when you subscribe without signing up. Instead, we create a password for you and mail it to you. This had the added benefit of meaning that an account is self-verifying, because you can’t login unless you’ve got the password which was sent to your email address. Users can still sign up directly if they want to, since it just makes sense (to me) to always have a “sign up” link.
As a tiny little aside, please note that all the automatic passwords are generated as “pseudo-words” with alternating consonants and vowels so that they’re much easier to type.
One of the nice things about an interpreted language like PHP is that you don’t need to worry about memory management. The underlying system is able to figure out what objects are no longer needed and get rid of them on your behalf. You’re free to spend your time writing useful code instead of reinventing the memory management wheel.
In theory.
As I wrote about earlier, my walker processes were gobbling up large (and growing) amounts of memory for no good reason. Well, I finally tracked down and fixed the problem. Gory details to follow, so if you’re not a coder you can stop reading now.
Feedwhip’s Blender framework has a built-in caching system. Whenever an object is loaded from the database, it stores a pointer to that object in a global table. The next time somebody wants to use that same object, we can just pull it out of memory instead of hitting the database. The not only reduces the load on the database, but cuts down on some heavy computation we do to regenerate full web pages based on just the differences between each snapshot.
This works well but there’s an obvious problem — if you let the walker run long enough, eventually you’ll have pulled the entire database into memory. Or maybe PHP will just run out of memory and die. Either way, you’ve got a problem. So, I added a function to clear the global cache. This is what it looks like, more or less:
function clear_global_cache()
{
$the_global_cache = array();
}
The theory is that by assigning the global cache to an empty array, the system will notice that all the objects it used to contain are no longer needed, and it’ll clean those up. Again, that’s in theory, and in practice nothing was happening. So, I changed the code to a series of array_pop()s, and that seemed to improve things, but not fix them entirely.
The problem is that in addition to having a global cache, I’ve got a local cache. The local cache is attached to each object, and it stores pointers to objects that it has directly referenced. For example, every subscription has an associated user. The first time you ask a subscription for its user, it’ll go to the database and grab it. The next time, it’ll just use the local cache. This works well, except that I’ve now got all kinds of objects pointing at each other. This is a classic circular reference problem and I had assumed that PHP was smart enough to detect these, but I guess I was wrong.
To fix this problem I used the classic solution to the classic circular reference problem: I created a shutdown() method. Actually, my method is called clear_local_cache(), and it is called recursively on all the objects in all the caches. So, now the clear_global_cache code looks like this:
static function clear_global_cache()
{
while( count( $global_cache ) )
{
$obj = array_pop( $global_cache );
if( is_array( $obj ) )
{
while( count( $obj ) )
{
$obj2 = array_pop( $obj );
if( NULL != $obj2 &&
is_a( $obj2, “BlenderObject” ) )
{
$obj2->clear_local_cache();
}
$obj2 = NULL;
}
}
$obj = NULL;
}
}
The memory usage (according to the memory_get_usage function) now hovers nicely right around 5MB per walker. Sadly, top is still reporting 20-50MB per PHP process after about 20 minutes, and growing, albeit more slowly. Happily, overall bandwidth consumption is higher — which is probably the best metric I have for how efficiently the system is running.
The changes I’ve made are good enough for now, I think. Time to move on to more visible features.
I hope you enjoyed your holidays. In addition to a wonderful time spent with my family, I immersed myself in my annual week-long video game binge. This year’s choice was Age of Empires III, which was enjoyable — if a little bit too similar to AoE I and II.
With gaming out of my system for the next twelve months (more or less), it’s back to work on Feedwhip. 2007 is only a few days old but we’ve already managed to push some good updates to Feedwhip’s servers, which I’ll talk about in subsequent posts.
2007 should be an interesting year for Feedwhip. In addition to all the exciting new features I’ve got planned, the continued growth in our user base (which is great — keep spreading the word!) also means that at some point we’re going to have to make some big changes to accommodate all those feeds. Moving to a colo? Charging for premium accounts? Donations? Trying to wring a few more pennies out of Google ads?
I’m not sure what changes we’ll make, if any, or when, if ever. But I’ll be sure to give plenty of notice before making any major philosophical changes to Feedwhip.
Have an wonderful New Year!
Yesterday I noticed that the bandwidth graph (thank you, cacti!) for Feedwhip’s backend server had a peak every two hours. Coincidentally, I restart all the walkers every two hours. So, this morning I turned off the restart and just let them go for as long as they wanted.
Sure enough, the bandwidth usage slowly tailed off, and the walkers got further and further behind. A look at the processes on the server revealed a probable cause: memory leaks! Sixteen walker processes were using 150MB each, and the kernel swap process (which handles shuffling data to disk and back when the RAM gets full) was working hard. So, it appears that I need to restart the walkers more often than every two hours, otherwise they’ll slowly eat up memory and their processing speed will slow way down.
Of course, this begs the question: WHY are the walkers are leaking memory?
The Blender framework (which Feedwhip is based on) uses some caching of objects to reduce the load on the database, but this cache is cleared at the start of every cycle. There may be circular references living in the cache when it is cleared, and I have to assume/hope that PHP is smart enough to handle those. I’ll take a quick glance through the caching code for bugs, but I don’t think this is the problem.
Blender makes liberal use of PHP’s include statement to pull in templates for sending emails and rendering web pages. A bit of web research has suggested that each one of these includes may be staying in memory. So every time I send an email, my memory consumption goes up a bit. As you might guess, I’m sending a lot of emails. A redesign of the Blender templating system might fix this (if it really is the problem), but that’s a big, messy job that I’d rather not undertake.
I think a better solution is to create a slightly more sophisticated process launcher that automatically rotates processes in and out after, oh, twenty minutes or so. Right now, I’ve just got a scheduled task (aka a cron job) that runs every hour, stops all the walkers and then starts them up again. I could replace that with code that ran once every minute and restarted the oldest process…
Some thinking on this is in order. In the meantime, increasing my relaunch frequency to once per hour should keep the walkers memory consumption under control.
Update, 4 hours later
The refresh-once-per-hour thing is working really well. There’s no backlog, and the bandwidth graph is both smoother and taller (meaning more bits coming in) than before. I’d still like to nail down the memory leak thing, though, because as Feedwhip expands we’ll need to have even more walkers going simultaneously, and we can’t have each walker sucking up 50MB of memory for no good reason.
A call to Qwest this morning confirmed that they had indeed set my bandwidth limit too low when they didn’t fix my line which they insist was never broken in the first place. Anyway, we’re now correctly set at the higher bandwidth level.
The weird thing, though, is that the walkers (background processes which look for feeds to download, and then download them) aren’t running any faster, even with almost three times the bandwidth available. Thinking that maybe the bottleneck is at the other end of the connection (ie., the websites I’m downloading from aren’t providing me data fast enough) I tried upping the number of walkers from 6 to 16. That way, I’d be able to fill up my bandwidth by downloading from more sites simultaneously. But still, no change.
This one’s a real head-scratcher. Feedwhip’s processing capabilities are more or less maxed out, but it doesn’t appear to be a bandwidth or processing speed issue. Anyway, I’ll keep looking at the numbers (and generating more), and see if I can figure out where the bottleneck is.
After an agonizing 4 days without Feedwhip, we’re back online! The internet connection miraculously restored itself as some point today. Because, of course, there was never actually anything wrong with it, according to Qwest. Except that my bandwidth settings are now different. I guess Qwest likes to lie about their status, not tell you when they do maintenance, and then claim there was nothing wrong in the first place. Oh, and along the way, they’ll cut your bandwidth in half.
Hopefully this is the last interruption for a long, long time. I took the opportunity while we were down to rearrange the server closet, so now I’m not so scared about accidentally unplugging the power while reaching around back for an ethernet cable.
Over the next few days I’ll be going back over my to-do list and trying to figure out what the hell I was working on before all hell broke loose. Some nifty new features should be popping up very early in the New Year.
As advertised, we lost power last night and are still without power today. It’s my hope that we’ll be back up and running some time today, but there’s no way to know when power will be restored.
Update: Dec. 16, 11am
Power was restored in the early afternoon yesterday, but the internet connection has been misbehaving badly ever since. Pinging external servers gives us about 30% lost packets. Qwest claims everything is working fine on their end. We’ve tried rebooting the modem and switching hubs. We’re seeing the same behavior across all servers. The walkers are running very slowly as a result of all this and notifications are delayed at best, but mostly not going out at all.
I’ve got a call in to our ISP, but it appears they don’t work weekends — not even the weekend after a major storm wipes out power to half the homes in Seattle.
Update: Dec. 17, 10pm
Nothing new to report, I’m afraid. My ISP has been unresponsive. I’m hoping they’ll have some answers when they show up for work on Monday. In the meantime, I’ve been shopping around for a new ISP.
Update: Dec. 18, 5pm
After spending the day on the line with my ISP, we’re still without an internet connection. We’ve replaced modems, hubs, and switches, called Qwest, reconfigured the modems, moved me to a whole other network, and still nothing. I’m able to ping external servers, usually losing about 20% of packets, but can’t establish any http connections. They just get stuck in a “waiting” state in the browser.
Sadly, we’ve made no progress on this problem. I can’t say when Feedwhip will be back online.
Another windstorm is forecast for Seattle for tonight (Dec 14) and it promises to be even bigger than the last one. Although we are located right next to a substation, there’s a decent chance we’ll lose power again.
A windstorm blew through Seattle this morning and took our electricity down with it. UPS lasted about 10 minutes, while the full outage itself was about 3 hours.
I’m running some tests now, but it appears everything has recovered gracefully. Thanks for your patience.
Feedwhip suffered a four-hour outage today as our internet connection inexplicably disappeared. I’ll be following this up with both our ISP (drizzle.com) and our DSL provider (Qwest) on Monday.
I’ve spent a lot of time the past few months improving the stability of Feedwhip’s software and servers, and it’s really frustrating to have all that work undermined by careless bandwidth providers. Anyone want to recommend a good, cheap co-lo?
A few minor updates went live today:
- A link to this page, the Feedwhip blog, is now on the menu bar.
- A new vertical ad is on the feed detail page. The bottom ad on the feed detail page tended to get pushed way down to the bottom and out of sight, making it not very useful. So, hopefully this one will get more eyeballs. Ads aren’t paying the bills — not even close — but it is good to know how they might pay the bills if I had, oh, a billion users. Remember, if you don’t like the ads, then sign up for a premium account. It’s still free!
- Fixed a bug in the way “last changed” is displayed on the Your Feeds page. Now each feed tracks its own last changed setting independently. Before, putting feeds into groups messed this up.
Here’s what coming down the pipe for Feedwhip in the next few weeks:
- I’ve noticed a significant number of people registering with phone-based email accounts. Unfortunately, these don’t work very well with the complicated style sheets and verbose descriptions that are currently used. So, I’m going to add a new setting that lets people specify whether they want their notifications to be fancy (fully formatted html), simple (plain text), or terse (as few words as possible).
- I want to make it even easier for website designers to use feedwhip. Right now, the “share” page isn’t getting much traction. So, I want users to be able to create new feeds just by hitting a single URL. Something like feedwhip.com/instant/?url=myurl.com. That’s all you would need to do, and if the feed doesn’t exist then feedwhip would instantly create it and take you to the feed detail page.
- As part of the above change, I want to make it easier to export your filter settings (This will be even more important in the context of the next feature I describe). If you create the perfect bunch of settings to extract data from a feed, you should be able to easily share those settings with anyone. There are some privacy issues, where people shouldn’t be able to just plop onto a random page and see what keywords you’re searching for. And there are issues where sometimes people want to subscribe to a bunch of settings and have those settings be changed whenever the original owner changes them, and sometimes they want to subscribe and not have them change ever. I’m not sure which should be the default, or how to easily explain the difference to end users.
- I had a request from a user to be able to track when one specific number on a webpage changes. Right now, we can’t support that, but it’s something I’d love to add. The way it would work is users could create sub-feeds. They’d take the original data from a webpage and run it through some kind of a filter, creating a whole new feed. They could then ask for notifications when the new feed changes. Some example of filters:
- a “chop” filter which looks for two phrases and cuts out everything before, after, or in between them.
- a general regex (regular expression — basically a fancy text searching notation) filter that lets you look for patterns and extract any content you want
- a general javascript filter — this one is more of a version 2 feature, but the idea is that you could write some arbitrary piece of javascript code that would process the html and then generate whatever you want based on it. Kind of like what GreaseMonkey does for firefox.
- A few places in the UI could use some clean up. The detail page for a feed group, for example, is pretty ugly.
- I think there’s still a bug or two lurking in the notification engine — particularly a race condition that leads to double notifications.
Okay, so I think that’s more like a few months as opposed to a few weeks, but that’s what I’ll be working on for the time being. Visit the Feedwhip contact page if you want to throw in your two cents.
Whoops. I put together a script to purge old data out of the database without remembering that I need some old data to compare against the new data coming in. As a result, some unchanged feeds suddenly triggered as changing because there was nothing old to compare against. This only happened for feeds which haven’t changed since at least June, so hopefully it won’t affect too many of Feedwhip’s users.
Back to the drawing board…
Feedwhip was inaccessible today for about an hour starting at 7:30am PST. The DSL connection to our house suffers from sporadic outages, and nobody upstream of me is willing to take responsibility for it. For the record, I’m using Qwest to provide the lines, and drizzle.com is my ISP. If either of them read this and want to make amends, you know where to find me.
The real solution, of course, is to move Feedwhip’s servers to a co-lo (a fancy warehouse full of other computers with redundant electricity and bandwidth). That will happen one day, I’m sure, but for now we just can’t afford it.
Some people (most notably, yours truly) are seeing double notifications and/or doubled items inside of notifications. This only happens for feeds that are in groups. What’s happening is that a walker claims the feed group and starts processing it. At about the same time, a different walker claims one of the feeds inside the group. The first walker, processing the feed group, also process all the feeds inside the group — just to make sure everybody’s got the latest data– and so the same feed ends up getting processed twice at the same time. This is a pretty classic example of what’s called a race condition in the programming world.
Anyway, boring details aside, there should be a fix for this soon.
With my parents watching the baby, I went down to some local coffee shops to get some work done. Uncomfortable seating and glaring sunshine aside, I managed to make a few changes that should improve the way notifications are handled.
Some feeds were timing out on download. The default timeout was 20 seconds, and this was too short for some that just sat there and did nothing. This was particularly egregious when the webpage was only partially downloaded — we didn’t detect the timeout and so we were seeing all kinds of major changes when the real problem was just a slow feed. Well, now we can detect timeouts when we download. [For the techies among you: we use PHP's internal curl calls to attempt the download. These occasionally fail for no good reason, and when that happens we spawn an external curl process at the unix command line. Detecting timeouts-after-partial-downloads for these latter calls was missing.].
Finally, we added a few new fields to the subscriptions table. Instead of just tracking the last time a subscription had an email sent, we now also track the last time it changed (which can vary with your filter settings) and the last time we checked it for updates (since subscription checking can happen on a different schedule than the underlying feed).
Last week’s update introduced some bugs into the notification system. Some people are getting double notifications, and others are receiving entire web pages instead of just the changed parts. I’ve blocked off a big bunch of time to work on this tomorrow, so hopefully things will start running smoothly again soon.
1. The walkers* are slow today due to the continuing DDoS against everydns.net. Current backlog is a few hours. The plan is to wait this one out while looking into our options for improving the reliability of our dns services. Update 3pm: we’re all caught up again.
2. The database was temporarily offline for several hours early this morning. Why does the kernel have to kill my most important process when it runs out of memory? Update 3pm: this server now has a juicy new 4GB swapfile to play with (on top of its existing 2GB ram and 2GB swap) so hopefully this won’t happen very often in the future.
* a “walker” is a piece of code that walks the feed list in the Feedwhip database, looking for feeds that need some attention. You trivia hounds will be delighted to know that we run six walkers simultaneously.
EveryDNS is Feedwhip’s DNS provider, and when they were hit by a massive DDoS last week, Feedwhip felt the effects. Mostly, we felt it because our crawlers slowed down to about a quarter of their usual speed. I suspect this is because some reverse-DNS-lookup queries were timing out and then failing, causing all the downloads to take much longer than usual.
Well, I’m seeing the same thing again today and when I try to visit everydns.net, I find their site is down. Time to move on to another DNS provider, I think. Right now we’re using EveryDNS in a “mirrored primary” configuration — they mirror the configuration on our local DNS server, and I point the master name servers up to them. The idea is that they can continue providing services if my own name server goes down. But if my name server is down, then chances are EVERYTHING at Feedwhip is down. Maybe I should just point everything to me and cut a failure point out of the loop.
FC5 is now installed on the development server, and Cacti is providing some useful insight into how Feedwhip is working. Apparently, my web/mail server is consuming just as much bandwidth as my database/crawler server, which makes me raise my eyebrows. My initial suspicion is that I’m being inundated with spam, but we’ll see…
Here’s how the past two days have gone:
- I want to monitor network bandwidth so I’d like to install Cacti.
- Cacti segfaults when I try to view its install page. The reported solution is to upgrade to PHP 5.2.
- I’m running FC4, which is now unsupported so no new packages are available for it.
- I need to upgrade to FC5. This isn’t working. I am stuck in dependency hell.
If this doesn’t mean anything to you, then suffice it to say that I spent the past two days working very hard and not getting anything done. I’m tempted to just take one of Feedwhip’s servers offline and start from scratch.
Update: 3pm
I think I’ve finally got things straightened out:
- I had downloaded and installed FC5 before updating the kernel
- That led to all kinds of dependency brokenness
- Then I messed up my yum repository information
- Then I tried uninstalling the FC5 RPM, and that made my $releasever environment variable go away
So, to fix things I went back and cleaned up my yum repositories, and pointed them by hand to the FC4 directories. Then I did a “yum upgrade kernel” and that ran with no dependencies. Now I’m running a general “yum upgrade”, and that’s going to take many hours because it’s upgrading just about everything. Next up (later tonight) I’ll go back to #1, above, and see if it’ll finally work.
This is a new blog to chronicle the ongoing development of Feedwhip.
Here’s a few quick bullets to get you up to speed:
- Feedwhip is a free, web-based service that detects changes to other web pages. It then sends you an email or generates an RSS feed based on those changes.
- Feedwhip is run by yours truly, Steve Leroux. I’m an at-home Dad with a 9-month-old daughter and I work on Feedwhip between diaper changes and bottle feedings. I’m also a professional software engineer with more than 10 years experience developing for Windows, Unix, and the Web.
- Feedwhip is a little more than a year old right now.
- Feedwhip is hosted on three Fedora Core 4 servers in my basement. The code is written in PHP 5 and I’m using MySQL for the database. Apache 2 serves the pages.
- Feedwhip is written using the Blender framework, which you’ve never heard of because I made it from scratch. It is a PHP-based MVC framework which borrows many ideas from Ruby’s Rails framework. If I ever get enough spare time, I’d love to clean up the code and release Blender as an open-source project. Odds of me getting enough spare time are fairly small.
