Yesterday I noticed that the bandwidth graph (thank you, cacti!) for Feedwhip’s backend server had a peak every two hours. Coincidentally, I restart all the walkers every two hours. So, this morning I turned off the restart and just let them go for as long as they wanted.

Sure enough, the bandwidth usage slowly tailed off, and the walkers got further and further behind. A look at the processes on the server revealed a probable cause: memory leaks! Sixteen walker processes were using 150MB each, and the kernel swap process (which handles shuffling data to disk and back when the RAM gets full) was working hard. So, it appears that I need to restart the walkers more often than every two hours, otherwise they’ll slowly eat up memory and their processing speed will slow way down.

Of course, this begs the question: WHY are the walkers are leaking memory?

The Blender framework (which Feedwhip is based on) uses some caching of objects to reduce the load on the database, but this cache is cleared at the start of every cycle. There may be circular references living in the cache when it is cleared, and I have to assume/hope that PHP is smart enough to handle those. I’ll take a quick glance through the caching code for bugs, but I don’t think this is the problem.

Blender makes liberal use of PHP’s include statement to pull in templates for sending emails and rendering web pages. A bit of web research has suggested that each one of these includes may be staying in memory. So every time I send an email, my memory consumption goes up a bit. As you might guess, I’m sending a lot of emails. A redesign of the Blender templating system might fix this (if it really is the problem), but that’s a big, messy job that I’d rather not undertake.

I think a better solution is to create a slightly more sophisticated process launcher that automatically rotates processes in and out after, oh, twenty minutes or so. Right now, I’ve just got a scheduled task (aka a cron job) that runs every hour, stops all the walkers and then starts them up again. I could replace that with code that ran once every minute and restarted the oldest process…

Some thinking on this is in order. In the meantime, increasing my relaunch frequency to once per hour should keep the walkers memory consumption under control.

Update, 4 hours later

The refresh-once-per-hour thing is working really well. There’s no backlog, and the bandwidth graph is both smoother and taller (meaning more bits coming in) than before. I’d still like to nail down the memory leak thing, though, because as Feedwhip expands we’ll need to have even more walkers going simultaneously, and we can’t have each walker sucking up 50MB of memory for no good reason.