Walkers were randomly dying on some specific feeds. After wandering around with things for a bit, I noticed that the HTML for those feeds had some really, really stupid nested tags — hundreds and hundreds of nested <font> tags, for example. Anyway, it turns out that PHP isn’t very good at recursion and when a website hit somewhere around 500 layers of nested HTML tags, PHP threw in the towel on behalf of my html-simplification routines.
It’s becoming clearer and clearer: PHP is simply not a suitable choice for creating stable back-end processes.

2 comments
Comments feed for this article
August 21, 2007 at 8:46 am
Ansel
BeautifulSoup is a great Python library for dealing with awful HTML. It’s been *extremely* useful for dealing with HTML from the wild wild web.
January 26, 2008 at 11:52 am
MSinclair
PHP users can try htmLawed - filters specified HTML tags/attributes, balances and properly nests HTML elements, etc.