The title isn’t really correct as, well, they’re not fundamentals…however they’re not fun so that was my best effort to get sarcasm into the title.
I love programming in PHP, and I must admit, I do enjoy finding solutions to problems that most people would probably tear their hair out over; but from time-to-time, PHP throws me a curve ball covered in shards of glass with a label attached that says “from PHP with love.”
I’ve been working on some new classes and functions at work to make our web applications work faster and better and part of that involves parsing HTML to extra information. You’d think in PHP overall this is a relatively easy task, as always you’ve got several ways to accomplish the same task. The method we use for this is DOMDocument as it allows you to use XPath to query for the stuff you want. The problem I had was to do with language; even though I was running all incoming text through iconv to ensure it was UTF-8, DOMDocument often turned the text into complete garbage – it was ignoring my declaration that it is UTF-8 text and trying to autodetect it, resulting in nonsense.
After severak hours of trying to work out why DOMDocument was about as obedient as a 16 year old emo suffering from ADHD, I found a solution – and the best bit of all is the fact it is painfully simple but not at all obvious (like all PHP bugs that drag on.)
If you ever want to pass a valid xHTML document into DOMDocument and you’re sure of the encoding, add an XML declaration to the very top of the HTML if there isn’t one. So if you’re document begins:
<html><head><title>….
Change it to this:
<?xml version=”1.0″ encoding=”utf-8″?><html>head><title>
Then ask no questions, smile and make the dev team a coffee 🙂