HTML started without XML backing. So, there were no rules. People and Browsers wrote the code in every possible way. So, while the XHTML standard was introduced later on, there are still quite a few websites, including such popular ones as Google and Yahoo which have a lot of HTML that is not XHTML. It is amazing to see how the popular browsers can tolerate all those errors and still produce a decent viewable page.
One of the main reasons to convert a HTML page to an XHTML is so that it is possible to apply xpath on the resulting DOM and extract pieces of information easily. There are various reasons and use cases why people want to do this. This post is mainly about what are the Java tools out there to achieve this and what worked best for me.
At http://java-source.net/open-source/html-parsers there are a bunch of parsers that can parse the HTML.
My requirements are
1. It should be LGPL or other such license that allows including in commercial products
2. It should be standards based so that I can easily replace it with another if needed. Hence I was only interested in those parsers that produced DOM as per org.w3c.dom.* classes
So, I evaluated and started off with JTidy. The results were not good for some popular websites! This is probably because JTidy hasn’t been updated since 2001 or so. Then I tried Corba and eventually NekoHTML. My initial concern for NekoHTML was the Xerces dependency. However, it doesn’t need the entire Xerces and comes with a pre-packaged smaller xerces dependency component.
One issue I had with parsing and writing the document back into XML using transformation is that I didn’t set the OutputKeys.METHOD to XML or XHTML. So, the resulting file wasn’t a valid XML and it took a while to figure that out.
Now that I have the HTML as XML, I am able to use xpath and do whatever transformations I need.