HTML to XHTML / XML

HTML started without XML backing. So, there were no rules. People and Browsers wrote the code in every possible way. So, while the XHTML standard was introduced later on, there are still quite a few websites, including such popular ones as Google and Yahoo which have a lot of HTML that is not XHTML. It is amazing to see how the popular browsers can tolerate all those errors and still produce a decent viewable page.

One of the main reasons to convert a HTML page to an XHTML is so that it is possible to apply xpath on the resulting DOM and extract pieces of information easily. There are various reasons and use cases why people want to do this. This post is mainly about what are the Java tools out there to achieve this and what worked best for me.

At http://java-source.net/open-source/html-parsers there are a bunch of parsers that can parse the HTML.

My requirements are

1. It should be LGPL or other such license that allows including in commercial products
2. It should be standards based so that I can easily replace it with another if needed. Hence I was only interested in those parsers that produced DOM as per org.w3c.dom.* classes

So, I evaluated and started off with JTidy. The results were not good for some popular websites! This is probably because JTidy hasn’t been updated since 2001 or so. Then I tried Corba and eventually NekoHTML. My initial concern for NekoHTML was the Xerces dependency. However, it doesn’t need the entire Xerces and comes with a pre-packaged smaller xerces dependency component.

One issue I had with parsing and writing the document back into XML using transformation is that I didn’t set the OutputKeys.METHOD to XML or XHTML. So, the resulting file wasn’t a valid XML and it took a while to figure that out.

Now that I have the HTML as XML, I am able to use xpath and do whatever transformations I need.

Advertisements

3 Comments

Filed under HTML/XML/XHTML

3 responses to “HTML to XHTML / XML

  1. Thanks for the tip!

    Regards,

    Enoch

  2. fariza

    Hi, I a newbie to JAVA and would like to extract the DOM representation of a HTML page. I read that nekoHTML allows us to do just that but how do I go about writing a JAVA program for that. Is it sufficient to just add the nekoHTML jar files? Do I need to ger the Xerces files as well?

    Hope you could give me headstart. Really would appreciate it, thanks.

  3. S

    Yes, Xerces is required but as I mentioned above, neko ships a subset of xerces, but in my case I ended up using the entire xerces distribution since Tomcat also comes with xerces and no point having two versions of xerces. As to the code, below is the snippet

    import org.apache.xerces.parsers.DOMParser;
    import org.apache.xerces.xni.parser.XMLDocumentFilter;
    import org.cyberneko.html.HTMLConfiguration;
    import org.cyberneko.html.filters.*;

    URL url = new URL(urlStr);
    HttpURLConnection hcon = (HttpURLConnection)url.openConnection();
    hcon.setConnectTimeout(cto);
    hcon.setReadTimeout(rto);
    hcon.setUseCaches(false);
    hcon.setRequestProperty(“User-Agent”,userAgent);
    hcon.setRequestProperty(“Accept-Charset”,”utf-8″);
    hcon.setRequestProperty(“Keep-Alive”,”300″);
    InputStream instr = hcon.getInputStream();
    HTMLConfiguration config = new HTMLConfiguration();
    XMLDocumentFilter[] filters = { new Purifier() };
    config.setProperty(“http://cyberneko.org/html/properties/filters”,filters);
    DOMParser dp = new DOMParser(config);
    dp.setFeature(“http://apache.org/xml/features/dom/include-ignorable-whitespace”,false);
    dp.parse(new org.xml.sax.InputSource(instr));
    return dp.getDocument();

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s