We all know Google’s PageRank plays a key role in the results displayed on the page. However, based on my research, there is no legal way to query up the page rank of a website. Google doesn’t provide a webservice for this.
Today, I came across an Icon displayed on a website that displayed the TrafficRank of that site.
Clicking the link
took me to Amazon where I could see the traffic rank of the website.
After a bit of research, found that Amazon’s webservices provides the TrafficRank and LinkInCount (sort of what PageRank does), and other useful information about a website via WebService interface. They also have sample code in various languages.
Note that most of Amazon’s web services are not free but many of them are reasonbly priced. For example, the UrlInfo webservice which provides the above info is free for the first 10,000 requests per month and currently charging a mere $0.15/1,000 requests there after.
Anyway, I think with Amazon’s service, there is atleast a legal way of obtaining the popularity of a website.
I needed to use GBP symbol in a regular expression. I thought it was simple, but realized my keyboard doesn’t have that symbol. Then, after a bit of research, found that I could use utf-8 code of GBP (\u0093) in the regular expression. Based on the documentation for java.util.regex.Pattern, I tried it and it worked as desired.
HTML started without XML backing. So, there were no rules. People and Browsers wrote the code in every possible way. So, while the XHTML standard was introduced later on, there are still quite a few websites, including such popular ones as Google and Yahoo which have a lot of HTML that is not XHTML. It is amazing to see how the popular browsers can tolerate all those errors and still produce a decent viewable page.
One of the main reasons to convert a HTML page to an XHTML is so that it is possible to apply xpath on the resulting DOM and extract pieces of information easily. There are various reasons and use cases why people want to do this. This post is mainly about what are the Java tools out there to achieve this and what worked best for me.
At http://java-source.net/open-source/html-parsers there are a bunch of parsers that can parse the HTML.
My requirements are
1. It should be LGPL or other such license that allows including in commercial products
2. It should be standards based so that I can easily replace it with another if needed. Hence I was only interested in those parsers that produced DOM as per org.w3c.dom.* classes
So, I evaluated and started off with JTidy. The results were not good for some popular websites! This is probably because JTidy hasn’t been updated since 2001 or so. Then I tried Corba and eventually NekoHTML. My initial concern for NekoHTML was the Xerces dependency. However, it doesn’t need the entire Xerces and comes with a pre-packaged smaller xerces dependency component.
One issue I had with parsing and writing the document back into XML using transformation is that I didn’t set the OutputKeys.METHOD to XML or XHTML. So, the resulting file wasn’t a valid XML and it took a while to figure that out.
Now that I have the HTML as XML, I am able to use xpath and do whatever transformations I need.
Right now I am watching America’s Got Talent. A teenage girl just sang a Yodel that’s so charming. Last week another 11yr old girl sang a completely different type of song which is equally difficult and she blew the judges and the audience. Before that, a guy did finger tapping. Seems he is the only guy in the world who can do that! And just now, the magician pair who changed their clothes left and right, front and back, so fast and so elegantly just surprised everyone.
While there are some bad performances, the ones that are good are really good. Irrespective of who wins this competition, they are all great people with great talent. It’s unfair to pick one as a competitor, but that’s life. Even if it’s not fair to compare apple to oranges. Unlike American Idol, So you think you can dance and other shows, where the comparison is on similar talent, this one is completely different.
Why is this being discussed here at poeticcode?
There are usually heated debates about which programming language is best, which development framework is cool, typed/typeless, functional vs object oriented etc. Using popularity, or economics or some other barometer, perhaps only one of them will win to be the No.1. But then, they are all winners in their own right!
Just like “Talent” is the only least common denominator for the “America’s Got Talent”, perhaps, “Turing complete” is the only least common denominator for the various programming languages.
On the http://froogle.google.com web page, below the search box, they have a section called “A few of the items recently found with Froogle:” which displays a list of keywords that people searched on froogle. It sort of appears that this list is dynamic and shows what users have searched RECENTLY. However, after accessing the page serveral times a day and then consolidating the list of unique keywords shows that they randomly keep showing a list from a total of 954 unique searches. So, definitely the word “RECENTLY” is misleading.
I expected this list to refresh atleast once a day. But even afer observing the page for several days, I see the same list of 954 unique searches begin used.
Filed under Froogle, Google