Category Archives: Google

PageRank vs Delicious Tags

How can Yahoo! improve it’s search results? Google nailed it down for about a decade by using their so called PageRank algorithm. But because it’s a patented technology, others can’t copy it. However, it’s not just the pagerank that really improved the search results. One other key thing with Google’s approach is to give more importance to the text used to describe the target page by the link text/anchor. This is quite powerful, because, someone who provides ERP application performance tuning services can use a bunch of keywords on every page of the website whether or not that page really is about that topic. But sites that link to any of those pages, will use only a handful of keywords to create the appropriate description for the page.

The very concept of identifying what a page is about from an external reference to the page using a particular description is similar to deriving what a page is about based on the tags used to book mark that page on del.icio.us. So, instead of purely relying on the keywords listed on a page, while crawling and indexing a webpage, Yahoo! can query up the tags associated to that page in del.icio.us repository and combine it to provide extra weight to the keywords used in the tags. One good thing with this approach is, the tag cloud of the page on delicious gives information about what people generally think of that page as. For example, even though tocloud.com provides tag cloud generation tools, when someone sees that page, they think of various other things such as tagcloud, tagging, seo etc. But more weight to tagcloud than to seo because the del.icio.us tag cloud for tocloud.com shows tagcloud tag much bolder than seo tag.

Ofcourse, just like Google has to deal with issues such as link farms, backlinks etc, people may start creating fake accounts and keep tagging their pages with all sorts of keywords to influence the search results. So, there should be some clever algorithms by Yahoo in detecting fake users vs real users who are tagging and filter out any such manipulations.

Now what will live.com do to figure out the true purpose of a website? Those who finds an answer can start the 3rd search engine that can be quite successful.

Advertisements

Leave a comment

Filed under del.icio.us, Google, search engine, Yahoo!

Why did Google introduce Custom Search Engines?

The real answer? I don’t know! But I like to guess.

Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.

Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!

Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.

This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).

The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .

In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!

I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.

My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.

So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.

Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.

1 Comment

Filed under custom search engine, Google, Google CSE, Google Search, Information Retrieval, search engine

Google’s PageRank, Amazon’s TrafficRank, LinkInCount

We all know Google’s PageRank plays a key role in the results displayed on the page. However, based on my research, there is no legal way to query up the page rank of a website. Google doesn’t provide a webservice for this.

Today, I came across an Icon displayed on a website that displayed the TrafficRank of that site.
Clicking the link

http://www.amazon.com/exec/obidos/tg/url/-/www.clickbot.net/104-9220080-0165529

took me to Amazon where I could see the traffic rank of the website.

After a bit of research, found that Amazon’s webservices provides the TrafficRank and LinkInCount (sort of what PageRank does), and other useful information about a website via WebService interface. They also have sample code in various languages.

Note that most of Amazon’s web services are not free but many of them are reasonbly priced. For example, the UrlInfo webservice which provides the above info is free for the first 10,000 requests per month and currently charging a mere $0.15/1,000 requests there after.

Anyway, I think with Amazon’s service, there is atleast a legal way of obtaining the popularity of a website.

Leave a comment

Filed under amazon web services, Google

“A few of the items recently found with Froogle:”

On the http://froogle.google.com web page, below the search box, they have a section called “A few of the items recently found with Froogle:” which displays a list of keywords that people searched on froogle. It sort of appears that this list is dynamic and shows what users have searched RECENTLY. However, after accessing the page serveral times a day and then consolidating the list of unique keywords shows that they randomly keep showing a list from a total of 954 unique searches. So, definitely the word “RECENTLY” is misleading.

I expected this list to refresh atleast once a day. But even afer observing the page for several days, I see the same list of 954 unique searches begin used.

Leave a comment

Filed under Froogle, Google