The real answer? I don’t know! But I like to guess.
Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.
Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!
Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.
This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).
The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .
In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!
I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.
My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.
So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.
Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.