On linux I use the locate command to find files by name. I use grep to search content of files within a directory (recursively). On windows, I have always used only the filename based search. I don’t run Window’s search indexing service. Nor do I have Google Desktop installed (they already know a lot about my web searches, why give away more info about me?).
Anyway, the purpose of this post is to pen a couple of my thoughts on the recent Google’s complaint about Vista’s search capabilities (or lack of them, to be more precise, as it pertains to Google’s ability to replace the in-built functionality with its own).
Unstructured textual data is fundamentally searched using the technology outlined at
It works pretty good when no one tries to influence the search results. When this algorithm is directly applied to the web, it fails miserably because of web masters and SEO experts trying to tweak their pages to influence the basic algorithm. Then came this $100+ billion dollar idea called PageRank by Google guys which figured out a way to beat this biased tweaking. So, that’s why they are No. 1 today in the internet search.
However, files on your desktop are not being tweaked by anyone so that their files are going to show up first. Isn’t it? This isn’t the case either at home or at work. Essentially, any intranet information is mostly unbiased. In addition, the pagerank, which is mainly based on linking between pages, is mostly irrelevant since files on the disk are mostly non-html files. Like pdf, word, ppt, spreadsheet, txt files. Ofcourse, there may be html files too, but that’s not the majority. Since PageRank is irrelevant for personal/intranet files, it doesn’t matter whether one uses Google’s desktop search or any other search based on the tf-idf algorithm mentioned earlier.
Google’s argument is that people should have choice. There are people who are pro and against Google’s idea. Some believe that this should be a OS functionality. Even though I don’t use Google Desktop or Vista or Window’s search indexing service, my thinking is, as there is no advantage of PageRank algorithm, it doesn’t matter which one is used.
However, I have the following thought process, from a technical standpoint, based on my familiarity with materialized views. Materialized Views is a database concept that is used to speed up queries, much the same way a text index helps speeding up text search as opposed to a normal grep command which has to scan the entire text of each file. Materialized Views can be classified as fast refreshable or full refresh mvs. That is, depending on the complexity of the query being materialized, it’s either possible to compute the query either in a incremental fashion (since the last refresh), or compute it completely. Luckily, the tf-idf is incremental. MVs also can be refreshed immediately or in deferred mode. In immediate mode, as the underlying table is updated (and on commit), the MV gets computed. The equivalent to this in case of a text indexing a file system is, as and when a file is created, updated or deleted, the text index can be updated immediately. The benefit of immediate refresh is that there is no need for a periodic long-running process to update the index. The reason why the periodic process takes longer time is because it takes a while to figure out what files have been already indexed and which ones have been modified, created or deleted since the last index. With immediate refresh, this operation is spread over each file operation. The disadvantage with the immediate refresh is the fact that it would slow down the individual operations, but that’s mostly very negligible. With dual cpu cores these days, these additional book keeping tasks like text indexing can happen in the spare cpu. The same can be said for the long running periodic process as well, but the fact remains that it has to scan the entire disk, much like a virus scanning service.
Now, the ability to detect a change to a file is within the filesystem module which is a core OS functionality. Who can do this more efficiently than the one who wrote the file system in the first place? So, if Microsoft supports this type of an incremental file system search indexing, then with the quality of search results being no different (due to lack of PageRank advantage in a file repository), which one would you choose?