Monthly Archives: June 2007

PageRank vs Delicious Tags

How can Yahoo! improve it’s search results? Google nailed it down for about a decade by using their so called PageRank algorithm. But because it’s a patented technology, others can’t copy it. However, it’s not just the pagerank that really improved the search results. One other key thing with Google’s approach is to give more importance to the text used to describe the target page by the link text/anchor. This is quite powerful, because, someone who provides ERP application performance tuning services can use a bunch of keywords on every page of the website whether or not that page really is about that topic. But sites that link to any of those pages, will use only a handful of keywords to create the appropriate description for the page.

The very concept of identifying what a page is about from an external reference to the page using a particular description is similar to deriving what a page is about based on the tags used to book mark that page on del.icio.us. So, instead of purely relying on the keywords listed on a page, while crawling and indexing a webpage, Yahoo! can query up the tags associated to that page in del.icio.us repository and combine it to provide extra weight to the keywords used in the tags. One good thing with this approach is, the tag cloud of the page on delicious gives information about what people generally think of that page as. For example, even though tocloud.com provides tag cloud generation tools, when someone sees that page, they think of various other things such as tagcloud, tagging, seo etc. But more weight to tagcloud than to seo because the del.icio.us tag cloud for tocloud.com shows tagcloud tag much bolder than seo tag.

Ofcourse, just like Google has to deal with issues such as link farms, backlinks etc, people may start creating fake accounts and keep tagging their pages with all sorts of keywords to influence the search results. So, there should be some clever algorithms by Yahoo in detecting fake users vs real users who are tagging and filter out any such manipulations.

Now what will live.com do to figure out the true purpose of a website? Those who finds an answer can start the 3rd search engine that can be quite successful.

Advertisements

Leave a comment

Filed under del.icio.us, Google, search engine, Yahoo!

Desktop Search

On linux I use the locate command to find files by name. I use grep to search content of files within a directory (recursively). On windows, I have always used only the filename based search. I don’t run Window’s search indexing service. Nor do I have Google Desktop installed (they already know a lot about my web searches, why give away more info about me?).

Anyway, the purpose of this post is to pen a couple of my thoughts on the recent Google’s complaint about Vista’s search capabilities (or lack of them, to be more precise, as it pertains to Google’s ability to replace the in-built functionality with its own).

Unstructured textual data is fundamentally searched using the technology outlined at

http://en.wikipedia.org/wiki/Tf-idf

It works pretty good when no one tries to influence the search results. When this algorithm is directly applied to the web, it fails miserably because of web masters and SEO experts trying to tweak their pages to influence the basic algorithm. Then came this $100+ billion dollar idea called PageRank by Google guys which figured out a way to beat this biased tweaking. So, that’s why they are No. 1 today in the internet search.

However, files on your desktop are not being tweaked by anyone so that their files are going to show up first. Isn’t it? This isn’t the case either at home or at work. Essentially, any intranet information is mostly unbiased. In addition, the pagerank, which is mainly based on linking between pages, is mostly irrelevant since files on the disk are mostly non-html files. Like pdf, word, ppt, spreadsheet, txt files. Ofcourse, there may be html files too, but that’s not the majority. Since PageRank is irrelevant for personal/intranet files, it doesn’t matter whether one uses Google’s desktop search or any other search based on the tf-idf algorithm mentioned earlier.

Google’s argument is that people should have choice. There are people who are pro and against Google’s idea. Some believe that this should be a OS functionality. Even though I don’t use Google Desktop or Vista or Window’s search indexing service, my thinking is, as there is no advantage of PageRank algorithm, it doesn’t matter which one is used.

However, I have the following thought process, from a technical standpoint, based on my familiarity with materialized views. Materialized Views is a database concept that is used to speed up queries, much the same way a text index helps speeding up text search as opposed to a normal grep command which has to scan the entire text of each file. Materialized Views can be classified as fast refreshable or full refresh mvs. That is, depending on the complexity of the query being materialized, it’s either possible to compute the query either in a incremental fashion (since the last refresh), or compute it completely. Luckily, the tf-idf is incremental. MVs also can be refreshed immediately or in deferred mode. In immediate mode, as the underlying table is updated (and on commit), the MV gets computed. The equivalent to this in case of a text indexing a file system is, as and when a file is created, updated or deleted, the text index can be updated immediately. The benefit of immediate refresh is that there is no need for a periodic long-running process to update the index. The reason why the periodic process takes longer time is because it takes a while to figure out what files have been already indexed and which ones have been modified, created or deleted since the last index. With immediate refresh, this operation is spread over each file operation. The disadvantage with the immediate refresh is the fact that it would slow down the individual operations, but that’s mostly very negligible. With dual cpu cores these days, these additional book keeping tasks like text indexing can happen in the spare cpu. The same can be said for the long running periodic process as well, but the fact remains that it has to scan the entire disk, much like a virus scanning service.

Now, the ability to detect a change to a file is within the filesystem module which is a core OS functionality. Who can do this more efficiently than the one who wrote the file system in the first place? So, if Microsoft supports this type of an incremental file system search indexing, then with the quality of search results being no different (due to lack of PageRank advantage in a file repository), which one would you choose?

Leave a comment

Filed under desktop search, google desktop, vista search

Even a good site can be insecure

It’s a long story but I ended up at doba.com through Guy Kawasaki’s LinkedIn profile. I liked the concept and was about to register for the affiliate program. Then, I noticed that the registration form, which requires SSN/Tax ID, has to be submitted over http and not https. That’s when I gave up. Thing is, even a popular website (this site has an alexa rank of 5,901 as of this writing) can be insecure.

2 Comments

Filed under online security

OOP (Object Oriented Programming / Pain) with JavaScript

As web 2.0 applications typically make use of heavy javascript to provide better usability, including the use of the AJAX feature, the client-side scripting code is going to bloat. Managing this code as a bunch of variables and functions soon will become unmanageable and hence it’s better to start making use of OOP features of JavaScript. Few other reasons in addition to the code bloat for using OOP is to reduce the global space proliferation and potential name conflict and also when providing a library for others to create mashups of your service, like the popular Google Maps, abstracting the entire API through a set of objects is a better solution.

Anyway, I have been just experimenting with the OOP concepts and learned something new. Apparently, using this.variable makes the variable a public variable while declaring it as “var variable” makes it a private variable. I don’t have any problem with this. But, what I am not happy with is, that the public variable should always be accessed as this.variable within the other member functions instead of accessing them directly as variable. The private variables defined using var however, can be accessed in this manner. In Java for example, irrespective of whether public, protected or private is used to define a member variable, it can always be accessed within a member function without qualifying it with the this reference. However, this is not the case with JavaScript and that’s a bit of a pain and been one of the source of my bugs.

Leave a comment

Filed under javascript

A good open source SVG editor

I am not an artist. But I like creating logos that represent my creative ideas. I do this so that I can take a break from the actual coding of the ideas. I used various different programs in the past like GIMP and Paint.NET. However, this time around, I wanted to create the logo in SVG format so that I can scale it to whatever icon size I want without loss of details. So, I found Inkscape and created my first SVG logo. It was quite easy. However, one thing I noted with all the various painting/image editing programs is that they don’t have an out-of-the-box pentagon, hexagon and other n-sided polygons that are quite useful in designing logos. I had to manually create one for my need, but the best thing with Inkscape is that after creating the polygon, it’s possible to control individual nodes of the polygon. It’s even possible to make some of the sides curvy. All in all, it’s a nice program and best of all, it’s free!

7 Comments

Filed under logo design, svg editor

WiTricity (none, WEP, WPA, WPA2, …)

Yesterday there is a big news about WiTricity, ability to transfer electricity without a wire. This is perhaps one of the best inventions of this century. No denial. But today, I was thinking about the economics of it. With a regular wireless connection (used for data networks), because of the security it’s possible to keep unintended users away from your network. But what happens with WiTricity? With no such “encryption” of electricity capability, while you are powering your indoors with it, many people can be sitting in a car near your office or house can start using some of that power using this same wireless technology. How do you prevent that?

I am not being pessimistic with this technology, just wearing my entrepreneur and security admin hats. Worst case, it will still have some limited use cases.

Leave a comment

Filed under WiTricity

University 2.0

What is university 2.0? This is just an idea I am mulling about. With the cost of education going up, people may not have time, money and commitment to study beyond Bachelors or Masters all the way to complete a PhD. That doesn’t mean at a later stage they can’t continue to invest time and effort in something they are passionate about. If that something is inline with the work they are doing that would be even better.

However, to do independent research, why should that be termed as University 2.0? Well, that’s where I want to explore the Web 2.0 ideas into this independent research. Perhaps, there will be open, free (or inexpensive) collaboration to do research. For example, an open source research journal, willing volunteers ready to spend time mentoring the research aspirants (this should be possible just the same way people are ready to contribute to Wikipedia) and perhaps even virtual degrees (well, those who are interested in researching on the side, perhaps don’t really care about the certificates, but hae, why not?). And as these days every social-network system seems to be interested in giving a number to everything (perhaps it all started with PageRank?), some kind of a popularity ranking to each research.

1 Comment

Filed under PhD, research, university research, Web 2.0