Monthly Archives: February 2007

Ignoring mouse events on the scrollbar

Say you implemented a popup. In order to make the popup disappear when the user clicks on some other area of the browser, you would have to capture the mouse events. So far so good. However, if the popup goes beyond the visible area, then the user would use the mouse to scroll using the scrollbars. Unfortunately, by the time the scrollbar is adjusted and the mouse button is lifted, it triggers a onmouseup event which will close the popup. One way to prevent this is, checking the source element of the event being HTML element. Something like if(source.nodeName == “HTML”) return; where source is the source of the event.

Advertisements

Leave a comment

Filed under javascript, js/dhtml

Using document vs element.ownerDocument

The dom elements have a field called ownerDocument. Given that there is already a document variable available in JavaScript, I thought that it was just a redundant piece of data available (though it’s part of the DOM specification). Then, today I realized a specific case where it was really useful. The use case was to open an IFRAME as a popup when the user clicks on a link. The IFRAME itself had to be embedded into a DIV element per my earlier article, Changing IFRAME content in IE. Now, the issue was, once the IFRAME was loaded, I didn’t want it to have scrollbar, so I had written a function to sync the IFRAME’s required size to the DIV element’s size. This works fine, but one issue remains. That is, sometimes the width of the IFRAME is larger than what can fit between the cursor and the right edge of the browser and I need to adjust it to move left. However, for this I need to access the document variable of the DIV from a JavaScript code that’s executed from within IFRAME’s webpage after it’s loaded. So, the way it can be done is, window.frameElement.parentNode.ownerDocument!

2 Comments

Filed under DHTML, javascript

Why did Google introduce Custom Search Engines?

The real answer? I don’t know! But I like to guess.

Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.

Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!

Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.

This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).

The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .

In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!

I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.

My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.

So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.

Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.

1 Comment

Filed under custom search engine, Google, Google CSE, Google Search, Information Retrieval, search engine

Custom Search Engine by Google

Google started allowing people to roll-out their own search engines. This is a very powerful feature if you think about it. For example, university departments can have custom search engine that only brings results from a set of websites related to the topic of that department.

So, the key is to identify a group of websites that can be logically grouped to provide a more targeted search results. While that is the idea, given that Google’s cse (cse blog) allows up to 5000 links, people may come up with innovative use for such large numbers. For example, the search engine Publicly Traded Companies Search Engine provides search among thousands of publicly traded companies.

One good thing with Google’s CSE is that it allows for collaboration. So the 5000 links limit can be overcome, by working with other people to add more links. Ofcourse, as long as the set of links have that extra meta-data with them, this should be fine. But, abusing this to add any random set of links will not serve the intent. Good luck rolling out your own CSE.

Leave a comment

Filed under custom search engine, Google Search, search engine

Amazon.com’s AST vs Snap.com’s SPA

Do you have a website with links to other sites? Do you want people to preview the target site’s homepage without leaving your page? The solution is to offer a thumbnail preview. But how do you get these thumbnails? There are two solutions. One is Amazon.com’s Alexa Site Thumbnails. The other is Snap.com’s Snap Preview Anywhere™.

So, which is right for you? That’s what I want to discuss here.

1. Cost: AST charges $0.20 per 1000 impressions while SPA is free. So, obviously this is going to play a big role in your decision.
2. Code: With SPA, you just register and get a snippet of html code that you put in your page. This actually accesses a javascript. With AST, you write code on the serverside, get the links to the thumbnails and generate your page.
3. DHTML: Say you generate content on your page dynamically on the client side (because of mashups or whatever it is), then with AST, you will have the full control. However, this will not work with SPA (I need to findout if I am wrong by doing more research), as their javascript gets invoked when the page is loaded.
4. Tracking Preview Actions: With AST this is possible, because, once you get the link on the serverside, you can substitute it with a redirection url so that the preview image will first hit your website and then redirected to AST’s actual link. With SPA, you can’t do this.
5. Homepage or Any page: AST only supports the homepage of a website. SPA supports any page. Ofcourse, if SPA doesn’t have a particular page as a thumbnail already in their system, they just queue it up.

Hope this helps in deciding the right service for your site. Anyone who thinks the above statements for SPA are wrong, can please post a comment with a solution on how you can achieve with SPA.

Leave a comment

Filed under DHTML, site preview

Top social networking sites & their technologies

From a Yahoo! article I came to know that the top 3 social networking sites in the US are MySpace.com , Facebook.com and bebo.com. Interestingly, each of these companies uses a completely different middle-tier stack. MySpace.com uses Microsoft-IIS 6.0, Facebook.com uses Apache 1.3.37 and bebo.com uses Resin/3.0.21. MySpace.com uses cfm (ColdFusion?) for it’s dynamic content, Facebook.com uses PHP and bebo.com uses JSP. I am not sure which database is used by MySpace.com and Facebook.com, but bebo.com claims it’s powered by Oracle. Chances are either one or both of the other two may be powered by MySQL. From job descriptions of Facebook.com, they seem to require experience preferably in MySQL and then Oracle. Perhaps, they are using both databases.

Leave a comment

Filed under website architecture