Category Archives: Google

People Search

Today I was searching for the name of a person who is at the Executive level at a small software company. The first two pages were filled with a few web pages related to this person, but many more were filled with web pages related to another person with the same name and talking about some lawsuit that person was engaged in.

Then I thought that there should be a better way to provide searching by people’s name. Here is my idea.

1) First and foremost, the search engine should recognize that the search is about a person’s name.
2) The search engine should have the capability to distinguish two webpages containing the same name but different persons. This is not as easy, but context should help a lot.
3) But more importantly, at the time of presenting the results, each search result should be associated to an image, some kind of gravatar. This would help people searching to distinguish between the search results of one person vs the other. In some cases, just reading the surrounding text should help, but may not always be the case. However, if the search engine could detect which of the two people XYZ are being referred to in the webpage, then even if just the name is used to do the search, the results could still be visually presented indicating which page is about which of the two people (or more).

Now, which large company wants to do this? Google? Yahoo? Bing? Don’t patent your stuff though, because I just made it public. He he he.

Leave a comment

Filed under Bing, Google, Yahoo!

Google’s Final April Fool Prank

Last year, I woke up on April 1st and went to the office and tried my best to fool people around. No one got fooled. I was quiet surprised why that was the case. At the end of the day, after talking to a few colleagues who are also pranksters like me, they said they had a similar experience. That no one got fooled. So, we all got together and brainstormed the reason behind it.

Apparently, most people do a Google search within the first 5 minutes of their sitting in front of the computer at work. So, everyone who visited Google’s homepage got Fooled or reminded that it’s the day of making fools. So, the surprise element that many pranksters count on to make friends and colleagues fools has been deprived by Google due to it’s monopoly.

After realizing this, a bunch of us sent an email to Google’s CEO Eric Schmidt telling him of how we are no longer able to make the best out of April 1st. Recognizing our problem, Eric Schmidt promised that he would like to pull off one last prank for 2010 and then stop it. I am hoping that he and Google would stick to their promise.

Leave a comment

Filed under Google

Don’t Be Evil, Locally, The New Google Motto

This FT.com article says that Google is reversing it’s self imposed policy of not allowing gambling related ads in the UK. The article says

[Google has been reviewing its gambling advertising policy "to ensure it is as consistent as possible with local business practices", said James Cashmore, industry leader, entertainment and media, at the company. "We hope this change will enhance the search experience for users and help advertisers connect with interested consumers."]

There are many local businesses that are illegal in most parts of the world. But, hae, if there are millions of dollars at stake, why not? So, the “Don’t be evil” all of a sudden got qualified with geography.

Leave a comment

Filed under Google

Text Link Ads & Google

My SEO knowledge is limited, yet I could notice an important factor with Google. Google seems to give importance to the content that’s most recent. Infact, the recency seems to have a bit more weightage than the page rank. For example, when a fresh page is created that is optimized for a given keyword, a website (home page rank) that has a pagerank of 3 easily beat those that are as high as with pagerank 6 and 7. But as time passes, the page keeps getting demoted eventually returning to it’s right place considering the pagerank only effect.

So, essentially, Google’s search results seem to be based on

Freshness(page) + PR(page)

For the fresh pages for which I have seen this correlation, I had also had a Pagerank 4 based blog discuss it. So, it’s not clear if the Freshness is of the page in itself or if it’s related to the freshness of the link to the page. If it’s the later, then it’s important to realize that, since Google takes time to find your new page, index it, gradually promote it from it’s sandbox to the main search servers, there is a lag. That means, if you advertise a text link ad on a website, the result of it could be felt 10 to 30 days later.

Also, it’s unclear on the adverse impact of the page’s score if an inlink is removed. That is, I am not familiar if Google would take note of drop in the inlinks and immediately propagate it to the main index or if it waits for a while. If it drops immediately, then it’s better to extend the text link ad service beyond the one month. Essentially, in that case, set a side budget for 2 months and expect good results for only 50 to 30 days. So, even though the text link ads prove to be twice as much expensive, if it’s helping the site, then it’s worth it. I see that a lot of affiliates can benefit from this type of behavior of Google.

BTW, I recently read that Google no longer gives credit to domain names that contain ‘-’ in them that are created specifically for SEO. I wonder if using an ‘_’ and in the url, for example, cordless power tools gets any extra credit or not.

Leave a comment

Filed under Google, text link ads

Google Search Privacy

If you are a web master, you know that Google tells you “top search queries” and also “top search query clicks” and the corresponding “average top position” for each of those queries. As a web master, I am sure you would love this information. But if you are the end user doing those queries?

Let’s take a pause and see, what these two types of information are. One gives the position of your web page for the searches that are conducted while the other gives the position of your web page when the user actually clicked on your web page in the search results. The first piece of information is simple and straightforward. However, how it possible to get the second piece? The only ways it’s possible is by tracking the clicks by the users.

Recently, Google’s search results also has “View and manage your web history” link on the top of the page. I personally can’t understand why people want to keep track of what they have searched in the past. No one really wants that, especially if they are concerned about privacy.

So, how is it possible to track the clicks? If you actually see the search result links, they have the standard href=link syntax. So, the link itself is directly pointing to the website itself and so when you place the cursor on top of the link, you do get the link in the status bar. However, there is also the onmousedown event that actually routes the click through a function that hijacks the original link and replaces it with a redirect through Google. That’s how Google knows about the click.

So, if you are over cautious about privacy, what would you do? I have searched userscripts.org for any GreaseMonkey’s user scripts that fix this issue. One seemed to fix the issue, but the way it did it was to register an additional event listener that sets the link back to the original. The reason why that author had to do it that way is perhaps, from within GreaseMonkey scripts, it’s not possible to directly alter the events. Instead, one has to use the addEventListener to register an additional listener. So, it’s not possible to prevent the listener set by the original content.

While the above way of resetting the link back to the original link is fine, the way I addressed this problem is with a oneliner. It is


unsafeWindow.clk = function() { };

That’s it. What this does it, it replaces the window.clk function of the results document that’s called from the onmousedown event listener with a different function that does nothing. Ofcourse, this is specific to Google and the earlier idea of resetting the link may work as a generic case.

Leave a comment

Filed under Google, Google Search Privacy

Thoughts On Net Neutrality

I am thinking more about Net Neutrality these days. These are only some of the thoughts like a devil’s advocate and not finalized opinions.

I see two types of proponents of Net Neutrality. Those that don’t want either the content providers to pay nor the consumers to pay additional fee to the ISPs. And, those that don’t want content providers to pay to ISPs, but agree to ISPs providing tiered pricing to consumers. I frankly don’t understand the first category, since I don’t think installing and operating a network is not free and some one has to pay for that work. So, I want to explore if the second type of proponents are correct.

Now, take YouTube for example, which generates a lot of bandwidth requirement due it’s video streaming. It’s free for end users and the service makes money through ads. Any good service can never be free and some one got to pay for it. A different model for YouTube would be to charge the consumers a minimal fee and not have ads at all. However, YouTube wouldn’t want to do this because, they know that it’s possible to make far more money by making the advertisers bid for their ad space than charging a flat-rate to consumers.

Basically, everyone knows it’s usually much more profitable and have higher margins in a B2B model than in a B2C model. So, these very Net Neutrality proponents who justify that the ISPs should make their additional money to operate additional network bandwidth by charging the consumers based on their usage and hence essentially suggesting a B2C model, themselves want to go with a B2C model.

Think about it, Google could have chosen to make search as a subscription based service to consumers and let the various businesses to put their ads in the search results for free instead of making them to bid for their position.

If content providers have the desire to make their content reach the end user without having to pay to the ISPs, even by bandwidth alone and let alone by bidding to that bandwidth, wouldn’t every website have the same desire to reach the consumers through the search engine?

In the above analogy,

Content Provider = Website
ISP = Search Engine
Consumer = Consumer
ISP Subscription = Search Service Subscription (note, the price of ISP need not be same as Search Service price).

If the search service providers (SSPs) don’t want the websites to have a free ride of their precious page-view bandwidth, why would an ISP want content providers to get a free ride of their network bandwidth?

Let me know how the above thinking is flawed or can be reinforced with tweaks.

Leave a comment

Filed under AdWords, Google, ISP, MSN, Net Neutrality, SSP, Yahoo!, YouTube

Why Google acquired Grand Central?

I don’t know the real answer. I like to pen down the main reason I can think of.

Let me first digress a bit. If you use LinkedIn, you would know that it’s possible for LinkedIn to create a profile of you based on the people you are connected to. This is in addition to all the personal details you provide about yourself. However, personal information like school and work will not completely distinguish two people. As the saying goes, “A Man is known by the Company he Keeps”, in addition to the personal information, the LinkedIn connections will give more information about a person.

The more accurate profile any company has about a person, the more it can target it’s services. For Google, that’s typically advertisement. With a service like Grand Central, Google will be able to amass the people relationships using the phone calls (A calls B). Currently, LinkedIn has no way to give weightage to a relationship. When two childhood buddies connect on LinkedIn that’s no different from when a recruiter hooks up with a person. Given that beyond that initial connection, the actual email communication happens outside LinkedIn, there is no better way for LinkedIn to establish additional weightage to each relationship.

On the other hand, the services offered by Grand Central allows it to track who is calling you all the time. The more calls you receive from a number, the more weightage can be given to that connection.

In addition, say you are trying to buy a house (well, now is not the right time to do so in many parts of the US at present, but say you are one of those who is still thinking of buying one). Now, if Grand Central figures out that you are working with some local real estate agent based on the calls you have been constantly receiving, Google can start showing you mortgage related ads, real estate ads etc. Ofcourse, they can do that based on what you are searching as well. But based on what it knows about that particular realtor, it can target even more.

Infact, Google has already been doing this with email. While Yahoo & Hotmail choose to not put any email address that you send an email into your address book by default, GMail does the opposite. It’s essentially cataloging all your network and the more you keep using Gmail, the more it can learn about you! By acquiring Grand Central, it not only knows your email network, it also knows about your phone network!

Leave a comment

Filed under Google, Grand Central, linkedin

PageRank vs Delicious Tags

How can Yahoo! improve it’s search results? Google nailed it down for about a decade by using their so called PageRank algorithm. But because it’s a patented technology, others can’t copy it. However, it’s not just the pagerank that really improved the search results. One other key thing with Google’s approach is to give more importance to the text used to describe the target page by the link text/anchor. This is quite powerful, because, someone who provides ERP application performance tuning services can use a bunch of keywords on every page of the website whether or not that page really is about that topic. But sites that link to any of those pages, will use only a handful of keywords to create the appropriate description for the page.

The very concept of identifying what a page is about from an external reference to the page using a particular description is similar to deriving what a page is about based on the tags used to book mark that page on del.icio.us. So, instead of purely relying on the keywords listed on a page, while crawling and indexing a webpage, Yahoo! can query up the tags associated to that page in del.icio.us repository and combine it to provide extra weight to the keywords used in the tags. One good thing with this approach is, the tag cloud of the page on delicious gives information about what people generally think of that page as. For example, even though tocloud.com provides tag cloud generation tools, when someone sees that page, they think of various other things such as tagcloud, tagging, seo etc. But more weight to tagcloud than to seo because the del.icio.us tag cloud for tocloud.com shows tagcloud tag much bolder than seo tag.

Ofcourse, just like Google has to deal with issues such as link farms, backlinks etc, people may start creating fake accounts and keep tagging their pages with all sorts of keywords to influence the search results. So, there should be some clever algorithms by Yahoo in detecting fake users vs real users who are tagging and filter out any such manipulations.

Now what will live.com do to figure out the true purpose of a website? Those who finds an answer can start the 3rd search engine that can be quite successful.

Leave a comment

Filed under del.icio.us, Google, search engine, Yahoo!

Why did Google introduce Custom Search Engines?

The real answer? I don’t know! But I like to guess.

Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.

Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!

Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.

This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).

The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .

In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!

I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.

My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.

So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.

Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.

1 Comment

Filed under custom search engine, Google, Google CSE, Google Search, Information Retrieval, search engine

Google’s PageRank, Amazon’s TrafficRank, LinkInCount

We all know Google’s PageRank plays a key role in the results displayed on the page. However, based on my research, there is no legal way to query up the page rank of a website. Google doesn’t provide a webservice for this.

Today, I came across an Icon displayed on a website that displayed the TrafficRank of that site.
Clicking the link

http://www.amazon.com/exec/obidos/tg/url/-/www.clickbot.net/104-9220080-0165529

took me to Amazon where I could see the traffic rank of the website.

After a bit of research, found that Amazon’s webservices provides the TrafficRank and LinkInCount (sort of what PageRank does), and other useful information about a website via WebService interface. They also have sample code in various languages.

Note that most of Amazon’s web services are not free but many of them are reasonbly priced. For example, the UrlInfo webservice which provides the above info is free for the first 10,000 requests per month and currently charging a mere $0.15/1,000 requests there after.

Anyway, I think with Amazon’s service, there is atleast a legal way of obtaining the popularity of a website.

Leave a comment

Filed under amazon web services, Google