Category Archives: Google Search

Are you losing to your competitor?

Few years back using my experience in a certain area, I created a piece of software and gave it a name and created a website with the same name and started selling it. It’s a very niche area and the only a small set of the target audience actually seek the solution and pay for it. My software is priced $$ and my sales are single digit. So, the potential is less than a thousand per year. But who knows, with time it could have become a little bit more.

On the contrary, after about an year, things went the other way. Hardly anyone evening contacting to find more info. Yeah the economy had been bad since 2008 and all that, but that’s not the reason for the dismal performance I had. Out of no where, some guy started offering a similar solution for free. People are expected to register a domain name and us his solution by mapping their domain name to his server.

Which is fine. You could say, if someone could offer it for free, why do you think anyone would pay you $$? First, there are some problems with what is being done by that person. I won’t go into those details. But what I don’t think is appropriate is, for that person to optimize his website around my product. He literally used my product name which is very specific and tried doing SEO around it. Given his customers map their domain to his server, he is getting free links back from all these websites to his main site and in the anchor links he even used my product name. So, the theme is “An alternative to xyz” or “An xyz for free” and so on where xyz is my product name.

Only because my product name and my domain name are the same and since Google atleast has the sense of giving a domain name a lot more importance, when someone searches for xyz, my website does come first. But the problem is, the next link in the results is his which says “A free xyz”.

That’s how I got screwed and hardly had any one wanting to purchase my product. The truth of the matter is, there are clear advantages of buying the software, installing it on one’s own server and using it vs mapping one’s domain to a free solution. However, when you are looking at search results, it’s not possible to explain the potential customers what the differences are.

I am not sure how this can be solved by Google or anyone.

Advertisements

Leave a comment

Filed under Google Search, SEO

Why did Google introduce Custom Search Engines?

The real answer? I don’t know! But I like to guess.

Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.

Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!

Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.

This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).

The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .

In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!

I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.

My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.

So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.

Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.

1 Comment

Filed under custom search engine, Google, Google CSE, Google Search, Information Retrieval, search engine

Custom Search Engine by Google

Google started allowing people to roll-out their own search engines. This is a very powerful feature if you think about it. For example, university departments can have custom search engine that only brings results from a set of websites related to the topic of that department.

So, the key is to identify a group of websites that can be logically grouped to provide a more targeted search results. While that is the idea, given that Google’s cse (cse blog) allows up to 5000 links, people may come up with innovative use for such large numbers. For example, the search engine Publicly Traded Companies Search Engine provides search among thousands of publicly traded companies.

One good thing with Google’s CSE is that it allows for collaboration. So the 5000 links limit can be overcome, by working with other people to add more links. Ofcourse, as long as the set of links have that extra meta-data with them, this should be fine. But, abusing this to add any random set of links will not serve the intent. Good luck rolling out your own CSE.

Leave a comment

Filed under custom search engine, Google Search, search engine

Online Search Privacy

Recently AOL leaked a 3 months worth of searches conducted by their users, supposedly by mistake. Even though the usernames are not given out in the logs, there is a lot of criticism based on this article about how it became possible to derive the identify of a person based on the searches she conducted.

In case of AOL, the searches are tied back to the individual user based on the AOL account. Since Google doesn’t have subscription like AOL, is it safe to assume your privacy is safe with Google? Not so if you are using Gmail as well. Why? Because, when you login to Gmail and keep it open all the time like I do, then, everytime you do a search, Google would know that it’s you (with a particular gmail email address), that’s doing the search! And that’s scary. Isn’t it?

All the people keep complaining about this privacy issue. One way this could be resolved is, if you can have two separate instances of Firefox each running as a separte process and one not knowing about the cookies of the other. Perhaps due to various technical reasons this is not possible, but where there is a will, there is a way. Isn’t it?

On the otherhand, even if that was possible, would you want to keep switching between two separate browsers? That’s the whole reason why we all love the Firefox’s tabbed browsing functionality in the first place.

So, what can be done to be able to browse in a single browser and yet protect your privacy against Google? Or for that matter any other company that can track your usage of their free service based on your logging into another of their service?

I think, I found a solution. It’s basically the IE Tab plugin I talked about in my previous blog. Here is what you can do.

1. Get Firefox if you already don’t have one and install it.
2. Then download the “IE Tab” plugin mentioned above and install it, restart the browser.
3. Now, go to https://gmail.google.com/ and switch it to IE by clicking the “IE Tab”s icon at the bottom on the status bar.
4. Login to Gmail.
5. Now, open another tab and happly keep searching the web, and Google wouldn’t get your cookie of their gmail to track you.

How do you know if it’s working? You will notice that while in the past, when you visited Google’s homepage to search, you would see “Logout” link, now you will start seeing “Sign In” link. Google no longer knows that you logged in as IE’s cookie base is different from Firefox’s.

Of course, they can still use the IP address of your machine to try to corelate. But it’s a little less accurate than relating your searches to precisely you!

1 Comment

Filed under Gmail, Google Search, Privacy, Tech - Tips