Category Archives: search engine

Product Catalog Search By Color

Today I happened to see a website that offered searching for products by color. I actually seen this in another site a few months back but I didn’t think much about the underlying technology. Then, today, as a first reaction I thought “wow! are they hiring people to look at each product image and capture the colors”. Then I realized, this can be done easily by processing the product image. The idea is, every image is made of a bunch of pixels, and the color of each pixel is available through the API. So, one approach is to get the frequency of each color and order the colors by frequency and finally picking first N or based on some threshold. However, as with any image processing, there are other alternate choices available. For example, if the image is jpeg instead of gif, then the number of colors is too many and the frequency of each individual color might be very little. So, perhaps treating all the colors that are very similar into one single color would help. Similarly, sometimes a color with high frequency could be just small specs scattered all over the image and it’s not really useful. Or a ring with a small diamond in the middle could contain a very small but the most important color. So, a color based on clustering rather than purely based on frequency is also a good choice. Only thing is, there needs to be a way to not include the background color, which in most product images is a white color.

Keeping all the above in mind, assume each product is related with a few colors. Then, the next thing is to take the color that the user has picked to search and matching against the product colors with a delta difference since getting precise match is not always possible or gives many choices.

For a retailer doing the above is simply processing the images in the system and creating the color index. However, if this were to be done by a search engine, the search engine has to first retrieve each product image for processing.

Advertisements

Leave a comment

Filed under Product Catalog, search engine, Search Indexing

PageRank vs Delicious Tags

How can Yahoo! improve it’s search results? Google nailed it down for about a decade by using their so called PageRank algorithm. But because it’s a patented technology, others can’t copy it. However, it’s not just the pagerank that really improved the search results. One other key thing with Google’s approach is to give more importance to the text used to describe the target page by the link text/anchor. This is quite powerful, because, someone who provides ERP application performance tuning services can use a bunch of keywords on every page of the website whether or not that page really is about that topic. But sites that link to any of those pages, will use only a handful of keywords to create the appropriate description for the page.

The very concept of identifying what a page is about from an external reference to the page using a particular description is similar to deriving what a page is about based on the tags used to book mark that page on del.icio.us. So, instead of purely relying on the keywords listed on a page, while crawling and indexing a webpage, Yahoo! can query up the tags associated to that page in del.icio.us repository and combine it to provide extra weight to the keywords used in the tags. One good thing with this approach is, the tag cloud of the page on delicious gives information about what people generally think of that page as. For example, even though tocloud.com provides tag cloud generation tools, when someone sees that page, they think of various other things such as tagcloud, tagging, seo etc. But more weight to tagcloud than to seo because the del.icio.us tag cloud for tocloud.com shows tagcloud tag much bolder than seo tag.

Ofcourse, just like Google has to deal with issues such as link farms, backlinks etc, people may start creating fake accounts and keep tagging their pages with all sorts of keywords to influence the search results. So, there should be some clever algorithms by Yahoo in detecting fake users vs real users who are tagging and filter out any such manipulations.

Now what will live.com do to figure out the true purpose of a website? Those who finds an answer can start the 3rd search engine that can be quite successful.

Leave a comment

Filed under del.icio.us, Google, search engine, Yahoo!

Driving Search Volume Through Articles

Google continuously being No 1, and other search engines constantly losing the search engine battle, Yahoo! seems to be trying out a few new things. One I have been seeing for the last few days is to provide links to a few select words which when highlighted will
popup a small inline box (similar to contextual ads) letting the user to click that and get search results. I personally don’t like that. Now, today I see on their homepage the following
link about the most popular puppies which enumerates a list of top 20 most popular puppy breeds. Each of the breed name is a link that takes you to the search results for that breed. I again don’t see why one would want to do a search while reading an article, but the fact that you don’t know that you will be taken to the search results, makes you click it, perhaps with the hope of seeing more details like photos and other, only to be greeted with a link of ads, and web search results (and images).

Whether this technique is useful to the audience or not, it certainly is a good tactic to raise the search volume for Yahoo!

Leave a comment

Filed under search engine, search engine volume, Yahoo!

Why did Google introduce Custom Search Engines?

The real answer? I don’t know! But I like to guess.

Remember the days when Google used to display the total number of pages they have indexed? Which was like 4 billion or something? Then one fine day, they changed it to like 8 billion or so? Within a week, I think Yahoo! said they index even more than that (I think 18 billion or some such large number, even though their results continuously returned fewer than Google?). Thing is, web is exploding. And with all the social networking, Web 2.0, blogs etc, the rate at which content is getting generated is only increasing more and more (including this blog!). Which means, no matter whether it is PageRank algorithm or what other extremely smart algorithm, at such huge volumes of data, the amount of noise is bound to exist. Add to this, the continuous quest to beat Google’s algorithms by smart web masters and SEO gurus only contributed to suboptimal/undesired pages bubbling up.

Also, one of the reasons for the sudden explosion of Google’s (and Yahoo!’s) number of pages is due to the fact that they started indexing pages that contained a few parameters in their URLs. That is, in the past, mostly static urls were crawled by the search engines. And perhaps allowed atmost 1 additional parameter. But I believe Google handles up to 3 or perhaps more. Well, even otherwise, people realized this limitation (or behavior of the search engines) and as a result, started opening up their content as static urls using various techniques such as URL Rewrites, subdomains etc. Who would want to do this? Well, mostly retailers who want their entire product catalog available for public search. No wonder, when you search “site:amazon.com” on Google, you get more than 20 million results!

Considering all the above reasons, it’s obvious a single search engine that provides searching these billions of pages (which one day will probably hit Google of pages – pretty long shot), it’s imperative that this collective knowledge (or spam or whatever you call it) has to be intelligently broken into smaller subsets and people can search within the subsets that are most relevant to them.

This is interesting! Because, Yahoo!, which kind of lost to Google in the search volume, back in 90s was the revolutionary search engine which organized web pages into various categories and people could search/browse these categories. So, after all the page ranks and other search engine techniques, we are kind of back to the categorization (well, this is a oversimplification and I don’t deny the importance of all these algorithms).

The only fundamental difference though is that instead of controlling the categories by a single entity within their server, the CSE concept allows individuals to categorize the websites in whatever way they want. And this is quite powerful since there is no bottleneck for someone to view and approve, subject matter experts knows best to classify the content, innovative ideas like Fortune 500 Companies Search and My Meta Matrimonials are possible. And there are several other CSEs that can be found at cselinks.com .

In addition to improving the quality of the search results, an other benefit to the search engine companies if people start using more and more of these CSEs is, the load on their server will decrease. On a first thought, one might think that introducing so many custom search engines may introduce a lot of load on the servers for these companies. However, given that the CSEs are a much much smaller subset of the entire web, the amount of work that the search engine needs to do is probably far lesser!

I don’t know the exact algorithms used to be able to make these CSEs to be much faster. But intuitively I can think of searching CSEs as similar to searching a single partition within a range/list/hash partitioned table in the database. Without such a partitioning, the database has to do a Full Table Scan instead of a Full Table Scan within a single partition (for those queries that require FTS). Of course, the fundamental difference between these two, the CSE and partitioned tables, is that in case of the CSE, the data can be partitioned in as many different ways as possible, while in case of partitioned tables, data is partitioned upfront in one and only one way.

My knowledge of information retrieval is limited to how Lucene works. And per that, each token will have the list of documents it occurred in and it’s a matter of taking the search keyword (complex search expressions with booleans will be more sophisticated, but the fundamental concept remains the same, or at least to make this discussion simpler), and navigating through the vector of documents of this token. The key thing there though is the order in which this vector is presented to the user. This is where, the Google’s PageRank should come into picture. Lucene uses a priority queue to get the top 100 results that are most relevant and in process also gets the total count. Google probably does the same, but now as part of that it should apply a filter based on the sites specified in the CSE. One thing to note is that Google allows giving higher priority to a site of your choice than the other (even if it has a higher pagerank) which means, the relevancy has to be calculated dynamically based on the CSE. Anyway, with this simple change to the existing algorithm, I don’t really see much benefit with the CSE as it’s infact adding extra operation of filtering. However, a different approach will be to build smaller indexes for the subsets (keeping the docids same as the global index and just processing the existing index) which will avoid the filter operation as well as having to loop through all the other sites not part of the CSE. The downside with this is the amount of additional disk space.

So, while I presented two possible ways of implementing CSEs, I am sure there may be much better ways of doing the same that the smart PhDs at Google may have figured out.

Irrespective of the performance, the most important point is, with better quality to search results, Google will be able to provide more targeted ads because Google has an additional context of the search, the context of the CSE. Obviously, more targeted ads means more revenue and perhaps that extra revenue justifies the extra cost, should there be no better way to optimize the CSE retrieval compared to the regular search.

1 Comment

Filed under custom search engine, Google, Google CSE, Google Search, Information Retrieval, search engine

Custom Search Engine by Google

Google started allowing people to roll-out their own search engines. This is a very powerful feature if you think about it. For example, university departments can have custom search engine that only brings results from a set of websites related to the topic of that department.

So, the key is to identify a group of websites that can be logically grouped to provide a more targeted search results. While that is the idea, given that Google’s cse (cse blog) allows up to 5000 links, people may come up with innovative use for such large numbers. For example, the search engine Publicly Traded Companies Search Engine provides search among thousands of publicly traded companies.

One good thing with Google’s CSE is that it allows for collaboration. So the 5000 links limit can be overcome, by working with other people to add more links. Ofcourse, as long as the set of links have that extra meta-data with them, this should be fine. But, abusing this to add any random set of links will not serve the intent. Good luck rolling out your own CSE.

Leave a comment

Filed under custom search engine, Google Search, search engine