Today I happened to see a website that offered searching for products by color. I actually seen this in another site a few months back but I didn’t think much about the underlying technology. Then, today, as a first reaction I thought “wow! are they hiring people to look at each product image and capture the colors”. Then I realized, this can be done easily by processing the product image. The idea is, every image is made of a bunch of pixels, and the color of each pixel is available through the API. So, one approach is to get the frequency of each color and order the colors by frequency and finally picking first N or based on some threshold. However, as with any image processing, there are other alternate choices available. For example, if the image is jpeg instead of gif, then the number of colors is too many and the frequency of each individual color might be very little. So, perhaps treating all the colors that are very similar into one single color would help. Similarly, sometimes a color with high frequency could be just small specs scattered all over the image and it’s not really useful. Or a ring with a small diamond in the middle could contain a very small but the most important color. So, a color based on clustering rather than purely based on frequency is also a good choice. Only thing is, there needs to be a way to not include the background color, which in most product images is a white color.
Keeping all the above in mind, assume each product is related with a few colors. Then, the next thing is to take the color that the user has picked to search and matching against the product colors with a delta difference since getting precise match is not always possible or gives many choices.
For a retailer doing the above is simply processing the images in the system and creating the color index. However, if this were to be done by a search engine, the search engine has to first retrieve each product image for processing.
If you have a dynamic website that is search engine friendly, chances are your entire database of some entity such as a list of products is made available that can be easily reached by crawling by the bots. Doing this comes with a cost. The bots keep crawling regularly, even at a slower rate, and you have to render those dynamic pages one after the other costing you both bandwidth and CPU cycles. To avoid this, you can write your dynamic pages such that they are “last modified” aware. That is, say you have a product page that lists the product details as well as any comments by the user. While most of the product details itself is relatively static, content such as comments could be changing every few days especially so for a product that has launched in the first few months. Ofcourse, if you do dynamic pricing, then that’s different. But even then, people seldom search based on the exact price string (people want to know if a product is available below a certain price and not an exact price, so don’t bother to worry that your latest price is not indexed. Besides, by the time it’s indexed to the time it’s available in the search results, your price may have changed yet again).
So, how do you make your dynamic page “last modified” aware? The way it works is, bots use a special http request which is a conditional http GET request that passes a special if-modified-since header with a specific date. So, the bot is essentially asking you to respond with the full content only if the content has modified since a given date. Otherwise, you can just respond with a status code of 304 which tells the bot that there is no change. So, search engine crawlers like that of Google which maintains the last time they have crawled uses these conditional requests so that they can avoid the same bandwidth and cpu cycles as you.
How do you know if you are making use of this functionality? It’s easy. Check in your log files and see if you have any 304 response codes against Google bot or other search engine bot requests. If you always see 200 and never 304, then you are not using this feature.