If you are a webmaster who monitors your website statistics periodically, you know that there are a bunch of crawlers, mostly from Google and Yahoo! visiting your website. One thing I noticed is that Googlebot typically visits your website from a single ip address at any given day (frequency and ip variation may perhaps depend on the popularity of the website) while Yahoo! slurp visits the site from multiple ip addresses. I think Yahoo! does this perhaps to parallelize their crawling. However, between the option of parallelizing crawling on one site vs multiple sites, the later is probably desirable for a few reasons. One is the fact that the website being crawled will need to expend less resources (think of keepalive, no concurrent crawler connections). The other issue is, if you use a normal web statistics software that doesn’t offer more powerful analytics by filtering out crawling visits, the number of visitors is going to be high if it’s crawled from multiple IPs. Also, the latest visits report on my website’s cpanel groups visits by ip address and as a result, there are too many entries for Yahoo! while there is a consolidated single entry for Googlebot. I wonder if there is a way to specify the max crawlers per bot.
Category Archives: bots