I work with a dozen or so clients at any given time, and in the last three (or thereabouts) weeks I’ve noticed that some sites on small servers with limited capacity are being “eaten alive” by spidery searchbots. And not just the usual suspects—Google, Yahoo, MSN—but by specialized searchbots that exhibit a kind of behavior I haven’t seen very much before. It used to be that web site owners prayed for the searchbots to come by, and searchbots by and large were sparing in their examination of pages, not hitting a site very hard at all, but building an overall image of the pages on the site over a long time. [1. Illustration: “Spider & Crossbones” pirate flag]
But times are changing rapidly! Even a site with very little human traffic may be suddenly and catastrophically overwhelmed by searchbot traffic.
Sites on small servers frequently are configured in such a way that they can serve perhaps a dozen or two simultaneous visitors[2. Web servers have limited RAM memory, and because of the way popular web server software, like Apache, is usually configured, once the RAM memory is full, they either slow down or stop serving visitors entirely. The condition is sometimes called “wedging” since it’s like trying to drive a wedge into a crack in a log.]. Searchbots (the robotic spiders that crawl the web on behalf of search engines) don’t use a web site the same way humans do. A human site visitor downloads a page, a bunch of photos, some style sheets, and then sits there a few seconds (at least) reading or looking at the page before clicking for more. Web servers like those that I maintain for my clients, are configured so that they can handle this kind of “human paced” load, and we have lots of tricks [3. Like offloading the photos to content management systems.] so human visitors can be served really fast. WordPress sites, for example, require considerable CPU time to create a dynamic page that’s composed of data, photos, plugins and other widgets. So we have the server cache the finished pages, and serve those cached copies rapidly, rather than spending a lot of server CPU time regenerating them for every visitor. A cached page might require only a tiny fraction of a second to serve, compared to the seconds it takes to build the page in the first place.
But searchbots frequently look only at the core page, and not at the photos[3. There are specialized searchbots now that look only at photos or videos.], and then quickly move to the next page they want to investigate. Sometimes a searchbot will request 5 to 10 pages in a single second—human visitors usually are paced at a page every few seconds. When a searchbot explores like this it can rapidly max out a small server. What’s more, human visitors tend to clump or cluster and look at similar things—while searchbots may request pages all over the place completely unconnected to each other. The human visitors, because they’re interested in similar topics, will end up hitting cached pages, while the searchbot, making 30x the normal number of requests per second [4. Say 10 pages per second rather than 1 every 3 seconds for a human], hits pages all over the site, unrelated to each other.
The worst of the “bad behavior” however, arises from certain bots (I’ll name them in another article later on) that “anticipate” what their masters might want to see and do a “look-ahead” instead of picking up a single page, they might pick up a page and 5 to 10 related pages, regardless of whether their master wanted those pages. You can think of them building a repository of pages, stemming from the top or home page, that a visitor might want to see, just on the off chance that a visitor will come along wanting that specific page.
Although the spiders are usually there for legitimate purposes, related to fancy and sometimes useful new online services, this kind of spidering can really drag down a server!