Swarming Searchbots from Amazon AWS

I wrote about this – a few months ago in “Are hungry searchbots eating your site alive?” – but the saga continues! I need a rescue mission, so please will someone send in the SWAT team?

[Geek warning—this post is really for geeks only]

Here’s the short version:

If you tweet your blog posts, there are hundreds of bots reading the twitter feed and waiting for your post;
These bots immediately descend on your web server (following a tweet) and spider all over the place;
If your blog is WordPress-powered or requires significant CPU or database resources to generate a page, this can slow your server at exactly the time when you most need the capacity for human visitors;
The majority of these swarming bots do not properly identify themselves to your server; and
The majority of them are coming from AWS now.
It’s time to firewall unidentified bots hosted at AWS out of our blogs!

Here are the details:

One of my clients insists on tweeting every post that goes up on her web site, and that means dozens of posts a day because the site syndicates posts (in other words, it “repeats” posts from the blogs of her contributors) from about 100 different sources. It’s a great idea to tweet your posts, because in theory it generates lots of human traffic. But, in this case within seconds of any tweet there are hundreds of Twitter-following searchbots swarming all over the site checking every possible page[1. These are bots on servers of companies that presumably “live” off reading tweets and then directing traffic to the tweeted web sites.]. They do (HTTP) GETs of the pages, and they don’t bother looking at photos or support files (which is nice of them, actually, because it lowers the server load) — they just hit the pages — and sometimes a single bot will hit a page 4 or 5 times right in a row. Why would a bot need to GET a page 4 times in a row? It only needs it once, doesn’t it?

The problem is that on a WordPress-powered site, each GET causes the full generation of all of the dynamic content on that page (if it’s not cached yet). This can take a considerable amount of CPU and database time. It can take “more than clock time” to generate a page…meaning that it can take several seconds of CPU time to generate a page. Bots that hit pages, like the home page, that are seen frequently will get a cached version of that page, which is “easy on the server” and requires very few resources to generate, but bots that crawl all over the site cause the regeneration of uncached pages, and that can put a significant load on a small server! For example, the bot hitting a page four times in a second can cause the need for 4 x 3 = 12 seconds of CPU time if it requires 3 CPU seconds to generate the page (it’s uncached, remember?). In one second they can generate the need for 12 seconds of CPU time. (I know I’m repeating myself.) It’s this unrestricted (and way too fast!) crawling that causes a server to fall to its knees and surrender — and stop serving pages to human visitors. Unfortunately, the crawlers hit the site immediately after a tweet goes out, which is when you want the server to be at maximum readiness to handle human traffic!

The majority of these swarming bots are now coming from IP addresses assigned to Amazon AWS. And although bots usually identify themselves to a site when they GET a page[2. …by including a string called the User-Agent in their request], saying who or what company is doing the search, and usually a URL so you can find out more about the company or the searchbot, the majority of these new bots I’m seeing do not identify themselves. Well-behaved bots like Google and Bing, Yahoo and MSN, all identify themselves and they only GET pages on a slower schedule[3. …that I can define in the robots.txt file]. It’s the outlaws that are causing the problems.

What I am seeing these days is dozens of AWS-based bots spidering the site, making many requests (each) per second and no identification at all!

I had hoped I could come to a compromise solution that would keep these guys from clobbering the client site but still let them in, but I can’t find one. So the firewalls went up today to refuse service to most AWS-based searchers. Sorry guys, but bad behavior on the part of many has caused grief for the few human visitors, and we have to pay attention to our real readers, who are the humans, after all.

Leave a Reply