Technology in terms you understand. Sign up for the Confident Computing newsletter for weekly solutions to make your life easier. Click here and get The Ask Leo! Guide to Staying Safe on the Internet — FREE Edition as my thank you for subscribing!

My website is being pounded by spiders, what can I do?

Question:

The Access database in my ASP-based web site is overloaded because spiders are crawling my site all day. Do you think it is a good idea to create another source specially for the spiders and only send “real people” to my live database-based site? Or do you think this will cause problems? Do you have any suggestions?

I can imagine some readers looking at this and going “spiders? There are spiders on the internet?” Indeed there are. And they are something that many website owners need to be aware of and deal with appropriately.

You see spiders are generally a good thing.

But displaying different content for them? Well, that’s a bad thing. A really bad thing.

Become a Patron of Ask Leo! and go ad-free!

Spiders “walk the web”. They’re programs that automatically visit web sites, look at all the links on the web site, and then go visit all the pages and other websites that those links point to. Repeat that process indefinitely, and with almost every page on the web linking to some other page on the web, a spider that just follows links should be able to visit or access almost everything that’s on the internet. At least in theory.

Spiders are a good thing because the big search engines like Google, Yahoo and others will use a spider (actually many spiders) to examine web pages for inclusion in their massive search indexes. Their spiders will also come back and visit web pages “every so often”, so as to keep their index up to date with any changes you’ve made since the last time the spider visited.

There are two “problems” with spiders:

“The problems with the load caused by spiders is exacerbated, of course, if your website is designed poorly …”
  • There are a lot of them. Probably thousands of spiders all attempting to visit every website, and often repeatedly. Every search engine, every custom search engine, a bunch of academic projects, and who knows what else may have its own spider attempting to visit your site. The load can add up.
  • Sometimes they misbehave. Since it’s a computer program, a
    spider could ask for pages faster than your web server could deliver them, but a “well behaved” spider won’t. Sometimes a reputable spider like Yahoo’s or Google’s will get confused, and sometimes spiders aren’t well behaved. In those cases, a spider can bring a site to its knees.

The problems with the load caused by spiders is exacerbated, of course, if your website is designed poorly, on a low performing server, or has insufficient bandwidth.

In the original question you indicated that you’re using Microsoft Access as your database. I dearly love Access for many things, but being the database behind a web site isn’t one of them. It’s not something Access was designed for, and would be one of the first places I’d look for performance related issues under moderate to heavy load. More appropriate technologies include Microsoft SQL Server, MySQL or others.

I do want to be clear that presenting one set of content to the spiders and another to “real” users is a very, very bad idea if you want to rank on the search engines. For one thing, you run the risk of providing the wrong content when people click on a search engine result, which is a bad experience for the users. Even worse, though, most search engines explicitly prohibit this behavior. If you present one set of content to real users and something different to the spiders, you run a very real risk of being banned from the search engine results entirely.

So once you’ve cleared up your site performance issues, what are your options?

The first is something called “robots.txt”. This is a text file that you place in the root of your website that instructs the spiders as to what they may, and may not, do. Using robots.txt you can tell specific spiders what parts of your site they are allowed to scan. That means you can also tell a specific spider not to scan your site at all. If you don’t care about search engine rankings, you can even tell all spiders not to scan your site.

The downside to robots.txt is that it relies on spiders being “well behaved”. Spiders have to choose to follow the instructions that you place in robots.txt, and most do. Certainly legitimate spiders do. But what about the rest? What about the ones that ignore what you’ve said in robots.txt completely?

Your only real recourse, that I’m aware of, is to block them at the IP level. That means first identifying the offending spider by examining your server access logs. Then, determine the IP address or IP address range that the spider may access your site from. Lastly, use some technique on your web server to block that IP address or address range from accessing your site. The exact technique varies, but on Apache web servers, for example, it’s often as easy as a simple entry in your .htaccess configuration file.

Finally, I want to mention that spiders aren’t the only way your site can get over loaded. Of course, you could just be very popular, and I hope that’s a problem you’ll have to face some day soon :-). However spammers have also entered the picture. Spammers have started to use tools to automatically fill in any form they might find on your site, in the hopes that what they post will somehow get published. Spammers are also on the lookout for vulnerabilities inCGI scripts that they can then hijack to use as an email spam-sending relay. If you find your server is overloaded, be sure to check out exactly what it causing the problem so you can take the right action.

 

Do this

Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.

I'll see you there!

Leave a reply:

Before commenting please:

  • Read the article.
  • Comment on the article.
  • No personal information.
  • No spam.

Comments violating those rules will be removed. Comments that don't add value will be removed, including off-topic or content-free comments, or comments that look even a little bit like spam. All comments containing links and certain keywords will be moderated before publication.

I want comments to be valuable for everyone, including those who come later and take the time to read.