Technology in terms you understand. Sign up for the Confident Computing newsletter for weekly solutions to make your life easier. Click here and get The Ask Leo! Guide to Staying Safe on the Internet — FREE Edition as my thank you for subscribing!

My website is being pounded by spiders, what can I do?

Question:

The Access database in my ASP-based web site is overloaded because spiders
are crawling my site all day. Do you think it is a good idea to create another
source specially for the spiders and only send “real people” to my live
database-based site? Or do you think this will cause problems? Do you have any
suggestions?

I can imagine some readers looking at this and going “spiders? There are
spiders on the internet?” Indeed there are. And they are something that many
website owners need to be aware of and deal with appropriately.

You see spiders are generally a good thing.

But displaying different content for them? Well, that’s a bad thing. A
really bad thing.

Become a Patron of Ask Leo! and go ad-free!

Spiders “walk the web”. They’re programs that automatically visit web sites,
look at all the links on the web site, and then go visit all the pages and
other websites that those links point to. Repeat that process indefinitely, and
with almost every page on the web linking to some other page on the web, a
spider that just follows links should be able to visit or access almost
everything that’s on the internet. At least in theory.

Spiders are a good thing because the big search engines like Google, Yahoo
and others will use a spider (actually many spiders) to examine web pages for
inclusion in their massive search indexes. Their spiders will also come back
and visit web pages “every so often”, so as to keep their index up to date with
any changes you’ve made since the last time the spider visited.

There are two “problems” with spiders:

“The problems with the load caused by spiders is
exacerbated, of course, if your website is designed poorly …”
  • There are a lot of them. Probably thousands of spiders all
    attempting to visit every website, and often repeatedly. Every search engine,
    every custom search engine, a bunch of academic projects, and who knows what
    else may have its own spider attempting to visit your site. The load can add
    up.

  • Sometimes they misbehave. Since it’s a computer program, a
    spider could ask for pages faster than your web server could deliver
    them, but a “well behaved” spider won’t. Sometimes a reputable spider like
    Yahoo’s or Google’s will get confused, and sometimes spiders aren’t well
    behaved. In those cases, a spider can bring a site to its knees.

The problems with the load caused by spiders is exacerbated, of course, if
your website is designed poorly, on a low performing server, or has
insufficient bandwidth.

In the original question you indicated that you’re using Microsoft Access as
your database. I dearly love Access for many things, but being the database
behind a web site isn’t one of them. It’s not something Access was designed
for, and would be one of the first places I’d look for performance related
issues under moderate to heavy load. More appropriate technologies include
Microsoft SQL Server, MySQL or others.

I do want to be clear that presenting one set of content to the spiders and
another to “real” users is a very, very bad idea if you want to rank on the
search engines. For one thing, you run the risk of providing the wrong content
when people click on a search engine result, which is a bad experience for the
users. Even worse, though, most search engines explicitly prohibit this
behavior. If you present one set of content to real users and something
different to the spiders, you run a very real risk of being banned from the
search engine results entirely.

So once you’ve cleared up your site performance issues, what are your
options?

The first is something called “robots.txt”. This is a text file that you
place in the root of your website that instructs the spiders as to what they
may, and may not, do. Using robots.txt you can tell specific spiders what parts
of your site they are allowed to scan. That means you can also tell a specific
spider not to scan your site at all. If you don’t care about search engine
rankings, you can even tell all spiders not to scan your site.

The downside to robots.txt is that it relies on spiders being “well
behaved”. Spiders have to choose to follow the instructions that you place in
robots.txt, and most do. Certainly legitimate spiders do. But what about the
rest? What about the ones that ignore what you’ve said in robots.txt
completely?

Your only real recourse, that I’m aware of, is to block them at the IP
level. That means first identifying the offending spider by examining your
server access logs. Then, determine the IP address or IP address range that the
spider may access your site from. Lastly, use some technique on your web server
to block that IP address or address range from accessing your site. The exact
technique varies, but on Apache web servers, for example, it’s often as easy as
a simple entry in your .htaccess configuration file.

Finally, I want to mention that spiders aren’t the only way your site can
get over loaded. Of course, you could just be very popular, and I hope that’s a
problem you’ll have to face some day soon :-). However spammers have also
entered the picture. Spammers have started to use tools to automatically fill
in any form they might find on your site, in the hopes that what they post will
somehow get published. Spammers are also on the lookout for vulnerabilities in
cgi scripts that they can then hijack to use as an email spam-sending relay. If
you find your server is overloaded, be sure to check out exactly what it
causing the problem so you can take the right action.

Do this

Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.

I'll see you there!

Leave a reply:

Before commenting please:

  • Read the article.
  • Comment on the article.
  • No personal information.
  • No spam.

Comments violating those rules will be removed. Comments that don't add value will be removed, including off-topic or content-free comments, or comments that look even a little bit like spam. All comments containing links and certain keywords will be moderated before publication.

I want comments to be valuable for everyone, including those who come later and take the time to read.