I just learned the phrase “Deep web” today and I’m reading an excellent
Wikipedia article explaining that it’s not just for criminals and terrorists. A
few years ago for personal reasons, I chose to write a memoir style blog on
perhaps the most popular social networking site then used for people over the
age of 50. I chose this site not only because it was even more popular than
Facebook, I chose it specifically because it’s content was dynamically
generated. Finally, I chose it because I knew not only that I had ready-made
readers waiting but an accessible list of those who’d be interested in what I
I contacted the website company at which I was a paid member and checked to
see if any search engine could cache what I wrote. I was told no. What I
wrote was a very personal, festivist style airing of grievances. I received tons
of email responses from my readers about the various grievances I aired all
positive and supportive.
Now, Leo, learning about the Deep Web, I’m wondering if what that website
customer service rep told me that the site was cache proof was valid. Are there
search engines on the Deep Web? If so, are those search engines tailored to find
criminals and terrorists or even regular people?
In this excerpt from
Answercast #21, I look at the idea of a “Deep Web” and what information you
might find about yourself there.
So there’s a number of interesting issues here. I too had not heard the phrase “Deep Web,” although I’ve certainly been familiar with several of the concepts.
Ultimately, here’s my take on it. If an individual can view your web page without going through any special hoops like say, connecting to a special network, or entering a username and password, or anything like that, then I’m convinced there’s a search engine that has already spidered it and potentially cached it.
There’s nothing to protect it from that kind of activity.
Limiting website spidering
What your admin may have been talking about is this thing called robots.txt. It’s a file that website owners can place in their website that tells search engines what can and cannot be spidered, or can and cannot be cached. There’s also information you can place on an HTML page that basically says the same thing. And that will keep the pages out of the popular search engines.
You will not find those pages in Google; you will not find those pages in Bing.
The problem is that techniques, like robots.txt, and the information that is placed in the HTML file is purely voluntary. It’s not a technological solution. It’s basically telling a search engine, “Hey, please don’t index me. Please don’t cache me.”
Good search engines will do that; will respect that. Others may not. There’s actually no requirement that they do so. It is simply a gentleman’s agreement.
By the fact that your website, or your content, is available without any special hoops to jump through, I’m certain, I’m convinced that there are search engines, caching utilities, spiders out there that have and will continue to spider your site and cache its content.
Programs crawl the web constantly
There’s really not a lot we can do about that.
We normally think of there only being a handful of search engines; you know, things like Google and Bing and maybe Yahoo back in the day. In reality, there are thousands and thousands of search engines and spiders crawling the web almost constantly.
- Competitive search engines to
- Government sponsored data collection utilities to
- University research projects that are out there just sort of surfing everything they can find.
So, in reality, the fundamental rule of the internet still applies.
Once you post something that is publicly accessible, for all practical purposes, you have lost control over it. You can remove it, but you do not know if somebody hasn’t already cached it, made a copy of it, or reposted it somewhere else. That’s just the nature of how the public internet works; dark or otherwise, that’s just how it works.
Next from Answercast 21- If someone starts using my old email address could they find my information?