
May the source be with you, but remember the KISS principle ;-)
Rogue robots

(Robots that ignore robots.txt file)

The number of bots accessing popular websites exceed the number of real users by wide margin. For example in one week Softpanorama site was accessed from 14735 unique addresses. Less then 5K of them can be classified as "real users" ( users that actually read at least one page on the site). That means that bots represent 66% of all IP addresses that accessed the site.

Only around 200 of those bots read robots.txt file. So all other robots can be viewed as rogue. In other words rogue robots dominate the Web. IP the fires GET request non-stop (50 more more request per minute) and does not read robots.txt should be classified as rogue robot too.

Most robots "uncritically" use URLs from the pages they scan and it looks like a lot of their source URLs are "poisoned". That include Google and Microsoft robots. What is worse is that some crazy URL that robot gets is used again and again -- looks like they have no mechanism to decrease validity of pages that contain many broken URLs. So much about Google intelligence and quality of Google programmers. Judging form actual behaviour they just don't care.

But truth be told behavior of all robots has elements of suspicious behavior.

One important method of distinguishing whether the robot is "crazy"/undebugged or outright evil is to check whether it obeys robots.txt file. You can include a couple of "test" directory for particular robot and observe results. Also you can (and should) include all old (now non-existent) directories and see which robots still attempt to access files in them.

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

For example:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. Some crawlers like Googlebot and Slurp recognize strings containing "*", while MSNbot and Teoma interpret it in different way.

If robot does not obey robots.txt or is producing way too many 404 using non-existent URLs it should be hunted and killed ;-).

For example here is definitely evil robot :-) - - [24/Aug/2012:03:51:00 -0700] "GET /Net/telnet.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:52:15 -0700] "GET /Algorithms/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:14 -0700] "GET /Bulletin/archive.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:14 -0700] "GET /Scripting/perl.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:21 -0700] "GET /Freenix/linux.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:55:22 -0700] "GET /Solaris/Whitepaper/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:01:39 -0700] "GET /Antivirus/Spyware/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:21:30 -0700] "GET /Skeptics/cs_skeptic.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:23:16 -0700] "GET /WWW/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:24:17 -0700] "GET /Bookshelf/xml.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:25:00 -0700] "GET /Social/overload.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:25:24 -0700] "GET /Admin/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)"
Very similar to "crazy robots" are "obnoxious copiers" who overload the site by trying to mirror all the content. Sometimes several times a day. For example:

Bug#699133 wget When issuing the following exact command wget -m I get wget malloc() smallb

Web scraping - Wikipedia, the free encyclopedia

Technical measures to stop bots[edit]

The administrator of a website can use various measures to stop or slow a bot. Some techniques include:

Incapsula Finds Malicious Bots Account for Approximately 30 Percent of Internet Traffic

Other report findings include:

"We have been conducting this study since 2012, and one constant in our findings is that malicious bots are becoming increasingly sophisticated and harder to distinguish from humans. These bots pose a huge threat to websites and are capable of large-scale hack attacks, DDoS floods, spam schemes and click fraud campaigns," said Marc Gaffan, CEO of Incapsula. "With the vulnerabilities exposed in the past year, notably Shellshock, it is more important than ever that companies operating websites are diligent in securing their sites from malicious traffic."

Robots exclusion standard - Wikipedia, the free encyclopedia



The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March, 29, 2020