PDA

View Full Version : Ban malicious bots with this Perl Script



zestgourmet
03-08-2005, 11:36 AM
I've been noticing a lot of bad bots (cyveillance) lately crawling my site and ignoring robots.txt. I found the following solution on Webmaster World. It sounds very promising. Will this work on WestHost's shared servers?

P.S. Here are links to the Webmaster World's full thread
http://www.webmasterworld.com/forum13/1699.htm
http://www.webmasterworld.com/forum13/1183.htm



The script protects any part of your directory hierarchy under the control of the .htaccess file that it modifies.




#!/usr/local/bin/perl

$htadir = $ENV{DOCUMENT_ROOT};
$htafile = "/\.htaccess";

# Form full pathname to .htaccess file
$htapath = "$htadir"."$htafile";

# Get the bad-bot's IP address, convert to regular-expressions (regex) format by escaping all
# periods.
$remaddr = $ENV{REMOTE_ADDR};
$remaddr =~ s/\./\\\./gi;

# Get User-agent & current time
$usragnt = $ENV{HTTP_USER_AGENT};
$date = scalar localtime(time);

# Open the .htaccess file and wait for an exclusive lock. This prevents multiple instances of this
# script from running past the flock statement, and prevents them from trying to read and write the
# file at the same time, which would corrupt it. When .htaccess is closed, the lock is released.
#
# Open existing .htaccess file in r/w append mode, lock it, rewind to start, read current contents
# into array.
open(HTACCESS,"+>>$htapath") ¦¦ die $!;
flock(HTACCESS,2);
seek(HTACCESS,0,0);
@contents = <HTACCESS>;
# Empty existing .htaccess file, then write new IP ban line and previous contents to it
truncate(HTACCESS,0);
print HTACCESS ("SetEnvIf Remote_Addr \^$remaddr\$ getout \# $date $usragnt\n");
print HTACCESS (@contents);
# close the .htaccess file, releasing lock - allow other instances of this script to proceed.
close(HTACCESS);

# Write html output to server response
print ("Content-type: text/html\n\n");
print ("<html><head><title>Fatal Error</title></head>\n");
print ("<body text=\"#000000\" bgcolor=\"#FFFFFF\">\n");
print ("<p>Fatal error</p></body></html>\n");
exit;


Basic install:
Upload this file in yout cgi-bin and name it trap.pl. Set permissions using chmod 755 (ow:rwx,gr:rx,wo:rx). Create a link on a 1x1-pixel transparent .gif in one or more of your pages, and link it to /about.cgi?id=13 or similar. In .htaccess, rewrite /about.cgi to to /cgi-bin/trap.pl. Disallow the /about.cgi file in robots.txt so good 'bots won't fetch it. (Be sure to post the updated robots.txt hours or even days before uploading and installing this script - some slow-cycle robots need a long time to read it!)

Add the following code to your .htaccess to block the IP address records written by the script:

# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.html¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>


How this works:
If a bot finds the link and ignores the disallow in robots.txt, it will attempt to fetch /about.cgi. .htaccess will rewrite that request to trap.pl. trap.pl runs, grabs the bot's IP address, and converts it to regular-expressions format. It then opens you .htaccess file and writes a new line at the beginning, SetEnvIf (IP-address) getout followed by a comment containing a timestamp and the user-agent. It then closes your .htaccess file, and serves a very short html page (included in the script) to the requestor.

On the next request from that IP address, the new .htaccess file is processed, and the "deny from getout" directive blocks that IP address and any others added by the script:

Note that due to the construction of the "deny from getout" section added to .htaccess, all requestors including bad-bots are allowed to fetch robots.txt and files beginning with "403" and ending with ".html". This allows the bots a chance to read robots.txt, and it also allows them to fetch my custom error pages. This last item is important to prevent an "infinite request loop" once a bot is banned.

Warning: Replace all broken vertical pipe symbols "¦" with solid vertical pipe symbols before attempting to use. These characters are changed by posting here on WebmasterWorld.

This script works very nicely - Thanks Key_Master!

Jim

sunzon
03-08-2005, 05:58 PM
Yes it will, no reason it wouldn't.
I know it and use it on one site (but on another provider), and it works fine, no problem.

I personally think it is playing with fire which is why I don't use it throughout my sites, because it is also quite possible a good spider will pick up the anchored 1*1 pixel image and consider it as seo spamming.