Results 1 to 2 of 2
  1. #1

    Default Ban malicious bots with this Perl Script

    I've been noticing a lot of bad bots (cyveillance) lately crawling my site and ignoring robots.txt. I found the following solution on Webmaster World. It sounds very promising. Will this work on WestHost's shared servers?

    P.S. Here are links to the Webmaster World's full thread
    http://www.webmasterworld.com/forum13/1699.htm
    http://www.webmasterworld.com/forum13/1183.htm


    The script protects any part of your directory hierarchy under the control of the .htaccess file that it modifies.




    #!/usr/local/bin/perl

    $htadir = $ENV{DOCUMENT_ROOT};
    $htafile = "/\.htaccess";

    # Form full pathname to .htaccess file
    $htapath = "$htadir"."$htafile";

    # Get the bad-bot's IP address, convert to regular-expressions (regex) format by escaping all
    # periods.
    $remaddr = $ENV{REMOTE_ADDR};
    $remaddr =~ s/\./\\\./gi;

    # Get User-agent & current time
    $usragnt = $ENV{HTTP_USER_AGENT};
    $date = scalar localtime(time);

    # Open the .htaccess file and wait for an exclusive lock. This prevents multiple instances of this
    # script from running past the flock statement, and prevents them from trying to read and write the
    # file at the same time, which would corrupt it. When .htaccess is closed, the lock is released.
    #
    # Open existing .htaccess file in r/w append mode, lock it, rewind to start, read current contents
    # into array.
    open(HTACCESS,"+>>$htapath") ¦¦ die $!;
    flock(HTACCESS,2);
    seek(HTACCESS,0,0);
    @contents = <HTACCESS>;
    # Empty existing .htaccess file, then write new IP ban line and previous contents to it
    truncate(HTACCESS,0);
    print HTACCESS ("SetEnvIf Remote_Addr \^$remaddr\$ getout \# $date $usragnt\n");
    print HTACCESS (@contents);
    # close the .htaccess file, releasing lock - allow other instances of this script to proceed.
    close(HTACCESS);

    # Write html output to server response
    print ("Content-type: text/html\n\n");
    print ("<html><head><title>Fatal Error</title></head>\n");
    print ("<body text=\"#000000\" bgcolor=\"#FFFFFF\">\n");
    print ("<p>Fatal error</p></body></html>\n");
    exit;


    Basic install:
    Upload this file in yout cgi-bin and name it trap.pl. Set permissions using chmod 755 (ow:rwx,gr:rx,wo:rx). Create a link on a 1x1-pixel transparent .gif in one or more of your pages, and link it to /about.cgi?id=13 or similar. In .htaccess, rewrite /about.cgi to to /cgi-bin/trap.pl. Disallow the /about.cgi file in robots.txt so good 'bots won't fetch it. (Be sure to post the updated robots.txt hours or even days before uploading and installing this script - some slow-cycle robots need a long time to read it!)

    Add the following code to your .htaccess to block the IP address records written by the script:

    # Block bad-bots using lines written by bad_bot.pl script above
    SetEnvIf Request_URI "^(/403.*\.html¦/robots\.txt)$" allowsome
    <Files *>
    order deny,allow
    deny from env=getout
    allow from env=allowsome
    </Files>


    How this works:
    If a bot finds the link and ignores the disallow in robots.txt, it will attempt to fetch /about.cgi. .htaccess will rewrite that request to trap.pl. trap.pl runs, grabs the bot's IP address, and converts it to regular-expressions format. It then opens you .htaccess file and writes a new line at the beginning, SetEnvIf (IP-address) getout followed by a comment containing a timestamp and the user-agent. It then closes your .htaccess file, and serves a very short html page (included in the script) to the requestor.

    On the next request from that IP address, the new .htaccess file is processed, and the "deny from getout" directive blocks that IP address and any others added by the script:

    Note that due to the construction of the "deny from getout" section added to .htaccess, all requestors including bad-bots are allowed to fetch robots.txt and files beginning with "403" and ending with ".html". This allows the bots a chance to read robots.txt, and it also allows them to fetch my custom error pages. This last item is important to prevent an "infinite request loop" once a bot is banned.

    Warning: Replace all broken vertical pipe symbols "¦" with solid vertical pipe symbols before attempting to use. These characters are changed by posting here on WebmasterWorld.

    This script works very nicely - Thanks Key_Master!

    Jim

  2. #2

    Default

    Yes it will, no reason it wouldn't.
    I know it and use it on one site (but on another provider), and it works fine, no problem.

    I personally think it is playing with fire which is why I don't use it throughout my sites, because it is also quite possible a good spider will pick up the anchored 1*1 pixel image and consider it as seo spamming.
    tell me and I'll forget; show me and I may remember; involve me and I'll understand.

Similar Threads

  1. need an example of simple perl script + html
    By smashcon in forum CGI Scripts / Perl
    Replies: 23
    Last Post: 09-07-2009, 01:10 PM
  2. Replies: 4
    Last Post: 05-25-2007, 01:18 PM
  3. Trying to use ImageMagick in a cgi perl script
    By gabriellee in forum CGI Scripts / Perl
    Replies: 1
    Last Post: 08-07-2006, 07:42 AM
  4. Installing Bugzilla on a WestHost VPS
    By sonavor in forum CGI Scripts / Perl
    Replies: 0
    Last Post: 03-28-2006, 11:43 PM
  5. Error running an ImageMagick Perl script
    By maychan in forum CGI Scripts / Perl
    Replies: 2
    Last Post: 12-14-2004, 05:20 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •