View Full Version : ia_archiver Robot

03-31-2006, 10:30 AM
I recently noticed a BIG increase in bandwidth on one of my accounts and thought perhaps I should explore what was going on. By examineing my the current access_log for the account I found that within about nine hours I had 3985 hits with 2264 of those showing a user agent of ia_archiver. Now this is a "legit" robot that comes to your site and archive it. You can read more about it here (http://pages.alexa.com/help/webmasters/index.html). In theroy this robot should not be hitting a site enough to cause such an increase in bandwidth but in this case it appears to be running amoke. :)

Right away I dissallowed the robot in my robots.txt file by adding the following:

User-agent: ia_archiver
Disallow: /
Ofcourse that did not stop the spider since it was already in the process of spidering my site but it should keep it at bay in the future. So now I needed a way to cut back the bandwidth it was eating up so I decided to add a rewrite to the .htaccess file in my public root. This is simply a copy of something I used in the past to keep anouther "bad spider" at bay that was doing the same thing.

# prevent the ia archiver
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver* [NC]
RewriteRule ^.*$ [R,L]

It is still hitting my site but at least now it is not eating up as much bandwidth. I guess I could have gone through and added a ban on the IP but used this approach instead since I was not 100% sure if the robot stuck with the same IP or not.

So how would someone else handle this? Any other solutions that may be better?

03-31-2006, 10:37 AM
I forgot to mention that in order to figure out the count on hits from the robot I use grep in SSH.

grep -c ia_archiver /var/log/httpd/access_log
The above command using the -c option returns the line count of matches instead of the lines themselves. Also if you happen to want to "watch" the hits on your site you can use this command in SSH.

tail -f /var/log/httpd/access_log
It returns each new line added to the access_log.

03-31-2006, 01:11 PM
Alexa is supposed to be a well behaved robot and will respect the instructions in your robots.txt file. So all you need to do is tell her what to index and what not to index. I don't think you need to bother with the Rewrite rules.

03-31-2006, 01:22 PM
Well that is what I thought also but I explored thier site some and they do mention that if the robot has already started spidering your site that it would not stop till it's next visit and that you would have to contact them to have it removed. What I find really odd is that normally Alexa would not be hitting my site this much. It has been doing it for a few days now. It is possible that I am miss interpreting the access_logs.

03-31-2006, 02:26 PM
She's only visited a couple of sites that I've checked and has only indexed a couple of pages and then left... now I'm feeling left out... :(

03-31-2006, 02:29 PM
Let me grab a bit of an access file and see if you think I am mis-reading it.

03-31-2006, 02:31 PM - - [31/Mar/2006:14:28:42 -0700] "GET /donate.php?sid=bb094a532cb6d5ca9ce88148fb30800d HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:28:45 -0700] "GET /donate.php?sid=bd2b229b730c9304bf4757ff69b2d8ff HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:28:48 -0700] "GET /viewforum.php?f=7&sid=af432fbacf41ad5376a5a9dea89a 27a8 HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:28:49 -0700] "GET /viewforum.php?f=7&sid=b5b969e8fd5efdcff404d550bccf 3847 HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:28:51 -0700] "GET /donate.php?sid=d4967d8e2a2b7569bf52c455d70c18c6 HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:28:55 -0700] "GET /viewforum.php?f=7&sid=ba55124a918d9fca5a5e5a97f442 f79f HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:28:57 -0700] "GET /donate.php?sid=dab7daf22b4f04cbe653110f170034a2 HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:29:01 -0700] "GET /viewforum.php?f=7&sid=bcd13f6ff00de0e23ea64de5f535 6383 HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:29:03 -0700] "GET /donate.php?sid=eabc4891bfdf5355b6f4dc2f77abcf87 HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:29:07 -0700] "GET /viewforum.php?f=7&sid=c3e58f710cc30f6704513e14f32d 2c9d HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:29:10 -0700] "GET /donate.php?sid=f110ed1efffdfbb3b47f1594b9c19eec HTTP/1.0" 302 314 "-" "ia_archiver" - - [31/Mar/2006:14:29:13 -0700] "GET /viewforum.php?f=7&sid=c4245f9a7bd3ae21b5e78b0667dc 77ee HTTP/1.0" 302 328 "-" "ia_archiver" - - [31/Mar/2006:14:29:15 -0700] "GET /groupcp.php?sid=8ed4ea52a04587883e8fa81ad9ddcfde HTTP/1.0" 302 314 "-" "ia_archiver"

03-31-2006, 02:46 PM
You know what I may very well have messed my self up. :) I added the stuff yesterday to block Alexa and now when I was just reviewing my logs I noticed this. - - [31/Mar/2006:04:41:49 -0700] "GET /robots.txt HTTP/1.0" 302 277 "-" "ia_archiver"

Looks like she asked for the robot.txt file this morning again and dummy me did not let here have it. I am also wonder in I sending back the wrong code with rewrite since the server is send 302 wich I think is a temp move result. hmmmm....

03-31-2006, 03:27 PM
Ok I got fed up and decided to simply ban the IP via the Site Manager. I am still leaveing the other things in place. The reason for this is I am almost convinced this was not Alexa but a bot masking as her. I think I will do some further investigateing but for now banning the IP knocks them out totally. :)

04-01-2006, 12:57 AM
You'll need to ban a block of IP's, as Alexa operates on the range: -

In your access_log excerpt, the section:
[...] HTTP/1.0" 302 314 "-" "ia_archiver"

that number '314' is the number of bytes transferred, so you can check the bandwidth of any IP by grep'ing for the IP number and then summing up the contents of that field to get the bandwidth used.

04-01-2006, 09:17 AM
I now have anouther account that seems to be haveing a bandwidth problem. It is already at 32% on the first day of the month... this site is not that popular. :)

I am not finding the ia_archiver entry but in the last archived access_log there are some very odd entries. They are not all in the normal fomat and one is so long it takes up almost the entire log.

Just wanted to post to give folks a heads up that they may want to keep an eye on thier accounts and see if anyone may be experiance something similiar.

04-01-2006, 09:54 AM
Well I forgot to follow my first rule of thumb... "Do not panic!". The archived access_log that I thought was showing "odd entries" is fine. Just a note to anyone out there... Make sure that you have the binary box clicked when you download it and not the text box. :)

It is still odd that the bandwidth is so high on this account on the first day and I am still exploreing the cause.

04-01-2006, 10:53 AM
The issue was with the Reseller Manager. I was reviewing the bandwidth in Reseller Manager and upon looking at the bandwidth in the Site Manager for the individual account found that it was showing at 1%. Big difference. :) I had seen this behavior in the past but not to this drastic degree. Normally it was only off by a couple precentage points.

The "fix" for this is to re-apply the package for the account in question via the Reseller Manger. WestHost is aware of the issue.