We're currently having some problems with some web spiders beating up our webservers sucking up available sessions in our application and slurping up a whole bunch of our bandwidth. We're interested in rate-limiting them. I found what appeared to be a very relevant iRule at http://devcentral.f5.com/Default.aspx?tabid=109 (third place winner), but when I try to load it up in the iRule editor it complains. It complains, I believe, because HTTP Headers are not available from within CLIENT_ACCEPTED and CLIENT_CLOSED logic. That makes sense because CLIENT_ACCEPTED and CLIENT_CLOSED are associated with building and destroying tcp connections (i believe), so it wouldn't make sense for data (headers/req-uri's) to be transferred at that time. Does anyone have any suggestions on how to accomplish this or something similar?

If you changed the CLIENT_ACCEPTED event to HTTP_REQUEST and CLIENT_CLOSED to HTTP_RESPONSE, you'd get per HTTP request throttling, instead of per-TCP connection throttling (Click here). Aaron

I was thinking about that, it would limit to N pending requests. I was worried about how to respond when I wanted to reject the request. It's easy at the tcp level, just reject the connection. At the request-level I have to think about how to respond without negatively impacting our page -rank. I'm currently working on a rule which would rate limit to N requests per M seconds and have the same problem. How do I tell msn-search to buzz off without pissing off them off too much? -- though this won't be a big problem in the future as we've added crawl-delay to our robots.txt for future crawling.

Here's where I am with my rate limiting iRule, though I havnt even checked to see if it'll parse yet: when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 set ::rate_limit_message "You've been rate limited for sending more than 1 request every $::min_interval seconds." } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] if { [matchclass [$user_agent contains $::Crawlers] } Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { block it somehow HTTP::respond 500 content $::rate_limit_message } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } }

You can still use reject to send a TCP reset (even from an HTTP_ event). I haven't delved into search engine optimization, but technically it would be appropriate to send a 503 response back. Google seems to handle the 503 as you'd hope: http://googlewebmastercentral.blogspot.com/2006/08/all-about-googlebot.html If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page? You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later. What should I do if Googlebot is crawling my site too much? You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon. Aaron

Here's the rate limiting approach: when RULE_INIT { array set ::active_crawlers { } set ::min_interval 1 } when HTTP_REQUEST { set user_agent [string tolower [HTTP::header "User-Agent"]] Logic only relevant for crawler user agents if { [matchclass $user_agent contains $::Crawlers] } { Throttle crawlers. set curr_time [clock seconds] if { [info exists ::active_crawlers($user_agent)] } { if { [ $::active_crawlers($user_agent) < $curr_time ] } { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } else { reject } } else { set ::active_crawlers($user_agent) [expr {$curr_time + $::min_interval}] } } }

Rate limiting Search Spiders

13 Replies

aneilsingh_5064
Nimbostratus
Jan 24, 2012
@Colin, I am on version 10.2.1. I would be very interested in some working examples.

Personally I think I would rather just have the actual irule to contain the list.

And of course performance is important.

Thanks
hooleylist
Cirrostratus
Jan 24, 2012
I think you can tell Google and Bing to crawl your sites at a slower rate:

http://www.bing.com/community/site_blogs/b/webmaster/archive/2009/08/10/crawl-delay-and-the-bing-crawler-msnbot.aspx

In the robots.txt file, within the generic user agent section, add the crawl-delay directive as shown in the example below:

User-agent: *

Crawl-delay: 1

http://googlewebmastercentral.blogspot.com/2008/12/more-control-of-googlebots-crawl-rate.html

We've upgraded the crawl rate setting in Webmaster Tools so that webmasters experiencing problems with Googlebot can now provide us more specific information. Crawl rate for your site determines the time used by Googlebot to crawl your site on each visit.

If those options don't work for you it might be better to assign a rateclass for search engine spiders rather than sending back a 503. It should add less overhead on LTM and provide faster overall crawl times. Of course, I'm not an SEO expert so this is something you might want to research before using.

You could use a list of spider user-agents like this to identify spiders:

http://www.useragentstring.com/pages/Crawlerlist/

You could either check the User-Agent header using a switch statement or putting the header tokens in a data group and using the class command to do the lookup. Once you identify a spider you could assign a rate class:

http://devcentral.f5.com/wiki/iRules.switch.ashx

http://devcentral.f5.com/wiki/iRules.class.ashx

http://devcentral.f5.com/wiki/iRules.rateclass.ashx

Aaron
aneilsingh_5064
Nimbostratus
Jan 24, 2012
Hi there, we are a SaaS company and therefore don't have control of all our communities submissions to Search engines. As well each of our Communities we allow the ability for them to have a robots.txt file as well. These robots.txt files are handled and server up with Code and not sitting in the root of the typical web server.

Here you can see why I need to control speed from the F5 for robots.

When searching on this site it seems this thread beginning is what I am after. Now I am just hoping someone has a working examples and can help me.

thanks for your input.

Forum Discussion

Rate limiting Search Spiders

13 Replies

Recent Discussions

What is the meaning is 52% block in WAF

Rundeck ansible F5 errors

rewrite Azure AD response for portal access via web portal

AS3 Monitoring multiple ports selectively

Open Redirection Mitigation

Related Content

How to search the latest security academic papers

F5 logs - search query

Introduction to F5 Distributed Cloud Console Rate Limiting Feature

In search of a security incident response system for the masses

L3/4/DNS DDoS Reporting with Elastic Search and Kibana