Forum Discussion

Jai_Macker_3993's avatar
Jai_Macker_3993
Icon for Nimbostratus rankNimbostratus
Dec 02, 2008

Slowdown with irule and large data group?

Hi,

We have been seeing issues with peak period slowdowns and f5 support has looked through our config already and found nothing that stands out. We do use a pretty complex irule with a data group lookup, so support did suggest we post in the forum for assistance on that end. What we are seeing is tmm cpu usage go up to over 95% regularly and occasionally peg at 100%, this is with only about 200 http requests/sec. This seems to cause slow down in the network traffic of the bigip device, even in terms of ping times to/from other boxes when compared to non-peak periods. Below is some background on our setup, a few questions and the irule we use. Please take a look and let us know if you can help. Thank you!

Background on our setup:

-We host a large number of relatively low traffic sites

-Total peak period traffic is currently only about 200 req/sec

-Sites use the same virtual server as all are sent to single ip hosted by us as part of setup with our cdn

-We need to be able to balance these sites smartly between various web servers in order to keep like assets together as much as possible for caching optimizations (we use a "memcached like" distributed caching along with local "near caching" for further in vm optimization that has proven to significantly speed request processing time of our web servers.)

-Version: BIG-IP 9.3.0 Build 178.5

-Been using the same irule/datagroup setup since early this year, was originally setup with the help of an f5 field engineer

-The balancing to specific web servers is done using a lookup in the irule to a data group file that has become relative large over time and is currently at 1.8Mb and 48K entries. (In the past few months, we added a bunch of even lower traffic sites to get to this point, so the num of data group entries close to doubled while traffic increased only 20-30%)

-Appears this may not be scaling as traffic and data group file size has grown as evidenced by peak period slowdowns occurring

Questions

-What kind of algorithm is used for datagroup lookups and how cpu intensive are the lookups? Is there a way to optimize these?

-Our number of requests per second is not significant compared to the numbers we see mentioned here in terms of the how many times others execute an irule per second, so while I have read of some syntax/variable optimizations just now that we will try out, I am wondering where our optimization efforts are best focused, on irule syntax or overall irule/datagroup setup?

-Is cookie persistence much lighter on cpu cycles than data group usage? One thought we had was to just have the first hit use the datagroup and have a session cookie set by it for the next hits. Is there an ideal way to do this within the irule? Seems the irule is processed before any default persistence profile of the virtual server, so the setting and check of this would have to be in the irule?

The irule:

 
 when RULE_INIT { 
      Enable to debug affinity translations via log messages in /var/log/ltm 
     set ::AffinityDebug 0 
     set ::AffinityInfo 0 
  
     if { $::AffinityDebug } { 
          log local0. "Affinity debugging enabled" 
     } 
  
    set ::ends_array [list ".htm" ".dcss" "/robots.txt" "/sitemap.xml" "/_gateway" \ 
    "/FileUpload" "/SecureFinancingGateway" "/SecureStoreGateway" \ 
    "/SecureGateway" "/GoogleVerify" ] 
   
    set ::starts_array [list "/demos" "/remoting" "/builder" \ 
        "/designs" "/shows" "/inventory" "/sb" "/showroom" \ 
        "/media" "/edit" "uploader" "/upload" "/link" ] 
 } 
  
 when CLIENT_ACCEPTED { 
    set isRetry 0 
 } 
  
 when HTTP_REQUEST { 
    if { $::AffinityDebug } { 
         log local0. "1: host: [HTTP::host] path: [HTTP::path] full URI [HTTP::uri]" 
    } 
    set path [HTTP::path] 
    set host [HTTP::host] 
  
     do any path rewriting that's neccessary 
  switch -glob $host { 
    "*designs.xxx.com" { 
      switch -glob $path { 
        "/" { 
          if { $::AffinityDebug } { 
            log local0. "leaving designs as /" 
          }                } 
      } 
    } 
    "yyy.com" - 
    "www.zzz.com" { 
      switch -glob $path { 
       "/" { 
          if { $::AffinityDebug } {           log local0. "rewriting zzz/ to /views/index" 
          } 
          set path "/views/index"              } 
       "*htm" { 
          while this regex is not ideal, it is barely hit based on the above switch 
          regexp "^/(.*)\.htm(.*)" $path htmGarbage uriPart 
          set path "/views/$uriPart" 
       } 
      } 
    } 
    default { 
      switch -glob $path { 
        "/" { 
          if { $::AffinityDebug } { 
            log local0. "rewriting to /index.htm" 
          } 
          set path "/index.htm" 
        } 
      } 
    } 
  } 
  
  
   
    if { $::AffinityDebug } { 
         log local0. "2: after processing, path is $path" 
     } 
  
     select static content and divert to static servers 
    set static_request 1 
   
    foreach ender $::ends_array { 
        if { $path ends_with $ender } { 
            if { $::AffinityDebug } { 
                 log local0. "$path matched end rule $ender- will be sent to dynamic" 
             } 
            set static_request 0 
            break 
        } 
    } 
   
    if {$static_request} { 
            foreach starter $::starts_array { 
                if { $path starts_with $starter } { 
                    if { $::AffinityDebug } { 
                         log local0. "$path matched start rule $starter- will be sent to dynamic" 
                     } 
                    set static_request 0 
                    break 
                } 
            } 
    } 
   
    HTTP::path "$path" 
      if { $host ends_with "zzz.com" } { 
        if { $::AffinityDebug } { 
            log local0. "$host $path will be sent to ZZZPool" 
        } 
       pool ZZZPool 
    } elseif { $path starts_with "/uploader" } { 
        if { $::AffinityDebug } { 
            log local0. "$path matched start rule /uploader - will be sent to uploader pool" 
        } 
        pool UploaderPool 
    } elseif { $path starts_with "/sbapp" } { 
        if { $::AffinityDebug } { 
            log local0. "$path matched start rule /sbapp- will be sent to sbapp" 
        } 
        pool SbappPool 
    } elseif { $path starts_with "/webchat" } { 
        if { $::AffinityDebug } { 
            log local0. "$path matched start rule /webchat- will be sent to jive" 
        } 
        pool JivePool 
    } elseif { $static_request } { 
        if { $::AffinityDebug } { 
             log local0. "$path did not match any app rules- sending to static pool" 
         } 
        pool StaticPool 
    } else { 
         determine server from host 
            DATA GROUP LOOKUP 
        set dest [findclass [HTTP::host] $::VirtualHost " "] 
        if { $dest ne "" && [active_members $dest] && ! $isRetry}{ 
            if { $::AffinityInfo } { 
                log local0. "[HTTP::host] - $dest" 
            } 
            pool $dest 
        } else { 
            if { $::AffinityInfo } { 
                if { $dest == "" } { 
                    log local0. "[HTTP::host] not found in vhosts- round robining" 
                } elseif { ! [active_members $dest] } { 
                    log local0. "$dest has no active members- round robining [HTTP::host]" 
                } else { 
                    log local0. "[HTTP::host] for $dest is being retried- round robining" 
                } 
             } 
            pool PoolALL 
        } 
    } 
 } 
  
 when LB_FAILED { 
    if { $::AffinityInfo } { 
        log "lb failed event triggered " 
    } 
    if { [active_members PoolALL] > 0 } { 
        set isRetry 1 
        LB::reselect pool PoolALL 
    } else { 
        HTTP::respond 502 
    } 
 } 
 

7 Replies

  • Not sure if this is feasible with the number of sites you are talking about, but I have had a similar situation where instead of doing a datagroup lookup, we just simply went with a strict naming convention for pools such that they match the hostname of the site with "_pool" added. That way you just read the host header in the request and tack "_pool" on to that to form the pool name.

     

     

    Denny
  • Thanks for your reply Denny. Sadly, a naming convention for pools will not work for us as we have many urls that are not alike, but are part of the same "group" and are balanced to the same web servers in order to properly take advantage of the local caching. The last resort would be for us give up that advantage of the local caching if the processing for it under load causes more slowdown than it gains and just balance evenly between the webservers based on source ip or the like. However, I would like to first figure how to best take advantage of our current architecture with BigIP and give that a try before heading down another route.
  • Colin_Walker_12's avatar
    Colin_Walker_12
    Historic F5 Account
    That looks like an awfully intensive iRule to me. Any time you're looking at doing for loops and multiple global variables and array lookups...that starts getting pretty processor intensive...especially the loops.

     

     

    Your thought about using a cookie after the first lookup is a good one as well. I'd recommend trying to get to the point where a cookie is used for persistence after the first request, which will GREATLY reduce the number of times this iRule needs to be processed. After that I'd start looking into trimming down this iRule. We can certainly help with that if needed as well, but pursuing the persistence option is a good first step.

     

     

    Colin
  • We added the session cookie and this did help a lot with cpu usage now down to around 30% during peak periods. So, the major immediate issue is resolved and thanks to everyone for their help with that.

     

     

    However, as Colin said, the iRule is still pretty intensive, so I would like to improve that further so it is not an issue again down the road. I added the "timing on" flag before and after the session cookie change and that gave us a 4x improvement, but from this: http://devcentral.f5.com/Default.aspx?tabid=53&forumid=5&tpage=1&view=topic&postid=3650 I can tell that the iRule evaluation is still using up 20% of cpu at 200 req/sec.

     

     

    For further debugging, is there a way to use the "timing on" for specific blocks of code within an iRule event or can it just be turned on at the global and event level? I'm pretty much wondering if we can get more detailed info on the current execution of the rule with bigpipe before going into the trial and error stage, ie take out the for loops, reset stats and recheck the avg cpu cycles to see their effect, then move on to the next block of code and repeat.

     

     

    Thanks,

     

    Jai
  • Hi Jai,

     

     

    I'm pretty sure you can enable timing per rule or per event, but not more granular than the event. You get a syntax error when trying to enable/disable timing within a rule event:

     

     

    when HTTP_REQUEST {

     

    timing on

     

    log local0. "test"

     

    timing off

     

    }

     

     

    line 2: [command is not valid in the current scope] [timing on]

     

    line 4: [command is not valid in the current scope] [timing off]

     

     

    Aaron
  • Colin_Walker_12's avatar
    Colin_Walker_12
    Historic F5 Account
    Aaron's correct here. Timing can be enabled for the entire rule, or per event, but anything more granular is going to fail.

     

     

    Colin