Forum Discussion

Keith_Richards_'s avatar
Keith_Richards_
Icon for Nimbostratus rankNimbostratus
Jul 05, 2007

LB_FAILED event

Please can someone tell me how long the LTM will wait, given that the selected server hasn't responded, before triggering the LB_FAILED event.

I have noticed that when people are using LB_FAILED for passive monitoring they also put in an LB:detach and LB:down. I don't understand why you use the LB:detach - is it just good housekeeping on the LTM? When we used the LB:down on our system there were pool members getting removed due to slow response even though they weren't really down/offline.

Oh, and sorry, but another thing. I'm using universal persistence as below:


when HTTP_REQUEST { 
set uri [HTTP::uri] 
set jsess [findstr $uri "jsessionid" 11 "?"] 
log local0. "Entering REQUEST, jsess is: $jsess" 
if { $jsess != "" } { 
persist uie $jsess 
} 
} 
when HTTP_RESPONSE { 
if { [findstr $uri "jsessionid"] > 0}{ 
set jsess1 [findstr $uri "jsessionid" 11 "?"] 
log local0. "jsessionid found, jsess is: $jsess1" 
persist add uie $jsess1 
} 
}

The virtual server uses a second iRule for LB_FAILED events. I have seen other examples in the forum delete the persistence using a 'persist delete' statement - should I use this? I assumed that if a connection to a member fails with event LB_FAILED that that connection will be reselected (as below) and a new persistence record created?


when LB_FAILED {
set selected_server [LB::server addr]
if {$selected_server == ""} {
log local0. "No mdex node available"
}
else {
log local0. "Node: ${selected_server} not responding."
 Select another node
LB::reselect
}
}

Thanks, Keith

8 Replies

  • Deb_Allen_18's avatar
    Deb_Allen_18
    Historic F5 Account
    Not sure what the timeout is before LB_FAILED is triggered, anybody else know?

     

     

    The connection and persistence table relationships are not cleared if a selected node that looks UP fails to respond (thus triggering LB_FAILED). Only a monitor marking the node DOWN clears the related server-side table entries.

     

     

    So with OneConnect, LB::detach is necessary before a re-select to clear the connection table relationship of the client to the backend server, otherwise the same node will be re-selected. (might not be specific to OneConnect, but I think I'm remembering that correctly, somebody will straighten me out I'm sure)

     

     

    Same idea re: removing the persistence record in LB_FAILED when re-selecting. Not sure if it's as necessary, but it would ensure the old persistence table relationship is cleared, then a new one would be created when the connection is re load balanced.

     

     

    (LB::down isn't well integrated with monitors yet, so I would recommend against using it in this situation, preferring instead good monitoring with appropriate intervals configured.)

     

     

    HTH

     

    /deb
  • An oldie but goodie from Dr. Teeth:

     

     

     

    LB_FAILED is triggered in a variety of circumstances. I'll list a few off the top of my head:

     

    a) no pool selected

     

    b) no available pool members in selected pool

     

    c) no route to pool member

     

    d) failed to connect to pool member

     

     

    With regard to d). The "max retrans syn" option in the TCP profile will affect the timeout.

     

  • Deb_Allen_18's avatar
    Deb_Allen_18
    Historic F5 Account
    Ah, nice work! Didn't think of that, & didn't realize that was in the profile.

    So first retransmission if no response is typically 3 seconds, and typcial backoff timer algorithm is to double wait time after each failed attempt. I verified for LTM @ the command line, and it looks like max syn retrans is set to 5, producing the following progression and timing out after 93 seconds:
    Trying 172.24.2.200...
    11:26:24.455153 172.24.2.41.55507 > 172.24.2.200.http: S ...
    11:26:27.452465 172.24.2.41.55507 > 172.24.2.200.http: S ...
    11:26:33.452465 172.24.2.41.55507 > 172.24.2.200.http: S ...
    11:26:45.452465 172.24.2.41.55507 > 172.24.2.200.http: S ...
    11:27:09.452481 172.24.2.41.55507 > 172.24.2.200.http: S ...
    11:27:57.452466 172.24.2.41.55507 > 172.24.2.200.http: S ...
    telnet: connect to address 172.24.2.200: Connection timed out

    Looks like "Maximum Syn Retransmissions" is set to 4 in LTM's default tcp profile though, so LB_FAILED would be triggered if server didn't respond in 45 seconds:
      1st SYN:  0
      2nd SYN: +3 seconds
      3rd SYN: +6 seconds
      4th SYN: +12 seconds
      5th SYN: +24 seconds
     ======================
     LB_FAILED: 45 seconds

    I will update the LB_FAILED wiki page with details to that effect.

    /deb
  • Hey Deb,

     

     

    how did you do the test mentioned in your last post?

     

     

    We are having issues with a black box solution we are load balancing with our F5's, and we have an iRule in place with LB failed, and it is being hit a lot more than i would expect. now we don't see the members of the pool drop, just the log entry saying "LB failed" (which we told the iRule to log so we could see what the users saw).

     

     

    Any thoughts?
  • Hi Andrew,

     

     

    You can check the LB_FAILED wiki page for details:

     

     

    http://devcentral.f5.com/wiki/default.aspx/iRules/lb_failed

     

    LB_FAILED is triggered when LTM is ready to send the request to a pool member and one hasn’t been chosen (the system failed to select a pool or a pool member), is unreachable (when no route to the target exists), or is non-responsive (fails to respond to a connection request).

     

     

    How often do you see LB_FAILED triggered? If it's fairly consistent you could try running a tcpdump to capture the communication. I'd try capturing the client and serverside connection info in the trace so you can see exactly what's happening.

     

     

    Aaron
  • Just want to note that the code segment above that tests to see if a member was selected (set selected_server [LB::server addr]) and then triggers the else to LB::reselect can/will send this bugger into a loop of reselection. I just did that and while I don't know how to get the thing to stop (which maybe someone knows a way to stop an iRule gracefully?? ), i failed the box over to the HA partner and restarted TMM.

     

     

    anyway.. my question is how the health monitor can show good but the LB is failing, apparently it is failing to connect to the pool member. how can that occur? any insight would be helpful.

     

    i'm on version 11 and another article indicated there was a new feature [event info] that will return the reason. but i'm getting nothing in return.. so that would help too!

     

     

    thanks for help here.. we have a pool that is available but it ends up with LB::fail.
  • Not sure this helps but are you SNATting? If so, perhaps the server is accepting connections from the F5 Self-IP address but not the SNAT address?

    On the subject of the loop, I'd recommend that some simple counting is employed to ensure the looping is short lived;

    
    when CLIENT_ACCEPTED {
     set loopcounter 0
    }
    when LB_FAILED {
    set selected_server [LB::server addr]
    if {$selected_server == ""} {
    log local0. "No mdex node available"
    }
    elseif { loopcounter <=4 } {
    log local0. "Node: ${selected_server} not responding."
     Select another node
    LB::reselect
    incr loopcounter }
    }
    
  • Can I create a virtual server (VS_Primary) with the iRule like the following:

     

     

    when LB_FAILED {

     

    virtual VS_Backup

     

    }