Forum Discussion

wtwagon_99154's avatar
wtwagon_99154
Icon for Nimbostratus rankNimbostratus
Feb 10, 2010

Health Check Issues

The Scenario:

 

 

We are performing some load testing of an application and are using a health check as follows:

 

 

GET /blah/blah.asmx HTTP/1.1\r\nUser-Agent:Mozilla\r\nhost:blahblah.blah.net\r\n

 

 

We look for a 200 OK

 

 

Monitor interval: 5 seconds

 

Timeout: 16 seconds

 

 

We have 2 servers in the pool and they are all green at the moment.

 

 

The Problem:

 

 

As soon as we start throwing LoadRunner load tests at this VIP, all the servers go to red almost immediately(we are throwing a small amount of load, around 40 concurrent sessions). What really gets me is that I can run the same exact request as above during the load test and I am able to successfully get there and receive a 200 OK.

 

 

I also tried swapping the interval to 15 seconds with a timeout of 46 seconds and received the same problem (although this time, it happened after 5 minutes instead of almost immediately).

 

 

What steps can be taken to try to diagnose why the F5 is thinking these nodes are down, when they are clearly not?

 

 

18 Replies

  • We have exactly the same issue. The F5 stops doing the health checks for no apparent reason (marks the node as down) and then resumes the health checks after about 3 minutes (174 sec). When the F5 resumes the health checks the nodes is marked as up immediately.

     

     

    I've got the tcpdump traces (taken on the F5) that confirm the F5 isn't doing any health check during the outage). I've also got the bigd log and there are no failed health checks in there.

     

     

    Did you log the issue with F5 and got a patch or workaround?

     

     

    Kind regards,

     

     

    Jan
  • Hi, sorry to open an old topic, but I have this exact same issue.

     

     

    Did anyone have any luck finding out what the problem is, and have a solution at all?

     

     

    Thanks in advance.
  • so is there really a problem here or not?

     

    seems people like to say they have evidence of a problem, yet no followup on any support cases filed or solutions found.

     

     

    i'm suggesting that this topic be removed as it could be stirring things up that aren't valid.

     

     

    I ask because we are experiencing symptoms that would indicate that the f5 marks a member down and stops polling for a time and then marks it up and resumes polling. sounds like the above. but is that really the case?

     

     

    any insight on this would be valuable, but without it this discussion thread is pretty dead, IMHO.
  • Sorry for the long wait.

     

     

    In our case it was a configuration issue. The health checks were running from the same IP as the actual traffic for the virtual server. In some cases the health check would re-use a recently used TCP source port for the health check. Our firewall considered this as a late packet for a recently closed flow and dropped the packet.

     

     

    The issue was caused by a faulty interface configuration. In an active passive setup, you shouldn't use unit ID in the floating IP configuration. This causes the F5 to use the interface ip for both the health checks and virtual server traffic (we are using SNAT). Normally the virtual server traffic should SNAT behind the floating IP. After opening a ticket at F5 we changed the config and now the health checks are running from the interface IP, separte from the viritual server which is natting behind the floating ip.

     

     

    Since then the conflicting tcp port issue on our firewall has been resolved.

     

     

    I know my explanation is a bit blurry but it's been 3 years ago and due to technical issue I cannot restore my old PST files to dig up the actual F5 case emails.

     

     

    Jan
  • We have the same behaviour (ltm 1600 11.1 hf4). In some unknown circumstances, the ltm stops with health-monitoring (always the same node with the same service). This marks the node down and the service is no more accessible. After about three minutes (174 sec.) it begins with health-checking again. On our passive-box, we have the same behaviour...but never at the same time. Any hints?

     

    Many thanks.

     

    Tom
  • i know it is intermittent but would it be possible to have bigd debug, tcpdump and qkview while issue is occurring?

     

     

    Troubleshooting Ltm Monitors

     

    https://devcentral.f5.com/wiki/AdvDesignConfig.TroubleshootingLtmMonitors.ashx
  • bigd debug provides actually too much output. i'm currently "tcpdumping", hoping, that the issue occurs.....
  • bigd debug provides actually too much output. i'm currently "tcpdumping", hoping, that the issue occurs.....please do not forget to generate qkview then. you know qkview collects a number of statistics of system. it could also be helpful.