Health Check Issues

The Scenario:

We are performing some load testing of an application and are using a health check as follows:

GET /blah/blah.asmx HTTP/1.1\r\nUser-Agent:Mozilla\r\nhost:blahblah.blah.net\r\n

We look for a 200 OK

Monitor interval: 5 seconds

Timeout: 16 seconds

We have 2 servers in the pool and they are all green at the moment.

The Problem:

As soon as we start throwing LoadRunner load tests at this VIP, all the servers go to red almost immediately(we are throwing a small amount of load, around 40 concurrent sessions). What really gets me is that I can run the same exact request as above during the load test and I am able to successfully get there and receive a 200 OK.

I also tried swapping the interval to 15 seconds with a timeout of 46 seconds and received the same problem (although this time, it happened after 5 minutes instead of almost immediately).

What steps can be taken to try to diagnose why the F5 is thinking these nodes are down, when they are clearly not?

management

monitoring

18 Replies

Jan_Renard
Nimbostratus
Apr 01, 2010
We have exactly the same issue. The F5 stops doing the health checks for no apparent reason (marks the node as down) and then resumes the health checks after about 3 minutes (174 sec). When the F5 resumes the health checks the nodes is marked as up immediately.

I've got the tcpdump traces (taken on the F5) that confirm the F5 isn't doing any health check during the outage). I've also got the bigd log and there are no failed health checks in there.

Did you log the issue with F5 and got a patch or workaround?

Kind regards,

Jan
Rikardo_77456
Nimbostratus
Apr 29, 2012
Hi, sorry to open an old topic, but I have this exact same issue.

Did anyone have any luck finding out what the problem is, and have a solution at all?

Thanks in advance.
brad_11480
Nimbostratus
May 15, 2012
so is there really a problem here or not?

seems people like to say they have evidence of a problem, yet no followup on any support cases filed or solutions found.

i'm suggesting that this topic be removed as it could be stirring things up that aren't valid.

I ask because we are experiencing symptoms that would indicate that the f5 marks a member down and stops polling for a time and then marks it up and resumes polling. sounds like the above. but is that really the case?

any insight on this would be valuable, but without it this discussion thread is pretty dead, IMHO.
Jan_Renard
Nimbostratus
May 16, 2012
Sorry for the long wait.

In our case it was a configuration issue. The health checks were running from the same IP as the actual traffic for the virtual server. In some cases the health check would re-use a recently used TCP source port for the health check. Our firewall considered this as a late packet for a recently closed flow and dropped the packet.

The issue was caused by a faulty interface configuration. In an active passive setup, you shouldn't use unit ID in the floating IP configuration. This causes the F5 to use the interface ip for both the health checks and virtual server traffic (we are using SNAT). Normally the virtual server traffic should SNAT behind the floating IP. After opening a ticket at F5 we changed the config and now the health checks are running from the interface IP, separte from the viritual server which is natting behind the floating ip.

Since then the conflicting tcp port issue on our firewall has been resolved.

I know my explanation is a bit blurry but it's been 3 years ago and due to technical issue I cannot restore my old PST files to dig up the actual F5 case emails.

Jan
tomtux_93477
Nimbostratus
Jul 05, 2012
We have the same behaviour (ltm 1600 11.1 hf4). In some unknown circumstances, the ltm stops with health-monitoring (always the same node with the same service). This marks the node down and the service is no more accessible. After about three minutes (174 sec.) it begins with health-checking again. On our passive-box, we have the same behaviour...but never at the same time. Any hints?

Many thanks.

Tom
nitass
Employee
Jul 05, 2012
i know it is intermittent but would it be possible to have bigd debug, tcpdump and qkview while issue is occurring?

Troubleshooting Ltm Monitors

https://devcentral.f5.com/wiki/AdvDesignConfig.TroubleshootingLtmMonitors.ashx
tomtux_93477
Nimbostratus
Jul 05, 2012
bigd debug provides actually too much output. i'm currently "tcpdumping", hoping, that the issue occurs.....
nitass
Employee
Jul 05, 2012
bigd debug provides actually too much output. i'm currently "tcpdumping", hoping, that the issue occurs.....please do not forget to generate qkview then. you know qkview collects a number of statistics of system. it could also be helpful.

Forum Discussion

Health Check Issues

18 Replies

Recent Discussions

Disk space full - what files, folders are safe to delete?

ASM instance creation

BIG-IP DNS: Check Status Of Multiple Monitors Against Pool Member

Open Redirection Mitigation

F5Access | MacOS Sonoma

Related Content

Using F5 Distributed Cloud DNS Load Balancer health checks and DNS observability

F5 health 443 monitor issue with Atlassian Conluence

F5 Distributed Cloud - Regional Edge Health Monitoring Insights

Health monitor question

Regex issue