Tolerant gateway_icmp monitor?

Question

We have gateway failsafe configured for our HA pair.  Which despite being misconfigured and no mention of it when we previously had a support case open, it seemed to do the right thing.  Including after we had changed state preference from active/standby to none/none.  Until recently....&nbsp;
Both sides had been configured to use the same gateway pool...so the secondary wouldn't failover when it lost its gateway.  A couple weekends ago, things were flopping from one side to the other in the datacenter...but the F5 switched to secondary and stayed there.  Its also configured with some vlan failsafes, which didn't trigger.&nbsp;
However, it continues to be a growing problem with icmp packet losses in our datacenter.  Things that are only being checked by icmp with our Nagios will appear to go up and down throughout the day, when packet loss exceeds 60% (default threshold in check_icmp)...&nbsp;
But, any single lost icmp packet for gateway_icmp results in the F5 failing over....  &nbsp;
Since vlan failsafe didn't trigger during the last disruption, turning off gateway failsafe doesn't seem to be an acceptable solution to coping with our network.&nbsp;
So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?&nbsp;
Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)&nbsp;

uni · Answer

Create a new monitor, with the gateway_icmp monitor as the parent. You can then adjust the polling parameters to suit your needs.
I can't remember the default, and no access at the moment, but I think it is 1 poll every 5 seconds, and fails if no response after 16 seconds. This seems pretty resilient to me. If you are getting flapping with that, I think you have a bigger problem you need to sort out instead.
However, you could try changing the timeout to perhaps 31 seconds. That effectively allows 6 polls to fail before failing the node.&nbsp;

nitass · Answer

So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?&nbsp;

as uni explained, it is done by interval and timeout in health monitor setting.&nbsp;

Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)&nbsp;

there are 2 configurations; one is vlan failsafe and the other one is ha group. if you want to failover when interface is down, ha group is the way you go.&nbsp;
sol13297: Overview of VLAN failsafe (10.x - 11.x)&nbsp;
http://support.f5.com/kb/en-us/solutions/public/13000/200/sol13297.html&nbsp;
Manual Chapter: Understanding Fast Failover&nbsp;
http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/bigip_redundant_systems_config_11_0_0/8.html&nbsp;

nitass_89166 · Answer

So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?&nbsp;

as uni explained, it is done by interval and timeout in health monitor setting.&nbsp;

Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)&nbsp;

there are 2 configurations; one is vlan failsafe and the other one is ha group. if you want to failover when interface is down, ha group is the way you go.&nbsp;
sol13297: Overview of VLAN failsafe (10.x - 11.x)&nbsp;
http://support.f5.com/kb/en-us/solutions/public/13000/200/sol13297.html&nbsp;
Manual Chapter: Understanding Fast Failover&nbsp;
http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/bigip_redundant_systems_config_11_0_0/8.html&nbsp;

lkchen · Answer

Well, there seems to be some disagreement in how F5 monitors work and what other administrators are claiming.&nbsp;
They claim is the gateway was only down for ~10 seconds.  Implying that ping 1 and ping 2 would timeout, but ping 3 would succeed (and be under 1s)  But, the F5 failed over, suggesting that it reacted to ping 1 timing out.&nbsp;
And, as evidence they are point to the standby unit, which logged a 10 second outage of its gateway (which is configured not to react when its in standby.)  Where I would think that it indicates the outage had been at least 26 seconds.&nbsp;
I don't suppose there's a verbose logging option that can show that the gateway_icmp monitor is working correctly.&nbsp;
Other administrators have also complained similar when the F5 reports that their web server or other application servers are marked down   Including servers where they connect to the vip themselves and it eventually responds (after almost a minute...and the monitor timeout was set to 31 seconds....)&nbsp;
But, then everything is always the F5's fault.&nbsp;
Like a few weekends ago....when there was much strangeness going on with the datacenter network (a ticket was logged at 12:40pm about it)....The F5 failed over around 6:25pm...resulted in major service outages lasting for about an hour, when I started poking around on the F5's...where the secondary seemed to finally start moving traffic... which seemed to coincide with doing a config sync of primary to secondary.  Wouldn't have thought that there would've been that much difference, but didn't think to check.&nbsp;
What I did see was that there were pools that primary could see one member and second could see the other member....or pool completely down to one and completely up to the other, etc.  Suggesting that there was a problem between the two main datacenter switches.  primary is connected to switch A, and secondary is connected to switch B.&nbsp;
It was after this disruption that I discovered the problem with how gateway failsafe had been configured....since the secondary was reporting outages, but stayed active.&nbsp;

nitass · Answer

They claim is the gateway was only down for ~10 seconds. Implying that ping 1 and ping 2 would timeout, but ping 3 would succeed (and be under 1s) But, the F5 failed over, suggesting that it reacted to ping 1 timing out.

if the failover was triggered by failsafe, it should be shown in /var/log/ltm.

if the failover was triggered by failsafe, it should be shown in /var/log/ltm.&nbsp;

Forum Discussion

Tolerant gateway_icmp monitor?

8 Replies

Recent Discussions

import live updates from version x to version y

Tenant image upgrade

iRule editor partition button does not work

F5Access | MacOS Sonoma

Overwriting or adding LTM SSL Traffic cert and key using iControlREST

Related Content

Health monitor question

F5 Distributed Cloud - Regional Edge Health Monitoring Insights

iRule based RADIUS Health Monitor Builder

HTTP Monitor

F5 NGINXaaS for Azure: Monitoring and Metrics