Forum Discussion

lkchen's avatar
lkchen
Icon for Nimbostratus rankNimbostratus
Feb 06, 2014

Tolerant gateway_icmp monitor?

We have gateway failsafe configured for our HA pair. Which despite being misconfigured and no mention of it when we previously had a support case open, it seemed to do the right thing. Including after we had changed state preference from active/standby to none/none. Until recently....

 

Both sides had been configured to use the same gateway pool...so the secondary wouldn't failover when it lost its gateway. A couple weekends ago, things were flopping from one side to the other in the datacenter...but the F5 switched to secondary and stayed there. Its also configured with some vlan failsafes, which didn't trigger.

 

However, it continues to be a growing problem with icmp packet losses in our datacenter. Things that are only being checked by icmp with our Nagios will appear to go up and down throughout the day, when packet loss exceeds 60% (default threshold in check_icmp)...

 

But, any single lost icmp packet for gateway_icmp results in the F5 failing over....

 

Since vlan failsafe didn't trigger during the last disruption, turning off gateway failsafe doesn't seem to be an acceptable solution to coping with our network.

 

So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?

 

Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)

 

8 Replies

  • uni's avatar
    uni
    Icon for Altostratus rankAltostratus

    Create a new monitor, with the gateway_icmp monitor as the parent. You can then adjust the polling parameters to suit your needs. I can't remember the default, and no access at the moment, but I think it is 1 poll every 5 seconds, and fails if no response after 16 seconds. This seems pretty resilient to me. If you are getting flapping with that, I think you have a bigger problem you need to sort out instead. However, you could try changing the timeout to perhaps 31 seconds. That effectively allows 6 polls to fail before failing the node.

     

  • So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?

     

    as uni explained, it is done by interval and timeout in health monitor setting.

     

    Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)

     

    there are 2 configurations; one is vlan failsafe and the other one is ha group. if you want to failover when interface is down, ha group is the way you go.

     

    sol13297: Overview of VLAN failsafe (10.x - 11.x)

     

    http://support.f5.com/kb/en-us/solutions/public/13000/200/sol13297.html

     

    Manual Chapter: Understanding Fast Failover

     

    http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/bigip_redundant_systems_config_11_0_0/8.html

     

    • Robert_47833's avatar
      Robert_47833
      Icon for Altostratus rankAltostratus
      the interface should be set as trunk ,then configure in HA group?
  • So, is there an alternative gateway_icmp monitor that does multiple pings, or requires that a certain number of gateway_icmp's fail in a row before triggering failover?

     

    as uni explained, it is done by interval and timeout in health monitor setting.

     

    Though I wonder if I should add the vlan that the gateway is in to vlan failsafe? (its looking at the vlan for public facing VSs and an internal vlan which has a number of servers for core services.)

     

    there are 2 configurations; one is vlan failsafe and the other one is ha group. if you want to failover when interface is down, ha group is the way you go.

     

    sol13297: Overview of VLAN failsafe (10.x - 11.x)

     

    http://support.f5.com/kb/en-us/solutions/public/13000/200/sol13297.html

     

    Manual Chapter: Understanding Fast Failover

     

    http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/bigip_redundant_systems_config_11_0_0/8.html

     

    • Robert_47833's avatar
      Robert_47833
      Icon for Altostratus rankAltostratus
      the interface should be set as trunk ,then configure in HA group?
  • lkchen's avatar
    lkchen
    Icon for Nimbostratus rankNimbostratus

    Well, there seems to be some disagreement in how F5 monitors work and what other administrators are claiming.

     

    They claim is the gateway was only down for ~10 seconds. Implying that ping 1 and ping 2 would timeout, but ping 3 would succeed (and be under 1s) But, the F5 failed over, suggesting that it reacted to ping 1 timing out.

     

    And, as evidence they are point to the standby unit, which logged a 10 second outage of its gateway (which is configured not to react when its in standby.) Where I would think that it indicates the outage had been at least 26 seconds.

     

    I don't suppose there's a verbose logging option that can show that the gateway_icmp monitor is working correctly.

     

    Other administrators have also complained similar when the F5 reports that their web server or other application servers are marked down Including servers where they connect to the vip themselves and it eventually responds (after almost a minute...and the monitor timeout was set to 31 seconds....)

     

    But, then everything is always the F5's fault.

     

    Like a few weekends ago....when there was much strangeness going on with the datacenter network (a ticket was logged at 12:40pm about it)....The F5 failed over around 6:25pm...resulted in major service outages lasting for about an hour, when I started poking around on the F5's...where the secondary seemed to finally start moving traffic... which seemed to coincide with doing a config sync of primary to secondary. Wouldn't have thought that there would've been that much difference, but didn't think to check.

     

    What I did see was that there were pools that primary could see one member and second could see the other member....or pool completely down to one and completely up to the other, etc. Suggesting that there was a problem between the two main datacenter switches. primary is connected to switch A, and secondary is connected to switch B.

     

    It was after this disruption that I discovered the problem with how gateway failsafe had been configured....since the secondary was reporting outages, but stayed active.

     

  • They claim is the gateway was only down for ~10 seconds. Implying that ping 1 and ping 2 would timeout, but ping 3 would succeed (and be under 1s) But, the F5 failed over, suggesting that it reacted to ping 1 timing out.

     

    if the failover was triggered by failsafe, it should be shown in /var/log/ltm.

     

  • lkchen's avatar
    lkchen
    Icon for Nimbostratus rankNimbostratus

    All I can find for the failover is:

    Feb  6 11:45:56 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 11:45:56 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 11:45:56 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 11:45:56 local/pri-4600 notice sod[3440]: 01140029:5: HA pool_memb_down FWSM_Wildcard_Virt1 fails action is failover.
    Feb  6 11:45:56 local/pri-4600 notice sod[3440]: 010c0018:5: Standby
    

    It doesn't give the assurance the other 2 pings in the window had/would also fail.

    I presume the 3 lines, is because there are 3 pools in the F5 checking the same gateway....the FWSM_Wildcard_Virt pool used by the numerous "FWSM_F5_xxx_Routing" 'Forwarding(IP)' virtual servers that creates the default route for most of the vlans behind the F5. And, the FWSM_Wildcard_Virt2 pool definition for Unit 2.

    Even under normal circumstances, ping times are:

    100 packets transmitted, 100 received, 0% packet loss, time 99055ms
    rtt min/avg/max/mdev = 0.377/1.036/1.990/0.323 ms
    

    Continuing through that day's log has:

    Feb  6 11:46:40 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 11:46:40 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 11:46:40 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 11:46:40 local/pri-4600 notice sod[3440]: 01140030:5: HA pool_memb_down FWSM_Wildcard_Virt1 is now responding.
    Feb  6 12:26:21 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 12:26:21 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 12:26:21 local/pri-4600 notice mcpd[3447]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 12:26:21 local/pri-4600 notice sod[3440]: 01140029:5: HA pool_memb_down FWSM_Wildcard_Virt1 fails action is failover.
    Feb  6 12:26:26 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 12:26:26 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 12:26:26 local/pri-4600 notice mcpd[3447]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 12:26:26 local/pri-4600 notice sod[3440]: 01140030:5: HA pool_memb_down FWSM_Wildcard_Virt1 is now responding.
    

    While on the secondary:

    Feb  6 09:37:00 local/sec-4600 notice mcpd[3448]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 09:37:00 local/sec-4600 notice mcpd[3448]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 09:37:00 local/sec-4600 notice mcpd[3448]: 01070638:5: Pool member bcj.bda.cfd.cab:0 monitor status down.
    Feb  6 09:37:00 local/sec-4600 notice sod[3453]: 01140029:5: HA pool_memb_down FWSM_Wildcard_Virt2 fails action is failover.
    Feb  6 09:37:10 local/sec-4600 notice mcpd[3448]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 09:37:10 local/sec-4600 notice mcpd[3448]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 09:37:10 local/sec-4600 notice mcpd[3448]: 01070727:5: Pool member bcj.bda.cfd.cab:0 monitor status up.
    Feb  6 09:37:10 local/sec-4600 notice sod[3453]: 01140030:5: HA pool_memb_down FWSM_Wildcard_Virt2 is now responding.
    Feb  6 11:45:56 local/sec-4600 notice sod[3453]: 010c0019:5: Active
    

    And, these are actually the second back and forth....the logs for the earlier pri->sec have rotated off now.

    I had switched back from sec to pri early on Feb 4th (before I had checked my messages and found that its a snow day...and later the 5th was also a snow day 😉 The switch back from sec to pri on the 4th took longer than expected, because when I had recreated its gateway failsafe, I had inadvertently left its action as Reboot. Which caused it to reboot over and over again, until I was able to disable it. Which was challenging, because even though I could ssh in well before its boot had finished...I couldn't reconfigure HA until it knew if it was licensed for it. Don't recall if I saw that the last time I had gotten into a reboot loop....though I think it was probably because I hadn't thought to try ssh'ng since sshd is started fairly early in the boot. And, was racing the narrow window between the console login prompt and the failsafe reboot. Its times like these that I think our password is overly complex.... 🙂

    I switched back again early on the 7th....so far its stayed.