Forum Discussion

Son_of_Tom_1379's avatar
Son_of_Tom_1379
Icon for Nimbostratus rankNimbostratus
Jun 02, 2014

BigIP LTM Pool Monitors Stuck on "Checking"

We have dissolved an F5 BigIP HA pair to redesign/update the configurations of a system without any downtime, the plan is to run two BigIp's side by side, and once we're happy with the new build, we'll cut over and re-enact HA with the new build.

 

The issue is with restarting the BigIP with the new configuration, in that about half of the monitors stay in the checking state. If we remove the monitors from a pool and re-add them, the monitors come up and the pool members come back online. The monitors that don't come up appear to be random, and include many different types such tcp-half-open, HTTP, complex HTTP, and even DNS monitors (trying to avoid ICMP).

 

The nodes that are being monitored can even be the same, for example one of the monitors that is stuck at the moment is for and Exchange OWA check, whereas the same node being monitored by an RPC check is up.

 

This is 11.5 HF2. The ltm log doesn't seem to contain much information apart from "monitor status up" when the monitors come back online after being manually removed/re-added to a pool.

 

It's as if there are too many monitors for the system to bring up post reboot, and that doesn't fill me with warm feelings.

 

Any help would be appreciated!

 

7 Replies

  • Hi!

     

    That sounds a bit strange. For lack of other suggestions, have you tried to force an MCPD reload?

     

    http://support.f5.com/kb/en-us/solutions/public/13000/000/sol13030.html

     

    /Patrik

     

  • i understand monitors will be distributed to blade (i.e. not all the blades do the same health check). i suspect some blade may be overwhelming especially if you have so many monitors. to see which blade does which monitor, you may turn on bigd debug and check /var/log/bigdlog. please make sure you turn the debug off. otherwise, it will eat up your disk space.

    root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.debug
    sys db bigd.debug {
        value "enable"
    }
    root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.dbgfile
    sys db bigd.dbgfile {
        value "/var/log/bigdlog"
    }
    
    • Son_of_Tom_1379's avatar
      Son_of_Tom_1379
      Icon for Nimbostratus rankNimbostratus
      Thanks Patrik, I'll give that go, but I don't see a manual restart of a service post reboot as a viable solution. If it works around the issue for now that would suffice, would just need to make it procedure. The strange part is the old system never needed this, albeit the old system used mainly ICMP monitors, there were only a couple of http/tcp monitors in operation. Nitass, this is a single 1600 system (until I put it into HA, then there will be two), but I will certainly review some verbose logs. A new point to this, the system has been running overnight with email alerting in place, and I've received about 30 alerts of members going down. The old system did not have alerting in place (SNMP or otherwise) so I'm not sure that this is usual, but we've never had an issue with accessing services. The next question is, how many monitors is too many? We have about 50 nodes, the nodes that are reporting down are using http monitors, and are only ever down for about 10 seconds (or so it reports when the monitor comes back up). I'm tempted to just increase the timeout, as it's in the default 5 / 16, as perhaps that's to low. Thanks for you time guys, I'll report back my next set of findings
  • i understand monitors will be distributed to blade (i.e. not all the blades do the same health check). i suspect some blade may be overwhelming especially if you have so many monitors. to see which blade does which monitor, you may turn on bigd debug and check /var/log/bigdlog. please make sure you turn the debug off. otherwise, it will eat up your disk space.

    root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.debug
    sys db bigd.debug {
        value "enable"
    }
    root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.dbgfile
    sys db bigd.dbgfile {
        value "/var/log/bigdlog"
    }
    
    • Son_of_Tom_1379's avatar
      Son_of_Tom_1379
      Icon for Nimbostratus rankNimbostratus
      Thanks Patrik, I'll give that go, but I don't see a manual restart of a service post reboot as a viable solution. If it works around the issue for now that would suffice, would just need to make it procedure. The strange part is the old system never needed this, albeit the old system used mainly ICMP monitors, there were only a couple of http/tcp monitors in operation. Nitass, this is a single 1600 system (until I put it into HA, then there will be two), but I will certainly review some verbose logs. A new point to this, the system has been running overnight with email alerting in place, and I've received about 30 alerts of members going down. The old system did not have alerting in place (SNMP or otherwise) so I'm not sure that this is usual, but we've never had an issue with accessing services. The next question is, how many monitors is too many? We have about 50 nodes, the nodes that are reporting down are using http monitors, and are only ever down for about 10 seconds (or so it reports when the monitor comes back up). I'm tempted to just increase the timeout, as it's in the default 5 / 16, as perhaps that's to low. Thanks for you time guys, I'll report back my next set of findings
  • This morning a similar issue was discussed with a customer (running v11.4.1).

    Monitors on one machine (in a sync-failover device-group) remained in "checking" (blue) state after power-on.

    (Unfortunately no qkview was pulled before he tried to fix the problem.)

    In /var/log/ltm.1.gz I noticed multiple log entries as follows:

    err mcpd[7267]: 01070712:3: Caught configuration exception (0), Can't find monitor rule: 3065 - ltm/validation/MonitorRule.cpp, line 992

    I guess there was an inconsistency with the binary configuration file (typical symptoms as config sync failure, archive save failure did not apply).

    He fixed the situation by changing the pool configurations (similar approach as described in the initial post) which probably forced mcpd to save the config to the binary configuration file.

    (Please note, that after a boot the configuration will be loaded from binary configuration files by default.)

    Ideally one would open a support case before trying to workaround.

    If there is no time to get it handled by support, I would try to stop mcpd, delete the binary configuration files, start mcpd and load the configuration from the text configuration files.

    Happy new year to everyone monitoring this thread! 🙂
  • Don't know if this will help, but I had a similar issue with the can't find monitor rule log message after trying to do a sync, and Syed Nazir mentioned to try:

    Try the following to correct the issue:

    1. Reloaded the configuration by executing command:
      tmsh load sys config
    2. If the issue is not resoloved, restart the tmm by executing command:
      bigstart restart tmm
    3. If the issue is still not resoloved, reboot the F5 should resolve the issue.