Forum Discussion
nitass
Jun 02, 2014Employee
i understand monitors will be distributed to blade (i.e. not all the blades do the same health check). i suspect some blade may be overwhelming especially if you have so many monitors. to see which blade does which monitor, you may turn on bigd debug and check /var/log/bigdlog. please make sure you turn the debug off. otherwise, it will eat up your disk space.
root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.debug
sys db bigd.debug {
value "enable"
}
root@(VIP2400-R77-S2)(cfg-sync Standalone)(/S1-green-P:Active)(/Common)(tmos) list sys db bigd.dbgfile
sys db bigd.dbgfile {
value "/var/log/bigdlog"
}
- Son_of_Tom_1379Jun 02, 2014NimbostratusThanks Patrik, I'll give that go, but I don't see a manual restart of a service post reboot as a viable solution. If it works around the issue for now that would suffice, would just need to make it procedure. The strange part is the old system never needed this, albeit the old system used mainly ICMP monitors, there were only a couple of http/tcp monitors in operation. Nitass, this is a single 1600 system (until I put it into HA, then there will be two), but I will certainly review some verbose logs. A new point to this, the system has been running overnight with email alerting in place, and I've received about 30 alerts of members going down. The old system did not have alerting in place (SNMP or otherwise) so I'm not sure that this is usual, but we've never had an issue with accessing services. The next question is, how many monitors is too many? We have about 50 nodes, the nodes that are reporting down are using http monitors, and are only ever down for about 10 seconds (or so it reports when the monitor comes back up). I'm tempted to just increase the timeout, as it's in the default 5 / 16, as perhaps that's to low. Thanks for you time guys, I'll report back my next set of findings