Forum Discussion

daboochmeister's avatar
Aug 22, 2018

GTM Oracle monitor succeeding, but server not marked as up (no reply from big3d)

Env: GTM 11.5.2

We have a wide IP based on pools that use an Oracle monitor. That monitor performs a SQL query to check the read/write status of the database, and if that status is good, marks the Oracle server accessed as "up".

This monitor is failing for 3 out of our 6 Oracle IPs. The monitor itself seems to succeed, based on turning on debugging - i see success in the debug log for the monitor itself, as well as in the DBDaemon-0.log (copies of the lines from those logs below). But in the gtm log, it's reporting that there was no reply from big3d, and therefore the 3 servers are being marked as down. The "no reply" error is even occurring on the same GTM - it's reporting that error for its own self IP. iqdump shows that there's no issue with connecting to big3d, all looks nominal.

These same 3 Oracle servers/IPs are used in other pools, that use other Oracle monitors, that are all working correctly.

None of the "known" causes that I could find for a "no reply from big3d" error seem to apply - e.g., no multiple traffic groups involved (in fact, there's no LTM virtual server involved), no iquery issues (neither connectivity nor certificates nor anything else). And all other big3d related activities work fine, all of the rest of our wide IPs are fine, the LTM VIP status is correctly registering on the GTMs, etc.

Any thoughts?

Here are lines from various log files - first, the monitor debug log (i'm fuzzing our internal IPs and the password, but they are correct):

********** Debugging session beginning at: Wed Aug 22 03:59:06 2018

Arguments 1-2:
::ffff:172.16.XX.XX
1521

Environment variables:
COUNT=0
DATABASE=(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=%node_ip%)(PORT=%node_port%))(CONNECT_DATA=(SERVICE_NAME=app_odb)))
DEBUG=yes
MON_TMPL_NAME=/Common/MyAccount-ODB_prod_monitor
NODE_IP=::ffff:172.16.XX.XX
NODE_PORT=1521
PASSWORD=XXXXX
RECVCOLUMN=1
RECVROW=1
RECV_I=READ WRITE
SEND=select open_mode from v$database
USERNAME=XXXXX
--
TMOS_RD: 0 (0)
Daemon port: 1521
count='0' converts to '0'
Command-line PID filename: /var/run/ORACLE__Common_MyAccount-ODB_prod_monitor_::ffff:172.16.XX.XX-0_1521.pid
PID file /var/run/DBDaemon-0.pid exists. Checking for correctness of PID.
DBDaemon on port 1521 says its PID is 9115.
PID matches
Asking daemon to ping remote database.
Recvd: 'oracle.jdbc.OracleDriver
'
Recvd: 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.XX.XX)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))
'
Recvd: 'XXXXX
'
Recvd: 'XXXXX
'
Recvd: 'select open_mode from v$database
'
Recvd: 'READ WRITE
'
Recvd: '1
'
Recvd: '1
'
Recvd: 'jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.35)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756097):
'
Recvd: '!Up!
'
up

Now an extract from the DBDaemon log:

2018-08-22 03:59:50.992: (Thread-34756183): Count: 0
2018-08-22 03:59:51.02: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): DB connect succeeded.
2018-08-22 03:59:51.02: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Query message: select open_mode from v$database
2018-08-22 03:59:51.06: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Send Query success
2018-08-22 03:59:51.06: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Response from server: OPEN_MODE: 'READ WRITE'
2018-08-22 03:59:51.06: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Checking for recv string: READ WRITE
2018-08-22 03:59:51.07: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Analyze Response success
2018-08-22 03:59:51.07: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.16.121.36)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=app_odb)))(Thread-34756183): Current count: 199  Count : 0

And GTM:

Aug 22 01:28:17 gc-www-ns-01 alert gtmd[18606]: 011ae0f2:1: Monitor instance /Common/MyAccount-ODB_prod_monitor 172.16.XX.XX:1521 UNKNOWN_MONITOR_STATE --> DOWN from /Common/gc-www-ns-01 (no reply from big3d /Common/gc-www-ns-01(172.23.XX.XX): timed out)

2 Replies

  • Can you help us understand with your setup first.

    tmsh list gtm wideip 

    tmsh list gtm pool 

    Is the gtm server a generic host or BIGIP LTM.

    If its a generic host,

    Share us the gtm monitor which is applied to the GTM VS.

    At this point, from your GTM

    (Monitor instance /Common/MyAccount-ODB_prod_monitor)
    logs, I can assume there is no LTM involved.

  • Sure. The pool is:

    gtm pool MyAccount-ODB-GC_pool {
        alternate-mode none
        fallback-mode none
        max-address-returned 3
        members {
            GC-Oracle-Prod-SCAN:GC-Oracle-Prod-SCAN-IP1 {
                order 0
            }
            GC-Oracle-Prod-SCAN:GC-Oracle-Prod-SCAN-IP2 {
                order 1
            }
            GC-Oracle-Prod-SCAN:GC-Oracle-Prod-SCAN-IP3 {
                order 2
            }
        }
        monitor MyAccount-ODB_prod_monitor
    }
    

    The server definition is:

    gtm server GC-Oracle-Prod-SCAN {
        addresses {
            172.16.XX.YY {
                device-name GC-Oracle-Prod-SCAN
            }
        }
        datacenter "/Common/Primary Data Center"
        product generic-host
        virtual-servers {
            GC-Oracle-Prod-SCAN-IP1 {
                destination 172.16.XX.YY:ncube-lm
            }
            GC-Oracle-Prod-SCAN-IP2 {
                destination 172.16.XX.ZZ:ncube-lm
            }
            GC-Oracle-Prod-SCAN-IP3 {
                destination 172.16.XX.AA:ncube-lm
            }
        }
    }
    

    The monitor:

    gtm monitor oracle MyAccount-ODB_prod_monitor {
        count 0
        database (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=%node_ip%)(PORT=%node_port%))(CONNECT_DATA=(SERVICE_NAME=app_odb)))
        debug no
        defaults-from oracle
        description "Oracle monitor of read-write status of ODB database"
        destination *:*
        interval 30
        password $M$Vw$9eeRFG0XmGiaevM5ubIKAw==
        probe-timeout 5
        recv "READ WRITE"
        recv-column 1
        recv-row 1
        send "select open_mode from v$database"
        timeout 91
        username F5CHECK
    }
    

    I didn't include the wide IP, as it's downstream of the issue - the issue is that the pool above is never marked as up, even though it should be.

    That same server object (with its 3 virtual servers) works fine in other similar monitors. That same monitor works fine for a different server object (with 3 different virtual servers). And, the monitor works fine on THESE IPs, per the monitor debug log (which shows the correct results being returned from Oracle, and the servers being marked as !Up!, see my post above); the DBDaemon log also shows the monitor correctly working, and returning "Up". The only breakdown appears to be getting the status from big3d (which times out). Note there are no other timeouts on any other statuses (statii?) being reported by big3d.

    Really stumped. I may restart big3d as a precaution - is that safe to do, or will it potentially interrupt GSLB resolution of hostnames? (we have 4 standalone GTMs in this particular iQuery mesh).

    thx