Forum Discussion

Ovov's avatar
Ovov
Icon for Altostratus rankAltostratus
Sep 01, 2023

Sync-failover group doesn't sync properly

Hello,

I need some help with essential Active/Standby setup where I can't make two nodes to sync data. This is the problem I end up with: "did not receive last sync successfully"

VLANs are configured like this:

vlantagtagged interface
Client111.1
HA131.3
Server121.2

Self IPs and routes are following

 

 

[root@bigip1:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt  metric 4096 
10.11.11.0/24 dev Client  proto kernel  scope link  src 10.11.11.111 
10.12.12.0/24 dev Server  proto kernel  scope link  src 10.12.12.121 
10.13.13.0/24 dev HA  proto kernel  scope link  src 10.13.13.131 
127.1.1.0/24 dev tmm  proto kernel  scope link  src 127.1.1.254 
127.7.0.0/16 via 127.1.1.253 dev tmm 
127.20.0.0/16 dev tmm_bp  proto kernel  scope link  src 127.20.0.254 
192.168.159.0/24 dev mgmt  proto kernel  scope link  src 192.168.159.129


[root@bigip2:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt  metric 4096 
10.11.11.0/24 dev Client  proto kernel  scope link  src 10.11.11.112 
10.12.12.0/24 dev Server  proto kernel  scope link  src 10.12.12.122 
10.13.13.0/24 dev HA  proto kernel  scope link  src 10.13.13.132 
127.1.1.0/24 dev tmm  proto kernel  scope link  src 127.1.1.254 
127.7.0.0/16 via 127.1.1.253 dev tmm 
127.20.0.0/16 dev tmm_bp  proto kernel  scope link  src 127.20.0.254 
192.168.159.0/24 dev mgmt  proto kernel  scope link  src 192.168.159.130 

 

 

Floating IPs on both devices are set to:
- Client: 10.11.11.110
- Server: 10.12.12.120

Both devices have certificates, time is in sync via NTP, have the same version 17.1.0.2 Build 0.0.2 (provisioned from the same OVA) and license.

Conif sync is set to: HA self IPs
Failover networks is: HA + Management
Mirroring: HA + Server

BigIP1 is Online, BigIP2 is Forced Offline before I start building cluster.

Hosts are connected via VmWare Workstation Lan Segments, thus no filtering is applied. I double check I can see packets in "tcpdump -nn -i" for any of the interfaces Client/Server/HA when for example trying to establish the SSH connection from the other host to the respective IP of the interface that is being watched.

Then I add device trust. Soon both devices are shown as "In sync" in the device_trust_group.

Then create a sync-failover group of two devices with Automatic Incremental Sync with Max sync size =10240. After this, the sync statuses are following:
- device_trust_group = In Sync
- Sync-Failover-Group = Awaiting Initial Sync

If I run "tcpdump -nn -i any tcp" I mostly see packets on HA network for ports 1029 and 4343
If I run "tcpdump -nn -i any udp" I mostly see packets on HA network for port 1026

tmm log

 

 

Sep 1 22:39:29 bigip1.sq.cloud notice mcpd[7261]: 01071436:5: CMI listener established at 10.13.13.131 port 6699
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:32 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:39:34 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 22:44:48 bigip1.sq.cloud notice mcpd[7261]: 01071038:5: Master Key updated by user %cmi-mcpd-peer-10.13.13.132
Sep 1 22:52:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:57:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries

 

 

Lastly I push the configuration from the device that is in the online state to the Sync-Failover-Group.

Then the sync status is like shown on the screenshot at the beginning of this message. Suggested sync actions (push A or B to group) do not help. Looked through: K63243467, K13946

Appreciate any suggestions that can resolve or properly push/pull the config. Thank you!

  • Thank you for the hints! I've followed some actions described in ID882609 , though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpd

    I restarted that second device and did tail -f /var/log/tmm on both hosts.

    First device

     

    Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.

     

    Second device with sync issues contained end_transaction message timeout

     

    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries

     

    That error message lead me to K25064172 and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.

    [root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth
    03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    
    [root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed
    pci_bdf pseudo_name type available_drivers driver_in_use
    ------------ ----------- --------- --------------------- -------------
    0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock,
    0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
    0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
    0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3

    The fix for VmWare is

    echo "device driver vendor_dev 15ad:07b0 sock" >> /config/tmm_init.tcl

    And after I have restarted both nodes I saw the desired "In Sync" status.

    What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result. Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all. It should be some compatibility issue I think.

  • Ovov's avatar
    Ovov
    Icon for Altostratus rankAltostratus

    Thank you for the hints! I've followed some actions described in ID882609 , though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpd

    I restarted that second device and did tail -f /var/log/tmm on both hosts.

    First device

     

    Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.

     

    Second device with sync issues contained end_transaction message timeout

     

    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
    Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132)
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
    Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries

     

    That error message lead me to K25064172 and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.

    [root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth
    03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
    
    [root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed
    pci_bdf pseudo_name type available_drivers driver_in_use
    ------------ ----------- --------- --------------------- -------------
    0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock,
    0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
    0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
    0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3

    The fix for VmWare is

    echo "device driver vendor_dev 15ad:07b0 sock" >> /config/tmm_init.tcl

    And after I have restarted both nodes I saw the desired "In Sync" status.

    What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result. Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all. It should be some compatibility issue I think.

  • Check the connectivity between the BIGIP's via HA interface IP's

    10.13.13.131 and 10.13.13.132

    Also check the Port lockdown settings for the HA Selfip.

    make sure the HA interface is Tagged or untagged .

    Do telnet on 4353 between the BIGIP on HA selfip

     

    • Ovov's avatar
      Ovov
      Icon for Altostratus rankAltostratus

      Thank you for the suggestion.

      I haven't found the issue however:

      • Port lockdown settings for HA Self IP's is set to "Allow All" for both devices
      • Both HA interfaces are tagged with the same vlan 13
      • 4353 connection is working fine, I can see packets travelling both ways on both hosts. Checked with: tcpdump -nn -i HA tcp port 4353

         First host

       

      09:39:39.272348 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [P.], seq 71446:72894, ack 0, win 9018, options [nop,nop,TS val 1419664648 ecr 1419664639], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
      09:39:39.272436 IP 10.13.13.131.57460 > 10.13.13.132.4353: Flags [.], ack 72894, win 65535, options [nop,nop,TS val 1419664647 ecr 1419664648], length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
      09:39:39.283026 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [.], seq 72894:74342, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 1448 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
      09:39:39.283110 IP 10.13.13.132.4353 > 10.13.13.131.57460: Flags [P.], seq 74342:74400, ack 0, win 9018, options [nop,nop,TS val 1419664651 ecr 1419664647], length 58 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=
      09:39:39.793529 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [P.], seq 1:203, ack 1, win 12316, length 202 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
      09:39:39.793643 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 203, win 16189, length 0 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
      09:39:39.811879 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [P.], seq 1:76, ack 203, win 16189, length 75 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
      09:39:39.813850 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 76, win 12391, length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip1.xx.yyyy port=1.3 trunk=
      09:39:39.824753 IP 10.13.13.131.57460 > 10.13.13.132.4353: Flags [P.], seq 0:202, ack 72894, win 65535, options [nop,nop,TS val 1419665200 ecr 1419664648], length 202 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip2.xx.yyyy_6699 port=1.3 trunk=​

       

      Second host

       

      09:41:24.654511 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 39154:40551, ack 1, win 6565, options [nop,nop,TS val 1419770029 ecr 1419770026], length 1397 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:24.658487 IP 10.13.13.131.51678 > 10.13.13.132.4353: Flags [.], ack 40551, win 65535, options [nop,nop,TS val 1419770030 ecr 1419770029], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:24.658558 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 40551:42079, ack 1, win 6565, options [nop,nop,TS val 1419770033 ecr 1419770030], length 1528 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:25.189243 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 3575478456, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
      09:41:25.190545 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
      09:41:25.190633 IP 10.13.13.132.25677 > 10.13.13.131.4353: Flags [.], ack 1, win 13042, length 0 out slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
      09:41:25.191423 IP 10.13.13.131.4353 > 10.13.13.132.25677: Flags [.], ack 1, win 18138, length 0 in slot1/tmm1 lis=_cgc_outbound_/Common/bigip1.xx.yyyy_6699 port=1.3 trunk=
      09:41:25.658648 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [.], seq 40551:41999, ack 1, win 6565, options [nop,nop,TS val 1419771033 ecr 1419770030], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:25.764044 IP 10.13.13.131.51678 > 10.13.13.132.4353: Flags [.], ack 41999, win 65535, options [nop,nop,TS val 1419771136 ecr 1419771033], length 0 in slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:25.764175 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 41999:42079, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 80 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=
      09:41:25.764206 IP 10.13.13.132.4353 > 10.13.13.131.51678: Flags [P.], seq 42079:43527, ack 1, win 6565, options [nop,nop,TS val 1419771139 ecr 1419771136], length 1448 out slot1/tmm1 lis=_cgc_inbound_/Common/bigip2.xx.yyyy port=1.3 trunk=