Sync-failover group doesn't sync properly
Hello,
I need some help with essential Active/Standby setup where I can't make two nodes to sync data. This is the problem I end up with: "did not receive last sync successfully"
VLANs are configured like this:
vlan | tag | tagged interface |
Client | 11 | 1.1 |
HA | 13 | 1.3 |
Server | 12 | 1.2 |
Self IPs and routes are following
[root@bigip1:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt metric 4096
10.11.11.0/24 dev Client proto kernel scope link src 10.11.11.111
10.12.12.0/24 dev Server proto kernel scope link src 10.12.12.121
10.13.13.0/24 dev HA proto kernel scope link src 10.13.13.131
127.1.1.0/24 dev tmm proto kernel scope link src 127.1.1.254
127.7.0.0/16 via 127.1.1.253 dev tmm
127.20.0.0/16 dev tmm_bp proto kernel scope link src 127.20.0.254
192.168.159.0/24 dev mgmt proto kernel scope link src 192.168.159.129
[root@bigip2:Active:Standalone] config # ip route
default via 192.168.159.2 dev mgmt metric 4096
10.11.11.0/24 dev Client proto kernel scope link src 10.11.11.112
10.12.12.0/24 dev Server proto kernel scope link src 10.12.12.122
10.13.13.0/24 dev HA proto kernel scope link src 10.13.13.132
127.1.1.0/24 dev tmm proto kernel scope link src 127.1.1.254
127.7.0.0/16 via 127.1.1.253 dev tmm
127.20.0.0/16 dev tmm_bp proto kernel scope link src 127.20.0.254
192.168.159.0/24 dev mgmt proto kernel scope link src 192.168.159.130
Floating IPs on both devices are set to:
- Client: 10.11.11.110
- Server: 10.12.12.120
Both devices have certificates, time is in sync via NTP, have the same version 17.1.0.2 Build 0.0.2 (provisioned from the same OVA) and license.
Conif sync is set to: HA self IPs
Failover networks is: HA + Management
Mirroring: HA + Server
BigIP1 is Online, BigIP2 is Forced Offline before I start building cluster.
Hosts are connected via VmWare Workstation Lan Segments, thus no filtering is applied. I double check I can see packets in "tcpdump -nn -i" for any of the interfaces Client/Server/HA when for example trying to establish the SSH connection from the other host to the respective IP of the interface that is being watched.
Then I add device trust. Soon both devices are shown as "In sync" in the device_trust_group.
Then create a sync-failover group of two devices with Automatic Incremental Sync with Max sync size =10240. After this, the sync statuses are following:
- device_trust_group = In Sync
- Sync-Failover-Group = Awaiting Initial Sync
If I run "tcpdump -nn -i any tcp" I mostly see packets on HA network for ports 1029 and 4343
If I run "tcpdump -nn -i any udp" I mostly see packets on HA network for port 1026
tmm log
Sep 1 22:39:29 bigip1.sq.cloud notice mcpd[7261]: 01071436:5: CMI listener established at 10.13.13.131 port 6699
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:29 bigip1.sq.cloud err mcpd[7261]: 0107142f:3: Can't connect to CMI peer 10.13.13.132, TMM outbound listener not yet created
Sep 1 22:39:32 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:39:34 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 22:44:48 bigip1.sq.cloud notice mcpd[7261]: 01071038:5: Master Key updated by user %cmi-mcpd-peer-10.13.13.132
Sep 1 22:52:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 22:57:33 bigip1.sq.cloud notice mcpd[7261]: 01071451:5: Received CMI hello from /Common/bigip2.sq.cloud
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:01:09 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070430:5: end_transaction message timeout on connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132)
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01070418:5: connection 0xedc5a0c8 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed
Sep 1 23:06:11 bigip1.sq.cloud notice mcpd[7261]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
Lastly I push the configuration from the device that is in the online state to the Sync-Failover-Group.
Then the sync status is like shown on the screenshot at the beginning of this message. Suggested sync actions (push A or B to group) do not help. Looked through: K63243467, K13946
Appreciate any suggestions that can resolve or properly push/pull the config. Thank you!
Thank you for the hints! I've followed some actions described in ID882609 , though it wasn't exactly the situation I had. Specifically one of the devices failed to correctly restart tmm: bigstart restart tmm. That started spawning the following message each two seconds: Re-starting mcpd
I restarted that second device and did tail -f /var/log/tmm on both hosts.
First device
Sep 2 13:55:11 bigip2.xx.yyyy notice mcpd[6967]: 01b00004:5: There is an unfinished full sync already being sent for device group /Common/Sync-Failover-Group on connection 0xea1726c8, delaying new sync until current one finishes.
Second device with sync issues contained end_transaction message timeout
Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:45:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070430:5: end_transaction message timeout on connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01070418:5: connection 0xe685c948 (user %cmi-mcpd-peer-10.13.13.132) was closed with active requests Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 0107143c:5: Connection to CMI peer 10.13.13.132 has been removed Sep 2 13:50:10 bigip1.xx.yyyy notice mcpd[7158]: 01071432:5: CMI peer connection established to 10.13.13.132 port 6699 after 0 retries
That error message lead me to K25064172 and K10142141 despite I'm not running in AWS, my VmWare Workstation used vmxnet3 driver and I tried to switch to sock as suggested in that KB.
[root@bigip1:Standby:Not All Devices Synced] config # lspci -nn | grep -i eth 03:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 0b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 13:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) 1b:00.0 Ethernet controller [0200]: VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01) [root@bigip1:Standby:Not All Devices Synced] config # tmctl -d blade tmm/device_probed pci_bdf pseudo_name type available_drivers driver_in_use ------------ ----------- --------- --------------------- ------------- 0000:03:00.0 F5DEV_PCI xnet, vmxnet3, sock, 0000:13:00.0 1.2 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:0b:00.0 1.1 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3 0000:1b:00.0 1.3 F5DEV_PCI xnet, vmxnet3, sock, vmxnet3
The fix for VmWare is
echo "device driver vendor_dev 15ad:07b0 sock" >> /config/tmm_init.tcl
And after I have restarted both nodes I saw the desired "In Sync" status.
What is interesting enough that I got this issue on two separate computers running the same VmWare Workstation version. I also reinstalled three different versions of BigIP and always got the same result. Another crazy thing is that if instead of Sync-Failover I would create Sync-Only group, there were no issues at all. It should be some compatibility issue I think.