Forum Discussion

Brian_Dean's avatar
Brian_Dean
Icon for Nimbostratus rankNimbostratus
Nov 03, 2015

Pool Member Maintenance Process

I got a call today from a developer letting me know he would like to perform some maintenance on a pool of about 5 servers. So I spent some time reading over the documentation regarding pool member and node statuses. I setup a packet capture on the unit active for the traffic group. And began trying combinations of statuses to confirm their behavior. My goal is to be able to "flip a flag" on a pool member or node and have TMOS re-associate exiting connections with a different pool member without exception.

 

Environment

 

  • 3900 BIG-IP 11.6 HF5 Active/Standby
  • One virtual named http_vs, one pool named http_pool, five nodes named by IP address all members of http_pool.
  • The http_vs virtual has a default persistence profile of cookie with no fallback.

From here on out I'll talk about just one of the http_pool members, 192.168.0.10.

 

I had the following tcpdump running on the active unit, tcpdump -s 0 -nni 0.0 host 192.168.0.10 and tcp port 80. I redeadly reviewed the persistence table with tmsh show ltm persistence persist-records node-addr 192.168.0.10 node-port 80. As expected with cookie persistence, I never saw any records.

 

I executed each of the following commands and monitored the behavior:

 

  • tmsh modify ltm pool http_pool members modify { 192.168.0.10:80 { state user-down } }
  • tmsh modify ltm pool http_pool members modify { 192.168.0.10:80 { session user-disabled } }
  • tmsh modify ltm pool http_pool members modify { 192.168.0.10:80 { session user-disabled state user-down } }
  • tmsh modify ltm pool http_pool members delete { 192.168.0.10:80 }
  • tmsh modify ltm node 192.168.0.10 { state user-down }
  • tmsh modify ltm node 192.168.0.10 { session user-disabled }
  • tmsh modify ltm node 192.168.0.10 { session user-disabled state user-down }

The pool member had about 400 active connections at the time each command was executed and in each case it took about 15 minutes for all connections to expire. I verified by executing the following about every 30 seconds, tmsh show sys connection ss-server-addr 192.168.0.10 ss-server-port 80. Most of this behavior is expected based on the documentation which is good. I was however surprised that removing the node from the pool didn't force TMOS to make a new load balance decision.

 

I thought about deleting the existing connections after removing the node from the pool with tmsh delete sys connection ss-server-addr 192.168.0.10 ss-server-port 80 but I don't think this is going to be transparent with the client.

 

I looked around for an already vetted process to perform but the topic is pretty sparse. Ultimately what I'm after is a way to administratively reproduce what happens when a monitor marks a pool member down. I would like existing connections to re-associated with a different pool member.

 

I'll continue to keep looking but I was curious what others are doing in cases like this. I can also see this beneficial of a pool member is still online but experiencing some layer 7 problem not detected by a monitor i.e. some what to stop all traffic to a pool member without the client being aware of the problem. Or am I asking for the holy grail?

 

Thanks in advance for given my situation some thought.

 

2 Replies

  • In this scenario there is actually no way of re-associating existing connections to another pool member, the only way is to drain connections with the disabled state. There is a setting on the pool, called Action on Service Down that tells the BIG-IP what to do with existing connections in case a pool member becomes unavailable (either by way of a monitor marking the member as such, or setting the member in forced offline) and there is one option there called reselect but that is something that only works in a very limited number of scenarios and none of them apply here:

     

    https://support.f5.com/kb/en-us/solutions/public/15000/000/sol15095.html?sr=49120474

     

    So yeah, planning ahead is key in a maintenance situation, start by disabling the member way ahead of time in order for connections to drain. This is what most of my customers tend to do, unless they are in a hurry, in which case they tend to set the Action on Service Down setting to Reject and then setting the member as forced offline. Of course this is not transparent to the users, since they will have their connections reset by the BIG-IP but it's better than to have the client wait for timeouts in case the server is completely out of it.

     

    What you noted about removing a member from the pool is expected behavior - the connection table is the supreme ruler of the BIG-IP, if a connection exists in the connection table that counts regardless of what the configuration looks like.