Forum Discussion

RoutingLoop_179's avatar
Nov 02, 2013

LTM VE performance while loadbalancing DNS (DNS profile) and Max connections

Hi I'll be raising a case via my SE as well but wanted to post on here in case others had some ideas. Probably going to be a bit of a long post - sorry.

 

We experienced a major service outage yesterday as our LTM VE seemed to crumble which was load balancing our DNS traffic. We haven't yet determined the initial cause but believe we have a good idea of the sequence of events but have some queries regarding F5 VE and ESXi interaction with regards to UDP traffic.

 

In normal working conditions we have average of 10K cps doing only DNS traffic - with LTM DNS profile applied this usually equates to about ~500 open active connections. I have always see it stay around this figure unless it stops seeing the DNS responses come back at which point the open connections obviously starts to rocket.

 

We've now seen in both load testing and also now in live is that about 40-50K cps LTM_ESXi5 combination seems starts dropping packets. however LTM does not show any dropped packets on the interfaces.

 

regarding the outage - we basically experienced a cascaded failure, it seems cps shot upto ~45k (haven't yet been able to determine if this was the cause or effect) but the consequence is it starts loosing packets, most importantly the probes to the DNS servers (both ICMP and DNS monitor), hence the LTM took all our DNS servers offline. Hence making the situation worse by swamping us with a storm of DNS requests from client retries, so the probes were continually flapping and the LTM could never recover. We recovered the situation by rate limiting the virtual server and removing the probes from the pool so DNS was just forwarded irrespective. My experience in load testing seems that LTM copes well while the max open connections is kept low but as soon as we seem to hit a threshold the max open connections then rockets because we begin to drop traffic.

 

So does anyone know what is the expected limit for a VE cps while using ESXi hypervisor - ours is literally only doing DNS loadbalancing. v11.4, default VE deployment from template. we only seemed to hit 50% cpu across 2 vCPU.

 

I've read a bit about UDP buffer tuning on linux host while using ESXi but is it possible to tune the udp buffers on F5 ltm using the same e.g something like sysctl -w net.core.rmem_max=26214400

 

Does anyone know if there is F5 ESXi best practises?

 

sorry for long post - hopefully someone will read it.

 

3 Replies

  • Thanks for posting; this is very useful information and the root cause will be even more so.

     

    Just so it's clear, what is CPS please? I assume you had a full license? What does the license screen show? How did RAM usage look? Any rate classes etc. configured?

     

    Perhaps use the 'tmsh show ltm profile dns' command to understand the DNS rate limits of your implementation/license?

     

    Interesting it's v11.4; I'm aware that v11.3 on contain some significant performance improvements over prior versions.

     

  • Hi - thanks for you replying. Sorry CPS refers to connections per second as seen on the stats on LTM - although we are only talking about DNS on the LTM (UDP hence connectionless) - I suppose a better description would be flows per second, but LTM shows them as connections. Unfortunately still have not determined the root cause which started it all off.

     

    Yes it's a full production license - 1Gbps throughput (however 40K cps of DNS relates to only about ~60Mbps of total throughput) . There was a very minor spike in RAM usage but nothing i would have been concerned about. VE has 2 vCPU - busiest cpu was peaking ~50%. I saw similar performance on v11.2 when load testing (we've only just upgraded to 11.4 due to particular features) but at that time put it down to LTM VE limitation due to CPU reaching nearly 100%, it now seems the LTM-VE can use multiple vCPU so an increase in performance like you say.

     

    Licensed Service Rates DNS GTM Effective Rate Limit (RPS) 0 0 Configured Object Count 0 0 Rate Rejected Requests 0 0

     

    all interfaces on VE with LTM do not show any packet drops. Unfortunately we didn't get to check the network stats on ESXi host (ESXTOP) before we got service restored. I am going to try and replicate the conditions in the lab using load testing and see if it's ESXi which is dropping traffic - I have a niggle that it could be related to UDP buffers on ESXi, however much of the information recommends changing these on the VM hosts - but in this case it would mean changing on the LTM, something I'm quite reluctant to do as i assume LTM is all pre-tuned. I'd be interested if anyone has any experience with this.

     

    http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1010071

     

    I've attached a couple of screenshots our the LTM usage stats.

     

     

     

  • Thanks.

     

    I'm no virtualisation expert but I wouldn't have thought ESXi should be buffering anything. Regardless, any changes made using sysctl and the like only affect the HMS and it's management interface, not the tmm/traffic ones you're concerned with.

     

    A typical high connection rate issue would normally hammer RAM not the CPU so no point inspecting time-outs around that.

     

    Let us know what support say when they come back to you.