Forum Discussion

daboochmeister's avatar
Apr 30, 2019

Fault-tolerant DNS load balancing via LTM - preventing any dropped requests?

(sorry if this is a re-post - i posted a few weeks back, but that post appears to be messed up in the devcentral database)

 

Env: LTMs running 13.1.1.4 (we also have GTMs, also at 13.1.1.4, but i don't believe they're relevant)

 

We are encountering times when our internal DNS responders (Infoblox, btw) will drop individual queries, or simply not respond to them. Very infrequently, and a standards-compliant client should simply retry and extend timeout, etc. But for technical reasons, we have been given a requirement to provide a fault-tolerant DNS interface that will not exhibit this behavior.

 

Is there any way to implement such fault tolerance in an LTM VIP that proxies UDP-based DNS requests?

 

"Action on Service Down" and "Request Queueing" seem to be fundamentally connection-oriented (i.e., TCP oriented), based both on their description and some preliminary testing. "Reselect Tries" sounds like exactly what we need, but seems not to be affecting UDP traffic ...

 

We have DNS Controllers (GTMs) as well ... and use them for GSLB ... but it's not clear to me how they could be leveraged for such fault tolerance for our standard DNS services (moving all our zones from Infoblox to the GTMs as authoritative is ... daunting).

 

Any recommendations, iRules to implement the equivalent of request queueing, etc.? Thank you!

 

2 Replies

  • We are encountering times when our internal DNS responders (Infoblox, btw) will drop individual queries, or simply not respond to them.

    Do you know why they aren't responding ? If this reason is known, you may be able to solve it using iRule or any existing feature on the F5 device.

    Is the query even reaching the DNS responders ? There is a case where queries are probably not going to reach the responders so the client must be able to re-try. Even with the fault tolerant system, client needs to have the ability to re-try.

  • Action on service down should work. However, it will work when the F5 is able to identify that the pool member is actually down.

     

    Have you tried reselect within action on service down with datagram LB enabled in the UDP profile ? May be even lower the idle timeout in the UDP profile ? I haven't really tried it but offering a few options that I think will help based on my understanding of the F5 function.

     

    The above changes will reduce the chances of lost queries but does not completely eliminate it.