DNS based failover between AWS Availability Zones and Split DNS

Working in the AWS public cloud; one has to adapt to a world of guaranteed failure at unpredictable times. Utilizing a combination of LTM for HA within a single availability zone and GTM across availability zones and regions provides an architecture to survive the chaos monkeys.

The following is part of a demo environment that I built for AWS, it highlights a couple of useful features of LTM/GTM including:

  • Monitoring LTM services from GTM
  • Using GTM for outbound DNS resolution
  • Creating split DNS records for EIP vs. internal VPC IP
  • Create topology LB records to avoid cross-AZ communication for active-active services

Demo

Here’s what the overall architecture looks like:

We create a couple of wide IP records to show different failure scenarios (these DNS records are isolated to my demo environment):

  • prefer-d.f5demo.com
    • Active/Standby from D to E
  • prefer-e.f5demo.com
    • Active/Standby from E to D
  • active-active.f5demo.com
    • Round robin between D and E (unless internal request)

Here’s what things look like when everything is OK from an external user:

Or another view using a command line view with curl:

Taking a look from an internal user we can see that the behavior is slightly different. In this case a request from the US-EAST-1D AZ will always request from the same AZ when communicating with the active-active service. It also accesses the service using the internal IP address. This provides some cost savings if you have services that are very data heavy to avoid cross-AZ data billing charges.

Performing a failure of the D services (stopping the web server) we can see that initially connections fail while the client is still trying to access services use the US-EAST-1D IP address:

Once the client refreshes its DNS record (default 30 second TTL) we can see that it is now only communicating with US-EAST-E services:

Setting it up

To create this demo you’ll need:

  • LTM/GTM devices
  • Two AZ
  • Some backend services

Once you have these you can build out a standard LTM/GTM environment in AWS. Create a DNS cache (required for Topology LB). Point your AWS instances to your DNS cache listeners (make sure this is only accessible to your internal clients!!!). Build up some Split DNS / topology records.

You can also extend this example to go cross-Region as well (with the limitation that your internal IP space would not be accessible cross VPC)!

More Chaos

The example above illustrates a single failure scenario (loss of US-EAST-1D web services). You can imagine that there’s several other scenarios that could cause a failure including and not limited to:

  • Loss of EAST-1E services
  • Loss of a single AZ
  • Loss of a single LTM/GTM device

The demo environment is built to survive these and keep on running despite the best efforts of any chaos monkeys.

Published May 22, 2015
Version 1.0

Was this article helpful?