The Skeleton in the Global Server Load Balancing Closet
Like urban legends, every few years this one rears its head and makes its rounds. It is certainly true that everyone who has an e-mail address has received some message claiming that something bad is going on, or someone said something they didn’t, or that someone influential wrote a letter that turns out to be wishful thinking. I often point the propagators of such urban legends to Snopes because the folks who run Snopes are dedicated to hunting down the truth regarding these tidbits that make their way to the status of urban legend. It would nice, wouldn’t it, if there was such a thing for technical issues; a technology-focused Snopes, if you will. But there isn’t, and every few years a technical urban legend rears its head again and sends some folks into a panic. And we, as an industry, have to respond and provide some answers. This is certainly the case with Global Server load balancing (GSLB) and Round-Robin DNS (RR DNS). CLAIM : DNS Based Global Server Load Balancing (GSLB) Doesn’t Work STATUS : Inaccurate ORIGINS The origins of this skeleton in GLSB’s closet is a 2004 paper written by Pete Tenereillo, “Why DNS Based Global Server Load Balancing (GSLB) Doesn’t Work.” It is important to note that at the time of the writing Pete was not only very experienced with these technologies but was also considered an industry expert. Pete was intimately involved in the early days of load balancing and global server load balancing, being an instrumental part of projects at both Cisco and Alteon (Nortel, now Radware). So his perspective on the subject certainly came from experience and even “inside” knowledge about the way in which GSLB worked and was actually deployed in the “real” world. The premise upon which Pete bases his conclusion, i.e. GSLB doesn’t work, is that the features and functionality over and above that offered by standard DNS servers are inherently flawed and in theory sound good, but don’t work. His ultimate conclusion is that the only way to implement true global high-availability is to use multiple A records, which are already a standard function of DNS servers. DNS based Global Server Load Balancing (GSLB) solutions exist to provide features and functionality over and above what is available in standard DNS servers. This paper explains the pitfalls in using such features for the most common Internet services, including HTTP, HTTPS, FTP, streaming media, and any other application or protocol that relies on browser based client access. … An Axiom The only way to achieve high-availability GSLB for browser based clients is to include the use of multiple A records It would be easy to dismiss Pete’s concerns based on the fact that his commentary is nearly seven years old at this point. But the basic principles upon which DNS and GSLB are implemented today are still based on the same theories with which Pete takes issue. What Pete missed in 2004 and what remains missing from this treatise is twofold. First, GSLB implementations at that time, and today, do in fact support returning multiple A records, a.k.a. Round-Robin DNS. Second, the features and functionality provided over and above standard DNS do, in fact, address the issues raised and these features and functionality have, in fact, evolved over the past seven years. Is returning multiple A records to LDNS the only way of achieving High Availability? How is advanced health checking an important component of providing High Availability? Many people misuse the term ‘high availability’ by indicating that it only equates to when a site is either up or down. This type of binary thinking is misguided and is purely technical in focus. Our customers have all indicated that high availability also includes performance of the application or site. The reason is that by business definitions if a site or application is too slow it is unavailable. Poor performance directly impacts productivity, one of the key performance indicators used to measure the effectiveness of business employees and processes. As a result, high availability can be achieved in a number of different ways. Intelligent GSLB solutions, through advanced monitoring and statistical correlation, take into account not only whether the site is up or down, but such detail as hop count, packet loss, round-trip time, and data-center capacity to name a few. These metrics then transparently provide users with most efficient and intelligent way of steering traffic and achieving high availability. Geolocation is another means of steering traffic to the appropriate service location, as well as any number of client and business-specific variables. Context is important to application delivery in general, but is a critical component of GSLB to maintain availability – including performance. The round robin handling of the A records by the Local DNS (LDNS) is a well known problem in the industry. When multiple A records are handed back to the LDNS for an address resolution, the LDNS shuffles the list and returns the A records list back to the client without honoring the order in which it received it. The next time the client requests an address, the LDNS responds with a different ordered list of A records. This LDNS behavior makes it very difficult to predict the order in which A records are being returned to the client. In order to overcome this problem, many prefer to configure a GSLB solution to send back one A record to the LDNS. When compared with just a ‘plain’ DNS server that would send back any one of the site addresses with a TTL value, an intelligent GSLB sends back the address of the best performing site, based on the metrics that are important to the business, and sets the TTL value. A majority of the LDNS that are RFC compliant will honor the TTL value and resolve again after the TTL value has expired. The GSLB performs advanced health checking and sends back the address of the best performing site taking into account metrics like application availability, site capacity, round trip time, hops and completion rate hence providing the best user experience and meeting applicable business service level agreements. In the event of a site failure (when the link is down or because of a catastrophic event), existing clients would connect to the unavailable site for the period of time equal to the TTL value. The GSLB sets a TTL value of 30 seconds when returning an A record back to the LDNS. As soon as the 30 second time period expires, the LDNS resolves again and the GSLB uses its advanced health checking capability and determines that one of the multiple sites is unavailable. The GSLB then starts to direct users transparently to the best performing site by returning the address of that site back to the LDNS. A flexible GSLB will also provides a Manual Resume option that gives them the option of letting the unavailable site stay down to mitigate the commonly known back end database synchronization problem. An intelligent GSLB also has the option of sending multiple A records that allows delivery of content from the best performing sites. For example, let’s say an enterprise wants to deliver their content using 10 sites and wants to provide high availability. Using sophisticated health checking, the GSLB can determine the two best performing sites and return their addresses to the users. The GLSB would track each site’s performance and send back the best sites based on current network and site conditions (context) for every resolution. Slow sites or sites that are down would never be sent back to the user. What about the issues with DNS Browser Caching? Of all the issues raised by Pete in his seminal work of 2004, this is likely the one that is still relevant. Browser technology has evolved, yes, but the underlying network stack and functionality has not, mainly because DNS itself has not changed much in the past ten years. Most modern browsers may or may not (evidence is conflicting and documentation nailing it down scant) honor DNS TTL but they have, at least, reduced the caching interval on the client side. This may or may not - depending on timing - result in a slight delay in the event of a failure while resolution catches up but it does not have nearly the dramatic negative impact it once had. In early days, a delay of 15 minutes could be expected. Today that delay can generally be counted in seconds. It is, admittedly, still a delay but one that is certainly more acceptable to most business stakeholders and customers alike. And yet while the issue of DNS browser caching is still technically accurate, it is not all that relevant; the same solution Pete suggests to address the issue – RR DNS – has always been available as an option for GSLB. Any technology, when not configured to address the issue for which it was implemented, can be considered a failure. This is not a strike against the technology, but the particular implementation. The instances of browser caching impacting site availability and performance is minimal in most cases and for those organizations for which such instances would be completely unacceptable it is simply a matter of mitigation using the proper policies and configuration. > SUMMARY What it comes down to is that Pete, in his paper, is pushing for the use of Round-Robin DNS (RR DNS). Modern Global Server Load Balancing (GSLB) solutions fully support this option today and generally speaking always have. However, the focus on the technical aspects completely ignores the impact of business requirements and agreements and does not take into consideration the functions and features over and above standard DNS that assist in supporting those requirements. For example, health-checking has come a long way since 2004, and includes not only simply up-down indicators or even performance-based indicators but is now able to incorporate a full range of contextual variables into the equation. Location, client-type, client-network, data center network, capacity… all these parameters can be leveraged to perform “health” checks that enable a more accurate and ultimately adaptable decision. Interestingly, standard DNS servers leveraged to implement a GSLB solution are not capable of nor do they provide the means by which such health checks can be implemented. Such “health monitoring” is, however, a standard offering for GSLB solutions. NEW FACTORS to CONSIDER Given the dynamism inherent not only to local data centers but global implementations and the inclusion of cloud computing and virtualization, GSLB must also provide the means by which management and maintenance and process automation can be accomplished. Traditional DNS solutions like BIND do not provide such means of control; they are enabled with the ability to participate in the collaborative processes necessary to automate the migration and capacity fulfillment functions for which virtualization and cloud computing are and will be used. Thus a simple RR DNS implementation may be desirable, but the solution through which such implementations will be implemented must be more modern and capable of addressing management and business concerns as well. These are the “functions and features” over and above standard DNS servers that provide value to organizations regardless of the technical details of the algorithms and methods used to distribute DNS records. Additionally, traditional DNS solutions – while supporting new security initiatives like DNSSEC – are less able to handle such initiatives in a dynamic environment. A GSLB must be able to provide dynamic signing of records to enable global server load balancing as a means to support DNSSEC. DNSSEC introduces a variety of challenges associated with GSLB that cannot be easily or efficiently addressed by standard DNS services. Modern GLSB solutions can and do address these challenges while enabling integration and support for other emerging data center models that make use of cloud computing and virtualization. This skeleton is sure to creep out of the closet yet again in a few years, primarily because DNS itself is not changing. Extensions such as DNSSEC occasionally crop up to address issues that arise over time, but the core principles upon which DNS have always operated are still true and are likely to remain true for some time. What has changed are the data center architectures, technology, and business requirements that IT organizations must support, in part through the use of DNS and GSLB. The fact is that GSLB does work and modern GSLB solutions do provide the means by which both technical and business requirements can be met while simultaneously addressing new and emerging challenges associated with the steady march of technology toward a dynamic data center. So You Put an Application in the Cloud. Now what? F5 Friday: Hyperlocalize Applications for Everyone Location-Aware Load Balancing Cloud Needs Context-Aware Provisioning What is a Strategic Point of Control Anyway? German DPA Issues Legal Opinion on Cloud Computing As Cloud Computing Goes International, Whose Laws Matter? Load Balancing in a Cloud259Views0likes0CommentsWILS: Virtualization, Clustering, and Disaster Recovery
#virtualization Clustering is local. Disaster recovery is global. There are two levels of reliability for an application. There’s local and there’s global. We might want to consider it more simply as “inside” and “outside” reliability. Virtualization enables local reliability – the inside kind of reliability. Whether you’re relying upon clustering or load balancing (each has advantages and disadvantages, but for purposes of reliability and this discussion we’ll assume equal capabilities) to provide the abstraction isn’t as important as recognizing that in terms of reliability you’re acting at the local, i.e. inside, level. A cluster or pool, in load balancing parlance, is able to maintain local reliability by distributing load across multiple instances of the application. We can transparently add or remove instances to achieve the elasticity necessary to meet demand, thus ensuring reliability. In the event of a local disaster, such as the failure of a virtual machine, we can take the failed instance out of the rotation and even provision another to replace it. What clustering (load balancing) can’t do is address global reliability, i.e. outside reliability. Global reliability must be addressed using a different technology, normally referred to as Global Server Load Balancing (GLSB). The terminology grew out of the days when global reliability was achieved by load balancing individual servers across the globe to ensure a failure in the network or at a specific location could not interrupt the service. As demand grew, GSLB performed the same functions, but did so at a site level, essentially load balancing sites instead of individual servers. The name remains, however confusing that may be to the uninitiated. To achieve global reliability you need GSLB. To avoid the detrimental effects of a disaster in the network or at the site level, you must be able to direct users to an active location. This is realized in most implementations through simple DNS load balancing techniques; i.e. when a user makes a request the GSLB service responds with the IP address of an appropriate, active site. GLSB is capable of much more complex decision making, however, and decisions can be based on a variety of business and operational parameters, at the discretion of the organization. The GSLB service monitors each of the local sites, and is able to detect an outage within seconds and begin directing users elsewhere. At the local level, clustering and load balancing also monitor the “health” of individual instances and can react similarly in the event of a failure, but do so only at the local level. If the site fails, as might be the case in the event of a disaster, the local service is unable to do anything about it. It can’t redirect globally, it can’t notify other components. It’s just gone. For disaster recovery purposes, this is important stuff. When cloud first drifted onto the scene is was postulated that the cheaper compute would make implementing secondary data centers specifically for disaster recovery purposes more financially feasible for a wider variety of organizations. While that’s true in the sense that it’s way cheaper than building a secondary data center, many of the technological foundations remain the same: GSLB and a replicated environment. Some folks balk at the replication and point to transparent migration as a solution. After all, why pay even pennies on the hour instances that may never be put into commission? The problem is that transparent migration of virtual machines is only useful while the VMs are live and running. If they aren’t, such as might be the case in the event of a disaster, the site can’t be replicated and global reliability fails. A cluster-to-cluster failover via a bridged network to the cloud might sound like a good idea, but it isn’t practical when applied to a disaster recovery scenario. Too much depends on the availability of the site, of the network, and of the clustering/load balancing mechanism itself. If any one of the components has failed, global reliability is unrealizable. To achieve true global reliability regardless of the involvement of cloud computing , you’re going to need to implement a good old-fashioned GSLB architecture, complete with the network components and replicated application infrastructure. Local reliability (inside) may be achievable with virtual clustering solutions, but global reliability requires a very different architecture and set of technologies. Disaster recovery strategies cannot rely on local reliability, they must be based on global reliability. WILS: Write It Like Seth. Seth Godin always gets his point across with brevity and wit. WILS is an ATTEMPT TO BE concise about application delivery TOPICS AND just get straight to the point. NO DILLY DALLYING AROUND. Back to Basics: Load balancing Virtualized Applications The Cost of Ignoring ‘Non-Human’ Visitors Cloud Bursting: Gateway Drug for Hybrid Cloud The HTTP 2.0 War has Just Begun Why Layer 7 Load Balancing Doesn’t Suck Network versus Application Layer Prioritization WILS: The Many Faces of TCP WILS: WPO versus FEO218Views0likes0CommentsF5 Friday: Devops for DNS
#devops #cloud Managing a global presence – especially in the cloud – can introduce additional complexity. Back in the day when virtualization and cloud were just making waves, one of the first challenges made obvious was managing IP addresses. As VM density increased, there were more IP network management tasks that had to be handled – from distributing and assigning IP addresses to VLAN configuration to DNS entries. All this had to be done manually. It was recognized there was a growing gap between the ability of operations to handle the volatility in the IP network due to virtualization and cloud, but very little was done to address it. One of the forerunners of automation in the IP management space was Infoblox. Only we didn't call it "automation" then, we called it "Infrastructure 2.0". After initially focusing on managing the internal volatility in the IP network, the increase in architectures adopting a hyper-hybrid cloud model are turning that focus outward, toward the need to more efficiently manage the global IP network space. The global IP network space, too, has volatility and may in fact require more flexibility as organizations seek to leverage cloud bursting and balancing architectures to assure availability and performance to its end-users. One of the requisites of a highly available global-spanning architecture is the deployment of multiple global server load balancing (GSLB) solutions such as BIG-IP Global Traffic Manager (GTM). To assure availability a la disaster recovery/business continuity initiatives, it is imperative to deploy what are essentially redundant yet independently operating global load balancing devices. This distribution means multiple, remote devices that must be managed and, just as importantly, that must tie into global IP address management frameworks. Most of this today is not automated; organizations advancing their devops initiatives may have already begun to embrace this demesne and automate using available tooling such as scripting and device APIs, but for the most part organizations have not yet focused on this problem (having quite a bit of work to do internal in the first place). This is integration work, it's management work, it's a job for devops – and it's an important one. The ability to integrate and seamlessly manage hyper-hybrid architectures is paramount to enabling federated cloud ecosystems in which organizations can move about as demand and costs require without requiring labor-intensive activity on the part of operations. Automating and centralizing a federated ecosystem at the global IP network layer is a transformational shift on par with the impact of the steam train in the US's old west. The impact of faster and further was profound and enabled expansion of population and business alike. Federation enabled by the appropriate toolsets and processes will provide similar benefits, enabling business and IT to expand and improve its services to its end-users by leaps and bounds, without incurring the costs or risks of a disconnected set of remotely deployed resources. F5 and Infoblox have enabled exactly this type of solution comprising integration of F5 GTM via our iControl API with Infoblox Load Balancer Manager (LBM). The solution merges appliance-based DNS, DHCP, and IP address management with a network of standalone BIG-IP GTM devices to create a single management grid. With lots of devops goodness like changing and synchronizing configuration in a hyper-hybrid (or just highly distributed) environment, the integrated solution is an enabler of broader more dynamic and distributed architectures. It enables the automation of tasks without scripting, assures a consistent workflow with pre-configured "best practices" for DNS management, as well as automating daily operational tasks such as synchronizing updates and checking on status. You can read more in the solution profile Automate DNS Network and Global Traffic Management or in one of Don's excellent blogs on the topic: F5 Friday: Infoblox and F5 Do DNS and Global Load Balancing Right. DNS Architecture in the 21st Century Related blogs & articles: Global Server Load Balancing Resources: Creating a DNS Blackhole. On Purpose DNS Is Like Your Mom No DNS? No… Anything DNS Gets an Upgrade BIG-IP v11: Operational Efficiency for Federal Government Agencies DNSSEC: Is Your Infrastructure Ready? Global Server Load Balancing Overview High-Performance DNS Services in BIG-IP Version 11 [PDF] The DDoS Threat Spectrum [PDF] Availability and the Cloud [PDF] Technology Alliance Partnership Update Week of September 14th 2012 Lori MacVittie is a Senior Technical Marketing Manager, responsible for education and evangelism across F5’s entire product suite. Prior to joining F5, MacVittie was an award-winning technology editor at Network Computing Magazine. She holds a B.S. in Information and Computing Science from the University of Wisconsin at Green Bay, and an M.S. in Computer Science from Nova Southeastern University. She is the author of XAML in a Nutshell and a co-author of The Cloud Security Rules217Views0likes0Comments