Load Balancing Fu: Beware the Algorithm and Sticky Sessions
The choice of load balancing algorithms can directly impact – for good or ill – the performance, behavior and capacity of applications. Beware making incompatible choices in architecture and algorithms. One of the most persistent issues encountered when deploying applications in scalable architectures involves sessions and the need for persistence-based (a.k.a. sticky) load balancing services to maintain state for the duration of an end-user’s session. It is common enough that even the rudimentary load balancing services offered by cloud computing providers such as Amazon include the option to enable persistence-based load balancing. While the use of persistence addresses the problem of maintaining session state, it introduces other operational issues that must also be addressed to ensure consistent operational behavior of load balancing services. In particular, the use of the Round Robin load balancing algorithm in conjunction with persistence-based load balancing should be discouraged if not outright disallowed. ROUND ROBIN + PERSISTENCE –> POTENTIALLY UNEQUAL DISTRIBUTION of LOAD When scaling applications there are two primary concerns: concurrent user capacity and performance. These two concerns are interrelated in that as capacity is consumed, performance degrades. This is particularly true of applications storing state as each request requires that the application server perform a lookup to retrieve the user session. The more sessions stored, the longer it takes to find and retrieve the session. The exactly efficiency of such lookups is determined by the underlying storage data structure and algorithm used to search the structure for the appropriate session. If you remember your undergraduate classes in data structures and computing Big (O) you’ll remember that some structures scale more efficiently in terms of performance than do others. The general rule of thumb, however, is that the more data stored, the longer the lookup. Only the amount of degradation is variable based on the efficiency of the algorithms used. Therefore, the more sessions in use on an application server instance, the poorer the performance. This is one of the reasons you want to choose a load balancing algorithm that evenly distributes load across all instances and ultimately why lots of little web servers scaled out offer better performance than a few, scaled up web servers. Now, when you apply persistence to the load balancing equation it essentially interrupts the normal operation of the algorithm, ignoring it. That’s the way it’s supposed to work: the algorithm essentially applies only to requests until a server-side session (state) is established and thereafter (when the session has been created) you want the end-user to interact with the same server to ensure consistent and expected application behavior. For example, consider this solution note for BIG-IP. Note that this is true of all load balancing services: A persistence profile allows a returning client to connect directly to the server to which it last connected. In some cases, assigning a persistence profile to a virtual server can create the appearance that the BIG-IP system is incorrectly distributing more requests to a particular server. However, when you enable a persistence profile for a virtual server, a returning client is allowed to bypass the load balancing method and connect directly to the pool member. As a result, the traffic load across pool members may be uneven, especially if the persistence profile is configured with a high timeout value. -- Causes of Uneven Traffic Distribution Across BIG-IP Pool Members So far so good. The problem with round robin- – and reason I’m picking on Round Robin specifically - is that round robin is pretty, well, dumb in its decision making. It doesn’t factor anything into its decision regarding which instance gets the next request. It’s as simple as “next in line", period. Depending on the number of users and at what point a session is created, this can lead to scenarios in which the majority of sessions are created on just a few instances. The result is a couple of overwhelmed instances (with performance degradations commensurate with the reduction in available resources) and a bunch of barely touched instances. The smaller the pool of instances, the more likely it is that a small number of servers will be disproportionately burdened. Again, lots of little (virtual) web servers scales out more evenly and efficiently than a few big (virtual) web servers. Assuming a pool of similarly-capable instances (RAM and CPU about equal on all) there are other load balancing algorithms that should be considered more appropriate for use in conjunction with persistence-based load balancing configurations. Least connections should provide better distribution, although the assumption that an active connection is equivalent to the number of sessions currently in memory on the application server could prove to be incorrect at some point, leading to the same situation as would be the case with the choice of round robin. It is still a better option, but not an infallible one. Fastest response time is likely a better indicator of capacity as we know that responses times increase along with resource consumption, thus a faster responding instance is likely (but not guaranteed) to have more capacity available. Again, this algorithm in conjunction with persistence is not a panacea. Better options for a load balancing algorithm include those that are application aware; that is, algorithms that can factor into the decision making process the current load on the application instance and thus direct requests toward less burdened instances, resulting in a more even distribution of load across available instances. NON-ALGORITHMIC SOLUTIONS There are also non-algorithmic, i.e. architectural, solutions that can address this issue. DIVIDE and CONQUER In cloud computing environments, where it is less likely to find available algorithms other than industry standard (none of which are application-aware), it may be necessary to approach the problem with a divide and conquer strategy, i.e. lots of little servers. Rather than choosing one or two “large” instances, choose to scale out with four or five “small” instances, thus providing a better (but not guaranteed) statistical chance of load being distributed more evenly across instances. FLANKING STRATEGY If the option is available, an architectural “flanking” strategy that leverages layer 7 load balancing, a.k.a. content/application switching, will also provide better consumptive rates as well as more consistent performance. An architectural strategy of this sort is in line with sharding practices at the data layer in that it separates out by some attribute different kinds of content and serves that content from separate pools. Thus, image or other static content may come from one pool of resources while session-oriented, process intensive dynamic content may come from another pool. This allows different strategies – and algorithms – to be used simultaneously without sacrificing the notion of a single point of entry through which all users interact on the client-side. Regardless of how you choose to address the potential impact on capacity, it is important to recognize the intimate relationship between infrastructure services and applications. A more integrated architectural approach to application delivery can result in a much more efficient and better performing application. Understanding the relationship between delivery services and application performance and capacity can also help improve on operational costs, especially in cloud computing environments that constrain the choices of load balancing algorithms. As always, test early and test often and test under high load if you want to be assured that the load balancing algorithm is suitable to meet your operational and business requirements. WILS: Why Does Load Balancing Improve Application Performance? Load Balancing in a Cloud Infrastructure Scalability Pattern: Sharding Sessions Infrastructure Scalability Pattern: Partition by Function or Type It’s 2am: Do You Know What Algorithm Your Load Balancer is Using? Lots of Little Virtual Web Applications Scale Out Better than Scaling Up Sessions, Sessions Everywhere Choosing a Load Balancing Algorithm Requires DevOps Fu Amazon Makes the Cloud Sticky To Boldly Go Where No Production Application Has Gone Before Cloud Testing: The Next Generation2.3KViews0likes1CommentLoad Balancing on the Inside
Business critical internal processing systems often require high-availability and fault tolerance, too. Load balancing and application delivery is almost always associated with scaling out interactive, web-based applications. Rarely does anyone think about load balancing and application delivery in batch processing systems even when those systems might be critical to the business they are supporting. But scaling out non-interactive processing systems and providing high-availability to such critical systems is just as easily accomplished for an application delivery controller (ADC) as it is to scale out an interactive web-based application. Maybe easier. When that system also requires a bit more intelligence than just simple load balancing, it makes a lot of sense to look closer at a context-aware system that can support all the requirements in a single solution. THE SCENARIO A batch document processing system uses a document ID to match all related documents to the same “case.” The first time a document ID is encountered, it creates a new “case” and subsequent documents bearing that ID are attached to the original case. To ensure processing around the clock, a redundant set of application servers is configured to process the documents, and the vendor’s application server clustering solution is used to load balance documents (in simple round-robin fashion) across the two instances. A load test is conducted, ramping up to 2500 documents per hour (41 per minute, fewer than 1 per second). During the test it is discovered that in some situations two documents with the same ID will arrive at the clustering solution in order. They will each be load balanced to separate instances. There is no existing “case” for this document id. Because of processing times and load on the servers, both documents result in the creation of separate “cases.” The test is considered a failure. Because the system, while managing the load fine from a network perspective, executed incorrectly under load from a process perspective. The solution? Reconfigure the clustering solution to an active-standby configuration, thus introducing the process latency needed to ensure that the scenario does not occur. Retest. Success. The result? The investment in the second instance of the application server – hardware, software licenses, management, maintenance – is wasted. It is a “failover” node only and reduces the overall capacity – and ultimately performance at higher load levels – of the system. WHEN CONTEXT MATTERS This scenario is real; it was described to me by a program manager at a Fortune 500 with a great deal of frustration as it seemed, to her anyway, that the architects could not come up with a working solution other than wasting a perfectly good set of resources. Instinctively she described a solution that leveraged persistence to force all documents with the same ID to the same server as it had been proven repeatedly that if all documents with the same ID were processed by the same application server that the system processed them correctly and associated them with the right “case” in all situations. But the application server clustering solution, which can provide server affinity (persistence) based on a few variables, was for some reason not able to support affinity (persistence) based on the document ID. After a few questions regarding the overall system and processing times it became clear that a context-aware application delivery controller could indeed solve this problem. The solution is fairly simple, actually, and based on existing persistence-based load balancing solutions. It is a given that documents with the same ID are batch processed within minutes of each other. Thus, a persistence table with a life of an hour or even thirty-minutes would provide the proper context in which documents could be processed and directed to the “right” web application server. This requires context; it requires that the load balancing solution, the application delivery controller, be aware of not only what it is processing but what it has processed already, and where it’s been sent. Document ID Based Persistence Logic Extract the document ID from the document Check the persistence table for the document ID If the document ID already exists, route the document to the same server as the previous document(s) with that ID If the document ID does not exist, decide which server the document will be sent to for processing and create an entry in the persistence table Wash. Rinse. Repeat. This problem is really about process level execution; about enforcing a business requirement on the technological implementation. In order to achieve compliance with the business process expectations it is necessary to be able to view each request in the context of that process rather than as an individual request that needs to be executed. Thus each touch point in the architecture that needs to manipulate, transform, or perform some task with or on or to the request needs to be able to take into consideration the process; it needs to be context-aware so that its decisions are made within the context of the entire process and not just the individual request. Layer 7 switching, application load balancing, application delivery. Whatever you want to call it, it is the way in which load balancing becomes context-aware and becomes collaborative. It enables the business requirements to be not only taken into consideration but enforced while ensuring that CapEx and OpEx investments in additional systems are not left to sit idle; wasted. It improves capacity essentially by introducing process latency into the equation. By forcing the process to follow a particular path the application delivery controller assists in the technological implementation meeting the goals of the business. In order words, it aligns IT with the business. Sometimes the marketing fluff is more solid than it appears. To Boldly Go Where No Production Application Has Gone Before WILS: Network Load Balancing versus Application Load Balancing Sessions and Cookies and Persistence, oh my! Persistent and Persistence, What's the Difference? If Load Balancers Are Dead Why Do We Keep Talking About Them? A new era in application delivery Infrastructure 2.0: The Diseconomy of Scale Virus The Politics of Load Balancing Business-Layer Load Balancing Not all application requests are created equal244Views0likes1CommentWILS: How can a load balancer keep a single server site available?
Most people don’t start thinking they need a “load balancer” until they need a second server. But even if you’ve only got one server a “load balancer” can help with availability, with performance, and make the transition later on to a multiple server site a whole lot easier. Before we reveal the secret sauce, let me first say that if you have only one server and the application crashes or the network stack flakes out, you’re out of luck. There are a lot of things load balancers/application delivery controllers can do with only one server, but automagically fixing application crashes or network connectivity issues ain’t in the list. If these are concerns, then you really do need a second server. But if you’re just worried about standing up to the load then a Load balancer for even a single server can definitely give you a boost.421Views0likes2CommentsBack to Basics: The Theory of (Performance) Relativity
#webperf #ado Choice of load balancing algorithms is critical to ensuring consistent and acceptable performance One of the primary reasons folks use a Load balancer is scalability with a secondary driver of maintaining performance. We all know the data exists to prove that "seconds matter" and current users of the web have itchy fingers, ready to head for the competition the microsecond they experience any kind of delay. Similarly, we know that productivity is inherently tied to performance. With more and more critical business functions "webified", the longer it takes to load a page the longer the delay a customer or help desk service representative experiences, reducing the number of calls or customers that can be serviced in any given measurable period. So performance is paramount, I see no reason to persuade you further to come to that conclusion. Ensuring performance then is a vital operational directive. One of the ways operations tries to meet that objective is through load balancing. Distributing load ensures available and can be used to offset any latency introduced by increasing capacity (again, I don't think there's anyone who'll argue against the premise that load and performance degradation are inherently tied together). But just adding a load balancing service isn't enough. The algorithm used to distribute load will invariably impact performance – for better or for worse. Consider the industry standard "fastest response time" algorithm. This algorithm distributes load based on the historical performance of each instance in the pool (farm). On the surface, this seems like a good choice. After all, what you want is the fastest response time, so why not base load balancing decisions on the metric against which you are going to be measured? The answer is simple: "fastest" is relative. With very light load on a pool of, say, three servers, "fastest" might mean sub-second responses. But as load increases and performance decreases, "fastest" might start creeping up into the seconds – if not more. Sure, you're still automagically choosing the fastest of the three servers, but "fastest" is absolutely relative to the measurements of all three servers. Thus, "fastest response time" is probably a poor choice if one of your goals is measured in response time to the ultimate customer – unless you combine it with an upper connection limit. HOW TO USE "FASTEST RESPONSE TIME" ALGORITHMS CORRECTLY One of the negatives of adopting a cloud computing paradigm with a nearly religious-like zeal is that you buy into the notion that utilization is the most important metric in the data center. You simply do not want to be wasting CPU cycles, because that means you're inefficient and not leveraging cloud to its fullest potential. Well, horse-puckey. The reality is that 100% utilization and consistently well-performing applications do not go hand in hand. Period. You can have one, but not the other. You're going to have to choose which is more important a measurement – fast applications or full utilization. In the six years I spent load testing everything from web applications to web application firewalls to load balancers to XML gateways one axiom always, always, remained true: As load increases performance decreases. You're welcome to test and retest and retest again to prove that wrong, but good luck. I've never seen performance increase or even stay the same as utilization approaches 100%. Now, once you accept that reality you can use it to your advantage. You know that performance is going to decrease as load increases, you just don't know at what point the degradation will become unacceptable to your users. So you need to test to find that breaking point. You want to stress the application and measure the degradation, noting the number of concurrent connections at which performance starts to degrade into unacceptable territory. That is your connection limit. Keep track of that limit (for the application, specifically, because not all applications will have the same limits). When you configure your load balancing service you can now select fastest response time but you also need to input hard connection limits on a per-instance basis. This prevents each instance from passing through the load-performance confluence that causes end-users to start calling up the help desk or sighing "the computer is slow" while on the phone with their customers. This means testing. Not once, not twice, but at least three runs. Make sure you've found the right load-performance confluence point and write it down. On your hand, in permanent marker. While cloud computing and virtualization have certainly simplified load balancing services in terms of deployment, it's still up to you to figure out the right settings and configuration options to ensure that your applications are performing with the appropriate levels of "fast". Back to Basics: Load balancing Virtualized Applications Performance in the Cloud: Business Jitter is Bad The “All of the Above” Approach to Improving Application Performance The Three Axioms of Application Delivery Face the facts: Cloud performance isn't always stable Data Center Feng Shui: Architecting for Predictable Performance A Formula for Quantifying Productivity of Web Applications Enterprise Apps are Not Written for Speed313Views0likes0CommentsArchitecting Scalable Infrastructures: CPS versus DPS
#webperf As we continue to find new ways to make connections more efficient, capacity planning must look to other metrics to ensure scalability without compromising performance. Infrastructure metrics have always been focused on speeds and feeds. Throughput, packets per second, connections per second, etc… These metrics have been used to evaluate and compare network infrastructure for years, ultimately being used as a critical component in data center design. This makes sense. After all, it's not rocket science to figure out that a firewall capable of handling 10,000 connections per second (CPS) will overwhelm a next hop (load balancer, A/V scanner, etc… ) device only capable of 5,000 CPS. Or will it? The problem with old skool performance metrics is they focus on ingress, not egress capacity. With SDN pushing a new focus on both northbound and southbound capabilities, it makes sense to revisit the metrics upon which we evaluate infrastructure and design data centers. CONNECTIONS versus DECISIONS As we've progressed from focusing on packets to sessions, from IP addresses to users, from servers to applications, we've necessarily seen an evolution in the intelligence of network components. It's not just application delivery that's gotten smarter, it's everything. Security, access control, bandwidth management, even routing (think NAC), has become much more intelligent. But that intelligence comes at a price: processing. That processing turns into latency as each device takes a certain amount of time to inspect, evaluate and ultimate decide what to do with the data. And therein lies the key to our conundrum: it makes a decision. That decision might be routing based or security based or even logging based. What the decision is is not as important as the fact that it must be made. SDN necessarily brings this key differentiator between legacy and next-generation infrastructure to the fore, as it's just software-defined but software-deciding networking. When a switch doesn't know what to do with a packet in SDN it asks the controller, which evaluates and makes a decision. The capacity of SDN – and of any modern infrastructure – is at least partially determined by how fast it can make decisions. Examples of decisions: URI-based routing (load balancers, application delivery controllers) Virus-scanning SPAM scanning Traffic anomaly scanning (IPS/IDS) SQLi / XSS inspection (web application firewalls) SYN flood protection (firewalls) BYOD policy enforcement (access control systems) Content scrubbing (web application firewalls) The DPS capacity of a system is not the same as its connection capacity, which is merely the measure of how many new connections a second can be established (and in many cases how many connections can be simultaneously sustained). Such a measure is merely determining how optimized the networking stack of any given solution might be, as connections – whether TCP or UDP or SMTP – are protocol oriented and it is the networking stack that determines how well connections are managed. The CPS rate of any given device tells us nothing about how well it will actually perform its appointed tasks. That's what the Decisions Per Second (DPS) metric tells us. CONSIDERING BOTH CPS and DPS Reality is that most systems will have a higher CPS compared to its DPS. That's not necessarily bad, as evaluating data as it flows through a device requires processing, and processing necessarily takes time. Using both CPS and DPS merely recognizes this truth and forces it to the fore, where it can be used to better design the network. A combined metric helps design the network by offering insight into the real capacity of a given device, rather than a marketing capacity. When we look only at CPS, for example, we might feel perfectly comfortable with a topological design with a flow of similar CPS capacities. But what we really want is to make sure that DPS –> CPS (and vice-versa) capabilities were matched up correctly, lest we introduce more latency than is necessary into a given flow. What we don't want is to end up with is a device with a high DPS rate feeding into a device with a lower CPS rate. We also don't want to design a flow in which DPS rates successively decline. Doing so means we're adding more and more latency into the equation. The DPS rate is a much better indicator of capacity than CPS for designing high-performance networks because it is a realistic measure of performance, and yet a high DPS coupled with a low CPS would be disastrous. Luckily, it is almost always the case that a mismatch in CPS and DPS will favor CPS, with DPS being the lower of the two metrics in almost all cases. What we want to see is as close a CPS:DPS ratio as possible. The ideal is 1:1, of course, but given the nature of inspecting data it is unrealistic to expect such a tight ratio. Still, if the ratio becomes too high, it indicates a potential bottleneck in the network that must be addressed. For example, assume an extreme case of a CPS:DPS of 2:1. The device can establish 10,000 CPS, but only process at a rate of 5,000 DPS, leading to increasing latency or other undesirable performance issues as connections queue up waiting to be processed. Obviously there's more at play than just new CPS and DPS (concurrent connection capability is also a factor) but the new CPS and DPS relationship is a good general indicator of potential issues. Knowing the DPS of a device enables architects to properly scale out the infrastructure to remediate potential bottlenecks. This is particularly true when TCP multiplexing is in play, because it necessarily reduces CPS to the target systems but in no way impacts the DPS. On the ingress, too, are emerging protocols like SPDY that make more efficient use of TCP connections, making CPS an unreliable measure of capacity, especially if DPS is significantly lower than the CPS rating of the system. Relying upon CPS alone – particularly when using TCP connection management technologies - as a means to achieve scalability can negatively impact performance. Testing systems to understand their DPS rate is paramount to designing a scalable infrastructure with consistent performance. The Need for (HTML5) Speed SPDY versus HTML5 WebSockets Y U No Support SPDY Yet? Curing the Cloud Performance Arrhythmia F5 Friday: Performance, Throughput and DPS Data Center Feng Shui: Architecting for Predictable Performance On Cloud, Integration and Performance769Views0likes0CommentsWILS: Moore’s Law + Application (Un)Scalability = Virtualization
#Virtualization was inevitable. One of the interesting side effects of having been a developer before migrating to a more network-focused view of the world* is it’s easier to understand the limitations and constraints posed on networking-based software, such as web servers. During the early days of virtualization adoption, particularly related to efforts around architecting more scalable applications, VMware (and others) did a number of performance and capacity-related tests in 2010 that concluded “lots of little web servers” scale and perform better than a few “big” web servers. Although virtualization overhead varies depending on the workload, the observed 16 percent performance degradation is an expected result when running the highly I/O‐intensive SPECweb2005 workload. But when we added the second processor, the performance difference between the two‐CPU native configuration and the virtual configuration that consisted of two virtual machines running in parallel quickly diminished to 9 percent. As we further increased the number of processors, the configuration using multiple virtual machines did not exhibit the scalability bottlenecks observed on the single native node, and the cumulative performance of the configuration with multiple virtual machines well exceeded the performance of a single native node. -- “Consolidating Web Applications Using VMware Infrastructure” [PDF, VMware] The primary reason for this is session management and the corresponding amount of memory required. Capacity is a simple case of being constrained by the size of the data required to store the session. Performance, however, is a matter of computer science (and lots of math). We could go through the Big O math of hash tables versus linked lists versus binary search trees, et al, but suffice to say that in general, most algorithmic performance degrades the larger N is (where N is the number of entries in the data store, regardless of the actual mechanism) with varying performance for inserts and lookups. Thus, it is no surprise that for most web servers, hard-coded limitations on the maximum number of clients, threads, and connections exist. All these related to session management and have an impact on capacity as well as performance. One assumes the default limitations are those the developers, after extensive experience and testing, have determined provide the optimal amount of capacity without sacrificing performance. It should be noted that these limitations do not scale along with Moore’s law. The speed of the CPU (or number of CPUs) does impact performance, but not necessarily capacity – because capacity is about sessions and longevity of sessions (which today is very long given our tendency toward Web 2.0 interactive, real-time refreshing applications). This constraint does not, however, have any impact whatsoever on the growth of computing power and resources. Memory continues to grow as do the number of CPUs, cores, and speed with which instructions can be executed. What the end result of this is that “scale up” is no longer really an option for increasing capacity of applications. Adding more CPUs or more memory exposes the reality of diminishing returns. The second 4GB of memory does not net you the same capacity in terms of users and/or connections as the first 4GB, because performance degrades in conjunction with increase in memory utilization. Again, we could go into the performance characteristics of the underlying algorithms where resizing and searching of core data structures becomes more and more expensive, but let’s leave that to those so inclined to dig into the math. The result is it shouldn’t have been a surprise when research showed “lots of little web servers”, i.e. scale out, was better than “a few big web servers”, i.e. scale up. Virtualization - or some solution similar that enabled operators to partition out the increasing amount of compute resources in such a way as to create “lots of little web servers” – was inevitable because networked applications simply do not scale along with Moore’s Law. *A Master’s degree in Computer Science doesn’t hurt here, either, at least for understanding the performance of algorithms and their various limitations Lots of Little Virtual Web Applications Scale Out Better than Scaling Up Consolidating Web Applications Using VMware Infrastructure To Take Advantage of Cloud Computing You Must Unlearn, Luke. It’s 2am: Do You Know What Algorithm Your Load Balancer is Using? Virtual Machine Density as the New Measure of IT Efficiency230Views0likes0CommentsF5 Friday: Performance, Throughput and DPS
No, not World of Warcraft “Damage per Second” - infrastructure “Decisions per second”. Metrics are tricky. Period. Comparing metrics is even trickier. The purpose of performance metrics is, of course, to measure performance. But like most tests, before you can administer such a test you really need to know what it is you’re testing. Saying “performance” isn’t enough and never has been, as the term has a wide variety of meanings that are highly dependent on a number of factors. The problem with measuring infrastructure performance today – and this will continue to be a major obstacle in metrics-based comparisons of cloud computing infrastructure services – is that we’re still relying on fairly simple measurements as a means to determine performance. We still focus on speeds and feeds, on wires and protocols processing. We look at throughput, packets per second (PPS) and connections per second (CPS) for network and transport layer protocols. While these are generally accurate for what they’re measuring, we start running into real problems when we evaluate the performance of any component – infrastructure or application – in which processing, i.e. decision making, must occur. Consider the difference in performance metrics between a simple HTTP request / response in which the request is nothing more than a GET request paired with a 0-byte payload response and an HTTP POST request filled with data that requires processing not only on the application server, but on the database, and the serialization of a JSON response. The metrics that describe the performance of these two requests will almost certainly show that the former has a higher capacity and faster response time than the latter. Obviously those who wish to portray a high-performance solution are going to leverage the former test, knowing full well that those metrics are “best case” and will almost never be seen in a real environment because a real environment must perform processing, as per the latter test. Suggestions that a standardized testing environment, similar to application performance comparisons using the Pet Shop Application, are generally met with a frown because using a standardized application to induce real processing delays doesn’t actually test the infrastructure component’s processing capabilities, it merely adds latency on the back-end and stresses capacity of the infrastructure component. Too, such a yardstick would fail to really test what’s important – the speed and capacity of an infrastructure component to perform processing itself, to make decisions and apply them on the component – whether it be security or application routing or transformational in nature. It’s an accepted fact that processing of any kind, at any point along the application delivery service chain induces latency which impacts capacity. Performance numbers used in comparisons should reveal the capacity of a system including that processing impact. Complicating the matter is the fact that since there are no accepted standards for performance measurement, different vendors can use the same term to discuss metrics measured in totally different ways. THROUGHPUT versus PERFORMANCE Infrastructure components, especially those that operate at the higher layers of the networking stack, make decisions all the time. A firewall service may make a fairly simple decision: is this request for this port on this IP address allowed or denied at this time? An identity and access management solution must make similar decisions, taking into account other factors, answering the question is this user coming from this location on this device allowed to access this resource at this time? Application delivery controllers, a.k.a. load balancers, must also make decisions: which instance has the appropriate resources to respond to this user and this particular request within specified performance parameters at this time? We’re not just passing packets anymore, and therefore performance tests that measure only the surface ability to pass packets or open and close connections is simply not enough. Infrastructure today is making decisions and because those decisions often require interception, inspecting and processing of application data – not just individual packets – it becomes more important to compare solutions from the perspective of decisions per second rather than surface-layer protocol per second measurements. Decision-based performance metrics are a more accurate gauge as to how the solution will perform in a “real” environment, to be sure, as it’s portraying the component’s ability to do what it was intended to do: make decisions and perform processing on data. Layer 4 or HTTP throughput metrics seldom come close to representing the performance impact that normal processing will have on a system, and, while important, should only be used with caution when considering performance. Consider the metrics presented by Zeus Technologies in a recent performance test (Zeus Traffic Manager - VMware vSphere 4 Performance on Cisco UCS – 2010 and F5’s performance results from 2010 (F5 2010 Performance Report) While showing impressive throughput in both cases, it also shows the performance impact that occurs when additional processing – decisions – are added into the mix. The ability of any infrastructure component to pass packets or manage connections (TCP capacity) is all well and good, but these metrics are always negatively impacted once the component begins actually doing something, i.e. making decisions. Being able to handle almost 20 Gbps throughput is great but if that measurement wasn’t taken while decisions were being made at the same time, your mileage is not just likely to vary – it will vary wildly. Throughput is important, don’t get me wrong. It’s part of – or should be part of – the equation used to determine what solution will best fit the business and operational needs of the organization. But it’s only part of the equation, and probably a minor part of that decision at that. Decision based metrics should also be one of the primary means of evaluating the performance of an infrastructure component today. “High performance” cannot be measured effectively based on merely passing packets or making connections – high performance means being able to push packets, manage connections and make decisions, all at the same time. This is increasingly a fact of data center life as infrastructure components continue to become more “intelligent”, as they become a first class citizen in the enterprise infrastructure architecture and are more integrated and relied upon to assist in providing the services required to support today’s highly motile data center models. Evaluating a simple load balancing service based on its ability to move HTTP packets from one interface to the other with no inspection or processing is nice, but if you’re ultimately planning on using it to support persistence-based routing, a.k.a. sticky sessions, then the rate at which the service executes the decisions necessary to support that service should be as important – if not more – to your decision making processes. DECISIONS per SECOND There are very few pieces of infrastructure on which decisions are not made on a daily basis. Even the use of VLANs requires inspection and decision-making to occur on the simplest of switches. Identity and access management solutions must evaluate a broad spectrum of data in order to make a simple “deny” or “allow” decision and application delivery services make a variety of decisions across the security, acceleration and optimization demesne for every request they process. And because every solution is architected differently and comprised of different components internally, the speed and accuracy with which such decisions are made are variable and will certainly impact the ability of an architecture to meet or exceed business and operational service-level expectations. If you’re not testing that aspect of the delivery chain before you make a decision, you’re likely to either be pleasantly surprised or hopelessly disappointed in the decision making performance of those solutions. It’s time to start talking about decisions per second and performance of infrastructure in the context it’s actually used in data center architectures rather than as stand-alone, packet-processing, connection-oriented devices. And as we do, we need to remember that every network is different, carrying different amounts of traffic from different applications. That means any published performance numbers are simply guidelines and will not accurately represent the performance experienced in an actual implementation. However, the published numbers can be valuable tools in comparing products… as long as they are based on the same or very similar testing methodology. Before using any numbers from any vendor, understand how those numbers were generated and what they really mean, how much additional processing do they include (if any). When looking at published performance measurements for a device that will be making decisions and processing traffic, make sure you are using metrics based on performing that processing. 1024 Words: Ch-ch-chain of Fools On Cloud, Integration and Performance As Client-Server Style Applications Resurface Performance Metrics Must Include the API F5 Friday: Speeds, Feeds and Boats Data Center Feng Shui: Architecting for Predictable Performance Operational Risk Comprises More Than Just Security Challenging the Firewall Data Center Dogma Dispelling the New SSL Myth495Views0likes0CommentsF5 Friday: The Data Center is Always Greener on the Other Side of the ADC
Organizations interested in greening their data centers (both green as in cash as well as in grass) will benefit from the ability to reduce, reuse and recycle in just 4Us of rack space with a leaner, greener F5 VIPRION According to the latest data from the U.S. Energy Information Administration, the average cost of electricity for commercial use rose from 9.63 (Jan 2010) to 9.88 (Jan 2011) cents per kWh. If you think that’s not significant, consider that the average cost of powering one device in the data center has increased by 3% from 2010 to 2011 – an average of about $5 per 250w device. On a per device basis, that’s not so bad, but start multiplying that by the number of devices in an enterprise-class data center and it begins to get fairly significant fairly quickly – especially given that we haven’t started calculating the costs to cool the devices yet, either. Medium is the New Large in Enterprise Sometimes It Is About the Hardware VIPRION 2400 and vCMP Presentation VIPRION Platform Resources vCMP: License to Virtualize Virtual Clustered Multiprocessing (vCMP) The ROI of Application Delivery Controllers in Traditional and Virtualized Environments If a Network Can’t Go Virtual Then Virtual Must Come to the Network Data Center Feng Shui: Architecting for Predictable Performance188Views0likes0CommentsWAN Optimization is not Application Acceleration
Increasingly WAN optimization solutions are adopting the application acceleration moniker, implying a focus that just does not exist. WAN optimization solutions are designed to improve the performance of the network, not applications, and while the former does beget improvements of the latter, true application acceleration solutions offer greater opportunity for improving efficiency and end-user experience as well as aiding in consolidation efforts that result in a reduction in operating and capital expenditure costs. WAN Optimization solutions are, as their title implies, focused on the WAN; on the network. It is their task to improve the utilization of bandwidth, arrest the effects of network congestion, and apply quality of service policies to speed delivery of critical application data by respecting application prioritization. WAN Optimization solutions achieve these goals primarily through the use of data de-duplication techniques. These techniques require a pair of devices as the technology is most often based on a replacement algorithm that seeks out common blocks of data and replaces them with a smaller representative tag or indicator that is interpreted by the paired device such that it can reinsert the common block of data before passing it on to the receiver. The base techniques used by WAN optimization are thus highly effective in scenarios in which large files are transferred back and forth over a connection by one or many people, as large chunks of data are often repeated and the de-duplication process significantly reduces the amount of data traversing the WAN and thus improves performance. Most WAN optimization solutions specifically implement “application” level acceleration for protocols aimed at the transfer of files such as CIFS and SAMBA. But WAN optimization solutions do very little to aid in the improvement of application performance when the data being exchanged is highly volatile and already transferred in small chunks. Web applications today are highly dynamic and personalized, making it less likely that a WAN optimization solution will find chunks of duplicated data large enough to make the overhead of the replacement process beneficial to application performance. In fact, the process of examining small chunks of data for potential duplicated chunks can introduce additional latency that actual degrades performance, much in the same way compression of small chunks of data can be detrimental to application performance. Too, WAN optimization solutions require deployment in pairs which results in what little benefits these solutions offer for web applications being enjoyed only by end-users in a location served by a “remote” device. Customers, partners, and roaming employees will not see improvements in performance because they are not served by a “remote” device. Application acceleration solutions, however, are not constrained by such limitations. Application acceleration solutions act at the higher layers of the stack, from TCP to HTTP, and attempt to improve performance through the optimization of protocols and the applications themselves. The optimizations of TCP, for example, reduce the overhead associated with TCP session management on servers and improve the capacity and performance of the actual application which in turn results in improved response times. The understanding of HTTP and both the browser and server allows application acceleration solutions to employ techniques that leverage cached data and industry standard compression to reduce the amount of data transferred without requiring a “remote” device. Application acceleration solutions are generally asymmetric, with some few also offering a symmetric mode. The former ensures that regardless of the location of the user, partner, or employee that some form of acceleration will provide a better end-user experience while the latter employs more traditional WAN optimization-like functionality to increase the improvements for clients served by a “remote” device. Regardless of the mode, application acceleration solutions improve the efficiency of servers and applications which results in higher capacities and can aid in consolidation efforts (fewer servers are required to serve the same user base with better performance) or simply lengthens the time available before additional investment in servers – and the associated licensing and management costs – must be made. Both WAN optimization and application acceleration aim to improve application performance, but they are not the same solutions nor do they even focus on the same types of applications. It is important to understand the type of application you want to accelerate before choosing a solution. If you are primarily concerned with office productivity applications and the exchange of large files (including backups, virtual images, etc…) between offices, then certainly WAN optimization solutions will provide greater benefits than application acceleration. If you’re concerned primarily about web application performance then application acceleration solutions will offer the greatest boost in performance and efficiency gains. But do not confuse WAN optimization with application acceleration. There is a reason WAN optimization-focused providers have recently begun to partner with application acceleration and application delivery providers – because there is a marked difference between the two types of solutions and a single offering that combines them both is not (yet) available.791Views0likes2CommentsWILS: Load Balancing and Ephemeral Port Exhaustion
Understanding the relationship between SNAT and connection limitations in full proxy intermediaries. If you’ve previously delved into the world of SNAT (which is becoming increasingly important in large-scale implementations, such as those in the service provider world) you remember that SNAT essentially provides an IP address from which a full-proxy intermediary can communicate with server-side resources and maintain control over the return routing path. There is an interesting relationship between intermediaries that leverage two separate TCP stacks (such as full-proxies) and SNAT in terms of concurrent (open) connections that can be supported by any given “virtual” server (or virtual IP address, as they are often referred to in the industry). The number of ephemeral ports that can be used by any client IP address is 65535. Programmer types will recognize that as a natural limitation imposed by the use of an unsigned short integer (16 bits) in many programming languages. Now, what that means is that for each SNAT address assigned to a virtual IP address, a theoretical total of 65535 connections can be open at any other single address at any given time. This is because in a full-proxy architecture the intermediary is acting as a client and while servers use well-known ports for communication, clients do not. They use ephemeral (temporary) ports, the value of which is communicated to the server in the source port field in the request. Each additional SNAT address available increases the total number of connections by some portion of that space. As you should never use ephemeral ports in the privileged range (port numbers under 1024 are traditionally reserved for firewall and other sanity checkers - see /etc/services on any Unix box) that number can be as many as 64512 available ports between the SNAT address and any other IP address. For example, if a server pool (virtual or iron) has 24 members and assuming the SNAT address is configured to use ephemeral ports in the range of 1024-65535, then a single SNAT address results in a total of 24 x 48k = 1,152k concurrent connections to the pool. If the SNAT is assigned to a virtual server that is targeting a single address (like another virtual server or another intermediate device) then the total connections is 1 x 48k = 48k connections. Obviously this has a rather profound impact on scalability and capacity planning. If you only have one SNAT address available and you need the capabilities of a full-proxy (such as payload inspection inbound and out) you can only support a limited number of connections (and by extension, users). Some solutions provide the means by which these limitations can be mitigated, such as the ability to configure a SNAT pool (a set of dedicated IP addresses) from which SNAT addresses can be automatically pulled and used to automatically increase the number of available ephemeral ports. Running out of ephemeral ports is known as “ephemeral port exhaustion” as you have exhausted the ports available from which a connection to the server resource can be made. In practice the number of ephemeral ports available for any given IP address can be limited by operating system implementations and is always much lower than the 65535 available per IP address. For example, the IANA official suggestion is that ephemeral ports use 49152 through 65535, which means a limitation of 16383 open connections per address. Any full-proxy intermediary that has adopted this suggestion would necessarily require more SNAT addresses to scale an application to more concurrent connections. One of the advantages of a solution implementing a custom TCP/IP stack, then, is that they can ignore the suggestion on ephemeral port assignment typically imposed at the operating system or underlying software layer and increase the range to the full 65535 if desired. Another major advantage is making aggressive use of TIME-WAIT recycling. Normal TCP stacks hold on to the ephemeral port for seconds to minutes after a connection closes. This leads to odd bursting behavior. With proper use of TCP timestamps you can recycle that ephemeral port almost immediately. Regardless, it is an important relationship to remember, especially if it appears that the Load balancer (intermediary) is suddenly the bottleneck when demand increases. It may be that you don’t have enough IP addresses and thus ports available to handle the load. WILS: Write It Like Seth. Seth Godin always gets his point across with brevity and wit. WILS is an ATTEMPT TO BE concise about application delivery TOPICS AND just get straight to the point. NO DILLY DALLYING AROUND. Related Posts All WILS Topics on DevCentral Server Virtualization versus Server Virtualization885Views0likes0Comments