Data Center Feng Shui: Architecting for Predictable Performance

Variable latency is a Very Bad Thing™ – that’s why we build core networks based on hardware, not software.

One of the key components to a successful scaling strategy is recognizing when more (or less) capacity is required and then acting upon that information.

We call cloud and auto-scaling “on-demand” but in reality it’s more the case that we’re taking action based on historical data; on the past five or ten minutes of performance and load on a given resource. Ultimately this requires some predictive capabilities, either of systems or people. Based on data regarding the health and performance of an application, a decision must be made whether to add or subtract capacity. The ability to successfully predict capacity needs even ten minutes hence in the data center requires not prescient operations but rather predictable patterns of performance.

This technique is not new, and it can be used outside the data center, too. Sometimes for purposes for which we are not all that comfortable:

The first stage measures the time it takes to send a data packet to the target and converts it into a distance – a common geolocation technique that narrows the target's possible location to a radius of around 200 kilometers. Wang and colleagues then send data packets to the known Google Maps landmark servers in this large area to find which routers they pass through. When a landmark machine and the target computer have shared a router, the researchers can compare how long a packet takes to reach each machine from the router; converted into an estimate of distance, this time difference narrows the search down further. 'We shrink the size of the area where the target potentially is,' explains Wang. Finally, they repeat the landmark search at this more fine-grained level: comparing delay times once more, they establish which landmark server is closest to the target.

-- Pinpointing a Computer to Within 690 Meters, Bruce Schneier


Of note should be the reliance in this technique of measuring packet performance. If the time it takes a packet to go from point A to point B is highly variable, this technique becomes less and less reliable. Similarly, if performance of applications – which includes the network, storage and application delivery network components – is highly variable, it can lead to failures to adjust capacity in a way that aligns with operational and business goals.

SOMETIMES it is ABOUT the HARDWARE : THE NETWORK ASIC

This issue rears its ugly head quite often, but has become more and more frequent with the explosion of virtualization and increasing adoption of cloud computing . First it was the applications, now it’s the network.

After all, if cloud and virtualization and auto-scaling can provide such efficiency of operation for applications, enabling consistent performance and better control over the operationally consistency of scaling out and down on-demand, it makes sense that the same could be true for the network. Thus it is no surprise that some vendors are moving to offer virtual network appliances (VNAs) for what has traditionally been hardware-only network components such as routers and switches.

The x86 architecture is rapidly becoming a superior price/performance hardware alternative in networking; the packet processing performance on x86 increased 100X in the last four years.

-- Vyatta CEO Kelly Herrell, “Vyatta readies for virtual machine explosion

This is a true statement – at least the latter half. The packet processing performance on x86 has increased in the past four years and no doubt will continue to increase in the next four years. And yet this statement ignores the definition of “performance” in terms of consistency, and the rate at which latency increases as load on an x86 processor grows. We need to remember why it was we moved to hardware for routing and switching in the first place. It wasn’t always that way, after all, and many of us remember with varying degrees of fondness the software implementations that were eventually replaced by consistently performing hardware-based solutions.

Routing and switching lends itself exceptionally well to low cost optimized ASICs because all inspection of the packet is governed by well known IETF and IEEE protocols contained in the packet headers. There is little to no benefit to implementing such functions in software as servers are more expensive on per megabit basis than a router. Additionally, the software implementation will also not scale as linearly and will have much higher latency (plus variability in latency) as a hardware based solution.  This is especially true with the emerging 40G and 100G standards which will not scale linearly with the cost of CPUs.  Cisco routers in 1990’s were software based and didn’t change until companies like Extreme, Foundry (now Brocade) and Packet Engines (now Alcatel-Lucent) came along and proved you could do wire speed, low latency, hardware based layer 3 routing at Gigabit speeds.  Arista (founded largely by former Cisco employees) originally took the virtual appliance approach but have since have retreated to a hardware-based solution focused on high 10G density with ultra low latency. F5, as well, has long leveraged high-speed, hardware ASICs in its platforms because of the price-performance benefits afforded by such solutions. The integration of switching with load balancing was a giant leap forward for application delivery because it normalized network performance and allowed internal engineers to focus on the more variable performance of functions and features at the upper layers of the stack, namely the application layer.

PREDICTABILITY and CONSISTENCY are KEY

The benefit to consistent network performance is normalization and predictability. By being able to predict the performance impact of the network it is less a factor on the formula needed for scaling out and down applications.

It is less efficient to model a data center based on the need to adjust for increasing load on both applications and the network, and figure in the performance degradation inherent in both that comes from approaching RAM and CPU provisioning limits. This variability is highly impactful from the perspective of not only operations, but the business as well. Performance-sensitive applications – and not just those in the financial sector – that rely upon consistent application (and therefore network) performance can degrade employee efficiency, thereby negatively impacting productivity and causing all hell to break loose in the data center.

The impact on application performance from sudden influxes of traffic are well understood – and it’s not a positive one. This has less to do with the application itself and more to do with the way in which the network stack interoperates with the network; it’s the overload from managing network traffic and TCP connections that oft-times is the culprit of application outages in the face of highly fluctuating load. That’s why the use of application delivery controllers instead of simple load balancing is so important to maintaining availability and consistency of performance. The same holds true for any solution that is moved from hardware to software: it becomes reliant on the software to manage the network and transport protocol layers in software, which as load increases creates a heavier burden on the general purpose CPU which ultimately degrades performance by introducing variable amounts of latency.

Variable latency is bad. It impacts business productivity, it impacts application performance and availability. It reduces the ability of operations to predict capacity needs with the granularity necessary to achieve a highly efficient and dynamic data center. It’s not always about whether it can be accomplished in software, of course it can. Everything is, after all, ultimately software. In the case of ASICs it’s simply been reduced, optimized and tied tightly to a very small but powerful piece of hardware that is not only high-performance, but performs consistently.

Sometimes it is about the hardware.

AddThis Feed Button Bookmark and Share

Published Apr 13, 2011
Version 1.0

Was this article helpful?

No CommentsBe the first to comment