Know Your IaaS: Boundary Identifies Performance Lags Introduced by the Cloud

By Boundary on 11 19 2012

With the holidays approaching – and along with them, a critical e-commerce season – vendors are spending millions to identify performance lags, ensure low latency and uniform performance across their infrastructure. But building low latency applications in the public cloud is tricky. Cloud applications depend on infrastructure you don’t control, and it can be difficult to gain visibility into how the underlying infrastructure is impacting your application.

I’ve been working on an internal application for our sales organization. I thought that Amazon’s new “t1.micro” instances would be perfect for this app. Amazon announced this class of instances in 2010 for applications that require “a small amount of consistent CPU resources,” but notes that these instances can provide burst capacity when additional resources are available on the host machine. 

We set up HAProxy in front of our application tier and ran the application. As load increased, so did latency. Over time, round trip time (RTT) as measured by Boundary rose to a 6 full seconds! Here’s a screenshot of Application Visualization, which uses the color red to indicate unacceptable latency. The HAProxy tier was almost completely red.
Interestingly enough, we had to stop one of the HAProxy nodes, and when we started it up again, it moved within Amazon’s infrastructure, and voilà  – the latency disappeared.
So, What Changed?

Boundary automatically posts notes, called Annotations, to the timeline that notify you when something has changed.  This includes code deploys from your build and release tools as well as underlying system changes. I could see immediately that nothing external to the system had changed; we’d introduced no additional traffic, and no changes to the system itself – aside from respawning the instance, of course. So what changed?

Those of you with experience running applications in the public cloud – especially if you’ve experimented with Micro instances – already know the answer. What had changed was my instance location in the Amazon EC2 infrastructure.

In public cloud environments, the host on which your instance runs is managed by the providers’ provisioning infrastructure. In the case of our lonely Micro instances, we’d begun the test with an instance running on a host machine without burst capacity to spare, then migrated it to another node within EC2 that was able to offer us a little more breathing room beyond the base provision guaranteed to Micro instances.

It’s critical to be able to measure network health, such as congestion and oversubscription, in an IaaS (infrastructure as a service) environment. Boundary helps pinpoint the root cause of these issues by empowering operations professionals to spot outliers and poor-performers. These tools help companies identify when the underlying infrastructure is causing application performance problems quickly.

Once HAProxy02 was healthy and serving sub-second latency values once again, I stopped and restarted HAProxy01. As expected, it moved within Amazon’s infrastructure and relaunched with a fresh IP, but within 15 seconds, I could see that application latency was worse in this location! I stopped and restarted again, and this time, landed in a location where application latency was ideal. While this sort of leapfrogging might not be appropriate for production operations, it offers a nice look into what users might observe when running close to capacity on instances that operate on a burstable capacity model – whether in EC2 or somewhere else.
Here’s AppVis showing a much healthier demo environment:
What’s the moral of the story?

In cloud environments, performance problems aren’t always introduced by you or your users.  Sometimes, the problems are introduced by the underlying cloud infrastructure or by relying on burst capacity without realizing it. Having tools that pinpoint and alert you to exactly where problems are introduced can save hours of troubleshooting and enable teams to head off performance problems before they become widespread fires.