Another Cloud Outage – Another Boundary Early Warning

By Boundary on 11 01 2012

Last week we showed you how Boundary spotted a problem developing in AWS before Amazon publicly announced there were issues. Many articles have been written about this, including a postmortem from Netflix. What’s interesting about this is that Netflix saw exactly the same issues and behavior as Boundary, including observing the issue two full hours before Amazon announced a problem.

Yesterday, the Windows Azure platform suffered performance degradation in their North and West Europe sub-regions. But this time, the emerging issue (before Microsoft announced issues) and the impact were caught by a Boundary customer, Qbranch.
At 9:55AM UTC, Microsoft released this notice to its Azure customers:

Azure Notice of Performance Issues

The cause was traced to two faulty network switches. At 1:26PM UTC, Oct. 31, the issue was resolved and Microsoft posted an update that things were operating normally. Again, these are typical problems that can occur in any cloud service and in your own physical data center.

But, Fredrik Lindstrom, from Qbranch, said that using Boundary, they observed problems develop in their Azure environment 15 hours before Microsoft announced an issue.

“We run a multi-tenant service which makes heavy use of Windows Azure Servicebus for relayed messaging to extend our reach into customer environments” said Fredrik. “Think of Servicebus as a proxy for messaging which greatly simplifies the task of integrating systems hidden behind firewalls and other nasty things. ”

“We first noticed trouble Tuesday evening around 7 pm UTC, in the form of high latency and timeouts for traffic going across the Servicebus. Our customers only use our service during regular office hours, so the impact was minimal. Next morning, we still had the same issue and Microsoft notified customers of networking issues at 9:55 UTC.”

Fredrik snapped this image of the Boundary Dashboard in Qbranch showing the problem emerging in the evening of Oct 30th. Note: the time scale in the image is Hours.

Azure Service Degradation Problem Observed

“The times match up exactly, also note that the issue appeared to go away during the night only to reappear in the morning,” Fredrik observed.

Fredrik shared this photo of the Out of Order Packets network anomaly they were seeing in their Azure environment — at 1:26pm (or 2:26 Qbranch time), at the exact time the issue was resolved in Azure, Qbranch’s Boundary dashboard immediately reflected the fix. The team at Qbranch could relax, having definitive proof that the Azure issue causing problems had been resolved.

Out of Order Packets - Issue Fixed


In cloud environments, many of the details of what’s going on under the OS are abstracted. It is very hard to differentiate between problems caused by the underlying infrastructure and problems introduced by application code or OS configuration. With Boundary, Qbranch was able to immediately isolate where the problem was introduced, monitor the impact, and verify that it was truly resolved.

The lesson from this is clear. You cannot rely upon your Cloud vendor to tell you when Cloud issues are affecting your application performance. For a variety of reasons, Cloud providers may not announce a developing problem until they are absolutely certain there is a problem. For business critical and latency sensitive applications deployed in the Cloud, you must have visibility into emerging problems that are impacting your application. Otherwise, you will waste time troubleshooting in the wrong area – literally hours and days can be lost here – only to finally determine that the problem is in the underlying Cloud service.
Boundary shows you when the cloud infrastructure is impacting your application…. any part of your application (unlike traditional APM tools that just look at the performance of the application code). In seconds, Boundary can isolate that the problem is in the Cloud provider’s network and give you the information to resolve the issue quickly. Anyone running business critical or latency sensitive applications in the Cloud benefits from using Boundary.
With our free service, why wouldn’t you?