Another Cloud Outage – Another Boundary Early Warning
By Boundary on 11 01 2012
Last week we showed you how Boundary spotted a problem developing in AWS before Amazon publicly announced there were issues. Many articles have been written about this, including a postmortem from Netflix. What’s interesting about this is that Netflix saw exactly the same issues and behavior as Boundary, including observing the issue two full hours before Amazon announced a problem.
Yesterday, the Windows Azure platform suffered performance degradation in their North and West Europe sub-regions. But this time, the emerging issue (before Microsoft announced issues) and the impact were caught by a Boundary customer, Qbranch.
At 9:55AM UTC, Microsoft released this notice to its Azure customers:
The cause was traced to two faulty network switches. At 1:26PM UTC, Oct. 31, the issue was resolved and Microsoft posted an update that things were operating normally. Again, these are typical problems that can occur in any cloud service and in your own physical data center.
But, Fredrik Lindstrom, from Qbranch, said that using Boundary, they observed problems develop in their Azure environment 15 hours before Microsoft announced an issue.
“We run a multi-tenant service which makes heavy use of Windows Azure Servicebus for relayed messaging to extend our reach into customer environments” said Fredrik. “Think of Servicebus as a proxy for messaging which greatly simplifies the task of integrating systems hidden behind firewalls and other nasty things. ”
“We first noticed trouble Tuesday evening around 7 pm UTC, in the form of high latency and timeouts for traffic going across the Servicebus. Our customers only use our service during regular office hours, so the impact was minimal. Next morning, we still had the same issue and Microsoft notified customers of networking issues at 9:55 UTC.”
Fredrik snapped this image of the Boundary Dashboard in Qbranch showing the problem emerging in the evening of Oct 30th. Note: the time scale in the image is Hours.
“The times match up exactly, also note that the issue appeared to go away during the night only to reappear in the morning,” Fredrik observed.
Fredrik shared this photo of the Out of Order Packets network anomaly they were seeing in their Azure environment — at 1:26pm (or 2:26 Qbranch time), at the exact time the issue was resolved in Azure, Qbranch’s Boundary dashboard immediately reflected the fix. The team at Qbranch could relax, having definitive proof that the Azure issue causing problems had been resolved.
In cloud environments, many of the details of what’s going on under the OS are abstracted. It is very hard to differentiate between problems caused by the underlying infrastructure and problems introduced by application code or OS configuration. With Boundary, Qbranch was able to immediately isolate where the problem was introduced, monitor the impact, and verify that it was truly resolved.