Skype Outage Triggered By Microsoft Windows Patch
The Skype outage last week has been explained, and it appears to have been triggered by a gazillion Windows boxes rebooting after a new system patch had been downloaded last week:
[from Patch Tuesday update triggered Skype outage | The Register][...] Skype has blamed last week's prolonged outage on the effects of Microsoft's Patch Tuesday.
The latest security update from Microsoft required a system reboot. The effect of so many machines rebooting and subsequently trying to log onto the Skype VoIP network triggered system instability and a prolonged outage of almost two days starting on Thursday
Here's what Skype's spokesman Villu Arak had to say:
The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.
The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk.
Yikes. Sounds like a house of cards scenario: the inherent flaws of the Microsoft Windows distribution model -- getting gazillions to reboot at roughly the same time -- leads to discovery of an inherent flaw in a massive peer-to-peer architecture, in which gazillions of users logging on at the same time that gazillions of users are offline (trying to login) swamp the system, and since all the users continue to attempt to login, the system can't right itself. Kind of like five panicked people all trying to get into a canoe that has capsized, endlessly recapsizing.
As we get more connected it becomes ever more important that the foundational elements we rely on have to be super resilient and tolerant of disruptions. I hope the Skype people go back to the drawing board and walk through the innards of the system to avoid this in the future. And it's a message for anyone trying to scale a system to support gazillions of users.
One of the problems of large working systems is that it becomes functionally impossible to simulate actual working conditions: you have to turn them on and let them run. But, just like living systems, they must evolve or die. If Skype -- or any other key element of the web fabric that we rely on -- begins to show deep design flaws that make the system unstable, people will quickly migrate to other solutions.

Comments