Now that the dust has settled, we have had a chance to look more closely at what happened this morning. We had one of our core routers experience a still unknown hardware error. This error caused the router to reboot in a semi-functional state. It had enough configuration remaining that it caused our other core routers to decide that they did not need to assume authority, and remove the problem router from the configuration.
Once this was determined, it was fairly simple for our engineers to remove the problem router from the network topology, allowing the other routers to take over. At that point, most of the network came back online.
We had a group of edge switches that remained confused by the port problems from the primary router and refused to fail over their RSTP configurations to the other routers. These switches required manual intervention in order to bring them back online.
Neither of these problems was expected and we are still determining why they occurred. We will be making whatever changes are necessary to our network to ensure that our redundancy and failover configuration works optimally in the future. Additionally, we will be doing a full hardware diagnosis on the original problem router to determine exactly what hardware needs to be replaced.
I apologize for the inconvenience that this has caused, and wish to reassure everyone that we will address all identified problems and build an even stronger network.
Throughout the issue the data center's spokespeople posted updates to Twitter, Facebook, and even WebHostingTalk's forums. This kind of transparency by a NOC is admirable and rare (which is why we try ourselves to emulate it, and provide as much information as possible to our own customers in as many venues as possible). I've leased servers from other companies who felt several hours' of downtime didn't require an explanation to anyone.
They called in every engineer they have to fix this issue in a timely manner. This is the first network affecting outage they've had since 2004, which doesn't make the situation today acceptable but it does put it in a bit of context.
We chose this particular data center as our server provider because of their commitment to reliability, redundancy and expedient customer service. If anything this morning's incident has solidified that view, to have an issue of this magnitude resolved in a little over an hour is a testament to their dedication. In this industry especially, a company is judged best by how they react in a worst case scenario.