I'm continuing to investigate the story of the outage Tuesday at 365 Main's San Francisco datacenter that brought down some of the most well-known sites on the Internet. Right now, a 365 Main executive is blaming failures at 5 out of its 10 generators. That's right: Fully half of 365 Main's generators failed right as San Francisco experienced a power outage. More to come on this soon, but for now, here's the memo from Marcy Maxwell, 365 Main's head of security.
From: "Marcy Maxwell"
To: "Engineering"
Sent: 7/25/07 5:08 PM
Subject: UPDATE: POWER EVENT - Fourth Notice
UPDATE: 5:00 P.M., Wednesday, July 25, 2007
A complete investigation of the power incident continues with several specialists and 365 Main employees working around the clock to address the incident.
Generator/Electrical Design Overview
The San Francisco facility has ten 2.1 MW back-up generators to be used in the event of a loss of utility. The electrical design is N+2, meaning 8 primary generators can successfully power the building (labeled 1-8), with 2 generators available on stand-by (labeled Back-up 1 and Back-up 2) in case there are any failures with the primary 8.
Each primary generator backs-up a corresponding colocation room, with generator 1 backing up colocation room 1, generator 2 backing up colocation room 2, and so on.
Series of Electrical Events
* The following is a description of the electrical events that took place in the San Francisco facility following the power surge on July 24, 2007:
* When the initial surge was detected at 1:47 p.m., the building's electrical system attempted to roll all colocation rooms to diesel generator power.
* Generator 1 detected a problem in its start sequence and shut itself down within 8-10 seconds. The cause of the start-up failure is still under investigation though engineers have narrowed the list of suspected components to 2-3 items. We are testing each of these suspected components to determine if service or replacement is the best option. Generator 1 was started manually by on-site engineers and reestablished stable diesel power by 2:24 p.m.
* After initial failure, Generator 1 attempted to pass its 732 kW load to Back-up 1, which also detected a problem in its start sequence. The exact cause of the Back-up 1 start sequence failure is also under investigation.
* After Generator 1 and Back-up 1 failed to carry the 732 kW, the load was transferred to Back-up 2 which correctly accepted the load as designed.
* Generator 3 started up and ran for 30 seconds before it too detected a problem in the start sequence and passed an additional 780 kW to Back-up 2 as designed.
* Generator 4 started up and ran for 2 seconds before detecting a problem in the start sequence, passing its 900 kW load on to Back-up 2. This 900kW brought the total load on Back-up 2 to over 2.4 MW, ultimately overloading the 2.1 MW Back-up 2 unit, causing it to fail. Generator 4 was manually started and brought back into operations at 2:22 p.m. Generator 4 was switched to utility operations at 7:05 a.m. on 7/25 to address an exhaust leak but is operational and available in the event of another outage.
* Generators 2, 5, 6, 7 and 8 all operated as designed and carried their respective loads appropriately.
* By 1:30 p.m. on Wednesday, July 25, after assurance from PG&E officials that utility power had been stable for at least 18+ continuous hours, 365 Main placed diesel engines back in standby and switched generators 2,5,6,7, 8 to utility power.
* Customers in colocation rooms 2, 4, 5, 6, 7 & 8 are once again powered by utility, and are backed up in an N+1 configuration with Back-up 2 generator available.
* Generators that had failed during the start-up sequence but were performing normally after manual start (1 & 3) continue to operate on diesel and will not be switched back to utility until the root causes of their respective failures are corrected.
Other Discoveries
* In addition to previously known affected colocation rooms 1, 3 and 4, we have discovered that several customers in colo room 7 were affected by a 490 millisecond outage caused when the dual power input PDUs in colo 7 experienced open circuits on both sources. A dedicated team of engineers is currently investigating the PDU issue.
Next Steps
* Determine exact cause of generator start-up failure and PDU issues through comprehensive testing methodology.
* Replacements for all suspected components have been ordered and are en route.
* Continue to run generators 1 & 3 on diesel power until automatic start-up failure root cause is corrected.
* Continue to update customers with details of the ongoing investigation.
Regards,
Marcy
Marcy Maxwell Vice President, Security 365 Main Inc. "The World's Finest Data Centers"













Comments
I'm surprised that they keep these servers in San Francisco at all, at least without a back-up in a non-earthquake-prone location, like, say, Kansas. I'm sure SF is due for a good shaking any day now.
KANSAS! LOL! That's Dorothy Territory.
One big Tornado and the servers are on their way to OZ!
@deliriousnyc: My mind has been wandering similarly lately; this has only served to stoke the fires. What -will- happen when there's a big ole earthquake in SF? I'm sure this isn't the only datacenter in town.
@Rick: Meant to say what will happen to the internets?
The unanswered question here -- how often do they test their generators' startup sequences?
The likelihood of that much failure at once either points to a lack of testing of the generators, or a lack of honesty in what happened.
Or just am amazing amount of bad luck. Anything is possible. Just highly unlikely. (Sherlock Holme's attitude on the highly unlikely notwithstanding.)
@synnik: The problem with testing backups is that while you do monthly tests as part of standard maintenance, it's not a *real* test - the load is manually transferred from utility power to generators and back again. This way if a failure is detected the building won't go dark The only *real* test of a backup system, and the automatic failover mechanisms, is to literally pull the plug, which data centers are understandably reluctant to do, given the consequences of a failure - exactly what we saw happen here.
@rekoil, bullshit. There's no reason you can't shunt away from supplying the room on the up side but still simulate a service failure on the back side. I've done that, and so should 365 have been, although they obviously weren't.
It costs more to be able to isolate your generators (rather than insist on passing through them), but there's no reason you can't run the room of batteries like usual and the batteries of mains while pulling the gens out of the loop.
Start a discussion:
Login with your username and password below. Or comment on this post via email.
Forgot your username or password? New User?