SAN FRANCISCO, 3:11 AM, FRI MAY 16 | 37 POSTS IN THE LAST 24 HOURS | tips@valleywag.com | SUBMIT A TIP | RSS

A drunk employee kills all of the websites you care about

365 Main365 Main, a datacenter on the edge of San Francisco's Financial District, is popular with Soma startups for its proximity and its state-of-the-art facilities. Or it used to be, anyway, until a power outage took down sites including Craigslist, Six Apart's TypePad and LiveJournal blogging sites, local listings site Yelp, and blog search engine Technorati. The cause? You won't believe it.

A source close to the company says:

Someone came in shitfaced drunk, got angry, went berserk, and fucked up a lot of stuff. There's an outage on 40 or so racks at minimum.
Whoever it is, while we like how you roll in theory, in practice, we'd appreciate it if you laid off the servers running websites we actually use. (Update: I no longer know whether to trust the source who sent in the tip about a drunk employee.)

We're sure 365 Main will deny that such a thing could ever happen. And, conveniently, the neighborhood is having power troubles, too. But here's a question: When you have several levels of redundant power, what could bring your customers' servers down other than something like an employee physically ripping the plugs out of the wall? Or, with less effort, hitting the emergency-power-off switch that San Francisco's building codes require 365 Main install?

Update: Technorati's Dave Sifry just sent this email:

Folks,

I just wanted to let you know, it looks like San Francisco is having a MAJOR power event, with outages from the Financial district all the way down to Daly City. One of our colos at 365 Main Street has experienced a power outage (never mind that they always swear up and down that this kind of event can't possibly happen, oh no, they have multiple redundant systems and they charge us up the wazoo to make sure that we'll have business continuity, so of course, this isn't really happening, oh yes) however, our other data centers are all up and running, so we hope to be back up and running as quickly as possible.

I'll keep you all updated on progress, and I appreciate you bearing with us as we work our way through this...

Dave

Subsequent coverage:

2:42 PM on Tue Jul 24 2007
By Owen Thomas
147,514 views
52 comments

Comments

  • The power in SOMA is coming on and off intermittently. I don't think this drunk guy did that too!

  • I think your source is wrong. The power's going off all over town (see twitter), not just at one data center. PG&E just told a co-worker that 10,000 people are affected, and it will be fixed between now and 4:30.

  • RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco Data Center

    [www.prnewswire.com]

  • I'm down the block from 365 and the power is fine here.

    Does 365 have no backup power supplies? A single point of failure for all power? Shame on them. Even those amateurs at One Wilshire seem to never be affected by outages.

  • The power went out at 425 Market Street for about three minutes after 1pm Tuesday. Scads of fire trucks rushed to the site.

  • You all live in a real cool place! Where do I sign up?

  • Never ascribe to malice what can easily explained by the usual hosting provider's propensity to screw everything up.

  • We're in colo 2 and all of our power circuits are up. (Most of our racks are fed from two different PDUs - we haven't lost either.)

    Six Apart had some stuff in colo 4 at one point - not sure if they're still in that room or what.

  • This is the second time in two years that 365Main has dropped offline. Totally unacceptable. And totally the reason why you need multiple datacenters regardless of any promises your provider makes to you.

  • haha, these other sites should take a tip from redenvelope.com: (who also has their servers @ 365 Main)
    "Customer Service Is our Top Priority. That's why we're working today to upgrade our system and make it easier to use."

    It's not an outage... it's an upgrade!

  • Why do you still have this shitty rumor up, when it's clear that this is because of the power outage? Would you just believe any random crap anyone wrote in? Because I'm voting for "gremlins" if that's the case. That, or Death Eaters in honor of the Potterdammerung.

  • While I doubt it's due to a "drunk employee", having a power outage is NO EXCUSE for a datacenter, especially one such as 365 Main. The power should never go out.

  • From TypePad:

    The TypePad service is currently unavailable due to power issues at
    our co-location facility. This means that the TypePad application and
    your TypePad blog are not reachable at this time. This began at
    approximately 1:50 pm Pacific Daylight Time today, Tuesday July 24
    2007.

    We are working closely with our hosting partner to bring TypePad back
    online as soon as possible. We sincerely apologize for the
    inconvenience this is causing, and we appreciate your patience. We
    will send another email update with more information as soon as
    possible.

    Thank you,
    The TypePad team

  • So, I've never been inside 365 Main but based on my experience in the colos I have been in (Globix, Exodus, SVColo) I'm calling bullsh*t. I have a hard time believing that Craigslist, Typepad, Netflix, and the like are just in shared racks, rather than in they're own cage.

    Getting to the power plugs (or trashing servers) inside a cage is a pretty big PITA so unless you have the key/combo.

    My take is that unless it was a 365 Main employee that got into the battery room and went all Pete Townsend on the place I'm going to go ahead and believe them.

    (just my $0.02US)

  • I'd also be surprised if this story were true. Seems like a power outage in SF. [laughingsquid.com] And for the record, it wasn't me. [valleywag.com]

  • I have 20 servers at a different colo in SF and even though they experienced some power issues, their own power generators have kept everything up and running just fine.

    I don't see why 365Main couldn't have handled it the same way. Obviously, like this article suggests, something else is the reason of this giant blunder. I don't think they'll ever admit what has happened though.

  • Well, it might have been a power outage, but isn't a site like this supposed to have UPS and generators, so they stay lit no matter how long the lights are out outside?

  • Second Life's asset server is having issues because of these power cuts too. I 'spect there's a fair few angry residents, as per the usual.

  • @Figaro: Yep. You could hear diesel generators cranking away while walking around Soma during the outage. 365 Main? Quiet except for the rack-and-stackers waiting to get inside. No generator noise there.

  • I have servers inside of 365 Main.

    They are up.

    They have been up all day.

    I am shelled into them right now.

    Please, refrain from uninformed blanket statements and judgements of the facility and the companies that occupy it based on second-hand (or further afield) information.

    Yes, there are some big sites there. There are both shared hosting and dedicated co-location facilities.

  • @rekoil: yes, they are supposed to. I have a quarter rack @ 365... I've toured the colo a couple times... they have HUGE back-up power generators. Very strange that they didn't kick in.

  • What would Valleywag be saying if they were hosted at 365 Main now? Hmmmm.....

  • It sounds like only one of the power circuits went down - it didn't cut over correctly to the UPS's during the outage. Sounds like a pretty bad thing to happen to such a well-regarded DC.

  • SixApart?

    I felt a grave disturbance in the Force - as if millions of fanfic writers cried out at once, and were suddenly silenced...

  • Even funnier since 365 just put out this press release [www.365main.com]


  • Purple monkey, that's easy!

    Server not Found
    Firefox can't find the server at www.valleywag.com.


    * Check the address for typing errors such as
    ww.example.com instead of
    www.example.com

    * If you are unable to load any pages, check your computer's network connection.

    * If your computer or network is protected by a firewall or proxy, make sure that Firefox is permitted to access the Web.

    * Remeber, this is Valleywag and anything posted to it as a news item is as likely to be fact as fiction.



  • I work in a datacenter and generators are only meant to run for so long before power will be needed. Redundancy in power, telco carriers, and proper cooling is key, so be sure to talk to someone about HOW the redundancy works when power goes out, or if a carrier drops. Alot of people fall for the "24/7/365" blurb without actually verifying.


  • A press release from March describes the backup power at their 5 data centers. It is ironically headlined:

    PG&E RECOGNIZES 365 MAIN FOR DATA CENTER
    ENERGY REDUCTION PROGRAMS

    San Francisco Data Center Energy Reductions Contribute to Power Reliability
    and Environmental Protection


    [365main.com]

    Pacific Gas and Electric Company has recognized
    365 Main's San Francisco data center for its noteworthy energy-reduction accomplishments while participating in PG&E's Critical Peak Pricing (CPP) program. The CPP program is designed to curtail energy load during critical peak days to offset the possibility of an energy emergency...

    The most significant power reduction, however, was attributed to an innovative testing procedure for the building's back-up generators.

    Each of 365 Main's five national data centers is equipped with powerful back-up generators to ensure customer uptime in the event of a power outage. In 365 Main's founding data center in San Francisco, the company maintains ten 2.1 MW (megawatt) generators manufactured by Hitec. These generators, known as Continuous Power System (CPS) generators, run 24 hours a day, ready to deliver 100 percent power to the data center in the event of an outage.

    As part of a comprehensive preventative maintenance program, 365 Main tests each Hitec generator once a month by running each of the 3000 horsepower diesel engines for two hours. By replacing a dated, inefficient generator-testing procedure, 365 Main reduced utility power consumption by as much as 12.5 percent during monthly tests.

  • I would agree Mat-Honan and DavidU. I'm sitting in Telecom One right now on the back-side of 365 Main, and we're still having some rippling issues.

    There are in fact, very large, redundant diesel generators under the foundation in the basement, so understandably form the outside it's possible that one might not hear them.
    However, I wouldn't put it past 365 to hire an incompetent, asshole employee. Their churn rate for security alone is ridiculous. They go through at least a whole set of guards/NOC techs every few weeks or so. It's not a rotation either, most have said they've just seen people quit.
    365 is a good looking facility, and that's about it.

    Redundancy in all aspects is important in this regard as well.

  • Nice to know that any drunk employee can wander in and take out his aggression on my Livejournal server! Thanks alot Mr. Killjoy! Now how in the heck am I gonna publish my Mayor McCheese - Harry Potter fan fiction story!?

    By the way, the power outage hasn't affected any other part of San Francisco. The Presidio area is just dandy.

  • @Figaro: I think it's about 10 zillion times more likely to be a technical fuckup (they can't switch over to the generators due to stupid) than the Very, Very Emo Disgruntled Employee story.

    Who, coincidentally, went on his angry drunken rampage just as power issues hit the city? Occam's Razor says uh-uh.

  • More importantly, what type of cocktail or alcohol was this person drinking. I need to stock some for my personal use, yet maintain a policy of zero tolerance for this concoction for my employees.

  • AND... it looks like (nsfw) KINK.com is also affected... dammit!!

    - Chuck

  • As a current 365 Main customer I can tell you that there are unhappy employees there. I wasn't interested enough to find out why, but they exist.

    I can also tell you that they have a power kill switch for each room. At least, if my memory serves me correctly, they do.

    Each room is probably something like a megawatt. What do you think is going to happen to the local power grid if you just shunt 1 megawatt back on to it? PG&E aren't going to take that lying down and bits of the city will probably trip... Hmmmm....

    As we've seen all over the country, this can have a "knock on" effect and take out more bits of the city... Or the entire North East.

    It's possible. I wouldn't rule out a drunken Emo just yet.

    365 Main use HiTech power backup systems which are inertia driven, not batteries, so the chance of it just going wrong by "stupid" are remote at best. I dunno, I'm thinking there MIGHT be something to this.

  • This appears to be human error because 365main is designed to a minimum of n+1 redundancy so even if the power was out in SF for a week they would be powered by their gensets. However, human error often gets in the way. 99% of datacenter outages are caused by human error, not failure of equipment.

  • A couple years ago they pointed all of SF's power can from a southern direction and there was no redundancy from North (marin) or (north) East Bay so if power is out, there's no flow from another end to pick up the slack - it terminates in SF. Doesn't seem very smart and maybe they never got it fixed?

  • Here's what really went down at 365main:

    365main, like all facilities built by Above.net back in the day, doesn't have a battery backup UPS. Instead, they have these things called "CPS", or continuious power systems. What they are is very very large flywheels that sit between electric motors and generators. So the power from PG&E never directly touches 365main. PGE power drives the motors which turn the flywheels which then turn the generators (or alternators, I don't remember the exact details) which in turn power the facility. There are 10 of these on their roof.

    The flywheels (the CPS system) can run the generator at full load for up to 60 seconds according to the specs.

    There are also 10 large diesel engines up on the roof as well, connected to these flywheels. If the power is out for more than 15 seconds, the generators start up, and clutch in and drive the flywheels. There are no generators in the basement. (There is a large duel storage in the basement, and the fuel is pumped up to the roof. There are smaller fuel tanks on the roof as well. )

    Here's what I think happened. Since there were several brief outages in a row before the power went out for good, it seems that the CPS (flywheel) systems weren't fully back up to speed when the next outage occurred. Since several of these grid power interruption happened in a row, and were shorter than the time required to trigger generator startup, the generators were not automatically started, BUT the CPS didn't have time to get back up to full capacity. By the 6th power glitch, there wasn't enough energy stored in the flywheels to keep the system going long enough for the diesel generators to start up and come to speed before switching over.

    Why they just didn't manually switch on the generators at that point is beyond me.

    So they had a brief power outage. By our logs, it looks like it was at the most 2 minutes, but probably closer to 20 seconds or so.

    Here's the letter they sent to their customers about this:

    This afternoon a power outage in San Francisco affected the 365 Main St. data
    center. In the process of 6 cascading outages, one of the outages was not
    protected and reset systems in many of the colo facilities of that building.
    This resulted in the following:

    - Some of our routers were momentarily down, causing network issues. These
    were resolved within minutes. Network issues would have been noticed in our
    San Francisco, San Jose, and Oakland facilities.

    - DNS servers lost power and did not properly come back up. This has been
    resolved after about an hour of downtime and may have caused issues for many
    GNi customers that would appear as network issues

    - Blades in the BC environment were reset as a result of the power loss.
    While all boxes seem to be back up we are investigating issues as they come in

    - One of our SAN systems may have been affected. This is being checked on
    right now

    If you have been experiencing network or DNS issues, please test your
    connections again. Note that blades in the DVB environment were not affected.

    We apologize for this inconvenience. Once the current issues at hand are
    resolved, we will be investigating why the redundancy in our colocation power
    did not work as it should have, and we will be producing a postmortem report.














  • I was at 365main colo 3 working on my system when the power went out. Luckily none of my servers were on at the time. The room went dark and a lot quieter! It was crazy. The generators are not in the basement but on the roof, and i could hear them from inside and you could also hear them and smell the gasoline form outside too. The the lights were back on half dimmed. There was no drunken employee (anyone that has worked with 365main would beleive me). 365Main sent out a statement saying the building as of 4pm was 100% operational and still running on generators until PG&E can confirm that utility power is stable.

  • 365 Main was also working on a major electrical upgrade all week, and the outage might have been bad timing for them...


  • How is it that frickin' Nordstrom kept their store, just down the street from 365, open, and their computers running, sans power in the neighborhood , and that the same power outage crashed the biggest server farm on the west coast. Was EVERYONE at 365 Main liquored?

  • This kind of story is why I read valleywag far less these days.

  • They may charge for the back up power service but do they actually provide it?

  • Sorry. Didn't mean to whiz on the ciscos. I'm ok now though, 'cept for this damn headache.