Can the drones deliver my packets?
February 28, 2017 12:16 PM   Subscribe

 
This is a big black eye for amazon, but I don't think it'll affect them too much. There is too much integration with their services for too many applications for this to result in anything more than a shrug.

Also this tweet is excellent.


And their status page.
posted by durandal at 12:20 PM on February 28, 2017 [11 favorites]


Well that may be a problem I'll just check if my site is working with IsItDownRightNow. Oh. Wait. No.
posted by The Bellman at 12:21 PM on February 28, 2017 [13 favorites]


Boo US-EAST-1! Yes US-WEST-1! Surely the best region, and not at all the next one likely to have a major outage.
posted by Going To Maine at 12:22 PM on February 28, 2017 [10 favorites]


Our call systems are down here at work. Which is fine by me. I work for a telecommunications company and so we're working on e-mails until it gets resolved. Hehe, the office is in good spirits, pancake day is helping, though I fear for when it comes back online and we're smashed with a thousand calls all at once.
posted by Fizz at 12:22 PM on February 28, 2017 [1 favorite]


we have always been at war with US-EAST-1
posted by radicalawyer at 12:24 PM on February 28, 2017 [71 favorites]


Just tried to check something on Strava and it's down. "Strava site issues due to Amazon Web Services outage."
posted by spikeleemajortomdickandharryconnickjrmints at 12:24 PM on February 28, 2017 [2 favorites]


Had a little bit of panic when Github went, but my office has yet to start using Github's issue tracker.
posted by ocschwar at 12:24 PM on February 28, 2017


their status page

there was an amusing period when that page was itself affected
posted by thelonius at 12:24 PM on February 28, 2017 [4 favorites]


Coincidentally, I work for a company that makes call systems and this is causing some outages. I would say that my office is in somewhat ... worse spirits.
posted by one of these days at 12:25 PM on February 28, 2017 [2 favorites]


Had a little bit of panic when Github went, but my office has yet to start using Github's issue tracker.

GitHub seems to be up?
posted by Going To Maine at 12:27 PM on February 28, 2017


Github file upload on issues down
Slack file upload down
etc etc
posted by hleehowon at 12:27 PM on February 28, 2017 [3 favorites]


One day it's all going to come tumbling down and then where will you get your cat .gifs?
posted by the uncomplicated soups of my childhood at 12:29 PM on February 28, 2017 [3 favorites]


Is it a coincidence that this happened the day after Google turned on their new robot?
posted by durandal at 12:31 PM on February 28, 2017 [1 favorite]


One day it's all going to come tumbling down and then where will you get your cat .gifs?

I'll scan them in manually.
posted by zombieflanders at 12:31 PM on February 28, 2017 [8 favorites]


Google.
posted by Autumnheart at 12:32 PM on February 28, 2017 [1 favorite]


I'll scan them in manually

How? And why?
posted by Sys Rq at 12:33 PM on February 28, 2017 [8 favorites]


One day it's all going to come tumbling down and then where will you get your cat .gifs?

The copy machine.
posted by Fizz at 12:33 PM on February 28, 2017 [2 favorites]


Just make new ones and then upload them to a personal host via FTP?

My cats will be famous!
posted by Autumnheart at 12:36 PM on February 28, 2017


The red checkmark was indeed hosted on S3.
posted by floatboth at 12:40 PM on February 28, 2017 [5 favorites]


Gah. My last day on a project and I can't commit any code because no tests will run since our servers are running on AWS.
posted by octothorpe at 12:43 PM on February 28, 2017 [3 favorites]


My rails app looks fine until you try to retrieve an actual audio file . . .
posted by aspersioncast at 12:44 PM on February 28, 2017


Just make new ones and then upload them to a personal host via FTP?

That's what I thought 15 years ago. Oh how things come full circle. It will always remain my greatest pleasure in memory of monsieur le chat Malo, seeing cats take over the web.
posted by fraula at 12:45 PM on February 28, 2017 [2 favorites]


Good thing it's Mardi Gras day and nobody is working anyway. At least here in New Orleans.
posted by Bringer Tom at 12:46 PM on February 28, 2017 [2 favorites]


I must confess I'm madly curious as to what the AV Club article "that doesn't reflect our values" they were madly apologizing for on their twitter feed.
posted by Kitteh at 12:50 PM on February 28, 2017


That explains Quora being down I guess.

Not that I go there.
posted by leotrotsky at 12:51 PM on February 28, 2017 [2 favorites]


I blame Faye Dunaway and Warren Beatty.
posted by theora55 at 12:51 PM on February 28, 2017 [3 favorites]


My old job (the one that laid me off because my the new boss played favorites instead of basing it on skill not that I'm bitter..) uses a ton of the education tools that are now down because of this. I find myself filled with mild glee.

Not to say my current job isn't also suffering from this, but everyone here is a lot more tech-savvy and we're all mostly just "lol the cloud! Am I right?"
posted by INFJ at 12:52 PM on February 28, 2017 [2 favorites]


Have they really come full circle, though? Is there any real difference between uploading something to "the cloud" vs. buying your own domain and uploading stuff to a hosting account? I can't think of one, except that you don't need an FTP program.
posted by Autumnheart at 12:52 PM on February 28, 2017 [1 favorite]


we have always been at war with US-EAST-1

I think most people know exactly what US-EAST-1 values are.
posted by leotrotsky at 12:53 PM on February 28, 2017 [6 favorites]


Nvm, I found it and yeah, it was gross.
posted by Kitteh at 12:53 PM on February 28, 2017


Also, shouldn't AWS have better redundancy so that massive outages don't take a ton of clients down? Seems like that would be a reasonable goal of Amazon's.
posted by Autumnheart at 12:54 PM on February 28, 2017


> I must confess I'm madly curious as to what the AV Club article "that doesn't reflect our values" they were madly apologizing for on their twitter feed.

It's this story AFAIK: Well, that was fast: Turns out Oscars viral star “Gary from Chicago” was fresh out of jail.
posted by skynxnex at 12:54 PM on February 28, 2017 [2 favorites]


Oops, sorry for not seeing your NVM. Re: AV Club. It's weird they had a (now deleted tweet, I guess?) that said they had edited it. Maybe they were in the workflow process but didn't publish it before the S3 (and more really!) outage made it impossible.

Back to amazon, this has brought down some of our products and things through Heroku we can't even really check the status since you can't log into Heroku and their API is down. Ah well.
posted by skynxnex at 12:56 PM on February 28, 2017


Autumnheart - see b1tr0t's comment.

"Usually you have to pay extra for that kind of action, Cotton" Only there's no usually here.
posted by k5.user at 12:57 PM on February 28, 2017


b1tr0t: deffo times and places where all the AZ's are fucked

so you have some orchestration to also have a google compute dealie or something...
posted by hleehowon at 1:02 PM on February 28, 2017 [1 favorite]


Yeah, my understanding is that all the AZs in US-EAST-1 were down for S3 and other services, so you'd have to do multi-region which is harder and more expensive often times (but still what you need to do to be really HA).
posted by skynxnex at 1:04 PM on February 28, 2017 [2 favorites]


How many nines?
posted by indubitable at 1:05 PM on February 28, 2017 [4 favorites]


Is Slack down?
posted by awfurby at 1:08 PM on February 28, 2017


from the status page, still a lot of red icons!!

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour

A lot of devops discussions will be occurring in the next few weeks :-)
posted by sammyo at 1:08 PM on February 28, 2017


Slack isn't down, but the ability to upload items/attach items in messages is down.
posted by INFJ at 1:09 PM on February 28, 2017


I work for a software company. We just migrated our product to AWS over the weekend. I had the good fortune to take this week off. 8-)
posted by kevinbelt at 1:13 PM on February 28, 2017 [12 favorites]


MIFI has become an alert system better than twitter (I'd dread having to "watch" the twit constantly")

From a certain (perhaps biased :) mefi: BREAKING: Books are still working just fine.
posted by sammyo at 1:17 PM on February 28, 2017 [4 favorites]


I was wondering why my site at the NIH was down and I had a moment of fear that the administration had cancelled my entire institute without telling me.
posted by Sophie1 at 1:21 PM on February 28, 2017 [13 favorites]


Software company here as well, and I had to demo our product a few times right as AWS went down. Luckily we also have a parallel EU site so I switched over to that and our customers were impressed that I could still show them something when so much of their other stuff was down. A+ for our devops team.
posted by rmless at 1:25 PM on February 28, 2017 [9 favorites]


You knew the Russian hackers wouldn't like all those negative WaPo articles about Trump, Jeff.
posted by jamjam at 1:39 PM on February 28, 2017 [4 favorites]


BREAKING: Books are still working just fine.

Yeah, I'll just pull up a book from my Kindle Clo ...oh.
posted by leotrotsky at 1:40 PM on February 28, 2017 [2 favorites]


I noticed that I wasn't able to track a recent Amazon order, and thought "uh oh...".
posted by newfers at 1:42 PM on February 28, 2017


You knew the Russian hackers wouldn't like all those negative WaPo articles about Trump, Jeff.

Oh, crap. There's actually a nonzero chance this is the reason.
posted by leotrotsky at 1:42 PM on February 28, 2017 [5 favorites]


SREs draw blades
shame pours from US-EAST-1
absolution knife
posted by Abehammerb Lincoln at 1:42 PM on February 28, 2017 [5 favorites]


I was wondering why my site at the NIH was down and I had a moment of fear that the administration had cancelled my entire institute without telling me.

That's scheduled for Wednesday.
posted by srboisvert at 1:53 PM on February 28, 2017 [2 favorites]


I was finally getting used to my new project management/to-do list app, and now this happened.

I'm going back to plastering my workspace with post-it notes. Only major downtime is when one falls between my desk and the wall.
posted by Kabanos at 2:09 PM on February 28, 2017 [1 favorite]


Boo US-EAST-1! Yes US-WEST-1! Surely the best region, and not at all the next one likely to have a major outage.
posted by Going To Maine


Pfft.
posted by Celsius1414 at 2:10 PM on February 28, 2017


Books are still working just fine.

Books? You mean, like... printed out web pages?
posted by fatbird at 2:11 PM on February 28, 2017 [1 favorite]


I guess I was thinking more from an individual standpoint, as opposed to a corporate standpoint. That being said, I wonder how much money these companies are losing right now and kind of come back to the thought that sometimes hosting your own isn't the worst idea in the world.
posted by Autumnheart at 2:12 PM on February 28, 2017


Is it possible it's affecting domestic flights?
posted by saladin at 2:12 PM on February 28, 2017


Tweet of the day: Area SRE Regrets Storing S3 Outage Contingency Plan on S3

Comment on /r/aws suggests it is not far from the truth. S3 is pretty mind boggling but it is almost a single point of failure, a lot of the aws (and other) infrastructure takes advantage/relies on the super fast and cheap place to drop files.
posted by sammyo at 2:23 PM on February 28, 2017 [2 favorites]


Enquiring minds currently stuck at Jacksonville International Airport want to know...
posted by saladin at 2:25 PM on February 28, 2017 [1 favorite]




> BREAKING: Books are still working just fine.

As credit card processors?
posted by The corpse in the library at 2:26 PM on February 28, 2017 [2 favorites]


I'm kind of shocked that people who are doing Business Things with AWS that involve real money don't at least use multi-region replication.
posted by indubitable at 2:29 PM on February 28, 2017 [3 favorites]


Enquiring minds currently stuck at Jacksonville International Airport want to know...

enquiring minds should probably just sit at Shula's and drink more.
posted by Dr. Twist at 2:31 PM on February 28, 2017 [2 favorites]


It's pretty easy to think you don't depend on a single global service.
posted by Skorgu at 2:40 PM on February 28, 2017 [3 favorites]


If you have business critical applications where downtime is measured in millions per hour I question the wisdom of not existing in multiple azs and regions
posted by vuron at 3:07 PM on February 28, 2017 [1 favorite]


Can the drones deliver my packets

A Seagate 8 TB drive weighs, according to amazon, a little under 2 lbs. Let's call it 2.
An mq-1 predator ("drone") has a payload of 450 lbs, according to the USAF
That makes 225 drives, or 1800 TB.
The median Comcast subscriber uses 88 GB a month. Call 1 tb = 1000 gb.
So a drone could 'deliver' 20454 household-months of data.

The long latency means it's not good for commercial nets, though.
posted by the man of twists and turns at 3:07 PM on February 28, 2017 [9 favorites]


I love so very much that you worked that out.

Speaking of nets though, there'd better be a good one on the ground for receiving that payload, because you didn't account for the chute (unless it's built into the fudging on the drive weight).

Is this basically a Cory Doctorow novel?
posted by aspersioncast at 3:25 PM on February 28, 2017 [1 favorite]


It is a black eye for anyone who calls themselves a software architect yet completely disregards the AWS availability zones.

Multiple AZs wouldn't help you in this outage. S3 is region-specific. Distributing across AWS regions is a lot trickier than across AZs.
posted by RobotVoodooPower at 3:25 PM on February 28, 2017 [5 favorites]


So a drone could 'deliver' 20454 household-months of data.

The long latency means it's not good for commercial nets, though.


Also, my neighbor has a BB gun. So it's not good for the ping rate.
posted by parliboy at 3:27 PM on February 28, 2017 [6 favorites]


a Predator would laugh at a BB gun. it's a full-sized aircraft.
posted by indubitable at 3:32 PM on February 28, 2017 [1 favorite]


Books are still working just fine.

I backed one of those on Kickstarter once. Wasn't impressed.
posted by Mr.Encyclopedia at 3:39 PM on February 28, 2017 [1 favorite]


I wonder if people read that S3 resiliency is achieved by multiple availability zones, and didn't consider that Amazon doesn't move your data between regions without your express instruction. And S3 feels quite global, and people don't secure MongoDb, let alone think out a multi-region strategy for something with 9 elevens of mumble...

So the whole region has a problem, and it's the component that you had already mentally checked off as bombproof. And like that — it's gone.
posted by Wrinkled Stumpskin at 3:44 PM on February 28, 2017


Clearly in order to have a reliable cloud deployment, you need to deploy to at least 3 separate cloud services, so if one goes down the other two can form a quorum.
posted by idiopath at 3:45 PM on February 28, 2017 [6 favorites]


From a certain (perhaps biased :) mefi: BREAKING: Books are still working just fine.

Yes, my Kindle, which is in airplane mode until I finish The Traitor Baru Cormorant, is working just fine. Does the main character ever start acting like the genius she's supposed to be? I'm ready to give up and let the library delete it.
posted by betweenthebars at 3:54 PM on February 28, 2017 [1 favorite]


i think you're right, idiopath. even the savviest of tech firms got bitten by this because HA has come to imply bombproof, when really the probabilistic event is possible, but not likely.
posted by j_curiouser at 3:56 PM on February 28, 2017


N.B. the rule of "at least 3 so there's a quorum" is a common theme in distributed systems where you need strong consistency amongst multiple nodes, and my suggestion of using 3 separate cloud providers is merely a joke riffing on that theme, rather than carefully considered operations advice that would help you economically deploy a highly available service. I Am Not Your System Architect YMMV etc.
posted by idiopath at 4:06 PM on February 28, 2017 [2 favorites]


I've discovered that the occasional outage (that does not result in loss of life or property) reminds your customers not to take you for granted.
posted by RobotVoodooPower at 4:14 PM on February 28, 2017 [3 favorites]


Definitely the case on the maintenance side of the house. If things work too perfectly, an employee purge is just around the corner.
posted by Strange_Robinson at 4:25 PM on February 28, 2017 [1 favorite]


You do need to come up with a process to sync your data, or your highest priority data to other regions.
You are correct, but doing this right is not trivial for most large scale applications.
posted by primethyme at 4:59 PM on February 28, 2017 [2 favorites]


Every time I see us-east-1 I think of the old U.S. East realm in Diablo II.
posted by limeonaire at 5:00 PM on February 28, 2017 [3 favorites]


Clearly in order to have a reliable cloud deployment, you need to deploy to at least 3 separate cloud services, so if one goes down the other two can form a quorum.

Ideally five, so that if one goes down while you're doing maintenance on another, you'll still maintain quorum.
posted by invitapriore at 5:00 PM on February 28, 2017 [1 favorite]


Strike Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway /strike - Predator drone filled with 8 TB drives buzzing through the air at 135 mph.


This outage has been randomly annoying in how many pages/services connect and/or load, but then their ostensibly XSS/CDS'ed resources fail to load or initialize at all, especially larger media files or embedded images and other media.

Something I've been vaguely pondering for roughly ten years about general cloud computing be it storage, services or virtualization is if there is a some kind of cross-platform service either for small to medium infrastructure or consumers that will effectively and intelligently integrate an array of cloud/virtual/colo/real services for either failover protection for small services or atomized consumer end use.

A sort of smaller, consumer version of the data management, integration and redundancy tools that Big Data uses.

I know major infrastructure has this kind of failover, and I know there are solutions for medium-ish sites/services, but it seems like there's an emerging market for the small business, small site and/or consumer market that can integrate and manage data over a variety of online/cloud sources and offer a hybrid approach and sort of management front end for the wide variety of media and storage solutions that people use today.
posted by loquacious at 5:28 PM on February 28, 2017 [2 favorites]


AWS reminds me of a Too Big to Fail bank. So many other organizations directly and indirectly dependent on them, and impossible to forecast how failures would cascade throughout the system.
posted by mikemacman at 8:51 PM on February 28, 2017 [1 favorite]


Massive Internet Outrage, on the other hand, never ends
posted by atoxyl at 9:41 PM on February 28, 2017


Strike Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway /strike - Predator drone filled with 8 TB drives buzzing through the air at 135 mph.

And they said RFC 1149 would never be a scalable solution.
posted by radwolf76 at 10:06 PM on February 28, 2017 [2 favorites]


Oh.. PACKETS... thought that said pancakes.... I was hopeful for a moment.
posted by HuronBob at 10:12 PM on February 28, 2017 [1 favorite]


This is like saying that a microwave oven reminds you of the space shuttle because you don't know how either one works.

Ouch! A little harsh, no?

By 'system', I don't mean an organization's systems, I mean the global, interconnected web services ecosystem, in the same sense that a bank's failure can reverberate throughout the financial system in poorly understood ways.
posted by mikemacman at 9:26 AM on March 1, 2017 [1 favorite]


The thing is, AWS is pretty shockingly reliable, yeah? So I don't know that DevOps who didn't have a backup plan for this are going to necessarily push developing one to the top of the stack instead of gambling that such massive outages are one-off freaks.
posted by Going To Maine at 9:48 AM on March 1, 2017


No, the obscurity of AWS is not at all comparable to the financial system.

That said - If you’re an end user of both finance and web services that depend on AWS, the consequences of a failure of either one can appear exactly the same: far-reaching inability to work, potentially impacting other people who depend on that. Imagine the power outage that took out NYC a few years back. Perhaps a better metaphor, given that no one (I think?) died. Definable-but-obscure cause, far-reaching consequences.
posted by Going To Maine at 11:33 AM on March 1, 2017 [1 favorite]


The obscurity of AWS is not at all comparable to the financial system. Systemic risk in the banking sector is an unknown-unknown. AWS dependencies are extremely easy to know: you just look at the bill they send you!

But what about other services you use, that themselves use S3 or CloudFront? Your exposure to AWS failures isn't limited to your direct dependencies with them.
posted by mikemacman at 11:45 AM on March 1, 2017 [1 favorite]


b1tr0t, I think you're missing the point about the too-big-to-fail thing, both in terms of AWS and in terms of banks. Nobody really cares about the reasons for a failure. If you understood every cent on JPMorganChase's balance sheet, would that make it less of a problem if the bank were going to collapse? (You could make a case that understanding every cent could reduce risk, but unless you're Jamie Dimon, I don't know how much practical effect it would have.) Likewise, does knowing why AWS had problems do anything to help the people affected? Too-big-to-fail is inherently a consequentialist question. The different causes of bank failure and AWS failure mean little to someone affected by them.
posted by kevinbelt at 12:23 PM on March 1, 2017 [1 favorite]


People who lost their homes in the financial crisis will probably take exception at being compared to people who lost access to their Rare Pepe stash for a day.

Who is spewing out FUD?

My initial point was that AWS is so central to so much of the web services that we application developers all rely on (without even knowing that we're dependent on them, in the case of transitive dependencies via 3rd parties) that it is very hard to understand the consequences of AWS failure on our web infrastructure, just as it was very difficult for a single bank or hedge fund to understand their exposure to, say, Lehman failing.
posted by mikemacman at 1:23 PM on March 1, 2017


Do you get it? You are deliberately arguing against a point no one is making. No one is saying that we shouldn't understand how or why things fail. Of course we should.

Brake fluid lines fail differently than brake pads, and nearly all mechanics understand how. But if your brakes give out and you rear end the guy in front of you, that guy (to say nothing of the policeman who comes to the scene or the insurance adjuster who gets involved) isn't just going to say "oh, mechanics understand that, you're good to go". You're going to get a ticket and your insurance is going to pay for the damage. Somewhere on a bureaucratic form, there's a checkbox to determine whether the cause was the brake line, the pads, distracted driving, etc. But only the bureaucrat cares. The guy whose car got hit just wants his bumper replaced.

This isn't a question of knowing how things work. It's a question of antitrust regulation.

People whose businesses lost income due to AWS will probably take exception to being compared to a rare Pepe stash, whatever that is.
posted by kevinbelt at 2:14 PM on March 1, 2017 [2 favorites]


No one is saying that the AWS outage exposed a fundamental flaw in cloud computer. We're saying that AWS should be subject to increased oversight due to its dominant market position.
posted by kevinbelt at 2:19 PM on March 1, 2017 [1 favorite]


The power grid analogy is a decent one in some respects; there's a large subset of people who basically understand how to wire a lamp, a smaller subset who can calculate load for a whole breaker box or install an uninterruptible redundant battery backup, and a much smaller subset who actually understand how the power is being generated and distributed from the plant/across the grid.

A lot of people relying on AWS are analogous to that middle category, and plenty are using hardware-as-a-service so they don't have to think about how it all fits together, whether or not they have a basic fundamental understanding of distributed network fundamentals.
Not to mention that the extent to which all these different platforms is reliant on AWS isn't necessarily mapped out by some 2-person company building a rails app.

Does that really mean it should be subject to increased oversight? Maybe; the power grid is. The new FCC chair doesn't seem like the most forward-thinking individual, and I'd worry about some of the potential consequences for legislation regarding things like uptime, especially because there are already a ton of contractual obligations at work.
posted by aspersioncast at 5:41 AM on March 2, 2017


I'm surprised at the relative paucity of postmortems on the outage. Hopefully Amazon will publish something explaining what happened on their end. Meanwhile, can anyone recommend suggestions for AWS users to mitigate this kind of risk in the future, aside from DIY?
posted by philosophygeek at 7:07 AM on March 2, 2017


The "you should have been using multiple regions" thing is a bit facile. First of all, AWS availability zones within a region are supposed to be separate physical data centers with no shared points of failure ("even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone"), and the S3 documentation says it's designed for "99.999999999% durability" and "99.99% availability" by "redundantly storing your objects on multiple devices across multiple facilities." So that is what you are paying for.

So what are we to think when an entire region becomes unavailable? Either a bug in the S3 software was pushed to production, or there was some kind of cascading failure they didn't anticipate. If it's the former, then using another region only helps if Amazon kindly avoids pushing the bug to every region at once. If it's the latter, the same issue is lurking in every region. Either way, the service is not operating as advertised.

Second, because of network latency and the CAP theorem, there's no free lunch. You have to choose between availability and consistency when things fail. If it's a bank, I want them to choose to make the ATM unavailable if they can't guarantee my balance is correct. And I don't mind waiting a few seconds for a distributed transaction to ensure that. If it's a dumb web site, sure, I don't really care if my last comment disappears, and I don't want to wait many seconds to post something just to ensure a globally consistent comment thread. It all depends on the application.
posted by mubba at 8:50 AM on March 2, 2017 [1 favorite]




Nobody expects the "rm -rf /" typo!
posted by Xyanthilous P. Harrierstick at 1:01 PM on March 2, 2017


« Older Now you can quantify your sense of inadequacy   |   A Millennial Reviews: ‘Seinfeld' Newer »


This thread has been archived and is closed to new comments