Amateur hour at AWS

davidu · on April 9, 2014

I understand a certain level of displeasure at their lack of specificity while they mitigated the issue. But... in this case, the time to remediate doesn't really change your response to the threat. No matter what, you need to change all your keys, generate new private keys, etc.

They got it fixed within 48 hours, globally, which, if you ask me, is incredible at their scale.

I would hardly describe anything AWS does as amateur. But maybe that's just me.

thathonkey · on April 9, 2014

Calling the team at AWS "amateurs" is a great way to discredit everything the author wrote. AWS is a gigantic infrastructure and they got everything fixed IN LESS THAN 48 HOURS. That is not an amateur response time.

Sorry their updates weren't to your liking but they were responding and posting bulletins the whole time and again: they solved the issue very quickly given the number of clients they support.

chris_wot · on April 9, 2014

The way they communicated it was very, very amateurish. They needed to know when they could re-key their certs - seems like it was impossible to tell. That's something that needed to be done ASAP. Without knowing precisely when the environment that affected them was updated, this client couldn't get that done as quickly as they might have.

exelius · on April 9, 2014

It's not amateurish, it was just poor communication in a situation that they've (fortunately) not had to deal with. Which happens from time to time; and good companies recognize their failures and fix them. One thing I know about Amazon from friends who work there is that they don't tolerate failure. They have a culture of owning your mistakes and fixing them; anyone who doesn't buy into that attitude will get fired pretty quickly (and Amazon fires a lot of people.)

It's pretty fucking professional to update the infrastructure that runs half the internet in under 48 hours with no issues. But again, communication can be a problem when you have as many customers as they do.

OP raised some legitimate concerns, but his credibility was undercut by attacking Amazon and calling them names. Ironically, his post was a much more amateur move as his concerns would likely be taken more seriously if he had stuck to the issues and not resorted to name-calling. The essence of professionalism is sticking to the issues at hand and not being sidetracked by extraneous factors.

alttab · on April 9, 2014

The nail in the coffin was some comment at the end about moving off of AWS to something else over blog posts lacking timestamps.

kordless · on April 9, 2014

> sorry their updates weren't to your liking but...

Folks, this is an account created 19 hours ago making half an apology and then rationalizing not listening to customers because they solved a "big" problem quickly. That's an ad hominum argument and as a result, a big tell on intent.

From my perspective, pushing out a new SSL build to a bunch of load balancers in a highly automated network like AWS is probably, by this point, a trivial task. Actually listening to the customer and responding decently is MUCH harder. Clearly it could be done better, which is the point of the post.

Rise above getting offended/scared about being called "amateurs" and start talking more about what goes on in that creepy black box that is AWS. You owe the world that much, at the very least.

res0nat0r · on April 9, 2014

> From my perspective, pushing out a new SSL build to a bunch of load balancers in a highly automated network like AWS is probably, by this point, a trivial task.

These responses I always see on HN when there is an AWS issue always show me how disconnected many of the commenters are from reality, or from ever being involved in a huge infrastructure.

Sure the AWS status page doesn't have a hip web 2.0 AJAX backed d3.js powered cool looking status page. Yes they don't update it every 3 minutes with new info, but many (most) of the problems that one off customers see do not reach a threshold that would ever effect enough customers to make it into a dashboard post. I do think they need to speed up their status updates, but these posts need to get OK'd by a decent number of people before they get thrown up.

There are usually multiple ELB instances living on every rack of every datacenter in every AZ in every region of AWS. Relaunching / patching hundreds of thousands of instances in 48 hours with minimial disruption to customers, is a lot harder than you think.

kordless · on April 10, 2014

I'm not disconnected from reality. I actually understand the problem at hand and understand it is a lot of work for some engineers. However, it's still likely it's old hat techniques by now, hence the 'trivial' remark. Also, the context of the comment I was responding to was correlating 'fixing in 48 hours' to not being marketing amateurs.

My primary point agrees with your second paragraph, which is that they could do better on the status updates. Unfortunately this has been going on for YEARS at AWS, so it's worth ratcheting up the tone when talking about it. It's important, and they need to fix it.

thathonkey · on April 9, 2014

Hey there super sleuth... not sure what you're trying to imply but I'm not affiliated with Amazon nor do I use their AWS service. I do not speak for them. I am not making an apology on their behalf. I suggest you stop levying false accusations and avoid using words that you simply don't understand ie. ad hominum (not only did you spell it wrong, but you've misapplied it).

To clarify: I'm a long time HN reader that finally got around to making an account (and certainly not for the express purpose of defending Amazon). However, I did want to call the author out on writing a terribly unfair knee-jerk, heat-of-the-moment indictment of AWS (this type of thing is unfortunately all-too-common in the tech community: actual amateurs writing as if they are a central authority about subjects that they have something approaching 0 understanding of. For example: the multitude of complex engineering and PR challenges a service provider like AWS faces during something like the Great OpenSSL Exploit of 2014). What I'm trying to say is: cut them some slack. Their response seemed perfectly reasonable to me.

Hope this helps.

kordless · on April 10, 2014

I spell things wrong all the time and sometimes hominem doesn't get caught by the spell checker. So fucking what? It's ironic that you call this out because, well, it's an ad hominem argument in and of itself. You are attacking one thing (or supporting one thing) to prove another point. Wikipedia says it best, "claim or argument is rejected on the basis of some irrelevant fact". Claiming AWS isn't practicing amatuer hour based on the fact they rolled out fixes in 48 hours is making an ad hominem argument. Amazon's marketing department is distinctly different from their engineering department. It's irrelevant that they are technically competent enough to patch this when you consider the fact the marketing/communications department could give two shits about how they tell anyone what has been fixed.

In retrospect, what I should have done is called out the blaming statements you made in your first post. That's what brought me to action and caused me to write my response the way I did. I should know better than trying to rationalize with someone who is in dissonance. BTW, narrowrail called you out below for this blaming statement here. Pay attention - people are giving you feedback. Take it or leave it.

Vote down all my comments if that makes you feel better. Karma is meant to burn. It's also a tell that this story dropped off the main page and I'm still getting downvotes on my comment. AWS koolaid much?

Oh, and FWIW, I am a super sleuth. A super sleuth of human behavior and emotional response. I also watch what I say about others, trying not to blame and indicate opinion where needed. That's why I said your behaviors were a 'tell on intent'. I have no idea who you are or why you created an account just to comment on this story, but I guarantee there is more to it than what meets the eye.

narrowrail · on April 9, 2014

As a long time HN reader, you should know that the condescending "super sleuth" mention was unecessary. It should also be apparent that your highly defensive comment with a very new account would raise eyebrows. The "hope this helps" ending also comes off as passive-aggressive. We can do better.

thathonkey · on April 9, 2014

The condescension and passive-aggression was fully intentional and really the only way to respond to such an asinine comment. Hope THIS helps; I could do better. Next time perhaps active aggression will be called for?

dang · on April 10, 2014

Please don't make posts like this to HN.

The way to respond to an asinine comment, if you must respond, is to politely refute it.

sebgeelen · on April 9, 2014

The thing AWS does which can be described as amateur, was the communication. AWS has always been really quiet when there's an issue on their services.

It's a nightmare to know when your problem is due to your infrastructure or if their's a bigger scale issue at AWS cause they never talk about it...

madaxe_again · on April 9, 2014

They're not amateur. We pay for their premium support and their communication is professional, polite, and usually totally useless.

mrjatx · on April 9, 2014

Their communication is not impressive. I work with AWS daily and I've been aware of issues LONG before they publicly announce anything on their status pages.

brown9-2 · on April 9, 2014

Typically AWS is great at communicating with post-mortems, for example: http://aws.amazon.com/message/2329B7/

In this case, they probably didn't want to be too explicit about the details of patching tens of thousands of machines while the remediation was still ongoing.

I do agree that it's unexpectedly hard to find a link to the "security notice" page anywhere.

westernmostcoy · on April 9, 2014

That outage notice you've linked includes a promise to improve communication.

Also, how'd you find that link? If you happened to just have it laying around, that's fine but it would be better if they had these things linked somewhere customers can find them when new ones are posted (like a page covering service post-mortems) and the timeline is missing little details like the year when the outage happened and a point of contact (it's signed by "the aws team") if you have questions.

brown9-2 · on April 9, 2014

it would be better if they had these things linked somewhere customers can find them when new ones are posted

That's a great point. I found it by googling "AWS post mortem", but I only knew it existed because I had been linked to this page before.

mkr-hn · on April 9, 2014

Amazon has a status page for all AWS services and provides RSS feeds for each: http://status.aws.amazon.com/

madaxe_again · on April 9, 2014

Trust me, this doesn't cover all their issues. Don't get me wrong, we love our AWS stack, but we've had all sorts of weird stuff happen over the years, from zombie ELBs hitting the wrong hosts to invisible SGs... sometimes we see from twitter that others are seeing the same things, but AWS don't ever confirm or deny anything, they just stall then tell you when it's fixed. I guess its support in terms of something to give to your PHB, but when something goes wrong within AWS, it's a very black box - which isn't surprising.

sebgeelen · on April 9, 2014

I know.

Sadly this page is not updated often enough. I mean, when there's a known issue on one of the aws service this page display a little "i" icon, which is barely visible.

And when, you encounter some problem with your AWS stuff that clearly come from their side, if the problem it's not wide, they just say nothing. At that point you can search yourself for hour to be sure it's not your responsibility, and after-woods, you just wait, blind.

coherentpony · on April 9, 2014

You're absolutely right. The author of the post is just whining for the sake of whining. For every post hating on amazon there's one hating on heroku<http://www.holovaty.com/writing/aws-notes>. Best course of action is to just deal with it.

jasonkester · on April 9, 2014

I read the words in your blog post and came to a completely different conclusion.

Despite not putting time stamps on their communications (which you seem really really upset about), they fixed everything for everybody in like a single day. You, their customer (and better still, me, their customer) didn't have to lift a finger.

This is exactly why we farm out infrastructure to companies like Amazon. They have a whole squad of smart people standing around leaning against a post all day every day waiting for something like this to happen so they can jump in and fix everything for us.

My little one-player business has no such team o' dudes on standby. If it weren't for the fact that Amazon is cleaning up for me, I'd be two days into having a really bad time and getting no productive work done.

chris_wot · on April 9, 2014

Actually, they had to re-key their certificates. They did have work to do, and they couldn't do it till the upgrades at AWS was done.

So yeah, you'd be still having a bad day. Because you have no team 'o dudes on standby, so your already busy staff are chaffing at the bit to get the certificates sorted out, and potentially not being able to restore service and constantly checking to see Amazon has completed their patching. i.e. little to no productive work is being done.

bashcoder · on April 9, 2014

For future reference, AWS security bulletins can be found at:

https://aws.amazon.com/security/security-bulletins/

The author suggests that Amazon only made one post, updating it throughout the process. But at the link above you will see four posts regarding this issue, with the first one having been updated once to add information. No, they are not timestamped, but they are dated.

While it may be fair to criticize AWS for its customer communications during this process, I'm fairly certain that if he had a backstage pass to such a comprehensive process of remediating thousands of production systems with zero downtime, he would perhaps find the team somewhat less... amateurish.

dm2 · on April 9, 2014

Obviously they were working as fast as they possibly could without risking major outages. They probably had millions of servers to update.

I'd even argue that it's not a good idea to advertise, "these servers are vulnerable to this attack".

AWS is massive and organizing that kind of update by an army of engineers isn't easy.

You received a non-personalized message because AWS support probably received tens of thousands of irate customers demanding that their systems be patched immediately. For some reason they weren't equipped to handle that kind of update but I'm sure they will learn from this and hopefully next time the response will be faster, if possible.

toong · on April 9, 2014

TFA gives you a clue why you need to know when all ELBs are updated:

* "However, due to nobody outside of AWS knowing exactly how ELB works, this could just mean that the machines currently responding to the requests are patched, but in the next request, it could hit an unpatched ELB machine."

* "We then wrote to support to hear if it was now safe to re-key the certificates, but did not hear from them for hours."

Summary: You re-key your certificates, thinking you are all good. Now an attacker hits a non-patched ELB, exploits the issue and gets your new keys.

dm2 · on April 9, 2014

So wait until you see this message: https://aws.amazon.com/security/security-bulletins/heartblee...

and then update your certs again just to be safe.

chris_wot · on April 9, 2014

So you pay for revocation of certificates twice. The first time you revoked the certicates, someone could have compromised them immediately afterwards. How is that a good solutions?

dm2 · on April 9, 2014

There's not always a perfect solution, gotta work with what you're given.

If you absolutely can't wait for the updated status message then the solution I gave was the only solution.

If the price of the revocation (which my issuer doesn't charge) is too high for your business then your only option is to wait.

ceejayoz · on April 9, 2014

Is payment for revocation standard practice? I've reissued and revoked probably a dozen Comodo certs (both directly and via Namecheap as a reseller) without issue.

troels · on April 9, 2014

I suppose it matters to know when the patch has been completed, so that new keys can be generated as quickly as possible.

The overall tone of the post seems rather out of proportions though.

smoyer · on April 9, 2014

So ... as a professional organization, why does OpsBeat continue to contract with such an amateur organization? You could scale up your own servers, round-robin DNS (more for ELB), etc if you're not happy with their performance.

Instead, a group of dedicated professionals updated a world-wide infrastructure in less than two days. If you were running your own systems would you have managed that? Yes, you would have known exactly when you were done but could you predict ahead of time when you'd be done?

So, as engineers we make trade-offs and AWS is a pretty clear winner when you look at the TCO of having a scalable architecture. Once you've made that trade-off, the down-side is that you don't have the ultimate flexibility provided by a bare-metal host.

eknkc · on April 9, 2014

So they are amateurs because;

- They managed to create and test a deployment procedure in a couple of hours. - Deployed this update to thousands of machines spread into multiple continents in 48 hours. - There were no downtimes. No action required from customers. - Eveything seems to in in order now.

Yeah.. I hope they die in a fire.

sendob · on April 9, 2014

I don't think you understand the nature of the problem if you think no action by customers was required.

I think the response from amazon was fine and they are clearly not amateurs technically, communication was very poor.

eknkc · on April 9, 2014

They managed to do everything they could without customer involvement. The rest (retiring keys and stuff) can not be handled by them due to nature of the problem.

Now that the technical issue has been resolved by other smart people, I'm gonna replace our cert keys and be just fine.

copergi · on April 9, 2014

>The rest (retiring keys and stuff) can not be handled by them due to nature of the problem.

And they actively refused to let people know when they were safe to do that. That is the problem.

jdmichal · on April 9, 2014

I would think the post stating that "All ESBs are now patched" was that message, no? Before that post, you were not safe. After that post, you were.

copergi · on April 9, 2014

Yes, and "sit there refreshing a page over and over waiting to see if they update it because they refuse to even give you a hint of when they expect to be done" is not acceptable.

jdmichal · on April 9, 2014

Oh, so you mean like the update before that read "expect them to be updated in the next few hours"?

http://blog.opbeat.com/posts/amateur-hour-at-aws/few-hours.p...

I want to be clear here that there are plenty of arguments to be made here, but you are not making them.

copergi · on April 10, 2014

Yes, that is not good enough, and it was already over a day into their complete lack of communication. Just because you don't agree with it, doesn't mean it is not an argument. I would expect a company of amazon's size to give customers precise and useful info. Not a vague "should be done in a few hours" after a day of silence.

chris_wot · on April 9, 2014

Agreed. I'm seeing a lot of people here saying "The service is great - we didn't have to do anything!". Sure hope someone tells those people to re-key their certificates at some point.

darkarmani · on April 9, 2014

Way to be disingenuous. The argument has nothing to do with their technical skill in rolling out a fix, but their lack of communication.

"No action required from customers. - Eveything seems to in in order now."

What? What about their customers revoking and replacing certs?

zts · on April 9, 2014

If they didn't already have a tested deployment procedure, that is a bigger indictment upon their professionalism than the woeful communication during the incident.

0xbadcafebee · on April 9, 2014

While the author is perhaps overly critical of the response of AWS to this issue, he has a point. When your business is suffering from an issue related to your provider and your customers are calling every 10 minutes to ask when you are going to fix it, you need your service provider to give you as much timely information as possible so you can relay it to your customers. The best thing they could have done would have been to provide gratuitous, empty information, every half hour. Even just saying "we're still working on it folks but no new information" will keep people calm and provide needed feedback for concerned individuals.

Gigablah · on April 9, 2014

I just wish people would stop using "amateur" as an insult.

forgottenpass · on April 9, 2014

It is an appropriate insult when the party you're conducting business is or presents themselves as professional.

It's about the expectations you convey to your customers. If you don't want "amateur" to sting, don't pretend not to be one. Make sure your customers have an accurate understanding of the services you provide and your ability to provide them.

lazylizard · on April 9, 2014

Pros -> people who do it for the money only. Amateurs -> people who do it because they like it. why/when has 'amateur' become an insult?

alttab · on April 9, 2014

I do my job for the money and because I like it. What does that make me?

jewel · on April 9, 2014

Hopefully this openssl issue has shown large organizations the need to have a way to quickly roll out security patches, ideally even before the vendor has released an updated package.

Imagine tomorrow that someone finds a remotely exploitable kernel issue, perhaps involving UDP packet handling. If you have the right infrastructure in place, you should be able to drop the patch file in the right directory and run a script that builds a new system package, runs some automated testing, and then pushes that package out immediately using whatever rolling update strategy is normally used, but at an accelerated pace.

I wish I had time to build something that makes patching system packages on debian systems simpler, making it trivial for businesses to "fork" the distribution as necessary to work around issues (whether they be security critical or not). I've written more thoughts on the matter on my blog: http://stevenjewel.com/2013/10/hacking-open-source/

If you're managing a smaller set of servers, I've been pretty happy with apticron and nullmailer as a way to make sure security updates are applied everywhere. It'd be nice if it could receive notification of security issues faster, perhaps via some sort of push mechanism, but it at least gets things taken care of within 24 hours.

m-app · on April 9, 2014

BTW, I remember from the CloudFlare blog that they were notified in advance of the bug and had already patched it. How come big names like AWS and Heroku did not get this prior information? Who decides on which companies hear it before the public does?

From the Cloudflare blog: "This bug fix is a successful example of what is called responsible disclosure. Instead of disclosing the vulnerability to the public right away, the people notified of the problem tracked down the appropriate stakeholders and gave them a chance to fix the vulnerability before it went public."

-- http://blog.cloudflare.com/staying-ahead-of-openssl-vulnerab...

kcbanner · on April 9, 2014

I was wondering about this myself. They must have though AWS was too big and it would be leaked?

avenger123 · on April 9, 2014

Don't throw out the baby with the bathwater. I think most users would be happy that the vulnerability was patched within 48 hours than the lack of proper communication.

In this case Amazon choose to focus on fixing the problem versus communicating every detail. The potential consequence in dollars of not fixing the issue in a timely manner is likely in triple-digit millions. I would imagine the cost of having some small number of users complain about how they weren't up to date with communications isn't worth them focusing on it.

Also, I would imagine that customers such as Netflix and other major clients likely got more in-depth communication.

I'm sure it's not going to really affect Amazon's bottom line if opbeat decides to move to a different provider. If you are a very small fish in a big ocean, expect to be treated that way. It's sad to say that but that's the reality.

Personally, I'll take them getting this fixed as a fast as possible, versus getting hourly updates telling me how they're still working it. For example, I just want to hear that they found the MH370 plane. I'm tired of reading news stories with the same depressing message.

Aqueous · on April 9, 2014

This seems highly pedantic and nitpicky. It seems like you're looking for things to criticize.

Use of Heroku as an example of communications leadership is misplaced. I have had support request sit around unanswered for days in Heroku. Don't get me wrong - I love Heroku. But you have to pay them an arm and a leg monthly in order to get quick support response time. AWS, on the other hand, seems very responsive to all requests.

neom · on April 9, 2014

The fact that all of this was fixed so quickly given the size of AWS infrastructure is in and of itself very impressive. Sometimes there isn't much to say except "We're working on it" - You can argue semantics but half the internet was in the same boat yesterday and sure maybe they comminuted poorly but "Amature‎ hour" isn't fair imho.

xerophtye · on April 9, 2014

I have experienced issues with several online services, as I expect everyone here has as well. I am impressed with how Heroku handled it with mandatory updates every 3 hours, and handing out clear instructions to their customers and also apologizing when the procedures caused inconvenience. Not everyone handles these things with such care

robotpony · on April 9, 2014

Amateur hour at upbeat.com ... OP clearly has limited experience with large hosting vendors.

It's frustrating to wait for updated information, but AWS delivered reasonable details as they were available. Could it improve? Probably. Does it rate worse than other vendors? No, not at all. Consider recent Rackspace or Azure outages, information follows hours later, sometimes days.

rpowers · on April 9, 2014

This seems a bit harsh. While it would have been nice to have a bit more transparency, calling them amateur is not realistic. It takes time to patch a bunch (read thousands) of servers and one day turn around is not all that bad.

brryant · on April 9, 2014

If you think about the sheer number of logical load balancers they have to update (tens of thousands?) and the time it took them to update all of them, I'm incredibly impressed by their quick turn around.

snorkel · on April 9, 2014

I'd rather they just fix the problem than tell me detailed bedtime stories.

developer786 · on April 9, 2014

Totally off topic, but programmers, I REALLY need your help...

https://news.ycombinator.com/item?id=7559067