I understand a certain level of displeasure at their lack of specificity while they mitigated the issue. But... in this case, the time to remediate doesn't really change your response to the threat. No matter what, you need to change all your keys, generate new private keys, etc.
They got it fixed within 48 hours, globally, which, if you ask me, is incredible at their scale.
I would hardly describe anything AWS does as amateur. But maybe that's just me.
Calling the team at AWS "amateurs" is a great way to discredit everything the author wrote. AWS is a gigantic infrastructure and they got everything fixed IN LESS THAN 48 HOURS. That is not an amateur response time.
Sorry their updates weren't to your liking but they were responding and posting bulletins the whole time and again: they solved the issue very quickly given the number of clients they support.
The way they communicated it was very, very amateurish. They needed to know when they could re-key their certs - seems like it was impossible to tell. That's something that needed to be done ASAP. Without knowing precisely when the environment that affected them was updated, this client couldn't get that done as quickly as they might have.
It's not amateurish, it was just poor communication in a situation that they've (fortunately) not had to deal with. Which happens from time to time; and good companies recognize their failures and fix them. One thing I know about Amazon from friends who work there is that they don't tolerate failure. They have a culture of owning your mistakes and fixing them; anyone who doesn't buy into that attitude will get fired pretty quickly (and Amazon fires a lot of people.)
It's pretty fucking professional to update the infrastructure that runs half the internet in under 48 hours with no issues. But again, communication can be a problem when you have as many customers as they do.
OP raised some legitimate concerns, but his credibility was undercut by attacking Amazon and calling them names. Ironically, his post was a much more amateur move as his concerns would likely be taken more seriously if he had stuck to the issues and not resorted to name-calling. The essence of professionalism is sticking to the issues at hand and not being sidetracked by extraneous factors.
> sorry their updates weren't to your liking but...
Folks, this is an account created 19 hours ago making half an apology and then rationalizing not listening to customers because they solved a "big" problem quickly. That's an ad hominum argument and as a result, a big tell on intent.
From my perspective, pushing out a new SSL build to a bunch of load balancers in a highly automated network like AWS is probably, by this point, a trivial task. Actually listening to the customer and responding decently is MUCH harder. Clearly it could be done better, which is the point of the post.
Rise above getting offended/scared about being called "amateurs" and start talking more about what goes on in that creepy black box that is AWS. You owe the world that much, at the very least.
> From my perspective, pushing out a new SSL build to a bunch of load balancers in a highly automated network like AWS is probably, by this point, a trivial task.
These responses I always see on HN when there is an AWS issue always show me how disconnected many of the commenters are from reality, or from ever being involved in a huge infrastructure.
Sure the AWS status page doesn't have a hip web 2.0 AJAX backed d3.js powered cool looking status page. Yes they don't update it every 3 minutes with new info, but many (most) of the problems that one off customers see do not reach a threshold that would ever effect enough customers to make it into a dashboard post. I do think they need to speed up their status updates, but these posts need to get OK'd by a decent number of people before they get thrown up.
There are usually multiple ELB instances living on every rack of every datacenter in every AZ in every region of AWS. Relaunching / patching hundreds of thousands of instances in 48 hours with minimial disruption to customers, is a lot harder than you think.
I'm not disconnected from reality. I actually understand the problem at hand and understand it is a lot of work for some engineers. However, it's still likely it's old hat techniques by now, hence the 'trivial' remark. Also, the context of the comment I was responding to was correlating 'fixing in 48 hours' to not being marketing amateurs.
My primary point agrees with your second paragraph, which is that they could do better on the status updates. Unfortunately this has been going on for YEARS at AWS, so it's worth ratcheting up the tone when talking about it. It's important, and they need to fix it.
Hey there super sleuth... not sure what you're trying to imply but I'm not affiliated with Amazon nor do I use their AWS service. I do not speak for them. I am not making an apology on their behalf. I suggest you stop levying false accusations and avoid using words that you simply don't understand ie. ad hominum (not only did you spell it wrong, but you've misapplied it).
To clarify: I'm a long time HN reader that finally got around to making an account (and certainly not for the express purpose of defending Amazon). However, I did want to call the author out on writing a terribly unfair knee-jerk, heat-of-the-moment indictment of AWS (this type of thing is unfortunately all-too-common in the tech community: actual amateurs writing as if they are a central authority about subjects that they have something approaching 0 understanding of. For example: the multitude of complex engineering and PR challenges a service provider like AWS faces during something like the Great OpenSSL Exploit of 2014). What I'm trying to say is: cut them some slack. Their response seemed perfectly reasonable to me.
I spell things wrong all the time and sometimes hominem doesn't get caught by the spell checker. So fucking what? It's ironic that you call this out because, well, it's an ad hominem argument in and of itself. You are attacking one thing (or supporting one thing) to prove another point. Wikipedia says it best, "claim or argument is rejected on the basis of some irrelevant fact". Claiming AWS isn't practicing amatuer hour based on the fact they rolled out fixes in 48 hours is making an ad hominem argument. Amazon's marketing department is distinctly different from their engineering department. It's irrelevant that they are technically competent enough to patch this when you consider the fact the marketing/communications department could give two shits about how they tell anyone what has been fixed.
In retrospect, what I should have done is called out the blaming statements you made in your first post. That's what brought me to action and caused me to write my response the way I did. I should know better than trying to rationalize with someone who is in dissonance. BTW, narrowrail called you out below for this blaming statement here. Pay attention - people are giving you feedback. Take it or leave it.
Vote down all my comments if that makes you feel better. Karma is meant to burn. It's also a tell that this story dropped off the main page and I'm still getting downvotes on my comment. AWS koolaid much?
Oh, and FWIW, I am a super sleuth. A super sleuth of human behavior and emotional response. I also watch what I say about others, trying not to blame and indicate opinion where needed. That's why I said your behaviors were a 'tell on intent'. I have no idea who you are or why you created an account just to comment on this story, but I guarantee there is more to it than what meets the eye.
As a long time HN reader, you should know that the condescending "super sleuth" mention was unecessary. It should also be apparent that your highly defensive comment with a very new account would raise eyebrows. The "hope this helps" ending also comes off as passive-aggressive. We can do better.
The condescension and passive-aggression was fully intentional and really the only way to respond to such an asinine comment. Hope THIS helps; I could do better. Next time perhaps active aggression will be called for?
Their communication is not impressive. I work with AWS daily and I've been aware of issues LONG before they publicly announce anything on their status pages.
In this case, they probably didn't want to be too explicit about the details of patching tens of thousands of machines while the remediation was still ongoing.
I do agree that it's unexpectedly hard to find a link to the "security notice" page anywhere.
That outage notice you've linked includes a promise to improve communication.
Also, how'd you find that link? If you happened to just have it laying around, that's fine but it would be better if they had these things linked somewhere customers can find them when new ones are posted (like a page covering service post-mortems) and the timeline is missing little details like the year when the outage happened and a point of contact (it's signed by "the aws team") if you have questions.
Trust me, this doesn't cover all their issues. Don't get me wrong, we love our AWS stack, but we've had all sorts of weird stuff happen over the years, from zombie ELBs hitting the wrong hosts to invisible SGs... sometimes we see from twitter that others are seeing the same things, but AWS don't ever confirm or deny anything, they just stall then tell you when it's fixed. I guess its support in terms of something to give to your PHB, but when something goes wrong within AWS, it's a very black box - which isn't surprising.
Sadly this page is not updated often enough.
I mean, when there's a known issue on one of the aws service this page display a little "i" icon, which is barely visible.
And when, you encounter some problem with your AWS stuff that clearly come from their side, if the problem it's not wide, they just say nothing. At that point you can search yourself for hour to be sure it's not your responsibility, and after-woods, you just wait, blind.
You're absolutely right. The author of the post is just whining for the sake of whining. For every post hating on amazon there's one hating on heroku<http://www.holovaty.com/writing/aws-notes>. Best course of action is to just deal with it.
I read the words in your blog post and came to a completely different conclusion.
Despite not putting time stamps on their communications (which you seem really really upset about), they fixed everything for everybody in like a single day. You, their customer (and better still, me, their customer) didn't have to lift a finger.
This is exactly why we farm out infrastructure to companies like Amazon. They have a whole squad of smart people standing around leaning against a post all day every day waiting for something like this to happen so they can jump in and fix everything for us.
My little one-player business has no such team o' dudes on standby. If it weren't for the fact that Amazon is cleaning up for me, I'd be two days into having a really bad time and getting no productive work done.
Actually, they had to re-key their certificates. They did have work to do, and they couldn't do it till the upgrades at AWS was done.
So yeah, you'd be still having a bad day. Because you have no team 'o dudes on standby, so your already busy staff are chaffing at the bit to get the certificates sorted out, and potentially not being able to restore service and constantly checking to see Amazon has completed their patching. i.e. little to no productive work is being done.
The author suggests that Amazon only made one post, updating it throughout the process. But at the link above you will see four posts regarding this issue, with the first one having been updated once to add information. No, they are not timestamped, but they are dated.
While it may be fair to criticize AWS for its customer communications during this process, I'm fairly certain that if he had a backstage pass to such a comprehensive process of remediating thousands of production systems with zero downtime, he would perhaps find the team somewhat less... amateurish.
Obviously they were working as fast as they possibly could without risking major outages. They probably had millions of servers to update.
I'd even argue that it's not a good idea to advertise, "these servers are vulnerable to this attack".
AWS is massive and organizing that kind of update by an army of engineers isn't easy.
You received a non-personalized message because AWS support probably received tens of thousands of irate customers demanding that their systems be patched immediately. For some reason they weren't equipped to handle that kind of update but I'm sure they will learn from this and hopefully next time the response will be faster, if possible.
TFA gives you a clue why you need to know when all ELBs are updated:
* "However, due to nobody outside of AWS knowing exactly how ELB works, this could just mean that the machines currently responding to the requests are patched, but in the next request, it could hit an unpatched ELB machine."
* "We then wrote to support to hear if it was now safe to re-key the certificates, but did not hear from them for hours."
Summary: You re-key your certificates, thinking you are all good. Now an attacker hits a non-patched ELB, exploits the issue and gets your new keys.
So you pay for revocation of certificates twice. The first time you revoked the certicates, someone could have compromised them immediately afterwards. How is that a good solutions?
Is payment for revocation standard practice? I've reissued and revoked probably a dozen Comodo certs (both directly and via Namecheap as a reseller) without issue.
So ... as a professional organization, why does OpsBeat continue to contract with such an amateur organization? You could scale up your own servers, round-robin DNS (more for ELB), etc if you're not happy with their performance.
Instead, a group of dedicated professionals updated a world-wide infrastructure in less than two days. If you were running your own systems would you have managed that? Yes, you would have known exactly when you were done but could you predict ahead of time when you'd be done?
So, as engineers we make trade-offs and AWS is a pretty clear winner when you look at the TCO of having a scalable architecture. Once you've made that trade-off, the down-side is that you don't have the ultimate flexibility provided by a bare-metal host.
- They managed to create and test a deployment procedure in a couple of hours.
- Deployed this update to thousands of machines spread into multiple continents in 48 hours.
- There were no downtimes. No action required from customers.
- Eveything seems to in in order now.
They managed to do everything they could without customer involvement. The rest (retiring keys and stuff) can not be handled by them due to nature of the problem.
Now that the technical issue has been resolved by other smart people, I'm gonna replace our cert keys and be just fine.
Yes, and "sit there refreshing a page over and over waiting to see if they update it because they refuse to even give you a hint of when they expect to be done" is not acceptable.
Yes, that is not good enough, and it was already over a day into their complete lack of communication. Just because you don't agree with it, doesn't mean it is not an argument. I would expect a company of amazon's size to give customers precise and useful info. Not a vague "should be done in a few hours" after a day of silence.
Agreed. I'm seeing a lot of people here saying "The service is great - we didn't have to do anything!". Sure hope someone tells those people to re-key their certificates at some point.
If they didn't already have a tested deployment procedure, that is a bigger indictment upon their professionalism than the woeful communication during the incident.
While the author is perhaps overly critical of the response of AWS to this issue, he has a point. When your business is suffering from an issue related to your provider and your customers are calling every 10 minutes to ask when you are going to fix it, you need your service provider to give you as much timely information as possible so you can relay it to your customers. The best thing they could have done would have been to provide gratuitous, empty information, every half hour. Even just saying "we're still working on it folks but no new information" will keep people calm and provide needed feedback for concerned individuals.
It is an appropriate insult when the party you're conducting business is or presents themselves as professional.
It's about the expectations you convey to your customers. If you don't want "amateur" to sting, don't pretend not to be one. Make sure your customers have an accurate understanding of the services you provide and your ability to provide them.
Hopefully this openssl issue has shown large organizations the need to have a way to quickly roll out security patches, ideally even before the vendor has released an updated package.
Imagine tomorrow that someone finds a remotely exploitable kernel issue, perhaps involving UDP packet handling. If you have the right infrastructure in place, you should be able to drop the patch file in the right directory and run a script that builds a new system package, runs some automated testing, and then pushes that package out immediately using whatever rolling update strategy is normally used, but at an accelerated pace.
I wish I had time to build something that makes patching system packages on debian systems simpler, making it trivial for businesses to "fork" the distribution as necessary to work around issues (whether they be security critical or not). I've written more thoughts on the matter on my blog: http://stevenjewel.com/2013/10/hacking-open-source/
If you're managing a smaller set of servers, I've been pretty happy with apticron and nullmailer as a way to make sure security updates are applied everywhere. It'd be nice if it could receive notification of security issues faster, perhaps via some sort of push mechanism, but it at least gets things taken care of within 24 hours.
BTW, I remember from the CloudFlare blog that they were notified in advance of the bug and had already patched it. How come big names like AWS and Heroku did not get this prior information? Who decides on which companies hear it before the public does?
From the Cloudflare blog: "This bug fix is a successful example of what is called responsible disclosure. Instead of disclosing the vulnerability to the public right away, the people notified of the problem tracked down the appropriate stakeholders and gave them a chance to fix the vulnerability before it went public."
Don't throw out the baby with the bathwater. I think most users would be happy that the vulnerability was patched within 48 hours than the lack of proper communication.
In this case Amazon choose to focus on fixing the problem versus communicating every detail. The potential consequence in dollars of not fixing the issue in a timely manner is likely in triple-digit millions. I would imagine the cost of having some small number of users complain about how they weren't up to date with communications isn't worth them focusing on it.
Also, I would imagine that customers such as Netflix and other major clients likely got more in-depth communication.
I'm sure it's not going to really affect Amazon's bottom line if opbeat decides to move to a different provider. If you are a very small fish in a big ocean, expect to be treated that way. It's sad to say that but that's the reality.
Personally, I'll take them getting this fixed as a fast as possible, versus getting hourly updates telling me how they're still working it. For example, I just want to hear that they found the MH370 plane. I'm tired of reading news stories with the same depressing message.
This seems highly pedantic and nitpicky. It seems like you're looking for things to criticize.
Use of Heroku as an example of communications leadership is misplaced. I have had support request sit around unanswered for days in Heroku. Don't get me wrong - I love Heroku. But you have to pay them an arm and a leg monthly in order to get quick support response time. AWS, on the other hand, seems very responsive to all requests.
The fact that all of this was fixed so quickly given the size of AWS infrastructure is in and of itself very impressive. Sometimes there isn't much to say except "We're working on it" - You can argue semantics but half the internet was in the same boat yesterday and sure maybe they comminuted poorly but "Amature hour" isn't fair imho.
I have experienced issues with several online services, as I expect everyone here has as well. I am impressed with how Heroku handled it with mandatory updates every 3 hours, and handing out clear instructions to their customers and also apologizing when the procedures caused inconvenience. Not everyone handles these things with such care
Amateur hour at upbeat.com ... OP clearly has limited experience with large hosting vendors.
It's frustrating to wait for updated information, but AWS delivered reasonable details as they were available. Could it improve? Probably. Does it rate worse than other vendors? No, not at all. Consider recent Rackspace or Azure outages, information follows hours later, sometimes days.
This seems a bit harsh. While it would have been nice to have a bit more transparency, calling them amateur is not realistic. It takes time to patch a bunch (read thousands) of servers and one day turn around is not all that bad.
If you think about the sheer number of logical load balancers they have to update (tens of thousands?) and the time it took them to update all of them, I'm incredibly impressed by their quick turn around.
They got it fixed within 48 hours, globally, which, if you ask me, is incredible at their scale.
I would hardly describe anything AWS does as amateur. But maybe that's just me.