Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Downtime (staff.tumblr.com)
82 points by inmygarage on Dec 7, 2010 | hide | past | favorite | 45 comments


Contrast this with DHH's response to today's Campfire outage: http://twitter.com/#!/dhh

Not only is 37signals refunding everyone for the month, DHH is replying to pretty much everyone personally to tell them so.


Everything about their response to that issue impresses me. Seriously, I hope to be as good when (not if) one of my screwups next makes it past the code designed to prevent my screwups from reaching daylight.

http://productblog.37signals.com/products/2010/12/explaining...


Tumblr has 1.5 million daily users. I think it might make more sense for them to write an apologetic blog post and reiterate how they're focussing on not letting this happen again :)


Also, it's free.


All Tumblr users received a full refund for not just a single month but an entire year!


37signals have always raked other companies across the coals for not being sincere enough in their apologies. It's impressive to see that they really do put their money where their mouth is.


this is awesome. not suggesting tumblr handled their situation poorly, but even the refund alone is stellar service. when i just read about this, i got the same feeling i did when netflix resolved a missing dvd on the phone with me in less than 5 minutes.

though, "We didn't earn your money with this service." is a little harsh. actually, lots of it seems harsh. cheer up dhh, we still like you (and i even got a few 500s i think).


I wonder how much of the incident and poor handling of the response is due to losing their particularly good CTO (Marco) a few months ago. I would be a bit concerned if I were a tumblr user or investor.

Experience with LiveJournal, Friendster, Twitter, etc. has been that problems don't just magically fix themselves; absent someone with enough vision to figure out potential problems and actually solve them in technology and business process, you're kind of fucked. In the case of Twitter, they had enough money to buy some excellent talent. In the case of LJ, Brad stepped up as an amazing hacker (with limited financial resources, and really one of the first to solve the problem). In the case of Friendster, bad executive leadership (CEO/board being ineffectual, VP Eng being a tool) killed them.


As a longtime Tumblr user, I can tell you serious performance/ frequent down blips/ bugs/ issues have been there for a long time, way before Marco left. I'm not saying they're his fault (he seems awesome), just that they don't correlate with his departure.


Maybe part of the problem is that NYC doesn't have as many people familiar with running high volume consumer web apps at scale?

They have banking people (who use a largely entirely different technology stack, budget, and way of working), and people who have worked for smaller companies, but they don't have a huge number of veterans of large consumer webapps. Google's NYC office probably is more of a sink than a source for ops talent, and there isn't a large presence of declining big giants to strip for engineers like there is in Silicon Valley.

Do people move from SF/SV to NYC to work for established companies? I could see this if you had personal reasons for wanting to move (being from NYC and going back, or just wanting to go there), but it doesn't seem like a compelling option otherwise.


People who don't use Tumblr will interpret this blog post differently from people who use Tumblr all the time. If you use + love Tumblr (as I do http://news.ycombinator.com/item?id=1973546) you're really less disappointed in this specific 24 hour failure (these things happen), and more disappointed about what's not being talked about. This would have been a great time to start opening up about some specific chronic issues, including communication style that is more close-lipped than Apple.

Still rooting for Tumblr like crazy, but bummed out :(


Perhaps those features will be available, for a fee. I mean, they're still not making any money. I hope they have a some kind of plan up their sleeve.


I guess I don't grok the value added by Tumblr. What exactly do they offer that offers a better value proposition than Wordpress or Posterous?


Like Twitter, Tumblr has some socialness that makes it quite easy to have a conversation larger than yourself. Visible Liking, Reblogging, and Following are implemented in really nice ways. Wordpress and Posterous are great, but you're pretty much by yourself. (Posterous has some basic socialness, but not like Tumblr) With Tumblr it's easy to discover stuff you're interested in, and have people find your stuff w/out having to rely on hustling your work. From the outside it's hard to tell (and just looks like a blog) but from the inside view of a user, it's really quite like Twitter not in format, but in experience.*

* I make many comparisons to Twitter here. They are extremely different. But they are both social and function like places, so I reference Twitter to make it easy for outsiders to grok.


Wordpress and Posterous are designed by geeks. Tumblr is designed by a nongeek.

First off, Tumblr is not primarily a blog. It's what we call a "tumblelog", which is a stupid word; I like to call it "vomit" or "spew". The fact that it's the sleekest blog platform out there is secondary to its larger function as a spewing platform.

Blogs are for creating. Spew is for recycling, for breaking things down into small teeny pieces, for streams and streams of things which are essentially meaningless but contribute to a larger whole. Now, not all blogs are strictly bloglike either; the Linked List format is somewhat in between the two, though Linked Lists are usually more disciplined in nature. Another way to draw the distinction is to say that blogs are for building things, while tumblelogs are for shaping them. Or you could call blogs classical music and tumblelogs jazz. One is looser, more freeform, more about the movement than about the individual notes. The other is about the finished product, or about realizing a vision.

A defining characteristic of a true geek is that he builds. Doesn't matter what he builds; what matters is that he appreciates structures. So when we look at a blogging platform, our needs traditionally tend to be focused on constructing more elaborate things. They also tend to be relatively solitary in nature. (Wordpress is; I never had a friend that used Posterous and so I don't know how they handle following things.)

I like slagging on Posterous because I'm still bothered that they get compared to Tumblr when they're really entirely different. Tumblr's breakthrough was its deconstruction of the blog format. You could post an image without a title. You could post a quote without something to frame it. You could post ANYTHING without a datestamp, or a "posted by" attribute, because nobody cares about them, they just care about the flow of content. Posterous has lines of datestamping, and they don't handle title-less posts. They're all about traditional title-body posting. They're all about "blogs", about these elaborate constructions. They aren't broken down like Tumblr.

Tumblr's a fucking awesome engine because it removes everything bloggy about blogs. It's designed for a steady vomitstream of thoughts and ideas. Not your thoughts or ideas. Anybody's. And it strips away everything we associate with blogs, and with everything it strips away it becomes sleeker, lither, more powerful. No comments means no conversations other than reblogs — and reblogs are great because first off, they let you improvise off anybody else's stream, and second off, they make it IMPOSSIBLE to participate without being a "primary creater" with a flow that other people are following, versus blog comments where every commenter is attached to one site at a time.

That means it attracts people who don't build things. People who just want to push out content without worrying about being judged for value. But they want to push it out, because it is a creative and cathartic act just to release these ideas out into the world, or to change their flow. Lower barrier to entry, more ability to interact meaningfully. You can participate without any skill but still find that people are interested in you.

Now, I use Tumblr primarily as a building tool. I find that its theming system makes it very easy for me to design custom interfaces for complex publishing sites, and yet still push through the entire site as a streaming feed to any Tumblr user who wants it. And it's used by lots of serious designers who appreciate its versatility. But I'm an edge user. Every HN user here who uses Tumblr is an edge users. The real users don't post here, because they're not out to build things, they're out to just express themselves loudly and with fury.


Your explanation can also be reflected on this comment from the post: We've nearly quadrupled our engineering team this month alone

Without more information, this can either make people trust more tumbler in the future (we went from 3 to 12 engineers) or less (we went from 200 to 800). I'd worry in the later case, because it might show that Tumblr is not aware of Brook's law (http://en.wikipedia.org/wiki/Brooks%27s_law), which is always unreassuring for me.


Great explanation. As someone who doesn't use Tumblr, this clarified a lot. Thanks!


>We’ve nearly quadrupled our engineering team this month alone...

How do you create and keep a good culture when you're growing that fast? Anyone have experience with this sort of scenario?


Unless the number of engineers they have w/ all the hiring is still less than 20, you don't. You hope that your culture is in place and strong enough to have some say over what happens in the future, but realistically if you don't have REAL leadership in the company (I mean, strong people managers, strong programming leaders and strong culture leaders...all are needed), the company won't look anything like what it does today in a year. That may be good or bad (I don't know the culture there), but it just is almost inevitable (sure, there are exceptions...but the general rule is...)

People would argue that places like facebook and/or google might have retained their culture etc etc. I doubt that there was a point when someone grew the ENTIRE engineering team 4x in a 30 day span at those companies (unless it was from small number to small still small number).

That number is a bit arbitrary, but makes sense from what I've seen and experienced at other companies. A 4x number is is hard to overcome when you have more than 4-8 core original engineers. Think about what that means. If you start with 5 devs, 30 days later you have 20? Are they seniors that can live on their own (where did you find 15 senior people in 30 days?)? Are the original people all leads who are now responsible for shepparding these new devs onward into the company (who is doing the real work for the first couple of weeks these people are coming upto speed...5 people generally don't have fully automated systems, builds, deployments, automated tests....)

Imagine if this number was 8->32 in 30 days? 10->40? 20-> 80? 100->400?

You have to lose something. I think what you lose becomes the key thing that defines what your culture is going forward...this would look to be one of those moments in a company when someone really, really needs to decide what is important to them. This is when the REAL culture will be defined for the company.


Adding more people to a software project often has the effect of making it take longer, at least for mid to near term milestones. Quadrupling sounds like the adage of pouring gasoline on a fire.


Your attitude may be appropriate for software development, but if we're talking about the number of IT people watching the server farm, more is probably better.


I know that - you know that - and they probably do to.

I think in this case the point is to reassure people that they are taking it seriously.


True, but it may not be the case that all of this new staff is going to increasing the size of existing project teams... they could be spinning up whole new teams for new projects - for which the idea had been stuck on the back-burner.


I'm surprised that this has so many upvotes. Are you all net this simplistic?


It's probably that the upvoters all read the Mythical Man-Month.


Yes, my comment is a clear rip off of MMM.


Right, but they are a company thinking in terms of years going forward. Gotta get people hired at some point.


What did it quadruple from? One to four? Two to eight? I'm being serious; isn't all of tumblr numbered at around 20 people, or am I mistaken? Their About page says "we've grown from a team of two, to more than ten people."


It sounds to me like their biggest problem was growing too slowly.


I understand they were busy, but only two or three updates about progress during the entire ordeal (on Twitter) seemed a little low to me. Their "we'll be back shortly" page also didn't link to their Twitter page either, so for many people it was just a black hole for updates about when a surprising large chunk of the web would return.


I've been around when chunks or a large web site went down. Everyone is scrambling to fix it. While we know communication is worth taking time and effect, it is also difficult when you're focused on a singular task. You do get blinders.

Thats not to say its a good excuse, and you certainly lern from it - but it is understandable.


Everyone learns quickly after a major outage (they happen) that they need to have a game plan moving forward for communicating to users and customers. I hope tumblr took note of this lesson.


The postmortem is pretty weak and us, geeks, would have loved to see more detail. Tumblr would have gotten some good karma with a detailed explanation. Well, unless this is all made up just to look impressive and the real reason was something else like human error that wiped the production DB. But even in that case honesty would have paid off - remember GitHub's recent DB wipe and their excellent explanation?

On a side note - anyone know what DB they are using? The cynic in me is thinking "Hey, another MongoDB + FourSquare 'success' story of webscale awesomeness."


Agreed. I'd like to see more detail. C.f. facebook's recent post-mortem (http://www.facebook.com/notes/facebook-engineering/more-deta...). Lots of detail. When a downtime blog post is so vague like Tumblr's was, my first instinct is that there's something they are deliberately not telling about what went wrong. Or maybe they haven't figured out exactly why it went down yet. I hope a more detailed postmortem is coming.


Doesn’t actually explain what happened, or why a database cluster outage means more than a read-only situation.

I understand this is aimed at all users, but I’m still disapointed.


Don't be, their engineers have better things to patch up at the moment than write a detailed post. They will, hopefully, though like foursquare did with their mongodb outage - http://nosql.mypopescu.com/post/1265191137/foursquare-mongod...


Todo:

Move the blog/status site outside your network (linode.com)

Work on a process to try and follow if you have another outage.

   - One person to handle communication (blog post / respond to users)
Work on a faster way to recover from such a failure. Maybe have a read only version you can switch to "maintenance mode" ?

Done:

Probably the biggest outage you'll face.

20+ outages don't usually happen.

Learning from it. . .


eBay had some really long outages, and multiple. I'd say a site which has had one 20h outage is way more likely to have similarly long outages in the future -- it demonstrates that neither the technical nor procedural measures are there to prevent them.

Of course, once you have enough 20h outages, you dont have to worry about the problem anymore.


EDIT - originally this comment started with "Wow, set the snark to stun!", but the parent has since been modified, which eliminates the need.

We've all had too many examples of bad pr-attracting events and their fallout over the years. A wise man wrote 'One shouldn't be forming a(n incident-response) team during a crisis', and I don't expect different here. Do I hope for better as someone who uses and admires the Tumblr platform, and expect more communication next time? Yes.

The quantcast graph pointing to the 1 _billion_ increase over the past _two_months_? Easy to play armchair qb to that, but by jeebus, it's almost surprising they made it this long.


they had someone responding on tumblrhelp with chat and sympathetic words throughout most of the outage.


Pretty weak compared to other post-mortems.


They should have a better ongoing communication method than Twitter in the event of downtime. I'd suggest getting a separate hosting account with a reliable third party, and planning on running something that looks like (but isn't) a normal tumblog, but is instead something very reliable, like editing an html file on a web server that everyone on the team has shell access to.


This points to the necessity of a data storage solution which doesn't involve waiting for hours upon hours for the caches to warm in order for your service to be reliable.


technical details?


My favorite opinion on the handling of this event is, by far, this one:

http://twitter.com/b6n/status/11877355945463808




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: