Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hmm. This looks to me like a lot of the savings were realized by moving away from managed services into a scenario where there’s more operator overhead. The AWS bill gets lower, but what about the cost of the engineering work?


Does anyone else find the costs associated with running well-tested, well developed systems overblown? Like if you know how to adjust some basic parameters, you will solve for 99% use cases (adjust memory, adjust ram).

Examples I can think of is Rabbit MQ and Cassandra. But in general, we have some really battle-tested software these days that has become simpler to configure and run over time. People seem scared to run their own these days.


I vouched for this comment because it’s a valid point and I’m not sure why it was killed.

I happen to disagree strongly, though: lots of engineers in my experience undervalue the work of systems administrators and underestimate the effort needed to operationalize any technology.

Running your own is absolutely fine if you are willing to keep your stack small and invest time learning the tools you pick. But there are still horror stories of people thinking snapshots are backups, turning the wrong knobs and turning off fsync on their databases, ...


Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry (granted you know how to use Docker).

Most small startups are actually the ones who don’t really need SaaS services.


>Yea exactly and unless you are FB scale you can just run a single docker container and never really have to worry

This has not been the case at multiple employers and or consulting clients.

If you're providing software to an enterprise this almost will never fly. That single docker container will have an outage when basically anything happens. The container dies, systemd fails to restart, node dies, network switch dies, data center has basically any major issue, etc.

I think your comment brings value just probably biased with your own experience of running a consumer to consumer startup.


I think you're misinterpreting my comment. I meant specifically for most small time startups, not necessarily small time startups deploying enterprise apps. If you're deploying enterprise apps then by definition you're for all intents and purposes "fb scale."

A lot of SaaS promise infinite scalability—a need which often never comes to most small time startups.


Sometimes.

But developers are part of this problem too. There's plenty of times where I see devs immediately reach for tools instead of learning just a little bit more about what they already have. My favorite example is when folks want to add a NoSQL db into the mix on top of a traditional db. Not because there's a real performance need, but because for their use case it is 'easier'. Never mind that their problem possibly could have been solved by just writing their own SQL instead of trusting a garbage ORM...


This is probably a tradeoff in a lot of AWS related stuff; you pay a premium for convenience. But depending on your workload it can pay itself off fairly quickly; AWS bills can go up quite fast, whereas personnel costs are fairly predictable.


False equivalence. The engineer will be doing more than just cloud work.

This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money


> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

Time is of a limited quantity and time spent managing postgres backups (for example) is time not spent doing other (possibly more meaningful/impactful _to the business_) work.


How much liability can you claim against AWS if there's an issue with their RDS backups?


How much liability can you claim against Cloud Employee if there's an issue with your RDS backups?


What is the chance that Amazon cause an issue with your RDS backup, versus a Cloud Employee?

The answer is definitely not clear to me at all

EDIT: no sarcasm, I legitimately don't know which I would choose as a biz owner


I would personally always have an off-platform backup to fall back on, as protection against the platform going down, accidental damage to data or malicious damage to data. Snapshots in cold storage too.


Liability? Probably none. See section "11. Limitations of Liability." here: https://aws.amazon.com/agreement/

RDS SLA's are here (doesn't mention backups though, so not sure how that's handled): https://aws.amazon.com/rds/sla/

Not a lawyer or anything, but my layman's understanding is that, you essentially are voluntarily opting-in to waiving liability when you sign up for AWS and accept the terms and conditions, and instead of liability, you agree to accept service credits if SLAs are not met.


what is involved in managing backups? isn't that just a cronjob?


First, you need to write the cronjob. But what goes in there? You need to decide exactly how you're going to make a backup, and the process may differ by what's being backed up. Ideally you want a quiescent snapshot, but the way you do that varies by application. What if the application is a distributed application, in which case you need to synchronize the snapshot process among all its nodes? What if it's a master-replica design, where the node that runs the cron job may vary based on the current topology?

And if you need some sort of cluster-aware lock to coordinate backups among different peers, you'll need to decide which system works for you, implement it, and maintain that as a separate system. And if that needs to be upgraded, figure out a bulletproof process for upgrading it while it's still being used as a coordinator.

Then, you need to ensure there's storage for the backup. You need to decide what kind of storage you're going to use, make sure you've got enough space, figure out how to encrypt the storage (very important in secure environments), how to protect the storage using authn/authz. And lots of environments have retention and storage lifecycle policies - you don't want to put the old backups on the expensive fast media; you want it on the cheap slow media. And some environments make you dispose of old data, so you have to figure out how to age it out but without ever losing the backups you want to keep.

Finally, you need to make sure the backups you create are valid and usable. So you'll want to build an automated regression testing procedure to ensure that every time you make a change (regardless of how minor) to the system being backed up or the backup process, that you end up with usable backups.

(Disclaimer: I work for AWS, but opinions expressed here are my own and not necessarily those of my employer.)


You make it sound like there aren’t cookbooks for many of these scenarios and that the company will have to invent these scripts and procedures by hand.

Yes it is work, but this company’s whole reason for being is to save AWS spend, so I assume they have patterns they employ for their clients regularly that achieve their SLO.


There are original-definition cookbooks and yet it still costs me time to provision my own lunch vs using the managed service of my corner restaurant.


Yes, managed services are better in many cases.


Sometimes there are cookbooks, but they are of varying quality and often don't have dedicated resources to maintain them, so I would use them with great caution. You also have to implement them and often maintain the underlying infrastructure.

But I was really responding to the brusque naiveté of the "just write a cronjob" response.


“managed services” don’t necessarily save you the headache of making sure backups are usable. But other points are valid.


There was a time when I used to think the same, then I found my backups were corrupted (or stopped because run out of space, etc) just in the moment when I needed them.


I once worked at an IT shop that worked closely with the construction industry. A new sports stadium was being built and we were doing panoramic photos during each stage of construction, and rendering them in a web app where the facilities team could “peel back the layers” and see what was behind the wall or under the floor all the way down to the foundation. This was... 15 years ago? So it was pretty neat technology and a little less ubiquitous than today.

Well, our storage server barfed and the data was gone. Went to restore from backups, all the hourly tar files were there... but were zero bytes.

We looked at the backup script the engineer had put together and it was one of those classic “didn’t give the right parameter to have tar recurse” type bugs. Unfortunately we lost all the photos of the foundation and much of the photos of the electric being run. Oops.


One wonders why tar even has an option (and a default!) to not recurse. What would be the use case for that?

The "normal" use case seems to have a recursive archival. Sounds like somebody chose the wrong default... would be an interesting software archaelogy project to figure out where this "feature" originated.


Not for a reliable solution. For example, assume you have a master and a replica database for reliability, what happens if the master, where the cron runs, fails? Do you remember to set up the cron up on the replica? From my experience, having worked on backup software, over 10% of servers that need to be backed up are not.

System reliability is hard, and the cloud makes that easier.


Need a lot of storage and make sure the backup is readable (view the files content or try a restore).

The number one backup solutions nowadays is AWS S3, because it's easy-to-use unlimited storage.

How does a company handle backups without S3? Usually they don't. That would require employees to buy machines/SAN with tens of TB of storage and maintain them (weeks in ordering and travelling to the datacenter once in a while). It's too much hassle so nevermind.


Easy to use unlimited storage is a sure recipe for not finding what you actually need, restoring the wrong backup, etc.

Unless you take your DR plans seriously, the cloud doesn't eliminate risk, it just changes it.

The place I work at forces a failover on a monthly basis, and does a full-on offsite DR exercise twice a year.

I'm sure it took time to set it all up, but now that it's there it takes almost no effort to continue.


The cloud eliminates the most common risks, that is ops + developers simply giving up on backups because there is nowhere to store them, and not being able to access them anyway.

Typical new sysadmin in large corp: The backup storage is full and backups have been failing since before I joined, should we do something about it?

Oh we raised tickets to request more disks. They will take months to arrive if they ever pass approvals.


Backups are a solved problem and has been for decades.

Can you think of a better example?


They’re a solved problem in the sense that the tools exist but that still has an ongoing operational cost to correctly setup, secure, monitor and test – and there are plenty of examples of expensive failures or security breaches caused by people thinking it was an easy solved problem.

Deciding whether you get enough benefit from doing that yourself is a classic business trade off which any experienced engineer should consider.


Yes. I’ve found that the amount you have to learn to use a managed service often equals or exceeds the amount you have to learn to run something on EC2 or on-prem. The automation/management costs of AWS or equivalent are a lot higher than people think and not significantly different from the costs to learn Linux and enough networking to do an “old-fashioned” deploy.


Much of that rings true, but I find some of the cloud abstractions can help to make the steady-state ops time required lower (especially for a side project where you really don’t want to deal with life interruptions).


The cloud was supposed to help you get rid of all these pesky sysadmins. Imagine the savings!

Now the cloud is so complicated that you have to hire "devops". It's the same people as before, with a higher salary.


I don't have a dog in this race; I'm not partnered with any cloud provider. I will say that based upon what the article discusses, they save a bunch of money on AWS Glue by... running their own ETL pipeline inside of ECS instead. What's the maintenance burden of that decision? It's certainly not zero.


That depends on the scale and other particulars that are only shared/known inside the company.

In my experience, beyond a certain scale it simply doesn't make any sense to use managed services anymore. There is a high initial upfront cost in development hours and hardware that is amortized over a very long time after, and this upfront cost is partially paid for by the reduced cloud bill.


That also depends on the markup.

EKS on EC2 costs EC2 costs + flat money for control plane, so that might make more sense than running your own Kubernetes on EC2. (although I have no experience to say how much time this actually saves you)

Managed Kafka costing 2x the cost of EC2 infrastructure? Probably not worth it.


Most ETL processes should be declarative and configured in text files like you configure a CloudFormation or Terraform. Once you have that, the execution piece is relatively straightforward. I think a lot of issues and costs with ETL come from poor architecture decisions. For instance, we all made the mistake of running Extract processes with too many transformations. An extract should be an extract, transformation should come later.


It’s a “false equivalence” or “flawed sales tactic” to suggest planning using total costs? That’s what both engineers and business people are supposed to do - and reflexively attacking it really does not cast your motives in a good light.


It's a false equivalence to suggest that managed services have zero staff costs, and that using a DIY database has a staff cost measured in whole FTEs.


That could be the answer to the question which was actually asked but if you read the thread again, notice that you’re arguing against a claim nobody made.


Scroll up, my dude:

> False equivalence. The engineer will be doing more than just cloud work.

> This comparison is the #1 flawed sales tactic the cloud companies use to convince you youre saving money

"False equivalence. The engineer will be doing more than just cloud work" -> "It's a false equivalence to suggest that [...] using a DIY database has a staff cost measured in whole FTEs."

Hey, maybe AWS should launch some kind of ML-powered reading-comprehension-as-a-service?


Yes, do scroll up — note that the portions you quoted were the strawmen which nojito tossed out, not the original question, and perhaps ponder whether accusing someone else of not reading for comprehension is adding anything to the conversation.


While I don't entirely disagree I think it should be made clear that both can be true, even at the same time. To spin up a 9 node managed Elasticsearch cluster load-balanced across two regions takes a competent engineer with practice roughly a couple of hours, or twenty to thirty minutes if they were smart and terraformed it out previously. Now there's a whole host of potential problems that come along with using that managed Elasticsearch cluster too (no access to "cluster mode", no tunability, etc). But if those potential problems don't apply to you and a very vanilla ES cluster suits your use case then you're fine.

Alternatively that practiced engineer could have spun up a self-managed ES cluster in a couple of DCs in about the same time, but now has the obligation to maintain those servers (patching, etc.). Maybe that marginal cost is damn near zero - chef has been deployed to all instances and enforces patching and there's already good security monitoring in place, etc. The cost of that engineer managing that box, as with a managed ES in AWS, is practically nothing.

TL;DR: as in all cases, it depends.


Agreed - there is a tradeoff that must factor in many things: engineer competency or the ability to get competent engineers, state of the product itself (maybe Elasticsearch as a service was an interim step in a longer term vision), complexity of the managed service itself, integratability (is that a word?) into other AWS services, maturity of the managed service, and probably a few other things I'm missing.

We've seen our teams go both from managed to non-managed and non-managed to managed with relative success - to give scale, across all of our accounts we spend way north of $3 million/month at AWS so this has happened within our realm a quite few times. The short, unsatisfying answer is that _it depends_. We have an internal policy from the suits that "if there's a managed version, use it" but most of our teams are thankfully smart enough to take that at face value and do their own analysis.


> integratability

interoperability?


Thanks :)


My personal favorite is "Move off of AWS MSK". No big deal - we just fire up some kafka brokers and and zookeeper nodes in ECS! All we gotta do is run several more supporting services to keep the cluster healthy and deal with the nightmare of Apache security ourselves.

As far as I'm concerned MSK is cheap - one broker is priced roughly same price as 2 equivalent EC2 instances. And you don't have to worry about zookeeper at all!


Hi Corey! I'm the author of the blog post - I definitely agree with you. 9 times out of 10 engineering teams underestimate cost of engineering work as well as opportunity cost lost due to managing non-core functionality internally or moving away from managed services.

For us our pipeline was actually easier to work with Flink than Glue because of the restrictions that Amazon placed on it and so that factored into our decision.


If you're already paying the cost(both engineering time and compute wise) for EMR, I can't imagine it takes more effort to create a new Flink job than a new Glue job?

The advantage of Glue or the corresponding serverless GCP ETL option (dataflow) is that it's serverless elastic, but it sounds like their workload wasn't applicable.


Unfortunately that's not nearly how AWS works. AWS breaks down everything and charges you for it separately. Flink and Glue are entirely different animals.


Can you explain more? From what I understand, if you're already running a Flink cluster on AWS and you have capacity for another job, you aren't charged more, no?

I haven't used Glue, but it seems like it's able to do stream processing on Kinesis and dumping to S3 or whatnot, so it seems like there's overlap with using EMR running Flink?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: