Predecessor may imply "worse" or "outdated" (although may not be the intent of the OP). I want to clarify that is definitely not the case: Kubernetes is a joke compared to Borg when it comes to running Google workloads (on many dimensions, most importantly scale).
I don't know about joke. Kubernetes has a much better design and interface than Borg. Often times I am frustrated trying to work around design flaws in Borg.
However yes, Borg has been enriched with about a decade of features and fixes that Google needs. A replacement of Borg is many, many years away.
This post highlights why many start disliking Google - its that vibe of superiority that comes across often. These days Google workload (e.g. qps) isn't that unique anymore...
Moving Borg to Kunernetes will eventually happen because it doesnt make sense to help maintain two systems that solve the same problem. And because Kubernetes is open source it will eventually be superior, because of diverse group of contributions.
Huh? How is superiority relevant here? I don’t work at Google (I have in the past) and both systems come out of that company with some of the same people have worked on both. If anything, Google marketing department would probably prefer for people to believe that Kubernetes ~= Borg to help sell GKE, not the other way around. Kubernetes is basically limited to several thousand hosts. That doesn’t even register at Google scale. Other folks do have high QPS systems but none really use Kubernetes to manage the entire cluster; Facebook comes to mind, for example, with an in house system. I would bet against that prediction; I don’t believe such a thing is even on the roadmap internally.
It's less and less relevant, a lot of Google things can run on the current limits for k8s: 5000 nodes / 150k pods in a single cluster.
Not everything at Google is large, many other compagnies run very large infra like Google scale on Kubernetes using multiple clusters with federation / regions ect ...
I know this is pedantic, but I'd argue is an inspiration for, in its present state kubernetes is unable to scale to a datacenter, let alone globally at google scale.
On a related note, we have built an E2E-verified, tamper-evident CI/CD pipeline for the Datadog Agent integrations [1]: the Agent will trust and install only integrations that correspond to source code that have signed by our developers. If there is an attack anywhere between our developers and end-users, it will be caught.
Unlike Binary Authorization for Borg, our security guarantees are publicly verifiable.
Binary Authorization for Borg is for verifying binaries running inside Google, not code installed on end-user machines. Having the authorization be "publicly verifiable" makes no sense.
Agreed with this statement.
It's a best practice generally to verify all software updates originate from a particular source before applying them in your environment. Most over the wire updates do this.
What's different with Binary Authorization for Borg is that within Google, that last verification step means more than just "came from Google", but "came from Google and went through all previous necessary checks", because of the way the CI/CD system works together.
Disclosure: I work at Google and helped write this whitepaper on Binary Authorization for Borg.
I am perfectly aware that Binary Authorization for Borg is for binaries running inside Google.
I am saying that our solution provides almost the opposite: publicly-verifiable assurance that you are running legitimate binaries, despite being built by automation
The biggest difference for me is that in-toto allows you to define any set of upstream metadata requirements, in a very open format, and Binary Authorization has a set of centrally defined requirements, that teams tend to implement in tranches, to meet minimum requirements. It may sound better to have a freeform format, but in practice, I've found that it makes it harder for people to know what they should actually do.
In Binary Authorization for Borg, services still define service-specific policies, but pick from a previously defined set of potential requirements. See the section on service-specific policies: https://cloud.google.com/security/binary-authorization-for-b...
You can more easily compare Grafeas and Kritis (OSS projects Google developed, which are similar to GCR Vulnerability Scanning and Binary Authorization for GKE), to in-toto. In fact, I gave a talk covering some of the options for this here: https://youtu.be/uDWXKKEO8NU?t=1314
Disclosure: I work at Google and helped write this whitepaper on Binary Authorization for Borg.
A comment, not a question: Though I think it was worth the cost, I'd say this was one of the most painful mandates/rollouts I've had to endure. The cost to developer productivity was pretty significant. I would have liked to have seen that impact discussed.
I've seen references to gVisor being used 'in production' for google app engine && cloud run and so forth.
Scanning through recent commits && the github repo this is clearly not the case - there are way too many outstanding issues and outright missing support for various things. Is this another project where it was written in a different language or something and then ported out?
gVisor is not a rewritten version of an internal tool. The code you see really does run in production for App Engine and Cloud Run. While there are some internal modifications to better integrate with internal infrastructure, the vast majority of the code is identical to open source, critically including all of the system call handling, filesystem, and memory management code.
While browsing through our issues will show that we still have plenty to work on, the vast majority of applications work well inside gVisor.
I cannot say anything about internal use of gVisor. Sorry.
As a bystander from outside, I generally don't like VM type of mechanism as security mechanism. Unless it's actually a VM hypervisor. That way hardware can be utilized to define a relatively simper and more robust security model. (Of cuz, not saying hardware is always superior please don't chase me on this direction).
On the contrary, true software sandbox like ebpf and webassembly with limited capabilities in its building blocks and clearly defined application scenarios, are better ways to do security in software.
One thing that really squicked me out when I left Google is how other companies, even large and sophisticated ones, are using all kinds of garbage that comes from canonical or red hat or percona, and they have NO IDEA what's in there. Say what you want about google's NIH culture, but in regards to code provenance and verifiable builds they are doing the right thing and many others are not.
Whilst it would be nice if everyone had the time and resources to code review and build their entire source dependency tree, is this ever going to be a reality for the long tail of enterprises who struggle with even resourcing / recruiting for their current workload? I think the vast majority are going to continue outsourcing this responsibility onto enterprise distros / vendors for a long time to come.
I think things would be easier for the long tail with more investment from all in the tools space - better support for monolithic repos, unified CI/CD systems, etc.
If you are a large company tech company (1-5k employees) there are far bigger risks than dodgy binary builds from upstream. (like leaked API keys to github...)
However, if you are a hyperscale, high value company (ie a place which has enough data or digital cash to be worth dicking with) then its a worthy problem.
I think 'garbage' is a strong word, but I believe what the original poster is trying to say is that there are a lot of binaries, packages, and libraries that most organizations will consume from upstream, and not verify directly. This requires either trust on a third party (often many third parties - in the case of open source), or more intense validation of those components and any changes to those components.
Binary Authorization for Borg performs verification for pieces that come out of Google's CI/CD pipeline. For third party code, see in the doc, "When importing changes from third party or open source code, we verify that the change is appropriate (for example, the latest version)."
Disclosure: I work at Google and helped write this whitepaper on Binary Authorization for Borg.
Literally anything that comes from a vendor in a package? Percona server/toolkit? Every binary package in Ubuntu? The Linux kernel as built and distributed by Red Hat?
> Many of the CI/CD controls we describe in this paper are placed where your code is developed, reviewed, and maintained by one organization. If you are in this situation, consider how you will include third party code as part of your policy requirements. For example, you could initially exempt the code, while you move towards an ideal state of keeping a repository of all third party code used, and regularly vet that code against your security requirements.
I don't know how much third party code is in use at Google these days, but I'd be curious to know if there is a formal effort at cataloging most-often used / most-sensitive third party code and prioritizing reviews of it.
I've thought about the problem of vetting programming language packages (pypi, npm, rubygems, whatever) off and on. It seems like the only two tenable strategies are "don't pin anything / always use tip of master" and "freeze deps, vet transitive deps at that frozen point, vendor the corresponding deps, and if you ever need to update a requirement for a feature or bugfix, go through the process again".
The latter seems like it could be out-sourced to a certain degree, if you trusted other organizations to "vet transitive deps".
This means that dependencies on a third-party library can be found simply by looking at deps lines in BUILD rules. That can then inform which projects you want to run (for example) fuzzers on:
At Microsoft, we just require all binaries to be signed on production systems. Some systems are configured to block execution 9f unsigned code. Where we can't do that, monitoring cuts an immediate sev-2 and wakes us up if any unsigned code is executed.
Does Linux not have a way to run only signed ELFs?
Thanks for that -- I love MS security practices/stories; interesting.
Yes, Linux does have ELF signing. I'm guessing you are speaking more generally about ensuring that "only signed code is allowed to execute", rather than just "making sure ELF binaries are signed" based on the remaining context of your comment.
Similar to Windows, making sure that exe files are signed isn't enough (PowerShell, drivers, kernel, firmware, etc) -- there's PowerShell scripts, etc, and "block execution of unsigned code" or even "block privileged execution of unsigned code".
Assuming that "signed code execution", as I frequently discover when I go looking for "how to do Foo Linux", there's more than one[0] way, depending on what the need/device/system is. Windows is in a lot of places you don't expect -- ATMs, IoT devices, etc -- Linux is ... it'd be easier to come up with a list of device types that haven't had a Linux kernel running on them. LWN had a write up in 2017 -- I know I've read more current, but theirs was a good summary and answers your question. The Linux Integrity Measurement Architecture (IMA)[2] is a more complete approach. Those are the more "general-use options" that I was aware of.
[0] Often at least 8; and there's usually a few of them arguing with eachother over something that's between the extremes of "who's dad would actually win in a boxing match" and ... religion. /s
If a single person can generate a signed binary by themselves, then restricting execution to signed binaries does not address the threat model in the article.
That's a big difference from the starting state that Google had, which was that a single person could create a signed production binary from unsubmitted code all by themselves.
(It was very convenient for iterating on one-off fixes in production in an emergency, but you would rightly question how someone gets into that position in the first place. Plus there was no guarantee that the code would ever get submitted, and post-fix code review might cause the code to be subtly broken prior to being committed to the monorepo.)
The second time I read your comment, I read it a little differently, and I the second part of your response sounded AMAZING.
monitoring cuts an immediate sev-2 [on systems that cannot be configured to block execution of unsigned code] if any unsigned code is executed
I can't wrap my brain around how it would be possible to detect "any unsigned code is executed" on a system that can't ensure the kernel it booted from passed a signature check, so I'm guessing that was simplified to avoid a miles-long comment.
I'm not trying to be pedantic -- I'm actually curious if you have any more you can share/point me at -- measures that Microsoft has found effective[0] at reducing the attack surface of those unfortunate devices at the "alert if what's running looks right, but fails a cryptographic check". I mean, GPO has allowed a white-list of executables for as long as I can remember, and there have always been permissions in the OS to control rights "provided everyone is playing by the rules", but the GPO was trivial to defeat[1]. Are there any other interesting measures you can share? Also -- when you talk about "binaries signed on production systems" -- are you referring to Intelli-code capabilities built into Windows[2] or something further?
[0] I'm sure, even they, are guilty of having some tooling that's little more than theatre and wasted CPU cycles in places.
[1] And I'm guessing these are made up of older systems with a larger percentage of software that can be easily fooled into calling into a DLL that it didn't think it was calling into.
[2] I haven't spent a lot of time reading on the topic, but I have a code-signing certificate -- I don't know a lot about the topic, though, was just curious if it's something "I can play with", or if it includes something I can't get (whether or not that's "because it's only available internally", or "it's enterprise expensive"). :)
I think anyone can set it up, it's a feature you can administer through group policy. Actually the entire chain of boot from the firmware to the kernel and all userspace is verified. EFI is verified by Secure Boot (through the TPM), which only runs signed kernels, which only runs signed drivers and signed kernel32/user32/etc. So during a normal boot, all running code has been signed and verified.
If that group policy setting I mentioned is enabled, then running unsigned code will either (depending on the setting) write a security event to the system log, which is uploaded and causes a sev-2, or (if set to strict) simply blocked, with a "Windows has protected your computer" message.
It's definitely available publicly - I've personally imaged servers with stock Windows Server 2019 and slipstreamed our internal monitoring agent (which doesn't do anything magic, it just watched the Windows Security ETW log), and personally gotten sev-2s when I tried running unsigned code.
Hence I really wonder why Linux doesn't do this.. I thought SELinux or something might; it could make things slower but my system runs quite fast so I suspect they keep a volatile cache of verified SHAs.
(Also I should mention that even assembly references are signed, so unsigned assemblies refuse to load.)
There's IMA subsystem that generally does the same, providing verification for executables at least ones that are linked and loaded normally - for obvious reasons it can't exactly fix cases where you have a signed binary that loads unsigned code into itself.
The system supports integration with TPM-stored keys, so you can bind to kernel verification or the whole firmware chain of trust (a question of which PCRs you end up using to bind the keys).
Unfortunately, it's not well known (especially if you don't compile your own kernel reading through all the options), and generally, not many distributions look into providing it - plus everything involving SecureBoot and TPM has to deal with poisoned opinions in open source community.
Thanks for the reply -- I guess what I'm wondering is how can reliably alert about execution of unsigned code on a system that cannot prevent the kernel/OS/firmware from being replaced by something evil. Based on your reply, it sounds like that's a pre-requisite[0], otherwise evil-kernel could simply report that signatures are passing for everything under its control.
I can think of several (imperfect) approaches for protecting a system that meets those per-requisites but is incapable of, say, validating userland programs -- though even that would be tricky to do reliably unless the kernel is checking signatures on execution, rather than passively (via filesystem checks, etc). This is more a fault of any system that "alerts when something out-of-policy happens" vs "prevents something out-of-policy from happening"
The part I have a difficult wrapping my brain around is what kind of system exists with those limitations? If the system is capable of verifying that it is running a signed firmware, bootloader and kernel (along with any other components that are doing signature verification), why isn't that same system capable of preventing every unsigned application from running, eliminating the need to issue a sev-2 for the impossible situation of "it's running unsigned code unexpectedly". The two major examples that come to mind -- Windows and Linux -- handle both requirements (albiet, with additional software on the Linux side/additional configuration on the Windows side).
Recent Windows versions do not have this limitation, AFAIK. Intellicode/code signing has been included for quite some time. It's mature, a lot of it is enabled by default with balanced default settings and as you mentioned, it's not terribly difficult to set up a system to be very strict with its handling of signed code execution[1]. I should clarify -- despite my last comment -- that I believe the way Microsoft implemented this to be well thought out. Though my Dad[2] will probably always click through the SmartScreen message, no matter how ugly it looks, and it can certainly offer a false sense of security for those who leave it in that default configuration, it still adds a nice layer to the onion (without being obnoxious) and serves as a foundation/required component to enable strict policies[3]. And I prefer this approach to a more strict "walled garden" approach -- freedom/usefulness has a security cost[4].
On the Linux side, depending on the distribution, this may or may not be available "out of the box". I know the distribution I run -- openSUSE Tumbleweed -- enforcing bootloader/kernel signing and enabling trusted boot is straight forward (simple if your hardware is natively handled by everything out-of-the-box, but still relatively easy if you have to compile in a module, AFAIK -- having never gone through the pain, myself). And there are a few options after that for userland that are comprehensive in that the aim isn't just to cover elf executables but to ensure that the important parts of the system are cryptographically verified to be unaltered[5].
[0] Assuming the kernel handles signature verification, or minimally verifies the signature of whatever does handle that.
[1] PowerShell is a great example; though it frustrates the hell out of people (me, and anecdotally, everyone I've worked with immediately after trying to run their first script on a new OS load), I love that the default configuration is (maybe was?) "Signed Only", requiring a bit of work (far less in the last version of Windows 10 that I used) to allow unsigned script execution, paired with a permissions system that lets administrators prevent lower security accounts from altering that setting (which also defaults/might only be allowed to be set by administrators in the first place).
[2] Not picking on him -- he was a heck of a power user back in the day, but thinking the average Joe/Jane who re-uses a handful of passwords (just got the parents out of this habit) and are otherwise oblivious to security warnings.
[3] In some ways, it's an extension of the GPO executable white-list -- which was easily bypassed by renaming evil-executable.exe to something that is allowed. Now, contents are checked and the white-lists are tiered (i.e. video/sound drivers must be signed by the user and an MS certificate, which doesn't happen until the program is evaluated ... IIRC, of course).
[4] A computer that's melted down into a mass of plastic/metal hack-proof,... and use-proof. Less snarky -- a computer that has no hardware capable of allowing it to communicate on a network is going to be many orders of magnitude more difficult to breach by simply following good practices on the physical-security front, but it'll be similarly useless if that system was intended to host a publicly accessible web site.
[5] As was mentioned, before, even saying "ensure all executable code is signed" isn't enough for some environments -- ensuring configuration is unaltered is an obvious one. Even signing state data is necessary in some cases, i.e. a system is hibernated, drive is removed/file is altered, and the system is booted, loading evil-hiberfile.sys. Sheesh, security work is fun -- close one hole and the "what abouts" start popping up everywhere else.
AFAIK, it wouldn't be able to unless Python or Ruby provide a way to verify the signatures of the code they're executing, but even then ... that's not really what code signing can do/is for.
Start with Intellicode/windows SmartScreen signing -- I have a code signing certificate - cost $100, had to bring a few forms of identification to a notary public, prove a few things about my physical address (which was awesome when you use a MVNO and don't have a land-line). At the end of the day, I can sign code that ticks the happy boxes in Windows 10, Visual Studio, etc. Obviously, the certificate signs evil as well as good. And my signature only really says "signed with the certificate that this person purchased".
On the more extreme end, imagine a hypothetical system -- from bootloader to OS and its configuration, everything must be signed with both my key and the key of my buddy Greg. We did everything perfectly so as to ensure that the bits inside these 5 programs are the only bits that have ever been signed with these keys and any other edge cases are covered -- i.e. it's a system that perfectly covers "This computer will do whatever these 5 programs tell it to do", and "We know these 5 programs are guaranteed to be unaltered"
That, alone, isn't a protection against "damage". More than code-signing is necessary for that:
1. Understand what the whatever is that the "allowed" programs do -- and (optionally/extreme cases) don't allow programs that can execute arbitrary, unsigned/unverified code.
2. Use every tool the operating system makes available to enforce what the program is allowed to do. If `python` is needed to run a script that copies file A on a remote server to file B on the local server, a start is to run with permissions that prevents python from damanging more than file A if the intended script were replaced with something-evil.py. How narrow you'll be able to define these rules depends on the capabilities of the operating system/hardware platform.
3. Further isolate (a.l.a. containers/container-like) -- pop it in a sub-environment (container/virtual machine) running a signed image that contains the python and the script -- no idea if docker provides anything here, but there's nothing to stop one from moving the whole thing to a small device with a signed bootloader/image containing the script that refuses to boot if any bit is changed on the device. It just depends on insane you want to get.
It's these permissions systems that prevent "damage" -- really, if the permissions systems were "perfect" they'd let you specify, precisely "in what context the application is allowed to perform what action". Code signing is, arguably, one of these permissions systems, too. Granted, without effective code-signing involved, the permissions checks can be bypassed in an undetectable manner.
I can't speak for MS -- not my employer -- but I worked for a large global telecom doing a mess of infrastructure development centered around security (among others). At the time, code signing/secure boot as it exists today ... I don't even know that it existed. We had a system that was allowed to run a single data translation program. It was possible to make this thing meet "spirit" of the above with a mix of hardware/physical security[0]. We had machines that enforced policies about what was allowed and disallowed. Every services security needs are different, and while there was certainly a "baseline -- everything at least does this", once you started talking wanting the assurances that "ensuring only signed code runs" involved, usually that landed it in the highest-security configuration for everything else (multiple-ring restricted internal white-list-only networks where inbound/outbound is tightly controlled and documented) followed by an analysis of the service to see how little we can give it and still make it work.
[0] I recall a box that ran a single DOS executable at boot from removable media where the drive had been jumpered to be read-only. It was locked in a cabinet in a data-center that myself and 5-or-so others had access to and its purpose was to receive data from a serial connection do dance around with it a bit and upload the resulting file at 1200 baud to God only knows where.
Re (1), when you say "don't allow programs that can execute arbitrary, unsigned/unverified code" -- does this mean we are blocking all scripting languages?
For (2) and (3), I agree that capabilities and containers are very important, but they are not really related to code-signing -- either python is signed or not.
So I don't see how code signing + python can co-exist? Once you allowed your python (ruby, perl) binary, the "binary whitelist" is pretty useless. Seem like other technologies -- like containers, sandboxes, SELinux-like labeling etc -- is the only way to go.
Has anyone outside of Google implemented something similar in spirit to this for K8s or ECS? What was the threat model you were considering when you built it? Was it worth it?
Kritis[0] is a K8s implementation of this that intends to block deployments of images that haven't been properly vetted beforehand, or has critical vulnerabilities, etc.
It's more a defense against getting NSA'd (via the specific threat model of an attacker secretly replacing a security service with an implementation that looks very similar but is much easier to crack).
I think that's right. I would strengthen that statement slightly - it's about ensuring that no actor - whether an insider, or someone who has stolen their credentials, or otherwise compromised them - can perform an action that single handedly accesses user data, without it being known to another actor - via access logs, via approvals, etc.
In terms of the upstream introduction of a new vulnerability, Binary Authorization for Borg can only verify that the code was in fact merged. See the section on third party code, "When importing changes from third party or open source code, we verify that the change is appropriate (for example, the latest version)."
Disclosure: I work at Google and helped write this whitepaper on Binary Authorization for Borg.
"Our infrastructure is containerized, using a cluster management system called Borg."
I was hoping they had some predictable, indexed build for borg backup[1].
[1] https://www.stavros.io/posts/holy-grail-backups/