Let’s Encrypt, OAuth 2, and Kubernetes Ingress

andrewstuart2 · on Feb 22, 2017

Suggestion to anybody reading this: don't use a DaemonSet for this. This really ought to be a Deployment of nginx-ingress resources behind a service exposed as `type: LoadBalancer` (if you're in a cloud-provider that supports LoadBalancer services). Then just create DNS aliases and configure nginx to do session affinity if needed, etc. Not only will it be able to scale with your load instead of cluster size, but you can actually update it in a rolling update already; DaemonSets cannot yet do that.

Really the most important part, though, is that DaemonSets are for services that need to run on each host. Like a log collection service [1] or prometheus node exporter [2].

[1] https://github.com/kubernetes/kubernetes/tree/master/cluster...

[2] https://github.com/prometheus/node_exporter

fortytw2 · on Feb 22, 2017

So I (the author) am a bit torn on this - a `type: LoadBalancer` service will create a NodePort underneath (yet another internal loadbalancer) and map those ports to a $cloud-platform-tcp-loadbalancer. By using a DaemonSet with a host port bound, you avoid a layer of internal routing.

I'm not so sure if one approach is _particularly_ better than the other though.

andrewstuart2 · on Feb 22, 2017

As far as the routing layer, that's going to be a very small gain as it's all handled by the kernel via iptables anyway. What you give up by using a DaemonSet is a constant overhead for every single worker node that gets created. You have to set aside capacity for traffic you might not have.

A Deployment can be scaled by actual utilization of the pod, and in the near future (if not already) via custom metrics of your own design (e.g. prometheus).

So if you have to spin up 50 new nodes to handle some batch machine learning work, you're going to wastefully create 50 new nginx instances when there's no new ingress traffic to handle. With a deployment, it just scales naturally as needed. So it's not about the marginal gains, it's about using the right tool for the job. :-)

fortytw2 · on Feb 22, 2017

Fair enough points :) I'll see about reworking the article, and our deployment a bit - keeping in mind that LoadBalancer services are platform dependent

mrkurt · on Feb 22, 2017

Services are all handled with IP Tables, but in current versions of kubernetes you end up sending a lot of traffic to different nodes for no reason. There's an issue to prevent that, and send traffic to the pod on the local node if possible, but it's pretty gross right now.

This sucks for performance/reliability reasons. It also makes it crazy difficult to keep track of visitors' source IPs.

xkarga00 · on Feb 22, 2017

FWIW, you can constrain a DaemonSet to run on a specific set of nodes by using nodeAffinity but you would need to label those nodes appropriately which you are going to do anyway in a production cluster (use a specific set of nodes for running infra components such as routers or registries).

maktouch · on Feb 23, 2017

We started nginx-ingress as a deployment, and we converted it to a DaemonSet:

- We rarely deploy new versions of the ingress controller

- We can't (or don't know how to) choose in which nodes the pods will go. If I make a deployment with 10 replicas, there's a chance that it'll all go in the same node

- Because we can't choose to distribute the pods, when a node containing the ingress pod went down, there was a noticeable blip of downtime (~27 seconds approx.). That's kinda unacceptable.

- Nginx ingress is pretty light. I don't mind it having just 1 of them in each node.

- Since we put our databases and stateful stuff outside kubernetes, we also decided to separate web facing kubernetes cluster and worker ones. This solves the problem of the "spin up 50 new nodes to handle some batch machine learning job".

So far, so good. I would actually suggest that you use a DaemonSet for this, just like I suggest you convert Kube-DNS to a daemonset (it's not by default on GKE for some obscure reason).

caleblloyd · on Feb 22, 2017

In order to get the proper Source IP with TCP Load Balancing before Kubernetes 1.5, you used to need to use a Daemon Set with Host Networking.

Kubernetes 1.5 introduced Source IP using Source NAT and Health Checks: https://kubernetes.io/docs/tutorials/services/source-ip/

You still must write scheduling rules so that pods are scheduled to have at most 1 instance running per node.

rusht · on Feb 22, 2017

It's worth noting that there is a discussion on GitHub [0] about building letsencrypt auto cert creation directly into ingress controller.

[0] https://github.com/kubernetes/kubernetes/issues/19899

zalmoxes · on Feb 22, 2017

That's cool, I've done pretty much the same thing for our internal services. I noticed you use the github org for oauth2proxy.

In our setup, I wanted to add authentication to a few dozen sub domains, but use a single oauth2proxy instance. Github Oauth makes this kind of gross, the callback must point to the same subdomain you're trying to authenticate. But it does allow something like /oauth2/callback/route.to.this.instead

In the end, to achieve what I wanted (a single oauth2proxy for multiple internal services) I had to - fork oauth2proxy and make a few small changes to the redirect-url implementation - create a small service with takes oauth.acme.co/oauth2/callback/subdomain.acme.co and redirects to subdomain.acme.co to comply with GitHub' oauth requirements - created a small reverse proxy in Go which does something similar to nginx_auth_request. I had a few specific reasons to do this (like proxying websockets and supporting JWT directly) https://gist.github.com/groob/ea563ea1f3092449cd75eeb78213cd...

I hope that someone ends up writing a k8s ingress controller specific to this use case.

aledbf · on Feb 22, 2017

Please check https://github.com/kubernetes/ingress/pull/190

zalmoxes · on Feb 22, 2017

Very nice! I remember talking with you regarding some of this in k8s slack when I was trying to figure out how to wire it all up.

Thank you for all the work you do on the ingress project by the way.

theptip · on Feb 22, 2017

Note one significant gotcha with this approach: the Ingress does TLS termination, so the hop from the Ingress to your pod is unencrypted.

That might be OK if 1) your data isn's sensitive or 2) you're running on your own metal (and so you control the network), but in GKE your nodes are on Google's SDN, and so you're sending your traffic across their DCs in the clear.

There are a couple of pieces of hard-to-find config required to achieve TLS-to-the-pod with Ingress:

1) You need to enable ssl-passthrough on your nginx ingress; this is a simple annotation: https://github.com/kubernetes/contrib/issues/1854. This will use nginx's streaming mode to route requests with SNI without terminating the TLS connection.

2) Now you'll need a way of getting your certs into the pod; kube-lego attaches the certs to the Ingress pod, which is not what you want for TLS-to-the-pod. https://github.com/PalmStoneGames/kube-cert-manager/ lets you do this in an automated way, by creating k8s secrets containing the letsencrypt certs.

3) Your pods will need an SSL proxy to terminate the TLS connection. I use a modified version of https://github.com/GoogleCloudPlatform/nginx-ssl-proxy.

4) You'll want a way to dynamically create DNS entries; Mate is a good approach here. Note that once you enable automatic DNS names for your Services, then it becomes less important to share a single public IP using SNI. You can actually abandon the Ingress, and have Mate set up your generated DNS records to point to the Service's LoadBalancer IP.

(As an aside, if you stick with Nginx Ingress, you can connect it to the outside world using a Kubernetes loadbalancer, instead of having to use a Terraform LB; the (hard-to-find and fairly new) config flag for that is `publish-service` (https://github.com/kubernetes/ingress/blob/master/core/pkg/i...).

atombender · on Feb 22, 2017

I wonder how much of a vulnerability that really is. The SDN encapsulates everything and is supposedly IP-spoofing-secure, so in principle there's no way for anyone else in the same DC to get your traffic.

Of course, you could have a local attacker get in through other means, and then access local DC traffic within your SDN. But if you get to that point, you probably have bigger problem than terminating SSL.

theptip · on Feb 22, 2017

Your main attack vectors are:

1) A disgruntled employee sets up a surreptitious tap on the network to see if any secret material comes through. A high value target would be `Authorization: Bearer` in HTTP headers, but there are plenty of other things to slurp up.

2) A normally honest employee running an unrelated network tap to diagnose an issue with the SDN spots your Authorization headers (or other secrets), and knowing that they have a legitimate reason to have the wire capture, copies out the key material. This is very hard to prevent, since network admins can and should be tapping the network from time to time.

I'm not particularly concerned about someone from outside Google breaking into the SDN fabric, though a hypervisor breach could leak network traffic from other tenants on your instance (if you are sharing).

SEJeff · on Feb 22, 2017

Yes and no.

You're sending traffic over google's SDN in the clear, which is still encrypted by google if you believe:

https://cloud.google.com/security/security-design/

theptip · on Feb 22, 2017

I am fairly sure that traffic between nodes is not encrypted; I have dug into this in some detail with GCP staff. if you can quote some sources to the contrary I'd be interested to see them.

(The linked security design doc mentions 'encrypted in transit _to the data center_', but it doesn't address traffic inside the DC, last time I read through it in detail).

SEJeff · on Feb 22, 2017

No I suspect you're right even though the linked security design doc does say they encapsulate all app level traffic (such as HTTP) in their own RPC.

conradev · on Feb 22, 2017

Would an overlay network with a shared secret for encryption of pod-to-pod networking be another solution to this problem? I feel like the ideal should involve keeping the key material in as few places as possible.

theptip · on Feb 22, 2017

Yes, that's an option -- the simplest option would be a single overlay network, with the node-node tunnels encrypted using IPSec or similar (https://github.com/coreos/flannel/issues/6 or https://www.weave.works/documentation/net-latest-how-it-work... or https://github.com/projectcalico/felix/issues/997). I think this would be tricky to configure in GKE though.

With a secret-per-pod, your key material lives in in the etcd on the API server, and gets mounted in a tmpfs on each pod that is given the secret. Only Pods in the Secret's namespace can access the Secret, so if you have RBAC configured correctly it should be possible to lock this down tightly to only the code that needs the Secret. (I'm not sure how to do this in GKE; I'm currently treating each cluster as a single security domain).

A-F1V3 · on Feb 22, 2017

If you are open to using an (awesome) vendor for this kind of thing, I would highly recommend taking a look at Backplane (https://www.backplane.io/). It's as simple as running a sidecar Backplane agent container along side any container that you want to route http traffic to and then shape the traffic via their API (it becomes your load balancer as well). They automatically provision Lets Encrypt certificates for your endpoints as well, so you don't have to worry about any of that. We have been using it at Mux in production for months and have been very happy with the results. Backplane also has some other nice built in features like blue/green deploys and built in OAuth support that are really nice to have out of the box.

dkoston · on Feb 23, 2017

Make sure to use an HPA and set up resource constraints on that ingress controller pod. Unbounded resource utilization may bite you in the a$$.

https://kubernetes.io/docs/user-guide/horizontal-pod-autosca...

Also, you may have redacted it but you don't appear to be adding a service with a static IP:

spec: loadBalancerIP: 1.2.3.4

Not having a global static IP for publicly accessible resources seems risky for uptime.

We've gone away from using ingress controllers and using services with static IPs + HPAs on nginx pods for this reason. Having to add a service + ingress controller adds complexity and doesn't really add value (IMO) since you can easily add nginx.conf as a ConfgMap and get the same ease of configuration as an ingress controller. Your mileage may vary with let's encrypt integrations.

prydonius · on Feb 23, 2017

This is a really great post, excited to give the OAuth 2 auth a try.

FWIW, an easier way to get started with the NGINX Ingress and kube-lego services is using the official Helm[1] Charts for them (https://github.com/kubernetes/charts/tree/master/stable/ngin... and https://github.com/kubernetes/charts/tree/master/stable/kube...).

[1] https://github.com/kubernetes/helm

blwide · on Feb 22, 2017

That's impressive but also quite some effort. Feels like premature optimization when looking at the (rather low) traffic of fromAtoB. On the other side, it's always good to have a scalable deployment when dealing with RoR apps.

captn3m0 · on Feb 22, 2017

>On GCP, the HTTP load balancers do not support TLS-SNI, which means you need a new frontend IP address per SSL certificate you have. For internal services, this is a pain, as you cannot point a wildcard DNS entry to a single IP, like * .fromatob.com and then have everything just work.

Wouldn't a wildcard SSL cert + wildcard DNS entry work even without SNI support here? I haven't used the GCP load balancer, but as long as you are serving a single certificate (* .fromatob.com), the client/server don't have to rely on SNI at all.

agentgt · on Feb 22, 2017

Question for the author: We just migrated some stuff to GCP as well but do not use kubernetes. For managing infrasructure we only use packer, bash, and google cloud deployment yaml files (similar to the kubernetes manifest).

Why do you still need saltstack and how do you find terraform? Why do you need terraform (I suppose it is for your non kubernetes infrastructure?)?

fortytw2 · on Feb 22, 2017

For the moment at least, it's much more comfortable for us to keep our databases outside of kubernetes, so we use saltstack(masterless), packer, and terraform to manage them. We also use terraform to manage all of our DNS, which is split between Route53 and the GCP DNS service.

agentgt · on Feb 22, 2017

Thanks! I have been meaning to give terraform a try to replace some of our custom gcloud + gcloud deployment descriptors. Also so that we don't need a separate docker compose version for development (I'm assuming in theory you can run terraform to do what docker compose does?).

limelight · on Feb 22, 2017

> Also so that we don't need a separate docker compose version for development (I'm assuming in theory you can run terraform to do what docker compose does?).

Not really. Terraform is much more meant as a tool for manipulating production infrastructure (primarily clouds), not for orchestrating Docker containers (including locally).

I'd strongly recommend you use the right tool for the job and it's a very rare job where Terraform is a good alternative to docker-compose.

linkmotif · on Feb 22, 2017

Discovered kube-lego via Google a few weeks ago and I am really excited to try it with my next product. Thanks for this post.