Kured – A Kubernetes Reboot Daemon

louis-paul · on Nov 9, 2017

I recently learned about a CoreOS-specific operator for that purpose: https://github.com/coreos/container-linux-update-operator. In standard CoreOS installations, once system updates are installed, the node is automatically rebooted using a semaphore. Instead, this uses the Kubernetes API to drain the pods from the node for a clean shutdown.

Kured has a feature to stop reboots if there are triggered Prometheus alerts; that’s a nice touch.

otterley · on Nov 10, 2017

Performing unattended upgrades seems like a great idea, until a bug introduced in an upgrade causes a performance or operability regression, introduces an even worse security bug, or is otherwise problematic. At worst, you could automate yourself into an outage.

IMO good reliability practices counsel against this approach, certainly not without testing these upgrades in a staging environment first.

alexk · on Nov 10, 2017

In Telekube we have hybrid auto updates with fallback to manual in case if things go wrong, so we can recover:

https://gravitational.com/docs/cluster/#performing-upgrade

otterley · on Nov 10, 2017

How does this play out in an actual failure scenario? How can a administrator detect failures? Does it apply to the entire OS, including kernel upgrades, or just your own software?

alexk · on Nov 10, 2017

Imagine etcd or docker fails during the upgrade on one of the master nodes.

Our monitoring system - satellite will report failure to the upgrade process that will stop.

Administrator will use our diagnostic utility to diagnose the problem, fix it and proceed with upgrade from the last failed step, but in the manual mode.

> Does it apply to the entire OS, including kernel upgrades, or just your own software?

Right now it only applies to our software

djb_hackernews · on Nov 9, 2017

Why not just spin up fresh nodes and age out the old ones?

See: pets vs cattle

lambda · on Nov 9, 2017

One reason I could see is if upgrading in place is substantially faster or cheaper than migrating data to new ones.

For instance, if the nodes are storage nodes in a redundant storage system, taking each one offline briefly for a reboot, and letting the others handle the slightly higher load, is a lot quicker than spinning up a new node and replicating all of the data over there so you can de-provision the old one.

Also, even if they don't have a lot of data, the time and resulting expense of spinning up and provisioning a new node while the old one is still online could add up to higher costs than just performing a reboot during a point of lower load.

danpalmer · on Nov 9, 2017

Might not be possible if you’re running your cluster on bare metal.

subway · on Nov 9, 2017

Bare metal in and of itself doesn't necessitate a long-running "substrate" operating system. Systems like Warewulf^WPerceus^WWarewulf[1] and xcat[2] have made it easy to provision stateless bare-metal hosts with almost docker-image-like filesystem images or "capsules".

Arguably data locality (as a parent-sibling comment suggests) is one reason you'd want to keep the substrate image around for a long time, but even then, there's no reason you have to destroy a data volume when reprovisioning an os volume. (nor do you even need an os volume if you run your substrate from ram).

[1]http://warewulf.lbl.gov/trac/wiki/Documentation [2]https://xcat.org/

stonogo · on Nov 10, 2017

xcat absolutely requires persistent storage, both on the management node and the service nodes.

subway · on Nov 10, 2017

xcat claims support for diskless service nodes on their website. I can't speak for how long it's been officially supported, but I seen to recall IBM claiming support in 2009, when they sold us a pair of iDataPlex clusters at UC.

Amusing side note: That was an interesting project with regard to system provisioning. The team responsible for the "south" cluster was from SDSC made up of folks who developed ROCKS[1], and the "north" cluster was managed by my team at LBNL, made up of folks who developed Warewulf (though at the time it was called Perceus). Each team was adamant about using their own tooling, so it made for some really long conference calls about keeping runtimes in sync.

[1]http://www.rocksclusters.org/

stonogo · on Nov 10, 2017

The service node's OS itself is/can-be stateless, but you still need to either laboriously copy the /install data every boot or else mount it in the running image somehow. So it's easy to provision them but you still need some stable data storage for them. The "stateless" xcat stuff is mostly aimed at compute-side operations where everything has a shared filesystem anyway.

I'm familiar with rocks, but I've never seen a warewulf cluster in the wild -- are you still using it?

subway · on Nov 10, 2017

To be fair, Warewulf largely 'suffered' from the same issue -- you could pack a fat initrd to be delivered over the network, but doing that at every boot with the full stack needed for compute at the time was crazy slow, and in some hardware configurations just wouldn't work. In an HPC environment we had shared filesystems out the wazoo (and with some amazing performance), so it made sense to mount up the majority of the os filesystem over NFS (and I wanna say some creative use of overlays).

I was only with the group a couple years nearly a decade ago (my job is up in the clouds these days) but it looks like Warewulf is still actively used: http://metacluster.lbl.gov/warewulf

bboreham · on Nov 10, 2017

If I understand you correctly, that would also require a continuous stream of new OS images with all the latest fixes.

nvarsj · on Nov 10, 2017

This is pretty neat. We accomplished something similar but simpler. We use unattended reboots with randomised selection of time throughout 24 hours. Drain node with a script. Not perfect but simple to do and setup.

susane123 · on Nov 10, 2017

Even though unattended upgrade is a good idea, it may cause further bugs.