I recently learned about a CoreOS-specific operator for that purpose: https://github.com/coreos/container-linux-update-operator. In standard CoreOS installations, once system updates are installed, the node is automatically rebooted using a semaphore. Instead, this uses the Kubernetes API to drain the pods from the node for a clean shutdown.
Kured has a feature to stop reboots if there are triggered Prometheus alerts; that’s a nice touch.
Performing unattended upgrades seems like a great idea, until a bug introduced in an upgrade causes a performance or operability regression, introduces an even worse security bug, or is otherwise problematic. At worst, you could automate yourself into an outage.
IMO good reliability practices counsel against this approach, certainly not without testing these upgrades in a staging environment first.
How does this play out in an actual failure scenario? How can a administrator detect failures? Does it apply to the entire OS, including kernel upgrades, or just your own software?
Imagine etcd or docker fails during the upgrade on one of the master nodes.
Our monitoring system - satellite will report failure to the upgrade process that will stop.
Administrator will use our diagnostic utility to diagnose the problem, fix it and proceed with upgrade from the last failed step, but in the manual mode.
> Does it apply to the entire OS, including kernel upgrades, or just your own software?
One reason I could see is if upgrading in place is substantially faster or cheaper than migrating data to new ones.
For instance, if the nodes are storage nodes in a redundant storage system, taking each one offline briefly for a reboot, and letting the others handle the slightly higher load, is a lot quicker than spinning up a new node and replicating all of the data over there so you can de-provision the old one.
Also, even if they don't have a lot of data, the time and resulting expense of spinning up and provisioning a new node while the old one is still online could add up to higher costs than just performing a reboot during a point of lower load.
Bare metal in and of itself doesn't necessitate a long-running "substrate" operating system. Systems like Warewulf^WPerceus^WWarewulf[1] and xcat[2] have made it easy to provision stateless bare-metal hosts with almost docker-image-like filesystem images or "capsules".
Arguably data locality (as a parent-sibling comment suggests) is one reason you'd want to keep the substrate image around for a long time, but even then, there's no reason you have to destroy a data volume when reprovisioning an os volume. (nor do you even need an os volume if you run your substrate from ram).
xcat claims support for diskless service nodes on their website. I can't speak for how long it's been officially supported, but I seen to recall IBM claiming support in 2009, when they sold us a pair of iDataPlex clusters at UC.
Amusing side note: That was an interesting project with regard to system provisioning. The team responsible for the "south" cluster was from SDSC made up of folks who developed ROCKS[1], and the "north" cluster was managed by my team at LBNL, made up of folks who developed Warewulf (though at the time it was called Perceus). Each team was adamant about using their own tooling, so it made for some really long conference calls about keeping runtimes in sync.
The service node's OS itself is/can-be stateless, but you still need to either laboriously copy the /install data every boot or else mount it in the running image somehow. So it's easy to provision them but you still need some stable data storage for them. The "stateless" xcat stuff is mostly aimed at compute-side operations where everything has a shared filesystem anyway.
I'm familiar with rocks, but I've never seen a warewulf cluster in the wild -- are you still using it?
To be fair, Warewulf largely 'suffered' from the same issue -- you could pack a fat initrd to be delivered over the network, but doing that at every boot with the full stack needed for compute at the time was crazy slow, and in some hardware configurations just wouldn't work. In an HPC environment we had shared filesystems out the wazoo (and with some amazing performance), so it made sense to mount up the majority of the os filesystem over NFS (and I wanna say some creative use of overlays).
I was only with the group a couple years nearly a decade ago (my job is up in the clouds these days) but it looks like Warewulf is still actively used: http://metacluster.lbl.gov/warewulf
This is pretty neat. We accomplished something similar but simpler. We use unattended reboots with randomised selection of time throughout 24 hours. Drain node with a script. Not perfect but simple to do and setup.
Kured has a feature to stop reboots if there are triggered Prometheus alerts; that’s a nice touch.