Server BMCs can need to be rebooted every so often

iforgotpassword · on Jan 15, 2023

A colleague from a different department was managing a fleet of servers doing a lot of computing. Their BMCs would just stop working after a few weeks of uptime, every time. He was annoyed by it but mostly ignored it as it was mostly just used for additional monitoring. The power outlets in the racks were remote controllable, so you could still hard reset a server if this was required.

However, as this fleet was used for computation, after a while they noticed that whenever the BMC stopped working, performance of the system increased by almost 10% or so. Definitely a non-neglible amount. So they kept the machines running in the broken BMC state for as much as possible.

c0l0 · on Jan 15, 2023

I've seen pretty much the opposite close to a decade ago with a number of AMD Opteron/15h-based servers we had, where the BMC could end up in a bad state that made the ipmi-driver-spawned kernel thread handling it burn a whole core for no good reason.

sweetjuly · on Jan 15, 2023

I wonder what the BMC was doing to hurt performance. Maybe it was questionable DVFS that stopped when the BMC died? Poor fan management that caused thermal throttling?

dsr_ · on Jan 15, 2023

That's a pretty good guess: lots of BMCs have fan control, and turning it off should make all fans stay at max power.

In a datacenter the odds of anyone noticing that the fans are always on high is practically nil, unless you are specifically monitoring fan RPM. Most folks don't bother, as what you are actually interested in is temperature.

brugidou · on Jan 15, 2023

We had an interesting incident where one of our datacenter temperature sensors kept on rising, and operators on site could clearly hear the noise increase of all servers going to max fan speed following a BMC "crash" triggered by a network loop on our IPMI lan.

It took us a while to identify the issue and all systems were running fine, but we had to shut down many racks to avoid the temperature to rise too high.

flemhans · on Jan 15, 2023

Really high RPM fans may also stand out audibly in the data center

cricalix · on Jan 15, 2023

Anecdotally, this reminds me of a terrible experience with some Dell hardware back around 2011. I volunteered to do an installation of some hardware in a POP cage at one of the London UK colos.

Hardware arrived a day or two late, got it unpacked and carted upstairs. Spent a few days installing it (had to build a deployment environment on my Mac in a VM because the servers had no CDROM drive, only NetBoot and USB; our centralised imaging system hadn't been built with "installs outside of core DC network" in mind).

Discovered that the machines had been ordered with iDRACs, not DRACs. Damn things did not work until the box was booted into Linux and a driver to interface/activate the board was loaded. This, of course, worked wonderfully when the kernel locked up with a panic.

The machines were being used to test custom kernel modifications. Panics and lockups were common. Every time it happened, the engineers in the US had to ask "smart hands" to go reset the machine in the cage.

Experiment was a success though. Turned into the foundation for how a very large edge network was built.

userbinator · on Jan 15, 2023

This is brought to you by the BMC with a KVM-over-IP that wouldn't accept '2' entered on the (virtual) keyboard in any way or form.

This is one of those things that I'd probably be willing to spend a ton of time analysing to find the root cause if I could. Dump the memory image, debug the code, and figure out why it's just that (and possibly other) keys. My guess just from the description is that a bitflip happened exactly to the entry of a character map translation table or similar.

m463 · on Jan 15, 2023

This stuff seems to happen on proxmox when you try speaking to a vm console using the novnc web interface. If I recall correctly for me it wouldn't handle modifier keys correctly so stuff like shift-2 for @ wouldn't work.

I suspect it is some kind of console and raw keycode mismatch with the remote ui, maybe via the browser.

yellow_lead · on Jan 15, 2023

Some people love the chase.

rbanffy · on Jan 15, 2023

What's better than chasing white rabbits?

jebr224 · on Jan 15, 2023

I do development on openbmc (https://github.com/openbmc/openbmc) which is an open source bmc implantation using bitbake, primarily targeted to aspeed, and nuvoton bmc chips.

amluto · on Jan 15, 2023

Can a mere mortal usefully deploy openbmc on a small number of servers?

shenki0 · on Jan 15, 2023

Not really. You need support from the manufacturer.

Recent BMC firmwares have openbmc under the hood, with the industry using it as a base for further development or customisation.

klysm · on Jan 15, 2023

I was looking into this recently. It doesn’t seem like you can easily get your hands on BMC hardware. There is one project I found where they’re using an FPGA and everything is open source but it still looked far from easy.

lapinot · on Jan 15, 2023

Afaik the problem is typically not hardware but software, since most server motherboards come with a bmc chip on it. Are you designing a motherboard? All i would ask for a bmc is root access. But now everyone (but the big players, who design their own motherboards) is basically stuck with a stupid embedded linux with shitty software that has half-backed features they don't even need, but now need to care about. When all you'd need would be a way to access a host serial console, read sensors, control boot sequence and perhaps re-flash the bios, ie most likely just basic interactions on serial interfaces.

jebr224 · on Jan 16, 2023

> Are you designing a motherboard? Yes, the team I work for designs server motherboards.

BMC security is what keeps me up at night. Firmware software quality is low, and often not up to date. I think openbmc does a good job in both respects.

rbanffy · on Jan 15, 2023

We got into that when we started expecting servers to have VGA and PS/2 ports because an RS-232 console would not be very useful with a GUI.

Is there a modular standard for BMCs?

jebr224 · on Jan 16, 2023

There are two standards for pluggable BMC's

  - runbmc (https://www.opencompute.org/documents/ocp-runbmc-daughterboard-card-design-specification-v1-4-1-pdf)

  - DC-SCM https://www.opencompute.org/documents/ocp-dc-scm-spec-rev-1-0-pdf

I have only glanced the surface of these specs.

klysm · on Jan 16, 2023

I was interested in building a board from scratch with a BMC on it

1amzave · on Jan 17, 2023

If the servers in question happen to be on the list of supported hardware, quite possibly. (I don't know of any up-to-date online list, but running `source setup` in the root of the source tree will print it.)

roughly · on Jan 15, 2023

Someone once told me that nobody ever returned a server because the BMC sucked too much, and that says everything you need to know about BMCs.

Incidentally, at a previous gig, the software that handled issuing BMC commands had a BMC reboot as the first step in any script it ran.

themoonisachees · on Jan 15, 2023

Not returned, but i've seen clients refuse to go with HP again because iLO sucked too much and come upgrade time went Dell. Unfortunately for them right around the time iLO got okay-ish and iDRAC got shitty.

xoa · on Jan 15, 2023

This was me though I ended up moving to SuperMicro instead. Not fancy but their BMC seems to get the job done, and also doesn't cost a significant chunk of money extra for basic functionality. And then even more money to have IPMI be dedicated connection not shared. And HP's frigging BIOS wouldn't even work with their own HP rack console! But it was happy with an Apple mouse and keyboard. Argh! Making me irritated again remembering it. But yeah, we didn't return them but did sell them off to some other poor sucker and never bought an HP again. They may be great at large scale with the higher end central management software suite and support etc, but the BMC definitely was a dealbreaker at an SMB level. It's one of the few differentiating features vs a pile of other competitors who do the same basic thing, so I do think it's not entirely meaningless to get it badly wrong.

toast0 · on Jan 15, 2023

> SuperMicro instead. Not fancy but their BMC seems to get the job done

Kind of depends which generation of servers. I had worked with a lot of x9 and x10 (xeon e-2600v1-4) which was alright, as long as you don't mind outdated java (well the newest x10 bmcs do html5 consoles too, IIRC); but I recently started renting an x8 server personally, and it's worse... My favorite is when serial over lan just stops responding when you go from console redirection to os opened serial port (and back)... real helpful for inputting disk encryption passphrases. Oh well, I'm renting this server because it's cheap, it's also 10+ years old, and it works enough.

xoa · on Jan 15, 2023

Totally fair. Yeah should have specified I'm talking fairly new stuff, at least new enough it all has HTML5 consoles and mildly more polish. Not that there aren't problems, less a "bug" then a "cutting edge for newbies" it does near zilch certificate validation for example. So when the friendly new guy wasn't paying attention and generated a client instead of server certificate and uploaded it to a remote one it cheerfully accepted it and whoops! Operator error of course but the real point was that remote rectifying it proved surprisingly impossible despite still having admin serial/ssh access. Documentation is bad and didn't seem to like typical reset codes or tools.

I guess after lots of problems with expensive "high end" fancy stuff like HPE's and co I kind of felt resigned that all BMC/IPMI was kind of crap, and at least with SM's I felt less obviously squeezed. Like I said math will all be different for high end stuff and big herds. But neither Dell or HPE's struck me as "oh yeah that's worth an extra $300 on every single last server!"

ixs · on Jan 17, 2023

I was just fiddling around with a SuperMicro X8 IPMI the other day. The X8 IPMI stuff is terrible, e.g. the warning that your Java installation is outdated on opening the website etc.

Turns out, you can actually install X9 IPMI firmware on X8 boards as the platform files are still shipped. Might be worth checking out, if this improves things for you. It did for me.

Check out https://github.com/devicenull/ipmi_firmware_tools for unpacking (and repacking) the SuperMicro firmware. The developer just merged my patches making it work with some of the X8 boards. As long as your board is listed in /etc/defaults of the IPMI tree you should be good.

ilyt · on Jan 15, 2023

That pisses me off, when some vendors want to charge extra for KVM on stuff that already have hardware to do it and competitors do it...

andrewjf · on Jan 15, 2023

Datacenter / Cloud Service hardware is a race to the bottom, we want simplicity and ability to work with off-the-shelf and widely available tools (Redfish (cURL), IPMI (freeipmi, openipmi, ipmitool)) so we can integrate these things into our hardware management platforms.

In this world, the legacy vendors like Dell and HP feel like they have to differentiate their hardware somehow, or they lose all the margin. So they'll charge you for all those "value-add" things like KVM or OOB Firmware Updates, because they can't make money on the machines themselves anymore.

The irony is all that extra garbage they add to their servers is exactly the opposite of what you want at high scale and really only serves the "enterprise" market that tends to deploy VMware and hand-manage servers and need point-and-click stuff since there's no incentive to write software to manage small environments like that.

xoa · on Jan 15, 2023

Yes, and as essentially a "hidden flat tax" it burns a lot more for large quantities of lower end hardware vs single bigger iron. Looking back at the records, at the time adding their iLO/M.2 comm card (needed to have a dedicated iLO port on those servers) was $65 from Provantage and then the license to actually do basic stuff with it like a web console was and $227. So ~$290 all told extra. We've got a few higher end systems, $10-20k NAS or larger hypervisor systems, and at that level sure an extra $300 stings a bit but is against a lot of other stuff. At medium range, more like $3k-5k, obviously worse. But some use cases called for say 6x $500-600 systems instead of a single $3k-6k system, and for a few projects someone was trying to go as cheap as possible and grabbed some still pretty new (gen10) plain vanilla ones off of used/bankruptcy sales for like $300-400 because the price seemed so attractive. But then using iLO properly could nearly double the price, and at the lower end there is a big difference in the hardware you can get for $600-700 vs $300-400, and then multiply that per unit.

Meanwhile competitor systems all have dedicated ports out of box. SuperMicro does have a paid "full unlock" for their BMC, but the only thing it does is add bios updates and such. All the core normal management functionality is there by default. It also only costs $30.

Granted HPs had other irritations like really wonky proprietary fan control (heck, proprietary fan cables too) that would mysteriously fail to function with different flavors of the same OS, and couldn't be overridden from the BMC (then what's the point!?). Also they were slow with EPYC options when that's what a lot of us really wanted to be switching for, the performance and value propositions were getting really good vs Intel who were also jerks.

Like lots of big players the experience may get different if you're buying hundreds to thousands or more units and have a dedicated account manager who takes care of all this for you etc etc. But x86-based servers is a pretty damn competitive market and at some point one has to stop and ask why hours are being burned futzing with stuff when literally the entire basic point of getting "server class" hardware with remote management functionality is to save man hours by NOT having to futz. So yeah there's a little rant I didn't even know I still had in me years later :). HP you silly goofs.

aetimmes · on Jan 15, 2023

Anecdote, but: I've seen a previous employer blackball a hardware vendor because of terrible BMC support.

andrewjf · on Jan 15, 2023

I've personally been involved in decisions like that, as well. If you're throwing stuff into a datacenter and the BMC doesn't work right[1], the hardware is basically a brick. Vendors should be blackballed for their incompetence.

[1]: all things i have personal experience with:

- chassis bootdev pxe => doesn't do that, just reboots to normal OS

- chassis power off => doesn't do it (oh, here's an ipmi raw command you can use for this BMC version. NO!)

- dhcp server sends an option in an offer it doesn't understand, drops the offer, no IP at all.

- gets scanned by auditor-checkbox-as-a-service (qualys), locks up and sends the host CPU into 100% and locks up that up too.

- Not supporting IPv6 properly (if there's a place to start deploying IPv6 it's BMCs), i.e. uses SLAAC properly, but doesn't use the gateway from the RA so you actually can't use it from outside it's own segment - needs a firmware update to fix but uhh, we didn't dual-stack the BMC network because the whole point is to get that IP space back.

We had to write a test suite for vendors to run against their BMC and validate these things and you were disqualified if you failed.

ilyt · on Jan 15, 2023

The worst bug I've seen in (IBM) BMC

The device in question didn't had NTP so to get accurate BMC logs timing it had to be synced from host OS

BMC is them fucking up the watchdog implementation, that worked on wall time.

Which means if you changed time in-between OS sending watchdog ping and BMC checking for timeout, you got system reboot.

But it was intermittent enough that you might set the cron to update the BMC clock (because you want accurate time on it for log correlation) and be lucky enough time to not connect to the cause when it finally happens.

Also we generally observed that BMCs got better, 10 years server were lucky if it was working and not crashing every few weeks (all mostly on IBM gear), to the point we had to stop alerting on that, and it was unreliable as method for OOB access, so far anything newer than few years have been fine. Althought I have no idea how you make modern hardware take minutes to return list of sensors(I'm looking at you Lenovo)

dveeden2 · on Jan 15, 2023

There are too many names for the BMCs, even within a single vendor. BMC, ILO, LOM, DRAC, iDRAC, etc

And the worst are those that use java applets or webstart and require a ancient java version.

kube-system · on Jan 15, 2023

I am honestly surprised how bad many of these are, and in production, no less.

I recently set up a supermicro system and spent a whole day just trying to figure out what to install to get the stupid ancient Java crap to load so I could mount an ISO.

rtp4me · on Jan 15, 2023

I keep an old Windows XP VM around with Java 6 just for this exact reason.

ixs · on Jan 17, 2023

There are reasons for tools like https://github.com/ixs/kvm-cli. It seems like every operations team built their own version that logs into the web interfaces, downloads the java stuff and then runs it locally...

Just annoying.

arcticgeek · on Jan 15, 2023

We have various Supermicro boards in production at work with BMCs from 2018 or so. The ATEN iKVM on them works just fine with a recent OpenJDK 11 and OpenWebStart. I’ve found that all the features work including mounting ISOs and doing remote upgrades. No need to whine about installing anything ancient or spreading ridiculous Java FUD.

dijit · on Jan 15, 2023

Happy for you.

Sysadmin for 15 years here though, Java was always a problem. The version wasn't always the worst bit, but it was always an exercise in frustration.

Mostly security controls to blame, but I have had so many issues across so many systems that I cannot stand by and let you claim this is FUD about Java.

the .NET applets are also a problem (because who has a compatible IE version?), but they worked more consistently than the Java ones back in the day.

The HTML5 ones are the only ones that seem to work consistently; but that could be biased as HTML5 is much newer, so BMCs implementing that might be updated with more regularity. (or be more modern hardware)

ilyt · on Jan 15, 2023

I keep VM with old java/ff precisley for some old shitty servers kvms

dveeden2 · on Jan 15, 2023

With current browsers Java applets are not supported anymore. Some older HPE systems the didn't update the firmware to provide alternatives.

I've seen multiple vendors with problematic code that didn't work with newer Java versions.

This is by no means meant to bash Java. Some non-Java BMCs can be horrible as well (e.g. require many TCP ports in a firewall/tunnel unfriendly way or require SSH with old algorithms that are no longer enabled by default, or telnet..)

arcticgeek · on Jan 15, 2023

I‘ve really only had experience with Supermicro BMCs but I totally believe you that there are lots of crufty OOB environments in the wild which are hard to work with. While it’s true that applets don’t work anymore (probably a good thing), and therefore the experience isn‘t as integrated or seemless as it once was, it’s a practical matter to just log into the BMC and click on the console preview, using the JNLP file to launch the console via OpenWebStart as an independent application outside of the browser. One other thing is that the self signed certs from these older implementations are often expired and therefore throw an extra warning or two when you launch these interfaces, but you just click through them and carry on.

ixs · on Jan 17, 2023

For fun: This actually runs the Java Applet KVM viewer on a SuperMicro X7 board: https://github.com/ixs/kvm-cli/blob/master/kvm_x7.py

1. Downloads the data from the IPMI interface

2. Modifies the files to run locally

3. Writes out a Java configuration with weak security settings so that TLS works with the deprecated ciphers.

4. Fires off a socat instance to redirect the localhost ports to the remote IPMI device.

5. Starts appletviewer locally.

Great fun writing that. Thank god we decomissioned the last X7 based storage appliances a while ago...

p_l · on Jan 15, 2023

This is one of the reasons I like to architect networks for netbooting (so no remote media needed) plus force every physical server to boot UEFI-only - because UEFI supports serial console properly, unlike BIOS, so I can just use IPMI Serial-over-LAN support.

Combination of those two generally removes the need for any of the advanced features that required custom clients or even a Web browser

andrewjf · on Jan 15, 2023

For sure - we have some networks that when a host netboots it always goes to something like http://netboot.xyz with serial console by default.

My favorite is some vendors using COM1 and some using COM2 so you have no idea which it is ahead of time.

p_l · on Jan 16, 2023

That's why I keep to UEFI - most of the time the configuration just works, and firmware passes down information about serial console to the OS, iirc.

pid-1 · on Jan 15, 2023

My first job (2016) involved dealing lots of servers used by telecom providers in many different DCs in different cities.

One of my most useful tools was a thumb drive full of old internet explorer, firefox, and Java installers.

The 50m ethernet cable (nicknamed BFC - big fucking cable) also deserves an honorable mention.

acranox · on Jan 15, 2023

Even worse was one of mine last year that needed Flash. Apparently we neglected to update it. I can handle ancient Java, but trying to get Flash setup was going to be futile, so I just went to the data center.

_fjb4 · on Jan 15, 2023

That was an old Cisco server, right? I think those models still require Flash even if they're fully updated, and people have to use VMs with Flash installed to access the BMC.

acranox · on Jan 15, 2023

Yep. I think you may be right, it’s EOL and probably doesn’t have any more updates available. I have a VM for when I need old Java, but I was going to need an older VM to run Flash, and that just wasn’t how I wanted to spend my time. :D

da768 · on Jan 15, 2023

Docker image with Firefox 52 ESR and Java plugin comes to the rescue.

Scramblejams · on Jan 15, 2023

Looking at you, Asrock...

themoonisachees · on Jan 15, 2023

A few years ago i worked as a grunt for a fleet-wide ESXi upgrade and we took the occasion to update iDRAC. The number one step in the procedure was to reboot iDRAC, no matter how good its state looked. I have never ever seen both a high uptime and completely functional iDRAC at the same time, across 200~ servers.

justsomehnguy · on Jan 15, 2023

Yep. And I still think I borked one iDRAC precisely because I ignored the reboot, because it worked fine on its identical twin sitting right next to it.

fulafel · on Jan 15, 2023

People are surprisingly chill about BMCs as attack surface, eg Intel management features get a lot more scrutiny.

matja · on Jan 15, 2023

IMO because you can configure the BMC to use it's own dedicated NIC and network.

That network should have no routing to the Internet and no production traffic, just BMC management.

When Intel ME shares the NIC used for the OS traffic - that is far harder to secure.

rbanffy · on Jan 15, 2023

> That network should have no routing to the Internet and no production traffic, just BMC management.

And no single computer with NICs on both the BMC network and one that can access the internet, hopefully.

All it takes is one single mistake.

ilyt · on Jan 15, 2023

In proper environment the BMC is on its own dedicated NIC with no way to bridge to that network from the machine and only access from host machine being from root/admin account.

And then there are people that just port-forward BMC ports to the internet as cheap remote KVM...

Graziano_M · on Jan 15, 2023

The operative word being "proper". In practice I see it accessible on the LAN far too often. The ipmi v2 protocol is so bad that if you just request to login with a known account name (which is probably 'admin') the BMC server will _provide you the password hash_ for you to crack at your leisure.

tinus_hn · on Jan 15, 2023

An important distinction is that one is a device you choose to add to the system, and the other is an offer you can’t refuse.

sys_64738 · on Jan 15, 2023

You should never keep the BMC plugged into a network. Security 101.

fulafel · on Jan 15, 2023

That view is what I meant by my comment.

IMO it's wrong - It's terrible product design that servers have network ports that will cause catastrophic failures unless carefully only connected to a special expertly secured un-network for fragile things.

sys_64738 · on Jan 15, 2023

It’s not wrong regardless of the BMC security. You don’t plug things into the network that don’t need to be plugged in. BMCs only need network access for servicing most cases.

somat · on Jan 15, 2023

As a big openbsd aficionado I always wish I could just run bog standard openbsd on my bmc's. ipmi does not do much for me so I would be fine if it were missing a good opensource ipmi stack. Most people would just want to run linux, but the point is, the bmc is just a small computer in charge of monitoring your big computer, however the bmc usually is unable to run an off the shelf operating system, so it is running an old proprietary version of linux with proprietary software. this situation sucks.

An idea I have floating around in the back of my head is to just use a raspberry pi as a bmc on consumer grade hardware, I am sure it would turn out more complicated than this, but basically just hook the power button, i2c, and other headers to the pi's gpio. now you have a bmc that runs an off the shelf os.

Tijdreiziger · on Jan 15, 2023

> An idea I have floating around in the back of my head is to just use a raspberry pi as a bmc on consumer grade hardware, I am sure it would turn out more complicated than this, but basically just hook the power button, i2c, and other headers to the pi's gpio. now you have a bmc that runs an off the shelf os.

You’re looking for one of the following projects:

https://pikvm.org/

https://tinypilotkvm.com/

wil421 · on Jan 15, 2023

I’ve heard of people using a Raspberry PI as a “BMC” by using VNC viewer connect. Here’s one example I found[1].

[1]https://forums.unraid.net/topic/97460-selfmade-ipmibmc-with-...

sys_64738 · on Jan 15, 2023

BMCs generally run Linux and monitor the SoC functionality that the hardware is designed around. You need a vendor specific software stack for the hardware monitoring. The Rasp Pi is a toy.

deadfece · on Jan 15, 2023

On a Sun box we had, the system controller would panic the domains every so many days (I think ~700). You could have rebooted the domains in the chassis for regular patches, but if you hadn't restarted the SC, you were in for a surprise.

Restart your management devices!

dveeden2 · on Jan 15, 2023

Yes I remember something similar on Sun Fire 6800.

Another issue was a firmware bug on the Sun Netra X1 where rebooting or updating the lom would result in a reset of the host. Not fun with UFS without logging enabled.

deadfece · on Jan 16, 2023

I can't remember which direction this went, but I had a Netra T1 hooked to another Sun machine (240R? V440?, don't remember), and resetting one would send a break out on the serial console... which would send the other into the ok> prompt in OBP.

We finally got serial servers out of that one, though!

pestatije · on Jan 15, 2023

BMC: Baseboard Management Controller

Telemark70 · on Jan 15, 2023

Thank you!

gjvc · on Jan 15, 2023

This is why staggered reboots of stuff, weekly or monthly, avoids this class of problem. It's simple and some might say it dumbs down the role of infrastructure management, but it sure as hell beats the feeling in the middle of the workday/workweek ... "It's lost grip. NFI what to do. Can't see anything. Reboot it FFS."

MertsA · on Jan 16, 2023

Bold of you to assume a regular reboot will also reboot the BMC (it won't although I guess there might be BMCs which do). Some things you really need a full cold boot. I've seen a test cluster of storage servers where after rebooting the whole cluster all in one go enough failed to boot that data was unavailable until a few servers were fixed due to flaky RAM that failed to make it past memory training on boot but was "fine" as far as we could tell until we rebooted.

I'd mostly agree with you, but this isn't always as simple as it may seem to "just reboot it" and there can be subtle differences between what you're exercising with your rolling reboots and what actually happens in a real complete power loss scenario. Plenty of stuff can break and you'll have no idea until you're trying to get stuff back up after a power loss event and you're up a creek.

chenxiaolong · on Jan 15, 2023

BMCs have to be some of the most unreliable devices that I've worked with. Some of the issues I encountered at my last job:

* [ASRock BMC] The BMC firmware updater sometimes causes the NICs to get stuck in a bad state where every NIC has the same MAC address. This can be resolved via a proprietary UEFI application for reflashing the correct MAC address.

* [Dell iDRAC] Local authentication randomly stops working due to some tmpfs running out of space (can see the message if there's an active SSH session). IPMI/SSH occasionally works well enough to issue a reboot command, but when it doesn't, sending the BMC reset comand to /dev/ipmi0 in the server OS is needed.

* [Dell iDRAC] Setting an asset tag via the IPMI DCMI command has a 15 character limit. If that limit is exceeded, the success response is still returned, but when querying the asset tag, random junk data longer than 15 bytes is returned. If I had to guess, I bet there was an sprintf() call somewhere in there :). This was fixed in newer iDRAC firmware. Now, it stores the last 15 bytes of the asset tag instead of returning an error.

* [Lenovo IMM] The shift or alt key sometimes gets stuck on the emulated keyboard without having used the remote console since the last BMC reboot. Can't be fixed by repressing the button, neither physically nor via the virtual keyboard. BMC reboot required.

* [Lenovo IMM] Booting the UEFI shell sometimes crashes both the system and the BMC.

* [Lenovo IMM] BMC and BIOS update sometimes claims to have succeeded, but didn't actually take effect.

* [Lenovo IMM] Rebooting the BMC via the web UI or SSH sometimes doesn't work. Making 50+ simultaneous requests to the login page is enough to crash and restart some component that allows the BMC reboot command to work again though.

* [Supermicro BMC] The BIOS update sometimes doesn't fully upload, but claims that it did. It still parses the header, so the new/old version fields look correct. Sometimes, rebooting the BMC and reflashing works. Other times, only a USB drive + USB keyboard + a recovery key combination fixes it.

* [Supermicro BMC] The remote console sometimes completely fails to initialize (though I've only seen this on servers where the BMC uptime was measured in years). Not just a blank screen. The GPU device was just... gone.

* [Supermicro BMC] Various IPMI commands just lie about successful execution. For example, setting the asset tag via the FRU succeeds, but has no effect. Those commands require toggling a write lock bit via an OEM command, which I only found by reverse engineering. Other commands, like the set asset tag DCMI command, leave the data in a corrupted state until a BMC reboot.

And finally, not really a bug, but an interesting thing about the Lenovo IMM. Instead of exposing information via standard IPMI features, like FRU or DCMI commands, the Lenovo IMM implements a virtual filesystem over OEM IPMI commands. These are commands, like (my naming) open_ro, open_rw, get_size, read, write, and close. They sometimes fail too. I think I ended up making all commands retry up to 10 times with a 5 second delay. At least Lenovo gets error return values right :).

To query the asset tag, you have to open_ro the "config.efi" file, get the size (because read-until-EOF doesn't always work), do a read loop, and close the file. Then, you have the XML data from the file you can query (20 seconds later due to retries). (If anyone ever needs to deal with the Lenovo IMM programmatically, I'd highly recommend the pyghmi [1] library. Wish I knew about it before reverse engineering their proprietary commands...)

[1] https://opendev.org/x/pyghmi

justsomehnguy · on Jan 15, 2023

> * [Supermicro BMC] The BIOS update sometimes doesn't fully upload, but claims that it did.

Or uploads it to the WebGUI but balks with a different errors. Flash with SUM from SuperDoctor

    SET IP=192.168.128.20
    SET USER=ADMIN
    SET PASS=ADMIN
    .\SUM -i %IP% -u %USER% -p %PASS% -c UpdateBios --file E:\Shares\Temp\x10sle-f\bios.bin --force_update --reboot
    .\SUM -i %IP% -u %USER% -p %PASS% -c GetBmcInfo

* [Supermicro BMC] BMC on a dedictated Ethernet port get moved to the Shared LAN port on the blade reboot. Which also means what if did shutdown the machine then you no longer can power up it remotely.

ilyt · on Jan 15, 2023

> * [Supermicro BMC] Various IPMI commands just lie about successful execution. For example, setting the asset tag via the FRU succeeds, but has no effect. Those commands require toggling a write lock bit via an OEM command, which I only found by reverse engineering. Other commands, like the set asset tag DCMI command, leave the data in a corrupted state until a BMC reboot.

we have Supermicro server with serial flashed to be something like 12345678 which means it probably doesn't work reliably on their production line either lmao

buildbot · on Jan 15, 2023

I’ve literally had the 2 problem before which is interesting, in my case I needed to completely de power the machine, a reboot of the bmc was not enough.

liminalsunset · on Jan 15, 2023

It appears that a large number of these are rebranded AMI MegaRAC software running on ASPEED processors, which are a little ARM chip with a virtual display card hanging off a PCIe x2 slot embedded into the mainboard.

AFAIK larger vendors like Dell and HP have their own thing.

maltris · on Jan 15, 2023

Experiences with iDRAC 6, 7, 8 have been terrible. After high runtime they stop to respond via HTTP and SNMP, its just rubbish. Reboot of BMC helped Sometimes, sometimes only a powerdrain. Back then even the ProSupport could not do much, they support a system unsupported by their devs.

The latest iDRACs (with the fancy blue GUI) work a little better. I have no numbers to back all that, its just a feeling, maybe the problems are yet to come.

Given the Dell pricing, they should be better. But i've heard from collegues that ILO and others are not much better either.

Peanuts99 · on Jan 15, 2023

My feeling mirrors yours about the latest iteration of iDRAC, we've installed stacks of Dell servers recently and not had any issues at all.

nineteen999 · on Jan 15, 2023

Cisco UCS CIMC is unreliable as well. iLO has the been the most stable out of the three for me.

isopede · on Jan 15, 2023

A few of my recent systems have come with a built-in combo BMC on the motherboard NICs. I haven't seen this before, is it a new trend? I'm imagining they put in a switch in there with the NICs? How does this even work?

I'm doing a few timing sensitive projects involving hardware timestamping in the NICs. Does this mess with say timing variability? I've disabled the BMCs out of paranoia but I don't know the topology inside.

shenki0 · on Jan 15, 2023

NC-SI. It allows the BMC and the host to share a single physical network connection.

gnufx · on Jan 15, 2023

That's common, and something to avoid. You want a physically separate BMC network for security and reliability.

ilyt · on Jan 15, 2023

One example: on some servers turning off NIC will... turn off the port completely, including shared BMC connection hanging off it.

Which happens during reboot meaning you won't catch the boot screen on KVM if you use shared nic

gnufx · on Jan 15, 2023

What I remember was losing it just at a crucial point in the PXE boot process. You're also dependent on the same switches as the main net, and have to rely on fragile firewalling or VLANs to restrict access to the BMCs.

wazoox · on Jan 15, 2023

Ah, those wonderful BMC that requires some obsolete version of Java... that doesn't work anyway because certificates are expired...

I also remember a long series of machines (Supermicro I think) whose BMCs would stop responding after a few weeks if there was too much traffic on the network.

Also those BMC that hijack silently eth0, breaking your server's bonded connections, when the main BNC connector get disconnected for some reason...

champtar · on Jan 15, 2023

I've had multiple time to factory reset Dell iDrac when some commands where failing with never seen before internal errors. Thankfully you can factory reset but keep network config, so no need for physical access.

indigodaddy · on Jan 15, 2023

Once you’ve dealt with so many iLOs/BMCs/DRACs, you realize just how superior Sun/Oracle ILOMs are/were.

Damogran6 · on Jan 15, 2023

Not just BMCs...'modern' (I learned this 6-7 years ago) network cards are pretty much seperate computers that handle dataflow up and down it's stack.

We had IDSes that were happy, up, network interface counts were climbing, but they were STONE DEAF to the network traffic we were actually interested in.

mukti · on Jan 15, 2023

We racreset our iDracs monthly. Seemed stupid when I started, but I'm grateful after working with them enough. Newer versions seem to be a bit better than 11th and 12th gen iDracs; but I would never rule out a racreset, even on the newest gen.

sys_64738 · on Jan 15, 2023

> Server BMCs are little computers running ancient versions of Linux with software that's probably terribly written and they stay running forever, which means all sorts of opportunities for slow bugs. Reboot away!

Such a silly comment. If your BMC is updated then it will use a recent version of Linux.

BenjiWiebe · on Jan 16, 2023

I'm curious why you say that. It's not a given that updating something (especially embedded!) will actually bring it to a modern version.

You can install an update for a router and end up with an absolutely ancient Linux kernel on it.