Do any tools like say Memtest86 use test patterns that will detect row hammer vulnerability and for that matter, is it something that can be fixed on any given system by changing the RAS/CAS frequency or the voltage?
Many years ago, I ran a couple of PC shops. When EDO RAM first came out we had a lot of problems with random crashes being reported by customers, usually when they were running Word. We would typically run memory tests for a few hours in QAFE, which was then the best at detecting errors, but it would consistently fail to find anything wrong with these systems.
It turned out that the most reliable way to reproduce the problem was to run the shareware version of Duke Nukem. So we had a copy of it on all our diagnostic disks. If a system could get through the first level without any on-screen corruption, you knew it wasn't going to come back to the shop, so that became a standard test on all new PCs.
You made me remember a extremely interesting article about Guild Wars team... they were annoyed with the amount of weird bug reports, and one of their programmers found out how to make the game self-analyze its data, and find if it was RAM hardware issue, he found out that about 1% of all their players had computers with faulty RAM, and a good amount of bug reports came from these computers, and that this program saved them a lot of money (since now they didn't needed to make devs chase "ghost" bugs, only actually real bugs).
I had bad RAM on my primary Linux desktop some ~six years ago and it caused an endless series of problems, data corruption, and hard reboots every few months. It happened so infrequently that I never quite tracked it down, but I'd copied hundreds of gigabytes of data between computers over my LAN and lots of random files still have some corrupted bits in them to this day from that.
I finally got serious about tracking down the issue, ran memtest, and sure enough, it discovered a faulty DIMM pretty quickly. I yanked it out and the problems immediately went away.
Memtest86's rowhammer test is pretty reliable. In fact, since the test's inclusion, so many Rowhammer-related errors have been observed that the developers of Memtest86 split the test into two passes (a high-frequency hammer test and a lower-frequency, rate-limited hammer test) due to the sheer number of users failing the high-frequency Rowhammer pass. [1]
I guess the number of "enthusiasts" running Memtest on their (or their acquaintances') PCs still is miniscule. So in my opinion no one would notice, except if a return questionnaire explicitly included a
[√] error while using memtest
field. Keep in mind that newer versions of Microsoft's Windows include their own memory tester. Most people will, if at all, use this.
As a hardware engineer, I find the idea of security vulnerabilities based on how closely components on a chip are to one another. Most of the hardware implementations we still use aren't really designed to prevent these exploits. Aside from rowhammer, there's also "sidechannel" attacks on shared CPUs, and measuring the electromagnetic fields generated by a PC to reverse-engineer sensitive data.
I'm certain that in the future these concerns will play a greater role in how we design computer components and systems. Previously these attacks were infeasible due to the lack of computing power available to sort through what is seemingly just noise.
Can this be exploited by VPSs, docker containers, and the like? As throwaway2048 mentioned, this can be executed from regular JS, so it's not a difficult thing to exploit, and it seems that ECC RAM doesn't help, so any server with vulnerable RAM could be exploited. Is there anything, from the software side of things, that can/will be done to mitigate this?
The researchers were also able to flip the bits inside DDR3 DIMMs installed on an enterprise-grade server. The tests succeeded even though all of the DDR3 modules included a protection known as ECC, the servers completely locked up or spontaneously rebooted, usually within three minutes of the tests commencing.
Yes; Locking up or rebooting seems like a reasonable thing to do if the memory has multiple unrecoverable errors. It certainly beats silent corruption.
But the failure mode for ECC is silent corruption, followed (maybe) by a crash, right? ECC can detect and correct one or two (or a few more?) bit-flips at a time, but if there's too many it won't notice that something's gone wrong.
The system then goes merrily along and either crashes (because you've flipped some bits at random) or recovers (because those bits weren't important) or hands the attacker root (because you've managed to flip some particular target bits).
So if the computer crashes or locks up, you can probably conclude that you've defeated the ECC and flipped some bits, and the system is vulnerable to Rowhammer. It's (maybe) just a matter of time before someone engineers a way to flip the specific bits needed to hand the attacker root.
The memory can recover one, detect two, and will throw up loud warning bells to any competent admin as long as it is doing so. If you see a sudden spike of ECC errors from a server, the memory is probably bad, and you replace it. If you see a sudden spike of ECC errors from a large number of servers, something is seriously wrong, and you investigate - I'd be more likely to blame it on bad power myself, or a EM Field/radiation source.
Even when trying to trigger rowhammer, the graph of bit flips is going to look like a logarithmic curve - Single-bit flips will be common, two-bits will be rare, and undetected flips will be real but ultimately not much to talk about compared to the discovered flips. You know they're there by the unrecoverable rate, and you replace the ram, discard any replicated data on that machine, and move on.
The memory can recover one, detect two, and will throw up loud warning bells to any competent admin as long as it is doing so.
I don't have personal experience here, but one of the important claims in the paper is that this warning is not given on all servers:
Unfortunately, server vendors routinely use a technique
called ECC threshold or the 'leaky bucket' algorithm where
they count ECC errors for a period of time and report them
only if they reach certain levels of failure. From what we
understand, this threshold is commonly above 100 per hour,
but this remains a trade secret and varies based on the
server vendor. So, to see ECC errors (MCE in Linux or
WHEA in Windows), there generally needs to be 100 bit flips
per hour or greater. This makes “seeing” Rowhammer on
server error logs more difficult.
In addition, we have observed some server vendors will
NEVER report ECC events back to the OS,although they might
get logged into IPMI. Typically, users expect to see
correctable ECC errors logged directly to the OS or that
halt the system when they cannot be corrected. During our
investigation into this phenomenon, we even encountered one
server that neither reported ECC events to the OS nor
halted when bit flips were not correctable. The end result
was data corruption at the application level.
This is something, in our opinion, that should never happen
on an ECC protected server system.
On the servers that I use the server will log to ipmi sel only after a threshold but the Linux OS will get an MCE for each error so a monitoring tool can detect and alert even before the BIOS logs anything.
I really wish they'd have expanded on that somewhere and actually stated what manufacturers/models of server they tested and which ones don't report ECC errors.
To be fair, the expected failure mode of unrecoverable memory corruption in kernel space should probably be a panic followed by a reboot. Unfortunately, as you said, if you manage to corrupt more than two bits in a row it's a crapshoot if ECC DRAM will even detect the corruption or not.
> ECC can detect and correct one or two (or a few more?) bit-flips at a time, but if there's too many it won't notice that something's gone wrong.
ECC can normally detect and correct a single bit flip, and detect but not correct a double bit flip, in the "line" of bits protected by it. The ECC protection is local, not global; if there are many flipped bits but they are far enough from each other, each "line" will only have a correctable single-bit flip.
Co-author of the original exploit writing here. ECC doesn't "make the attack take longer", it turns it from an exploitable security issue into a denial-of-service issue (as odds are the ECC will detect the bitflip and the kernel will shut down the box).
Then why's the paper[1] say different? The odds of a double-flip are smaller than s single-flip; a single flip gets corrected, a double stops the box (depending on the bios etc), an even rarer triple may not get detected... the odds are exactly why it just takes longer. This paper reports observing it happening.
I presume 'tdullien' assumes that "attack" refers to privilege escalation. I think he's claiming you are unlikely to get an undetected and exploitable triple bit flip before you crash the machine with the more single and double bit flips. Thus he says that ECC "turns it from an exploitable security issue into a denial-of-service issue".
Does the paper claim that in fact triple bit flips are sufficiently more frequent than one would expect, and thus an exploitable security issue still exists with ECC? This would be interesting news. But if you are defining "attack" to include denial-of-service by crashing the machine, I presume the disagreement is only about terminology.
So I do not know (in the mathematical sense) that ChipKill won't create a problem, but I think it is unlikely.
In general, if the ECC infrastructure works (e.g. DRAM detecting bit corruption and reporting it to the OS), the odds of an attacker finding a piece of RAM that flips bits "just right" to bypass the checksumming without triggering a OS panic are low enough that I'd rather worry about software bugs.
Not just supercomputer vendors - Chipkill has been a standard option on IBM X-Series (their standard server range) for years now. I remember deploying a whole bunch of servers back 2011 or so that had it.
It can mitigate the DoS if you have enough spare RAM. if you are right on the edge that removed RAM will turn into an OOM condition and will become a DoS.
Memory manufacturers all use heavy testing to both test designs and to weed out bad chips. This seems to be a weakness in these tests and should have been caught. Row-hammering is no different than applying test vectors. I'm sure those tests now include row-hammering but it will take a while for the new designs to come out.
A pragmatic solution is probably "go for the DIMMs that are identified as not apparently vulnerable", if you're in a position where you have to care about this faster than t(vendors release software updates to mitigate this class of attack on your platforms).
Many years ago, I ran a couple of PC shops. When EDO RAM first came out we had a lot of problems with random crashes being reported by customers, usually when they were running Word. We would typically run memory tests for a few hours in QAFE, which was then the best at detecting errors, but it would consistently fail to find anything wrong with these systems.
It turned out that the most reliable way to reproduce the problem was to run the shareware version of Duke Nukem. So we had a copy of it on all our diagnostic disks. If a system could get through the first level without any on-screen corruption, you knew it wasn't going to come back to the shop, so that became a standard test on all new PCs.