Once thought safe, DDR4 memory shown to be vulnerable to “Rowhammer”

robert_tweed · on March 18, 2016

Do any tools like say Memtest86 use test patterns that will detect row hammer vulnerability and for that matter, is it something that can be fixed on any given system by changing the RAS/CAS frequency or the voltage?

Many years ago, I ran a couple of PC shops. When EDO RAM first came out we had a lot of problems with random crashes being reported by customers, usually when they were running Word. We would typically run memory tests for a few hours in QAFE, which was then the best at detecting errors, but it would consistently fail to find anything wrong with these systems.

It turned out that the most reliable way to reproduce the problem was to run the shareware version of Duke Nukem. So we had a copy of it on all our diagnostic disks. If a system could get through the first level without any on-screen corruption, you knew it wasn't going to come back to the shop, so that became a standard test on all new PCs.

speeder · on March 18, 2016

You made me remember a extremely interesting article about Guild Wars team... they were annoyed with the amount of weird bug reports, and one of their programmers found out how to make the game self-analyze its data, and find if it was RAM hardware issue, he found out that about 1% of all their players had computers with faulty RAM, and a good amount of bug reports came from these computers, and that this program saved them a lot of money (since now they didn't needed to make devs chase "ghost" bugs, only actually real bugs).

CydeWeys · on March 18, 2016

I had bad RAM on my primary Linux desktop some ~six years ago and it caused an endless series of problems, data corruption, and hard reboots every few months. It happened so infrequently that I never quite tracked it down, but I'd copied hundreds of gigabytes of data between computers over my LAN and lots of random files still have some corrupted bits in them to this day from that.

I finally got serious about tracking down the issue, ran memtest, and sure enough, it discovered a faulty DIMM pretty quickly. I yanked it out and the problems immediately went away.

yuhong · on March 18, 2016

Newer versions of memtest86 has rowhammer test.

swinglock · on March 18, 2016

But is it reliable? Since this is news, I assume memtest won't detect most rowhammer vulnerable systems.

kondbg · on March 18, 2016

Memtest86's rowhammer test is pretty reliable. In fact, since the test's inclusion, so many Rowhammer-related errors have been observed that the developers of Memtest86 split the test into two passes (a high-frequency hammer test and a lower-frequency, rate-limited hammer test) due to the sheer number of users failing the high-frequency Rowhammer pass. [1]

[1] http://www.memtest86.com/troubleshooting.htm#hammer

yuhong · on March 18, 2016

I wonder what is the return rate for memory due to this test.

cnvogel · on March 20, 2016

I guess the number of "enthusiasts" running Memtest on their (or their acquaintances') PCs still is miniscule. So in my opinion no one would notice, except if a return questionnaire explicitly included a

    [√] error while using memtest

field. Keep in mind that newer versions of Microsoft's Windows include their own memory tester. Most people will, if at all, use this.

yuhong · on March 18, 2016

I submitted the original paper this is based on (notice most of it comes from Micron DDR4 chips): https://news.ycombinator.com/item?id=11308525

zymhan · on March 18, 2016

As a hardware engineer, I find the idea of security vulnerabilities based on how closely components on a chip are to one another. Most of the hardware implementations we still use aren't really designed to prevent these exploits. Aside from rowhammer, there's also "sidechannel" attacks on shared CPUs, and measuring the electromagnetic fields generated by a PC to reverse-engineer sensitive data.

I'm certain that in the future these concerns will play a greater role in how we design computer components and systems. Previously these attacks were infeasible due to the lack of computing power available to sort through what is seemingly just noise.

pocketarc · on March 18, 2016

Can this be exploited by VPSs, docker containers, and the like? As throwaway2048 mentioned, this can be executed from regular JS, so it's not a difficult thing to exploit, and it seems that ECC RAM doesn't help, so any server with vulnerable RAM could be exploited. Is there anything, from the software side of things, that can/will be done to mitigate this?

smaili · on March 18, 2016

Link to actual paper - http://www.thirdio.com/rowhammer.pdf

ori_b · on March 18, 2016

So. Time to move to ECC by default?

SixSigma · on March 18, 2016

from TFA

The researchers were also able to flip the bits inside DDR3 DIMMs installed on an enterprise-grade server. The tests succeeded even though all of the DDR3 modules included a protection known as ECC, the servers completely locked up or spontaneously rebooted, usually within three minutes of the tests commencing.

ori_b · on March 18, 2016

Yes; Locking up or rebooting seems like a reasonable thing to do if the memory has multiple unrecoverable errors. It certainly beats silent corruption.

roywiggins · on March 18, 2016

But the failure mode for ECC is silent corruption, followed (maybe) by a crash, right? ECC can detect and correct one or two (or a few more?) bit-flips at a time, but if there's too many it won't notice that something's gone wrong.

The system then goes merrily along and either crashes (because you've flipped some bits at random) or recovers (because those bits weren't important) or hands the attacker root (because you've managed to flip some particular target bits).

So if the computer crashes or locks up, you can probably conclude that you've defeated the ECC and flipped some bits, and the system is vulnerable to Rowhammer. It's (maybe) just a matter of time before someone engineers a way to flip the specific bits needed to hand the attacker root.

GauntletWizard · on March 18, 2016

The memory can recover one, detect two, and will throw up loud warning bells to any competent admin as long as it is doing so. If you see a sudden spike of ECC errors from a server, the memory is probably bad, and you replace it. If you see a sudden spike of ECC errors from a large number of servers, something is seriously wrong, and you investigate - I'd be more likely to blame it on bad power myself, or a EM Field/radiation source.

Even when trying to trigger rowhammer, the graph of bit flips is going to look like a logarithmic curve - Single-bit flips will be common, two-bits will be rare, and undetected flips will be real but ultimately not much to talk about compared to the discovered flips. You know they're there by the unrecoverable rate, and you replace the ram, discard any replicated data on that machine, and move on.

nkurz · on March 18, 2016

The memory can recover one, detect two, and will throw up loud warning bells to any competent admin as long as it is doing so.

I don't have personal experience here, but one of the important claims in the paper is that this warning is not given on all servers:

  Unfortunately, server vendors routinely use a technique   
  called ECC threshold or the 'leaky bucket' algorithm where 
  they count ECC errors for a period of time and report them 
  only if they reach certain levels of failure. From what we 
  understand, this threshold is commonly above 100 per hour, 
  but this remains a trade secret and varies based on the 
  server vendor. So, to see ECC errors (MCE in Linux or
  WHEA in Windows), there generally needs to be 100 bit flips 
  per hour or greater. This makes “seeing” Rowhammer on 
  server error logs more difficult.

  In addition, we have observed some server vendors will 
  NEVER report ECC events back to the OS,although they might 
  get logged into IPMI. Typically, users expect to see 
  correctable ECC errors logged directly to the OS or that 
  halt the system when they cannot be corrected. During our 
  investigation into this phenomenon, we even encountered one 
  server that neither reported ECC events to the OS nor
  halted when bit flips were not correctable. The end result 
  was data corruption at the application level.
  This is something, in our opinion, that should never happen 
  on an ECC protected server system.

http://www.thirdio.com/rowhammer.pdf

baruch · on March 18, 2016

On the servers that I use the server will log to ipmi sel only after a threshold but the Linux OS will get an MCE for each error so a monitoring tool can detect and alert even before the BIOS logs anything.

BuildTheRobots · on March 18, 2016

I really wish they'd have expanded on that somewhere and actually stated what manufacturers/models of server they tested and which ones don't report ECC errors.

brainfire · on March 19, 2016

Really, without details like that it's just unverifiable FUD.

snuxoll · on March 18, 2016

To be fair, the expected failure mode of unrecoverable memory corruption in kernel space should probably be a panic followed by a reboot. Unfortunately, as you said, if you manage to corrupt more than two bits in a row it's a crapshoot if ECC DRAM will even detect the corruption or not.

cesarb · on March 18, 2016

> ECC can detect and correct one or two (or a few more?) bit-flips at a time, but if there's too many it won't notice that something's gone wrong.

ECC can normally detect and correct a single bit flip, and detect but not correct a double bit flip, in the "line" of bits protected by it. The ECC protection is local, not global; if there are many flipped bits but they are far enough from each other, each "line" will only have a correctable single-bit flip.

wumpus · on March 18, 2016

ECC just makes the attack take longer.

tdullien · on March 18, 2016

Co-author of the original exploit writing here. ECC doesn't "make the attack take longer", it turns it from an exploitable security issue into a denial-of-service issue (as odds are the ECC will detect the bitflip and the kernel will shut down the box).

wumpus · on March 18, 2016

Then why's the paper[1] say different? The odds of a double-flip are smaller than s single-flip; a single flip gets corrected, a double stops the box (depending on the bios etc), an even rarer triple may not get detected... the odds are exactly why it just takes longer. This paper reports observing it happening.

1: http://www.thirdio.com/rowhammer.pdf

nkurz · on March 18, 2016

I presume 'tdullien' assumes that "attack" refers to privilege escalation. I think he's claiming you are unlikely to get an undetected and exploitable triple bit flip before you crash the machine with the more single and double bit flips. Thus he says that ECC "turns it from an exploitable security issue into a denial-of-service issue".

Does the paper claim that in fact triple bit flips are sufficiently more frequent than one would expect, and thus an exploitable security issue still exists with ECC? This would be interesting news. But if you are defining "attack" to include denial-of-service by crashing the machine, I presume the disagreement is only about terminology.

nickpsecurity · on March 18, 2016

I noticed supercomputer vendors and Oracle are using ChipKill plus ECC. Does adding that change the results any or still reduce to DOS?

tdullien · on March 18, 2016

So I do not know (in the mathematical sense) that ChipKill won't create a problem, but I think it is unlikely.

In general, if the ECC infrastructure works (e.g. DRAM detecting bit corruption and reporting it to the OS), the odds of an attacker finding a piece of RAM that flips bits "just right" to bypass the checksumming without triggering a OS panic are low enough that I'd rather worry about software bugs.

DoS may still be a pretty viable attack, though.

darkr · on March 18, 2016

Not just supercomputer vendors - Chipkill has been a standard option on IBM X-Series (their standard server range) for years now. I remember deploying a whole bunch of servers back 2011 or so that had it.

baruch · on March 18, 2016

It can mitigate the DoS if you have enough spare RAM. if you are right on the edge that removed RAM will turn into an OOM condition and will become a DoS.

nickpsecurity · on March 18, 2016

Appreciate everyone's insightful replies. :)

rodgerd · on March 18, 2016

That could still be extraordinarily entertaining for an attacker to execute on a shared hosting environment, though.

Animats · on March 18, 2016

Can you execute Rowhammer attacks from WebAssembly?

throwaway2048 · on March 18, 2016

You can execute it from regular JS https://github.com/IAIK/rowhammerjs

mchahn · on March 18, 2016

Memory manufacturers all use heavy testing to both test designs and to weed out bad chips. This seems to be a weakness in these tests and should have been caught. Row-hammering is no different than applying test vectors. I'm sure those tests now include row-hammering but it will take a while for the new designs to come out.

cloudsloth · on March 18, 2016

Is a pragmatic solution more virtualized memory, other software solutions, hardware shielding, lower data density, or some combination of these?

rincebrain · on March 18, 2016

A pragmatic solution is probably "go for the DIMMs that are identified as not apparently vulnerable", if you're in a position where you have to care about this faster than t(vendors release software updates to mitigate this class of attack on your platforms).