Your jemalloc case is using 1564672 kB, this means you only have 764 hugepages. Since the RAM access patterns in the parent process are random, you only need ~1000 accesses to trigger multiple COW (copy-on-write) totaling most of the ~1.5GB. redis-benchmark simulates 50 clients by default and you use a pipelining factor of 4, so you have 200 in-flight requests at any one time. I am not familiar with how the redis server prioritizes the requests to be served, but if it is like most network servers it is not deterministic (whatever comes first from the network event loop). So while on average a request may wait for 200 other ones to be served, a small fraction of them will have to wait longer because of this non-deterministic factor. It is not unimaginable that some have to wait for 500, maybe 1000 other requests. Waiting for 1000 requests means waiting for a COW of most of the ~1.5GB which would take a few hundred milliseconds and would absolutely explain your latency spike.
On the other hand your malloc case is using 4kB pages. If a request has to wait for 1000 other in-flight requests to be executed, that means waiting for at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0 milliseconds. This is why the latency is much lower.
tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.
This was my first thought as well, but actually this is not what is happening AFAIK, and the performance hit is likely due to inefficient huge page allocation. There are reasons I believe this, but I'm actually checking in a more systematic way right now before saying random things.
EDIT: you were exactly right. This is what happens, there are 50 clients in the benchmark, with many queued requests, so indeed since the benchmark is designed to touch all the keys evenly, what happens is that every client served in a given event loop cycle has a big chance to get a page fault. This seemed unrealistic to me, since I saw the spike in a single event-loop cycle, but it is how is working actually. Thanks!
On the other hand your malloc case is using 4kB pages. If a request has to wait for 1000 other in-flight requests to be executed, that means waiting for at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0 milliseconds. This is why the latency is much lower.
tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.