We can see in this specific case there was better cache locality and more data was served from the L1 and L2 cache with a drop in L3 cache misses (no hits, because it didn’t have to look in L3 for anything).
6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.
> We can see in this specific case there was better cache locality
Dramatically better L1 and L2 cache behavior. It seems clear that the additional instruction load of the Rust driver is partially made up by the excellent cache utilization.
This "Rust vs C" document is just one part of a larger analysis of network driver implementations in many languages; C, Rust, Go, C#, Java, OCaml, Haskell, Swift, Javascript and Python. Have a look at the top level README.md of that GitHub repo.
6 cycles for bounds checks that the branch predictor never had to rewind on is nothing in comparison to saving a couple trips to L3.