You can compile your app with the threaded runtime, in which case the threads are scheduled amongst the available CPU cores.
Edit: The benchmark in question does just this, with 5 cores. So while the docs you cut-n-pasted are technically accurate, they do not apply to this particular run of this particular benchmark.
Well, fair enough: presumably GHC is creating 5 kernel threads under the covers, whereas the C implementation creates 503 kernel threads. Therefore the C implementation incurs approximately 100x the context-switching overhead. Once again, apples to oranges.
the threads are scheduled amongst the available CPU cores.
Actually, the per-CPU idle stats suggest that the Haskell program ran entirely on a single CPU, so it wasn't actually utilizing all 4 cores anyway.
Edit: The benchmark in question does just this, with 5 cores. So while the docs you cut-n-pasted are technically accurate, they do not apply to this particular run of this particular benchmark.