What actually amazes me is that... Python is good enough:
> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.
> working through all 3.5 gigabytes of test data once
> Time is ~6.5s.
I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.
Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.
Also kudos to the authors to use mean instead of average in the graphs.
> Firstly, the program spends most of its time in the heavily-optimized str.find function, which is implemented in C.
I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.
It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.
Example Java: Runtime written in C++, practically all libraries written in Java, including the standard library, and Java libraries are greatly preferred.
> even Python with a naive implementation is good enough.
There is nothing naive about their implementation.
1) It uses C FFI (as many low level python functions do)
2) It uses a suboptimal but still not naive algo.
Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident
That is a fragile benchmark. For simple scripts where you rely exclusively on python procedures that are written in C python is often very fast.
Once you start working on that data using procedures actually written I python performance is usually far from stellar.
I have had that happen to me many times. Back in the guile 2.0 days I tried porting some utils to python based on preliminary benchmarks that were often an order of magnitude faster than my guile ones, but when the logic was implemented the difference was gone. Then guile 2.2 (and now guile 3) happened and everything got magically faster, even though less and less of the runtime is written in C.
> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.
> working through all 3.5 gigabytes of test data once
> Time is ~6.5s.
I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.
Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.
Also kudos to the authors to use mean instead of average in the graphs.