> Nim compilation process took an additional 702 ms That's horrifyingly slow for...

cyber_kinetist · on Sept 25, 2021

Nim is actually one of the fastest to compile out of the compiled languages out there, on par with Go. Although this is a bit subjective, I think a second of compilation is good enough for light scripting tasks. (And being a statically-typed languages it catches a good chunk of errors before compilation is finished.)

Nim's advantage is that it uses a good old C compiler for the backend (which has been hyperoptimized for decades), but the frontend (transpiler) is also pretty fast. Nim's compilation speed should improve a bit when incremental compilation support is added (which would probably solve a lot of other current issues for Nim, for example better IDE tooling)

benjamin-lee · on Sept 25, 2021

I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

brabel · on Sept 25, 2021

I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.

Here's a comparison with Common Lisp:

~/fasta-dna $ time python3 run.py

0.3797277865097147

21.828 secs

~/fasta-dna $ time sbcl --script run.lisp

0.37972778

2.415 secs

~/fasta-dna $ ls -al nc_045512.2.fasta

-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta

So, almost as fast as Nim (the time includes compilation time)?

Here's the Common Lisp code:

    (with-open-file (in "nc_045512.2.fasta")
      (loop for line = (read-line in nil)
            while line
            with gc = 0 with total = 0 do
              (unless (eql (aref line 0) #\>)
                (loop for i from 0 below (length line)
                      for ch = (char line i) do
                        (setf total (1+ total))
                        (when (or (eql ch #\C) (eql ch #\G))
                          (setf gc (1+ gc)))))
            finally (format t "~f~%" (/ gc total))))

With a top-level function and some type declarations it could run even faster, I think.

EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.

The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).

cb321 · on Sept 25, 2021

@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.

Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:

    import memfiles as mf
    var gc = 0
    var total = 0

    var f = mf.open("orthocoronavirinae.fasta")
    for line in memSlices(f):
        let n = line.size
        let cs = cast[cstring](line.data)
        if n > 0 and cs[0] == '>': # ignore comment lines
            continue
        for i in 0 ..< n:
            let letter = cs[i]
            if letter == 'C' or letter == 'G':
                gc += 1
            total += 1

    echo(gc.float / total.float)
    mf.close(f) # not really needed; process about to end

Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }

brabel · on Sept 25, 2021

I clicked on the big Download button and selected "all records", it downloaded over 3.5GB before I gave up... which file exactly should I use??

benjamin-lee · on Sept 25, 2021

I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].

Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

[2] https://file.io/nUNc7cG5i8gj

hexo · on Sept 25, 2021

The file at [2] is already gone :(

tandav · on Sept 25, 2021

can you upload somewhere your 150M file. If i follow the link in your comment there are bunch of small files, did you concatenate them?