More

silentvoice · 2026-04-06T21:05:02 1775509502

what are some bad behaviors you've seen with NFS,ZFS,RAID and how do you diagnose it and how did it lead you to this solution

aozgaa · 2026-04-06T21:15:52 1775510152

NFS -- very slow reads, much slow than `cp /nfs/path/to/file.txt ~/file.txt`. I generally suspect these have to do with some pathological behavior in the app reading the file (eg: doing a 1-byte read when linearly scanning through the file). diagnose with simple `iotop`, timing the application doing the reads vs cp, and looking at some plethora or random networking tools (eg: tcptop, ...). I've also very crudely looked at `top`/`htop` output to see that an app is not CPU-bound as a first guideline.

ZFS -- slow reads due to pool-level decompression. zfs has it's own utilities, iirc it's something like `zpool iostat` to see raw disk vs filesystem IO.

RAID -- with heterogenous disks in something like RAID 6, you get minimum disk speed. This shows up when doing fio benchmarking (the first thing I do after setting up a new filesystem/mounts). It could be that better sw has ameliorated this since (last checked something like 5ish years ago).

silentvoice · 2026-02-28T22:24:17 1772317457

I have no submission for this but I joined the hype in my own way by optimizing the training loop. These tiny models are not really well suited to frameworks like pytorch, and with highly patient AI agents we can now just inline the whole thing into C++ just to see what happens, which I do below:

https://www.reidatcheson.com/transformer/llm/ml/cuda%20graph...

silentvoice · 2025-11-06T12:27:24 1762432044

oh boy I've got opinions here.

Basically I just don't want to hear about "the state of SIMD in Rust" unless it is about dramatic improvement in autovectorization in the rust compiler.

80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized. Intrinsics get you the rest of the way on harder code. Autovectorization is essentially a solved problem for the vast majority of floating point code.

Not so with rust, because of a dogmatic approach to floating point arithmetic that assumes bitwise reproducibility is the "right" answer for everyone (actually, it's the right answer to almost nobody) to the point of not even allowing a user to even flag on these optimizations. and once you get to the point of writing intrinsics you have to handwrite code for every new architecture when autovectorizers could have gotten you 80%-90% of the way there with a single source and often this is just enough.

the contention with the above is that if a user needs SIMD they can just use some SIMD API and make their intention more clear. this is essentially an argument that we should handwrite intrinsics. well guess what. I'm a programmer and I use compilers because they _do this for me_ and indeed are able to do so very easily in C or C++ when I instruct it that I'm ok with with reordering operations and other "accuracy impacting" optimizations.

The huge joke on us is that these optimizations generally have the effect of _improving_ accuracy because it will reduce the number of rounding steps either by simply reducing the number of operations or by using fused multiply adds which round only once.

gajjanag · 2025-11-06T14:38:57 1762439937

>80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized.

Yep. I was pleasantly surprised by the autovectorization quality with recent clang at work a few days ago. If you write code that the compiler can infer to be multiples of 4, 8, etc the compiler goes off and emits pretty decent NEON/AVX code. The rest as you say is handled quite well by intrinsics these days.

Autovectorization was definitely poorer 5-10 years ago on older compiler toolchains.

CryZe · 2025-11-06T12:30:05 1762432205

Keep an eye out for the algebraic operations on floats currently in nightly then: https://doc.rust-lang.org/nightly/std/primitive.f32.html#alg...

the__alchemist · 2025-11-06T14:48:05 1762440485

I stumbled on these recently; you can do these in CUDA kernels. I have some "todo: mul_add here" in my rust code!

TinkersW · 2025-11-06T19:47:56 1762458476

So you have to write fugly code just to get something that should be a compiler switch?

the__alchemist · 2025-11-06T14:46:53 1762440413

Yikes. Sounds like we need this in rust ASAP. (I do a lot of parallizable code; GPU-centric, but CPU-SIMD is a good fallback for machines that don't have nvidia GPUs). I find the manual SIMD packing/unpacking clumsy, especially when managing this in addition to non-SIMD CPU, and GPU code.

silentvoice · 2025-09-24T12:07:59 1758715679

There are two sides to numerical linear algebra. The first is the "linear algebra" part, which is very mathematically sophisticated and the language you choose to represent these concepts is not so important as your understanding. pencil and paper is the ideal place to prove out understanding of this.

The "numerical" part is a minefield because it will take all your math and demolish it. just about every theoretical result you proved out above will hold not true and require extra-special-handholding in code to retain _some_ utility.

As such I think a language which enables you to go as fast as possible from an idea to seeing if it crosses the numerical minefield unscathed is the one to use, and these days that is python. It is just so fast to test a concept out, get immediate feedback in the form of plotting or just plain dumb logging if you like, and you can nearly instantly share this with someone even if you're on ARM +linux & they are Intel+windows

The most problematic issue with python&numpy, as it relates to learning _numerical_ side of linear algebra, is making sure you haven't unintentionally promoted a floating point precision somewhere (for example: if your key claim is that an algorithm works entirely in a working precision of 32 bits but python silently promoted your key accumulator to 64 bits, you might get a misleaing idea of how effective the algorithm was) but these promotions don't happen in a vacuum and if you understand how the language works they won't happen.

edit: & I have worked professionally with fortran for a long time, having known some of the committee members and BLAS working group. so I have no particular bias against the language

silentvoice · on Oct 24, 2019

Hi I wrote the blog post linked - and I feel a little silly that I didn't check that _both_ loops vectorized. So I fixed the Rust implementation to keep a running vector of partial sums which I finish up at the end - this one did vectorize. The result was a 2X performance bump, which I'm about to include in the blog post as an update.

If it's OK I'll link to this comment as the inspiration.

On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.

lovasoa · on Oct 25, 2019

I liked your trick to iterate on chunks to force the compiler to vectorize the code ! Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster.

See the generated assembler here: https://rust.godbolt.org/z/G5A2u0

silentvoice · on Oct 25, 2019

Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.

I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.

So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.

tom_mellior · on Oct 24, 2019

Great to hear that you managed another 2x speedup! Sure, feel free to link my comment if you like.

silentvoice · on Dec 7, 2015

Coming from a PhD in math I can give this good trick for assessing grand mathematical claims:

Google the authors.

Maybe unfair to intelligent amateurs, but based on my decade of experience you find out from this whether to take something seriously.

Might need some adjustment of Google terms for hard-to-google names, just use common sense.

_ugfj · on Dec 7, 2015

Except Patrick Sole is a real mathematician who has been working in this area for some years now http://www.emis.de/journals/JTNB/2007-2/article03.pdf and so it's not easy to dismiss it just based on a name.

The most plausible explanation is https://www.reddit.com/r/math/comments/3vnrqj/two_authors_cl... here: "Zhu sent Sole some questions about his Robin inequality paper, including Zhu's ideas for proving RH. Sole responded, but there was some communication breakdown that led to Zhu thinking Sole endorsed his ideas. Zhu typed up his idea and added Sole's name to it in order to get the paper read. This is of course unethical, but given that Zhu thought his proof was correct, in his mind he was doing Sole a favor."

This was published on Saturday so my best guess is Patrick Sole on Monday will either post a refute or will claim it is true and everyone will shit a brick (unikely).

silentvoice · on Dec 8, 2015

I wasn't clear enough. I was responding to the flow of comments of the form: "Riemann hypothesis is hard, this is unlikely to be true." Sure it's true, but doing a little more research could inform that opinion well past the zeroth-order approximation of "it's a hard problem."

I didn't actually take my own advice, I just wait for Terence Tao to write a post then I know it's true :)

_ugfj · on Dec 8, 2015

I think the average HNer will hear when a proof happens for real in just the everyday channels of theirs: Twitter, Facebook, Reddit, HN etc will be FULL of it (with good reason). Remember the Higgs boson?

silentvoice · on April 26, 2015

Maybe a better term should be "blind" rather than black-box. I think the goal is simply to hold optimization to the same level of reproducibility that is expected of most scientific fields today, and if a researcher is allowed to introduce a hundred tunable parameters that makes their algorithm converge on all the standard test cases then they haven't created a reproducible optimizer - they have created a benchmark solver.

silentvoice · on Feb 28, 2015

How is this for "performance portability?" I use this solution in C when the function is very expensive, therefore a little extra indirection really won't make any difference - but can potentially improve the reusability a lot. Is inlining calls to function via a pointer a very basic optimization that any self respecting compiler should be able to do, or is it a very advanced optimization that I can't count on working across platforms? Given the possibility of dynamic libraries I don't see how it could be inlined in all cases, therefore at least some kind of analysis must be done before trying it.

mbel · on Feb 28, 2015

Of course, my statement if far from being complete and definitive answer, it's a mere suggestion of another method to add to OP's list.

If the functions are defined in different binaries which are dynamically linked, the chances of inlining are approaching zero [0], although this method may work across binaries in contrast to all techniques from the article. To enable compiler to inline the argument-function, the compiler must have definitions of both functions. To enable this in library/library-consumer scenario the higher level function [1] can be placed in header file and marked as static, this will guarantee that they are in the same module.

Inlining function pointers doesn't seem to be advanced optimization technique (it is not that different from constants propagation). GCC 4.5 does this for same module functions even in -O1 [2].

Personally, I prefer to emphasize modularity in my code and move to other solutions when something is identified as a bottle-neck (which seems to be nice rule of the thumb for all optimizations).

[0] I would love to be proven wrong, by some kind of JIT-ing dynamic linker.

[1] ie. function accepting other functions as arguments.

[2] http://stackoverflow.com/questions/2959535/c-function-pointe...

comex · on Feb 28, 2015

Inlining a call to a function via a pointer, if the function is static and implemented in the same .c file or an included header, is a basic optimization.

I'm less sure about the compiler figuring out that the user wants (in that case) process_image to be inlined into each of the wrapper functions. However, if you don't mind a bit of nonportability, it's easy to force it to do so: mark the function to be inlined __forceinline on MSVC, __attribute__((always_inline)) on GCC/Clang/ICC/etc.

silentvoice · on Feb 28, 2015

Curious it doesn't mention C11's __Generic macro. A very nice way to achieve a lot of useful "overloading" behavior.

http://en.wikipedia.org/wiki/C11_%28C_standard_revision%29

ux · on Feb 28, 2015

I was simply not aware of it as I never really looked deeply at the features of anything posterior to C99 (I only vaguely remember the thread features that look appealing). That's interesting, I might end up mentioning it in the conclusion later.

Thank you.

ndesaulniers · on Feb 28, 2015

The largest issue I see with this is compiler vendors dragging there feet on C11 support, vs C++11 support which is much much more competitive.

silentvoice · on Jan 30, 2015

The idea of using DSLs in this field is attractive. The amount of math you have to hold in your head to really start talking about large scale simulations is staggering, and it gets more and more complicated as you scale up. There are thousands of moving parts that must be accounted for, any one of them can burn you.

I wonder however, with full respect to other groups who have made good progress in making the idea a reality, why they never attempt collaboration with computer science - a field that has been perfecting language design for so long? I remember speaking to someone doing a similar project, and they didn't even know that parsing was a "thing," yet they were implementing a DSL.

Since I'm more on the math side and not on the computer science side I can't comment on how much duplicated work is actually happening, but it seems like a lot.

AlbertoGP · on Jan 31, 2015

In their 2014 presentation [1] they say that their compiler is implemented in C++ using flex/bison (that's a lexer/parser generator combo) so at least in this case they're aware of the standard tools.

[1] http://equelle.org/presentations/