More

Const-me · 2026-06-15T19:50:36 1781553036

“Is it because of government regulations, do we need to deregulate?”

Insufficient law enforcement. The same memory manufacturers already broke antimonopoly laws in the past, pleaded guilty. Apparently the fines were too small for these companies to care, and the people responsible were promoted instead of being punished. More information: https://en.wikipedia.org/wiki/DRAM_price_fixing_scandal

Const-me · 2026-06-11T20:04:00 1781208240

Might be jurisdiction. Let’s say a person who is not a Polish citizen committing and broadcasting a crime outside of Poland, then trying to enter Poland. IANAL but I think this law sends that person to jail as long as the video is accessible from inside Poland.

Const-me · 2026-06-11T12:28:54 1781180934

> let alone more performant

Not anymore. On modern hardware, the only operation where integers win is single cycle add/sub. For the rest of operations (multiplication, division, square roots, etc.) floating point is faster, sometimes by a lot.

rasz · 2026-06-12T07:56:42 1781251002

If you care about performance you use logs and then multiplications turn into integer additions.

Const-me · 2026-06-12T11:29:29 1781263769

On modern processors, floating point addition often has equal performance to floating point multiplication. For example, on AMD Zen4 it’s 3 cycles latency and 0.5 cycles throughput.

I’m not sure that trick going to work in the context of computer graphics. To transform vectors or multiply matrices you need a mix of multiplications and additions, or an equivalent sequence of FMAs.

Const-me · 2026-06-10T14:02:45 1781100165

Good article. Worth noting C# standard library handles most of that complexity, no regular expressions required. Call System.Net.Mail.MailAddress.TryCreate, if successful read Address property to find the normalised address.

Const-me · 2026-06-10T13:46:36 1781099196

Cool trick, but personally I don’t trust C bitfields. When I need something like that, I usually create C++ class or C# structure with a single private uint64 field, and public methods to extract or manipulate the logical fields.

Because the class/structure only has a single uint64 field, the compilers are likely to pass value in a single general-purpose register. I believe that’s unlikely to happen for a structure with bit fields.

If you target AVX2 or newer you also have BMI1 and BMI2, intrinsics like bextr and bzhi are probably faster than whatever codes compilers are generating for bit fields.

Binary compatibility of bit fields is a moot point, using them at the API surface across compilers or languages is not ideal. A structure with a single uint64 field is very compatible.

Const-me · 2026-06-05T23:42:31 1780702951

None so far. When I try to use these language models in the primary areas of my expertise like SIMD or GPGPU they fail to do any good. When I ask them to implement some general-purpose stuff, the output is too low quality to be useful in my software.

Still, find them incredibly useful for code review (despite unable to write good C++ or C#, smart enough to detect issues there), also dealing with technologies outside of my area of expertise like Python or web stuff.

Const-me · 2026-05-31T20:05:04 1780257904

> Performance should not be priority #1. Security should be.

For a web browser, or a server in a bank, sure. For anything else, questionable.

> adding a sandbox around a memory-unsafe codec is going to be way more expensive

In modern world, overhead of strong sandboxes is surprisingly small. A nuclear but most reliable option is hardware assisted VM. On modern computers with SLAT and virtualized IO the overhead for most use cases is negligible. If you want something lighter weight, can use a multi-user nature of all modern OS kernels and isolate into a separate process with restricted permissions. Sandboxing overhead is approximately zero.

Const-me · 2026-05-26T14:49:00 1779806940

The AVX2 SIMD version is not ideal. Too many instructions, and it needs constant vectors. I would rather do it like that https://godbolt.org/z/cn6YKbfYd

Const-me · 2026-05-23T21:39:26 1779572366

> Most of those FLOPS are constrained by memory bandwidth

I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.

Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.

zzzoom · 2026-05-24T01:03:41 1779584621

Prefill (GEMM) is compute bound, decode (GEMV) is memory bound.

Const-me · 2026-05-24T06:56:18 1779605778

> decode (GEMV) is memory bound

Decode with batch size 1 is GEMV. Batching makes the decode GEMM too.

Const-me · 2026-03-31T10:03:24 1774951404

While data centres indeed have awesome internet connectivity, don’t forget the bandwidth is shared by all clients using a particular server.

If you have 100 mbit/sec internet connection at home, a computer in a data centre has 10 gbit/sec, but the server is serving 200 concurrent clients — your bandwidth is twice as fast.