I really don’t want to give anyone ideas, but doesn’t this make the Nvidia 5090 ...

layer8 · 2026-05-24T18:00:30 1779645630

An unbelievably good deal at $4000 plus?

johnvanommen · 2026-05-24T18:01:48 1779645708

Possibly the best deal there is

I really need to shut up, or bite the bullet and by one.

If you graph the tokens per second on the 5090, your jaw will hit the floor at how cheap it is

gruez · 2026-05-24T18:25:43 1779647143

With only 32gb of vram, you can only run small/quantized models, in which case what's the point? At $4000, that gets you 20 months of 10x claude or chagpt subscriptions, which provide far better models. You'd need some use case where you can tolerate worse models, and use a steady supply of them. That doesn't match most people's usage patterns.

regularfry · 2026-05-24T19:40:14 1779651614

If you can do what you need with qwen3.6-27b, it starts to look really interesting. That model is crazy good for the size, but it's a pain tweaking the params to run it on a 4090 with decent context and decent token speed. A 5090 looks tasty from that point of view, and only more so if you think in terms of the probability of that model being roflstomped by something in the same weight class in the next couple of years. I reckon that probability is significantly non-zero, but fundamentally it's a guess.

gruez · 2026-05-24T20:48:29 1779655709

>If you can do what you need with qwen3.6-27b, it starts to look really interesting.

What's the use case here? Churning out massive amounts of slop code through autonomous agents? Running openclaw 24/7? I think the proliferation of codex and claude code, compared to any of the cheaper open models suggests that at least for most software development, the 50-75% discount of open models isn't worth the hassle of the decreased intelligence.

weitendorf · 2026-05-24T22:55:46 1779663346

I think there is a reasonable basis for taking a gamble that small models capable of fitting on a 32GB card will continue to advance over the next 5 years and eventually approach Gemini Flash 3.5 / Sonnet 4.6 levels of capabilities, which I would consider to be past the threshold of “probably worth the cost and hassle of running 24/7” if the upfront cost of the hardware was palatable.

My use case would primarily be in search, integration, and indexing other software projects with my own, as well as transcription/indexing of interesting video and audio content (eg Dwarkesh interviews) that I don’t have time to watch but want to easily search and apply to my projects, and search/indexing for useful information from things like Linux kernel and security mailing lists. Basically there is a lot of stuff that, if the cost were low enough, I would point a reasonably intelligent AI at to distill out useful information and apply it to my projects, or just cherry pick the interesting things out and surface them to me so I don’t have to wade through all the mundane stuff and man-made slop getting in the way.

gruez · 2026-05-24T23:16:04 1779664564

>My use case would primarily be in search, integration, and indexing other software projects with my own, as well as transcription/indexing of interesting video and audio content (eg Dwarkesh interviews) that I don’t have time to watch but want to easily search and apply to my projects, and search/indexing for useful information from things like Linux kernel and security mailing lists. Basically there is a lot of stuff that, if the cost were low enough, I would point a reasonably intelligent AI at to distill out useful information and apply it to my projects, or just cherry pick the interesting things out and surface them to me so I don’t have to wade through all the mundane stuff and man-made slop getting in the way.

All of that feels like something that a $20 chatgpt pro subscription is for, maybe with slightly better tool use capabilities. There's no way that a $4000 purchase on a GPU would ever be worth it if all you're doing is running a handful of queries per day.

weitendorf · 2026-05-25T00:48:53 1779670133

It would require much more than a couple of queries per day, I want to basically do bulk ingestion and search/evaluation/integration across tens of thousands of videos and software projects (if it were cheap enough and smart enough). It would basically be setting up and operating a pretty large data ingestion and coding agent pipeline, which I would want to itself be mostly automated.

It’s ok if you don’t want to do the same kind of thing but I find it weird how dismissive so many people get about wanting to use LLMs for large projects, or how anybody who says they’re using them for these kinds of things (I’m doing similar for other stuff) gets challenged on what they’re doing it for.

echoangle · 2026-05-24T19:02:29 1779649349

Or you want to process private data or don’t have reliable connectivity. There are a few more reasons for local models I think.

EnPissant · 2026-05-24T19:11:19 1779649879

Also, electricity isn't free.

tom_alexander · 2026-05-25T01:19:21 1779671961

With enough solar panels it is!

HDBaseT · 2026-05-25T02:48:42 1779677322

Not quite.

Free for approximately 8 hours (assuming perfect weather conditions) and excluding unit cost and maintenance cost.

It has a cost.

tom_alexander · 2026-05-25T13:17:48 1779715068

My area has a net-metering plan available, so you can send any surplus out to the grid to offset energy pulled from the grid, essentially treating the grid like a large battery. That can extend the 8 hours into full 24-hour coverage with enough panels.

free652 · 2026-05-25T00:49:49 1779670189

I don't have 5090, I have 395+ and I use for gpu assisted OCR, embeddings vector, speach to text and etc. I have a freedom of using a large library of various models and I can fit a lot in 128gb.

I don't use it for coding, I have $20 Gemini, $20 codex, etc.

But then I got the framework board for $1700, now it's $2700

Galanwe · 2026-05-24T19:06:35 1779649595

The 5090 is crap for inference. Unless you like dummy models, sure they will run at light speed. All the rage is MoE with 500B-1T weights nowadays.

zozbot234 · 2026-05-24T20:40:37 1779655237

MoE is fine. You can put the shared weights on the 5090 (will fit handily even for the largest models) and expert weights on CPU, possibly with weights offload from storage.

EnPissant · 2026-05-25T07:44:57 1779695097

Even if you could fit a 500B model's expert weights in very fast system RAM, it would run so slow as to be useless.

zozbot234 · 2026-05-25T09:13:43 1779700423

That's really only "useless" if the only thing you care about is a quick real-time response. Contrary to common perception, MoE models do benefit from batching requests together even when run on a single node, you just have to ensure you have at least ~5 parallel requests in flight (and that's for the very sparsest models) to really see the aggregate benefit.

(Intuitively, that's because the issue of whether any active weights are being shared among requests - thus, any memory throughput is being reused - is a generalized birthday problem. That's why even having a few parallel requests is quite effective. Especially since the "random" choice of experts happens anew at any single layer, so there's a lot of independent samples.)

EnPissant · 2026-05-25T09:37:12 1779701832

This is just wishful thinking.

For prefill, it's really easy to batch MoE and get really good tk/s, even on a single stream.

For decode, you will run into the problem that:

1) you need more parallel requests which means more memory for context

2) 5 requests will not give you very much expert overlap on parallel requests

zozbot234 · 2026-05-25T09:49:57 1779702597

You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it; that's where the "birthday" framing becomes relevant. Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts.

EnPissant · 2026-05-25T09:58:35 1779703115

>You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it

I'm not sure what you are claiming. Decode is bottle-necked by memory bandwidth. To see a speed up of 2x, you have to ensure each expert weight memory fetch can be used by 2 parallel streams. What exactly is the average factor you are claiming for 5x parallel streams (due to "birthday paradox" factors)? The Birthday paradox isn't really relevant here. It's about coverage, not parallelism.

> Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts.

This is not true.

zozbot234 · 2026-05-25T11:02:36 1779706956

An aggregate speedup of 2x is a lot, we don't need that in a local context. Local hardware is heavily constrained by power and thermals, not just bandwidth; so all we really care about is raising compute intensity for decode a little bit to relax the memory bandwidth constraint. The average factor will depend on just how sparse the model is and how far you can push parallelism, there isn't just one single answer.

EnPissant · 2026-05-25T18:13:08 1779732788

But you won't see 2x expert re-use, the speedup with 5 streams will be tiny.

mattmanser · 2026-05-24T17:50:35 1779645035

It's gone up like 300% in cost in the last year.

JacobAsmuth · 2026-05-24T17:55:20 1779645320

Which surely is the highest it'll ever be! You're suggesting that the price will go down in the future? Would love to hear more about your thought process!

bcrosby95 · 2026-05-24T18:30:00 1779647400

Are you saying we're entering a period where tech increases in price instead of decreases? I guess it depends upon time horizon, but your statement isn't very specific.

JacobAsmuth · 2026-05-24T22:47:35 1779662855

Yeah man, obviously. RTX 5090s will almost certainly increase in price over the next two years as memory shortages get worse.

EnPissant · 2026-05-24T18:10:45 1779646245

There was only a very brief time it was selling for MSRP (last fall for $2000). Even if you use that as the previous data point, it's only 200% increased.

no-name-here · 2026-05-25T00:30:10 1779669010

> it's only 200% increased.

If it's 4k instead of 2k msrp, that's a 100% increase.

EnPissant · 2026-05-25T22:44:11 1779749051

You are correct. I should have said "increased to 200%".

johnvanommen · 2026-05-24T18:03:52 1779645832

I believe msrp is $2000 right?

forrestthewoods · 2026-05-24T17:58:11 1779645491

if you can buy one!

The RTX 5090 is faster than an H200. It just has less ram (32 vs 141), doesn't have NVLink, and technically isn't allowed to be used in a datacenter.

The datacenter GPUs sell at an 80% margin. They're incredibly overpriced. But the laws of supply and demand are undefeated and so here we all are.

alphabeta3r56 · 2026-05-24T18:01:33 1779645693

> The RTX 5090 is faster than an H200. It just has less ram

H200 has HBM and much more 64-bit compute

forrestthewoods · 2026-05-24T18:58:04 1779649084

Let me try again.

RTX 5090 has more CUDA cores that run at a higher clock speed. H200 has more RAM and significantly more RAM bandwidth.

Which one is net faster depends on your use case. But you may be very surprised that many workflows are faster on an RTX 5090!