Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

until i can see some decent attempt at an implementation (either in an FPGA, or an open-source simulator), where i can see cycle counts for some benchmarks, i will be erring on the side of scepticism, primarily because this CPU design still has some major issues (in my opinion).

compilers should not be deciding instruction scheduling - this is exactly the issue that kills VLIW ISAs (e.g Intel's Itanium CPU, AMD's Northern Islands and prior GPUs) when used for anything other than DSP-like codes. as much power and die space as an out-of-order schedular takes, run time is the only time (really) that the order of operations can be determined (in my humble opinion).



Every compiler tries to schedule instructions the best it can, it's just that for a lot of code you can't do a very good job due to uncertainty about memory delays. The Mill has a very interesting but relatively simple way of dealing with load delay uncertainty, which they cover in an earlier lecture.

And really, there are plenty of problems where static scheduling works just fine. Matrix multiplication, signal processing, stuff where the execution flow is regular enough that you can predict memory delays actually runs just as well (and in some cases faster!) on in order machines where the compiler schedules everything out ahead of time. That's why your cell phone has an in order VLIW DSP or two along with the ARM cores that run the application code and the GPU. This tends to correlate with floating point code, which is why Intel left the floating point unit on Silvermont in order while upgrading the inter and memory units to be OoO.

I certainly understand skepticism, and want to see this on production silicon before assuming that it'll work as well as they seem to think. But there's a lot of very interesting innovation there and I hope they license their patents to DSP makers even if their processor is a dud for application code.


there are plenty of data parallel problems, but general computer usage isn't really one of them.

another issue - say the Mill is released. when the next version comes, with different latencies, belt lengths, etc, what does that mean for the binaries?

i'm not exactly a fan of x86, but OoO scheduling has enabled the exploitation of instruction level parallelism without requiring changes to codes, and similarly, enables other vendors to implement the ISA in various fashions. i.e an in-order lower power single core CPU with 512k cache can execute the same instructions that a big OoO quad-core CPU with 20MB cache can.

as i said, i'm sceptical.. but i hope my scepticism is misplaced. and the only way we'll find out is when we have a decent implementation of a Mill to benchmark with.


Oh, the fact that the pipeline is exposed means that future changes to the pipeline would would either have to break binary compatibility or be intensely painful in some other way. The Mill family they're proposing isn't even binary compatible with itself! There are lots of areas where that doesn't matter so much, but I don't see them ever making major inroads into the PC market for that reason. Or maybe I'm wrong and virtual machines and LLVM are going to eat the world and people won't distribute binaries anymore except for the OS and systems library which you might actually be able to recompile for different versions.

I think them succeeding in making a processor that can run a real application really fast is quite plausible. Actual commercial success is much less likely.

EDIT: Well, I could see the Mill challenging MIPS for dominance in routers, say. Or in a Tivo.


A potential solution is to not distribute software in machine code, but in an intermediate form that can easily be translated into machine code for the target architecture. Then you can do this conversion at install time.


And that's exactly what we do. Not our idea - the IBM as400 line (still sold and widely successful after several total changes to the underlying ISA) does the same.

Not only does the bit encoding change with new Mill models, but each family member across the line has a unique bit encoding, yet any load module runs on any present or future member without rewrite or recompile. Details (and a live demo where I'll ask the audience to create a new instruction, and 20 minutes later will write and execute code using that instruction on the Mill simulator) in the January talk - sign up for announcements at ootbcomp.com/mailing-list.


"run time is the only time (really) that the order of operations can be determined"

For perfect optimization, yes. Its possible that accepting 50% performance of the OO sked by changing to a precalculation at compile time or the OS doing some kind of pseudo-JIT compilation might save so much latency, thermal budget, and power that you can stuff in more cache or more "whatever" resulting in overall higher system thruput. Its not just the traditional tradeoffs, because you're making a major architectural change.

A FPGA might be unable to model the complete systemic tradeoffs or more specifically the designer might not model the tradeoff correctly. The point of the arch might be that real silicon might have space for twice the cache memory (or whatever) with the new arch, but a lazy comparison implementation might have a meg of cache for each to keep the implementation simple, which would miss the point.

Its an interesting idea to think about, regardless if it turns out to be correct or not.


Have you watched all of the lectures? Your questions are answered in them. The Mill is not at all like other VLIW architectures.


I think the real problem is that it isn't clear what makes it different from VLIW architectures aside from esoteric features like this that I have major issues with (other comment:https://news.ycombinator.com/item?id=6874853)


He is talking about the fundamental problem of VLIW architectures. They rely on the static scheduling of programs to enable instruction level parallelism -- which is critical to performance. Scheduling is an NP Complete problem.


Scheduling is the bin-packing problem, which is indeed NP. However, long (40+ years) known heuristics get within low single digits % of perfect, and generally better than OOO hardware scheduling because static is not constrained by instruction window size. We didn't invent those heuristics, but they work for us too.


The harder the problem, the worse it's going to be to calculate it on the fly.


not true. an OoO scheduler has a fixed window size with which to re-order. and as soon as you have data-dependent branches or dependencies, your compiler cannot possibly schedule instructions effectively.


he proposes backwards scheduling from the sink to the sources inside each function. I don't know how much complexity is that (because np-complete is not that bad when n is low)


He hasn't said how this would be relevant to the Mill.


> They rely on the static scheduling ...

Or run-time profiling.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: