Run-ahead transfer prediction in the Mill CPU architecture

unwind · on Dec 9, 2013

The point of the page is that the Mill CPU (of which I know almost nothing, I don't think I had heard of it before) does not do branch prediction, instead it focuses on the memory traffic and does transfer prediction:

[...] it predicts transfers rather than branches [..]

So, the title of the post is annoyingly misleading and I would really appreciate it if someone could edit.

Mindless2112 · on Dec 9, 2013

Did you watch the video of the talk? The wording in the summary is a bit misleading.

The Mill does exit prediction on extended basic blocks -- it predicts which branch will be taken rather than if a particular branch will be taken, and it can chain these predictions.

Symmetry · on Dec 9, 2013

I've gotta say, organizing the code in terms of EBBs instead of calls makes lot of the features in earlier videos make a lot more sense. I have no idea how they'll be able to handle interrupts or avoid having to deal with interrupts, but I'm sure they have a story for that. Really gotta re-watch the memory video now.

Hand coding assembly for this would be really painful, but we already knew this with the belt. The Mill really seems to have been designed with compilers in mind.

igodard · on Dec 13, 2013

Sure was! I write compilers; much better to have the hardware team do the work :-)

Mill interrupts are nothing but involuntary function calls, and handlers are perfectly ordinary functions. There are no privileged operations; all security is memory based - an upcoming talk will cover that. ootbcomp.com/mailing-list to get talk announcements.

foxhill · on Dec 9, 2013

until i can see some decent attempt at an implementation (either in an FPGA, or an open-source simulator), where i can see cycle counts for some benchmarks, i will be erring on the side of scepticism, primarily because this CPU design still has some major issues (in my opinion).

compilers should not be deciding instruction scheduling - this is exactly the issue that kills VLIW ISAs (e.g Intel's Itanium CPU, AMD's Northern Islands and prior GPUs) when used for anything other than DSP-like codes. as much power and die space as an out-of-order schedular takes, run time is the only time (really) that the order of operations can be determined (in my humble opinion).

Symmetry · on Dec 9, 2013

Every compiler tries to schedule instructions the best it can, it's just that for a lot of code you can't do a very good job due to uncertainty about memory delays. The Mill has a very interesting but relatively simple way of dealing with load delay uncertainty, which they cover in an earlier lecture.

And really, there are plenty of problems where static scheduling works just fine. Matrix multiplication, signal processing, stuff where the execution flow is regular enough that you can predict memory delays actually runs just as well (and in some cases faster!) on in order machines where the compiler schedules everything out ahead of time. That's why your cell phone has an in order VLIW DSP or two along with the ARM cores that run the application code and the GPU. This tends to correlate with floating point code, which is why Intel left the floating point unit on Silvermont in order while upgrading the inter and memory units to be OoO.

I certainly understand skepticism, and want to see this on production silicon before assuming that it'll work as well as they seem to think. But there's a lot of very interesting innovation there and I hope they license their patents to DSP makers even if their processor is a dud for application code.

foxhill · on Dec 9, 2013

there are plenty of data parallel problems, but general computer usage isn't really one of them.

another issue - say the Mill is released. when the next version comes, with different latencies, belt lengths, etc, what does that mean for the binaries?

i'm not exactly a fan of x86, but OoO scheduling has enabled the exploitation of instruction level parallelism without requiring changes to codes, and similarly, enables other vendors to implement the ISA in various fashions. i.e an in-order lower power single core CPU with 512k cache can execute the same instructions that a big OoO quad-core CPU with 20MB cache can.

as i said, i'm sceptical.. but i hope my scepticism is misplaced. and the only way we'll find out is when we have a decent implementation of a Mill to benchmark with.

Symmetry · on Dec 9, 2013

Oh, the fact that the pipeline is exposed means that future changes to the pipeline would would either have to break binary compatibility or be intensely painful in some other way. The Mill family they're proposing isn't even binary compatible with itself! There are lots of areas where that doesn't matter so much, but I don't see them ever making major inroads into the PC market for that reason. Or maybe I'm wrong and virtual machines and LLVM are going to eat the world and people won't distribute binaries anymore except for the OS and systems library which you might actually be able to recompile for different versions.

I think them succeeding in making a processor that can run a real application really fast is quite plausible. Actual commercial success is much less likely.

EDIT: Well, I could see the Mill challenging MIPS for dominance in routers, say. Or in a Tivo.

gizmo686 · on Dec 10, 2013

A potential solution is to not distribute software in machine code, but in an intermediate form that can easily be translated into machine code for the target architecture. Then you can do this conversion at install time.

igodard · on Dec 13, 2013

And that's exactly what we do. Not our idea - the IBM as400 line (still sold and widely successful after several total changes to the underlying ISA) does the same.

Not only does the bit encoding change with new Mill models, but each family member across the line has a unique bit encoding, yet any load module runs on any present or future member without rewrite or recompile. Details (and a live demo where I'll ask the audience to create a new instruction, and 20 minutes later will write and execute code using that instruction on the Mill simulator) in the January talk - sign up for announcements at ootbcomp.com/mailing-list.

VLM · on Dec 9, 2013

"run time is the only time (really) that the order of operations can be determined"

For perfect optimization, yes. Its possible that accepting 50% performance of the OO sked by changing to a precalculation at compile time or the OS doing some kind of pseudo-JIT compilation might save so much latency, thermal budget, and power that you can stuff in more cache or more "whatever" resulting in overall higher system thruput. Its not just the traditional tradeoffs, because you're making a major architectural change.

A FPGA might be unable to model the complete systemic tradeoffs or more specifically the designer might not model the tradeoff correctly. The point of the arch might be that real silicon might have space for twice the cache memory (or whatever) with the new arch, but a lazy comparison implementation might have a meg of cache for each to keep the implementation simple, which would miss the point.

Its an interesting idea to think about, regardless if it turns out to be correct or not.

chongli · on Dec 9, 2013

Have you watched all of the lectures? Your questions are answered in them. The Mill is not at all like other VLIW architectures.

cpleppert · on Dec 9, 2013

I think the real problem is that it isn't clear what makes it different from VLIW architectures aside from esoteric features like this that I have major issues with (other comment:https://news.ycombinator.com/item?id=6874853)

DSingularity · on Dec 9, 2013

He is talking about the fundamental problem of VLIW architectures. They rely on the static scheduling of programs to enable instruction level parallelism -- which is critical to performance. Scheduling is an NP Complete problem.

igodard · on Dec 13, 2013

Scheduling is the bin-packing problem, which is indeed NP. However, long (40+ years) known heuristics get within low single digits % of perfect, and generally better than OOO hardware scheduling because static is not constrained by instruction window size. We didn't invent those heuristics, but they work for us too.

Dylan16807 · on Dec 9, 2013

The harder the problem, the worse it's going to be to calculate it on the fly.

foxhill · on Dec 9, 2013

not true. an OoO scheduler has a fixed window size with which to re-order. and as soon as you have data-dependent branches or dependencies, your compiler cannot possibly schedule instructions effectively.

nraynaud · on Dec 9, 2013

he proposes backwards scheduling from the sink to the sources inside each function. I don't know how much complexity is that (because np-complete is not that bad when n is low)

chongli · on Dec 9, 2013

He hasn't said how this would be relevant to the Mill.

Daniel_Newby · on Dec 9, 2013

> They rely on the static scheduling ...

Or run-time profiling.

jstclair · on Dec 9, 2013

I found it interesting that in a presentation given in 2013, the (only) comparison to a competitor's processor was the Pentium 4 (Prescott) from 2004(!).

Symmetry · on Dec 9, 2013

He's just using the P4 because it has the longest mis-predict penalty of any recent mass produced CPU. I'd guess that the Mill is going to be clocked fairly slow given the cycle latency numbers I'm seeing so I wouldn't be surprised if the Mill's 5 cycle mis predict penalty came out to the same time delay as a 10 cycle penalty on a Sandy Bridge.

tinco · on Dec 9, 2013

It was given for dramatic effect, as he adds, the Prescott had an extremely long pipeline.

It's not really relevant to compare to other more modern processors, because those both exist and have a traditional architecture. The Mill both does not exist and does not have a traditional architecture (at least in the prediction regard).

brainburn · on Dec 9, 2013

Are there architectures where a programmer can just use a specific opcode to inform the cpu what's most likely?

Ellipsis753 · on Dec 9, 2013

There were some. At at least some points. However they cannot generally be used and I do not think they are included in current CPUs. You can however do hints to GCC and GCC should convert them to the opcodes if they exist. You do this by using "__builtin_expect".

cpleppert · on Dec 9, 2013

Yes, the Pentium 4 had such instructions; but it turns out it they are mostly useless as a modern dynamic branch predictor can a) quickly learn any branch and adapt quickly b) use simple metaheuristics like a loop detector or BTFN(backwards taken forward not taken) essentially backward paths are more likely to be taken c) adaptively change the branch prediction based on the status of previous branching

any static prediction will do worse in case A on any branch which changes and will do much worse in C case while always doing worse in the B case if the branch occurs in a loop. It doesn't matter how 'smart' the static predictions are and whether they come from some guided optimization or otherwise; static predictions are worse in the vast majority of cases.

So when he talks about how he is going to build branch prediction into the ISA and load it asynchronously from memory this simply isn't going to improve performance. I might also point out that the claim that context switching hurts branch prediction performance results from the changed behavior of the branches in a new context and not from the processor not being able to reuse old predictions.

I think the Mill design isn't feasible and this is a good example of one problem: a massive new hardware investment of a cpu unit that caches and manipulates predictions in main memory while adding complexity to the internal pipeline that ends up performs demonstrably(in most cases) worse.

igodard · on Dec 14, 2013

Looks like we need to add some clarifying slides to this presentation :-(. I'll do my best here.

Prediction is not in the instructions set; there are no "likely taken" flags, opcodes, or the like. You are correct that machines that have these features have found them useless, because compilers cannot predict very well. That's why we used the (saved) dynamic experience rather than static prediction.

In all machines, task change kills prediction behavior because the new task uses the prediction hardware and table for its own code, overwriting the learned predictions that were present before the change. When control switches back to the original task, the table contents and other state are no longer what they had been, and everything has to be retrained again. Training is costly due to mispredictions.

The predictions that get loaded are those from previous executions that were streamed out to memory, post-processed, and saved in the load module. They are not as good as up-to-the-cycle dynamic predictions, because programs have phases and the save predictions can only capture the overall experience; still, they are much better than static, or no predictions at all, which is the situation at program initiation or after a different task has scrubbed the tables.

Lastly, I agree that your proposed design isn't feasible. That's why the Mill doesn't work that way.

Ivan

cma · on Dec 9, 2013

There is a gcc extension that doesn't necessarily inform the CPU on x86/64, but it does restructure code so that the unlikely branch doesn't pollute the cache:

        #define likely(x)       _builtin_expect ((x), 1)
        #define unlikely(x)     _builtin_expect ((x), 0)

The machine code for the unlikely branch will be moved off into an unrelated area of memory, while the likely code will stay located with the rest of your function's code.

seanmcdirmid · on Dec 9, 2013

Yes, its called static branch prediction. I'm not sure how popular it is in modern ISAs, but MIPS and SPARC had such instructions.

_delirium · on Dec 9, 2013

The Pentium 4 introduced prefix opcodes that hinted that the next branch would be likely or unlikely to be taken. These are still legal, but afaik post-NetBurst CPUs ignore them, and Intel's optimization manuals have dropped mention of them.

thesz · on Dec 9, 2013

I think that those instructions marked as obsolete in MIPS ISA manuals.

nhaehnle · on Dec 9, 2013

When a new branch is found, the CPU will predict it based on some simple and documented heuristics. It's been a while since I checked, but IIRC it boils down to: backward jumps are predicted to be taken (think loops) while forward jumps are predicted not to be taken.

When you give your compiler hints about what to expect, using either profiling data or explicit hints like __builtin_expect, the compiler will generally try to rearrange the code to be favourable for the CPU's branch prediction.

foxhill · on Dec 9, 2013

it's better to let the compiler decide (with gcc, you build a branch profile version, then rebuild the code with profiling results). unless you have something unusual, and have some deep insight into your code (unlikely), you're not going to do better than the compiler.

scott_s · on Dec 9, 2013

I find it strange you find it so unlikely that people have that much insight into their own code. The code I work on has several instances of likely/unlikely in the main path for cases that we know happen either once, rarely, or when we are moving off the optimized path.

ris · on Dec 9, 2013

Have you tried benchmarking it against a non-annotated compiled version?

scott_s · on Dec 9, 2013

I have not, but I believe those who originally wrote it did.

foxhill · on Dec 9, 2013

i've never came across code that has performed better with manual hints than compiler profiling.

at best, you can match the compiler, but.. i find that improbable, unless you have only a handful of branches.

scott_s · on Dec 9, 2013

These are situations where the branch will be taken, literally, millions of times in one direction, and once in the other.

kabdib · on Dec 9, 2013

There are many DSP system that have this (also, lots of attention to prefetching the right stuff). Compilers can pretend to get it right for you, or you just dive into tha sssembly if you want to save money and buy a slower CPU.