I don't think it's humanly possible to do it perfectly. Heuristically, the Archi...

I don't think it's humanly possible to do it perfectly. Heuristically, the Architecture manual gets you 95% of the way there.

For really detailed insight into code optimization, the Intel Profiler ($$) gives you a lot of tools for precise instruction scheduling (e.g. an indication of which instructions are stalling during execution of your code, useful analysis of cache miss rates, and which instructions caused those cache misses). ARM also provides a profiler that may do the same for ARM chips, but it is insanely expensive.

You can make do with LINUX stochastic profilers, but it may be helpful to have some utility code that provides dumps of relevant profiling registers for your CPU (e.g. L1, L2,L3 cache missed counts, missed-branch counts, processor stall counts, &c.) I'm not sure what x86 processors provide; but writing code to dump ARM profiling registers proved to be incredibly useful in a recent profiling and optimization misadventure.

Fwiw, unless you're using instructions that don't map well onto high-level languages, it's pretty difficult to beat well-tweaked GCC-generated code by more than a few percent. I imagine LLVM is the same. Unless you're writing code whose wellfare depends on whether it's 3% faster than a competitor, it's probably not worth it to drop into assembler.

With a bit of tweaking you can even get all the major C/C++ compilers to generate SIMD code that's consistently annoyingly good from non-SIMD C/C++ by encouraging the compilers to perform SIMD vectorization optimizations.

The other way to learn is to do. Profile EVERYTHING with a stochastic profiler. Tweak based on your necessarily limited understanding of the architecture. Profile again to confirm that your optimization actually is valid. Repeat until done.