I'd say the critical question really isn't around speed of an interpreter, since if you care about speed you're using something like a JIT compilation model to actual machine code anyway. A more important question, if you want the VM to become more widely useful, is to make the instruction set easy to compile to.
A big advantage of a stack-based architecture (in my experience) is that it's simpler to write a compiler for because you don't have to worry about register allocation. My compiler writing experience is a bit limited (I've done some work targeting the JVM, and some very limited MIPS assembly coding), but I've found that not having to track register state is a very helpful simplification.
Some VMs (LLVM and Parrot, that I know of) give you an arbitrary number of virtual registers, which are then mapped to physical registers by the compiler, so allocation is a not an issue.
Nearly ALL register-based IRs give you infinate number of registers. Register allocation is not done until absolutely the last minute. This alone is one good reason to favor register based stack machines. A register allocator along with basic dependence analysis phase can be tacked on the VM and you have an instant JIT "executor"; interpreter + compiler + runtime all in one convenient binary.
It's still harder to write compilers for them. Stack based forms are easy, you push the data onto the stack and pop it again when you need it, and the VM's compiler handles the heavy lifting. With a register form, specifically where you have an infinite number of registers, you are effectively doing SSA in your compiler to map onto 'registers' (SSA variables) and writing out the raw intermediary form. Stack based forms are simply easier to write front-end compilers for.
This had some interesting information, but it's meaningless. If you care about performance, you won't be using interpretation in your VM, and code that's quick to interpret is almost never code that's quick when compiled. The challenges are simply different when you're dealing with compiled VM code.
"meaningless" is probably a bit harsh--usually there is a very good reason to use interpretation in the first place: ease of implementation and portability. you get both things for free when using an interpreter but have to take care when implementing a compiled system (even gets worse with a just-in-time compiler [needs interpreter AND compilation stuff]).
Yes, interpretation has strengths, but those go away when you try to do 'optimized' interpretation. Interpretation optimizations, particularly in register forms, are not only difficult to implement but non-portable (they'll work everywhere, but all architectures are not created equal).
If you care about performance, going the compiler route is the only way. In addition, JITCs do not need an interpreter as well, it's just often done that way to offset the cost of compilation for routines that aren't hit often.
concerning your second point: you're absolutely right, there are a couple of direct compilation systems (without interpreters). for example, the cacao jvm for dec alphas (afaik) used to be a jit compiler without an interpreter (but they added one afterwards)
is a very interesting read, for those who are too lazy or just looking for a gist, here it is:
instead of using the prevalent stack-based interpreter architecture, register based interpreters need fewer instructions, since all those load & store instructions are not necessary anymore. however, the amount of information cannot become less, therefore the information is replaced by using quadruple code, i.e., the tuple (opcode, destination register, source register 1, source register 2). the quadruple code requires more space, alas the bytecode binaries become bigger, whereas the code requires fewer instruction dispatches than its stack based counterpart.
NOTE: this paper implements the optimization for the jvm, but lua uses a register based architecture, too! (AFAIR google's dalvik [of android fame] uses a register based approach, too--probably to save energy [since dispatches in interpreters require indirect branches, which are quite expensive])
Can't stack-based bytecode languages get compiled to code that doesn't have lots of load & store instructions, by using registers to represent the top several words of the stack?
You could use that route, but it's easier to just convert to an SSA form and allocate registers intelligently when you compile the code. Compiling stack-based code, even in an optimized way, is very simple.
They can, but now you have register spilling, register allocation, and your instruction sizes aren't significantly smaller, since you still refer to registers.
All in all, it's a bad idea unless you're doing it in hardware (where you simply can't have enough registers due to cost [in money and die area] issues)
yeah, i think i know what you mean, there are two ways:
1) explicit top of stack elements (e.g. the a-stack architecture of ocaml always keeps the TOS element in a register)
2) implicit top of stack element handling; the technique is called "stack caching" and the paper to read there is from ertl in 1995.
PS: by jit compiling this code can be easily eliminated.
PPS: the points i mentioned are only "easily" implementable when your host programming language supports primitive types (such as ints, longs, floats, etc.). whenever you are dealing with "objects" (i.e. pointers to structs) you have to do (un-)boxing which lessens the advantage of stack caching...
A big advantage of a stack-based architecture (in my experience) is that it's simpler to write a compiler for because you don't have to worry about register allocation. My compiler writing experience is a bit limited (I've done some work targeting the JVM, and some very limited MIPS assembly coding), but I've found that not having to track register state is a very helpful simplification.