Try Implementing threaded code interpreters by C language without tail-merging

When we want to write a fast interpreter,
Threaded code
http://en.wikipedia.org/wiki/Threaded_code
is a important technic.
As Mike Pall(auther of
Luajit
http://luajit.org
) said, implementing threaded code is difficult when we use a C language.
http://article.gmane.org/gmane.comp.lang.lua.general/75426
"
* Tail-merging and CSE will happily join all these common tails of
each instruction and generate a single dispatch point. Ick. You
can try to disable some optimizations for this piece of code,
but this will negatively impact all paths.
"

But I might reach a way.

Tail-merging optimization is opposite to inlining optimization.
while inlining replicates common codes, tail-merging gathers common codes.
The border line is determined by the code size.
So, if the common code is small, compilers tend to select inlining than tail-merging.
Especially, if each code block has their own instruction close to tail, only the last "goto label" is left as common code.
Now, I can write such a code for GNU C compiler(gcc).

op_foo:
do_foo(pc, inst);
pc++;
inst = memory[pc];
asm("#foo");
goto *next_op_label;
op_bar:
pc = do_bar(pc, inst);
inst = memory[pc];
asm("#bar");
goto *next_op_label;

Contents in each "asm()" don't have meaning than comments for assembler.
But comments are for human beings completely.
So computers can never understand them and can't determine whether can merge them or not.

Unfortunately, results are vary.
Looking at generated assembler code, there are some register move instructions inserted between "asm("#bar")" and "goto *next_op_label".

.L2: /* op_foo */
... /* operation for op_foo */
#APP
# 142 "test.c" 1
        #foo
# 0 "" 2
#NO_APP
        movl    %eax, %ecx
        .p2align 4,,10
        .p2align 3
.L25: /* merged tail */
        movl    %r9d, %edx
        movl    %r10d, %r9d
        jmp     *%r8
...
.L4: /* op_bar */
... /* operation for op_bar */
#APP
# 95 "test.c" 1
        #bar
# 0 "" 2
#NO_APP
        movl    %eax, %ecx
        jmp     .L25 /* goto merged tail! */

These instructions bloat the tail code.
I doubt this is the reason why inlining don't occur.
To remove these moves, I added register usage extention.

resister int pc asm("ebx"), inst asm("ecx");
...
op_bar:
pc = do_bar(pc, inst);
inst = memory[pc];
asm("#bar");
goto *next_op_label;

Register usage extension needs machine specific register names, so hurts portability.
Or, anyway I have to check whether tail-merging occur or not, which require me some machine specific knowledges.
But gcc takes over rest knowledges from me - select and schedule machine instructions.
For linear code such as each block of threaded code, compilers still make almost best codes.

With explicit register allocation for some local variables and "asm(#unique_comment)" followed immediately by computed gotos, C compliler can make a good threaded code.
Maybe Make still beat them, but I can't with assembler.

<see also>
nominolo's Blog: Implementing Fast Interpreters: Discussion & Further Reading
http://nominolo.blogspot.jp/2012/07/implementing-fast-interpreters_31.html

一人一党党

一人一人の、一人一人による、一人一人のための政治制度を!

Try Implementing threaded code interpreters by C language without tail-merging