解释器与 Template-based JIT 编译器

假如我们有一段某种编程语言的 stack-based 字节码：

const1   ; 将常量 1 压入栈顶
const3
add
const3
mult
halt

在一个经典的 C 编写的解释器中，我们可能会有一个用类似 switch 的东西实现的循环，来解释执行这段计算 (3 + 1) * 3 的字节码。

enum {
    OP_CONST1,
    OP_CONST2,
    OP_CONST3,
    OP_ADD,
    OP_MULT,
    OP_HALT,
};

void interpret(uint8_t insn[], uint32_t stack[]) {
    int pc = 0;
    uint32_t *st = stack;
next:
    switch (insn[pc++]) {
    case OP_CONST1:
        *st++ = 1; goto next;
    case OP_CONST2:
        *st++ = 2; goto next;
    case OP_CONST3:
        *st++ = 3; goto next;
    case OP_ADD:
        st[-2] = st[-1] + st[-2];
        st--; goto next;
    case OP_MULT:
        st[-2] = st[-1] * st[-2];
        st--; goto next;
    case OP_HALT:
        return;
    }
}

有很多原因导致这段代码性能不佳：

由于 C 标准的原因，switch 会插入额外的条件检查指令
如果在优化后使用了跳转表，我们需要先从栈顶读取一条 op，再用 op 作为索引从跳转表加载一个 native 地址，然后再间接跳转到这个地址。
分支预测的缺陷：字节码中的 op 以不同频率出现，但 switch 只有一个入口处的跳转用于指令 dispatch。这对 CPU 的基于跳转历史的分支预测器是致命的。

为了改进 switch dispatch 的缺陷，可以使用 GNU C 的 label-as-value 和computed-goto 拓展：

不再有额外的范围检查。
改进的分支预测：每条 op 之后都会有一个跳转，而不像 switch 那样在入口处集中跳转。这稍微有利于分支预测器的工作。
然而，我们仍然需要多次内存加载以获取下一条指令地址。

enum {
    OP_CONST1,
    OP_CONST2,
    OP_CONST3,
    OP_ADD,
    OP_MULT,
    OP_HALT,
};

void interpret(uint8_t insn[], uint32_t stack[]) {
    static void *ops[] = {
        &&L_CONST1, &&L_CONST2, &&L_CONST3, &&L_ADD, &&L_MULT, &&L_HALT,
    };
    long pc = 0;
    uint32_t *st = stack;
    goto *ops[insn[pc++]];
L_CONST1:
    *st++ = 1;
    goto *ops[insn[pc++]];
L_CONST2:
    *st++ = 2;
    goto *ops[insn[pc++]];
L_CONST3:
    *st++ = 3;
    goto *ops[insn[pc++]];
L_ADD:
    st[-2] = st[-1] + st[-2];
    st--;
    goto *ops[insn[pc++]];
L_MULT:
    st[-2] = st[-1] * st[-2];
    st--;
    goto *ops[insn[pc++]];
L_HALT:
    return;
}

如果实现了完全体的 direct threading，还可以进一步减少内存读取的次数。概念上而言，这需要将字节码序列从 {OP_CONST1, OP_CONST3, OP_ADD, OP_CONST3, OP_MULT, OP_HALT} 替换为 { &&L_CONST1, &&L_CONST3, &&L_ADD, &&L_CONST3, &&L_MULT, &&L_HALT}。这样我们即可减少一次内存寻址与读取。

有人观察到，在上面的循环中，解释器循环仍让 CPU 承受了巨大的分支预测失败的风险。尽管进行指令 dispatch 的位置从 switch 入口处分散到了每个 op 实现的末尾，但还是很不好进行预测。例如，参考最开始提到的字节码，我们两次使用了 const3 这个 op，然而它们的后继并不相同。这意味着分支预测器可能会在 const3 尾部的 dispatching jump 上频频预测失败。对于一个希望尽量跑的快的程序来说，如此之多的预测失败是令人难以接受的。由于这个问题产生的根本原因是物理机的 PC 和 VM 的 PC 缺乏相关性所导致的，所以我们也将这个问题称作 context problem。

Template-based JIT

也被称作 Template Interpreter，HotSpot VM 就使用了这个说法。

针对经典解释器上出现的问题问题，可以这样想：既然单个 op 的处理代码最后的跳转会由于实际字节码序列中同一 op 出现在不同的指令之前导致预测失败，那我们应该直接为每个不同位置的 op 都按照输入字节码的顺序生成处理代码和跳转的副本。这样可以做到每个 indirect jump 既对应了特定的 native 地址，又对应了虚拟指令的控制流，从而使得 CPU 可以充分利用硬件预测机制。而且，注意到在生成副本后，字节码中线性的控制流之间的 op 我们也不再需要间接跳转了，进一步提高了了性能。现在生成的代码中的跳转恰好对应于字节码控制流中的跳转。

生成处理代码副本的办法很简单，就是提前生成一堆对应体系结构的二进制代码也就是 template，然后运行时简单根据字节码序列的顺序把这些代码粘贴起来。例如，假设 VM 的 stack pointer 始终被放在 rsi 寄存器，我们可以为上面提到的字节码提前生成好以下模板。

OP_CONST1:
    mov     dword ptr [rsi], 1
    add     rsi, 4
----------------------------------------
OP_CONST2:
    mov     dword ptr [rsi], 2
    add     rsi, 4
----------------------------------------
OP_CONST3:
    mov     dword ptr [rsi], 3
    add     rsi, 4
----------------------------------------
OP_ADD:
    mov     eax, dword ptr [rsi-4]
    add     dword ptr [rsi-8], eax
    sub     rsi, 4
----------------------------------------
OP_MULT:
    mov     eax, dword ptr [rsi-8]
    imul    eax, dword ptr [rsi-4]
    mov     dword ptr [rsi-8], eax
    sub     rsi, 4
----------------------------------------
OP_HALT:
    ret

运行时，我们根据字节码，把上面模板拼接起来：

    mov     dword ptr [rsi], 1
    add     rsi, 4
    mov     dword ptr [rsi], 3
    add     rsi, 4
    mov     eax, dword ptr [rsi - 4]
    add     dword ptr [rsi - 8], eax
    sub     rsi, 4
    mov     dword ptr [rsi], 3
    add     rsi, 4
    mov     eax, dword ptr [rsi-8]
    imul    eax, dword ptr [rsi-4]
    mov     dword ptr [rsi-8], eax
    sub     rsi, 4
    ret

这样就生成一段 (相对高效的) 使用本机代码执行字节码语义的代码。为了简单起见我们忽略了分支等操作，一方面这段代码几乎是原本字节码 op 的一对一翻译，仍然离不开对解释器流程的模仿；另一方面由于生成了 native 代码，似乎也属于 Just-In-Time 编译的范畴，所以产生了 Template-based JIT 与 Template Interpreter 两种不同的称谓。

Template-based JIT#

Template-based JIT