Technique: Compile Time Code Generation and Optimization

27 Jan 2022 by Jonathan

C++ constexpr is really powerful. In this blog post, we’ll write a compiler that can parse a Brainfuck program given as string literal, and generate optimized assembly instructions that can then be executed at runtime. The best part: we neither have to actually generate assembly nor optimize anything ourselves! Instead we trick the compiler into doing all the hard work for us.

The same technique can be used whenever you want to specify some sort of “program” in a different way and translate it at runtime: regexes, routing tables, etc.

Brainfuck

Brainfuck is a “simple” Turing-complete programming language. When executed, a Brainfuck program modifies an array of bytes via a data pointer, which is controlled by one of six commands:

>, < increments/decrements the data pointer (ptr++, ptr--)
+, - increments/decrements the value the data pointer points to ((*ptr)++, (*ptr)--)
., , writes/reads the value the data pointer points to (putchar(*ptr), *ptr = getchar())
[, ] form a loop that executes the inner commands as long as the value the data pointer points to is zero (while (*ptr == 0) { …})

All other characters are considered comments and are ignored.

For more details (in particular: how do I actually do something useful?!), read the Wikipedia article.

Step 1: A traditional Brainfuck VM

We first build a traditional VM for executing Brainfuck. A program for our VM is an array of instructions:

enum class op
{
    ptr_inc,     // >
    ptr_dec,     // <
    data_inc,    // +
    data_dec,    // -
    write,       // .
    read,        // ,
    jmp_ifz,     // [, jump if zero
    jmp,         // ], unconditional jump
};

template <std::size_t InstructionCapacity>
struct program
{
    std::size_t inst_count;
    op          inst[InstructionCapacity];
    std::size_t inst_jmp[InstructionCapacity];
};

The first six operands directly correspond to the commands, the loop commands have been lowered to jumps. This means we don’t need to scan the input string for the corresponding [ or ] while executing. The jump target of instruction inst[i] is specified in inst_jmp]i]; it is the index of the destination. The value of the array for a non-jump instruction is ignored.

As we ultimately want to execute a Brainfuck program known at compile-time, I’m using a simple fixed-sized array to store the instructions – we will always know an upper bound of the size.

We can now write an execute() function that takes a program and the data_ptr by using a loop and a switch statement:

template <std::size_t InstructionCapacity>
void execute(const program<InstructionCapacity>& program,
             unsigned char* data_ptr)
{
    auto inst_ptr = std::size_t(0);
    while (inst_ptr < program.inst_count)
    {
        switch (program.inst[inst_ptr])
        {
        case op::ptr_inc:
            ++data_ptr;
            ++inst_ptr;
            break;
        case op::ptr_dec:
            --data_ptr;
            ++inst_ptr;
            break;
        case op::data_inc:
            ++*data_ptr;
            ++inst_ptr;
            break;
        case op::data_dec:
            --*data_ptr;
            ++inst_ptr;
            break;
        case op::write:
            std::putchar(*data_ptr);
            ++inst_ptr;
            break;
        case op::read:
            *data_ptr = static_cast<unsigned char>(std::getchar());
            ++inst_ptr;
            break;
        case op::jmp_ifz:
            if (*data_ptr == 0)
                inst_ptr = program.inst_jmp[inst_ptr];
            else
                ++inst_ptr;
            break;
        case op::jmp:
            inst_ptr = program.inst_jmp[inst_ptr];
            break;
        }
    }
}

What’s left is to parse a string literal into a program. Note that we can use the length of the string literal, which is a compile-time constant, as InstructionCapacity (in the worst-case, each character of the string is one instruction). To implement loops, we can use a stack that remembers the position of the last opened [.

template <std::size_t N>
constexpr auto parse(const char (&str)[N])
{
    program<N> result{};

    std::size_t jump_stack[N] = {};
    std::size_t jump_stack_top = 0;

    for (auto ptr = str; *ptr; ++ptr)
    {
        if (*ptr ==  '>')
            result.inst[result.inst_count++] = op::ptr_inc;
        else if (*ptr ==  '<')
            result.inst[result.inst_count++] = op::ptr_dec;
        else if (*ptr ==  '+')
            result.inst[result.inst_count++] = op::data_inc;
        else if (*ptr ==  '-')
            result.inst[result.inst_count++] = op::data_dec;
        else if (*ptr ==  '.')
            result.inst[result.inst_count++] = op::write;
        else if (*ptr ==  ',')
            result.inst[result.inst_count++] = op::read;
        else if (*ptr == '[')
        {
            jump_stack[jump_stack_top++] = result.inst_count;
            result.inst[result.inst_count++] = op::jmp_ifz;
        }
        else if (*ptr == ']')
        {
            auto open = jump_stack[--jump_stack_top];
            auto close = result.inst_count++;

            result.inst[close] = op::jmp;
            result.inst_jmp[close] = open;

            result.inst_jmp[open] = close + 1;
        }
    }

    return result;
}

Putting it together, we can now parse and execute a Brainfuck program given as string literal:

// `x = std::getchar(); y = x + 3; std::putchar(y);`
static constexpr auto add3 = parse(",>+++<[->+<]>.");

// Use this array for our data_ptr.
unsigned char memory[1024] = {};
execute(add3, memory);

Note that parsing happens entirely at compile-time, but execution at runtime. It is already nice that we can do that!

The generated assembly is straightforward: clang has decided to turn the switch into a lookup table, and the code for each instruction is only a couple of assembly instructions.

If you want to go down that route more, optimizing by adding meta instructions, or JIT compilation, I highly recommend this series by Eli Bendersky.

However, we’re doing something different.

Step 2: Tail recursion

We’re now going to change the way we write the program which doesn’t really change anything, but makes it easier to motivate the next step: turning the iterative version of execute() with the loop into a recursive version. This is done by passing all arguments that are changed during the loops, i.e. inst_ptr, as additional arguments. We then remove the loop and turn ++inst_ptr; break; into return execute(program, memory, inst_ptr + 1).

Normally, recursion would be worse than iteration, as it can lead to a stack overflow. However, here we’re having tail recursion, where the recursive call doesn’t actually need to push a new stack frame, but can just update the arguments and jump back to the beginning of the function – just like a loop. Of course, the compiler needs to do that optimization, otherwise a loop quickly results in a stack overflow. The clang attribute [[clang::musttail]] can be used to force clang’s hand. This is omitted in the snippet below for readability.

The new execute() function looks like this:

template <std::size_t InstructionCapacity>
void execute(const program<InstructionCapacity>& program,
             unsigned char* data_ptr,
             std::size_t inst_ptr = 0)
{
    if (inst_ptr >= program.inst_count)
        return; // Execution is finished.

    switch (program.inst[inst_ptr])
    {
    case op::ptr_inc:
        ++data_ptr;
        return execute(program, data_ptr, inst_ptr + 1);
    case op::ptr_dec:
        --data_ptr;
        return execute(program, data_ptr, inst_ptr + 1);
    case op::data_inc:
        ++*data_ptr;
        return execute(program, data_ptr, inst_ptr + 1);
    case op::data_dec:
        --*data_ptr;
        return execute(program, data_ptr, inst_ptr + 1);
    case op::write:
        std::putchar(*data_ptr);
        return execute(program, data_ptr, inst_ptr + 1);
    case op::read:
        *data_ptr = static_cast<unsigned char>(std::getchar());
        return execute(program, data_ptr, inst_ptr + 1);
    case op::jmp_ifz:
        if (*data_ptr == 0)
            return execute(program, data_ptr, program.inst_jmp[inst_ptr]);
        else
            return execute(program, data_ptr, inst_ptr + 1);
    case op::jmp:
        return execute(program, data_ptr, program.inst_jmp[inst_ptr]);
    }
}

Here the generated assembly appears to be slightly longer, but otherwise looks the same. This is not surprising, as we haven’t really changed anything for the compiler!

In general, this strategy of implementing a VM can be better than the switch statement version: by repeatedly calling a function that takes the relevant state as arguments, the calling convention ensures that it is kept in registers! Read this blog post to learn more about the technique.

Let’s actually change the generated assembly now.

Step 3: Making it a template

If you carefully look at the tail recursive version, you can make the following observation: in each recursive call, the new value of the inst_ptr is either given by adding a compile-time constant (1), or by reading the value from the inst_jmp array, which is also computed at compile-time. This means, if we know the value of inst_ptr before executing an instruction at compile-time, we also know its next value at compile-time. In the case of jmp_ifz, there is a branch on a runtime value, but the destination of each branch is fixed.

Furthermore, if we know the value of inst_ptr at compile-time, we also don’t need to do a runtime switch, as the corresponding instruction in the inst array is also computed at compile-time.

This means, we can turn execute(const program&, unsigned char* data_ptr, std::size_t inst_ptr) into a template, where program and inst_ptr are given as template parameters! We can pass the program as a template parameter, as it’s computed at compile-time. However, we can also pass inst_ptr as template parameter, as it’s initially 0, and later only modified by other constants. Then we can replace the switch by if constexpr, and instead of tail recursion, we have tail calls to a different instantiation of the template.

template <const auto& Program, std::size_t InstPtr = 0>
constexpr void execute(unsigned char* data_ptr)
{
    if constexpr (InstPtr >= Program.inst_count)
    {
        // Execution is finished.
        return;
    }
    else if constexpr (Program.inst[InstPtr] == op::ptr_inc)
    {
        ++data_ptr;
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::ptr_dec)
    {
        --data_ptr;
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::data_inc)
    {
        ++*data_ptr;
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::data_dec)
    {
        --*data_ptr;
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::write)
    {
        std::putchar(*data_ptr);
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::read)
    {
        *data_ptr = static_cast<char>(std::getchar());
        return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::jmp_ifz)
    {
        if (*data_ptr == 0)
            return execute<Program, Program.inst_jmp[InstPtr]>(data_ptr);
        else
            return execute<Program, InstPtr + 1>(data_ptr);
    }
    else if constexpr (Program.inst[InstPtr] == op::jmp)
    {
        return execute<Program, Program.inst_jmp[InstPtr]>(data_ptr);
    }
}

Note that we don’t even need C++20’s extended NTTP to pass the program as a template argument. As it’s a global variable, we can pass it by reference. This also means that name mangling is a lot nicer, as it’s just an address, and doesn’t need to specify the value of every array member.

Now take a look at the assembly: all the dispatch has disappeared, and it is replaced by “call std::getchar(), add 3, call std::putchar()! This is possible, because we’re doing the dispatch entirely at compile-time, the compiler sees a series of tail calls, which are trivial to fuse together and optimize.

Furthermore, as all accesses into Program’s arrays are compile-time constants, there is no need for Program to appear in the binary at all. This means that there is no additional memory to store the instructions.

Conclusion

While nice, how is this actually useful? We can just write the equivalent behavior in C++ directly, without bothering to parse a Brainfuck program.

However, there are situations where you want to specify something in a different language at compile-time, and have it execute at compile time. For example, regexes: given a compile-time string literal, we can generate a instructions for our regex VM, and leverage this technique to get efficient code generation for runtime execution. This is basically the way Hana’s CTRE library works. Similarly, I’m currently using it in lexy to generate efficient code for matching against a set of string literals.

Whenever you want to specify something in a DSL at compile-time and have it execute efficiently on dynamic input, this hybrid approach of separating the static program state and the dynamic data, using if constexpr and tail recursion (or just relying on inlining if you don’t have loops) works.