I accidentally wrote a Turing-complete parsing library

25 Nov 2021 by Jonathan

I’m currently working on lexy, a C++ parsing DSL library: you describe how input should be parsed, and lexy generates code for it, taking care of error recovery, parse tree generation, and parse values. Such parser generators are classified based on the expressiveness of the corresponding formal language. For example, a strict regular expression can only parse regular languages, which is a strict subset of a deterministic context free language, and so on.

lexy, being essentially syntax sugar for a recursive descent parser with (manually specified!) arbitrary lookahead but no other state, falls in the latter category. Parsers in that category are unable to parse context-sensitive languages such as XML with matching tags. To handle them, I’ve added support for “context variables”: state that can be modified during parsing.

However, in a recent refactoring of the context variables implementation, I’ve accidentally removed a big limitation, which makes lexy Turing-complete: the parser is thus able to do arbitrary computation while parsing the input.

TL;DR: I’ve written a lexy grammar that is able to execute, not just parse, a simple Turing-complete language.

lexy’s context variables

I’ve added context variables to lexy for my XML parser example: an XML tag has an opening and closing tag, which must be identical:

<open>
  <hello>World</hello> <!-- ok -->
</close> <!-- not ok, not identical -->

To parse this, you need to parse an opening tag, remember what it was, and compare it when you have the closing tag. This is not possible with a traditional context-free grammar. Likewise, you can’t parse something like “n a’s, then n b’s, then n c’s”, as there is no way to remember the count and “read” it two times.

Note that n a’s followed by n b’s is possible by using recursion to remember the count. P = ["a" P "b"].

lexy’s context variables make that possible. For example, dsl::context_counter is essentially an int that can be modified during parsing: we can create it, initializing it to a value, and then increment/decrement it while consuming input. This allows us to parse the language described above:

struct production
{
    static constexpr auto rule = [] {
        // Parse 'a's and count them.
        auto ac = dsl::context_counter<struct ac_id>;
        auto a  = ac.create() + ac.push(dsl::while_(dsl::lit_c<'a'>));

        // Parse 'b's and count them.
        auto bc = dsl::context_counter<struct bc_id>;
        auto b  = bc.create() + bc.push(dsl::while_(dsl::lit_c<'b'>));

        // Parse 'c's and count them.
        auto cc = dsl::context_counter<struct cc_id>;
        auto c  = cc.create() + cc.push(dsl::while_(dsl::lit_c<'c'>));

        // Check that they're equal.
        auto check = dsl::equal_counts(ac, bc, cc);

        return a + b + c + check;
    }();
};

lexy’s grammar is specified in productions, which are classes that have a rule member. This rule describes how it’s parsed; here we’re using an immediately invoked lambda. Read the tutorial if you want to learn more.

This production creates three counters, one for a, one for b, and one for c. We then parse the character repeatedly, while incrementing the counter for every character we encounter. At the end, we assert that they’re all equal.

When I originally implemented context variables, they were local to a single production: all variables created inside a production cannot be accessed outside of it. This made it impossible to combine context variables with recursion.

But during a recent refactoring of the context interface, I moved the storage of context variables to the global control block. This means that they are now available in all child productions!

Without realizing it, I’ve accidentally made lexy grammars Turing-complete. This means lexy can not only parse programming languages, but execute them directly!

If you’re curious about the technical details: the context variables are stored in a type-erased linked-list whose node are stored on the stack and identified by comparing a type id. The code is here.

The WHILE programming language

Let’s consider a simple Turing-complete language, which actually resembles an actual programming language: WHILE. It has (infinite) unsigned integer variables a, b, c, ..., addition/subtraction of constants, and a while loop. That’s enough for Turing-completeness, but let’s give it an if statement as well, for convenience.

It’s EBNF grammar looks like this:

identifier = { letter }
literal    = { digit }

assign = identifier ":=" literal ";"
add    = identifier "+=" literal ";"
sub    = identifier "-=" literal ";"

if    = "if" identifier "{" { stmt } "}" ["else" "{" { stmt } "}"]
while = while identifier "{" { stmt } "}"

stmt = assign | add | sub | if | while

That’s it. Note that you can only assign, add, or subtract constants from variables, not other variables. This makes simple tasks such as a := b quite tedious, but possible:

// a := b using a temporary variable x.
x := 0
while b // while (b != 0)
{
   b -= 1;
   a += 1;
   x += 1;
}
// At this point, a = x = b, but b's value was lost, restore it.
while x
{
    x -= 1;
    b += 1;
}

The above code works, because all variables are unsigned integers.

As mentioned, WHILE is Turing-complete: given infinitely many variables, it can be used to compute everything that can be computed. I won’t prove it here, but to illustrate, here is a program that computes the nth Fibonacci number:

// Input: i, which is the Fibonacci number that should be computed.
// Output: o, which will contain the ith Fibonacci number.
a := 0;
b := 1;

while i {
    i -= 1;

    // c := a;
    c := 0;
    while a { a -= 1; c += 1; }

    // c += b;
    // x := b;
    x := 0;
    while b { b -= 1; c += 1; x += 1; }

    // a := x; (which is the original b)
    a := 0;
    while x { x -= 1; a += 1; }

    // b := c; (which is the original a + b)
    b := 0;
    while c { c -= 1; b += 1; }
}

// o := a;
while a { a -= 1; o += 1; }

Let’s go and execute that with a lexy grammar.

Executing WHILE: Variables

To execute WHILE with lexy, we need to store the values of all variables. As you’ve probably guessed, we’re using dsl::context_counter for that. There are two problems with that approach that we need to solve.

First, the “name” of a context counter is given by a type. If we want N variables, we need to create N types. In particular, we cannot support infinite or user-defined variables, but only a finite set specified in the grammar.

This makes WHILE no longer Turing-complete, but that’s okay: Turing-completeness requires infinite memory, but computers are finite. The limit is fixed but arbitrary, so given enough patience during compilation, we can make it arbitrarily large.

In the code, we’ll use a template for variables:

template <char Name>
struct var_id
{};

template <char Name>
constexpr auto var = dsl::context_counter<var_id<Name>>;

The second issues is the way a dsl::context_counter can be modified: there is .inc()/.dec(), which increments/decrements it by one, and .push()/.pop(), which adds/subtracts the number of character consumed by a rule.

In WHILE, variables are specified in decimal: this means that we first need to (somehow) convert a read a decimal number while executing the matching number of .inc() calls. It’s possible, but incredibly tedious.

A more pragmatic solution is to switch to unary numbers, i.e. Tally marks: then the number N consists of N characters and we can use .push()/.pop() directly.

a := ||||||; // a := 6
b := ;       // b := 0

This obviously doesn’t affect the Turing-completeness of WHILE.

Parsing a number is just as simple as parsing zero-or-more | characters:

struct number
{
    static constexpr auto rule  = dsl::while_(dsl::lit_c<'|'>);
};

Executing WHILE: Variable statements

The three “variable statements” identifier := number, identifier += number, and identifier -= number need to modify a different context counter depending on the variable name. This means we don’t have a single production, but a production template:

template <char Name>
struct var_stmt
{
    // If we start with `Name`, we parse a variable statement.
    static constexpr auto rule = dsl::lit_c<Name> >> …;
};

The actual code uses dsl::keyword instead, to prevent var_stmt<'w'> triggering on while. The >> specifies a branch condition.

The actual body on the statement then needs to modify var<Name> accordingly. Addition and subtraction directly map to .push() and .pop():

auto add = LEXY_LIT("+=") >> var<Name>.push(dsl::p<number>);
auto dec = LEXY_LIT("-=") >> var<Name>.pop(dsl::p<number>);

Assignment is more tricky: we can use .push() only if the variable is currently zero. To reset a variable we use a loop that decrements the counter as long as it’s not zero:

// Note that this does not consume any input.
auto reset = dsl::loop(var<Name>.is_zero() >> dsl::break_
                       | dsl::else_ >> var<Name>.dec());

I could have easily added a .reset() function to a counter, but I didn’t want to add any rules to lexy just for this silly example.

Putting that all together, we have the complete production:

template <char Name>
struct var_stmt
{
    static constexpr auto rule = [] {
        auto reset = dsl::loop(var<Name>.is_zero() >> dsl::break_
                              | dsl::else_ >> var<Name>.dec());

        auto assign = LEXY_LIT(":=") >> reset + var<Name>.push(dsl::p<number>);

        auto add = LEXY_LIT("+=") >> var<Name>.push(dsl::p<number>);
        auto dec = LEXY_LIT("-=") >> var<Name>.pop(dsl::p<number>);

        return dsl::lit_c<Name> >> (assign | add | dec) + dsl::semicolon;
    }();
};

Executing WHILE: If statements

Similar to the variable statements, if statements also need to be templated on the variable name. It calls var<Name>.is_zero() and branches accordingly: if the variable name is zero, we skip over the if and execute the else, if there is any. Otherwise, we execute the if and skip over any else.

The body of an if/else is a list of statements surrounded by curly brackets. To execute that, we simply need to parse them: as seen with the var_stmt, parsing the input will modify the counters accordingly. As lexy has built-in support for list of things surrounded by brackets, this is straightforward:

struct execute_body
{
    static constexpr auto rule
      = dsl::curly_bracketed.opt_list(dsl::recurse<statement>);
};

To skip over the statement without executing it, we can just add separate versions of the productions that just parse them, without touching the counters. Instead, I’ve opted for a different approach: the body consists of a balanced sequence of curly brackets; we just need to discard input until we’ve seen as many opening as closing brackets. This is something dsl::context_counter was actually designed for:

struct skip_body
{
    static constexpr auto rule = [] {
        // A separate counter, just for this production.
        auto bracket_counter = dsl::context_counter<skip_body>;

        // If the counter is zero, we exit the loop.
        auto check_balance
            = dsl::if_(bracket_counter.is_zero() >> dsl::break_);

        // { increments
        auto open  = LEXY_LIT("{") >> bracket_counter.inc();
        // } decrements
        auto close = LEXY_LIT("}") >> bracket_counter.dec();
        // other characters are skipped
        auto skip  = open | close | dsl::ascii::character;

        // Create a bracket counter which is initially zero,
        // and skip until we've reached zero.
        return bracket_counter.create()
                  + dsl::loop(skip + check_balance);
    }();
};

The actual code also handles curly brackets in comments.

An if statement for the variable Name then just selects the correct version based on the value of the variable counter:

template <char Name>
struct if_stmt
{
    static constexpr auto rule = [] {
        // Parse the body and skip over else.
        auto skip_else = dsl::if_(kw_else >> dsl::p<skip_body>);
        auto non_zero = dsl::p<execute_body> + skip_else;

        // Skip the body and parse else.
        auto exec_else = dsl::if_(kw_else >> dsl::p<execute_body>);
        auto zero     = dsl::p<skip_body> + exec_else;

        auto select = var<Name>.is_zero() >> zero
                      | dsl::else_ >> non_zero;

        return LEXY_LIT("if") + dsl::lit_c<Name> + select;
    }();
};

The actual code is slightly different: if_stmt is templated over all possible variable names, skips the common leading if, and then tries to find the matching variable name.

Executing WHILE: While statements

Parsing a while statement is similar to if: we branch on var<Name>.is_zero() and either skip the body or execute it. But after executing the body of the loop we might need to do execute it again!

This means that when we execute the body, we must then rewind the input back to the beginning of the while loop to try again. lexy has dsl::peek() for that: it parses a rule but doesn’t consume the input. However, dsl::peek() does not provide access to context variables!

This isn’t a technical limitation; I could easily change the implementation of dsl::peek() to forward context variables to the inner rules. I just don’t have a reason for it other than supporting a WHILE interpreter. As such, this is the only case where I needed to write a custom rule for the example. lexy_ext::rewind(rule) parses rule with access to context variables and then rewinds the input back to the original implementation.

With that, executing a while statement is straightforward:

template <char Name>
struct while_stmt
{
    static constexpr auto rule = [] {
        // After executing the body, rewind back.
        auto non_zero = dsl_ext::rewind(dsl::p<execute_body>);
        // After skipping the body, we break.
        auto zero = dsl::p<skip_body> + dsl::break_;

        // We repeatedly select execution or skipping.
        auto loop = dsl::loop(var<Name>.is_zero() >> zero
                              | dsl::else_ >> non_zero);

        return LEXY_LIT("while") + dsl::lit_c<Name> + loop;
    }();
};

Executing WHILE: The program

Putting it all together is the statement production that just dispatches to var_stmt, if_stmt, and while_stmt for various variables, and a top-level program production. The latter needs to create all dsl::context_counter objects and parse stmts until the end of file is reached. We then get the value of the o variable and return it as the result.

struct program
{
    // Whitespace is space and C++ style comments.
    static constexpr auto whitespace = dsl::ascii::space
          | LEXY_LIT("//") >> dsl::until(dsl::ascii::newline);

    static constexpr auto rule = [] {
        // Create all variables (initialized to zero).
        auto create = (var<'a'>.create() + …);

        // Parse (and thus execute) a list of statements.
        auto run = dsl::loop(dsl::eof >> dsl::break_
                              | dsl::else_ >> dsl::p<statement>);

        // We produce the value of o (for output).
        auto output = var<'o'>.value();

        return dsl::whitespace + create + run + output;
    }();

    // Parsing `program` produces a value, which is an `int`.
    static constexpr auto value = lexy::construct<int>;
};

The actual code is templated on the list of variables.

Now we can read a file as input and parse it, which will execute the program:

auto file = lexy::read_file(filename);
if (!file)
{
    // error handling
}

// Executes program!
auto result = lexy::parse<program>(file.buffer(),
                                   lexy_ext::report_error);
if (!result)
    return 2;

std::printf("result: %d\n", result.value());

The full code is here:

You need to explicitly build the lexy_example_turing target, which takes a while (15 seconds on my laptop) due to the big number of template instantiations.

Conclusion

Is this useful? Definitely not.

On the contrary, Turing-complete languages are problematic. For example, lexy grammars can create infinite loops, which are now impossible to detect in the general case – thanks to WHILE, it reduces to the Halting problem.

However, lexy already had infinite loops:

// dsl::any always succeeds, so dsl::while_() loops forever.
static constexpr auto rule = dsl::while_(dsl::any);

This is because lexy grammars are not actually declarative: they’re syntax sugar for a hand-written parser where you need to specify exactly how everything is parsed, when and how it should backtrack, and in what order to try alternatives.

Turing-completeness requires the use of dsl::context_* rules, which can be easily audited. In general, it’s a good idea to avoid using them by parsing a more general input and validating it later. This allows better error messages and error recovery.

I didn’t plan for Turing-complete rules, but I’m not going to revert the refactor that introduced it: it’s a much cleaner and simpler implementation now, and I’d have to go out of my way to restore the previous behavior.

If you actually need to do complex stuff during parsing, it’s better to use dsl::scan instead. This rule allows you to parse some productions by hand; see an example here.

Appendix: Prime number tests in WHILE

The following code implements the main loop of a simple prime tester in WHILE. It uses the modified syntax with unary numbers that can be executed by lexy.

// Check whether i > 2 is a prime number.
i := |||||; // 5

// The current divisor we're checking.
d := ||;

// Main loop that checks all divisors.
e := |;
while e {
    // j := i;
    j := ; x := ;
    while i { i -= |; j += |; x += |; }
    while x { x -= |; i += |; }

    // Check whether i is divisible by d.
    // We do that by repeatedly decrementing j and m,
    // setting m to d when it is zero.
    //
    // If j is zero, we exit the loop.
    // m is then only also zero if they've reached zero at the same time,
    // i.e. j is divisible by m.
    m := ;
    while j {
        if m {
            j -= |;
            m -= |;
        } else {
            // m := d;
            m := ; x := ;
            while d { d -= |; m += |; x += |; }
            while x { x -= |; d += |; }
        }
    }

    if m {
        // i is not divisible by d.
        // Check whether d * d > i.

        // j := i;
        j := ; x := ;
        while i { i -= |; j += |; x += |; }
        while x { x -= |; i += |; }

        // b := d;
        // c := d;
        b := ; c := ; x := ;
        while d { d -= |; b += |; c += |; x += |; }
        while x { x -= |; d += |; }

        // a := d * d by doing a += b, c times.
        a := ;
        while c {
            c -= |;

            // a += b.
            x := ;
            while b {
                b -= |;
                a += |;
                x += |;
            }

            // b is now zero, restore it.
            while x {
                x -= |;
                b += |;
            }
        }

        // Determine whether a > j, by subtracting both.
        while a {
            if j {
                j -= |;
                a -= |;
                // Note: this can't set both j and a to zero,
                // as a == j => d divides i.
            } else {
                // j is zero, but a isn't.
                // This means that a > j <=> d * d > i <=> i is prime.
                a := ; // exit inner loop
                e := ; // exit outer loop
                o := |; // set result
            }
        }

        // Increment d by one.
        d += |;
    } else {
        // i is divisible by m, so it is not prime.
        e := ; // exit outer loop
        o := ; // set output to false
    }
}