Implementation Challenge: Lossless, compact parse tree with iterative traversal

28 Apr 2021 by Jonathan

My parser combinator library lexy was originally designed to parse some grammar into a user-defined data structure, comparable to Boost.Spirit. This is ideal for parsing simple “data” grammars like JSON or email addresses, and also works for parsing programming languages: simply parse into your AST. However, by design lexy::parse() will only forward data explicitly produced by the parsing combinators which does not include punctuation, comments, or whitespace.

Inspired by matklad’s blog post about modern parser generators, I’ve decided to add a way to retain all the information and produce a lossless parse tree by calling lexy::parse_as_tree(). This requires no changes to your existing grammar and simply switches the output. With that, I could also add an online playground that visualizes the parse tree of a given grammar on the given input.

Implementing the actual code that produces a parse tree during parsing wasn’t too hard – I’ve already had a handler that controls what happens during parsing to implement lexy::match() and lexy::validate(). The challenging part was the actual data structure for storing a parse tree: it should be memory-efficient, as it can be big, and users should be able to easily iterate over every node without requiring recursion.

The Baseline

Fundamentally, lexy::parse_tree is a structured view over the original input, which must be kept alive. It is an m-ary tree containing two kinds of nodes: tokens and production. A token node is a leaf node of the tree and stores a span of the input together with the token kind (essentially an enum). Iterating over all the tokens in the tree and concatenating their spans yields back the original input (the tree is lossless). A production node is a non-leaf node: its children are other production nodes or tokens; the node itself only stores an identifier of the production.

In C++, it looks like this (simplified):

template <typename Iterator, typename TokenKind>
struct pt_node_token
{
    // The span of the input occupied by the token.
    Iterator begin, end;
    // What kind of token it is.
    TokenKind kind;
};

template <typename Iterator, typename TokenKind>
struct pt_node_production
{
    // In the grammar, productions are identified by their type.
    // Here we type-erase that to the name of the productions
    // (two productions are equal iff the name has the same address).
    const char* production_name;
    // The children are either other production nodes or tokens.
    std::vector<std::variant<pt_node_production<Iterator, TokenKind>,
        pt_node_token<Iterator, TokenKind>> children;
};

The fact that there are two different nodes is an implementation detail; the actual parse tree hides them away:

template <typename Iterator, typename TokenKind>
class parse_tree
{
    using node_p = pt_node_production<Iterator, TokenKind>;
    using node_t = pt_node_token<Iterator, TokenKind>;

public:
    // A view over an actual node.
    class node
    {
    public:
      // Returns something that can be compared to a token kind,
      // with functions like `.is_token()`, `.is_production()` etc.
      auto kind() const;

      // A range that iterates over the children of the node.
      // For a token node, this is the empty range.
      auto children() const;

      // A range that iterates over the span of a token in the input.
      // For a production node, this is empty.
      auto lexeme() const;

    private:
        // The pointer to the implementation.
        // This is okay, as the tree is immutable,
        // so once we've parsed everything, the address won't change.
        std::variant<node_p*, node_t*> _ptr;
    };

    node root() const
    {
        return node{&_root};
    }

private:
    // The tree stores the root node, which owns all children.
    pt_node_production<Iterator, TokenKind> _root;
};

The full interface of lexy::parse_tree is documented here. A complete example that parses some input into a parse tree and then prints it out is on Compiler Explorer.

While this basic design would certainly works, it has a couple of issues:

Usability: the nodes don’t have a pointer to their parent. This also means it is impossible to iterate over all siblings of a node.
Memory Efficiency: sizeof(pt_node_token) is 3 * sizeof(void*), sizeof(pt_node_production) is 2 * sizeof(void*) + 3 * sizeof(void*), and std::variant<pt_node_token, pt_node_production> is 6 * sizeof(void*) (size of largest variant + tag rounded up to pointer size). A parse tree contains a lot of nodes, so those 48 bytes can add up (and they don’t even include a parent pointer).
Memory Allocations: most production nodes have only a couple of child nodes, so we’re doing many small vector allocations.
Recursion: if you want to iterate over all descendants of a node, you need to do a DFS, which requires recursion (or a separate stack).

We will tackle all of those problems to create an optimized implementation that requires 3 * sizeof(void*) per node, which includes a way to access the parent, does allocations in multiples of 4 KiB and can be traversed by simply following pointers without recursion.

Step 1: Compressing tokens

Currently, pt_node_token stores two iterators, which are pointers for most inputs, and a TokenKind, which is an enum. By default, enum’s are int, which allows for 4 billion different token kinds. This is overkill, so let’s use a std::uint_least16_t instead: 65536 different tokens should be enough for everybody. Then we also don’t need the TokenKind template parameter – the higher-level node is still (indirectly) templated and can do the casts for us.

template <typename Iterator>
struct pt_node_token
{
    Iterator begin, end;
    std::uint_least16_t kind;
};

Note that sizeof(pt_node_token) is still 24 bytes, but we only want to store two pointers and 16 bits! Let’s fix that.

If we restrict ourselves to random access iterators, we don’t need to store two iterators to define a range: we can store an iterator and a size instead. A token is mostly small: there are many single character tokens or short keywords like int. The longest tokens are string literals, but even those are unlikely to exceed the four gigabyte limit of a 32 bit integer:

template <typename Iterator>
struct pt_node_token
{
    Iterator begin;
    std::uint_least32_t size;
    std::uint_least16_t kind;
};

Now a token is only 2 * sizeof(void*), but parse_tree::node can still reconstruct the same information.

The actual code supports non-random-access iterators by conditionally switching between [begin, end) and begin, size at the cost of space.

Step 2: A compressed node pointer type

The final design will need lots of pointers to nodes. In the baseline, they’re expressed as std::variant<node_p*, node_t*>; let’s create a separate type for it:

template <typename Iterator>
class pt_node_ptr
{
    void* _ptr;
    bool _is_token;

public:
    pt_node_ptr()
    : _ptr(nullptr), _is_token(fale)
    {}

    void set(pt_node_token<Iterator>* ptr)
    {
        _ptr = ptr;
        _is_token = true;
    }
    void set(pt_node_production<Iterator>* ptr) { … }

    auto token() const
    {
        return _is_token
          ? static_cast<pt_node_token<Iterator>*>(_ptr) : nullptr;
    }
    auto production() const { … }
};

The reason for the unorthodox .set() instead of e.g. a constructor, will become apparent later on.

pt_node_ptr is essentially the same as the variant, but instead of a union we’re using a void*. This hasn’t bought us anything, but now we can optimize by realizing something about the possible values of _ptr: it is either null, in which case we don’t care, or it points to a token or production node, which have a certain alignment!

Both pt_node_token and pt_node_production store pointers, which have an alignment of 8 on a 64 bit system. This means that every address that is a valid for a node must be a multiple of 8. In binary, addresses that are a multiple of 8 end in three zeroes.

So while we need a 64 bit pointers, we always know three bits of the pointer value: the last ones will be zero. This is more than enough to store a boolean!

template <typename Iterator>
class pt_node_ptr
{
    std::uintptr_t _address;

    explicit pt_node_ptr(void* ptr, unsigned type)
    : _value(reinterpret_cast<std::uintptr_t>(ptr))
    {
        // Assert that the alignment is correct.
        assert((_value & 0b1) == 0);
        // Set the type.
        _value |= (type & 0b1);
    }

public:
    // Pointers to a token have the last bit set to zero.
    static constexpr auto type_token      = 0b0u;
    // Pointers to a production have the last bit set to one.
    static constexpr auto type_production = 0b1u;

    pt_node_ptr() : pt_node_ptr(nullptr, type_token) {}

    void set(pt_node_token<Iterator>* ptr)
    {
        *this = pt_node_ptr(ptr, type_token);
    }
    void set(pt_node_production<Iterator>* ptr)
    {
        *this = pt_node_ptr(ptr, type_production);
    }

    unsigned type() const
    {
        // The type is given in the last bit.
        return _address & 0b1;
    }
    void* base() const
    {
        // The pointer value has the last bit set to zero.
        auto cleared = _address & ~std::uintptr_t(0b1);
        return reinterpret_cast<void*>(cleared);
    }

    auto token() const
    {
        return type() == type_token
          ? static_cast<pt_node_token<Iterator>*>(base()) : nullptr;
    }
    auto production() const { … }
};

Now we’re having a void* plus tag without space overhead!

For a more general implementation of this pattern, check out LLVM’s PointerUnion.

Step 3: Stack allocation

At this point we’re having a space-efficient way to point to nodes in the tree, so we could just go ahead and add a parent pointer to every node. This wouldn’t work, however. While creating a production node for example, we’re repeatedly pushing its children into the std::vector, which needs to reallocate at some point. Upon reallocation, the memory address of all elements changes, which is problematic if one element is a finished production node whose children point back to it.

So we need a way to provide stable addresses for nodes. One easy way is to switch from std::vector<std::variant<pt_node_production<Iterator>, pt_node_token<Iterator>>> to std::vector<std::unique_ptr<std::variant<...>>>>>>>>. But this wouldn’t help our memory allocation situation.

Instead, I’ve used a stack allocator: we allocate big (4 KiB) memory blocks and use that memory for all our nodes. As we don’t free nodes until the entire tree gets destroyed, we can allocate by simply advancing a pointer. Once we’ve reached the end of our block, we allocate a new one and store it in a linked list of blocks.

class pt_buffer
{
    static constexpr std::size_t block_size = 4096 - sizeof(void*);

    // One of the memory blocks.
    struct block
    {
        // Pointer to the next block;
        // for the active block it is nullptr.
        block* next;
        // The actual memory we're re-using.
        unsigned char memory[block_size];

        static block* allocate()
        {
            auto memory = ::operator new (sizeof(block));
            auto b = ::new (memory) block;
            b->next = nullptr;
            return b;
        }
    };

    // The initial block in our linked list of blocks.
    // It only matters in the destructor,
    // where we walk all the blocks.
    block* _head;

    block* _cur_block;
    unsigned char* _cur_pos;

public:
    // Appropriate constructors/destructors.
    // Destructor releases everything by deallocating all blocks.

    // Reserves space by allocating a new block if necessary.
    void reserve(std::size_t size)
    {
        auto remaining_capacity
          = (_cur_block->memory + block_size) - _cur_pos;
        if (remaining_capacity < size)
        {
            // Allocate new block and add to list.
            auto next = block::allocate();
            _cur_block->next = next;
            // Update current block and memory.
            _cur_block = next;
            _cur_pos = &_cur_block->memory[0];
        }
        // Now we can safely use [_cur_pos, _cur_pos + size)
    }

    // Creates an object of the given type.
    // Space must have been reserved before.
    template <typename T, typename... Args>
    T* allocate(Args&&... args)
    {
        // Sanity checks omitted:
        // T needs to have pointer alignment
        // (otherwise we need to take care of alignment),
        // and reserve() must have been called.

        // The memory is [_cur_pos, _cur_pos + sizeof(T)).
        auto memory = _cur_pos;
        _cur_pos += sizeof(T);
        return ::new (memory) T(LEXY_FWD(args)...);
    }
};

Once we store every node of the tree in the buffer by reserving memory for it and then allocating it, its memory address won’t ever change, so we can safely store pointers to it even while the tree is in construction.

The children of a production node are now a std::vector<pt_node_ptr<Iterator>>: a plain pointer is enough, as the memory is owned by the tree and not the individual nodes, and it implicitly stores the type of the node without extra memory overhead.

The nodes are never destroyed by the buffer, so the std::vector leaks. This will be fixed below and the actual implementation has a static_assert for trivial copyability.

Step 4: Linked lists

The advantage of storing a std::vector of children is that you have random-access. This, however, doesn’t help you much here: you rarely want to access the nth child of a node, but the child that has a specific kind (whose index can vary due to whitespace and other nodes). The downside of std::vector is the additional memory allocation to store all the pointers as well as the space overhead – three pointers.

Instead, we can switch to a good old intrusive linked list: we give every node – productions and tokens – a pt_node_ptr to the next node in the list. Every node is only in one list as it has only one parent, so this works out.

Now I can know what you’re saying: a linked list is a bad data structure.

And this is true for something like std::list where we allocate every node separately. But here, all the nodes already live in the buffer and aren’t individually allocated. They’re also close together which helps with caching effects. For example, consider the following tree:

It’s memory layout is:

production | Hello | child | w | o | r | l | d | !

As you can see, the child nodes of production immediately follow the node. Only when we jump over child productions do we need to skip over all of their children.

To implement the linked list we’re introducing a base class that stores all members that are common for every node, i.e. the next pointer:

template <typename Iterator>
struct pt_node
{
    pt_node_ptr<Iterator> ptr;
};

template <typename Iterator>
struct pt_node_token : pt_node<Iterator>
{
    // as before
};

template <typename Iterator>
struct pt_node_production : pt_node<Iterator>
{
    // see below
};

Then we can also change pt_node_ptr such that .base() does not return void* but the common base class pointer pt_node*. This allows access to the ptr member without taking a branch that queries the type first.

In pt_node_production, we replace std::vector<pt_node_ptr<Iterator>> by a pointer to the first element and the number of children:

template <typename Iterator>
struct pt_node_production : pt_node<Iterator>
{
    const char* production_name;
    std::size_t child_count;
    pt_node_ptr<Iterator> first_child;
};

Now to add a child, we insert it into the end of the linked list and increment the child count. Appending something to a linked list requires a pointer to the currently last element of the list, but this is only relevant for construction so it doesn’t need to be stored as part of the tree.

Iteration over the children of a node starts with the first_child and then just follows each node’s .ptr.

This is already an improvement, but we can go even better: in most cases, the first child of a production node is stored immediately afterwards, so we don’t need first_child. All we need to do is remember the type; the address is just this + 1! Only when we have a production node at the end of the buffer do we need to have a pointer to the first child, as then it is in a different memory block.

The idea now is to remove the first_child pointer and instead store flags that remember the type of the first child and whether or not it is immediately adjacent. If it is, we can reconstruct a pt_node_ptr to the first child by combining the type with the address this + 1, otherwise, the memory immediately following the production node will contain the actual address. Note that this will only happen once per 4 KiB block by definition.

template <typename Iterator>
struct pt_node_production : pt_node<Iterator>
{
    const char* production_name;
    // We use a bit field to store the flags as one std::size_t.
    std::size_t child_count : sizeof(std::size_t) * CHAR_BIT - 2;
    std::size_t first_child_adjacent : 1;
    std::size_t first_child_type : 1;

    pt_node_ptr<Iterator> first_child()
    {
        // Get a pointer to the memory immediately afterwards.
        auto memory = static_cast<void*>(this + 1);
        if (first_child_adjacent)
        {
            // The memory contains the node.
            pt_node_ptr<Iterator> result;
            result.set(static_cast<pt_node<Iterator>*>(memory),
                       first_child_type);
            return result;
        }
        else
        {
            // The memory contains the actual pointer.
            return *static_cast<pt_node_ptr<Iterator>*>(memory);
        }
    }
};

Of course, it is important to reserve space to store an adjacent pointer:

// In the code that adds a production node.

// Reserve enough for production and trailing pointer.
auto space_needed
  = sizeof(pt_node_production<Iterator>)
  + sizeof(pt_node_ptr<Iterator>);
buffer.reserve(space_needed);
auto node = buffer.allocate<pt_node_production<Iterator>>(…);

.reserve() ensures that there is enough contiguous memory for both, but .allocate() advances only the part necessary to fit the node. If we do a subsequent allocation for a child node on the same block, this will use the memory that we reserved for the pointer – but then it is okay as that memory is immediately afterwards and we don’t need the pointer to the first child! We only need the pointer if a subsequent allocation is placed on a new block, but in that case the remaining space of the old block is left untouched and we can store it there.

Check the builder code for the full details.

Note that the block size of 4 KiB is enough for the 8 bytes to store a pointer to the next block, 170 nodes with size 24 each, and one trailing pointer for the first child of the last production node. This is just a coincidence – I didn’t plan for it when picking the block size.

Step 5: Parent pointers

Now we’re having stable addresses for nodes and a compact pointer: just add a pt_node_ptr<Iterator> parent member to the pt_node base class to give every node access to the pointer, right?

Well, that would add 8 bytes to every node which would bring the size up to 32. I don’t find that acceptable, especially as accessing a parent is not a common operation. Luckily, we don’t have to add an additional member, there is one available: the existing .ptr member of the linked list.

Every production node knows how to get to its first child, and from there it follows the .ptr member to the next child. The last child of a production node is identified by a .ptr member that is nullptr. So let’s just point back to the production node instead!

For that, we need to change pt_node_ptr such that it stores one additional bit of information for a pointer: the role. A pointer either has the “sibling” role, which means that a node is not the last child of a production and .ptr points to the next child of the parent (i.e. its sibling), or the “parent” role, which means that node is the last child and .ptr points back to the parent. As we’re having an alignment of 8, there are two more zeros we haven’t used:

template <typename Iterator>
class pt_node_ptr
{
    std::uintptr_t _address;

    explicit pt_node_ptr(void* ptr, unsigned type, unsigned role)
    : _value(reinterpret_cast<std::uintptr_t>(ptr))
    {
        // Assert that the alignment is correct.
        assert((_value & 0b11) == 0);
        // Set the type.
        _value |= (type & 0b1);
        // Set the role.
        _value |= (role & 0b1) << 1;
    }

public:
    static constexpr auto role_sibling = 0b0u;
    static constexpr auto role_parent  = 0b1u;

    …

    // previously, it was just `set`
    void set_sibling(pt_node_token<Iterator>* ptr) { … }
    void set_sibling(pt_node_production<Iterator>* ptr) { … }

    // No need to overload for token nodes;
    // they won't ever be parents.
    void set_parent(pt_node_production<Iterator>* ptr)
    {
        *this = pt_node_ptr(ptr, type_production, role_parent);
    }

    unsigned type() const
    {
        // The type is given in the last bit.
        return _address & 0b1;
    }
    unsigned role() const
    {
        // The role is given in the second last bit.
        return (_address & 0b10) >> 1;
    }

    pt_node<Iterator>* base() const
    {
        // The pointer value has the last two bits set to zero.
        auto cleared = _address & ~std::uintptr_t(0b11);
        return reinterpret_cast<pt_node<Iterator>*>(cleared);
    }

    auto token() const { … }
    auto production() const { … }
};

The builder just needs to ensure that the .ptr member of the last child of a production is set appropriately and we’re good to go. To query the parent of a node, just keep following its .ptr member until we’ve reached one whose role is role_parent – it points to the parent node. This is O(number of children), but that’s okay as we’re getting something else for free.

Step 6: Traversal

For any tree, there are three useful ranges we might want to iterate over.

The first range is the range of all direct children of a node. This was always possible. With the new design we obtain a pointer to the first child from the production node, and keep iterating .ptr until it has role_parent, then we’re done.

The second range is the range of all siblings of a node. With the new design, this is possible as well: just follow .ptr until we’ve reached role_parent, then go to the parent’s first child. Iteration stops when we’ve reached the initial node again.

The third range is to iterate over all, direct and indirect, children of a node. For the root node, this means iterating over the entire tree in a depth-first manner. Usually, this involves a combination of looping over the direct children and recursively visiting each child. This requires linear stack space and doesn’t fit nicely into C++’s iterator model.

But with the new design it can be done completely iteratively without recursion. The idea is to go the first child of the start node and then just keep following .ptr. When the type of ptr is a production and the role is “sibling”, we’ve reached a production for a first time and now need to visit its children. As such, we next go to the first child of the production. However, when the type of ptr is a production and the role is “parent”, we’ve already visited it before and just came back to it. Then we continue with the production’s .ptr to continue with its siblings.

Or, as the logic for the increment operator of a C++ iterator:

iterator& operator++()
{
    if (_cur.token())
        // A token has no children, so we continue with its sibling.
        _cur = _cur.base()->ptr;
    else if (_cur.is_sibling_ptr())
        // We've reached the production for the first time,
        // go to its first child.
        _cur = _cur.production()->first_child();
    else if (_cur.is_parent_ptr())
        // We've reached the production for the second time,
        // go to its sibling.
        _cur = _cur.base()->ptr;

    return *this;
}

Traversing the parse tree is thus as easy as traversing a linked list! No recursion, no stack, just simple pointer chasing. And remember the memory layout of our example tree:

production | Hello | child | w | o | r | l | d | !

We visit production, then Hello, then child, then w, … – traversal just follows the block in order. Only after a child do we jump back to the parent, and then back over all the children. But most of the time, we’re just dereferencing a pointer that points 24 bytes further into the memory block, as if we were iterating over an array!

Of course, it isn’t equivalent to an array, as the CPU cannot read the next pointer until the current element has been loaded. With an array, the next element offset is statically known which enables optimizations.

Conclusion

I’ve stopped at this point, but further optimizations are possible. For example, using virtual memory we can allocate huge amounts of memory – more than ever needed for the tree – and only commit it as its needed. This makes the needed for a linked list of blocks unnecessary, making allocation in the buffer faster and simplifying the production node.

The extension header lexy_ext/parse_tree_algorithm.hpp contains some useful algorithms for working with parse trees. For example, lexy_ext::child(tree, node, lexy::identifier_token_kind) returns the first (direct) child node that is an identifier. This requires iterating over all the children, which needs to load every previous child into memory, checking its kind, and determining whether it is an identifier. It would be faster to add something like a std::vector<std::pair<pt_node_ptr<Iterator>, TokenKind> to a production node – then the algorithm only needs to iterate over the vector which doesn’t load all previous children into memory.

However, whether or not such optimizations are worthwhile requires more benchmarking than I’ve done so far.