cppast - A library to parse and work with the C++ AST

20 Apr 2017 by Jonathan

Last year I started standardese, a C++ documentation generator. In order to provide exact documentation, I need to parse C++ code. As I didn’t want to waste time implementing my own parser, which will take ages and don’t work most of the time, I opted to use libclang.

libclang is a C API that exposes the C++ abstract syntax tree (AST) which is built on top of clang. And clang is a good and conforming C++ compiler, so I expected an interface to read the AST that just works and give me the information I need.

Well, I was wrong. Here’s why and how I solved it.

tl;dr: cppast.

libclang problems

libclang isn’t terrible. It has a reasonable easy to use interface, so it is quick to get going. And as it is based on clang, it has no problem dealing with conforming C++ code. Furthermore, it supports GCC and MSVC compiler extension and is fast enough for a documentation generator.

However, as its website advertises, it doesn’t expose the full AST.

If you just need to do basic tasks, like “print all functions in the given file”, it works well. But for standardese, I needed access to the full AST, in order to provide good documentation. And libclang simply doesn’t provide that.

The reason for that is simple: libclang features are implemented on-demand. Need to get more information about XXX for your project? Implement that yourself. So it works great for things other people already needed, but not for the rest.

Now in hindsight I should have probably used LibTooling even though the API isn’t stable, and from what I’ve heard, it is difficult to use in a standalone project. But instead I opted for a different path:

I started to workaround libclang’s limitation.

libclang workarounds

Read “hacks”.

For example, libclang doesn’t expose whether or not a function is marked noexcept, and if so, what the noexcept expression is, if it is conditional. It does, however exposes all tokens of a function.

See where I’m going with this?

I thought to myself “hm, that’s easy, just loop over the function tokens and see if you can find noexcept”. That’s what I did.

The first problem I ran into was macros. For compatibility reasons sometimes the noexcept is hidden behind a macro. But the libclang tokens are not preprocessed, so I needed to do that.

Instead of implementing my own preprocessor, I decided to use Boost.Wave which is advertised as a conforming C++ preprocessor implementation. I wrapped the tokenization behind an interface that automatically preprocessed those.

Now this had two consequences:

Compilation times of standardese exploded: As most files required to parse the tokens, most files needed Boost.Wave which included Boost.Spirit, which takes ages to compile.
The approach didn’t work if you had a macro to generate a couple of boilerplate functions.

So I had to resort to preprocessing the entire file with Boost.Wave. This improved compilation times as now only one file needed it, but wasn’t perfect either: Wave can’t preprocess standard library files due to many extensions, so I had to resort to a system that selects the header files that should be preprocessed. But more important: Boost.Wave is slow, so I wasn’t too happy.

After I wasted too much time trying to write my own preprocessor (macro expansion is surprisingly tricky), I resorted to using clang directly for preprocessing. If you pass -E clang will output the file after it has been preprocessed. So I did exactly that: I used a process library to call clang and parse the output. In particular, -E also expands all includes, which I didn’t want, requiring me to undo that. This wasn’t hard, thanks to the line marker output. I also used the opportunity to parse macros and include directives. While the preprocessor is still the slowest part, I’m happy with it.

So now I can safely scan the tokens of an entity to get the additional required information. But what started as a simple “loop and see if it contains a token” quickly grew into a ball of more or less smart heuristics as I needed to get more and more advanced information (contextual keywords like override and final, I’m looking at you). The end result works for any code I threw at it, and while I could come up with various edge cases, nobody uses them in real world code™.

But my parsing code was a convoluted and unmaintainable mess. It didn’t help that I needed to workaround various inconsistencies in the libclang API: Just take a look at this file!

And as the parsing code was strongly coupled with the standardese code, the entire project became messy. I originally designed standardese to be both a library you can use to generate documentation as you wish, and a tool. But with the current state, it’s not possible.

So I needed a different solution.

libclang outsourcing

Why am I telling you this story?

Because I now have found a way to get the C++ AST, but it is not usable and if you need the AST yourself, you have to go through all the same workarounds.

So I did the only reasonable thing: I extracted the mess into a different project.

I had two goals:

Provide a clean API to work with the AST and hide all the parsing code into the implementation. This only pollutes one place with my libclang workarounds.
Be independent from the underlying parsing implementation. This allows multiple backends or switching backends without affecting the usage code.

The standardese parsing API was a more or less thin wrapper over libclang. Each cpp_entity stored the libclang CXCursor and using it was a mix between my API and libclang. In order to achieve my goals, I had to completely isolate the entity hierarchy from libclang. For that, I had to mirror some infrastructure like resolving cross referencing, or creating an entirely new hierarchy for the types: Previously I simply used libclang’s CXType, now I have cpp_type and derived classes.

But the end result was totally worth it: I have a clean and modern representation of the C++ AST. It uses type_safe in the API, which makes it more expressive, and does not expose anything from libclang.

The parsing interface is as simple as possible - just call one member function - and all the messy details are physically hidden away. It can currently parse basically everything you can put in a header file, except variable templates. This is a libclang limitations - they’re currently “unexposed”, meaning that you only get the tokens concatenated into a string, and not further information. Of course, there are some other limitations I can’t easily workaround, but those are rare edge cases and only lead to things being unexposed.

It also has a complete cpp_type hierarchy, but the cpp_expression hierarchy currently only has two classes (literal and unexposed). It also does not parse function bodies, i.e. statements, or attributes. But those features will be added as needed (yell at me, if you want them).

I’ve also implemented other parts of standardese there: It features support for documentation comments in various formats and the same smart comment matching system, as well as a way to iterate over unmatched comments. And it also supports customizable code generation of an AST entity I can use to generate the synopsis.

Currently every AST entities are immutable, but I’ll change that, so you can both synthesize new entities and modify existing ones. This will also vastly simplify standardese’s code.

I will probably also add a more high level visitation interface, like clang’s AST matchers.

I can now proudly present:

cppast - a collection of libclang workarounds

Currently it is just the first prototype and I haven’t actually integrated it in standardese yet. This will probably require some changes to the API, so right now, everything’s unstable. But I encourage you to check it out. It features a simple command line tool that “pretty”-prints the AST, so please check if it can handle your own code.

I changed/simplified/used a different approach for some workarounds, so code that could be parsed with standardese, might not be parsed anymore.

As a bonus, I now have an exhaustive list of libclang limitations and bugs, so if I find the time, I can actually fix them and remove some of my workarounds. For that reason I’m not going to be supporting older LLVM versions: Right now, I’ll suggest you use it with clang 4.0, but 3.9.1 works as well (except for friend and include directives). And as soon as 4.1 is released, I’ll drop 3.9.1 support.

If you’re writing a project that requires the AST - reflection library, documentation generator, code generator - consider using cppast.

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.