Standardese documentation generator version 0.2: Entity linking, index generation & more

10 Aug 2016 by Jonathan

Two months ago I’ve released standardese version 0.1. I promised that the next version wouldn’t take so long as the first one - which took one month.

Well, I’m really not good as estimates.

But this release brings the last missing features to make standardese an actually usable documentation generator: index generation, referring to other parts of the documentation and more output formats, as well as other amazing features like an overhauled comment system. Also a lot of internal changes and bug fixes.

foonathan/standardese is a C++ documentation tool that aims to be a replacement of Doxygen. It is still WIP but already supports enough so that you could start using it for your documentation, although it probably contains many bugs.

An update on the parsing situation

In the last post I’ve complained about libclang and it’s limitations. It lack of features forced me to do my own parsing run over the source code with the help of Boost.Wave.

It is obvious that my parsing code isn’t perfect. And since the last update I’ve fixed many bugs for more or less unusual combinations of C++ features that my code couldn’t handle. Still, after all those fixes, I know about a couple of special cases that my code won’t handle right. But those are really weird combinations of special syntax rules, I doubt anybody is going to write them without deliberately trying to break my code.

I am not going to work much on the parsing code, the remaining bugs will be fixed “on-demand”: if your code isn’t handled correctly, I’ll fix it. But if nobody finds the bugs, I won’t fix them.

My parsing code has a different disadvantage: Boost.Wave is a huge, header-only library that massively increases the compilation time of the library. So it isn’t a permanent solution.

If somebody knows a simple C++ preprocessor and tokenizer library, contact me! And libclang isn’t enough because I couldn’t get its tokenizer to pre-process the source.

As response to the parsing situation many people suggested that I should contact the developers and asked them about it. I wrote a mailing list - eh - mail doing that. The general response was that libclang follows the same policy I do with my bugs: If somebody complains, they might do something. But it is definitely faster if you just submit patches yourself.

So as a long-term goal, I have to do exactly that. But for now I’ll use my Boost.Wave parser: After all it works. After standardese has most of the features I’ve planned, I’ll go back and do something about libclang, but not know.

If someone else is willing to write patches for me, please contact me.

Comment formatting

In the previous version you could use Markdown to format the documentation comments. But you could only use Markdown because the output format was Markdown and the comments were just copied over.

Now this is changed and the comment texts are properly parsed, but still allowing you to use Markdown, in fact: CommonMark. The parsing is done by the cmark library.

In the last post I’ve ranted about libclang. Now I want to praise cmark. It is an amazing library, a simple, well-designed, consistent C API that exposes anything I need. Except from an (already fixed) issue with their CMake, it is simply perfect for my use. I highly recommend it.

cmark’s C hierarchy is parsed and used to create a simple class hierarchy. This AST is slightly modified for my need and also supports the standardese sections and commands.

You can now specify sections at the beginning of a CommonMark paragraph and commands in each line of a command paragraph, a paragraph starting with a command. This looks like so:

/// The implicit brief section.
///
/// \effects The effects paragraph.
/// Still effects.
///
/// \returns The returns paragraph.
/// \effects <- this is a literal string here.
///
/// \param bar A parameter documentation.
/// Still the entire paragraph.
///
/// \unique_name foo
/// \exclude
void foo(int bar);

If you don’t like the verbose empty lines you can also set the option comment.implicit_paragraph to true. This will start a new paragraph with each line of the comment.

The last paragraph starts with a command so each line is parsed properly, unlike in the literal string. Read the readme for further information about sections and commands and/or the rest of this post for more information about the commands.

Comment matching

Before I’ve used the libclang function clang_Cursor_getRawCommentText() to get the comment of an entity. Like many things of libclang this had some limitations and didn’t return comments for some entities like macros.

Now this system is completely changed. The entire source code is now scanned for documentation comments - supported are ///, //!, /** ... */ and /*! ... */ as well as end-of-line comments //< - and the content stored. It automatically strips the comment marker as well as one whitespace following it, for C style comments it will also ignore the unnecessary * on the following line, if there are any:

/** This is comment text.
 * This again, without the star.
 *   This has two leading spaces, because one is stripped.
 */

End-of-line comments are also supported and they merge with following C++ style documentation comments:

enum foo
{
    a, //< Comment for a.
    b, //< Comment for b.
    /// Still comment for b.
    c, //< Comment for c.
};

The comments are matched to the entity that is either directly below or on the same line as the comment. Furthermore, this system allowed remote comments as well as inline comments.

Remote comments

Each entity has an associated unique-name, more on that in the linking paragraph. You can also write a comment without a corresponding entity and specify it yourself using the entity command:

void foo();

/// \entity foo
///
/// This is the comment for the function foo.

This is a remote comment and allows putting the documentation at a different place then the entity.

As long as that “different place” is still a file given to standardese. It currently must be a valid C++ file as well, although “documentation-only” files will be supported in the future.

A special shortcut command for files is the file command. It is the same as \entity current-file-name and allows writing documentation for the file.

Inline comments

There are some entities you cannot document with a matching comment and must use a remote comment. Those are (template) parameters and base classes.

To document those I’ve supported inline comments. You can document them in the comment for their corresponding parent entity using the param, tparam or base command.

param and tparam are actually aliases.

The next paragraph is then the documentation for that inline entity:

/// Documentation for function.
///
/// \param foo Documentation for parameter foo.
///
/// \param bar Documentation for parameter bar.
void func(int foo, int bar);

This is the same as:

/// Documentation for function.
void func(int foo, int bar);

/// \entity func(int,int).foo
///
/// Documentation for parameter foo.

/// \entity func(int,int).bar
///
/// Documentation for parameter bar.

Note that currently inline comments aren’t specially rendered, they’re treated as any other entity and get their own heading with synopsis.

Entity linking

One important feature that took a lot of internal refactoring to make it work is entity linking, i.e. the ability to link to a different entity. I’ve decided to use the regular CommonMark links but without a URL:

/// See [here as well](<> "foo").
void bar();

/// This is foo.
void foo();

This is just a CommonMark link with an empty URL (<>) and a title that is the unique name of the entity you want to link to. In this case the link text is different than the unique name of the entity linked to. But in most cases this isn’t, so you can just use the following shorthand syntax:

/// See [foo]().

No matter the syntax standardese will fill out the link to the URL of linked entity.

The unique name

For both linking and the remote comments you need the unique name of the entity. The unique name is basically the full name of the entity with a few exceptions as shown in the example:

struct foo {}; // unique name is `foo`

void func(); // unique name is `func()`

void func(int a, const char* b); // unique name is `func(int, const char*)`
                                 // unique name of parameter a is `func(int, const char*).a`
                                 // unique name of parameter b is `func(int, const char*).b`

namespace ns // unique name is `ns`
{
    class bar {}; // unique name is `ns::bar`

    template <typename T> // unique name of parameter is `ns::templ<T>.T`
    struct templ // unique name is `ns::templ<T>`
    : T // unique name is `ns::templ<T>::T`
    {
        void func() const; // unique name is `ns::templ<T>::foo() const`
    }; 
}

For functions it also needs to contain the signature and for templates the name of template parameters. (Template) Parameters themselves are after . behind their parent. All whitespace in a unique name will be erased before processing, so it doesn’t matter how you format it. Furthermore, you don’t need to put empty parenthesis () for a function without a signature.

If a function isn’t overloaded you aren’t required to put the signature at all, in no case. This is the short unique name.

But still this can be too long and verbose, so you can change the unique name to an arbitrary string with the unique_name command:

/// The unique name is now `function`.
///
/// \unique_name function
void fancy_function(const char* ptr, int a, ...);

This also works with remote comments:

/// \unique_name function
void fancy_function(const char* ptr, int a, ...);

/// \entity function
///
/// Comment for the former `fancy_function`.

Link target

It was quite tricky to calculate the corresponding URL for an entity because for example the name of the file depends on the output format. For that reason the links are only resolved before everything is written out to the file. standardese generates documentation on a per-file basis, so all entities of a file are documented in one output file. When generating the documentation it sets the output file name - without extension! - for all entities.

The output file name of a foo/bar/baz.hpp is foo__bar__baz.

When a link is resolved by the rendered both the output file name and the final extension is known, so it can generate the URLs.

To link to a specific entity and not only the file it is in, the output needs to contain anchors. Unfortunately, CommonMark still doesn’t support them! So I’ve created a “fake” AST entity md_anchor that actually maps to a CMARK_NODE_HTML_INLINE that renders the HTML anchor code. The anchor of an entity is just the unique name, so that worked out perfectly.

I’ll embed standardese documentation for my projects on this website which uses Jekyll. Jekyll takes the rendered standardese CommonMark files and transforms them into HTML. There is one problem though: the links rendered by the CommonMark renderer are file-name.md#entity-name, whereas Jekyll will change all files so that they use an HTML extension! To solve this problem I’ve also added an output.link_extension option. This overrides the extension the rendered will use for the links.

External links

Linking to other entities of the documentation isn’t the only thing you want. You also need to link to entities of other projects and to their documentation, for example you want:

/// See [std::vector::push_back()]().
void foo();

Well, this works! It will link to cppreference.com.

This is due to support for external documentations with the output.external_doc option. The following value is set implicitly:

std::=http://en.cppreference.com/mwiki/index.php?title=Special%3ASearch&search=$$

For all entity links in namespace std this will link to the search for that entity; the $$ is replaced by the given unique name of the entity. But you can set it for other namespaces and documentations.

Index generation

Thanks to the infrastructure required for entity linking, it was also very easy to generated index files. It will generate two indices: file and entity index.

The file index is in a file named standardese_files.XXX and contains a list of all the files that are documented. The entity index is in a file named standardese_entities.XXX and contains a list of all the entities in namespace scope, i.e. classes and free functions. The entity index is grouped by namespaces and also contains the brief section of the documentation.

More output formats

Thanks to cmark it was also trivial to add more output formats. standardese now supports rendering in CommonMark, HTML and experimental support for Latex and Man. It also supports dumping the AST in an XML format.

This is implemented by the cmark_render_XXX() functions, the Latex and Man format that doesn’t include HTML don’t work so well due to my anchor hack, but this will be tackled in a different version.

Other changes

I’ve also added some other features.

For example, the library was designed for multi-threaded execution from the beginning and now the tool also uses a thread pool to distribute generation on more cores. The default number of worker threads is the number of cores, this can be set with the --jobs or -j option.

I’ve also added an exclude command. An entity that is excluded will not appear in the documentation, at all:

/// \exclude
struct foo {};

/// A type.
using type = foo;

Will generate:

using type = implementation-defined;

as synopsis.

The buildsystem is also overhauled and the handling of external dependencies changed. But I’ve already covered that in another blogpost.

What’s now?

This update adds many important features to mature standardese and make it more than just a basic prototype. Thanks to many amazing people it also has more support on various different platforms. The parser is also improved so I know encourage you to start using standardese for your own documentation. I will also use it to finally write the documentation for standardese itself.

Of course, work’s not finished. The next version will tackle entity groups and modules as well as finally some of the more advanced features that will truly make standardese the best C++ documentation generator.

So check it out and share it!

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.