How I have beaten Boost.Pool #2: Inlining is key

Calling a function has a certain overhead. Registers must be saved, a new stack frame pushed,… For small functions this overhead is more than the actual implementation of the function!

For those, it is much better if the compiler would copy-paste the implementation directly into the call site. This is what inlining does.

Luckily, the compiler is usually able to do this optimization. Or can it?

In this series, I’ll explain my changes and share some lessons about optimization I’ve learned in the process of beating Boost.Pool. This time I’m going to cover inlining. I’m going to share some of guidelines I’ve learned and also going to give you a look into some of memory`s internal code and design.

Boost.Pool has an (unfair) advantage: It is entirely header-only.

Except the error handling part and stuff like that. But this is not the performance-critical code path.

If a function is defined in a header, the compiler can inline it very easily.

I mean, the corresponding keyword is even named inline.

And when an entire library is defined in a header - like in the case of Boost.Pool, the compiler sees all function implementations you call and can inline them without hassle. This can make them very fast very easily.

On the other hand, my library is not entirely header-only. Although the allocators in question - memory_stack and memory_pool are in fact templates, they do not contain much of the implementations. To explain that, let’s explore the internal structure of my library a little bit.

In part 1 I’ve very briefly explained that both stacks and pools take huge memory blocks and use those for their allocation. Allocators that operate on huge memory blocks and use a certain allocation strategy on them are called arena allocators. They have to do two things:

And according to the single responsibility principle, I’ve done it by two different classes as well. The management of the memory blocks is outsourced into the class template memory_arena and the allocation is handled by internal classes.

One of them is detail::fixed_memory_stack for memory_stack. It is a a stack allocator on a single memory block. The three classes detail::free_memory_list, detail::ordered_free_memory_list and detail::small_free_memory_list are the three free list implementations used by memory_pool.

They all provide the same interface to allow customization of memory_pool via the node_pool, array_pool and small_node_pool tags. Those tags provide a typedef type to the corresponding internal class that is used as free list inside the memory_pool.

All internal classes have in common that they do not allocate memory by themselves and do not own any of the memory they are working on. And those internal classes are not header-only but are defined in source files.

With the help of those internal classes, the arena allocators themselves are straightforward. They just forward to the internal class if it still has memory available. Else they request a new memory block from the memory_arena.

For example, this is the entire code of memory_pool::allocate_node:

void* allocate_node()
{
	if (free_list_.empty())
		allocate_block();
	FOONATHAN_MEMORY_ASSERT(!free_list_.empty());
	return free_list_.allocate();
}

If the free list is empty, it requests a new memory block and inserts it into the free list. This is done by the helper function allocate_block(). Else or afterwards it can just call free_list_.allocate(). Deallocation is even simpler, it just forwards to free_list_.deallocate().

And the allocation function of the internal functions are themselves quite short. So they are perfect candidates for inlining. Yet only the call of the header-only template is inlined, not the call to the internal helpers because those are defined in a source file.

This might surprise you, since everyone tells you that it doesn’t matter whether you declare functions in a header or source file. The compiler is smart enough, inline is just a hint anyways.

I was surprised as well.

Turns out, the compiler can not inline as well as everybody says.

I am only talking about GCC here. Other compilers might inline better by default.

What helps is the so called link time optimization (LTO). Now GCC can inline more of my code. This alone gave me a speedup of up to 500 per cent, without changing a single line!

Now that’s how you optimize!

With CMake based projects and GCC, you have to modify both the CMAKE_CXX_FLAGS and the CMAKE_EXE_LINKER_FLAG, add -flto there.

Guideline II: Look at the assembler

At this point you might wonder how I’ve found out that the compiler hasn’t completely inlined my calls.

Or you don’t because the title gave it away.

The answer is simple: I have looked at the generated assembler code. When writing performance critical code you should always look at the assembler to check that all your nice abstractions are optimized away.

It is very easy to see the generated assembler with CMake based projects. Just modify the CMAKE_CXX_FLAGS to include the right flag, e.g. -save-temps under GCC.

You might also want -fverbose-asm which will make the assembler output a little bit more readable by adding comments. There is also the GCC tool objdump which can show you the assembler right next to the source code, but it’s very confusing with optimizations enabled. Note that you should disable LTO when looking at the assembler.

Then just build your code as normal. Inside the build directory you will find files with the .s extension, this is the assembler output you are looking for.

It is a trickier to get the assembler code of templates since they are not actually compiled as long as they’re not instantiated. Also, their definition will be put into the file they are instantiated in, not into the file they are defined in (which is a header). What works for me is an otherwise empty file with an explicit template instantiation. You can find the entire template code in its assembler output.

Inspecting the assembler to see whether your code is properly inlined sounds more difficult than it is. But don’t worry, you don’t have to actually understand assembler for that.

For example, I’ve never properly learned how to write - or even read - assembler code.

Let’s say you want to know whether a function foo() is inlined. For that you have to look at the calling function bar() whether it is inlined there. You can only see whether a given function is flattened through inlining of the called functions.

Look through the code until you spot some gibberish that contains the name of your calling function. This is the mangled name of the function. There you’ll find the assembler code of it.

Then look for call or jmp instructions or something similar where the operand is the function name that should be inlined. If the assembler codes contains those, the calling function is still calling some functions in the assembler level. As a rule of thumb, a call is “worse” than jmp. A jmp is just a direct jump of the instruction to another code place whereas a call is a more expensive “regular” function call.

What also helps to understand the assembler is selectively commenting some code parts out to see which statements generate which assembler instructions.

Guideline III: Put performance critical functions into header files

Yes, this guideline is obvious.

Even if you have enabled link-time optimization, the compiler still cannot inline everything that is needed.

Let’s consider memory_stack::allocate() as an example:

void* allocate(std::size_t size, std::size_t alignment)
{
	detail::check_allocation_size(size, next_capacity(), info());
	auto mem = stack_.allocate(block_end(), size, alignment);
	if (!mem)
	{
		allocate_block();
		mem = stack_.allocate(block_end(), size, alignment);
		FOONATHAN_MEMORY_ASSERT(mem);
	}
	return mem;
}

Note: This is actualy the code of version 0.5. It is has changed significantly due to something that will be covered in a later part of the series.

First, it calls allocate() on the detail::fixed_memory_stack. If this allocation fails because the fixed memory of the internal stack is exhausted, it allocates a new block. Again, the helper function allocate_block() will - just like in memory_pool - request a new memory block from the memory_arena and give it to the internal implementation. After that it can allocate from the fixed stack without running into a limitation - this is ensured by the check in the first line.

But note the call to the helper function block_end() in the fixed stack. This is needed because the stack does not maintain a pointer to the end of the current memory block, just to the current top of the stack.

This was to save space and because this information is implicitly given in the higher level, read on.

But it needs this information to determine whether the current memory block has enough space. So it is given to the allocation function.

block_end() requests the current block from the memory_arena via its current_block() function. A memory_block consists of a pointer to it and a size information, so the end of it can be computed very straightforward.

memory_arena::current_block() isn’t quite straightforward though. Since the arena can grow, i.e. manage multiple memory blocks at once, it needs to store all of them somewhere. This is done by putting them into a singly linked list of memory blocks. The next pointer of each block is embedded into the block itself. In a similar fashion as in memory_stack/memory_pool, memory_arena itself is a template because it can be customized by a BlockAllocator and just manages multiple other classes.

One of them is detail::memory_block_stack which implements this linked list. It looks like so:

class memory_block_stack
{
public:
	// default ctor, dtor, move, swap omitted
	// typedefs omitted

	// pushes a memory block
	void push(allocated_mb block) FOONATHAN_NOEXCEPT;

	// pops a memory block and returns the original block
	allocated_mb pop() FOONATHAN_NOEXCEPT;

	// ...

	inserted_mb top() const FOONATHAN_NOEXCEPT;

	// empty(), size()

private:
	struct node;
	node *head_;
};

Again, the 0.5 version.

Conceptually, it deals with two kinds of memory blocks. Those returned directly by the BlockAllocator. They are passed to push() and will be returned by pop(). And then there are the blocks usable by the arena allocator. Those are a little bit smaller than the ones returned by BlockAllocator because they also contain the list node. The top one is returned by top(), this is directly called by memory_arena::current_block().

Because the class only needs a pointer to the first node, the node type itself can be an incomplete type and defined in the header. This allows me to change the node type without affecting clients at all.

push() creates the node type inside the block and adjusts the block size because it is now smaller. It also inserts into into the list. pop() erases the node from the list and increases the block size again.

top() does not need to adjust the block size, but it needs to adjust the pointer. It points to the node structure, which must be returned to the BlockAllocator, but must not be overriden by the arena allocator. It looks like so:

memory_block_stack::inserted_mb memory_block_stack::top() const FOONATHAN_NOEXCEPT
{
    FOONATHAN_MEMORY_ASSERT(head_);
    auto mem = static_cast<void*>(head_);
    return {static_cast<char*>(mem) + node::offset, head_->usable_size};
}

Because top() requires both access to node’s member variables and to the offset, which requires the size and full definition of node it cannot be put into the header directly - there is only the declaration of node available. And, more importantly, the compiler is unable to inline the call to top() and thus ultimately the call to block_end() inside memory_stack.

This is bad.

The overhead of a function call is bigger than the actual cost of the allocation code here!

So to avoid this overhead I choose speed over compile-time insulation and defined memory_block_stack::node inside the header to allow top() there as well.

As a matter of fact, I’ve also put the entire implementation of detail::fixed_memory_stack into the header as well. But this class has changed substantially anyway.

Guideline IV: Identify performance critical code paths

Before you now blindly follow guideline III and move all functions called by a performance critical functions into header files, let me tell you the next guideline.

Each but the most trivial function has multiple paths of execution. There is the normal code path, the abnormal code path taken in case of an error and maybe other. Look at each of the code paths and identify those that are taken in most cases. Then, optimize only those.

For example, take a look at memory_stack::allocate() again:

void* allocate(std::size_t size, std::size_t alignment)
{
	if (size > next_capacity())
		handle_error();
	auto mem = stack_.allocate(block_end(), size, alignment);
	if (!mem)
	{
		allocate_block();
		mem = stack_.allocate(block_end(), size, alignment);
		FOONATHAN_MEMORY_ASSERT(mem);
	}
	return mem;
}

I have manually inlined the call to detail::check_allocation_size() in the first line.

This function has four code paths, three directly visible:

allocate_block() has in fact many more possible code paths because the memory_arena also implements caching of recently freed memory blocks. But conceptually, only two matter for this discussion.

Of those four cases, the second one is - by far - the most common case. Case 1 and 4 are error handling routines which do not need to be optimized by definition and case 3 is expensive anway (it has to allocate new memory from the OS in the default implementation).

Case 2 is also the one where inlining matters the most because then the allocation itself consists of few and fast instructions. So for that reason, I’ve took special care to inline everything there, not in the other cases. For example, case 3 will ultimately call detail::memory_block_stack::push(), which is not put into the header file, to save the new block.

Guideline V: Help the compiler with debugging functions

Incorrect memory management can lead to many difficult to trace errors. For that reason, good (memory related) libraries provide ways to help debugging. Mine is no exception.

In debug mode, a complex system of debug checks and facilities is active. Those either can detect common errors all by themselves - like buffer overflow or many cases of invalid deallocation pointers/double-free - or help the user in detecting them - like use-after- free. Of course, those facilities have a significant overhead and are thus disabled in release mode. They should then have zero overhead, it should be like they didn’t exist in the first place!

A common way to implement them is to ensure exactly that: That they aren’t there if disabled.

This means macros.

But I absolutely hate interface macros, PREFIX_THEY_ARE_HORRIBLE(true). Thus I use them only when absolute necessary and use different ways of implementing it whenever I can.

I look forward to the day where I can get rid of assertions macros and use std::source_location() instead.

A full explanation of the debug system is out of scope here.

Maybe a topic of a future blog post?

Instead, let’s just focus on detail::debug_fill(). This works similar to std::memset() and fills an array with a certain value, but only if FOONATHAN_MEMORY_DEBUG_FILL is set to true.

It is for example called after memory is freed to help detect use-after-free errors. But this function is the basis for many more checks and is thus frequently called in all allocators. According to guideline IV it thus extremely important that it completely disappears if debug filling is disabled.

I’ve implemented it like so, debug_magic is an enum specifying the different values:

#if FOONATHAN_MEMORY_DEBUG_FILL
    void detail::debug_fill(void *memory, std::size_t size, debug_magic m) FOONATHAN_NOEXCEPT
    {
        // simplified
        std::memset(memory, static_cast<int>(m), size);
    }

    // other functions omitted
#else
    void detail::debug_fill(void *, std::size_t, debug_magic) FOONATHAN_NOEXCEPT {}

    // likewise
#endif

If FOONATHAN_MEMORY_DEBUG_FILL is false, the function has an empty body. A function with an empty body should be completely optimized out, right?

Well, this code is in a source file. And as it turns out, the compiler does the entire setup code for a function call just to immediately return in the called function!

This is a huge overhead, which is completely unecessary!

Thus, in order to achieve proper inlining, I’ve extracted the empty definitions of all debug functions into the header files. Only then do they really disappear from the assembler output.

The implementation in the debug case has stayed inside the source files, because in this debug mode nobody cares about performance anyway.

Conclusion

Allowing more and better inlining wasn’t the only optimization I’ve done. But it alone was responsible for around 50% of the entire speedup.

Ensuring that certain performance critical functions are inlined can thus give you a huge performance boost. I recommend everyone to follow these guidelines in order to make your code faster.

Of course, don’t follow them blindly! Profile, optimize, profile again.

In the next post I’ll deal with branches.

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.