What is the unit of a text column number?

17 Feb 2021 by Jonathan

I’ve recently published my parsing combinator library lexy. One of the things it does is issue a lexy::error if the input does not match the grammar. This error has a .position() which gives you the position where the error occurred.

In order to keep the happy path fast, .position() is not something that is easy to use for end users: it is simply an iterator into the input range. This is no good to a human user who wants something like line and column number to easily locate the problematic input.

Converting an iterator into line/column location seems simple enough: set line = column = 1 and iterate over the entire input until you’ve reached the position of the iterator. Every time you see a newline, increment the line number and set the column number back to 1. Otherwise, the column is implemented every time you … see what exactly?

What exactly is a “column” of a text and how do I compute it?

Approach #1: Count `char`s

Let’s just write the basic version without thinking much of the problem:

template <typename Input, typename Iterator>
auto find_location(const Input& input, Iterator position)
{
    auto line   = 1;
    auto column = 1;

    for (auto iter = input.begin(); iter != input.end(); ++iter)
    {
        if (iter == position)
        {
            // We found the location.
            break;
        }
        else if (*iter == '\n') // End of a line.
        {
            ++line;
            column = 1;
        }
        else
        {
            // Advance one column.
            ++column;
        }
    }

    return std::make_pair(line, column);
}

When we encounter a newline, we advance to the next line. Otherwise, we increment the column. Once we’ve reached the position in the input we’re looking for, we exit the loop and return the result.

This works and is quite simple and intuitive. I’ve tested a couple of text editors and compilers and it seems like this algorithm is used by clang, GCC before version 11, as well as neovims col() function.

Yet this algorithm is “wrong”.

We’re counting the number of chars in a line which in a Unicode world has no relation to any notion of “character”. Input like ä, €, or 𝔄 will count for 2, 3 and 4 columns respectively in UTF-8, but 1, 1, and 2 in UTF-16.

So we need to do better.

Approach #2: Count code points

Let’s assume the input is encoded in UTF-8 for the sake of discussion. UTF-8 is a multibyte encoding, which means that some “characters” are encoded using a sequence of char. A single char is called a code unit and a sequence of code units is used to encode a code point. “Characters” like ä, €, or 𝔄 are one code point, but encoded as multiple code units.

So we need to count code points, not chars:

for (auto iter = input.begin(); iter != input.end(); )
{
    if (iter == position)
    {
        // We found the location.
        break;
    }
    else if (*iter == '\n') // End of a line.
    {
        ++line;
        column = 1;
    }
    else
    {
        // One code point is a column.
        skip_code_point(iter, input.end());
        ++column;
    }
}

The function skip_code_point() does the necessary logic to advance the iterator to the next code point. This is not too complicated – just look at the bit pattern of the initial code unit, so I’ve omitted it here for brevity.

For reference, here is lexy’s straightforward implementation. It also extracts the value of the code point and handles various encoding errors such as overlong sequences. None of that is really necessary here though.

Counting code points means that even multibyte “characters” are treated as a single column and we’re no longer exposing their actual encoding. This algorithm seems to be used by the Rust compiler.

So, counting columns is a bit more complicated than you’d initially expect but it’s still manageable. lexy already provided rules to match Unicode code points, so let’s just use those in the actual implementation and call it a day.

Except it’s not that simple.

Handling text is never that simple.

Approach #3: Count grapheme clusters

Notice how I put “character” in quotes?

That’s because a “character” doesn’t really have a precise definition like code unit or code point. The closest to what a non-tech person would describe as character, is a Unicode grapheme cluster: a string that roughly corresponds to a single glyph in the font.

And of course, a single code point isn’t enough to encode one grapheme cluster, you might need multiple. You can combine many Latin characters with special code points to form characters such as f̃, w͜, or s̷̙̃, which are 2, 3, and 4 code points respectively. There are also scripts such as Hangul or Thai which make use of multiple code points that are combined when rendered – and then there are emojis.

Emojis easily combine many many code points into one symbol. It begins with flag emojis such as 🇪🇺, which is actually a special “E” followed by “U”, continues with emojis such as 🧑‍🔬 (scientist), which is 🧑 (person) glued together with 🔬 (microscope) using a special joiner code point, and ends at the absolute pinnacle of code point combinations - the family emoji 👪. How do you make a family? Easy, you take a person (with optional skin tone and gender modifier) and glue it with another person, as well as their children. That way you can easily end up with a single “character” consisting of ten or more code points!

So to properly count “characters”, we need to advance the position not by a code point, but by an entire grapheme cluster. This is what “real” text programs such as LibreOffice do.

While this is certainly doable, it seems complicated (and I’m not even sure that covers emoji sequences…?). So before implementing it, let’s make sure this is actual the approach we want.

Approach #4: Count virtual columns

When reporting an error, the compiler also underlines the relevant part of the input:

error: this is not how things work!
  my_really_cool_program(42);
                         ^^ this is wrong

For that, it needs to know how many spaces to print before printing the underline. If we define a column as that number of spaces, this is also referred to as a virtual column. It is reported by neovims virtcol() function and used by GCC since version 11 (as recommended by the GNU standard apparently).

Counting the number of equivalent spaces is not trivial in general, as that depends on the font. However, here we can safely assume a monospace font where every glyph has the same width (mono space, right?).

Except of course it doesn’t.

Most Chinese, Japanese or Korean characters are rendered twice as wide as most other characters, even in a monospace font:

1234 // 4 characters
全角 // 2 characters

And there are also wide version of some normal characters, such as Ａ (not A). But there is a Unicode standard and a lookup table, so that doesn’t seem to bad.

Except that this doesn’t cover emojis, which are also rendered twice as wide:

12
🙂

And then there is \t, the tab character. Dare I say and ask: How many spaces is a tab?

GCC seems to say “8”, for some reason. This awful choice means that the underline alignment breaks when I view an error message in neovim’s embedded terminal, where \t is rendered as four spaces, but the underline assumes its eight.

I mean it would break, if I were to use a literal \t character in my source code. I’m not a monster, so it doesn’t.

The incompatibilities between GCC and neovim doesn’t stop there either: remember those emojis glued together from multiple code points?

Well of course neovim doesn’t render them properly. 🧑‍🔬 isn’t displayed as 🧑‍🔬 but as 🧑<200d>🔬, where 200d is the value of the glue code point. This means that, according to neovim, 🧑‍🔬 virtual column length is 2 (first emoji) + 6 (length of '<200d>') + 2 (second emoji), so you need 10 spaces to account for it in the underline. GCC, however, prints only 4 spaces (2 for each emoji, and 0 for the invisible glue code point), which means it also gets misaligned in neovim’s terminal.

And can you really blame it?

In my “real” terminal, 🧑‍🔬 is rendered as 🧑🔬, so printing four spaces is correct there (although that’s also because my terminal doesn’t render it properly, then it would be two). So to answer “how many spaces is this character wide?”, we still need to ask the environment/font we’re using – even for monospace fonts!

Needless to say, this approach doesn’t seem right either.

And now what?

So, to recap, we’ve seen four approaches:

Counting code units: simple and fast to compute, but might be surprising to users as it has no real relation to “character”.
Counting code points: more complicated than counting bytes and “more correct”, but still no real relation to “character”.
Counting grapheme clusters: even more complicated, but at least it corresponds to “character”.
Counting virtual columns: somehow even more complicated still, but at least it it allows underlining the error message.

What should we do?

To answer that we have to take a step back and actually look at why we need column information in the first place. In particular, there are two distinct use cases: editors and compilers.

For an editor, we display columns to inform the user about the cursor position. There, I think counting grapheme clusters is the right approach. This has the advantage that the column directly corresponds to “how often do I need to press l (or the right arrow key) to go that column”, as cursor movement is also based around grapheme clusters. Telling the user “you’re at position 5” which means “press the arrow key five times to get there” is quite nice.

For a compiler, we display columns so the user can locate the position of an error. If the user looks at the output and then manually goes to that error location, this should also be the number of grapheme clusters, as that corresponds to arrow movement.

But nobody looks at an error message and manually navigates to the location using the column information! Your IDE/vim setup automatically jumps to the error location (or you just look at the underline and manually go there without looking at the column at all).

This means that the error location should be written in a format that is easily parseable by the IDE, in units that are easy to compute – i.e. code units. Counting code units is simple and fast and there is only one unique definition of it.

Contrast this with virtual columns, which is what GCC is going to use: to compute it properly, it depends on the environment! In particular, neovim’s and GCC’s definition disagrees, which means automatic jumping to an error location is impossible. GNU’s decision to use virtual column by default in the future seems misguided.

Don’t get me wrong – virtual columns have their place, e.g. for computing the underline. But even then, it is entirely non trivial to compute: do I report the correct value of two for 🧑‍🔬 or am I bug-compatible with most terminals and say its four? In either case, it doesn’t work inside neovim as there it’s rendered differently still. Not to mention tab, where there is no correct answer that works everywhere.

Using such a brittle unit with no clear definition in something that should be parsable by machines is just asking for trouble. I can understand why neovim chooses to use it as its column position: it is the one that closely resembles an actual column. But I don’t think even this is useful for a user: why would you need to know the equivalent number of spaces to indicate position?

That leaves code points which are middle ground: complicated to compute and not really useful for users. However, unlike code units they are independent of the actual encoding. So if you have an input file in UTF-16, but the compiler uses UTF-8 internally, giving positions in code points gives the same result for compiler and editor.

One scenario where this happens is with the use of a language server. Input files are usually UTF-8, but the language server protocol assumes UTF-16. Indicating column information in code points would be ideal, but they use UTF-16 code units instead, which requires servers to transcode. Note that there is an open issue to use code points instead, as that would be portable.

Conclusion

One table summary:

Counting	Machines	Humans	Portable
Code units	easy	not useful	no
Code points	moderate	not useful	yes
Grapheme clusters	hard	useful	yes
Virtual columns	hard	not really useful?	absolutely not

So, use code units as the unit if the location is intended to be parsed by machines (such as compiler error messages), use grapheme clusters as the unit if the location is intended to be useful to humans (such as in text editors).

Use code points over code units if you need to communicate between different encodings.

Use virtual columns only if that’s what you actually need (e.g. to align multiple lines). Using it as a portable output format such as in error messages is just asking for trouble.