New integer types I’d like to see

29 Sep 2022 by Jonathan

(Most) C++ implementations provide at least 8, 16, 32, and 64-bit signed and unsigned integer types. There are annoying implicit conversions, discussions about undefined behavior on overflow (some think it’s too much UB, others think it’s not enough), but for the most part they do the job well. Newer languages like Rust copied that design, but fixed the conversions and overflow behavior.

Still, I think there is room for innovation here. Let me talk about three new families of integer types I’d like to see.

Symmetric signed integers

The de-facto representation of signed integers on modern hardware is two’s complement. There positive values have the most significant bit set to zero, while it is set for negative values. To get the absolute value of a negative number, flip all bits and add one.

For example, on an 8-bit integer, 42 is 0b0'0101010: sign bit zero, the rest representing 42 in binary. On the other hand, -42 is 0b1'1010110: if you flip all bits you get 0b0'0101001, add one and you’re back at 0b0'0101010, which is 42. Crucially, 0b1'1111111 is -1 and more negative values go down to 0b1'0000000, which is -128.

Notice something interesting about the last value? If you do the conversion, flipping gives you 0b0'1111111, and the addition of one results in 0b1'0000000 – overflowing into the sign bit!

This means there absolute value of the smallest value is bigger than the absolute value of the biggest value 0b0'1111111 or 127; there are more negative values than positive values, because the positive half where the sign bit isn’t set also contain the number zero.

I really don’t like this asymmetry – it leads to annoying edge cases in all sort of integer APIs.

For starters, abs(x) for integers x isn’t a total function: abs(INT_MIN) isn’t representable. Likewise, x * (-1) isn’t total either: INT_MIN * (-1) overflows. It gets even funnier when you consider division: surely x / y cannot overflow as it division makes things smaller, right? Wrong, INT_MIN / (-1) overflows (and raises a division by zero (!) error on x86). Furthermore, INT_MIN % (-1) is also UB.

So here’s my wish: a signed integer where INT_MIN == -INT_MAX, by moving the minimal value one higher.

First, you’re not losing anything useful: I’d argue the extra negative number isn’t important in most use cases. After all, if you had an extra number, wouldn’t it make more sense to have an additional positive number instead of a negative number?

Second, you’re getting symmetry back. All the operations mentioned above are now symmetric and can’t overflow. This makes them a lot easier to reason about.

Third, you’re getting an unused bit pattern 0b1'0000000, the old INT_MIN, which you can interpret however you like. While you in principle could turn it into some sort of negative zero for extra symmetry, please don’t (just use one’s complement or sign magnitude instead). Instead we should copy a different feature from floating point arithmetic: not-a-number or NaN. Let’s call it INT_NAN = 0b1'0000000.

Just like floating point’s NaN, INT_NAN isn’t a valid integer value. In an ideal world, it would also be sticky on arithmetic, so INT_NAN ¤ x == INT_NAN, but that requires hardware support for efficiency. Instead, let’s just say arithmetic on INT_NAN is undefined behavior; sanitizers can then insert assertions in debug mode.

Why do I want INT_NAN and thus add an additional precondition to every integer arithmetic?

Because I really like sentinel values.

For example, with INT_NAN, it is possible to have sizeof(std::optional<int>) == int: instead of having to store an additional boolean to keep track of the existence of an optional, we can just store INT_NAN instead. Likewise, a closed hash table needs some way to distinguish between empty and non-empty entries. Having a sentinel value removes the need for additional meta data.

Now you might not like it that we’re picking a sentinel value here. What if you want to store INT_NAN in an std::optional<int>?

Well, INT_NAN isn’t a number, so why do you want to store it in an int? Only if you need some sort of sentinel value on your own. This is similar to NaN boxing of floating point values – you lose the ability to store (most) NaNs, but gain more efficient storage. However, unlike floating point arithmetic where e.g. 0/0 can result in NaN, under my model, no arithmetic operation on integers can result in INT_NAN as overflow is undefined behavior. So you really need to get out of your way by assigning INT_NAN to introduce integer NaN’s in your code.

You might not like that I suggest arithmetic on INT_NAN should be UB if you’ve been burned by aggressive compiler optimizations in the past. However, UB in the standard by itself is not a bad thing; UB literally means that the standard poses no requirements on the behavior, which gives the compiler the most freedom. They can assume it does not happen and optimize accordingly, but they can also insert debug assertions (either catching all, or rough checks with false negatives), or give it well-defined behavior. Most mainstream compilers do the first interpretation by default, but for example I’m currently working on a C interpreter, where it will panic on all instances on UB.

Unsigned integers with one bit missing

Using unsigned integer types in C++ is controversial, to say the least. An argument in favor is the ability to express in the type system in the type system when something cannot be negative. An argument against is the fact that subtraction easily results in big values due to integer overflow at zero.

Unrelated, but “integer underflow” is not a thing. Exceeding the minimal value of an integer is still integer overflow. Underflow occurs when you have a number that is too close to zero to be represented as a floating point.

As a proponent of unsigned integer types, I can’t argue against the annoying overflow on subtraction. It can cause all sorts of nasty bugs from buffer overflow to out of memory errors. Switching to a signed integer makes sense here as a very negative value makes it really obvious that an error occurred, and that is also the position many people take and use signed for everything. This is a shame, since you lose the ability to express yourself in the type system.

Since you apparently don’t need the extra bit of storage space in many situations, I’d like to have something that is logically a 63 bit unsigned integers as opposed to a 64 bit one. In assembly, it is represented the same way a 64 bit signed integer would. However, it is undefined behavior if it ever stores a negative value. This is similar to bool: it is logically a single bit but physically represented as a byte.

This sounds just like a signed integer with extra steps and more UB, so why bother?

First, compared to int, it automatically comes with a precondition that it cannot be negative. Second, fewer overflow checks are necessary in debug mode. Since we’re having a wide range of invalid values, we can just check the value whenever we do a store operation and not after every single arithmetic operation. Sure, we could overflow in an intermediate expression and then undo the overflow in subsequent arithmetic, but it is a nice trade-off between performance and safety.

Again, this sort of lax debug check is not possible if the standard were to require program termination on overflow.

In fact, those unsigned integer types are exactly equivalent to using the corresponding signed integer type and asserting that it is non-negative whenever necessary. It is just built into the type system and implemented by the compiler, instead of a programmer written precondition.

Distinct bit vectors vs integer type

I’ve recently worked on a compiler for a language that makes a distinction between signed integer types, unsigned integer types, and bit integer types. The first two families support only arithmetic operations while only the last one support bit operations. I was skeptical at first but come to really like the distinction.

I always found it weird how we treat an integer type both as a number and do arithmetic on it while also modifying individual bits. When you’re doing math, you rarely need to modify individual digits! This is especially true with signed integers, where the sign bit messes everything up and leads to implementation-defined or undefined behavior on shift operations.

Sure, distinguishing the two makes writing optimizations like shift instead of division or bitwise and instead of modulo more annoying to write, and some fancy bit hacks require more casts, but it also makes it really obvious what’s going on: you’re starting to treat an integer as a sequence of bits for some performance benefit, which requires some sort of documentation.

When I mentioned the shift optimizations, I’m not talking about replacing x / 2 by x >> 1 – the compiler is going to do that. I’m talking about things like hash % hash_table_size, where hash_table_size is always a power of two, but the compiler can’t know that.

While we’re at it: why do we reserve so many tokens for bit operations? How often do you actually need |, & or ~ to warrant an entire character that can’t be used for anything else? Not to mention that they have the wrong precedence in C, and aren’t really useful on their own: you often want is_bit_set(x, n), extract_bits(x, low, high) or other higher-level operations built on top of the bitwise operations. I’d like to see the operations delegated to (built-in) standard library functions, so we can re-use the operators for something else like pipelines.

Conclusion

There are a lot of new languages popping up in the C++ space recently, I’d love to see some of them experiment in such a fundamental area.

Symmetric signed integers make so many fundamental APIs nicer, and a 63 bit unsigned size_t can combine the best of the signed/unsigned world for containers. Sure, you still want the real unsigned types in situations where you want the extra bit since it doubles the range, but I think it would be fine to not have the true INT_MIN except for interop with C. Distinct bit vectors can make code more expressive, but the casts can also get really annoying. I’d still like to see someone try.