Pointer arithmetic can be safe

Advocates of memory-safe languages sometimes contrast them with C by saying that they don't have pointers, or (when someone points out how impractical they'd be if that were really true) that they don't have pointer arithmetic. This is supposed to make them safer. Because pointer arithmetic is unsafe, right?

Not necessarily. Pointer arithmetic in C happens to be unsafe, but this is not a problem with pointer arithmetic, only with C — or rather with its conventional implementation technique. C pointers are usually implemented as raw addresses, and pointer arithmetic as simple arithmetic on those addresses. The C standard, however, doesn't require this. It only requires pointer arithmetic (and comparisons) on pointers to the elements of an array (or one past the end), and it does not specify the behavior of pointer dereferences that don't point into the array. It doesn't require bounds checking, but it doesn't prohibit it either. So it's possible to make a conforming C implementation with bounds-checking on all pointer operations.

This has been done. Zeta-C (source here) was a C compiler for Lisp machines, which don't support unsafe array access at all. Scott Burson, the author, explains how it handled this:

All pointers were represented as pairs of an array and an index

Pointer arithmetic operated on the index, leaving the array intact. Pointer dereferences used the index with the ordinary, safe array operations, so all pointer dereferences were bounds-checked. Since Zeta-C also fixed C's other memory-safety problems (free did nothing, uninitialized variables were not garbage, and casts could not forge pointers), it was a memory-safe C compiler. This was part of its attraction — people didn't use it because they wanted to run C programs on Lisp machines, but because they wanted to debug their C programs in a safe implementation with the Lisp machine's excellent tools.

C programmers are well aware that memory unsafety is their biggest problem, and many other tools have been written to deal with it, but few of them recreate this feature. The technique is known to implementors (often by the name “fat pointers”) and is available as a patch for GCC. But it's not considered a standard part of the C debugging toolkit, even though it's easier to implement, and has a smaller performance cost, than commonly used tools like valgrind. I don't understand why. Wouldn't it be nice if your C compiler had a --safe mode which eliminated most memory safety bugs?

Update December 2013: The big problem with fat pointers is that they're incompatible with ABIs that use raw pointers, and virtually all interesting C programs make system or library calls through such an ABI. So a practical implementation of fat pointers needs to support raw pointers too, which adds complexity and greatly reduces the benefit.

Scott also tells a cute story about portability:

If you looked real closely, there were lots of little corners of C semantics where ZETA-C was not correct. In practice, however, one very rarely tripped over any of these.

For instance, I used Lisp integers for C int and long. This meant bignums would be created automatically, as usual in Lisp. Technically this is not a correct C implementation (even though I don't think the standard specifically says that the length of int and long shall be finite, one can take this as implied) but it very rarely ran into trouble. The only such case I remember, which was rather amusing, was a program that did something like

  int i;
  for (i = 1; i; i <<= 1) ...

(shifting a 1 bit left repeatedly, expecting it to fall off the left end of the word).

Who expects their C programs to run on a machine with infinite word size?

No comments: