by MostlyStatic

Saturday, April 14, 2018

Duff’s device — beautiful C

I love how C/C++ is so loose with syntax. The grammar that defines C/C++ is really minimal: everything that is not rejected is accepted (rather than the other way around). This leads to many super interesting edge cases. One example is Duff’s device [1]. Tom Duff was working at Lucasfilm, and he was trying to optimize some code that was copying memory to a piece of hardware.

The code he was looking at was this:

void send(register short *to, register short *from, register count) {
  do
    *to = *from++;
  while (--count > 0);
}

In C, register is a hint to the compiler, advising it to store that variable in a processor register instead of in memory (in 1984 when Duff came up with this, register was more than a hint, today on most hardware register means nothing). This code just copies count bytes from from to to.

Loop unrolling used to be a reasonable way to optimize a for loop. In loop unrolling, instead of iterating the for loop over i from 0 ... N you iterate ii from 0 ... floor(N/M) with a step size of M and replace the body with M copies of the body wherein i goes through ii, ..., ii+M-1 and you deal with the possible overflow (the overflow happens if M doesn’t divide N). That means you will have to do fewer jumps (JNEs) in the ASM (at the expense of a larger body). This ‘optimization’ is enabled by gcc flags such as -faggressive-loop-optimizations, -funroll-all-loops, -floop-unroll-and-jam. Today none of these flags are included in the gcc optimization flags -O1, -O2, -O3. That’s because the M extra jumps are now so fast compared to anything in the body that there's either no difference in speed, or the code is now slower due to pageing of the executable.

But back in 1984, Duff knew those jumps would make a difference, and -funroll-all-loops didn't exist, so he wanted to manually unroll that loop. He wanted to unroll it into multiples of 8, because that was optimal for his hardware. He came up with this code:

void send(register short *to, register short *from, register count) {
  register n = (count + 7)/8;
  switch (count % 8) {
    case 0: do { *to = *from++;
      case 7: *to = *from++;
      case 6: *to = *from++;
      case 5: *to = *from++;
      case 4: *to = *from++;
      case 3: *to = *from++;
      case 2: *to = *from++;
      case 1: *to = *from++;
    } while(--n > 0);  
  }
}

This code is wild because it interleaves a while loop within the labels of a switch statement, and also makes use of the ‘pass through’ feature of the switch statement. Basically, on the first pass, the switch statement bumps you to the case that you would need in order to not overflow the buffer (and instead exactly exhaust the whole buffer) after n-1 subsequent full passes through the while. After the first pass, the case labels are ignored. After Tom Duff published this device, a panel of members of the ANSI C committee convened to rule on whether or not it was correct C code: they said it was correct (what is not rejected is accepted).

This totally ingenious syntax is valid C/C++, and back in the 80s would produce the fewest instructions, and fastest execution required to implement this send. Tom Duff said this about the device: ‘It amazes me that after 10 years of writing C there are still little corners that I haven’t explored fully ... Many people have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I’m not sure whether it's for or against.’

Tom Duff first wrote about this in May 1984, the same month that Indiana Jones and the Temple of Doom was screened [1]. It’s not impossible that one thing lead to another.

Happy hacking!

[1] Re: Explanation, please! https://www.lysator.liu.se/c/duffs-device.html