by MostlyStatic

Saturday, April 14, 2018

Duff’s device — beautiful C

I love how C/C++ is so loose with syntax. The grammar that defines C/C++ is really minimal: everything that is not rejected is accepted (rather than the other way around). This leads to many super interesting edge cases. One example is Duff’s device [1]. Tom Duff was working at Lucasfilm, and he was trying to optimize some code that was copying memory to a piece of hardware.

The code he was looking at was this:

void send(register short *to, register short *from, register count) {
  do
    *to = *from++;
  while (--count > 0);
}

In C, register is a hint to the compiler, advising it to store that variable in a processor register instead of in memory (in 1984 when Duff came up with this, register was more than a hint, today on most hardware register means nothing). This code just copies count bytes from from to to.

Loop unrolling used to be a reasonable way to optimize a for loop. In loop unrolling, instead of iterating the for loop over i from 0 ... N you iterate ii from 0 ... floor(N/M) with a step size of M and replace the body with M copies of the body wherein i goes through ii, ..., ii+M-1 and you deal with the possible overflow (the overflow happens if M doesn’t divide N). That means you will have to do fewer jumps (JNEs) in the ASM (at the expense of a larger body). This ‘optimization’ is enabled by gcc flags such as -faggressive-loop-optimizations, -funroll-all-loops, -floop-unroll-and-jam. Today none of these flags are included in the gcc optimization flags -O1, -O2, -O3. That’s because the M extra jumps are now so fast compared to anything in the body that there's either no difference in speed, or the code is now slower due to pageing of the executable.

But back in 1984, Duff knew those jumps would make a difference, and -funroll-all-loops didn't exist, so he wanted to manually unroll that loop. He wanted to unroll it into multiples of 8, because that was optimal for his hardware. He came up with this code:

void send(register short *to, register short *from, register count) {
  register n = (count + 7)/8;
  switch (count % 8) {
    case 0: do { *to = *from++;
      case 7: *to = *from++;
      case 6: *to = *from++;
      case 5: *to = *from++;
      case 4: *to = *from++;
      case 3: *to = *from++;
      case 2: *to = *from++;
      case 1: *to = *from++;
    } while(--n > 0);  
  }
}

This code is wild because it interleaves a while loop within the labels of a switch statement, and also makes use of the ‘pass through’ feature of the switch statement. Basically, on the first pass, the switch statement bumps you to the case that you would need in order to not overflow the buffer (and instead exactly exhaust the whole buffer) after n-1 subsequent full passes through the while. After the first pass, the case labels are ignored. After Tom Duff published this device, a panel of members of the ANSI C committee convened to rule on whether or not it was correct C code: they said it was correct (what is not rejected is accepted).

This totally ingenious syntax is valid C/C++, and back in the 80s would produce the fewest instructions, and fastest execution required to implement this send. Tom Duff said this about the device: ‘It amazes me that after 10 years of writing C there are still little corners that I haven’t explored fully ... Many people have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I’m not sure whether it's for or against.’

Tom Duff first wrote about this in May 1984, the same month that Indiana Jones and the Temple of Doom was screened [1]. It’s not impossible that one thing lead to another.

Happy hacking!

[1] Re: Explanation, please! https://www.lysator.liu.se/c/duffs-device.html

Monday, July 31, 2017

Random MATLAB musings

Even though 2006 was a while ago, it still seems unlikely that software as mainstream as MATLAB would have shipped in that year with a bug in the default pseudorandom number generator. And yet, that’s exactly what happened [1]. You can still reproduce this bug (in MATLAB 2017a) by rewinding to the old defaults and plotting the output of the rand function.

rand('state', 0);
Z = rand(28, 100000);
condition = Z(1, :) < 1/4;
scatter(Z(16, condition), Z(28, condition), '.');


RNG FAIL [1]


The variable Z should have independent entries, but conditioning on the first row induces this ugly dependence between the 16th and 28th rows. The bug is long since squashed, but if you aren’t controlling the environment your software is run on, you shoud add rng('default'); to the top of your code. That will ‘fix’ your random stream in the unlikely event that your software is either run with an ancient version of MATLAB, or run just after some old legacy code that messes with the random stream.

rng('default'); also means that your code should produce the same output each time it’s run. That’s useful when scientific rigour is required, or if you’re hitting a bug triggered by the random stream, which often happens with Markov chain Monte Carlo (MCMC) for example.

If you want independent runs (i.e., random restarts in MCMC), you could require that your user provides the seeds. But if you don’t want to bug your users for seeds, how can you make sure the seeds are random? K&R suggests something like srand((unsigned)time(NULL)); [2]. Converting this C code to MATLAB (or any other language) can lead to dependence between runs if you launch a bunch of jobs all at once, and that is not acceptable. One option to get around that (which works on Linux/MacOS/cygwin, etc ...) is to initialise from the hardware random data device, like so:

rng('default');
fp = fopen('/dev/random', 'rb');
seed = fread(fp, 1, 'uint32');
fclose(fp);
rng(seed); % Set the seed for MATLAB's SRNG.

I usually let the user provide an optional argument seed to my main function, and then hit the hardware random device for the seed if they don’t provide it. Make sure to tell the user what seed was used, so they can sub it in to reproduce their result:

rng('default');
if ~exist('seed', 'var')
    fp = fopen('/dev/random', 'rb');
    seed = fread(fp, 1, 'uint32');
    fclose(fp);
end
rng(seed);
fprintf(1, 'Using seed: %ld\n', seed);

The hardware random device should be enough, but if you want overkill, use a genuine random number generated by the radioactive decay of a Cæsium-137 isotope. You can get that from this more-than-a-little-jank looking setup in Lignieres, Switzerland.

The hotbits setup


You need an API key to request these Cæsium-137 random numbers, which you can get here. Once you have your API key, sub it in for 'HB1XXXXXX' in this following code:

rng('default');
command = ['curl -s ' ...
  '"https://www.fourmilab.ch/cgi-bin/Hotbits' ...
  '?nbytes=4' ...
  '&fmt=hex' ...
  '&apikey=HB1XXXXXX"' ...
  ' | grep -A1 "<pre>"' ...
  ' | tail -n 1'];
[~, seed] = system(command);
seed = hex2dec(strtrim(seed));
rng(seed);
fprintf(1, 'Using seed: %ld\n', seed);

All of this reproducibility suff is contingent on your code not being multithreaded. If you’re threading with parfor for example, you won’t get reproducible results even if you control the seed (the order of the threads is not guaranteed and not governed by your seed). You can recover reproducibility by spamming rng calls everywhere, which works great for MCMC algorithms that can thread within each MCMC iteration. You also have to use the 'twister' random number generator, as 'default' lets parfor override the rng calls (in MATLAB 2017a at least). Here's some skeleton code for reproducible parfor:

seed = 2666;
rng(seed, 'twister');
seeds = randi(2^32 - 1, n + 1, 'uint32');
seeds(:) = seeds(randperm(n + 1));
% A 
parfor i = 1:n
  rng(seeds(i), 'twister');
  % B
end
rng(seeds(n + 1), 'twister');
% C

Your code is % A, % B and % C.

Happy hacking!

[1] P. Savicky. 2006. A strong nonrandom pattern in MATLAB default random number generator. Tehcnical Computing Prague.
[2] B. Kernighan and D. Ritchie. 1987. The C Programming Language, Second Edition. Prentice Hall.