2023-May-11

Multiword monotonic time counters

Suppose we have a multiprocessor system on which the widest possible atomic memory access is one word, and we wish to maintain a monotonic counter wider than one word, which all processors can read in a wait-free manner.

If the counter is maintained by hardware, e.g. a memory-mapped register, then a common trick is

extern volatile uint32_t counter[2];

int64_t read_counter() {
  uint32_t hi, lo, hi2;
  do {
    hi = counter[1];
    lo = counter[0];
    hi2 = counter[1];
  } while (hi != hi2);
  return ((int64_t)hi) << 32 | lo;
}

If the loads cannot be reordered, then hi2 == hi means the low word did not wrap between those loads. (This assumes it is unrealistic for the high word to have traveled through its entire range.)

However, if the counter is in RAM updated by software, this is insufficient. There's no ordering of the writes which prevents momentarily observing a torn counter. Classic solutions are

The problem also arises if the counter is partly maintained in hardware (necessarily the case for a high-resolution time counter) and partly by software. In this case, we cannot use the above techniques, because typical hardware does not implement them.

Extending a hardware counter

Suppose counter[0] is a hardware register which generates an interrupt when it overflows, and we wish to extend it into the bits of counter[1] by tracking those overflows. After an interrupt is generated, the hardware counter has already wrapped, but the high word will take a few cycles to update. Readers need additional information to correctly combine the high and low words.

One possibility is to also read the pending interrupt flag from the hardware. The ISR can increment counter[1] by one, clear the interrupt flag, then increment counter[1] again. Readers know that an overflow has occurred either when the interrupt flag is set, or when counter[1] is odd. This depends on two assumptions:

void handle_counter_interrupt() {
  counter[1]++;
  pending_interrupt[COUNTER] = false;
  counter[1]++;
}

uint32_t read_counter_hi() {
  uint32_t hi = counter[1];
  bool pi = pending_interrupt[COUNTER];
  return (hi >> 1) + (pi || (hi & 1));
}

int64_t read_counter() {
  uint32_t hi, lo, hi2;
  do {
    hi = read_counter_hi();
    lo = counter[0];
    hi2 = read_counter_hi();
  } while (hi != hi2);
  return ((int64_t)hi) << 32 | lo;
}

Another approach is to require some other source of wakeups which arrive more frequently than counter[0] overflows, and overlap counter[0] and counter[1] by one bit, periodically duplicating the MSB of counter[0] into the LSB counter[1]. This works even if counter[0] itself is incapable of generating interrupts, and readers don't need access to pending interrupt state.

// Must be called periodically.
void maintain_counter() {
  uint32_t hi = counter[1];
  uint32_t lo = counter[0];
  // Check that the counter hasn't gone more than 1/4th of the way around since we were last called.
  static uint32_t last = 0;  // nonzero once set
  assert(last == 0 || ((lo | 1) - last) <= 1<<30);
  last = lo | 1;
  // XOR LSB of counter[1] with MSB of counter[0].
  uint32_t diff = (hi & 1) ^ (lo >> 31);
  // If they differ, incrementing counter[1] will make them match.
  counter[1] += diff;
}

int64_t read_counter() {
  uint32_t hi = counter[1];
  uint32_t lo = counter[0];
  // Imitate maintenance update.
  uint32_t diff = (hi & 1) ^ (lo >> 31);
  hi += diff;
  // Combine words.  The one overlapping bit will match.
  return ((int64_t)hi) << 31 | lo;
}

Both routines are branch-free, no atomics are required, and any memory ordering is fine. This technique can be extended to more words.

Generalizations

Given a regular source of wakeups more frequent than counter[0] overflows, we can also treat the expected counter range (before the next wakeup) as shared state guarded by a generic synchronization mechanism, such as RCU. This is convenient when there is other shared state (such as calibration parameters) readers will need in order to interpret the counter.

Linux uses a reader-writer lock in which readers retry until a consistent generation number is seen, called a [seqlock]. FreeBSD has an analogous [timehands] mechanism.

References