02 Apr 2024

20% cheaper find-first-set bit using a new de Bruijn-like sequence

tldr: Find the first 3 bits of the lsb position using a de Bruijn-like sequence instead of finding them by dichotomy. Saves 20% of gas.

Merged in solady.

Given a sequence of bits, the function that returns the position of the least significant nonzero bit (the lsb) is often named ffs (for first set bit) or ctz (for count trailing zeroes).

Finding the lsb position is very useful when you’re using EVM words as arrays. For instance Uniswap v3 uses it to find the next initialized tick in a given tick range.

EVM words are 256 bits so you need 8 bits of information to know the lsb position.

Solady’s implementation does the following:

Remove all set bits except the least significant (x := x & ~x+1).
Binary-search the 3 high bits of the position (“in which half of the word is the lsb? in which half of that half is it? in which half of that half is the lsb?”).
Use a de Bruijn sequence to find the remaining 5 bits of the position.

Step 2 can be improved using a de Bruijn-like sequence.

In a de Bruijn sequence every substring of a given length appears uniquely. So if you have a de Bruijn sequence m and an isolated bit in x, you can find a unique bit sequence for every possible bit position in x by computing x * m which is equal to m << (position of set bit in x). Then, with a hash table, you map that sequence to the position of the isolated bit.

For instance with 4 bits let m := 0110 and let x be each possible isolated bit (we won’t look at the hash table in this example):

m * 0001 = 0110
m * 0010 = 1100
m * 0100 = 1000
m * 1000 = 0000

The first two bits of the result are different each time.

Now our goal is not to find the lsb position in a 256-bit word, it’s just to find the 3 high bits of the lsb position. In other words our goal is to find in which 32-bit block of a 256-bit word the lsb is.

  bit pos in x     255-224    223-192    ...    63-32    31-0
block num in x        7          6                1        0

Suppose we had an m such that x * m identified the 32-bit block that has a set bit. Then we could again use a hash table and map x * m to the position of the 32-bit block.

There is such a sequence: m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff.

Here is m in binary, blocks are numbered from the left since multiplication shifts m left:

      m in 32-bit blocks         block number in x
                                
10110110110110110110110110110110     block 0
11011101110111011101110111011101     block 1
11010011010011010011010011010011     block 2
01001001001001001001001001001001     block 3
00100001000010000100001000010000     block 4
10001100011000110001100011000110     block 5
00111001110011100111001110011100     block 6
11111111111111111111111111111111     block 7

What is special about m is that any 6-bit sequence appears in at most one block.

101101 only appears in block 0
001001 only appears in block 3
000110 only appears in block 5
111001 only appears in block 6
Blocks spillover into one another so the last 6-bit sequence of block 0 is 011011.
The bits after the bottom block are 0 so 100000 appears in block 7.
000000 does not appear, we will see why later.

Because of this special property the first 6 bits of x * m uniquely identify the block containing the set bit of x.

Definition of the special property

Start from a definition that includes de Bruijn sequences: "a sequence of length n such that every substring of length k in the sequence is unique"

Generalize it: "A sequence of length n such that if a substring of length k starts at indices i and j then ⌊i/d⌋ = ⌊j/d⌋". The initial definition has d=1.

In yul:

  // x given as input

  // Isolate first bit
  x := and(x, add(not(x), 1))

  // de Bruijn-like sequence
  let m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff

  // Get a 6-bit pattern
  let pattern := shr(250,mul(x,m))

The next step is mapping pattern to the correct block number. We’ll use a mapping table h.

Suppose pattern identifies block number 0bXYZ. We would like h such that the 3 bits at position pattern represent the block number: h[pattern] = X, h[pattern+1] = Y, h[pattern+2] = Z.

If this simple approach worked we would be able to do

  // Second hashmap
  h := 0x...

  first3bits := shr(253,shl(pattern,h))

It doesn’t work because of collisions. For instance suppose the pattern 000001 appears in block 1 (0b001) and the pattern 000010 appears in block 6 (0b110). h must be constructed in two mutually incompatible ways (positions in h are counted from the left to make bit cleanup use one less instruction):

h = . 0 0 1 . . .        (forced by block(0000001) = 1)
  clash ↕ 
h = . . 1 1 0 . .        (forced by block(0000010) = 6)

So we shift pattern to have more room and use p := pattern << 2 as an index into h. This solves the collision issue. The extra room allows us to handle the special case where the input is 0 and have ffs return 256 without any extra ifs: we put the invalid block number 0b1000 in position zero of h.

The value of h (constructed from m) is 0x8040405543005266443200005020610674053026020000107506200176117077. It is constructed like this:

// h is a empty word
// m_patterns is the set of 
// (pattern, bits of the pattern block number) pairs 
// that characterize m
h[0] = 1
for (pattern,block_high_bit,block_mid_bit,block_low_bit) in m_patterns:
  p = pattern << 2
  // h[p] is 0
  h[p+1] = block_high_bit
  h[p+2] = block_mid_bit
  h[p+3] = block_low_bit

The 1000 start of h is not erased in the for loop because 000000 is not in m.

Now h works like this: with a pattern of m and p = pattern << 2, h[p:p+3] is the block number of pattern. And h[0:3] = 0b1000.

The complete code is

  // Isolate first bit
  x := and(x, add(not(x), 1))

  // de Bruijn-like sequence: bit to pattern mapping
  let m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff

  // Get a 6-bit pattern
  let pattern := shr(250,mul(x,m))

  // pattern to position mapping
  let h := 0x8040405543005266443200005020610674053026020000107506200176117077

  // Get first 3 bits of c, with an extra bit so that c=256 when x=0.
  // 6-bit pattern is shifted so all 256 bits of h are addressed
  let c := shr(252,shl(shl(2,pattern),h))

  // Store bits in c, shifted to their correct position
  c := shl(5,c)

  // Get last 5 bits of c (regular de Bruijn lookup)
  c := or(c, byte(shr(251, mul(shr(c, x), shl(224, 0x077cb531))), 
      0x00011c021d0e18031e16140f191104081f1b0d17151310071a0c12060b050a09))

The resulting code uses about 20% less gas using solidity 0.8.24, with via_ir (determined with forge snapshot --mt FFS --isolate).

Extra notes

There is no m with 5-bit patterns. With 5-bit patterns you could address h with the bytes instruction since the address of a byte in a word is 5 bits long.
There are multiple candidates for m with 6-bit patterns.
There may be an m such that the corresponding h does not have collisions. Then shifting pattern left may be unnecessary (you would still have to account for the case x = 0 somehow).