20% cheaper find-first-set bit using a new de Bruijn-like sequence
tldr: Find the first 3 bits of the lsb position using a de Bruijn-like sequence instead of finding them by dichotomy. Saves 20% of gas.
Given a sequence of bits, the function that returns the position of the least significant nonzero bit (the lsb) is often named ffs
(for first set bit) or ctz
(for count trailing zeroes).
Finding the lsb position is very useful when you’re using EVM words as arrays. For instance Uniswap v3 uses it to find the next initialized tick in a given tick range.
EVM words are 256 bits so you need 8 bits of information to know the lsb position.
Solady’s implementation does the following:
- Remove all set bits except the least significant (
x := x & ~x+1
). - Binary-search the 3 high bits of the position (“in which half of the word is the lsb? in which half of that half is it? in which half of that half is the lsb?”).
- Use a de Bruijn sequence to find the remaining 5 bits of the position.
Step 2 can be improved using a de Bruijn-like sequence.
In a de Bruijn sequence every substring of a given length appears uniquely. So if you have a de Bruijn sequence m
and an isolated bit in x
, you can find a unique bit sequence for every possible bit position in x
by computing x * m
which is equal to m << (position of set bit in x)
. Then, with a hash table, you map that sequence to the position of the isolated bit.
For instance with 4 bits let m := 0110
and let x
be each possible isolated bit (we won’t look at the hash table in this example):
m * 0001 = 0110
m * 0010 = 1100
m * 0100 = 1000
m * 1000 = 0000
The first two bits of the result are different each time.
Now our goal is not to find the lsb position in a 256-bit word, it’s just to find the 3 high bits of the lsb position. In other words our goal is to find in which 32-bit block of a 256-bit word the lsb is.
bit pos in x 255-224 223-192 ... 63-32 31-0
block num in x 7 6 1 0
Suppose we had an m
such that x * m
identified the 32-bit block that has a set bit. Then we could again use a hash table and map x * m
to the position of the 32-bit block.
There is such a sequence: m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff
.
Here is m
in binary, blocks are numbered from the left since multiplication shifts m
left:
m in 32-bit blocks block number in x
10110110110110110110110110110110 block 0
11011101110111011101110111011101 block 1
11010011010011010011010011010011 block 2
01001001001001001001001001001001 block 3
00100001000010000100001000010000 block 4
10001100011000110001100011000110 block 5
00111001110011100111001110011100 block 6
11111111111111111111111111111111 block 7
What is special about m
is that any 6-bit sequence appears in at most one block.
101101
only appears in block 0001001
only appears in block 3000110
only appears in block 5111001
only appears in block 6- Blocks spillover into one another so the last 6-bit sequence of block 0 is
011011
. - The bits after the bottom block are 0 so
100000
appears in block 7. 000000
does not appear, we will see why later.
Because of this special property the first 6 bits of x * m
uniquely identify the block containing the set bit of x
.
In yul:
// x given as input
// Isolate first bit
x := and(x, add(not(x), 1))
// de Bruijn-like sequence
let m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff
// Get a 6-bit pattern
let pattern := shr(250,mul(x,m))
The next step is mapping pattern
to the correct block number. We’ll use a mapping table h
.
Suppose pattern
identifies block number 0bXYZ
. We would like h
such that the 3 bits at position pattern
represent the block number: h[pattern] = X
, h[pattern+1] = Y
, h[pattern+2] = Z
.
If this simple approach worked we would be able to do
// Second hashmap
h := 0x...
first3bits := shr(253,shl(pattern,h))
It doesn’t work because of collisions. For instance suppose the pattern 000001
appears in block 1 (0b001) and the pattern 000010
appears in block 6 (0b110). h
must be constructed in two mutually incompatible ways (positions in h
are counted from the left to make bit cleanup use one less instruction):
h = . 0 0 1 . . . (forced by block(0000001) = 1)
clash ↕
h = . . 1 1 0 . . (forced by block(0000010) = 6)
So we shift pattern
to have more room and use p := pattern << 2
as an index into h
. This solves the collision issue. The extra room allows us to handle the special case where the input is 0 and have ffs
return 256 without any extra ifs: we put the invalid block number 0b1000 in position zero of h
.
The value of h
(constructed from m
) is 0x8040405543005266443200005020610674053026020000107506200176117077
. It is constructed like this:
// h is a empty word
// m_patterns is the set of
// (pattern, bits of the pattern block number) pairs
// that characterize m
h[0] = 1
for (pattern,block_high_bit,block_mid_bit,block_low_bit) in m_patterns:
p = pattern << 2
// h[p] is 0
h[p+1] = block_high_bit
h[p+2] = block_mid_bit
h[p+3] = block_low_bit
The 1000
start of h
is not erased in the for loop because 000000
is not in m
.
Now h
works like this: with a pattern
of m
and p = pattern << 2
, h[p:p+3]
is the block number of pattern
. And h[0:3] = 0b1000
.
The complete code is
// Isolate first bit
x := and(x, add(not(x), 1))
// de Bruijn-like sequence: bit to pattern mapping
let m := 0xb6db6db6ddddddddd34d34d349249249210842108c6318c639ce739cffffffff
// Get a 6-bit pattern
let pattern := shr(250,mul(x,m))
// pattern to position mapping
let h := 0x8040405543005266443200005020610674053026020000107506200176117077
// Get first 3 bits of c, with an extra bit so that c=256 when x=0.
// 6-bit pattern is shifted so all 256 bits of h are addressed
let c := shr(252,shl(shl(2,pattern),h))
// Store bits in c, shifted to their correct position
c := shl(5,c)
// Get last 5 bits of c (regular de Bruijn lookup)
c := or(c, byte(shr(251, mul(shr(c, x), shl(224, 0x077cb531))),
0x00011c021d0e18031e16140f191104081f1b0d17151310071a0c12060b050a09))
The resulting code uses about 20% less gas using solidity 0.8.24, with via_ir (determined with forge snapshot --mt FFS --isolate
).
Extra notes
- There is no
m
with 5-bit patterns. With 5-bit patterns you could addressh
with thebytes
instruction since the address of a byte in a word is 5 bits long. - There are multiple candidates for
m
with 6-bit patterns. - There may be an
m
such that the correspondingh
does not have collisions. Then shiftingpattern
left may be unnecessary (you would still have to account for the casex = 0
somehow).