This is a repository that contains a port of Stream VByte to Go. Notably, this repo takes extra care to leverage SIMD techniques to achieve better performance. Currently, there is support for x86_64 architectures that have AVX and AVX2 hardware instructions. In cases where that is not available, or on non x86_64 architectures there is a portable scalar implementation. We also perform a runtime check to make sure that the necessary ISA is available and if not fallback to the scalar approach.
There are several existing implementations:
- Reference C/C++
- Rust
- Go
- Note: only has a scalar implementation which prompted this implementation with SIMD techniques.
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
MemCopy8Uint32-12 463986302 2.608 ns/op 12269.03 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/decode
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
Get8uint32Fast-12 377839186 3.170 ns/op 10095.99 MB/s
Get8uint32DeltaFast-12 298522095 4.455 ns/op 7183.20 MB/s
Get8uint32Scalar-12 63384603 19.28 ns/op 1659.36 MB/s
Get8uint32DeltaScalar-12 58705828 20.04 ns/op 1596.46 MB/s
Get8uint32Varint-12 27369775 43.77 ns/op 731.10 MB/s
Get8uint32DeltaVarint-12 20924770 57.30 ns/op 558.46 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/encode
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
Put8uint32Fast-12 297620898 3.864 ns/op 8281.18 MB/s
Put8uint32DeltaFast-12 276545827 4.350 ns/op 7356.59 MB/s
Put8uint32Scalar-12 41200776 28.59 ns/op 1119.30 MB/s
Put8uint32DeltaScalar-12 37773458 30.65 ns/op 1044.11 MB/s
Put8uint32Varint-12 58668867 17.20 ns/op 1860.67 MB/s
Put8uint32DeltaVarint-12 61446153 22.88 ns/op 1398.80 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/stream/reader
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
ReadAllFast/Count_1e0-12 99354789 12.24 ns/op 326.80 MB/s
ReadAllFast/Count_1e1-12 28076071 42.81 ns/op 934.43 MB/s
ReadAllFast/Count_1e2-12 11041639 107.2 ns/op 3730.16 MB/s
ReadAllFast/Count_1e3-12 1645387 729.9 ns/op 5480.00 MB/s
ReadAllFast/Count_1e4-12 170894 7034 ns/op 5686.52 MB/s
ReadAllFast/Count_1e5-12 16848 70969 ns/op 5636.29 MB/s
ReadAllFast/Count_1e6-12 1513 728516 ns/op 5490.62 MB/s
ReadAllFast/Count_1e7-12 152 7835111 ns/op 5105.22 MB/s
ReadAllDeltaFast/Count_1e0-12 92727970 13.10 ns/op 305.44 MB/s
ReadAllDeltaFast/Count_1e1-12 26164140 45.89 ns/op 871.61 MB/s
ReadAllDeltaFast/Count_1e2-12 9458992 128.5 ns/op 3113.55 MB/s
ReadAllDeltaFast/Count_1e3-12 1277408 934.4 ns/op 4280.69 MB/s
ReadAllDeltaFast/Count_1e4-12 144405 8318 ns/op 4808.88 MB/s
ReadAllDeltaFast/Count_1e5-12 14444 83151 ns/op 4810.55 MB/s
ReadAllDeltaFast/Count_1e6-12 1426 846305 ns/op 4726.43 MB/s
ReadAllDeltaFast/Count_1e7-12 127 9337355 ns/op 4283.87 MB/s
ReadAllScalar/Count_1e0-12 122650209 9.770 ns/op 409.43 MB/s
ReadAllScalar/Count_1e1-12 38012136 31.63 ns/op 1264.64 MB/s
ReadAllScalar/Count_1e2-12 4999376 241.6 ns/op 1655.30 MB/s
ReadAllScalar/Count_1e3-12 500337 2459 ns/op 1626.38 MB/s
ReadAllScalar/Count_1e4-12 50247 24034 ns/op 1664.34 MB/s
ReadAllScalar/Count_1e5-12 5032 238354 ns/op 1678.17 MB/s
ReadAllScalar/Count_1e6-12 499 2405669 ns/op 1662.74 MB/s
ReadAllScalar/Count_1e7-12 46 24533207 ns/op 1630.44 MB/s
ReadAllDeltaScalar/Count_1e0-12 100000000 10.32 ns/op 387.49 MB/s
ReadAllDeltaScalar/Count_1e1-12 36915704 32.52 ns/op 1230.08 MB/s
ReadAllDeltaScalar/Count_1e2-12 4818140 249.8 ns/op 1601.58 MB/s
ReadAllDeltaScalar/Count_1e3-12 512492 2374 ns/op 1685.20 MB/s
ReadAllDeltaScalar/Count_1e4-12 51004 23639 ns/op 1692.11 MB/s
ReadAllDeltaScalar/Count_1e5-12 3568 333168 ns/op 1200.60 MB/s
ReadAllDeltaScalar/Count_1e6-12 520 2304864 ns/op 1735.46 MB/s
ReadAllDeltaScalar/Count_1e7-12 48 24810555 ns/op 1612.22 MB/s
ReadAllVarint/Count_1e0-12 121348074 9.967 ns/op 401.34 MB/s
ReadAllVarint/Count_1e1-12 21056739 57.34 ns/op 697.64 MB/s
ReadAllVarint/Count_1e2-12 2025081 589.0 ns/op 679.15 MB/s
ReadAllVarint/Count_1e3-12 205881 5851 ns/op 683.69 MB/s
ReadAllVarint/Count_1e4-12 20906 57446 ns/op 696.31 MB/s
ReadAllVarint/Count_1e5-12 2037 580620 ns/op 688.92 MB/s
ReadAllVarint/Count_1e6-12 208 5755083 ns/op 695.04 MB/s
ReadAllVarint/Count_1e7-12 20 57872736 ns/op 691.17 MB/s
ReadAllDeltaVarint/Count_1e0-12 139763250 8.318 ns/op 480.87 MB/s
ReadAllDeltaVarint/Count_1e1-12 19199100 62.49 ns/op 640.11 MB/s
ReadAllDeltaVarint/Count_1e2-12 2149660 556.6 ns/op 718.65 MB/s
ReadAllDeltaVarint/Count_1e3-12 207122 5810 ns/op 688.41 MB/s
ReadAllDeltaVarint/Count_1e4-12 22680 53200 ns/op 751.88 MB/s
ReadAllDeltaVarint/Count_1e5-12 2145 500177 ns/op 799.72 MB/s
ReadAllDeltaVarint/Count_1e6-12 228 5262741 ns/op 760.06 MB/s
ReadAllDeltaVarint/Count_1e7-12 27 42000722 ns/op 952.36 MB/s
goos: darwin
goarch: amd64
pkg: github.com/theMPatel/streamvbyte-simdgo/pkg/stream/writer
cpu: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
--
WriteAllFast/Count_1e0-12 54152408 22.05 ns/op 181.36 MB/s
WriteAllFast/Count_1e1-12 27681948 43.17 ns/op 926.49 MB/s
WriteAllFast/Count_1e2-12 7136480 167.0 ns/op 2395.79 MB/s
WriteAllFast/Count_1e3-12 928952 1273 ns/op 3141.14 MB/s
WriteAllFast/Count_1e4-12 96117 12012 ns/op 3329.93 MB/s
WriteAllFast/Count_1e5-12 9718 114260 ns/op 3500.80 MB/s
WriteAllFast/Count_1e6-12 879 1242927 ns/op 3218.21 MB/s
WriteAllFast/Count_1e7-12 100 10368754 ns/op 3857.74 MB/s
WriteAllDeltaFast/Count_1e0-12 50489378 23.38 ns/op 171.06 MB/s
WriteAllDeltaFast/Count_1e1-12 26866423 45.03 ns/op 888.33 MB/s
WriteAllDeltaFast/Count_1e2-12 6695125 175.8 ns/op 2275.37 MB/s
WriteAllDeltaFast/Count_1e3-12 899895 1391 ns/op 2875.71 MB/s
WriteAllDeltaFast/Count_1e4-12 90394 12958 ns/op 3086.82 MB/s
WriteAllDeltaFast/Count_1e5-12 10000 122319 ns/op 3270.13 MB/s
WriteAllDeltaFast/Count_1e6-12 945 1249546 ns/op 3201.16 MB/s
WriteAllDeltaFast/Count_1e7-12 100 11461852 ns/op 3489.84 MB/s
WriteAllScalar/Count_1e0-12 56106489 21.72 ns/op 184.18 MB/s
WriteAllScalar/Count_1e1-12 18309972 65.09 ns/op 614.51 MB/s
WriteAllScalar/Count_1e2-12 2776918 433.5 ns/op 922.63 MB/s
WriteAllScalar/Count_1e3-12 289309 4209 ns/op 950.38 MB/s
WriteAllScalar/Count_1e4-12 29497 40884 ns/op 978.38 MB/s
WriteAllScalar/Count_1e5-12 3027 399959 ns/op 1000.10 MB/s
WriteAllScalar/Count_1e6-12 296 4010161 ns/op 997.47 MB/s
WriteAllScalar/Count_1e7-12 28 38753790 ns/op 1032.16 MB/s
WriteAllDeltaScalar/Count_1e0-12 54981757 21.90 ns/op 182.65 MB/s
WriteAllDeltaScalar/Count_1e1-12 17823349 67.10 ns/op 596.14 MB/s
WriteAllDeltaScalar/Count_1e2-12 2711672 442.4 ns/op 904.09 MB/s
WriteAllDeltaScalar/Count_1e3-12 292664 4130 ns/op 968.62 MB/s
WriteAllDeltaScalar/Count_1e4-12 29340 41014 ns/op 975.28 MB/s
WriteAllDeltaScalar/Count_1e5-12 2289 516113 ns/op 775.02 MB/s
WriteAllDeltaScalar/Count_1e6-12 302 3930860 ns/op 1017.59 MB/s
WriteAllDeltaScalar/Count_1e7-12 30 41357670 ns/op 967.17 MB/s
WriteAllVarint/Count_1e0-12 208214545 5.720 ns/op 699.32 MB/s
WriteAllVarint/Count_1e1-12 43083270 28.02 ns/op 1427.34 MB/s
WriteAllVarint/Count_1e2-12 4972045 242.8 ns/op 1647.67 MB/s
WriteAllVarint/Count_1e3-12 499011 2409 ns/op 1660.60 MB/s
WriteAllVarint/Count_1e4-12 51022 23590 ns/op 1695.67 MB/s
WriteAllVarint/Count_1e5-12 5216 231741 ns/op 1726.07 MB/s
WriteAllVarint/Count_1e6-12 518 2305364 ns/op 1735.08 MB/s
WriteAllVarint/Count_1e7-12 50 24905825 ns/op 1606.05 MB/s
WriteAllDeltaVarint/Count_1e0-12 175269966 6.792 ns/op 588.93 MB/s
WriteAllDeltaVarint/Count_1e1-12 51799438 23.38 ns/op 1710.63 MB/s
WriteAllDeltaVarint/Count_1e2-12 5417458 221.3 ns/op 1807.60 MB/s
WriteAllDeltaVarint/Count_1e3-12 539414 2243 ns/op 1783.48 MB/s
WriteAllDeltaVarint/Count_1e4-12 52717 22753 ns/op 1757.99 MB/s
WriteAllDeltaVarint/Count_1e5-12 5716 210456 ns/op 1900.63 MB/s
WriteAllDeltaVarint/Count_1e6-12 495 2453672 ns/op 1630.21 MB/s
WriteAllDeltaVarint/Count_1e7-12 70 17491186 ns/op 2286.87 MB/s
A note on the benchmarks: An array of random uint32's is generated and then encoded/decoded over and over again. An attempt is made to ensure that some of these benchmarks reflect the most probable real world performance metrics.
Stream VByte uses the same underlying format as Google's Group Varint approach. Lemire et al. wanted to see if there was a way to improve the performance even more and introduced a clever twist to enable better performance via SIMD techniques. The basic goal of the Group Varint format is to be able to achieve similar compression characteristics as the VByte format for integers and also be able to load and process them really quickly.
The insight that backs the VByte encoding is noticing that you oftentimes don't need 32 bits to encode a 32-bit integer. Take for example an unsigned integer that is less than 2^8 (256). This integer will have bits set in the lowest byte of a 32-bit integer, while the remaining 3 bytes will simply be zeros.
111 in binary:
00000000 00000000 00000000 01101111
An approach you can take to compress this integer is to encode the integer using a variable number of bytes. For example, you can use the lower 7 bits to store data, i.e. bits from the original integer, and then use the MSB as a continuation bit. If the MSB bit is on, i.e. is 1, then more bytes are needed to decode this particular integer. Below is an example where you might need 2 bytes to store the number 1234.
1234 in binary:
00000000 00000000 00000100 11010010
Num compressed:
v v Continuation bits
0|0001001| 1|1010010|
^ ^ Data bits
If you want to decode this integer, you simply build up the number iteratively. I.e. you OR the last 7 bits of every byte shifted to the appropriate length to your 32-bit integer until you find a byte that doesn't have a continuation bit set. Note that this works the same for 64-bit numbers.
The problem with this approach is that it can introduce a lot of branch mis-predictions during encoding/decoding. During the decoding phase, you don't know ahead of time the number of bytes that were used to encode the integer you are currently processing and so you need to iterate until you find a byte without a continuation bit on. If you have integers that are nonuniform, i.e. integers that require random numbers of bytes to encode relative to one another, this can pose a challenge to the processor's branch predictor. These mis-predictions can cause major slowdowns in processor pipelines and so was born the Group Varint format.
The Group Varint (varint-GB) format assumes that everything you hope to achieve, you can do with 32-bit integers. It introduces the concept of a control byte which is simply a byte that stores the encoded lengths of a group of 4 32-bit integers, hence Group Varint. 32-bit integers only require up to 4 bytes to properly encode. This means that you can represent their lengths with 2 bits using a zero-indexed length i.e. 0, 1, 2, and 3 to represent integers that require 1, 2, 3 and 4 bytes to encode, respectively.
00000000 00000000 00000000 01101111 = 111
00000000 00000000 00000100 11010010 = 1234
00000000 00001100 00001010 10000011 = 789123
01000000 00000000 00000000 00000000 = 1073741824
Num Len 2-bit control
----------------------------------
111 1 0b00
1234 2 0b01
789123 3 0b10
1073741824 4 0b11
Final Control byte
0b11100100
Encoded data (little endian right-to-left bottom-to-top)
0b01000000 0b00000000 0b00000000 0b00000000 0b00001100
0b00001010 0b10000011 0b00000100 0b11010010 0b01101111
You can then prefix every group of 4 encoded 32-bit integers with their control byte and then use it during decoding. The obvious downside is that you pay a storage cost of one byte for every 4 integers you want to encode. For 2^20 encoded integers, that's an extra 256 KB of extra space: totally marginal. The great upside, though, is that you've now removed almost all branches from your decoding phase. You know exactly how many data bytes you need to read from a buffer for a particular number and then can use branchless decoding.
package foo
import (
"encoding/binary"
)
func decodeOne(input []byte, size uint8) uint32 {
buf := make([]byte, 4)
copy(buf, input[:size])
// func (littleEndian) Uint32(b []byte) uint32 {
// _ = b[3]
// return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
// }
return binary.LittleEndian.Uint32(buf)
}
func main() {
ctrl := uint8(0b11_10_01_00)
data := []byte{
0b01101111, 0b11010010, 0b00000100,
0b10000011, 0b00001010, 0b00001100,
0b00000000, 0b00000000, 0b00000000,
0b01000000,
}
len0 := (ctrl & 3) + 1 // 1
len1 := (ctrl >> 2 & 3) + 1 // 2
len2 := (ctrl >> 4 & 3) + 1 // 3
len3 := (ctrl >> 6 & 3) + 1 // 4
_ = decodeOne(data, len0) // 111
_ = decodeOne(data[len0:], len1) // 1234
_ = decodeOne(data[len0+len1:], len2) // 789_123
_ = decodeOne(data[len0+len1+len2:], len3) // 1_073_741_824
}
Unfortunately, accelerating decoding of the varint-GB format with only SIMD techniques has proven unsuccessful. The below excerpt from the paper outlines why.
To understand why it might be difficult to accelerate the decoding of data compressed in the VARINT-GB format compared to the VARINT-G8IU format, consider that we cannot decode faster than we can access the control bytes. In VARINT-G8IU, the control bytes are conveniently always located nine compressed bytes apart. Thus while a control byte is being processed, or even before, our superscalar processor can load and start processing upcoming control bytes, as their locations are predictable. Instructions depending on these control bytes can be reordered by the processor for best performance. However, in the VARINT-GB format, there is a strong data dependency: the location of the next control byte depends on the current control byte. This increases the risk that the processor remains underutilized, delayed by the latency between issuing the load for the next control byte and waiting for it to be ready.
Additionally, they prove that decoding 4 integers at a time using 128-bit registers is faster than trying to decode a variable number of integers that fit into an 8-byte register, i.e. the varint-G8IU approach.
Lemire et al. have devised a brilliant SIMD algorithm for simultaneously generating two control bytes for a group of 8 integers. The best way to understand this algorithm is to understand how it works on a single integer and then assume it works in a vectorized form (it does). Going forward we'll use control bits stream to represent these control bytes we are building.
00000000 00000000 00000100 11010010 // 1234
Let's take one of the previous integers that we were looking at, 1234
, and walk through an example
of how the 2-bit control is generated for it using SIMD techniques. The goal is to be able to, for
any 32-bit integer, generate a 2-bit zero indexed length value. For example, if you have an integer
that requires 2 bytes to be encoded, we want for the algorithm to generate 0b01
.
00000000 00000000 00000100 11010010 // 1234
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(1234, 0x0101)
00000000 00000000 00000001 00000001
The algorithm first uses a mask where every byte is equal to 1. If you perform a per-byte min operation on our integer and the 1's mask, the result will have a 1 at every byte that had a value in the original integer.
00000000 00000000 00000001 00000001
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 00000000 11111111
Now you perform a 16-bit to 8-bit unsigned saturating pack operation. Practically this means that you're taking every 16-bit value and trying to shove that into 8 bits. If the 16-bit integer is larger than the largest unsigned integer 8 bits can support, the pack saturates to the largest unsigned 8-bit value.
Why this is performed will become more clear in the subsequent steps, however, at a high level, for every
integer you want to encode, you want for the MSB of two consecutive bytes in the control bits stream
to be representative of the final 2-bit control. For example, if you have a 3-byte integer, you want the
MSB of two consecutive bytes to be 1 and 0, in that order. The reason you would want this is that
there is a vector pack instruction that takes the MSB from every byte in the control bits stream
and packs it into the lowest byte. This would thus represent the value 0b10
in the final byte for
this 3-byte integer, which is what we want.
Performing a 16-bit to 8-bit unsigned saturating pack has the effect that you can use the saturation behavior to conditionally turn on the MSB of these bytes depending on which bytes have values in the original 32-bit integer.
00000000 00000000 00000000 11111111 // control bits stream
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 00000000 11111111
We then take the 1's mask we used before and perform a signed 16-bit min operation. The reason for this is more clear if you look at an example using a 3-byte integer.
00000000 00001100 00001010 10000011 // 789123
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(789123, 0x0101)
00000000 00000001 00000001 00000001
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 00000001 11111111
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 00000001 00000001
The signed 16-bit min operation has three important effects.
First, for 3-byte integers, it has the effect of turning off the MSB of the lowest byte. This is necessary
because a 3-byte integer should have a 2-bit control that is 0b10
and without this step using the MSB pack
operation would result in a 2-bit control that looks something like 0b_1
, where the lowest bit is on.
Obviously this is wrong, since only integers that require 2 or 4 bytes to encode should have that lower bit
on, i.e. 1 or 3 as a zero-indexed length.
Second, for 4-byte integers, the signed aspect has the effect of leaving both MSBs of the 2 bytes on. When using the
MSB pack operation later on, it will result in a 2-bit control value of 0b11
, which is what we want.
Third, for 1 and 2 byte integers, it has no effect. This is great for 2-byte values since the MSB will remain on and 1 byte values will not have any MSB on anyways, so it is effectively a noop in both scenarios.
00000000 00000000 00000000 11111111 // control bits stream (original 1234)
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 01111111 11111111
Next, we take a mask with the value 0x7F00
and perform an unsigned saturating add to the control bits stream.
In the case for the integer 1234
this has no real effect. We maintain the MSB in the lowest byte. You'll note,
however, that the only byte that has its MSB on is the last one, so performing an MSB pack operation would result
in a value of 0b0001
, which is what we want. An example of this step on the integer 789123
might paint a clearer
picture.
00000000 00000000 00000001 00000001 // control bits stream (789123)
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 11111111 00000001
You'll note here that the addition of 0x01
with 0x7F
in the upper byte results in the MSB of the resulting upper
byte turning on. The MSB in the lower byte remains off and now an MSB pack operation will resolve to 0b0010
,
which is what we want. The unsigned saturation behavior is really important for 4-byte numbers that only have
bits in the most significant byte on. An example below:
01000000 00000000 00000000 00000000 // 1073741824
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // byte-min(1073741824, 0x0101)
00000001 00000000 00000000 00000000
----------------------------------- // pack unsigned saturating 16-bit to 8-bit
00000000 00000000 11111111 00000000
00000001 00000001 00000001 00000001 // 0x0101 mask
----------------------------------- // signed 16-bit min
00000000 00000000 11111111 00000000
01111111 00000000 01111111 00000000 // 0x7F00 mask
----------------------------------- // add unsigned saturating 16-bit
01111111 00000000 11111111 11111111
Note here that because only the upper byte had a value in it, the lowest byte in the control bits stream remains
zero for the duration of the algorithm. This poses an issue, since for a 4-byte value, we want for the 2-bit
control to result in a value of 0b11
. Performing a 16-bit unsigned saturating addition has the effect of
turning on all bits in the lower byte, and thus we get a result with the MSB in the lower byte on.
01111111 00000000 11111111 00000001 // control bits stream (789123)
----------------------------------- // move byte mask
00000000 00000000 00000000 00000010 // 2-bit control
The final move byte mask is performed on the control bits stream, and we now have the result we wanted. Now that you see that this works for 1 integer, you know how it can work for 8 integers simultaneously, since we use vector instructions that operate on 128 bit registers.
The next problem to be solved is how to take a group of 4 integers, and compress it by removing extraneous/unused bytes so that all you're left with is a stream of data bytes with real information. Let's take two numbers from our examples above.
789123 1234
00000000 00001100 00001010 10000011 | 00000000 00000000 00000100 11010010
-------------------------------------------------------------------------
00001100 00001010 10000011 00000100 11010010 // packed
Here, we can use a shuffle operation. Vector shuffle operations rearrange the bytes in an input register according to some provided mask into a destination register. Every position in the mask stores an offset into the source vector stream that represents the data byte that should go into that position.
input [1234, 789123] (little endian R-to-L)
00000000 00001100 00001010 10000011 00000000 00000000 00000100 11010010
| | | | |
| | |____________________ | |
| |_____________________ | | |
|____________________ | | | |
v v v v v
0xff 0xff 0xff 0x06 0x05 0x04 0x01 0x00 // mask in hex
-----------------------------------------------------------------------
00000000 00000000 00000000 00001100 00001010 10000011 00000100 11010010 // packed
We keep a prebuilt lookup table that contains a mapping from control byte to the necessary mask and simply load that after we construct the control byte above. In addition, we keep a lookup table for a mapping from control bytes to total encoded length. This allows us to know by how much to increment the output pointer and overwrite, for example, the redundant upper 3 bytes in the above shuffle example.
Unpacking during decoding is the same as the above, but in reverse. We need to go from a packed format
to an unpacked memory format. We keep lookup tables to maintain a mapping from control byte to the reverse
shuffle mask, and then perform a shuffle operation to output to an uint32
array.