For all Firecracker versions prior to v1.0.0, the emulated block device uses a synchronous IO engine for executing the device requests, based on blocking system calls.
Firecracker 1.0.0 adds support for an asynchronous block device IO engine.
Warning
Support is currently in developer preview. See this section for more info.
The Async
engine leverages io_uring
for
executing requests in an async manner, therefore getting overall higher
throughput by taking better advantage of the block device hardware, which
typically supports queue depths greater than 1.
The block IO engine is configured via the PUT /drives API call (pre-boot only),
with the io_engine
field taking two possible values:
Sync
(default)Async
(in developer preview)
The Sync
variant is the default, in order to provide backwards compatibility
with older Firecracker versions.
Note vhost-user block device is another option for block IO that requires an external backend process.
curl --unix-socket ${socket} -i \
-X PUT "http://localhost/drives/rootfs" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d "{
\"drive_id\": \"rootfs\",
\"path_on_host\": \"${drive_path}\",
\"is_root_device\": true,
\"is_read_only\": false,
\"io_engine\": \"Sync\"
}"
Firecracker requires a minimum host kernel version of 5.10.51 for the Async
IO
engine.
This requirement is based on the availability of the io_uring
subsystem, as
well as a couple of features and bugfixes that were added in newer kernel
versions.
If a block device is configured with the Async
io_engine on a host kernel
older than 5.10.51, the API call will return a 400 Bad Request, with a
suggestive error message.
The performance is strictly tied to the host kernel version. The gathered data may not be relevant for modified/newer kernels than 5.10.
When using the Async
variant, there is added latency on device creation (up to
~110 ms), caused by the extra io_uring system calls performed by Firecracker.
This translates to higher latencies on either of these operations:
- API call duration for block device config
- Boot time for VMs started via JSON config files
- Snapshot restore time
For use-cases where the lowest latency on the aforementioned operations is
desired, it is recommended to use the Sync
IO engine.
The Async
engine performance potential is showcased when the block device
backing files are placed on a physical disk that supports efficient parallel
execution of requests, like an NVME drive. It's also recommended to evenly
distribute the backing files across the available drives of a host, to limit
contention in high-density scenarios.
The performance measurements we've done were made on NVME drives, and we've discovered that:
For read workloads which operate on data that is not present in the host
page cache, the performance improvement for Async
is about 1.5x-3x in overall
efficiency (IOPS per CPU load) and up to 30x in total IOPS.
For write workloads, the Async
engine brings an improvement of about
20-45% in total IOPS but performs worse than the Sync
engine in total
efficiency (IOPS per CPU load). This means that while Firecracker will achieve
better performance, it will be at the cost of consuming more CPU for the kernel
workers. In this case, the VMM cpu load is also reduced, which should translate
into performance increase in hybrid workloads (block+net+vsock).
Whether or not using the Async
engine is a good idea performance-wise depends
on the workloads and the amount of spare CPU available on a host. According to
our NVME experiments, io_uring will always bring performance improvements
(granted that there are enough available CPU resources).
It is recommended that users perform some tests with examples of expected workloads and measure the efficiency as (IOPS/CPU load).
View the release policy for information about developer preview terminology.
The Async
io_engine is not yet suitable for production use. It will be made
available for production once Firecracker has support for a host kernel that
implements mitigation mechanisms for the following threats:
The number of io_uring kernel workers assigned to one Firecracker block device is upper-bounded by:
(1 + NUMA_COUNT * min(size_of_ring, 4 * NUMBER_OF_CPUS)
This formula is derived from the 5.10 linux kernel code, while size_of_ring
is
hardcoded to 128
in Firecracker.
Depending on the number of microVMs that can concurrently live on a host and the number of block devices configured for each microVM, the kernel PID limit may be reached, resulting in failure to create any new process.
Kernels starting with 5.15 expose a configuration option for customising this upper bound. Once possible, we plan on exposing this in the Firecracker drive configuration interface.
The io_uring kernel workers are spawned in the root cgroup of the system. They don’t inherit the Firecracker cgroup, cannot be moved out of the root cgroup and their names don't contain any information about the microVM's PID. This makes it impossible to attribute a worker to a specific Firecracker VM and limit the CPU and memory consumption of said workers via cgroups.
Starting with kernel 5.12 (currently unsupported), the Firecracker cgroup is inherited by the io_uring workers.
We plan on marking the Async engine as production ready once an LTS linux kernel including mitigations for the aforementioned mitigations is released and support for it is added in Firecracker.
Read more about Firecracker's kernel support policy.