Skip to content

What's new with io_uring in 6.11 and 6.12

Jens Axboe edited this page Oct 4, 2024 · 9 revisions

Speedup of MSG_RING requests

MSG_RING requests can be used to send messages from one ring to another - either data of some sort, or pass direct/fixed file descriptors between rings. 6.11 contains a new way to handle especially remote posting on rings setup with IORING_SETUP_DEFER_TASKRUN more efficiently. No changes required on the application side to take advantage of this feature.

Main commit

Add support for bind/listen

Support has been added to natively support bind and listen operations in io_uring. This is particularly useful if a connection has been instantiated with any of the direct variants of IORING_OP_ACCEPT, as those directly instantiate a direct/fixed io_uring file descriptor, and hence no normal file descriptor exists for these connections. This means the regular bind(2) and listen(2) system calls cannot be used. Added in 6.11. See the liburing io_uring_prep_bind(3) and io_uring_prep_listen(3) man pages for details.

bind listen

Improved support for coalescing huge page segments

When registering huge pages as IO buffers, rather than break it up into hardware sized smaller pages, bigger segments can be used directly. This enables faster iteration of buffers, and also a smaller memory footprint as a single huge page would previously have used many indexes to get stored by io_uring. While huge pages have been supported for a long time for registered buffers, with these additions it can be dealt with more efficiently for regions spanning multiple huge pages. This results in the lower layers (like the block/storage layer and DMA mapping) to deal with these IOs in a more efficient manner, improving performance and reducing the overhead. Added in 6.12. This feature is transparent to applications, it'll just make registered huge page buffers more efficient.

Main commit

Support for async discard requests

Linux has a block ioctl to issue discard requests to a device, but like other ioctls, it is fully synchronous. This means that it'll block the calling thread, and to achieve any kind of parallelism for discard operations, many threads must be used. Needless to say, this is inefficient. 6.12 adds support for discard operations through io_uring, in a fully async manner. Performance details on a basic NVMe device provided in the below linked merge commit. Since then I did some testing on another device, and not only does async discards use a fraction of the CPU compared to an equivalent number of threads to retain the same number of inflight IOs, it was also 5-6x faster at the same work.

Merge

Support for minimum timeout waits

Normally when waiting on events with io_uring, a certain number of events to wait for is specified by the caller. The caller may also supply a timeout for the wait operation. The wait stops when either condition has been met - either the desired number of events are available, or the waiting timed out. In case of a timeout, some events may be available to process by the application. Applications tend to specify a timeout based on the latency they can tolerate. As it's not unusual for applications to have varying periods of how busy they are, specifying a generic timeout can be difficult. This is where min timeout comes in - if set, an application may wait based on the following joint conditions, where n are the number of events being waited for, t is the minimum timeout, and T as the overall timeout.

  1. Wait for n events to become available in t time.
  2. If n events are available, waiting is done and success is returned.
  3. If t time has elapsed and 1 or more events are available, waiting is done and success is returned.
  4. If t time has elapsed and 0 events are available, continue waiting until T time has passed.
  5. If any event becomes available after t time has elapsed, but before T, waiting is done and success is returned.
  6. If T time has expired and no events are available, -ETIME is returned

This allows applications to set a short timeout, t to define the latency accepted for a request, while still allowing a much longer T to expire if no events are available. This helps avoid excessive context switches during periods of less activity. It's worth mentioning that transitioning between the minimum and overall timeout does not entail any context switches of the application. Added in 6.12.

Main commit

Support for absolute timeouts and other clock sources

Waiting on events with a timeout has only been supported as relative timeouts. Some use cases would really like absolute timeouts as well, mostly from an efficiency point of view as they would otherwise need to do extra time retrieving calls in the application. And contrary to what seems to be popular belief, retrieving the current time is not necessarily a super cheap (or free) operation. Now io_uring supports specifying absolute timeouts as well as relative timeouts, and specifying either CLOCK_MONOTONIC or CLOCK_BOOTTIME as the clock source. Available in 6.12.

Absolute timeouts Selectable clock source

Incremental provided buffer consumption

Provided buffers are a way for applications to provide buffers for, typically, reading from sockets upfront. This enables io_uring to pick the next available buffer to receive into, when data becomes available from the socket. The alternative to provided buffers is assigning a buffer to a receive operation when it's submitted to io_uring. While this works fine, that can tie up a lot of memory in cases where it's uncertain when data will become available. The most efficient type of provided buffers are ring provided buffers (see io_uring_setup_buf_ring(3) and related man pages). Normally provided buffers are wholly consumed when picked. This means that if the provided buffers in a given buffer group ID are 4K in size, then a receive operation that only gets 1K of data will still consume the entire buffer. If applications have a mix of smaller and bigger (eg streaming) receives, then appropriately sizing buffers may be difficult.

In 6.12, support has been added for incremental consumption. This enables the application to provide much larger buffers, and only have individual receives consume exactly the amount out of that buffer that they need.

This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example:

Application registers a provided buffer ring, and adds two 32K buffers to the ring.

Buffer1 address: 0x1000000 (buffer ID 0)
Buffer2 address: 0x2000000 (buffer ID 1)

A recv completion is received with the following values:

cqe->res        0x1000  (4k bytes received)
cqe->flags      0x11    (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)

and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in:

cqe->res        0x2010  (8k bytes received)
cqe->flags      0x11    (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)

which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is:

cqe->res        0x5000  (20k bytes received)
cqe->flags      0x1     (CQE_F_BUFFER set, buffer ID 0)

and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The nextcompletion is then:

cqe->res        0x1000  (4k bytes received)
cqe->flags      0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)

which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID.

When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final.

Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the cqe cflags.

On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length.

An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen.

Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done.

To setup a provided buffer ring with incremental consumption, the IOU_PBUF_RING_INC flag must be given to io_uring_setup_buf_ring(3) or io_uring_register_buf_ring(3). Available in 6.12.

Main commit

Registered buffer cloning support

An application may register IO buffers with io_uring, for more efficient storage IO with O_DIRECT. If the application has multiple threads, it's not unusual to register the same set of buffers with one or more rings on each thread. Normally the buffer registration is fast enough that this doesn't pose a problem, but for registering really large amounts of memory (hundreds of gigabytes), it does still take some time. On my local test system, registering 900GB of memory (think caching system) took about 1 second to complete. A user reported that registering 700GB for his application took more than 2 seconds. While this isn't a big issue for application startup, if threads are more ephemeral in nature, then registration times of that nature are not acceptable.

Buffer cloning allows to clone a registration from an existing ring (the source) into a new ring (the destination). For the above 900GB case, rather than spend around 1 second to perform the registration, it can now be done in 17 microseconds on the same system. This puts it into the realm of something that can be done dynamically rather than only a startup operation.

See io_uring_clone_buffers(3) for more details. Available in 6.12.

Main commit