-
Notifications
You must be signed in to change notification settings - Fork 414
What's new with io_uring in 6.11 and 6.12
MSG_RING requests can be used to send messages from one ring to another - either data
of some sort, or pass direct/fixed file descriptors between rings. 6.11 contains a new
way to handle especially remote posting on rings setup with IORING_SETUP_DEFER_TASKRUN
more efficiently. No changes required on the application side to take advantage of this
feature.
Support has been added to natively support bind and listen operations in io_uring. This
is particularly useful if a connection has been instantiated with any of the direct variants
of IORING_OP_ACCEPT
, as those directly instantiate a direct/fixed io_uring file
descriptor, and hence no normal file descriptor exists for these connections. This means the
regular bind(2)
and listen(2)
system calls cannot be used. Added in 6.11. See the
liburing io_uring_prep_bind(3)
and io_uring_prep_listen(3)
man pages for details.
When registering huge pages as IO buffers, rather than break it up into hardware sized smaller pages, bigger segments can be used directly. This enables faster iteration of buffers, and also a smaller memory footprint as a single huge page would previously have used many indexes to get stored by io_uring. While huge pages have been supported for a long time for registered buffers, with these additions it can be dealt with more efficiently for regions spanning multiple huge pages. This results in the lower layers (like the block/storage layer and DMA mapping) to deal with these IOs in a more efficient manner, improving performance and reducing the overhead. Added in 6.12. This feature is transparent to applications, it'll just make registered huge page buffers more efficient.
Linux has a block ioctl to issue discard requests to a device, but like other ioctls, it is fully synchronous. This means that it'll block the calling thread, and to achieve any kind of parallelism for discard operations, many threads must be used. Needless to say, this is inefficient. 6.12 adds support for discard operations through io_uring, in a fully async manner. Performance details on a basic NVMe device provided in the below linked merge commit. Since then I did some testing on another device, and not only does async discards use a fraction of the CPU compared to an equivalent number of threads to retain the same number of inflight IOs, it was also 5-6x faster at the same work.
Normally when waiting on events with io_uring, a certain number of events to wait for is specified
by the caller. The caller may also supply a timeout for the wait operation. The wait stops when
either condition has been met - either the desired number of events are available, or the waiting
timed out. In case of a timeout, some events may be available to process by the application. Applications
tend to specify a timeout based on the latency they can tolerate. As it's not unusual for applications
to have varying periods of how busy they are, specifying a generic timeout can be difficult. This is
where min timeout comes in - if set, an application may wait based on the following joint conditions,
where n
are the number of events being waited for, t
is the minimum timeout, and T
as the overall timeout.
- Wait for
n
events to become available int
time. - If
n
events are available, waiting is done and success is returned. - If
t
time has elapsed and 1 or more events are available, waiting is done and success is returned. - If
t
time has elapsed and 0 events are available, continue waiting untilT
time has passed. - If any event becomes available after
t
time has elapsed, but beforeT
, waiting is done and success is returned. - If
T
time has expired and no events are available,-ETIME
is returned
This allows applications to set a short timeout, t
to define the latency accepted for a request,
while still allowing a much longer T
to expire if no events are available. This helps avoid
excessive context switches during periods of less activity. It's worth mentioning that transitioning
between the minimum and overall timeout does not entail any context switches of the application. Added
in 6.12.
Waiting on events with a timeout has only been supported as relative timeouts. Some use cases would
really like absolute timeouts as well, mostly from an efficiency point of view as they would otherwise
need to do extra time retrieving calls in the application. And contrary to what seems to be popular
belief, retrieving the current time is not necessarily a super cheap (or free) operation. Now io_uring
supports specifying absolute timeouts as well as relative timeouts, and specifying either
CLOCK_MONOTONIC
or CLOCK_BOOTTIME
as the clock source. Available in 6.12.
Absolute timeouts Selectable clock source
Provided buffers are a way for applications to provide buffers for, typically, reading from sockets
upfront. This enables io_uring to pick the next available buffer to receive into, when data becomes
available from the socket. The alternative to provided buffers is assigning a buffer to a receive
operation when it's submitted to io_uring. While this works fine, that can tie up a lot of memory
in cases where it's uncertain when data will become available. The most efficient type of provided
buffers are ring provided buffers (see io_uring_setup_buf_ring(3)
and related man pages). Normally
provided buffers are wholly consumed when picked. This means that if the provided buffers in a given
buffer group ID are 4K in size, then a receive operation that only gets 1K of data will still consume
the entire buffer. If applications have a mix of smaller and bigger (eg streaming) receives, then
appropriately sizing buffers may be difficult.
In 6.12, support has been added for incremental consumption. This enables the application to provide much larger buffers, and only have individual receives consume exactly the amount out of that buffer that they need.
This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example:
Application registers a provided buffer ring, and adds two 32K buffers to the ring.
Buffer1 address: 0x1000000 (buffer ID 0)
Buffer2 address: 0x2000000 (buffer ID 1)
A recv completion is received with the following values:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in:
cqe->res 0x2010 (8k bytes received)
cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is:
cqe->res 0x5000 (20k bytes received)
cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0)
and the application now knows that 20k of data is available at 0x1003000, which is where the
previous receive ended. CQE_F_BUF_MORE
isn't set, as no more data is available in this buffer
ID. The nextcompletion is then:
cqe->res 0x1000 (4k bytes received)
cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)
which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID.
When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE
will be set in
cqe->flags
. This tells the application that the kernel isn't done with the buffer yet, and that
it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup
with IOU_PBUF_RING INC
, as that's the only type of buffer that will see multiple consecutive
completions for the same buffer ID. For any other provided buffer type, any completion that passes back
a buffer to the application is final.
Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will
indicate the next buffer ID in the cqe
cflags.
On the send side, the application can manage how much data is sent from an existing buffer by setting
sqe->len
to the desired send length.
An application can request incremental consumption by setting IOU_PBUF_RING_INC
in the provided
buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like
before, no changes there. The only change is in how an application may see multiple completions for the same
buffer ID, hence needing to know where the next receive will happen.
Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC
, as both
require the ring to remain locked over the duration of the buffer selection and the operation completion. It
will consume a buffer otherwise regardless of the size of the IO done.
To setup a provided buffer ring with incremental consumption, the IOU_PBUF_RING_INC
flag must be given to io_uring_setup_buf_ring(3)
or io_uring_register_buf_ring(3)
. Available in 6.12.
An application may register IO buffers with io_uring, for more efficient storage IO with O_DIRECT
.
If the application has multiple threads, it's not unusual to register the same set of buffers with one
or more rings on each thread. Normally the buffer registration is fast enough that this doesn't pose a
problem, but for registering really large amounts of memory (hundreds of gigabytes), it does still take
some time. On my local test system, registering 900GB of memory (think caching system) took about 1 second
to complete. A user reported that registering 700GB for his application took more than 2 seconds. While this
isn't a big issue for application startup, if threads are more ephemeral in nature, then registration times
of that nature are not acceptable.
Buffer cloning allows to clone a registration from an existing ring (the source) into a new ring (the destination). For the above 900GB case, rather than spend around 1 second to perform the registration, it can now be done in 17 microseconds on the same system. This puts it into the realm of something that can be done dynamically rather than only a startup operation.
See io_uring_clone_buffers(3)
for more details. Available in 6.12.