-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io_uring_submit_and_wait() unexpected behavior #1301
Comments
@andyg24 could you let me know if you know Dave Skender's email address? |
I still cannot figure this out after 3 days. A single submitted nvme request takes 50-60 microseconds to complete on my system. 512 submitted requests (4K reads from random offsets of a large file) cause io_uring_submit_and_wait(&ring, 1) to block for 4000-6000 microseconds and return all 512 cqe's at once. How can I make it return after 60 microseconds with just a few cqe's? I suspect there is some kernel setting that I am missing. If @axboe or @isilence could offer any tips on what to try, I would really appreciate that! Thank you. |
So it appears that what's blocking is not the wait() part but the submission itself. The latency is just as high when I replace io_uring_submit_and_wait() with io_uring_submit(). It takes ~4 milliseconds to submit 500 read requests. Looking at /sys/kernel/debug/tracing/trace_pipe, there is a 100-200 microsecond wait every 32 nvme_setup_cmd requests. For example, <...>-396466 [003] ..... 1610636.311239: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=37280, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=993271344, len=7, ctrl=0x0, dsmgmt=0, reftag=0) I am guessing some queue somewhere fills up and blocks the call. /sys/block/nvme0n1/queue/nr_request on my system is Any suggestions on what to check / tweak so that I could have 500 in-flight read requests to the SSD without blocking? |
Could this be related to #1184? nr_requests is set to a large number in my case (1023), but perhaps the actual NVME drive has a much lower queue size. I am seeing this behavior with EXT4 and kernels ranging from 6.5 to 6.11. |
Yes, it sounds like ext4 ignores NOWAIT somewhere. Does it reproduces with newer kernels? |
Thank you for your reply, Pavel. I am seeing this on 6.11 and block devices too (e.g. /dev/nvme1n1), no filesystem needed. More often than not, io_uring_submit() takes a millisecond or more, then io_uring_wait_cqe() completes instantly and all 512 cqe's are returned at once. These are random 4K reads on a fast SSD. I will post some code to demonstrate this issue tomorrow. |
Here is a quick test:
This submits 512 4K read requests into random offsets of a file opened with O_DIRECT, then waits for the first cqe and prints how long the two system calls took and how many cqes were reaped in total. On my Ubuntu 24.04 laptop (6.8 kernel) and a WD_BLACK SN850X SSD, I am seeing results such as -- EXT4
-- Block device
So essentially this behaves like blocking IO -- nearly all the time is spent in submit(). On another desktop running 6.11 and with Solidigm P44 Pro and Corsair MP600 Pro drives, the first wait_cqe() always takes a long time (SSD wake up time?), but if I run the test sequentially, I get
Ideally, I would expect the submit() to return almost instantly -- anything that could block should be punted to some internal queue -- and then the first wait_cqe() to return after 50-100 us with a small subset of completed reads. Thanks. |
Updated the code in previous comment to use gettimeofday() instead of clock(). The timings still don't look right. |
I tried a version of the above with the following improvements:
I also tried IORING_SETUP_COOP_TASKRUN, but that didn't seem to help anything. Timings look better now, but still not what I would expect from a fast SSD: 6.8 kernel laptop:
6.11 kernel desktop:
Most of the time is still in submit(), and wait_cqe() returns fairly quickly with nearly all cqes.
Any ideas what else to try? Would nvme polled mode help here? |
Thanks for the repro, I'll take a look.
Only COOP_TASKRUN / DEFER_TASKRUN might matter for the problem, others are optimisations but shouldn't considerably change latency.
Hmm, that's more mysterious then, but I've seen all sorts of weird stuff, like raw block IO unexpectedly waiting on a filesystem's atime update because it was mounted in the filesystem instead of /dev/. I'll try it, but I might ask to run a bpftrace script if it doesn't reproduce. |
Can you run back to back QD=1 and QD=512 for both nvme and fs? |
Sure. I changed IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN to just IORING_SETUP_COOP_TASKRUN as that gives slightly better results. Ubuntu 24.04 laptop, 6.8 kernel, WD_BLACK SN850X. Fairly low-end but recent Intel CPU. -- EXT4
-- Block device
On the faster desktop, 512 sqe submission time is about 700 us for a block device. I wonder -- could it be that everything is working as expected, and it's just expensive to submit a read request in linux? Assuming 1.4 us per read request, that's about 5000 CPU cycles. Perhaps between io_uring, block device, nvme driver, etc. there is enough complexity to account for that? And so by the time submit() processes 64 requests, the previous 64 requests are already back from the SSD, making it appear like submit() blocks. |
FWIW,
Right, that's what I was thinking about. You're just submitting too many requests at the same time. 1.4us per request is reasonable, and it further depends on the kernel config and what block features are enabled. I'd say don't submit so many, 512 is neither good for latency nor for throughput, e.g. the kernel limits how many inflight requests there might be to a device, see |
Makes sense, thank you for your help. |
With IORING_SETUP_IOPOLL (nvme.poll_queues=4), things are quite a bit faster -- less than 1 us per 4K read request, or over 1M IOPS. Unfortunately, most of the time is still spent in io_uring_submit() even with lower QDs. This makes liburing behave more like blocking IO: by the time submit() returns, nearly all requests are already completed. I suppose I should try nvme passthrough next. Perhaps there is less overhead there, and submit() will return faster. The motivation for trying to improve submit() latency is to have the following flow: submit() a bunch of requests, then process completed requests while additional requests are in flight. Because submit() blocks until nearly everything is processed, it's difficult to have requests to an nvme in flight while doing other work. |
You seem to be equating "submitting many requests takes quite a bit of time" with "blocking". Is there any evidence of blocking here? There should not be blocking, if there is, then there's certainly a bug. That part I'd like to get to the bottom of. I haven't really looked into this case as Pavel was looking at it, but one thing that stuck out to me is that your batch size is way too big. Most NVMe devices will have a per-queue submit limit of 1023, generally. Sometimes it's smaller. You can see what yours is by checking |
In terms of overhead, a distro level configuration will have a bunch of unnecessary overhead. Things like blk-throttle, blk-latency, blk-wbt, etc - all of those will add overhead to the IO submission and completion path. Some of that is just running more code per IO, some of it is doing expensive things like many time stamps per IO. Newer kernels will do better in this regard. I've done 12-14M IOPS on a single IO core on a more optimized configuration, most distro like fat configuration will do less than that. But on a normal modern desktop CPU or server class that isn't too MHz starved, I'd still expect at least 5M per core. To get those kinds of efficiencies, you do need to have some kind of "mechanical sympathy" in terms of how you drive devices. That's back to my earlier point of not overloading the stack by doing chunks of 512 requests. |
Thank you for the quick reply! Right, so QD=128 is what I am trying. The timings for me on a fast SSD (spec'ed at ~1.4M/s random 4K reads), 6.11 kernel, nvme.poll_queues=4, and a Ryzen 7700X cpu look like this: io_uring_submit() - 120 us This is reading from a block device (/dev/nvme0n1), no filesystem involved. I will try to optimize kernel configuration as you suggested. Thank you for the pointers on where to start. When you say that you expect 5M per core, is that reads from a block device, not just nops? Do you think it's worth trying nvme passthrough commands to try to reduce the latency in submit()? |
120 usec is a lot. Is that submitting 128? What happens if you do 32, and wait on 32? 128 at the time and waiting on all of them is not a good model. I would not look at trying pass through just yet. Yes it'll cut some overhead, but still feels like there's something not optimal just yet which will bite you later anyway. |
I never quite nop numbers, this is doing actual reads from a device. |
If you attach your current tester I can try it here. |
Thanks, here is the tester, quick and dirty, with some changes since my last message:
There are brief comments at the beginning of the file on how to compile and run. On my system I get the following results:
So looks like it's closer to 0.5 us per submitted request for qd = 64, and around 0.7 us per request for qd = 256. In my previous test, I only ran one submit / reap cycle, and the submission time was higher (120 us for 128 qd). I guess there is an additional cost to access the memory buffers for the first time. Curious what numbers you see on your system. Regarding QD, I think 128-256 is not a bad number to aim for. My SSD claims 50 us read time and 1.4M IOPS, which suggests that 70 requests are processed in parallel. You need a larger number of requests than that to have at least one random read on each parallel path. Additionally, some requests need to be in flight queued on the SSD while the current batch is executing. |
For filesystems, the submit() times are a bit worse. I see ~120 us / 128 requests for both ext4 and xfs. |
I traced the previous program, no unexpected waiting there, but to be honest it's a simple normal bdev / nvme read, so not likely to misbehave. The only possible scenario would some troubles with the additional stuff like qos or throttle in your kernel config. For that we can take a perf profile to see the overhead, and I can write you a tracer for you to run, I'll return to it in a couple of hours, but again, numbers suggest it's just pure accumulated overhead of the submission path.
Make sure it doesn't exceed
Check that the thread is not CPU bound, i.e. iowait + idle > 0%. I'd also assume the claim is for 512B unless said otherwise. I don't know your drive, but it doesn't always bottleneck in flash and not e.g. ftl / htl |
Try this one, if there is anything suspicious we can dig deeper and record timings.
|
I must be missing something obvious here. Apologies in advance.
I am trying to read an ext4 file from an nvme opened with O_DIRECT.
I initialize io_uring_queue_init(512, &ring, 0) and add 512 4K read requests at random offsets in the file.
I then call io_uring_submit_and_wait(&ring, 1).
My expectation is that this should take 70-100 microseconds and return a subset of completed reads. Instead, the call takes 6-8 milliseconds and returns all 512. Why would it block until all reads are completed when wait_nr = 1?
The kernel is 6.8.
The text was updated successfully, but these errors were encountered: