Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring_submit_and_wait() unexpected behavior #1301

Closed
andyg24 opened this issue Dec 10, 2024 · 25 comments
Closed

io_uring_submit_and_wait() unexpected behavior #1301

andyg24 opened this issue Dec 10, 2024 · 25 comments

Comments

@andyg24
Copy link

andyg24 commented Dec 10, 2024

I must be missing something obvious here. Apologies in advance.

I am trying to read an ext4 file from an nvme opened with O_DIRECT.

I initialize io_uring_queue_init(512, &ring, 0) and add 512 4K read requests at random offsets in the file.

I then call io_uring_submit_and_wait(&ring, 1).

My expectation is that this should take 70-100 microseconds and return a subset of completed reads. Instead, the call takes 6-8 milliseconds and returns all 512. Why would it block until all reads are completed when wait_nr = 1?

The kernel is 6.8.

@titaniumdev80
Copy link

@andyg24 could you let me know if you know Dave Skender's email address?

@andyg24
Copy link
Author

andyg24 commented Dec 13, 2024

I still cannot figure this out after 3 days. A single submitted nvme request takes 50-60 microseconds to complete on my system. 512 submitted requests (4K reads from random offsets of a large file) cause io_uring_submit_and_wait(&ring, 1) to block for 4000-6000 microseconds and return all 512 cqe's at once. How can I make it return after 60 microseconds with just a few cqe's?

I suspect there is some kernel setting that I am missing.

If @axboe or @isilence could offer any tips on what to try, I would really appreciate that!

Thank you.

@andyg24
Copy link
Author

andyg24 commented Dec 13, 2024

So it appears that what's blocking is not the wait() part but the submission itself. The latency is just as high when I replace io_uring_submit_and_wait() with io_uring_submit(). It takes ~4 milliseconds to submit 500 read requests.

Looking at /sys/kernel/debug/tracing/trace_pipe, there is a 100-200 microsecond wait every 32 nvme_setup_cmd requests. For example,

<...>-396466 [003] ..... 1610636.311239: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=37280, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=993271344, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
<...>-396466 [003] ..... 1610636.311440: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=12675, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=709173472, len=7, ctrl=0x0, dsmgmt=0, reftag=0)

I am guessing some queue somewhere fills up and blocks the call. /sys/block/nvme0n1/queue/nr_request on my system is
1023.

Any suggestions on what to check / tweak so that I could have 500 in-flight read requests to the SSD without blocking?

@andyg24
Copy link
Author

andyg24 commented Dec 13, 2024

Could this be related to #1184?

nr_requests is set to a large number in my case (1023), but perhaps the actual NVME drive has a much lower queue size. I am seeing this behavior with EXT4 and kernels ranging from 6.5 to 6.11.

@isilence
Copy link
Collaborator

Yes, it sounds like ext4 ignores NOWAIT somewhere. Does it reproduces with newer kernels?

@andyg24
Copy link
Author

andyg24 commented Dec 19, 2024

Thank you for your reply, Pavel.

I am seeing this on 6.11 and block devices too (e.g. /dev/nvme1n1), no filesystem needed. More often than not, io_uring_submit() takes a millisecond or more, then io_uring_wait_cqe() completes instantly and all 512 cqe's are returned at once. These are random 4K reads on a fast SSD.

I will post some code to demonstrate this issue tomorrow.

@andyg24
Copy link
Author

andyg24 commented Dec 19, 2024

Here is a quick test:

// dd if=/dev/zero of=test.dat bs=1M count=100
// gcc -o io_test io_test.cc -luring
// ./io_test test.dat or ./io_test /dev/nvmeXnX

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>

#include "liburing.h"

#define QD 512

int main(int argc, char *argv[]) {
  srand(time(NULL));
  struct io_uring ring;

  if (argc != 2) {
    perror("Usage: io_test <file>");
    exit(1);
  }

  if (io_uring_queue_init(QD, &ring, 0 /* flags */) != 0) {
    perror("io_uring_queue_init");
    exit(1);
  }

  int fd = open(argv[1], O_RDONLY | O_DIRECT, 0644);

  if (fd < 0) {
    perror("Unable to open file");
    exit(1);
  }

  char* buffer = (char*) aligned_alloc(4096, QD * 4096);

  if (!buffer) {
    perror("aligned_alloc");
    exit(1);
  }

  // Prepare read requests
  struct io_uring_sqe *sqe;
  for (int i = 0; i < QD; i++) {
    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
      perror("io_uring_get_sqe");
      exit(1);
    }
    io_uring_prep_read(sqe, fd, buffer + 4096 * i,
                       4096 /* size */,
                       4096 * (rand() % 20000) /* file offset */);
  }

  // Submit requests and reap completions
  struct timeval submit_start, submit_end;
  gettimeofday(&submit_start, NULL);
  int ret = io_uring_submit(&ring);
  gettimeofday(&submit_end, NULL);

  if (ret != QD) {
    perror("io_uring_submit");
    exit(1);
  }

  struct io_uring_cqe *cqe;
  struct timeval wait_start, wait_end;
  gettimeofday(&wait_start, NULL);
  ret = io_uring_wait_cqe(&ring, &cqe);
  gettimeofday(&wait_end, NULL);

  if (ret < 0) {
    perror("io_uring_wait_cqe");
    exit(1);
  }

  unsigned head;
  int num_cqes = 0;

  io_uring_for_each_cqe(&ring, head, cqe) {
    num_cqes++;
  }

  int submit_us = submit_end.tv_usec - submit_start.tv_usec;
  int wait_us = wait_end.tv_usec - wait_start.tv_usec;

  printf("Submitted %d sqes in %d us, waited %d us for %d cqes\n",
         QD, submit_us, wait_us, num_cqes);

  io_uring_cq_advance(&ring, num_cqes);
  io_uring_queue_exit(&ring);

  close(fd);
  free(buffer);

  return 0;
}

This submits 512 4K read requests into random offsets of a file opened with O_DIRECT, then waits for the first cqe and prints how long the two system calls took and how many cqes were reaped in total.

On my Ubuntu 24.04 laptop (6.8 kernel) and a WD_BLACK SN850X SSD, I am seeing results such as

-- EXT4

# dd if=/dev/zero of=test.dat bs=1M count=100
# ./io_test test.dat 
Submitted 512 sqes in 7150 us, waited 1 us for 480 cqes

-- Block device

# ./io_test /dev/nvme0n1
Submitted 512 sqes in 4442 us, waited 0 us for 507 cqes

So essentially this behaves like blocking IO -- nearly all the time is spent in submit().

On another desktop running 6.11 and with Solidigm P44 Pro and Corsair MP600 Pro drives, the first wait_cqe() always takes a long time (SSD wake up time?), but if I run the test sequentially, I get

# ./io_test /dev/nvme0n1; ./io_test /dev/nvme0n1
Submitted 512 sqes in 1425 us, waited 6300 us for 287 cqes
Submitted 512 sqes in 1544 us, waited 0 us for 506 cqes

Ideally, I would expect the submit() to return almost instantly -- anything that could block should be punted to some internal queue -- and then the first wait_cqe() to return after 50-100 us with a small subset of completed reads.

Thanks.

@andyg24
Copy link
Author

andyg24 commented Dec 19, 2024

Updated the code in previous comment to use gettimeofday() instead of clock(). The timings still don't look right.

@andyg24
Copy link
Author

andyg24 commented Dec 20, 2024

I tried a version of the above with the following improvements:

  • Registered files
  • Registered buffers
  • IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN flags in io_uring_queue_init().

I also tried IORING_SETUP_COOP_TASKRUN, but that didn't seem to help anything.

Timings look better now, but still not what I would expect from a fast SSD:

6.8 kernel laptop:

# ./io_test /dev/nvme0n1
Submitted 512 sqes in 1552 us, waited 70 us for 471 cqes

6.11 kernel desktop:

# ./io_test /dev/nvme0n1; ./io_test /dev/nvme0n1; ./io_test /dev/nvme0n1
Submitted 512 sqes in 538 us, waited 6912 us for 267 cqes
Submitted 512 sqes in 663 us, waited 53 us for 475 cqes
Submitted 512 sqes in 640 us, waited 52 us for 475 cqes

Most of the time is still in submit(), and wait_cqe() returns fairly quickly with nearly all cqes.

// dd if=/dev/zero of=test.dat bs=1M count=100
// gcc -o io_test io_test.cc -luring
// ./io_test test.dat or ./io_test /dev/nvmeXnX

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>

#include "liburing.h"

#define QD 512

int main(int argc, char *argv[]) {
  srand(time(NULL));
  struct io_uring ring;

  if (argc != 2) {
    perror("Usage: io_test <file>");
    exit(1);
  }

 if (io_uring_queue_init(QD, &ring,
         IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN) != 0) {
    perror("io_uring_queue_init");
    exit(1);
  }

  int fd = open(argv[1], O_RDONLY | O_DIRECT | O_LARGEFILE, 0644);

  if (fd < 0) {
    perror("Unable to open file");
    exit(1);
  }

  // Register files
  io_uring_register_files(&ring, &fd, 1);

  // Register buffers
  iovec* iovecs = (iovec*) calloc(QD, sizeof(struct iovec));

  char* bufs[QD];

  for (int i = 0; i < QD; i++) {
    char* buf = (char*) aligned_alloc(4096, 4096);
    bufs[i] = buf;
    iovecs[i].iov_base = buf;
    iovecs[i].iov_len = 4096;
  }

  io_uring_register_buffers(&ring, iovecs, QD);

  // Prepare read requests
  struct io_uring_sqe *sqe;
  for (int i = 0; i < QD; i++) {
    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
      perror("io_uring_get_sqe");
      exit(1);
    }
    io_uring_prep_read_fixed(
        sqe, 0, bufs[i],
        4096 /* size */,
        4096 * (rand() % 20000) /* file offset */,
        i /* buf_index */);
    sqe->flags |= IOSQE_FIXED_FILE;
  }

  // Submit requests and reap completions
  struct timeval submit_start, submit_end;
  gettimeofday(&submit_start, NULL);

  int ret = io_uring_submit(&ring);

  gettimeofday(&submit_end, NULL);

  if (ret != QD) {
    perror("io_uring_submit");
    exit(1);
  }

  struct timeval wait_start, wait_end;
  gettimeofday(&wait_start, NULL);

  struct io_uring_cqe *cqe;
  ret = io_uring_wait_cqe(&ring, &cqe);

  gettimeofday(&wait_end, NULL);

  if (ret < 0) {
    perror("io_uring_wait_cqe");
    exit(1);
  }

  unsigned head;
  int num_cqes = 0;

  io_uring_for_each_cqe(&ring, head, cqe) {
    num_cqes++;
  }

  io_uring_cq_advance(&ring, num_cqes);

  int submit_us = submit_end.tv_usec - submit_start.tv_usec;
  if (submit_us < 0) submit_us += 1000000;
  int wait_us = wait_end.tv_usec - wait_start.tv_usec;
  if (wait_us < 0) wait_us += 1000000;

  printf("Submitted %d sqes in %d us, waited %d us for %d cqes\n",
         QD, submit_us, wait_us, num_cqes);

  io_uring_unregister_files(&ring);
  io_uring_unregister_buffers(&ring);
  io_uring_queue_exit(&ring);

  close(fd);

  for (int i = 0; i < QD; i++) {
    free(bufs[i]);
  }

  return 0;
}

Any ideas what else to try? Would nvme polled mode help here?

@isilence
Copy link
Collaborator

Thanks for the repro, I'll take a look.

I tried a version of the above with the following improvements:

Only COOP_TASKRUN / DEFER_TASKRUN might matter for the problem, others are optimisations but shouldn't considerably change latency.

I am seeing this on 6.11 and block devices too (e.g. /dev/nvme1n1), no filesystem needed.

Hmm, that's more mysterious then, but I've seen all sorts of weird stuff, like raw block IO unexpectedly waiting on a filesystem's atime update because it was mounted in the filesystem instead of /dev/. I'll try it, but I might ask to run a bpftrace script if it doesn't reproduce.

@isilence
Copy link
Collaborator

Can you run back to back QD=1 and QD=512 for both nvme and fs?

@andyg24
Copy link
Author

andyg24 commented Dec 20, 2024

Sure. I changed IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN to just IORING_SETUP_COOP_TASKRUN as that gives slightly better results.

Ubuntu 24.04 laptop, 6.8 kernel, WD_BLACK SN850X. Fairly low-end but recent Intel CPU.

-- EXT4

# ./io_test_512 test.dat; ./io_test_1 test.dat; ./io_test_512 test.dat
Submitted 512 sqes in 2409 us, waited 0 us for 476 cqes
Submitted 1 sqes in 32 us, waited 125 us for 1 cqes
Submitted 512 sqes in 2149 us, waited 0 us for 476 cqes

-- Block device

# ./io_test_512 /dev/nvme0n1; ./io_test_1 /dev/nvme0n1; ./io_test_512 /dev/nvme0n1
Submitted 512 sqes in 1048 us, waited 0 us for 321 cqes
Submitted 1 sqes in 16 us, waited 28 us for 1 cqes
Submitted 512 sqes in 978 us, waited 1 us for 380 cqes

On the faster desktop, 512 sqe submission time is about 700 us for a block device.

I wonder -- could it be that everything is working as expected, and it's just expensive to submit a read request in linux? Assuming 1.4 us per read request, that's about 5000 CPU cycles. Perhaps between io_uring, block device, nvme driver, etc. there is enough complexity to account for that? And so by the time submit() processes 64 requests, the previous 64 requests are already back from the SSD, making it appear like submit() blocks.

@isilence
Copy link
Collaborator

isilence commented Dec 20, 2024

Sure. I changed IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN to just IORING_SETUP_COOP_TASKRUN as that gives slightly better results.

FWIW, IORING_SETUP_DEFER_TASKRUN is usually recommended and should be more efficient.

I wonder -- could it be that everything is working as expected, and it's just expensive to submit a read request in linux? Assuming 1.4 us per read request, that's about 5000 CPU cycles. Perhaps between io_uring, block device, nvme driver, etc. there is enough complexity to account for that?

Right, that's what I was thinking about. You're just submitting too many requests at the same time. 1.4us per request is reasonable, and it further depends on the kernel config and what block features are enabled.

I'd say don't submit so many, 512 is neither good for latency nor for throughput, e.g. the kernel limits how many inflight requests there might be to a device, see /sys/block/<device name>/queue/nr_requests, which is usually defaulted to something around 32 or 64. SSDs have limitations as well.

@andyg24
Copy link
Author

andyg24 commented Dec 20, 2024

Makes sense, thank you for your help.

@andyg24 andyg24 closed this as completed Dec 20, 2024
@andyg24
Copy link
Author

andyg24 commented Dec 23, 2024

With IORING_SETUP_IOPOLL (nvme.poll_queues=4), things are quite a bit faster -- less than 1 us per 4K read request, or over 1M IOPS. Unfortunately, most of the time is still spent in io_uring_submit() even with lower QDs. This makes liburing behave more like blocking IO: by the time submit() returns, nearly all requests are already completed.

I suppose I should try nvme passthrough next. Perhaps there is less overhead there, and submit() will return faster.

The motivation for trying to improve submit() latency is to have the following flow: submit() a bunch of requests, then process completed requests while additional requests are in flight. Because submit() blocks until nearly everything is processed, it's difficult to have requests to an nvme in flight while doing other work.

@axboe
Copy link
Owner

axboe commented Dec 23, 2024

You seem to be equating "submitting many requests takes quite a bit of time" with "blocking". Is there any evidence of blocking here? There should not be blocking, if there is, then there's certainly a bug. That part I'd like to get to the bottom of.

I haven't really looked into this case as Pavel was looking at it, but one thing that stuck out to me is that your batch size is way too big. Most NVMe devices will have a per-queue submit limit of 1023, generally. Sometimes it's smaller. You can see what yours is by checking /sys/block/nvmeXn1/queue/nr_requests. For best performance on NVMe, experimentally I've found that even the fastest devices will perform at peak performance at even a queue depth of 128. Submitting (and keeping in flight) 512 at the time is way too big. You're just adding latency here, both on the submission side, but also on the device side. Rather than do huge chunks like that, submit eg 32 at the time, and aim for 128 or less on the device side to keep inflight. You'll see much better performance that way.

@axboe
Copy link
Owner

axboe commented Dec 23, 2024

In terms of overhead, a distro level configuration will have a bunch of unnecessary overhead. Things like blk-throttle, blk-latency, blk-wbt, etc - all of those will add overhead to the IO submission and completion path. Some of that is just running more code per IO, some of it is doing expensive things like many time stamps per IO. Newer kernels will do better in this regard.

I've done 12-14M IOPS on a single IO core on a more optimized configuration, most distro like fat configuration will do less than that. But on a normal modern desktop CPU or server class that isn't too MHz starved, I'd still expect at least 5M per core. To get those kinds of efficiencies, you do need to have some kind of "mechanical sympathy" in terms of how you drive devices. That's back to my earlier point of not overloading the stack by doing chunks of 512 requests.

@andyg24
Copy link
Author

andyg24 commented Dec 23, 2024

Thank you for the quick reply!

Right, so QD=128 is what I am trying. The timings for me on a fast SSD (spec'ed at ~1.4M/s random 4K reads), 6.11 kernel, nvme.poll_queues=4, and a Ryzen 7700X cpu look like this:

io_uring_submit() - 120 us
io_uring_wait_cqe() -- i.e. waiting for the first cqe -- 0 us
waiting for all remaining cqes - 60 us

This is reading from a block device (/dev/nvme0n1), no filesystem involved.

I will try to optimize kernel configuration as you suggested. Thank you for the pointers on where to start. When you say that you expect 5M per core, is that reads from a block device, not just nops?

Do you think it's worth trying nvme passthrough commands to try to reduce the latency in submit()?

@axboe
Copy link
Owner

axboe commented Dec 24, 2024

120 usec is a lot. Is that submitting 128? What happens if you do 32, and wait on 32? 128 at the time and waiting on all of them is not a good model.

I would not look at trying pass through just yet. Yes it'll cut some overhead, but still feels like there's something not optimal just yet which will bite you later anyway.

@axboe
Copy link
Owner

axboe commented Dec 24, 2024

I never quite nop numbers, this is doing actual reads from a device.

@axboe
Copy link
Owner

axboe commented Dec 24, 2024

If you attach your current tester I can try it here.

@andyg24
Copy link
Author

andyg24 commented Dec 24, 2024

Thanks, here is the tester, quick and dirty, with some changes since my last message:

// dd if=/dev/zero of=test.dat bs=1M count=100
// gcc -o io_test io_test.cc -luring
// ./io_test 128 test.dat or ./io_test 128 /dev/nvmeXnX

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>

#include "liburing.h"

#define MAX_QD 1024

int get_us(struct timeval ts1, struct timeval ts2) {
  int us = ts1.tv_usec - ts2.tv_usec;
  return (us < 0) ? us + 1000000 : us;
}

int main(int argc, char *argv[]) {
  srand(time(NULL));

  struct io_uring ring;

  if (argc != 3) {
    perror("Usage: io_test <qd> <file>");
    exit(1);
  }

 int qd = atoi(argv[1]);

 if (qd < 1 || qd > MAX_QD) {
   perror("Invalid qd argument");
   exit(1);
 }

 if (io_uring_queue_init(qd, &ring,
         IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_IOPOLL) != 0) {
    perror("io_uring_queue_init");
    exit(1);
  }

  int fd = open(argv[2], O_RDONLY | O_DIRECT | O_LARGEFILE, 0644);

  if (fd < 0) {
    perror("Unable to open file");
    exit(1);
  }

  // Register files
  io_uring_register_files(&ring, &fd, 1);

  // Register buffers
  iovec* iovecs = (iovec*) calloc(qd, sizeof(struct iovec));

  char* bufs[qd];

  for (int i = 0; i < qd; i++) {
    char* buf = (char*) aligned_alloc(4096, 4096);
    bufs[i] = buf;
    iovecs[i].iov_base = buf;
    iovecs[i].iov_len = 4096;
  }

  io_uring_register_buffers(&ring, iovecs, qd);

  // Measure submit / reap timings a number of times
  for (int cycle = 0; cycle < 16; cycle++) {

    struct timeval start_ts, prepare_ts, submit_ts, reap_one_ts, reap_all_ts;
    gettimeofday(&start_ts, NULL);

    // Prepare read requests
    struct io_uring_sqe *sqe;
    for (int i = 0; i < qd; i++) {
      sqe = io_uring_get_sqe(&ring);
      if (!sqe) {
        perror("io_uring_get_sqe");
        exit(1);
      }
      io_uring_prep_read_fixed(
          sqe, 0, bufs[i],
          4096 /* size */,
          4096 * (rand() % 20000) /* file offset */,
          i /* buf_index */);
      sqe->flags |= IOSQE_FIXED_FILE;
    }

    gettimeofday(&prepare_ts, NULL);

    // Submit requests
    int ret = io_uring_submit(&ring);
    gettimeofday(&submit_ts, NULL);

    if (ret != qd) {
      perror("io_uring_submit");
      exit(1);
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&ring, &cqe);

    gettimeofday(&reap_one_ts, NULL);

    if (ret < 0) {
      perror("io_uring_wait_cqe");
      exit(1);
    }

    unsigned head;
    int num_cqes = 0;

    while (num_cqes < qd) {
      if (io_uring_peek_cqe(&ring, &cqe) == 0) {
        io_uring_cqe_seen(&ring, cqe);
        num_cqes++;
      }
    }

    gettimeofday(&reap_all_ts, NULL);

    int prepare_us = get_us(prepare_ts, start_ts);
    int submit_us = get_us(submit_ts, prepare_ts);
    int reap_one_us = get_us(reap_one_ts, submit_ts);
    int reap_rest_us = get_us(reap_all_ts, reap_one_ts);

    printf("Submitted %d in %d us, reaped 1 cqe in %d us, rest in %d us\n",
           qd, submit_us, reap_one_us, reap_rest_us);
  }

  io_uring_unregister_files(&ring);
  io_uring_unregister_buffers(&ring);
  io_uring_queue_exit(&ring);

  close(fd);

  for (int i = 0; i < qd; i++) {
    free(bufs[i]);
  }

  return 0;
}

There are brief comments at the beginning of the file on how to compile and run. On my system I get the following results:

# ./io_test 64 /dev/nvme0n1
Submitted 64 in 68 us, reaped 1 cqe in 7054 us, rest in 93 us
Submitted 64 in 38 us, reaped 1 cqe in 907 us, rest in 217 us
Submitted 64 in 34 us, reaped 1 cqe in 6 us, rest in 128 us
Submitted 64 in 35 us, reaped 1 cqe in 3 us, rest in 178 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 66 us
Submitted 64 in 35 us, reaped 1 cqe in 4 us, rest in 55 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 157 us
Submitted 64 in 34 us, reaped 1 cqe in 4 us, rest in 190 us
Submitted 64 in 35 us, reaped 1 cqe in 3 us, rest in 159 us
Submitted 64 in 35 us, reaped 1 cqe in 4 us, rest in 165 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 182 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 84 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 159 us
Submitted 64 in 35 us, reaped 1 cqe in 5 us, rest in 155 us
Submitted 64 in 35 us, reaped 1 cqe in 6 us, rest in 175 us
Submitted 64 in 34 us, reaped 1 cqe in 3 us, rest in 164 us
# ./io_test 128 /dev/nvme0n1
Submitted 128 in 126 us, reaped 1 cqe in 6983 us, rest in 1154 us
Submitted 128 in 111 us, reaped 1 cqe in 0 us, rest in 174 us
Submitted 128 in 97 us, reaped 1 cqe in 0 us, rest in 82 us
Submitted 128 in 89 us, reaped 1 cqe in 0 us, rest in 158 us
Submitted 128 in 75 us, reaped 1 cqe in 0 us, rest in 150 us
Submitted 128 in 76 us, reaped 1 cqe in 0 us, rest in 108 us
Submitted 128 in 79 us, reaped 1 cqe in 0 us, rest in 182 us
Submitted 128 in 75 us, reaped 1 cqe in 0 us, rest in 189 us
Submitted 128 in 75 us, reaped 1 cqe in 0 us, rest in 199 us
Submitted 128 in 77 us, reaped 1 cqe in 3 us, rest in 147 us
Submitted 128 in 78 us, reaped 1 cqe in 0 us, rest in 178 us
Submitted 128 in 76 us, reaped 1 cqe in 0 us, rest in 165 us
Submitted 128 in 74 us, reaped 1 cqe in 0 us, rest in 111 us
Submitted 128 in 78 us, reaped 1 cqe in 0 us, rest in 59 us
Submitted 128 in 76 us, reaped 1 cqe in 0 us, rest in 63 us
Submitted 128 in 79 us, reaped 1 cqe in 0 us, rest in 119 us
# ./io_test 256 /dev/nvme0n1
Submitted 256 in 230 us, reaped 1 cqe in 8056 us, rest in 1162 us
Submitted 256 in 170 us, reaped 1 cqe in 0 us, rest in 144 us
Submitted 256 in 180 us, reaped 1 cqe in 0 us, rest in 169 us
Submitted 256 in 174 us, reaped 1 cqe in 0 us, rest in 127 us
Submitted 256 in 191 us, reaped 1 cqe in 0 us, rest in 137 us
Submitted 256 in 179 us, reaped 1 cqe in 0 us, rest in 70 us
Submitted 256 in 166 us, reaped 1 cqe in 0 us, rest in 86 us
Submitted 256 in 168 us, reaped 1 cqe in 0 us, rest in 143 us
Submitted 256 in 168 us, reaped 1 cqe in 0 us, rest in 83 us
Submitted 256 in 170 us, reaped 1 cqe in 0 us, rest in 64 us
Submitted 256 in 175 us, reaped 1 cqe in 0 us, rest in 60 us
Submitted 256 in 168 us, reaped 1 cqe in 0 us, rest in 54 us
Submitted 256 in 181 us, reaped 1 cqe in 0 us, rest in 68 us
Submitted 256 in 168 us, reaped 1 cqe in 0 us, rest in 57 us
Submitted 256 in 166 us, reaped 1 cqe in 0 us, rest in 166 us
Submitted 256 in 172 us, reaped 1 cqe in 0 us, rest in 92 us

So looks like it's closer to 0.5 us per submitted request for qd = 64, and around 0.7 us per request for qd = 256. In my previous test, I only ran one submit / reap cycle, and the submission time was higher (120 us for 128 qd). I guess there is an additional cost to access the memory buffers for the first time.

Curious what numbers you see on your system.

Regarding QD, I think 128-256 is not a bad number to aim for. My SSD claims 50 us read time and 1.4M IOPS, which suggests that 70 requests are processed in parallel. You need a larger number of requests than that to have at least one random read on each parallel path. Additionally, some requests need to be in flight queued on the SSD while the current batch is executing.

@andyg24
Copy link
Author

andyg24 commented Dec 24, 2024

For filesystems, the submit() times are a bit worse. I see ~120 us / 128 requests for both ext4 and xfs.

@isilence
Copy link
Collaborator

I traced the previous program, no unexpected waiting there, but to be honest it's a simple normal bdev / nvme read, so not likely to misbehave. The only possible scenario would some troubles with the additional stuff like qos or throttle in your kernel config. For that we can take a perf profile to see the overhead, and I can write you a tracer for you to run, I'll return to it in a couple of hours, but again, numbers suggest it's just pure accumulated overhead of the submission path.

Regarding QD, I think 128-256 is not a bad number to aim for.

Make sure it doesn't exceed /sys/block/nvmeXn1/queue/nr_requests, but in any case if you're submitting too many at the same time your latency will be limited by the submission path. I'd recommend same thing Jens said, decouple QD from the number of requests per submission. I.e. submit a small batch, check for completions without waiting (for DEFER_TASKRUN see get_events() helpers), repeat until you get the desired QD.

My SSD claims 50 us read time and 1.4M IOPS, which suggests that 70 requests are processed in parallel.

Check that the thread is not CPU bound, i.e. iowait + idle > 0%. I'd also assume the claim is for 512B unless said otherwise. I don't know your drive, but it doesn't always bottleneck in flash and not e.g. ftl / htl

@isilence
Copy link
Collaborator

Try this one, if there is anything suspicious we can dig deeper and record timings.

bpftrace wait-trace.bt -p <pid>
// bpftrace wait-trace.bt

kprobe:io_submit_sqes {
	@ts[curtask] = 1;
}

kretprobe:io_submit_sqes {
	@ts[curtask] = 0;
}

tracepoint:sched:sched_switch {
	if (args.prev_pid && @ts[curtask]) {
		@[kstack] = count();
	}
}

END {
	clear(@ts);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants