Robustly cancelling I/O before closing fds, when using one ring per core. #979

dcoutts · 2023-11-01T11:06:10Z

dcoutts
Nov 1, 2023

Context: I have a language runtime system where we have one OS thread per core, and we run a user-space thread scheduler for lightweight threads on each core. To add support for io_uring, we would want to have one io_uring per core, so we can issue and collect I/O operations from a single OS thread.

The difficulty is with closing fds. The fds are shared across all cores: any lightweight thread on any core can issue I/O on any fd. As discussed in #932, to close an fd in a robust way, we have to cancel any outstanding I/O on the fd too (otherwise the fd doesn't actually get closed, and we could have resource leaks for lightweight threads still blocked on I/O for that fd).

The problem is, there may be outstanding I/O operations on any of the io_uring for any of the cores, not just the ring where the close operation is being issued. So either we need a shared data structure to know if operations are outstanding on other cores (and arrange to issue cancel on those cores), or we need to blindly issue a cancel on each of the cores. This quickly becomes very expensive (in terms of cross-core traffic). And it's frustrating to always pay this cost just to support sloppy applications.

Does anyone have any good ideas?

Unfortunately it isn't a solution to insist that fds are limited to use on a single core. This isn't something that can be changed in this context. The user lightweight threads can communicate with each other and pass fds (or the higher level library handles that contain an fd) between themselves. For example a classic socket accept loop would accept an fd on (a lightweght thread on) one core and then fork a lightweight thread (which could go to any core) to handle the new connection.

It's also not a robust solution to insist that the user application cancel all I/O operations before closing. Yes, that's what applications should do, but a language runtime has to cope with sloppy user applications too (e.g. by throwing exceptions to the lightweight threads still blocked on I/O on an fd that another thread closed).

It seems to me the best approach would be if iouring could have a mode where close on an fd would automagically cancel any poll waiters on that fd. It wouldn't be necessary to cancel all ongoing I/O, just I/O operations that can wait indefinitely. This is the way that select, poll and epoll work (epoll is what the language runtime in question currently uses). With epoll, closing an fd that is registered in an epoll set, will generate a notification for that fd with an error.

As I understand it, with iouring, every I/O operation adds a refcount to the fd that its operating on, and that is what stops close from really closing files/pipes/sockets, because the refcount is not 0. But somehow select, poll and epoll manage to work without this behaviour of keeping the file alive while the poll is outstanding. Perhaps iouring could do the same, at least for operations waiting in the poll set. My guess is that if this isn't a zero cost thing, that it'd be behaviour best to request via a flag to io_uring_setup.

Thoughts?

isilence · 2023-11-01T14:34:33Z

isilence
Nov 1, 2023
Collaborator

Epoll does it with a pretty dirty (IMHO) hack, something similar but for io_uring would add a good amount of overhead to every io_uring request (probably 2 spin lock/unlock pairs), I don't think it would be better / more performant than implementing it in the userspace. Not to mention additional complexity and a couple of checks in the way of those who don't care about it. There can be also some optimisations, e.g. with fixed files, but that already gives me shivers.

With sockets the problem is easily solved by issuing a shutdown, I was arguing that it'd be really great to also have it for non-socket, especially if IO on those may never complete. That's one option but it requires kernel changes

Not necessary a solution but a couple of thoughts below to make it more efficient.

Does the framework have insight into file types? no need to cancel for fs/block files.
Do the file transfers b/w cores happen often? If not, the userspace performance is not really a problem.
Request submission should be fast, how much overhead tracking would add to it? Let's say we do it via a bitmask, it'll be almost always stable, so I don't think it's that expensive:

if (unlikely(!test_bit(file->used_by_thread_bitvec, current_thread_id))) { // fast path
    atomic_set_bit(file->used_by_thread_bitvec, current_thread_id);
}

Another way would be to dup the file when you transfer b/w cores, so all threads will have to close it when they're done. How you transfer files? Is it done explicitly via the framework API?

It may hurt perf of short lived files though, you may want to delay io_uring cancellations by a timer to batch more cancellations and do it together, that would amortize the ipi cost.

0 replies

davidzeng0 · 2023-11-06T15:28:58Z

davidzeng0
Nov 6, 2023

why wouldn't io_uring_prep_cancel_fd work here?

4 replies

davidzeng0 Nov 6, 2023

It seems to me the best approach would be if iouring could have a mode where close on an fd would automagically cancel any poll waiters on that fd. It wouldn't be necessary to cancel all ongoing I/O, just I/O operations that can wait indefinitely.

Recv/send ops and any other non-io-wq task should get cancelled immediately, file reads/writes will complete fully if already started

davidzeng0 Nov 6, 2023

struct File {
  int fd;
  int userspace_refs;
  bool closing;
};

async int open_new_file() {
  ...
}

async int read_file(File* file) {
  if file -> closing {
    return -ECANCELED;
  }

  create_read_sqe(file -> fd, ...);
  io_uring_push_sqe();
  
  return block_on_task();
}

void drop_ref(File* file) {
  if atomic_sub(file -> userspace_refs, 1) == 1 {
    close(file -> fd);
    free(file);
  }
}

async void close_file(File* file) {
  auto closing = atomic_set(file -> closing, true);
  
  if closing {
    drop_ref(file);

    return;
  }

  for each worker_with_io_uring {
    // inc file ref + lock + msg queue put + io uring msg ring (wake) 
    worker -> msg_queue_put(cancel_fd_request(file));
    
    // within worker
    // take cancel msg + run cancel fd + force submit and drop ref or wait for completion and drop ref. we know this is the last operation **queued** on the fd possible.
  }

  drop_ref(file);
}

// thread 1
file = zalloc File;
file -> fd = await open_new_file();
// send `file` to other threads, making sure to refcount

// thread 2
read_file(file);
close_file(file);

// thread 3
read_file(file);
close_file(file);

davidzeng0 Nov 6, 2023

you can do something like the above. its mostly pseudocode, so pardon the syntax. i might be missing something, but i believe this is the bare minimum if you're performance conscious. no submission queue locks are required.

davidzeng0 Nov 6, 2023

the async keyword means the function call suspends the light thread until completion, and assumes that the light thread cannot suspend unless you are calling block_on_task(). it is critical that the following doesn't happen: file -> closing is checked, light thread suspends, some time later it resumes and queues the operation. it is not a problem if the system thread suspends

the only cross core traffic is when the first close_file is called, signalling to all workers to cancel outstanding operations on each ring, and dropping the ref when the kernel has seen the cancel. its required that workers only process messages after a call to io_uring_submit(+ optional _and_wait), so that the cancel request is the last one for that file descriptor

subsequent calls to close_file on the same file ptr, either when the thread is done with it or on error (e.g read_file), will simply drop the ref

when there is guaranteed no more operations on an fd in userspace, we can close the fd and let the remaining operations finish and then the file will be dropped

DylanZA · 2023-11-06T16:04:04Z

DylanZA
Nov 6, 2023

As Pavel says, shutdown(2) is good for sockets, and hopefully that covers most of your use cases?

It sounds like you will need some synchronization between threads in any case however?
I don't see how else you robustly protect yourself against another thread closing an fd before your io is actually issued.

eg.

Thread 2:
  prep_write(7, "foo");

Thread 1:
  invalidate fd state
  close(7)
  open new file -> fd 7

Thread 2:
  io_uring_submit(); /* writes foo to the new file, not the old one */

0 replies

DylanZA · 2023-11-06T16:07:37Z

DylanZA
Nov 6, 2023

also note this statement is wrong:

As I understand it, with iouring, every I/O operation adds a refcount to the fd that its operating on,

it adds a refcount to the underlying file.
the fd can be reused while the file is still open (eg my comment above).

There are also complicated cases with fixed files where files can be open for a lot longer than you expected

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustly cancelling I/O before closing fds, when using one ring per core. #979

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Robustly cancelling I/O before closing fds, when using one ring per core. #979

Replies: 4 comments · 4 replies

isilence Nov 1, 2023 Collaborator

Replies: 4 comments 4 replies

isilence
Nov 1, 2023
Collaborator