post_retry support for mirrorring #171

petersilva · 2024-10-29T15:05:06Z

The client has concerns about robustness of the mirroring post generation during broker outages. Currently, I think the user jobs will just hang, trying desperately to publish notices for the broker.

The post_retry logic (actually all retry logic) depends on having one retry list / process. Each instance has a one file per retry queue (download and post being extant currently.) In the context of libsr3shim... this does not make much sense. the processes are typically short-lived, non-daemons.

There will be hundreds of thousands of post_retry.pid files created (1 per process.) if the broker goes down.
The pid itself cannot retry the posts, because in order to keep user jobs going, the process has to end.

An alternative to the thousands of .pid files, would be to post to a pipe, or a named pipe, per node... in which case, you need a janitor that reads the named pipe. You end up creating a second IPC network to robustify your IPC network.

Taking the simpler option:

we would create post_retry files per process... so there would be hundreds of thousands of such files created during a run. these processes end (they will not retry the posting themselves.)
something need a (likely python scheduled watch?) janitor process that finds the post_retry.pid files and perhaps puts them into conventional retry_queue, and deletes the post_retry.pid files. To avoid contention, it makes sure the files it reads are > 1 minute old, before trying to process them.
the janitor then needs to retry the posts conventionally.

This is one suggested implementation.

petersilva · 2024-10-29T15:05:46Z

@reidsunderland @habilinour what I did not have time to explain during the meeting.

petersilva · 2024-10-29T15:12:32Z

avoiding contention is probably harder than that... you need to have hostname and pid combined.. because you might get pid conflicts... I was thinking we could check the proc table to avoid conflict... but would have to check the proc table on all nodes, or run the janitor on all nodes, which feels ridiculously expensive. That's why I was using 1 minute... might need a longer time.

I hope we can just run 1 janitor for the whole cluster. During peak times it falls behind, and catches up later... everything is late anyways... minutes don't matter in this situation.

petersilva added enhancement New feature or request question Further information is requested worries needs work to clarify status mirroring issue affects or brought to light from mirroring deployment labels Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post_retry support for mirrorring #171

post_retry support for mirrorring #171

petersilva commented Oct 29, 2024

petersilva commented Oct 29, 2024

petersilva commented Oct 29, 2024

post_retry support for mirrorring #171

post_retry support for mirrorring #171

Comments

petersilva commented Oct 29, 2024

petersilva commented Oct 29, 2024

petersilva commented Oct 29, 2024