Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post_retry support for mirrorring #171

Open
petersilva opened this issue Oct 29, 2024 · 2 comments
Open

post_retry support for mirrorring #171

petersilva opened this issue Oct 29, 2024 · 2 comments
Labels
enhancement New feature or request mirroring issue affects or brought to light from mirroring deployment question Further information is requested worries needs work to clarify status

Comments

@petersilva
Copy link
Contributor

The client has concerns about robustness of the mirroring post generation during broker outages. Currently, I think the user jobs will just hang, trying desperately to publish notices for the broker.

The post_retry logic (actually all retry logic) depends on having one retry list / process. Each instance has a one file per retry queue (download and post being extant currently.) In the context of libsr3shim... this does not make much sense. the processes are typically short-lived, non-daemons.

  • There will be hundreds of thousands of post_retry.pid files created (1 per process.) if the broker goes down.
  • The pid itself cannot retry the posts, because in order to keep user jobs going, the process has to end.

An alternative to the thousands of .pid files, would be to post to a pipe, or a named pipe, per node... in which case, you need a janitor that reads the named pipe. You end up creating a second IPC network to robustify your IPC network.

Taking the simpler option:

  • we would create post_retry files per process... so there would be hundreds of thousands of such files created during a run. these processes end (they will not retry the posting themselves.)
  • something need a (likely python scheduled watch?) janitor process that finds the post_retry.pid files and perhaps puts them into conventional retry_queue, and deletes the post_retry.pid files. To avoid contention, it makes sure the files it reads are > 1 minute old, before trying to process them.
  • the janitor then needs to retry the posts conventionally.

This is one suggested implementation.

@petersilva
Copy link
Contributor Author

@reidsunderland @habilinour what I did not have time to explain during the meeting.

@petersilva petersilva added enhancement New feature or request question Further information is requested worries needs work to clarify status mirroring issue affects or brought to light from mirroring deployment labels Oct 29, 2024
@petersilva
Copy link
Contributor Author

avoiding contention is probably harder than that... you need to have hostname and pid combined.. because you might get pid conflicts... I was thinking we could check the proc table to avoid conflict... but would have to check the proc table on all nodes, or run the janitor on all nodes, which feels ridiculously expensive. That's why I was using 1 minute... might need a longer time.

I hope we can just run 1 janitor for the whole cluster. During peak times it falls behind, and catches up later... everything is late anyways... minutes don't matter in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mirroring issue affects or brought to light from mirroring deployment question Further information is requested worries needs work to clarify status
Projects
None yet
Development

No branches or pull requests

1 participant