Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

Closed
wants to merge 1 commit into from

Conversation

obrie
Copy link

@obrie obrie commented Mar 11, 2024

This fixes flatware hanging indefinitely when a worker dues unexpectedly due to an unrecoverable error such as a segmentation fault.

Context

When a worker dies unexpectedly, currently flatware has no CHLD signal handling in place to catch the death of that process and recover. This can happen if your spec happens to trigger a segmentation fault. In our specs, this happens rarely and randomly, but when it does -- it causes our CI to sit there waiting forever.

Solution

The proposed solution is to introduce a class that's responsible for "managing" workers. Specifically, this class will listen for CHLD signals and re-spawn a worker if the process was killed unexpectedly.

We can define "unexpected" based on whether the sink proactively deleted the worker when returning Job.sentinel.

Additionally, when the worker is re-spawned, we re-assign it the same work that it was previously assigned.

Alternative Solutions

Possible alternative solutions:

  • Don't retry specs when a worker dies unexpectedly but make sure that flatware is aware of this situation and eventually returns a non-zero error code

Open Questions

  • Is this a reasonable path forward to address this issue?
  • What, if anything, should we do for unrecoverable exceptions that are caused consistently (e.g. if a spec always causes a segmentation fault)? Should there be a maximum number of retries?
  • Are there other scenarios I haven't accounted for?

TODO

Assuming moving forward with the approach:

  • Update specs

Looking for any feedback on the high-level approach before updating the specs. Thanks!

@briandunn
Copy link
Owner

briandunn commented Mar 13, 2024

Hi @obrie, thanks for looking into this! trapping CLD and terminating the sink is a great idea. But what to do about it? In these situations I tend to look to plain rspec. Turns out if you have an unexpected exit in rspec it does nothing.

I experimented with trapping CLD, You can try it out by running bin/flatware rspec in the repo root. I think this behavior is close to ideal. Would it work for your use case? We shouldn't try to restart the worker. We should just print the list of specs that failed to run and exit non-zero. A CI script could do another run with that list, but if it doesn't happen often a rerun of the whole sweet would be simpler.

Another cool feature would be to report which spec caused the crash.
I believe the top spec in the un-run list will be the offender, but that is just incidental. To do that right we would need to implement a new formatter event for when a spec starts and track that in the sink. So probably an incremental improvement.

Breaking out a WorkerManager seems ok... as long as it's distinct responsibilities are clear. So the Sink:

  • distributes jobs to workers
  • handles messages from workers
  • handles SIGINT (?)
  • tracks what work is completed

And the WorkerManager

  • spawns worker processes
  • handles CLD
  • something else?

Not sure it's pulling it's weight if we remove respawn.

@obrie
Copy link
Author

obrie commented Mar 13, 2024

I was thinking about it this morning and came to the same conclusion -- I agree that the worker should just exit and reflect the same behavior as rspec. I'll try out your branch later today, but I think that's the right approach 👍 Assuming moving forward with that, agreed that it's probably not worth breaking out the logic into a WorkerManager at this point.

Thanks for the quick follow-up!

@obrie
Copy link
Author

obrie commented Mar 14, 2024

Looks like your solution should work for us 👍

Thanks! I'll close this in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants