Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

obrie · 2024-03-11T17:45:39Z

This fixes flatware hanging indefinitely when a worker dues unexpectedly due to an unrecoverable error such as a segmentation fault.

Context

When a worker dies unexpectedly, currently flatware has no CHLD signal handling in place to catch the death of that process and recover. This can happen if your spec happens to trigger a segmentation fault. In our specs, this happens rarely and randomly, but when it does -- it causes our CI to sit there waiting forever.

Solution

The proposed solution is to introduce a class that's responsible for "managing" workers. Specifically, this class will listen for CHLD signals and re-spawn a worker if the process was killed unexpectedly.

We can define "unexpected" based on whether the sink proactively deleted the worker when returning Job.sentinel.

Additionally, when the worker is re-spawned, we re-assign it the same work that it was previously assigned.

Alternative Solutions

Possible alternative solutions:

Don't retry specs when a worker dies unexpectedly but make sure that flatware is aware of this situation and eventually returns a non-zero error code

Open Questions

Is this a reasonable path forward to address this issue?
What, if anything, should we do for unrecoverable exceptions that are caused consistently (e.g. if a spec always causes a segmentation fault)? Should there be a maximum number of retries?
Are there other scenarios I haven't accounted for?

TODO

Assuming moving forward with the approach:

Update specs

Looking for any feedback on the high-level approach before updating the specs. Thanks!

…able error such as a segmentation fault

briandunn · 2024-03-13T15:31:09Z

Hi @obrie, thanks for looking into this! trapping CLD and terminating the sink is a great idea. But what to do about it? In these situations I tend to look to plain rspec. Turns out if you have an unexpected exit in rspec it does nothing.

I experimented with trapping CLD, You can try it out by running bin/flatware rspec in the repo root. I think this behavior is close to ideal. Would it work for your use case? We shouldn't try to restart the worker. We should just print the list of specs that failed to run and exit non-zero. A CI script could do another run with that list, but if it doesn't happen often a rerun of the whole sweet would be simpler.

Another cool feature would be to report which spec caused the crash.
I believe the top spec in the un-run list will be the offender, but that is just incidental. To do that right we would need to implement a new formatter event for when a spec starts and track that in the sink. So probably an incremental improvement.

Breaking out a WorkerManager seems ok... as long as it's distinct responsibilities are clear. So the Sink:

distributes jobs to workers
handles messages from workers
handles SIGINT (?)
tracks what work is completed

And the WorkerManager

spawns worker processes
handles CLD
something else?

Not sure it's pulling it's weight if we remove respawn.

obrie · 2024-03-13T15:41:30Z

I was thinking about it this morning and came to the same conclusion -- I agree that the worker should just exit and reflect the same behavior as rspec. I'll try out your branch later today, but I think that's the right approach 👍 Assuming moving forward with that, agreed that it's probably not worth breaking out the logic into a WorkerManager at this point.

Thanks for the quick follow-up!

obrie · 2024-03-14T19:46:45Z

Looks like your solution should work for us 👍

Thanks! I'll close this in the meantime.

Fix tests hanging when a worker dies unexpectedly due to an unrecover…

9b30850

…able error such as a segmentation fault

obrie closed this Mar 14, 2024

briandunn mentioned this pull request Mar 14, 2024

Fix: flatware hangs if a worker dies #82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

obrie commented Mar 11, 2024 •

edited

Loading

briandunn commented Mar 13, 2024 •

edited

Loading

obrie commented Mar 13, 2024

obrie commented Mar 14, 2024

Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

Fix tests hanging when a worker dies unexpectedly due to an unrecoverable error such as a segfault #81

Conversation

obrie commented Mar 11, 2024 • edited Loading

Context

Solution

Alternative Solutions

Open Questions

TODO

briandunn commented Mar 13, 2024 • edited Loading

obrie commented Mar 13, 2024

obrie commented Mar 14, 2024

obrie commented Mar 11, 2024 •

edited

Loading

briandunn commented Mar 13, 2024 •

edited

Loading