-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tracing warning when a thread blocks steps #162
base: main
Are you sure you want to change the base?
Conversation
@LucioFranco would you be able to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good I have just two small things we should fix/think about
src/sim.rs
Outdated
|
||
if self.elapsed > self.config.duration && !is_finished { | ||
return Err(format!( | ||
"Ran for duration: {:?} steps: {} without completing", | ||
self.config.duration, self.steps, | ||
self.config.duration, self.steps.load(Relaxed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid this load because fetch_add above returns the previous value so you can just +1 that and get the current value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, will adjust that!
src/sim.rs
Outdated
Ok(_) | Err(TryRecvError::Disconnected) => break, | ||
_ => {} | ||
} | ||
std::thread::sleep(Duration::from_secs(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to add a limit or make this exponentially backoff so that the noise of it is reduced in say a ci scenario where it may timeout and swamp up the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, we could log it only once per step with the step it's stuck on and then skip the log after that. I'd think something's likely broken if a single step is taking 10s of real time work
Add a warning to the sim when a given host or client blocks progress in a simulation run. This works by spawning a background thread for each run that periodically checks the steps taken by the simulation. If the number of steps is the same between checks then the thread adds the tracing info.
52bdca7
to
e890f25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you a consider a strategy where this is fallible? If X duration of real time elapses and the sim doesn't progress, fail the test. This saves operators from cancelling run away builds.
src/sim.rs
Outdated
loop { | ||
let prev = steps.load(std::sync::atomic::Ordering::Relaxed); | ||
// Exit if main thread has. | ||
match rx.try_recv() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this behave when you call run() N times in a row? It looks like you could spawn a ton of threads that don't clean up for 10s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I'll adjust it so it uses recv_timeout
- that way if sim exits early the background thread will be closed too and it'll clean up straight away
src/sim.rs
Outdated
loop { | ||
let is_finished = self.step()?; | ||
|
||
if is_finished { | ||
let _ = tx.send(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the drop handle this for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
I'll step back on this, and instead we can expect CI systems running these to set the timeout to be less oppinionated. For example, nextest is pretty great for configuring this. |
@mcches or @LucioFranco would you mind giving this another look when you have a min? I've addressed all prev comments |
Add a warning to the sim when a given host or client blocks progress in a simulation run. This works by spawning a background thread for each run that periodically checks the steps taken by the simulation. If the number of steps is the same between checks then the thread adds the tracing info.
Few questions here:
cargo test
without--nocapture
Fixes #160