-
-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tools for finding scheduler-dependent "heisenbugs" #239
Comments
An interesting question is how this might relate to testing network protocols. We already have some mechanisms in the memory-stream code to make tests where data trickles in on a weird schedule; one way to think of this is that when one task calls Right now we use ad hoc randomized sleeps for this kind of thing, which is a bit cumbersome in general, and I especially struggled when writing the ssl tests to get them to give good coverage over all the weird scheduling/network interactions that can happen there. In particular one problem is that if you have one task that does 1 sleep, and another task that does 3 sleeps, then it's very rare that the second task will ever complete first, just because P(X) > P(X1 + X2 + X3) is not high when X, X1, X2, X3, are all equidistributed. A more systematic state exploration system with some global view of the test could potentially do much better here. |
Here's an interesting, well-posed but non-obvious question: if you just go ahead and run the same test lots of times ("stress testing"), then this isn't very effective because you mostly explore the same interleavings over and over. Is there some algorithm that generates uniform samples from the space of interleavings? (Obviously this would have to be a stateful algorithm whose long-term stationary distribution has this property.) |
Some first-order objections to the "uniform samples from space of interleavings" idea:
async def t1():
global x
x = 1
for _ in range(10):
await yield_briefly()
async def t2():
for _ in range(10):
await yield_briefly()
global x
x = 2
async with trio.open_nursery() as nursery:
nursery.spawn(t1)
nursery.spawn(t2)
assert x == 2 In the vast, vast majority of interleavings (including all the ones you'd see during regular testing), this works. There's exactly one interleaving where it fails, which is the one where t2 runs to completion before t1 even starts. To find this kind of thing we need to find unusual interleavings, not just arbitrary interleavings – the ones that are "far" from the average interleaving in some sense. The ctrigger paper has some interesting discussion of what makes interleavings more or less similar. Not sure how applicable it is. |
Here's some notes from roc on making this work in practice on a firefox heisenbug that CHESS couldn't catch, with a practical heuristic algorithm: http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html |
There's also: https://github.com/osrg/namazu (from the comments on the post above) |
@ssanderson has an interesting scheme he's implemented in gevent, where he annotates his actual logic with calls like [Edit: here's what it might look like in Trio: #533] |
This sounds like a case for writing a scheduler on top of Hypothesis! Check out generic state machines - if you can define a Of course, that means that most of the work has just moved one layer up...
And at this point the result could accurately be described as a state-of-the-art fuzzing scheduler 🚀 |
Oh, happened to be re-reading roc's post above, and noticed that he also has some followups: It looks like there's also a talk here, for those who are into that kind of thing: https://air.mozilla.org/chaos-mode-and-rr/ (In case it's not clear, I think these are probably the most useful posts to look at first, because they are relentlessly practical in a way that all the other links in this thread are not.) If/when we seriously attempt this we should probably send him a ping to check if he has any other insights to share. |
Lines 1424 to 1436 in 3877632
One of the major problems in reproducing scheduler-dependent bugs is that the scheduler is in fact non-deterministic even if task timing and behavior is fully deterministic, because each Is there any strong reason not to use the global (revisiting this long after: we've had |
It's not really OK for random libraries to invoke the global |
Thinking about this a little more today. You could get quite a long way with: class Scheduler(ABC):
@abstractmethod
def schedule(self, newly_runnable: List[Task]) -> List[Task]:
"Returns the tasks to run in the next batch" The current scheduler would be: class CurrentBuiltinScheduler(Scheduler):
def schedule(self, newly_runnable):
_r.shuffle(newly_runnable)
return newly_runnable A big question though is how to support schedulers that don't want to schedule everything immediately. Example 1: a WFQ-style scheduler (#32) would want to keep track of all runnable tasks, and then repeatedly choose one task to run next. Example 2: rr's chaos mode involves sometimes intentionally refusing to schedule an otherwise runnable task for a certain amount of time, basically injecting a With the API described above, I think you can technically almost handle these – just stash some of the Hmm. I guess one way to approach this is to look at all the places where
(NB Speaking of deadlocks, I think if we do build a system that's good at provoking them, we probably also need to level up our tools for detecting and debugging them! If you throw the chaos scheduler at some test and it just never returns then that's... kinda useful to know I guess, but not really the best UX. Simple timeouts will get us some way (python-trio/pytest-trio#53). And I guess it's easy to detect a global deadlock in a system that doesn't interact with the outside world (if there are no runnable tasks and no timeouts scheduled, then... it's dead). Actually, we can also detect that we're not interacting with the outside world, by looking at registered I/O sources... hmm. I'm gonna split this off into a new issue, #1085. Maybe we should consider two different modes for this kind of torture-test debugging: one where we assume we control everything (autojump clock, fake network, etc.), and one that tries to apply less aggressive perturbations to a system that's actually interacting with the kernel (at least over localhost). I'm not sure whether it's better to try to start by building a chaos scheduler that just works for self-contained systems and then generalize it, or to try to start by building a chaos scheduler that only worries about scheduling, but then can be composed with autojump clock etc. Given that the chaos scheduler cares about time and clocks care about scheduling, though, this might be tricky. Heck, maybe we should combine the |
Thinking about how the pieces could fit together here... I think a more refined API for a pluggable scheduler would be (including current scheduler for comparison): class Scheduler:
def add_runnable(self, task: Task) -> None:
"Inform the scheduler that a task has become runnable"
self.runq.append(task)
def schedule_deadline(self) -> float:
"Ask the scheduler how long we can wait (on the Trio clock) before invoking schedule_batch"
return -math.inf if self.runq else math.inf
def schedule_batch(self) -> List[Task]:
"Get the next batch of tasks to run"
batch = self.runq
self.runq = []
return batch I think this is enough to implement both the current scheduler, and rr's chaos scheduler, when using a regular clock. When using the autojump clock, we have two problems: first, weird stuff will happen if the scheduler tries to mess with the special clock task, as mentioned above. But also: the autojump clock would need to be modified so that if the scheduler's timeout is the nearest timeout, then it jumps to that, instead of jumping to the next task timeout. Right now the autojump clock gets this by calling First let's check our assumptions. This particular complexity comes from deciding that the scheduler timeout should use the trio clock rather than the system clock. Is that the right decision? The intuition is that you want a chaos scheduler to be able to test things like, what if this task gets randomly delayed for a few seconds? But, you don't necessarily want to wait a few seconds to find out what happens. This is weirdly different from rr's setting, where you have preemptive multitasking and threads running on a real CPU, so when they introduce a multi-second delay in one thread, they really mean a multi-second delay! And in particular that determines both how much progress the other threads can make (CPU time) and how much time passes for timeouts (clock time). In our case, these two things are split: with the autojump clock, a task can get arbitrary amounts of CPU time in zero clock time, or arbitrary amounts of clock time can pass with no CPU usage. If we make the scheduler timeout use the Trio clock, then it becomes a knob to control clock time only, not CPU time; if we make it use the system clock, then it becomes a knob to control CPU time only, not clock time. But the scheduler already has complete control over the allocation of CPU time to processes, because it's the scheduler. Also, people will get annoyed if they use the autojump clock and then the scheduler decides to sit for a few seconds doing nothing, while also blocking everything else from running, which seems like something that would inevitably happen sometimes if the scheduler used the system clock. A naive version of rr's chaos mode could have a problem when combined with the autojump clock. Imagine a spin lock: task A takes the lock, and then the chaos scheduler decides that task A will sleep for 1 second. Task B then spins, waiting for A to drop the lock. Because task B is spinning, the autojump clock thinks the program is making progress, so it doesn't advance the clock. But because the clock doesn't advance, task A never gets scheduled, so task B keeps spinning forever. There's a pretty simple solution though: every time B spins, it passes through the scheduler, and the scheduler decides again to schedule task B instead of task A. So to avoid this, when the scheduler decides to deschedule a task for a while, it should reschedule after EITHER X seconds have passed on the clock, OR Y reschedules have been performed. (Also, probably no-one uses spinlocks in Trio anyway, so it wouldn't even matter.) So.... I'm leaning towards, yeah, the scheduler should use clock time. This then suggests that the clock is an even lower-level layer than the scheduler. The current autojump clock violates layering, by scheduling a task. So... probably we should make it stop doing that, by enhancing the run loop to directly provide the information the autojump clock needs. What does that look like? I guess we'd add some extra check to the run loop, very similar to how we handle (A nice thing here is that we could make this unconditionally a lower priority than So I guess the obvious approach would be to add these two hooks to the Of if we want to be minimalist, I guess technically the That's all kinda hacky, but it's... not as terrible as I expected? We even made it impossible to enumerate the set of installed instruments, so even if we did use an instrument it would be invisible to the user. Or on the other end of the option-space, it wouldn't be terrible to hardcode the autojump clock directly into the core. The data is in: people love it. It's not actually clear whether people ever want any other clocks (aside from the default one, of course). AFAIK there are no other implementations of the |
The instrument-based autojump clock is a cool idea, but I don't think it can be implemented using the current set of instrumentation hooks without sacrificing some edge cases. In particular:
|
Opened #1587 to discuss ways to implement the autojump clock without using a task |
Race conditions suck, but they're a common kind of bug in concurrent programs. We should help people test for them if we can.
Some interesting research out of MSR:
CHESS assumes you have a preemptive multitasking system and works by hooking synchronization primitives, and explores different legal paths through the resulting happens-before graph. This makes it tricky to find data races, i.e., places where two threads access the same variable without using any synchronization primitive; you really want to be able to try schedules where threads get preempted in the middle of these. In practice, it sounds like the way they do this (see page 7) is to first run some heavyweight data race detector that instruments memory reads and writes, and then use the result to add new annotations for the CHESS scheduler.
For us, we don't AFAIK have any reasonable way to do data race detection in Python (maybe you could write something with PyPy? it's non-trivial), but we do have direct access to the scheduler, so we can in principle explore all possible preemption decisions directly instead of having to infer them from synchronization points. However, this is likely to be somewhat inefficient – for CHESS it really needs to cut down the space of preemption points because otherwise it has to consider a preemption at every instruction which is ridiculously intractable, and we're in a better position then that, but the average program does still have lots of cases where two tasks that aren't interacting at all have some preemption points and this generates an exponential space of boring possible schedules.
They have a lot of interesting stuff in the paper (section 4) about how to structure a search, recovering from non-determinism in the code-under-test, some (not enough) detail about fair scheduling (they want to handle code with spin-locks). It looks like this is the fair scheduling paper; it looks like the main idea is that if a thread calls
sched_yield
or equivalent then it means it can't make progress, so you should lower its priority. That's... not a good signal in the trio context. Fortunately it's not clear that anyone wants to write spin-locks anyway... (we use the equivalent a few times in our test suite, but mostly only in the earliest tests I wrote before we had much infrastructure, and certainly it's a terrible thing to do in real code).Possibly useful citation: "ConTest [15] is a lightweight testing tool that attempts to create scheduling variance without resorting to systematic generation of all executions. In contrast, CHESS obtains greater control over thread scheduling to offer higher coverage guarantees and better reproducibility."
GAMBIT then adds a smarter best-first search algorithm on top of the basic CHESS framework. A lot of the cleverness in GAMBIT is about figuring out which states are equivalent and thus can be collapsed, which we don't really have access to. But the overall ideas are probably relevant. (It's also possible that we could use something like instrumented synchronization primitives plus the DPOR algorithm to pick "interesting" schedules to try first, and then fall back on more brute-force randomized search.)
I suspect there are two general approaches that are likely to be most useful in our context:
Hypothesis-style randomized exploration of the space of scheduling decisions, with some clever heuristics to guide the search towards edge cases that are likely to find bugs, and maybe even hypothesis-style minimization. (The papers above have some citations to other papers showing that low-preemption traces are often sufficient to demonstrate bugs in practice.)
Providing some sort of mechanism for users to explicitly guide the exploration to a subset of cases that they've identified as a problem (either by intuition or especially for regression tests and increasing coverage). You can imagine relatively simple things like "use these priorities until task 3 hits the yield point on line 72 and then switch to task 5", maybe?
An intermediate form between these is one of the heuristics mentioned in the GAMBIT paper, of letting the programmer name some specific functions they want to "focus" on, and using that to increase the number of preemptions that happen within those functions. (This is sort of mentioned in the CHESS paper too when talking about their state space reduction techniques.)
Possibly implementation for scheduler control: call an instrument hook before scheduling a batch, with a list of runnable tasks (useful as a general notification anyway), and let them optionally return something to control the scheduling of this batch.
See also: #77
The text was updated successfully, but these errors were encountered: