Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A probe set to stop on failure does not stop the experiment #4991

Open
bjoky opened this issue Dec 13, 2024 · 0 comments
Open

A probe set to stop on failure does not stop the experiment #4991

bjoky opened this issue Dec 13, 2024 · 0 comments
Labels

Comments

@bjoky
Copy link
Contributor

bjoky commented Dec 13, 2024

What happened:

An experiment with a probe set to "stop on failure" does not stop the experiment if it detects a failure. The experiment continues to run for the specified duration instead. I've tried this out with a pod-delete fault and an HTTP probe.

There are two lines logged in the pod running the fault. First this:
time="2024-12-13T15:32:10Z" level=error msg="The myapp http probe has been Failed, err: {\"errorCode\":\"HTTP_PROBE_FAILURE\",\"phase\":\"ChaosInject\",\"reason\":\"Actual value: 503. Expected value: should be equal to 200\",\"target\":\"myapp\"}"
This is expected when the probe fails.

The next line is where the problem occurs:
time="2024-12-13T15:32:10Z" level=error msg="Unable to patch chaosengine to stop, err: {\"errorCode\":\"HTTP_PROBE_FAILURE\",\"phase\":\"ChaosInject\",\"reason\":\"Actual value: 503. Expected value: should be equal to 200\",\"target\":\"myapp\"}"

What you expected to happen:

The experiment should have been interrupted. The Chaos Engine should have been set to stop.

Where can this issue be corrected? (optional)

I believe this can be fixed in the probe logic in litmus-go. It looks like the fact that there is a fault (that triggers the code to stop the experiment) also skips the step of setting the chaos engine to stop, because there is an "error".

How to reproduce it (as minimally and precisely as possible):

  1. Create a pod-delete experiment that will fail.
  2. Create an http probe that can detect the failure.
  3. Set the probe to stop on failure.
  4. Run the experiment.

Tested on Litmus 3.13.0

Anything else we need to know?:

@bjoky bjoky added the bug label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant