Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oom protection #1321

Merged

Conversation

DiegoTavares
Copy link
Collaborator

The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames.

Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met:

  • Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available
  • A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host.

Additional observations:

  • Setting OOM_FRAME_OVERBOARD_PERCENT to -1 will turn off the frame overboard protection feature.
  • Frames killed by this logic will be marked to be retried and will trigger the existing memory bump logic, in which the memory requirements for the layer will be increased to try to allocate a larger portion of memory for subsequent executions.
  • In rare occasions a frame can leave child processes behind. As rqd keeps track of procs related to the original frame, it will continue to report the Frame as active and cuebot will continue to request that it gets killed. To avoid this condition, there is a new logic that prevents repeated kill requests to be sent more then FRAME_KILL_RETRY_LIMIT times in 3 minutes (configurable).

…or large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames.

Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met:

Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available
A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host

(cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac)
(cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952)
(cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1)
There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually.

(cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41)
@DiegoTavares
Copy link
Collaborator Author

@bcipriano Ready for review

Copy link
Collaborator

@bcipriano bcipriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, some minor comments. Pretty complex but everything's broken out really nicely.

@DiegoTavares
Copy link
Collaborator Author

Ready for another round

Signed-off-by: Diego Tavares <[email protected]>
@DiegoTavares DiegoTavares merged commit e3136f4 into AcademySoftwareFoundation:master Nov 8, 2023
10 of 11 checks passed
carlosfelgarcia pushed a commit to carlosfelgarcia/OpenCue that referenced this pull request May 22, 2024
* The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames.

Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met:

Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available
A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host

(cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac)

* Frames killed for OOM should be retried

(cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952)

* OOM_FRAME_OVERBOARD_ALLOWED_THRESHOLD can be deactivated with -1

(cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1)

* Limit the number of kill retries

There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually.

(cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41)

* Fix merge conflicts

* Handle MR comments

* Minor improvements to the logic

Signed-off-by: Diego Tavares <[email protected]>

---------

Signed-off-by: Diego Tavares <[email protected]>
carlosfelgarcia pushed a commit to carlosfelgarcia/OpenCue that referenced this pull request May 22, 2024
* The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames.

Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met:

Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available
A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host

(cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac)

* Frames killed for OOM should be retried

(cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952)

* OOM_FRAME_OVERBOARD_ALLOWED_THRESHOLD can be deactivated with -1

(cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1)

* Limit the number of kill retries

There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually.

(cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41)

* Fix merge conflicts

* Handle MR comments

* Minor improvements to the logic

Signed-off-by: Diego Tavares <[email protected]>

---------

Signed-off-by: Diego Tavares <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants