-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oom protection #1321
Merged
DiegoTavares
merged 7 commits into
AcademySoftwareFoundation:master
from
DiegoTavares:oom_protection
Nov 8, 2023
Merged
Oom protection #1321
DiegoTavares
merged 7 commits into
AcademySoftwareFoundation:master
from
DiegoTavares:oom_protection
Nov 8, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…or large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames. Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met: Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host (cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac)
(cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952)
(cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1)
There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually. (cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41)
DiegoTavares
requested review from
bcipriano,
gregdenton,
jrray,
smith1511,
larsbijl,
IdrisMiles and
splhack
as code owners
October 4, 2023 21:20
DiegoTavares
force-pushed
the
oom_protection
branch
from
October 4, 2023 21:29
fbab6e9
to
80aa877
Compare
@bcipriano Ready for review |
bcipriano
requested changes
Oct 5, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, some minor comments. Pretty complex but everything's broken out really nicely.
cuebot/src/main/java/com/imageworks/spcue/dispatcher/DispatchSupport.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/commands/DispatchRqdKillFrameMemory.java
Outdated
Show resolved
Hide resolved
Ready for another round |
bcipriano
approved these changes
Nov 8, 2023
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Show resolved
Hide resolved
Signed-off-by: Diego Tavares <[email protected]>
DiegoTavares
merged commit Nov 8, 2023
e3136f4
into
AcademySoftwareFoundation:master
10 of 11 checks passed
carlosfelgarcia
pushed a commit
to carlosfelgarcia/OpenCue
that referenced
this pull request
May 22, 2024
* The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames. Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met: Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host (cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac) * Frames killed for OOM should be retried (cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952) * OOM_FRAME_OVERBOARD_ALLOWED_THRESHOLD can be deactivated with -1 (cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1) * Limit the number of kill retries There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually. (cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41) * Fix merge conflicts * Handle MR comments * Minor improvements to the logic Signed-off-by: Diego Tavares <[email protected]> --------- Signed-off-by: Diego Tavares <[email protected]>
carlosfelgarcia
pushed a commit
to carlosfelgarcia/OpenCue
that referenced
this pull request
May 22, 2024
* The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames. Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met: Host has less than OOM_MEMORY_LEFT_THRESHOLD_PERCENT memory available A frame is taking more than OOM_FRAME_OVERBOARD_PERCENT of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host (cherry picked from commit e88a5295f23bd927614de6d5af6a09d496d3e6ac) * Frames killed for OOM should be retried (cherry picked from commit b88f7bcb1ad43f83fb8357576c33483dc2bf4952) * OOM_FRAME_OVERBOARD_ALLOWED_THRESHOLD can be deactivated with -1 (cherry picked from commit 647e75e2254c7a7ff68c544e438080f412bf04c1) * Limit the number of kill retries There's an error condition on rqd where a frame that cannot be killed will end up preventing the host from picking up new jobs. This logic limits the number of repeated killRequests to give host a chance to pick up new jobs. At the same time, blocked frames are logged to spcue.log to be handled manually. (cherry picked from commit aea4864ef66aca494fb455a7c103e4a832b63d41) * Fix merge conflicts * Handle MR comments * Minor improvements to the logic Signed-off-by: Diego Tavares <[email protected]> --------- Signed-off-by: Diego Tavares <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current logic relies on hardcoded values which are not suitable for large hosts. The new logic takes into account the size of hosts and also tries to be more aggressive with misbehaving frames.
Prevent host from entering an OOM state where oom-killer might start killing important OS processes. The kill logic will kick in one of the following conditions is met:
OOM_MEMORY_LEFT_THRESHOLD_PERCENT
memory availableOOM_FRAME_OVERBOARD_PERCENT
of what it had reserved For frames that are using more than they had reserved but not above the threshold, negotiate expanding the reservations with other frames on the same host.Additional observations:
OOM_FRAME_OVERBOARD_PERCENT
to -1 will turn off the frame overboard protection feature.FRAME_KILL_RETRY_LIMIT
times in 3 minutes (configurable).