Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automatic discovery of MIG devices #992

Open
DrAuYueng opened this issue Oct 15, 2024 · 2 comments
Open

Support automatic discovery of MIG devices #992

DrAuYueng opened this issue Oct 15, 2024 · 2 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@DrAuYueng
Copy link

DrAuYueng commented Oct 15, 2024

Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:

  1. The device plug-in instance corresponding to the newly created GI is not started
  2. The status of the newly created CI in the node is not displayed

When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally.
It seems that the newly created MIG resources are not automatically discovered.

@klueska
Copy link
Contributor

klueska commented Oct 15, 2024

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants