Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Scanning GPU allocation map #2273

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

jopemachine
Copy link
Member

@jopemachine jopemachine commented Jun 13, 2024

Resolves #3327. (https://github.com/lablup/giftbox/issues/638) (BA-428) (GF-67).

Implement an API that allows the administrator to check how fGPU is allocated among agents through GPU alloc map (GPU allocation states per GPU device).

How it work

The GPU allocation is calculated by reading the resource.txt file in the scratch directory per kernel and summing up the allocation information in KernelResourceSpec.

Usage example

Note

This describes the same issue addressed in issue #638.

Tested using mock-accelerator.

Here is a simple example with which we can test this PR.

When I specify below two mock GPU devices in mock-accelerator.toml, I have 2 fGPUs in total.

devices = [
  { mother_uuid = "c59395cd-ac91-4cd3-a1b0-3d2568aa2d01", model_name = "CUDA GPU", numa_node = 0, subproc_count = 108, memory_size = "2G", is_mig_device = false },
  { mother_uuid = "c59395cd-ac91-4cd3-a1b0-3d2568aa2d02", model_name = "CUDA GPU", numa_node = 1, subproc_count = 108, memory_size = "2G", is_mig_device = false },
]

And after creating session like below command,

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=0.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
∙ Session ID e114540d-bd7e-4765-bb25-4b00a47feb51 is created and ready.
∙ This session provides the following app services: sshd, ttyd, jupyter, jupyterlab, vscode, tensorboard, mlflow-ui, nniboard

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=1.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
∙ Session ID 91bf45c7-43f3-4c52-9e49-48d49bc897f7 is created and ready.
∙ This session provides the following app services: sshd, ttyd, jupyter, jupyterlab, vscode, tensorboard, mlflow-ui, nniboard

I can query the gpu_alloc_map as json format using the following query statement.

query ($agent_id: String!) {
  agent(agent_id: $agent_id) {
    gpu_alloc_map
  }
}
{
  "data": {
    "agent": {
      "gpu_alloc_map": "{\"c59395cd-ac91-4cd3-a1b0-3d2568aa2d02\": \"0.80\", \"c59395cd-ac91-4cd3-a1b0-3d2568aa2d01\": \"0.60\"}"
    }
  }
}

And we can see two mock GPU devices have been allocated 0.6 and 0.8 fGPU respectively.

In the first request, 0.2 fGPU was allocated to the second GPU.
Since there was no device available to allocate 1.2 fGPU in the second request, it can be seen that 0.6 fGPU was evenly distributed and allocated to both GPU devices.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • API server-client counterparts (e.g., manager API -> client SDK)

Copy link

graphite-app bot commented Jun 13, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added comp:manager Related to Manager component comp:agent Related to Agent component size:M 30~100 LoC labels Jun 13, 2024
Copy link
Member Author

jopemachine commented Jun 13, 2024

@jopemachine jopemachine changed the title feat: Support scanning GPU allocation feat: Support scanning GPU allocation map Jun 13, 2024
@jopemachine jopemachine added this to the 24.03 milestone Jun 13, 2024
@jopemachine jopemachine marked this pull request as ready for review June 13, 2024 05:22
@kyujin-cho kyujin-cho modified the milestones: 24.03, 24.09 Jun 16, 2024
@jopemachine jopemachine changed the title feat: Support scanning GPU allocation map feat: Scanning GPU allocation map Jun 24, 2024
@jopemachine jopemachine requested a review from achimnol July 29, 2024 05:03
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from e86422e to a1f7c71 Compare September 30, 2024 13:24
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch 2 times, most recently from 689f599 to d31bca6 Compare November 5, 2024 03:31
@jopemachine jopemachine added the type:feature Add new features label Nov 5, 2024
@jopemachine jopemachine modified the milestones: 24.09, 24.12 Dec 6, 2024
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from 1eda694 to 2e9a2d6 Compare December 6, 2024 09:11
@jopemachine jopemachine marked this pull request as draft December 23, 2024 06:26
@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from f3056c8 to 0c45322 Compare December 24, 2024 02:35
@jopemachine jopemachine marked this pull request as ready for review December 24, 2024 03:31
Copy link
Collaborator

@HyeockJinKim HyeockJinKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

�lgtm

Copy link
Collaborator

@HyeockJinKim HyeockJinKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

�lgtm

@jopemachine jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from cc44683 to 4704dd6 Compare December 26, 2024 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component comp:manager Related to Manager component size:M 30~100 LoC type:feature Add new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement API for scanning GPU allocation map
3 participants