feat: Scanning GPU allocation map #2273

jopemachine · 2024-06-13T03:21:04Z

Resolves #3327. (https://github.com/lablup/giftbox/issues/638) (BA-428) (GF-67).

Implement an API that allows the administrator to check how fGPU is allocated among agents through GPU alloc map (GPU allocation states per GPU device).

How it work

The GPU allocation is calculated by reading the resource.txt file in the scratch directory per kernel and summing up the allocation information in KernelResourceSpec.

Usage example

Note

This describes the same issue addressed in issue #638.

Tested using mock-accelerator.

Here is a simple example with which we can test this PR.

When I specify below two mock GPU devices in mock-accelerator.toml, I have 2 fGPUs in total.

devices = [
  { mother_uuid = "c59395cd-ac91-4cd3-a1b0-3d2568aa2d01", model_name = "CUDA GPU", numa_node = 0, subproc_count = 108, memory_size = "2G", is_mig_device = false },
  { mother_uuid = "c59395cd-ac91-4cd3-a1b0-3d2568aa2d02", model_name = "CUDA GPU", numa_node = 1, subproc_count = 108, memory_size = "2G", is_mig_device = false },
]

And after creating session like below command,

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=0.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
∙ Session ID e114540d-bd7e-4765-bb25-4b00a47feb51 is created and ready.
∙ This session provides the following app services: sshd, ttyd, jupyter, jupyterlab, vscode, tensorboard, mlflow-ui, nniboard

❯ ./backend.ai session create \
            -r cpu=1 -r mem=2g -r cuda.shares=1.2 \
            cr.backend.ai/testing/ngc-pytorch:23.10-pytorch2.1-py310-cuda12.2
∙ Session ID 91bf45c7-43f3-4c52-9e49-48d49bc897f7 is created and ready.
∙ This session provides the following app services: sshd, ttyd, jupyter, jupyterlab, vscode, tensorboard, mlflow-ui, nniboard

I can query the gpu_alloc_map as json format using the following query statement.

query ($agent_id: String!) {
  agent(agent_id: $agent_id) {
    gpu_alloc_map
  }
}

{
  "data": {
    "agent": {
      "gpu_alloc_map": "{\"c59395cd-ac91-4cd3-a1b0-3d2568aa2d02\": \"0.80\", \"c59395cd-ac91-4cd3-a1b0-3d2568aa2d01\": \"0.60\"}"
    }
  }
}

And we can see two mock GPU devices have been allocated 0.6 and 0.8 fGPU respectively.

In the first request, 0.2 fGPU was allocated to the second GPU.
Since there was no device available to allocate 1.2 fGPU in the second request, it can be seen that 0.6 fGPU was evenly distributed and allocated to both GPU devices.

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
API server-client counterparts (e.g., manager API -> client SDK)

graphite-app · 2024-06-13T03:21:10Z

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

jopemachine · 2024-06-13T03:21:19Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

src/ai/backend/agent/resources.py

HyeockJinKim

�lgtm

HyeockJinKim

�lgtm

Co-authored-by: octodog <[email protected]>

github-actions bot assigned jopemachine Jun 13, 2024

github-actions bot added comp:manager Related to Manager component comp:agent Related to Agent component size:M 30~100 LoC labels Jun 13, 2024

jopemachine changed the title ~~feat: Support scanning GPU allocation~~ feat: Support scanning GPU allocation map Jun 13, 2024

jopemachine added this to the 24.03 milestone Jun 13, 2024

jopemachine marked this pull request as ready for review June 13, 2024 05:22

jopemachine mentioned this pull request Jun 13, 2024

fix: Session creation failure due to wrong type check when using mock-accelerator #2272

Closed

7 tasks

kyujin-cho modified the milestones: 24.03, 24.09 Jun 16, 2024

jopemachine changed the title ~~feat: Support scanning GPU allocation map~~ feat: Scanning GPU allocation map Jun 24, 2024

jopemachine requested a review from achimnol July 29, 2024 05:03

jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from e86422e to a1f7c71 Compare September 30, 2024 13:24

jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch 2 times, most recently from 689f599 to d31bca6 Compare November 5, 2024 03:31

jopemachine added the type:feature Add new features label Nov 5, 2024

HyeockJinKim reviewed Nov 27, 2024

View reviewed changes

src/ai/backend/agent/resources.py Outdated Show resolved Hide resolved

jopemachine requested a review from HyeockJinKim November 29, 2024 03:31

jopemachine commented Dec 6, 2024

View reviewed changes

src/ai/backend/agent/resources.py Show resolved Hide resolved

jopemachine modified the milestones: 24.09, 24.12 Dec 6, 2024

jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from 1eda694 to 2e9a2d6 Compare December 6, 2024 09:11

HyeockJinKim reviewed Dec 16, 2024

View reviewed changes

src/ai/backend/agent/resources.py Outdated Show resolved Hide resolved

jopemachine marked this pull request as draft December 23, 2024 06:26

jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from f3056c8 to 0c45322 Compare December 24, 2024 02:35

jopemachine mentioned this pull request Dec 24, 2024

feat: Cache gpu_alloc_map in Redis, and Add RescanGPUAllocMaps mutation #3293

Open

2 tasks

jopemachine marked this pull request as ready for review December 24, 2024 03:31

HyeockJinKim approved these changes Dec 26, 2024

View reviewed changes

jopemachine and others added 16 commits December 26, 2024 06:39

feat: Implement GPU alloc map scanning through KernelResourceSpec

007f209

feat: Add gpu_alloc_map field to Agent GQL field

c2ddb1a

chore: Add milestone to gpu_alloc_map field

da23246

chore: Add fragment

3c0dd23

chore: Change milestone to 24.09.0.

6efae05

fix: Merge with main

2628939

chore: Update schema

baf72e4

fix: Remove unrelevant change

e925cc4

chore: Remove useless newline

dabd595

refactor: scan_gpu_alloc_map *(Reflect feedback)

2ba70c0

chore: Change milestone to 24.12

46cef26

chore: Rename news fragmnet

a30f741

fix: Use semaphore and TaskGroup

09be83a

fix: Remove semaphore and Use asyncio.as_completed

3ada4ae

fix: Improve exception handling

35833ae

chore: update GraphQL schema dump

4704dd6

Co-authored-by: octodog <[email protected]>

jopemachine force-pushed the topic/06-13-feat_support_scanning_gpu_allocation branch from cc44683 to 4704dd6 Compare December 26, 2024 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Scanning GPU allocation map #2273

feat: Scanning GPU allocation map #2273

jopemachine commented Jun 13, 2024 •

edited

Loading

graphite-app bot commented Jun 13, 2024

jopemachine commented Jun 13, 2024 •

edited

Loading

HyeockJinKim left a comment

HyeockJinKim left a comment

feat: Scanning GPU allocation map #2273

Are you sure you want to change the base?

feat: Scanning GPU allocation map #2273

Conversation

jopemachine commented Jun 13, 2024 • edited Loading

How it work

Usage example

graphite-app bot commented Jun 13, 2024

Your org has enabled the Graphite merge queue for merging into main

jopemachine commented Jun 13, 2024 • edited Loading

HyeockJinKim left a comment

Choose a reason for hiding this comment

HyeockJinKim left a comment

Choose a reason for hiding this comment

jopemachine commented Jun 13, 2024 •

edited

Loading

jopemachine commented Jun 13, 2024 •

edited

Loading