(shortfin-sd) Multi-device program initialization fails in SPX mode #467

monorimet · 2024-11-10T15:54:24Z

With caching allocator and async allocations disabled:

~/SHARK-Platform/shortfin$ SHORTFIN_ALLOCATORS=caching SHORTFIN_AMDGPU_LOGICAL_DEVICES_PER_PHYSICAL_DEVICE=1 python -m shortfin_apps.sd.server --model_config=./python/shortfin_apps/sd/examples/sdxl_config_i8.json --device=amdgpu --fibers_per_device=4 --workers_per_device=1 --isolation="none" --flagfile=./python/shortfin_apps/sd/examples/sdxl_flags_gfx942.txt --build_preference=compile --device_ids 1 2
[2024-11-10 09:47:53.078] [info] Configure allocator amdgpu:0:0@0 = [caching]
[2024-11-10 09:47:53.491] [info] Configure allocator amdgpu:1:0@0 = [caching]
INFO:shortfin_apps.sd.components.manager:Created local system with ['amdgpu:1:0@0', 'amdgpu:0:0@0'] devices
Servicing 4 outstanding tasks
Completed BuildFile[bin](sdxl/stable_diffusion_xl_base_1_0_clip_bs1_64_fp16_amdgpu-gfx942.vmfb)
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_clip_dataset_fp16.irpa)
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_clip_bs1_64_fp16.mlir)
Servicing 1 outstanding tasks
Completed BuildEntrypoint(path='sdxl')
Servicing 4 outstanding tasks
Completed BuildFile[bin](sdxl/stable_diffusion_xl_base_1_0_punet_bs1_64_1024x1024_i8_amdgpu-gfx942.vmfb)
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_punet_bs1_64_1024x1024_i8.mlir)
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_punet_dataset_i8.irpa)
Servicing 1 outstanding tasks
Completed BuildEntrypoint(path='sdxl')
Servicing 4 outstanding tasks
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_vae_dataset_fp16.irpa)
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_vae_bs1_1024x1024_fp16.mlir)
Completed BuildFile[bin](sdxl/stable_diffusion_xl_base_1_0_vae_bs1_1024x1024_fp16_amdgpu-gfx942.vmfb)
Servicing 1 outstanding tasks
Completed BuildEntrypoint(path='sdxl')
Servicing 3 outstanding tasks
Completed BuildFile[gen](sdxl/stable_diffusion_xl_base_1_0_EulerDiscreteScheduler_bs1_1024x1024_fp16.mlir)
Completed BuildFile[bin](sdxl/stable_diffusion_xl_base_1_0_EulerDiscreteScheduler_bs1_1024x1024_fp16_amdgpu-gfx942.vmfb)
Servicing 1 outstanding tasks
Completed BuildEntrypoint(path='sdxl')
INFO:root:Loading parameter fiber 'model' from: genfiles/sdxl/stable_diffusion_xl_base_1_0_clip_dataset_fp16.irpa
INFO:root:Loading parameter fiber 'model' from: genfiles/sdxl/stable_diffusion_xl_base_1_0_punet_dataset_i8.irpa
INFO:root:Loading parameter fiber 'model' from: genfiles/sdxl/stable_diffusion_xl_base_1_0_vae_dataset_fp16.irpa
INFO:uvicorn.error:Started server process [1742761]
INFO:uvicorn.error:Waiting for application startup.
INFO:shortfin_apps.sd.components.manager:Starting system manager
INFO:root:Initializing service 'sd':
INFO:root:ServiceManager(
  INFERENCE DEVICES : 
     [Device(name='amdgpu:0:0@0', ordinal=0:0, node_affinity=0, capabilities=0x0), Device(name='amdgpu:1:0@0', ordinal=1:0, node_affinity=0, capabilities=0x0)]

  MODEL PARAMS : 
     base model : SDXL 
     output size (H,W) : [[1024, 1024]] 
     max token sequence length : 64 
     classifier free guidance : True 

  SERVICE PARAMS : 
     fibers per device : 4
     program isolation mode : ProgramIsolation.NONE

  INFERENCE MODULES : 
     clip : [ProgramModule('compiled_clip', version=0, exports=[encode_prompts$async(0rrrrrr_rr), encode_prompts(0rrrr_rr), __init(0v_v)])]
     unet : [ProgramModule('compiled_punet', version=0, exports=[main$async(0rrrrrrrr_r), main(0rrrrrr_r), __init(0v_v)])]
     vae : [ProgramModule('compiled_vae', version=0, exports=[decode$async(0rrr_r), decode(0r_r), __init(0v_v)])]
     scheduler : [ProgramModule('compiled_scheduler', version=0, exports=[run_initialize$async(0rrrr_rrrr), run_initialize(0rr_rrrr), run_scale$async(0rrrrrr_rrrr), run_scale(0rrrr_rrrr), run_step$async(0rrrrrr_r), run_step(0rrrr_r), __init(0v_v)])]

  INFERENCE PARAMETERS : 
     clip : [<_shortfin_default.lib.local.StaticProgramParameters object at 0x7f7bae8a0730>]
     unet : [<_shortfin_default.lib.local.StaticProgramParameters object at 0x7f3b5809bd70>]
     vae : [<_shortfin_default.lib.local.StaticProgramParameters object at 0x7f7bae88f870>]
)
INFO:shortfin_apps.sd.components.manager:Shutting down system manager
INFO:root:System manager command processor stopped
ERROR:uvicorn.error:Traceback (most recent call last):
  File "/home/eagarvey/SHARK-Platform/.venv/lib/python3.12/site-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.12/contextlib.py", line 204, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/eagarvey/SHARK-Platform/shortfin/python/shortfin_apps/sd/server.py", line 50, in lifespan
    service.start()
  File "/home/eagarvey/SHARK-Platform/shortfin/python/shortfin_apps/sd/components/service.py", line 136, in start
    self.inference_programs[worker_idx][component] = sf.Program(
                                                     ^^^^^^^^^^^
ValueError: iree/runtime/src/iree/hal/drivers/hip/event_semaphore.c:350: ABORTED; while calling import; while invoking native function hal.device.queue.dealloca; 
[ 0] bytecode compiled_clip.__init:52672 genfiles/sdxl/stable_diffusion_xl_base_1_0_clip_bs1_64_fp16.mlir:3:3

ERROR:uvicorn.error:Application startup failed. Exiting.

The same error occurs with the default allocator and async allocations enabled, using the following server CLI input:

SHORTFIN_AMDGPU_LOGICAL_DEVICES_PER_PHYSICAL_DEVICE=1 python -m shortfin_apps.sd.server --model_config=./python/shortfin_apps/sd/examples/sdxl_config_i8.json --device=amdgpu --fibers_per_device=4 --workers_per_device=1 --isolation="none" --flagfile=./python/shortfin_apps/sd/examples/sdxl_flags_gfx942.txt --build_preference=compile --amdgpu_async_allocations --device_ids 1 2

The text was updated successfully, but these errors were encountered:

AWoloszyn · 2024-11-11T19:06:04Z

Can you pull this branch:
https://github.com/AWoloszyn/iree/tree/hip-ctx
and see if this fixes it for you. It SEEMS to work for me? but I want to make sure it's the right direction. If it seems ok I will clean it up and get it landed upstream.

monorimet · 2024-11-12T20:33:09Z

Resolved by iree-org/iree#19103

monorimet closed this as completed Nov 12, 2024

monorimet reopened this Nov 12, 2024

monorimet closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(shortfin-sd) Multi-device program initialization fails in SPX mode #467

(shortfin-sd) Multi-device program initialization fails in SPX mode #467

monorimet commented Nov 10, 2024

AWoloszyn commented Nov 11, 2024

monorimet commented Nov 12, 2024 •

edited

Loading

(shortfin-sd) Multi-device program initialization fails in SPX mode #467

(shortfin-sd) Multi-device program initialization fails in SPX mode #467

Comments

monorimet commented Nov 10, 2024

AWoloszyn commented Nov 11, 2024

monorimet commented Nov 12, 2024 • edited Loading

monorimet commented Nov 12, 2024 •

edited

Loading