-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ROCM in CI #999
base: main
Are you sure you want to change the base?
Enable ROCM in CI #999
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/999
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit bde89de with merge base 63d142c (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
No ciflow labels are configured for this repo. |
@atalman im not sure the no-sudo flag does anything. Tried a few variants for the value like true or "true" and same result |
@pytorchbot rebase |
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4' | ||
gpu-arch-type: "rocm" | ||
gpu-arch-version: "6.2.4" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this section to the test-nightly
job since that's where we are testing the nightly wheels, and it uses linux_job_v2.yml
, which should be updated as per pytorch/test-infra#6003 (comment)
with: | ||
timeout: 120 | ||
no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
rocm: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
continue-on-error: ${{ matrix.gpu-arch-type == 'rocm' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely not be checked-in, since it's only for us to gather a complete list of test failures. @msaroufim Would we merge this PR only after ROCm CI is fully clean? I'd rather get all these infra changes merged, so that we run torchao CI on ROCm regularly, and maybe skip any failing tests for ROCm while we work separately to enable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you, the main constraint is we can't really be having CI per commit or on main run red since then it just causes confusion and people slowly learn to ignore seeing red. So if you'd like to merge some variant of this PR without running on commits to or on main then we can try to merge this more quickly
Personally I'd favor merging the skip tests as part of this work and we can do enablement for tests one by one easily while maintaining a green CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petrex Please note that torchao team would like to have this PR be merged with a clean signal for ROCm, so please skip any failing tests as part of this PR
Needed for pytorch/test-infra#6003 and pytorch/ao#999 Pull Request resolved: #143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <[email protected]>
happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy? |
2 pending items:
|
The credential is working now. There is a new failure w.r.t chown on the CI job https://github.com/pytorch/ao/actions/runs/12656214677/job/35334719646, but it’s a different story I think |
Salient points:
The above PR shows that we've migrated to almalinux-builder due to the EOL CENTOS 7. Changes to regression_test.yml to not install devtoolset-10 have been made in accordance with this switch.
torchao/utils.py
in invocation oftorch.cuda.get_device_properties()
Needs changes in pytorch/test-infra#6104