Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ROCM in CI #999

Open
wants to merge 98 commits into
base: main
Choose a base branch
from
Open

Enable ROCM in CI #999

wants to merge 98 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Oct 3, 2024

Salient points:

  • Add ROCm nightly wheels to test matrix
  • Refactor conda-builder -> almalinux-builder pytorch#140157
    The above PR shows that we've migrated to almalinux-builder due to the EOL CENTOS 7. Changes to regression_test.yml to not install devtoolset-10 have been made in accordance with this switch.
  • Fix bug in torchao/utils.py in invocation of torch.cuda.get_device_properties()

Needs changes in pytorch/test-infra#6104

Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/999

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit bde89de with merge base 63d142c (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2024
Copy link

pytorch-bot bot commented Oct 3, 2024

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@msaroufim
Copy link
Member Author

@atalman im not sure the no-sudo flag does anything. Tried a few variants for the value like true or "true" and same result

@amdfaa
Copy link
Collaborator

amdfaa commented Nov 19, 2024

@pytorchbot rebase

torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4'
gpu-arch-type: "rocm"
gpu-arch-version: "6.2.4"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this section to the test-nightly job since that's where we are testing the nightly wheels, and it uses linux_job_v2.yml, which should be updated as per pytorch/test-infra#6003 (comment)

with:
timeout: 120
no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }}
rocm: ${{ matrix.gpu-arch-type == 'rocm' }}
continue-on-error: ${{ matrix.gpu-arch-type == 'rocm' }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely not be checked-in, since it's only for us to gather a complete list of test failures. @msaroufim Would we merge this PR only after ROCm CI is fully clean? I'd rather get all these infra changes merged, so that we run torchao CI on ROCm regularly, and maybe skip any failing tests for ROCm while we work separately to enable them.

Copy link
Member Author

@msaroufim msaroufim Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you, the main constraint is we can't really be having CI per commit or on main run red since then it just causes confusion and people slowly learn to ignore seeing red. So if you'd like to merge some variant of this PR without running on commits to or on main then we can try to merge this more quickly

Personally I'd favor merging the skip tests as part of this work and we can do enablement for tests one by one easily while maintaining a green CI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petrex Please note that torchao team would like to have this PR be merged with a clean signal for ROCm, so please skip any failing tests as part of this PR

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Dec 20, 2024
@petrex
Copy link
Collaborator

petrex commented Jan 6, 2025

happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy?

@jithunnair-amd
Copy link
Collaborator

happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy?

2 pending items:

  1. Resolve issue with aws credentials causing some steps to fail in CI job eg. https://github.com/pytorch/ao/actions/runs/12656214677/job/35268262299#step:17:39 : this one needs pytorch infra team to enable "AWS trust policy" for torchao repository cc @huydhn to submit a PR for that
  2. Skip any failing unit tests for ROCm as part of this PR as per Enable ROCM in CI #999 (comment) cc @petrex to add skips to this PR

@huydhn
Copy link
Contributor

huydhn commented Jan 8, 2025

1. Resolve issue with aws credentials causing some steps to fail in CI job eg. https://github.com/pytorch/ao/actions/runs/12656214677/job/35268262299#step:17:39 : this one needs pytorch infra team to enable "AWS trust policy" for torchao repository cc @huydhn to submit a PR for that

The credential is working now. There is a new failure w.r.t chown on the CI job https://github.com/pytorch/ao/actions/runs/12656214677/job/35334719646, but it’s a different story I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants