-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved windows workflow to Kubernetes hosted runner #18967
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link logs of this running in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800.
Here's what I'm looking for / seeing:
Timing
- Total time: 57m
- Time for checkout: 2m30s (good - GitHub's large hosted runners were taking 8-10 minutes sometimes)
- Installing deps: around 1 minute total (fine, low priority to optimize, way better than the 20 minutes installing Visual Studio / MSVC before)
- Build: 52 minutes
- Test: 1 minute
Other misc things
- CMake version 3.30.5: fine. Could match what we use on Linux (CMake 3.23.2: https://github.com/iree-org/base-docker-images/blob/f97cc2f710ae11184742a5f4e28b044c4c481ecd/dockerfiles/cpubuilder_ubuntu_jammy.Dockerfile#L31-L35), but we'll want to upgrade to 3.29+ soon anyways: Migrate to CMake version 3.29+ and adopt
CMAKE_LINKER_TYPE
option #18598 - May want to start bumping the setup-python action to install and test 3.12 or 3.13 (Build and publish Python 3.13 and 3.13t wheels #18652)
- Build is using MSVC as expected (sometimes it defaults to another toolchain :P)
-- The C compiler identification is MSVC 19.41.34123.0 -- The CXX compiler identification is MSVC 19.41.34123.0
- A few compiler warnings but nothing new I think: https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800#step:9:6760
- ccache reports that 98% of calls are "cacheable", so that's a good indicator that having a remote cache will work as we'd like:
+ ccache --show-stats Cacheable calls: 5826 / 5887 (98.96%) Hits: 0 / 5826 ( 0.00%) Direct: 0 Preprocessed: 0 Misses: 5826 / 5826 (100.0%) Uncacheable calls: 61 / 5887 ( 1.04%) Local storage: Cache size (GiB): 0.0 / 5.0 ( 0.00%) Remote storage: Hits: 0 / 5826 ( 0.00%) Misses: 5826 / 5826 (100.0%)
- Full list of tests ran as we'd expect:
100% tests passed, 0 tests failed out of 1502
- name: "Clean up build dir" | ||
run: Remove-Item -Path "C:\mnt\azure\build-windows" -Recurse -Force |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Are these runner persistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, I have it run at the beginning because if the workflow fails, it doesn't reach the cleanup step. Do you know how to add this to the process that runs after the workflow is completed, regardless of if it succeeds or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I thought our Linux cluster had ephemeral runners? Wouldn't the Windows cluster follow the same convention? (All of this should go on the tracking issue, PR description, etc. - I'm reading between the lines here when it would be much easier to review the design if it was documented)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how to add this to the process that runs after the workflow is completed, regardless of if it succeeds or not?
In the workflow file there are options: https://stackoverflow.com/questions/58858429/how-to-run-a-github-actions-step-even-if-the-previous-step-fails-while-still-f
For this sort of self-hosted runner setup though, it's usually safer to configure at the runner level, like https://github.com/iree-org/iree/blob/main/build_tools/github_actions/runner/config/hooks/post_job.sh (maybe ARC has a similar hook you can connect to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd These are still ephemeral runners that scale up and down based on github activity. The runners that are spun up use the same storage mount that we clean up after every run due to some storage problems with the ones that the runners are coming up with. Because we are completely clearing the storage every time, this should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the OS partition is ephemeral but the storage is persistent? Do multiple runners share the same storage disk then...? Can you elaborate on "some storage problems" somewhere (GitHub tracking issue would be best, but I'll settle for a PR review thread like here or one of our internal chats) so we can investigate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially for the nightly case, but yes, as we scale this up and move it to presubmit, we should be moving to an ephemeral storage mount which is pretty easy to do in kubernetes (in which case we remove the cleanup step as well). @Eliasj42 has more context on the issues with the storage problems, but I remember he tried an exhaustive list of things over a couple of days which is when we decided to pivot to a mount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but we'll be aiming for presubmit (pull_request
) or at least postsubmit (push
). Nightly already works today and is free, so this would "just" be speeding up that build and taking on some extra billing costs if we stop after just this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postsubmit with a 1 hour build on a single runner may be reasonable to start putting mileage on this.
ae985df
to
38e56c4
Compare
Please also follow the policies at https://iree.dev/developers/general/contributing/
And general github best practices
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is much better now. If we have capacity, a 1 hour build might be okay to enable on presubmit. I'd like to draw the line at 20 minutes, but we do have some 20 minute build + 20 minute test workflows. Another 20 minutes total on top of that isn't too crazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800.
Here's what I'm looking for / seeing:
Timing
- Total time: 57m
- Time for checkout: 2m30s (good - GitHub's large hosted runners were taking 8-10 minutes sometimes)
- Installing deps: around 1 minute total (fine, low priority to optimize, way better than the 20 minutes installing Visual Studio / MSVC before)
- Build: 52 minutes
- Test: 1 minute
Other misc things
- CMake version 3.30.5: fine. Could match what we use on Linux (CMake 3.23.2: https://github.com/iree-org/base-docker-images/blob/f97cc2f710ae11184742a5f4e28b044c4c481ecd/dockerfiles/cpubuilder_ubuntu_jammy.Dockerfile#L31-L35), but we'll want to upgrade to 3.29+ soon anyways: Migrate to CMake version 3.29+ and adopt
CMAKE_LINKER_TYPE
option #18598 - May want to start bumping the setup-python action to install and test 3.12 or 3.13 (Build and publish Python 3.13 and 3.13t wheels #18652)
- Build is using MSVC as expected (sometimes it defaults to another toolchain :P)
-- The C compiler identification is MSVC 19.41.34123.0 -- The CXX compiler identification is MSVC 19.41.34123.0
- A few compiler warnings but nothing new I think: https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800#step:9:6760
- ccache reports that 98% of calls are "cacheable", so that's a good indicator that having a remote cache will work as we'd like:
+ ccache --show-stats Cacheable calls: 5826 / 5887 (98.96%) Hits: 0 / 5826 ( 0.00%) Direct: 0 Preprocessed: 0 Misses: 5826 / 5826 (100.0%) Uncacheable calls: 61 / 5887 ( 1.04%) Local storage: Cache size (GiB): 0.0 / 5.0 ( 0.00%) Remote storage: Hits: 0 / 5826 ( 0.00%) Misses: 5826 / 5826 (100.0%)
- Full list of tests ran as we'd expect:
100% tests passed, 0 tests failed out of 1502
6a1dca3
to
54844b9
Compare
Signed-off-by: Elias Joseph <[email protected]> Signed-off-by: Elias Joseph <[email protected]> Signed-off-by: Elias Joseph <[email protected]> added build dir env variables Signed-off-by: Elias Joseph <[email protected]> Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
f650492
to
3086ce6
Compare
Signed-off-by: Elias Joseph <[email protected]>
…g/iree into windows-runner-staging-2 ask git why this merge is necessary I have no idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share the latest logs of this workflow running, and put them in the PR description?
tests/e2e/stablehlo_models/mnist_train_test/mnist_train_test.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
SCCACHE_AZURE_CONNECTION_STRING: "${{ secrets.AZURE_CCACHE_CONNECTION_STRING }}" | ||
SCCACHE_AZURE_BLOB_CONTAINER: ccache-container | ||
SCCACHE_CCACHE_ZSTD_LEVEL: 10 | ||
SCCACHE_AZURE_KEY_PREFIX: "ci_windows_x64_msvc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI looks like we'll need some extra setup to use sccache on Windows beyond just running the build_all.sh script.
Linux uses source ./build_tools/cmake/setup_sccache.sh
. Can write a similar script for Windows or work something back into build_all.sh
.
Fine to keep this here, but it does make it seem like caching is expected to work, when I think it won't. Something to look into as part of #18557
Moved the windows MSVC workflow to a kubernetes hosted windows runner. Also moved a file location to Azure instead of GCP. workflow running successfully: https://github.com/iree-org/iree/actions/runs/11827489914/job/32955730437 Adresses iree-org#18813: runs windows workflow through a runner self hosted through kubernetes on a 32 vcpu machine. Still need to implement sccache. Job currently takes about 1 hour, may need to upgrade to 64 vcpu machine, although implementation of sccache will also save time ci-exactly: windows_x64_msvc --------- Signed-off-by: Elias Joseph <[email protected]> Co-authored-by: Elias Joseph <[email protected]>
Moved the windows MSVC workflow to a kubernetes hosted windows runner. Also moved a file location to Azure instead of GCP. workflow running successfully: https://github.com/iree-org/iree/actions/runs/11827489914/job/32955730437 Adresses iree-org#18813: runs windows workflow through a runner self hosted through kubernetes on a 32 vcpu machine. Still need to implement sccache. Job currently takes about 1 hour, may need to upgrade to 64 vcpu machine, although implementation of sccache will also save time ci-exactly: windows_x64_msvc --------- Signed-off-by: Elias Joseph <[email protected]> Co-authored-by: Elias Joseph <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
Moved the windows MSVC workflow to a kubernetes hosted windows runner. Also moved a file location to Azure instead of GCP.
workflow running successfully: https://github.com/iree-org/iree/actions/runs/11827489914/job/32955730437
Adresses #18813: runs windows workflow through a runner self hosted through kubernetes on a 32 vcpu machine. Still need to implement sccache.
Job currently takes about 1 hour, may need to upgrade to 64 vcpu machine, although implementation of sccache will also save time
ci-exactly: windows_x64_msvc