-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146
Conversation
6f48c86
to
6e5ca43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks okay to me. Just few nits about function name and the test.
7fe922f
to
8fa6b46
Compare
@hanhanW I realized deciding the vectorization strategy is actually before we do but I feel like we should also test the peeling part (that the computing of residual values are good). Let me know what you think. |
We split the tests based on CPU arch, so there are few
There are few pipeline tests in pipeline_peel_and_vectorize_tests.mlir. You can add a test to the file with a preset configuration. |
8eb4337
to
237d74f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one nit about testing.
(cc @banach-space )
compiler/src/iree/compiler/Codegen/LLVMCPU/test/vectorize_linalg_generic.mlir
Outdated
Show resolved
Hide resolved
237d74f
to
cb252e6
Compare
Vectorization pass should not introduce extra undefined behaviors. In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false. This patch addresses such issue by falling back to loop peeling, in case extra undefined behaviors could happen using mask strategy. Signed-off-by: Alan Li <[email protected]>
cb252e6
to
d8443e9
Compare
Hey, avoiding masking like this is a bit counter-intuitive to me TBH:
In general, this would depend on the ISA, right? I've just double-check DIV for SVE and note that (extra emphasis from me):
So, masking should be totally safe, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR appears to have fixed a previously failing test, causing CI to fail: https://github.com/iree-org/iree/actions/runs/10408412024/job/28826260334#step:8:54
=================================== FAILURES ===================================
_ IREE compile and run: test_mod_uint64::model.mlir::model.mlir::cpu_llvm_sync _
[gw1] linux -- Python 3.11.9 /home/runner/work/iree/iree/venv/bin/python
[XPASS(strict)] Expected run to fail (included in 'expected_run_failures')
--- generated report log file: /tmp/iree_tests_onnx_cpu_llvm_sync_logs.json ----
Please watch CI results before merging - automerge will not block on that job at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... Hmm, why is a failing test not blocking auto-merge? So it is generally discouraged to use auto-merge for IREE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our self-hosted GitHub Actions runners are too unreliable at the moment for auto-merge. They flake regularly and I don't want them blocking all commit traffic. Auto-merge will wait for the linux_x64_clang
and linux_x64_clang_asan
jobs which build and run unit tests. It is not currently waiting for model tests, GPU tests, ONNX op tests, Android builds, etc. (anything in pkgci.yml
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird... I saw the same test "failure" (newly passing) on https://github.com/iree-org/iree/actions/runs/10409729467/job/28830476078?pr=18223, which I think was before including this code... I wonder if the test is flaky? It certainly looked related to the code changes here 🤔
Masking it safe for SVE, yes. but generally speaking, in LLVM IR semantics:
Since we are lowering to LLVM IR so less undefined behavior is better..? |
caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (iree-org#18146) Signed-off-by: Alan Li <[email protected]>
SVE is just a specific example of an ISA with masking. Similar argument would hold for any ISA with native masking.
Isn't the point of masking to make sure that division doesnt't happen? I might be missing sth here, but I think that it would be good to try to compile your example to assembly to see what happens there (I am AFK, but can try SVE tomorrow). Do you have an example for an ISA without masking? |
caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (#18146) Signed-off-by: Alan Li <[email protected]>
Below is the main loop for SVE: .LBB0_1:
sub x15, x12, x8
cmp x15, #4
csel x15, x15, x13, lo
dup v2.2d, x15
mov w15, w14
mov w14, wzr
cmhi v3.2d, v2.2d, v0.2d
cmhi v2.2d, v2.2d, v1.2d
uzp1 v2.4s, v2.4s, v3.4s
cmpne p1.s, p0/z, z2.s, #0
ld1w { z2.s }, p1/z, [x9, x8, lsl #2]
ld1w { z3.s }, p1/z, [x10, x8, lsl #2]
.loc 1 8 12
movprfx z4, z2
sdiv z4.s, p0/m, z4.s, z3.s
mls v2.4s, v4.4s, v3.4s
st1w { z2.s }, p1, [x11, x8, lsl #2]
mov w8, #4
.loc 1 6 10
tbnz w15, #0, .LBB0_1
mov w0, wzr
.loc 1 6 10 epilogue_begin is_stmt 0
ldp x29, x30, [sp], #16
ret As expected, it looks fine. This instruction computes the predicate: cmpne p1.s, p0/z, z2.s, #0 Here's the example that I compiled (based on your test): module {
func.func @test_mod_vectorizing_strategy_peeling(%3 : tensor<6xi32>, %4: tensor<6xi32>) -> tensor<6xi32> {
%c0 = arith.constant 0 : index
%5 = tensor.empty() : tensor<6xi32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<6xi32>, tensor<6xi32>) outs(%5 : tensor<6xi32>) {
^bb0(%in: i32, %in_0: i32, %out: i32):
%7 = arith.remsi %in, %in_0 : i32
linalg.yield %7 : i32
} -> tensor<6xi32>
return %6 : tensor<6xi32>
}
} Note, upstream IREE doesn't support scalable vectorisation of Anyway, as hinted in my previous messages, this should probably be disabled for ISAs that support masking. WDYT? I can add this to my TODO list and send a PR once my other bits for vectorising |
How does it happen in practice? I could have wrong understanding about masking. I think they will eventually be lowered to masked load/store, and produce native_vector_size vectors. The rest of computation ops all take the results to perform the computation. So say that if we have 6 elements and two tiles, the second tile loads In Alan's investigation, this is the case at LLVM level: %28 = getelementptr i32, ptr %9, i64 %20, !dbg !81
%29 = tail call <4 x i32> @llvm.masked.load.v4i32.p0(ptr %28, i32 4, <4 x i1> %25, <4 x i32> zeroinitializer), !dbg !81
%30 = srem <4 x i32> %27, %29, !dbg !81 where vectorization actually loads zero as divider due to mask being zero, and the zeroinitializer. I don't really remember how SVE codegen config work right now. IIRC, we firstly peel the loop, and apply scalable vectorization on the main loop? If this is the case, I think we're doing the same thing here. The main loop is vectorized without masking on x86 path. The distinction of the IR is that one has scalar vector and the other does not. Correct me if I'm wrong. :)
Yes, we can disable it if we know the ISAs support the masking. Thanks for offering the help! |
Hey @hanhanW 👋🏻
ISel :-) (*) Yes, looking at your LLVM example it's quite apparent that there's potential UB. But then the backend can do it's magic and leverage the ISA to get rid of that and generate the predicated
Correct :) (well, strictly speaking we don't have scalable vectorization for
Sure, but then I wonder - shouldn't every iteration in the main loop "consume" exactly vector-width of elements? (where vector-width matches hardware vector-register length) So even without masking, it would be fine? (*) I will be OOO for a while, but will double check before sending a PR for SVE. |
Yeah, I'm happy that we are on the same page now. The issue is that we generated potential UB IRs, and then some backends can get rid of UB because they have such support. I don't find a predicated
Yes, that's right on the x86 path. On the x86 path (or say without scalable vectors), there are two strategies. One is peeling and the other is masking. The operations in the main loop are all vectorized without masking. Because they have static shapes. Things are a little different on scalable vectorization path. The "masking" strategy that we used on x86 path becomes "peeling first, then do masking on the main loop". I think you will likely hit the same issue if the ISA does not support predicated |
Vectorization pass should not introduce extra undefined behaviors.
In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.
This patch addresses such issue by falling back to loop peeling, in case extra undefined behaviors could happen using mask strategy.