[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

lialan · 2024-08-07T21:18:19Z

Vectorization pass should not introduce extra undefined behaviors.

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

This patch addresses such issue by falling back to loop peeling, in case extra undefined behaviors could happen using mask strategy.

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

hanhanW

It looks okay to me. Just few nits about function name and the test.

compiler/src/iree/compiler/Codegen/LLVMCPU/Utils.h

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

compiler/src/iree/compiler/Codegen/LLVMCPU/test/peel.mlir

lialan · 2024-08-09T11:29:37Z

@hanhanW I realized deciding the vectorization strategy is actually before we do peeling (in hindsight, of course), so I created a new test file to check the decision making part instead.

but I feel like we should also test the peeling part (that the computing of residual values are good). Let me know what you think.

hanhanW · 2024-08-09T17:23:54Z

@hanhanW I realized deciding the vectorization strategy is actually before we do peeling (in hindsight, of course), so I created a new test file to check the decision making part instead.

We split the tests based on CPU arch, so there are few select_*_lowering_strategy.mlir in https://github.com/iree-org/iree/tree/main/compiler/src/iree/compiler/Codegen/LLVMCPU/test. Here I'd suggest to put the test to select_x86_64_lowering_strategy.mlir because the target in the test is x86.

but I feel like we should also test the peeling part (that the computing of residual values are good). Let me know what you think.

There are few pipeline tests in pipeline_peel_and_vectorize_tests.mlir. You can add a test to the file with a preset configuration.

hanhanW

LGTM, just one nit about testing.

(cc @banach-space )

compiler/src/iree/compiler/Codegen/LLVMCPU/test/vectorize_linalg_generic.mlir

compiler/src/iree/compiler/Codegen/LLVMCPU/test/BUILD.bazel

Vectorization pass should not introduce extra undefined behaviors. In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false. This patch addresses such issue by falling back to loop peeling, in case extra undefined behaviors could happen using mask strategy. Signed-off-by: Alan Li <[email protected]>

banach-space · 2024-08-15T18:44:03Z

Hey, avoiding masking like this is a bit counter-intuitive to me TBH:

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

In general, this would depend on the ISA, right? I've just double-check DIV for SVE and note that (extra emphasis from me):

Signed divide active elements of the first source vector by corresponding elements of the second source vector

So, masking should be totally safe, no?

ScottTodd · 2024-08-15T19:41:17Z

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp

This PR appears to have fixed a previously failing test, causing CI to fail: https://github.com/iree-org/iree/actions/runs/10408412024/job/28826260334#step:8:54

=================================== FAILURES =================================== _ IREE compile and run: test_mod_uint64::model.mlir::model.mlir::cpu_llvm_sync _ [gw1] linux -- Python 3.11.9 /home/runner/work/iree/iree/venv/bin/python [XPASS(strict)] Expected run to fail (included in 'expected_run_failures') --- generated report log file: /tmp/iree_tests_onnx_cpu_llvm_sync_logs.json ----

Please watch CI results before merging - automerge will not block on that job at the moment.

Oops... Hmm, why is a failing test not blocking auto-merge? So it is generally discouraged to use auto-merge for IREE?

Our self-hosted GitHub Actions runners are too unreliable at the moment for auto-merge. They flake regularly and I don't want them blocking all commit traffic. Auto-merge will wait for the linux_x64_clang and linux_x64_clang_asan jobs which build and run unit tests. It is not currently waiting for model tests, GPU tests, ONNX op tests, Android builds, etc. (anything in pkgci.yml)

Weird... I saw the same test "failure" (newly passing) on https://github.com/iree-org/iree/actions/runs/10409729467/job/28830476078?pr=18223, which I think was before including this code... I wonder if the test is flaky? It certainly looked related to the code changes here 🤔

lialan · 2024-08-15T20:05:54Z

Hey, avoiding masking like this is a bit counter-intuitive to me TBH:

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

In general, this would depend on the ISA, right? I've just double-check DIV for SVE and note that (extra emphasis from me):

Signed divide active elements of the first source vector by corresponding elements of the second source vector

So, masking should be totally safe, no?

Masking it safe for SVE, yes. but generally speaking, in LLVM IR semantics:

Division by zero is undefined behavior. For vectors, if any element of the divisor is zero, the operation has undefined behavior.

Since we are lowering to LLVM IR so less undefined behavior is better..?

caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (iree-org#18146) Signed-off-by: Alan Li <[email protected]>

banach-space · 2024-08-15T20:46:52Z

Masking it safe for SVE, yes.

SVE is just a specific example of an ISA with masking. Similar argument would hold for any ISA with native masking.

but generally speaking, in LLVM IR semantics:


Division by zero is undefined behavior. For vectors, if any element of the divisor is zero, the operation has undefined behavior.

Isn't the point of masking to make sure that division doesnt't happen?

I might be missing sth here, but I think that it would be good to try to compile your example to assembly to see what happens there (I am AFK, but can try SVE tomorrow). Do you have an example for an ISA without masking?

caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (#18146) Signed-off-by: Alan Li <[email protected]>

banach-space · 2024-08-16T16:07:53Z

Below is the main loop for SVE:

.LBB0_1:
  sub x15, x12, x8
  cmp x15, #4
  csel  x15, x15, x13, lo
  dup v2.2d, x15
  mov w15, w14
  mov w14, wzr
  cmhi  v3.2d, v2.2d, v0.2d
  cmhi  v2.2d, v2.2d, v1.2d
  uzp1  v2.4s, v2.4s, v3.4s
  cmpne p1.s, p0/z, z2.s, #0
  ld1w  { z2.s }, p1/z, [x9, x8, lsl #2]
  ld1w  { z3.s }, p1/z, [x10, x8, lsl #2]
  .loc  1 8 12
  movprfx z4, z2
  sdiv  z4.s, p0/m, z4.s, z3.s
  mls v2.4s, v4.4s, v3.4s
  st1w  { z2.s }, p1, [x11, x8, lsl #2]
  mov w8, #4
  .loc  1 6 10
  tbnz  w15, #0, .LBB0_1
  mov w0, wzr
  .loc  1 6 10 epilogue_begin is_stmt 0
  ldp x29, x30, [sp], #16
  ret

As expected, it looks fine. This instruction computes the predicate:

  cmpne p1.s, p0/z, z2.s, #0

Here's the example that I compiled (based on your test):

module {
  func.func @test_mod_vectorizing_strategy_peeling(%3 : tensor<6xi32>, %4: tensor<6xi32>) -> tensor<6xi32> {
    %c0 = arith.constant 0 : index
    %5 = tensor.empty() : tensor<6xi32>
    %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<6xi32>, tensor<6xi32>) outs(%5 : tensor<6xi32>) {
    ^bb0(%in: i32, %in_0: i32, %out: i32):
      %7 = arith.remsi %in, %in_0 : i32
      linalg.yield %7 : i32
    } -> tensor<6xi32>
    return %6 : tensor<6xi32>
  }
}

Note, upstream IREE doesn't support scalable vectorisation of linalg.generic just yet. I am a bit behind with my PRs 😅 (I generated that using my downstream version)

Anyway, as hinted in my previous messages, this should probably be disabled for ISAs that support masking. WDYT? I can add this to my TODO list and send a PR once my other bits for vectorising linalg.generic (for SVE) are ready.

hanhanW · 2024-08-16T17:15:31Z

Isn't the point of masking to make sure that division doesnt't happen?
I might be missing sth here, but I think that it would be good to try to compile your example to assembly to see what happens there (I am AFK, but can try SVE tomorrow). Do you have an example for an ISA without masking?

How does it happen in practice? I could have wrong understanding about masking. I think they will eventually be lowered to masked load/store, and produce native_vector_size vectors. The rest of computation ops all take the results to perform the computation. So say that if we have 6 elements and two tiles, the second tile loads [v_0, v_1, 0, 0] for LHS and [v_2, v_3, 0, 0] for RHS, and the divisor becomes zeros which could lead to UB. The main reason is that the upstream vectorizer always use zeros as padding value.

In Alan's investigation, this is the case at LLVM level:

  %28 = getelementptr i32, ptr %9, i64 %20, !dbg !81
  %29 = tail call <4 x i32> @llvm.masked.load.v4i32.p0(ptr %28, i32 4, <4 x i1> %25, <4 x i32> zeroinitializer), !dbg !81
  %30 = srem <4 x i32> %27, %29, !dbg !81

where vectorization actually loads zero as divider due to mask being zero, and the zeroinitializer.

I don't really remember how SVE codegen config work right now. IIRC, we firstly peel the loop, and apply scalable vectorization on the main loop? If this is the case, I think we're doing the same thing here. The main loop is vectorized without masking on x86 path. The distinction of the IR is that one has scalar vector and the other does not. Correct me if I'm wrong. :)

Anyway, as hinted in my previous messages, this should probably be disabled for ISAs that support masking. WDYT? I can add this to my TODO list and send a PR once my other bits for vectorising linalg.generic (for SVE) are ready.

Yes, we can disable it if we know the ISAs support the masking. Thanks for offering the help!

banach-space · 2024-08-16T20:51:59Z

Hey @hanhanW 👋🏻

How does it happen in practice?

ISel :-) (*)

Yes, looking at your LLVM example it's quite apparent that there's potential UB. But then the backend can do it's magic and leverage the ISA to get rid of that and generate the predicated sdiv instruction that I shared in my example. IIUC, that happens during ISel.

IIRC, we firstly peel the loop, and apply scalable vectorization on the main loop? If this is the case, I think we're doing the same thing here.

Correct :) (well, strictly speaking we don't have scalable vectorization for linalg.generic yet, but that's a separate matter)

The main loop is vectorized without masking on x86 path.

Sure, but then I wonder - shouldn't every iteration in the main loop "consume" exactly vector-width of elements? (where vector-width matches hardware vector-register length) So even without masking, it would be fine?

(*) I will be OOO for a while, but will double check before sending a PR for SVE.

hanhanW · 2024-08-16T22:00:44Z

Yes, looking at your LLVM example it's quite apparent that there's potential UB. But then the backend can do it's magic and leverage the ISA to get rid of that and generate the predicated sdiv instruction that I shared in my example. IIUC, that happens during ISel.

Yeah, I'm happy that we are on the same page now. The issue is that we generated potential UB IRs, and then some backends can get rid of UB because they have such support.

I don't find a predicated sdiv instruction in x86 ISA, so we end up with disabling the masking strategy. But sure, we can enable it for SVE path if they have the support.

Sure, but then I wonder - shouldn't every iteration in the main loop "consume" exactly vector-width of elements? (where vector-width matches hardware vector-register length) So even without masking, it would be fine?

Yes, that's right on the x86 path. On the x86 path (or say without scalable vectors), there are two strategies. One is peeling and the other is masking. The operations in the main loop are all vectorized without masking. Because they have static shapes.

Things are a little different on scalable vectorization path. The "masking" strategy that we used on x86 path becomes "peeling first, then do masking on the main loop". I think you will likely hit the same issue if the ISA does not support predicated sdiv.

lialan mentioned this pull request Aug 7, 2024

arith.remsi causes floating point exception on llvm-cpu #18085

Closed

hanhanW reviewed Aug 7, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMCPU/KernelDispatch.cpp Outdated Show resolved Hide resolved

lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 6f48c86 to 6e5ca43 Compare August 8, 2024 02:35

lialan marked this pull request as ready for review August 8, 2024 16:46

lialan requested a review from MaheshRavishankar as a code owner August 8, 2024 16:46

hanhanW requested changes Aug 8, 2024

View reviewed changes

lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 7fe922f to 8fa6b46 Compare August 9, 2024 11:26

lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 8eb4337 to 237d74f Compare August 14, 2024 01:06

hanhanW approved these changes Aug 15, 2024

View reviewed changes

hanhanW changed the title ~~Make VectorPreProcStrategy consider undefined behaviors~~ [CPU] Make VectorPreProcStrategy consider undefined behaviors Aug 15, 2024

lialan force-pushed the lialan/vectorizer_fix branch from 237d74f to cb252e6 Compare August 15, 2024 18:05

lialan force-pushed the lialan/vectorizer_fix branch from cb252e6 to d8443e9 Compare August 15, 2024 18:12

lialan enabled auto-merge (squash) August 15, 2024 18:22

lialan linked an issue Aug 15, 2024 that may be closed by this pull request

arith.remsi causes floating point exception on llvm-cpu #18085

Closed

lialan merged commit 66ed138 into iree-org:main Aug 15, 2024
33 of 36 checks passed

ScottTodd reviewed Aug 15, 2024

View reviewed changes

lialan added a commit to lialan/iree that referenced this pull request Aug 15, 2024

Remove fixed XFAILED test.

3682877

caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (iree-org#18146) Signed-off-by: Alan Li <[email protected]>

lialan mentioned this pull request Aug 15, 2024

Remove fixed XFAILED test test_mod_uint64 #18235

Merged

lialan added a commit that referenced this pull request Aug 15, 2024

Remove fixed XFAILED test.

5d773e8

caused by revious commit: * 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (#18146) Signed-off-by: Alan Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

lialan commented Aug 7, 2024

hanhanW left a comment

lialan commented Aug 9, 2024 •

edited

Loading

hanhanW commented Aug 9, 2024

hanhanW left a comment

banach-space commented Aug 15, 2024

ScottTodd Aug 15, 2024

lialan Aug 15, 2024

ScottTodd Aug 15, 2024

ScottTodd Aug 15, 2024

lialan commented Aug 15, 2024

banach-space commented Aug 15, 2024

banach-space commented Aug 16, 2024 •

edited

Loading

hanhanW commented Aug 16, 2024

banach-space commented Aug 16, 2024

hanhanW commented Aug 16, 2024 •

edited

Loading

[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

Conversation

lialan commented Aug 7, 2024

hanhanW left a comment

Choose a reason for hiding this comment

lialan commented Aug 9, 2024 • edited Loading

hanhanW commented Aug 9, 2024

hanhanW left a comment

Choose a reason for hiding this comment

banach-space commented Aug 15, 2024

ScottTodd Aug 15, 2024

Choose a reason for hiding this comment

lialan Aug 15, 2024

Choose a reason for hiding this comment

ScottTodd Aug 15, 2024

Choose a reason for hiding this comment

ScottTodd Aug 15, 2024

Choose a reason for hiding this comment

lialan commented Aug 15, 2024

banach-space commented Aug 15, 2024

banach-space commented Aug 16, 2024 • edited Loading

hanhanW commented Aug 16, 2024

banach-space commented Aug 16, 2024

hanhanW commented Aug 16, 2024 • edited Loading

lialan commented Aug 9, 2024 •

edited

Loading

banach-space commented Aug 16, 2024 •

edited

Loading

hanhanW commented Aug 16, 2024 •

edited

Loading