Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Make VectorPreProcStrategy consider undefined behaviors #18146

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

lialan
Copy link
Contributor

@lialan lialan commented Aug 7, 2024

Vectorization pass should not introduce extra undefined behaviors.

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

This patch addresses such issue by falling back to loop peeling, in case extra undefined behaviors could happen using mask strategy.

@lialan lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 6f48c86 to 6e5ca43 Compare August 8, 2024 02:35
@lialan lialan marked this pull request as ready for review August 8, 2024 16:46
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay to me. Just few nits about function name and the test.

@lialan lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 7fe922f to 8fa6b46 Compare August 9, 2024 11:26
@lialan
Copy link
Contributor Author

lialan commented Aug 9, 2024

@hanhanW I realized deciding the vectorization strategy is actually before we do peeling (in hindsight, of course), so I created a new test file to check the decision making part instead.

but I feel like we should also test the peeling part (that the computing of residual values are good). Let me know what you think.

@hanhanW
Copy link
Contributor

hanhanW commented Aug 9, 2024

@hanhanW I realized deciding the vectorization strategy is actually before we do peeling (in hindsight, of course), so I created a new test file to check the decision making part instead.

We split the tests based on CPU arch, so there are few select_*_lowering_strategy.mlir in https://github.com/iree-org/iree/tree/main/compiler/src/iree/compiler/Codegen/LLVMCPU/test. Here I'd suggest to put the test to select_x86_64_lowering_strategy.mlir because the target in the test is x86.

but I feel like we should also test the peeling part (that the computing of residual values are good). Let me know what you think.

There are few pipeline tests in pipeline_peel_and_vectorize_tests.mlir. You can add a test to the file with a preset configuration.

@lialan lialan force-pushed the lialan/vectorizer_fix branch 2 times, most recently from 8eb4337 to 237d74f Compare August 14, 2024 01:06
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one nit about testing.

(cc @banach-space )

@hanhanW hanhanW changed the title Make VectorPreProcStrategy consider undefined behaviors [CPU] Make VectorPreProcStrategy consider undefined behaviors Aug 15, 2024
@lialan lialan force-pushed the lialan/vectorizer_fix branch from 237d74f to cb252e6 Compare August 15, 2024 18:05
Vectorization pass should not introduce extra undefined behaviors.

In some particular cases, using masking strategy could result in
divide-by-zero exception, as the masking strategy by default loads zero
values when mask is false.

This patch addresses such issue by falling back to loop peeling, in case
extra undefined behaviors could happen using mask strategy.

Signed-off-by: Alan Li <[email protected]>
@lialan lialan force-pushed the lialan/vectorizer_fix branch from cb252e6 to d8443e9 Compare August 15, 2024 18:12
@lialan lialan enabled auto-merge (squash) August 15, 2024 18:22
@lialan lialan linked an issue Aug 15, 2024 that may be closed by this pull request
@lialan lialan merged commit 66ed138 into iree-org:main Aug 15, 2024
33 of 36 checks passed
@banach-space
Copy link
Collaborator

Hey, avoiding masking like this is a bit counter-intuitive to me TBH:

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

In general, this would depend on the ISA, right? I've just double-check DIV for SVE and note that (extra emphasis from me):

Signed divide active elements of the first source vector by corresponding elements of the second source vector

So, masking should be totally safe, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR appears to have fixed a previously failing test, causing CI to fail: https://github.com/iree-org/iree/actions/runs/10408412024/job/28826260334#step:8:54

=================================== FAILURES ===================================
_ IREE compile and run: test_mod_uint64::model.mlir::model.mlir::cpu_llvm_sync _
[gw1] linux -- Python 3.11.9 /home/runner/work/iree/iree/venv/bin/python
[XPASS(strict)] Expected run to fail (included in 'expected_run_failures')
--- generated report log file: /tmp/iree_tests_onnx_cpu_llvm_sync_logs.json ----

Please watch CI results before merging - automerge will not block on that job at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... Hmm, why is a failing test not blocking auto-merge? So it is generally discouraged to use auto-merge for IREE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our self-hosted GitHub Actions runners are too unreliable at the moment for auto-merge. They flake regularly and I don't want them blocking all commit traffic. Auto-merge will wait for the linux_x64_clang and linux_x64_clang_asan jobs which build and run unit tests. It is not currently waiting for model tests, GPU tests, ONNX op tests, Android builds, etc. (anything in pkgci.yml)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird... I saw the same test "failure" (newly passing) on https://github.com/iree-org/iree/actions/runs/10409729467/job/28830476078?pr=18223, which I think was before including this code... I wonder if the test is flaky? It certainly looked related to the code changes here 🤔

@lialan
Copy link
Contributor Author

lialan commented Aug 15, 2024

Hey, avoiding masking like this is a bit counter-intuitive to me TBH:

In some particular cases, using masking strategy could result in divide-by-zero exception, as the masking strategy by default loads zero values when mask is false.

In general, this would depend on the ISA, right? I've just double-check DIV for SVE and note that (extra emphasis from me):

Signed divide active elements of the first source vector by corresponding elements of the second source vector

So, masking should be totally safe, no?

Masking it safe for SVE, yes. but generally speaking, in LLVM IR semantics:

Division by zero is undefined behavior. For vectors, if any element of the divisor is zero, the operation has undefined behavior.

Since we are lowering to LLVM IR so less undefined behavior is better..?

lialan added a commit to lialan/iree that referenced this pull request Aug 15, 2024
caused by revious commit:
* 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (iree-org#18146)

Signed-off-by: Alan Li <[email protected]>
@banach-space
Copy link
Collaborator

Masking it safe for SVE, yes.

SVE is just a specific example of an ISA with masking. Similar argument would hold for any ISA with native masking.

but generally speaking, in LLVM IR semantics:


Division by zero is undefined behavior. For vectors, if any element of the divisor is zero, the operation has undefined behavior.

Isn't the point of masking to make sure that division doesnt't happen?

I might be missing sth here, but I think that it would be good to try to compile your example to assembly to see what happens there (I am AFK, but can try SVE tomorrow). Do you have an example for an ISA without masking?

lialan added a commit that referenced this pull request Aug 15, 2024
caused by revious commit:
* 66ed138 - [CPU] Make VectorPreProcStrategy consider undefined behaviors (#18146)

Signed-off-by: Alan Li <[email protected]>
@banach-space
Copy link
Collaborator

banach-space commented Aug 16, 2024

Below is the main loop for SVE:

.LBB0_1:
  sub x15, x12, x8
  cmp x15, #4
  csel  x15, x15, x13, lo
  dup v2.2d, x15
  mov w15, w14
  mov w14, wzr
  cmhi  v3.2d, v2.2d, v0.2d
  cmhi  v2.2d, v2.2d, v1.2d
  uzp1  v2.4s, v2.4s, v3.4s
  cmpne p1.s, p0/z, z2.s, #0
  ld1w  { z2.s }, p1/z, [x9, x8, lsl #2]
  ld1w  { z3.s }, p1/z, [x10, x8, lsl #2]
  .loc  1 8 12
  movprfx z4, z2
  sdiv  z4.s, p0/m, z4.s, z3.s
  mls v2.4s, v4.4s, v3.4s
  st1w  { z2.s }, p1, [x11, x8, lsl #2]
  mov w8, #4
  .loc  1 6 10
  tbnz  w15, #0, .LBB0_1
  mov w0, wzr
  .loc  1 6 10 epilogue_begin is_stmt 0
  ldp x29, x30, [sp], #16
  ret

As expected, it looks fine. This instruction computes the predicate:

  cmpne p1.s, p0/z, z2.s, #0

Here's the example that I compiled (based on your test):

module {
  func.func @test_mod_vectorizing_strategy_peeling(%3 : tensor<6xi32>, %4: tensor<6xi32>) -> tensor<6xi32> {
    %c0 = arith.constant 0 : index
    %5 = tensor.empty() : tensor<6xi32>
    %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%3, %4 : tensor<6xi32>, tensor<6xi32>) outs(%5 : tensor<6xi32>) {
    ^bb0(%in: i32, %in_0: i32, %out: i32):
      %7 = arith.remsi %in, %in_0 : i32
      linalg.yield %7 : i32
    } -> tensor<6xi32>
    return %6 : tensor<6xi32>
  }
}

Note, upstream IREE doesn't support scalable vectorisation of linalg.generic just yet. I am a bit behind with my PRs 😅 (I generated that using my downstream version)

Anyway, as hinted in my previous messages, this should probably be disabled for ISAs that support masking. WDYT? I can add this to my TODO list and send a PR once my other bits for vectorising linalg.generic (for SVE) are ready.

@hanhanW
Copy link
Contributor

hanhanW commented Aug 16, 2024

Isn't the point of masking to make sure that division doesnt't happen?
I might be missing sth here, but I think that it would be good to try to compile your example to assembly to see what happens there (I am AFK, but can try SVE tomorrow). Do you have an example for an ISA without masking?

How does it happen in practice? I could have wrong understanding about masking. I think they will eventually be lowered to masked load/store, and produce native_vector_size vectors. The rest of computation ops all take the results to perform the computation. So say that if we have 6 elements and two tiles, the second tile loads [v_0, v_1, 0, 0] for LHS and [v_2, v_3, 0, 0] for RHS, and the divisor becomes zeros which could lead to UB. The main reason is that the upstream vectorizer always use zeros as padding value.

In Alan's investigation, this is the case at LLVM level:

  %28 = getelementptr i32, ptr %9, i64 %20, !dbg !81
  %29 = tail call <4 x i32> @llvm.masked.load.v4i32.p0(ptr %28, i32 4, <4 x i1> %25, <4 x i32> zeroinitializer), !dbg !81
  %30 = srem <4 x i32> %27, %29, !dbg !81

where vectorization actually loads zero as divider due to mask being zero, and the zeroinitializer.

I don't really remember how SVE codegen config work right now. IIRC, we firstly peel the loop, and apply scalable vectorization on the main loop? If this is the case, I think we're doing the same thing here. The main loop is vectorized without masking on x86 path. The distinction of the IR is that one has scalar vector and the other does not. Correct me if I'm wrong. :)

Anyway, as hinted in my previous messages, this should probably be disabled for ISAs that support masking. WDYT? I can add this to my TODO list and send a PR once my other bits for vectorising linalg.generic (for SVE) are ready.

Yes, we can disable it if we know the ISAs support the masking. Thanks for offering the help!

@banach-space
Copy link
Collaborator

Hey @hanhanW 👋🏻

How does it happen in practice?

ISel :-) (*)

Yes, looking at your LLVM example it's quite apparent that there's potential UB. But then the backend can do it's magic and leverage the ISA to get rid of that and generate the predicated sdiv instruction that I shared in my example. IIUC, that happens during ISel.

IIRC, we firstly peel the loop, and apply scalable vectorization on the main loop? If this is the case, I think we're doing the same thing here.

Correct :) (well, strictly speaking we don't have scalable vectorization for linalg.generic yet, but that's a separate matter)

The main loop is vectorized without masking on x86 path.

Sure, but then I wonder - shouldn't every iteration in the main loop "consume" exactly vector-width of elements? (where vector-width matches hardware vector-register length) So even without masking, it would be fine?

(*) I will be OOO for a while, but will double check before sending a PR for SVE.

@hanhanW
Copy link
Contributor

hanhanW commented Aug 16, 2024

Yes, looking at your LLVM example it's quite apparent that there's potential UB. But then the backend can do it's magic and leverage the ISA to get rid of that and generate the predicated sdiv instruction that I shared in my example. IIUC, that happens during ISel.

Yeah, I'm happy that we are on the same page now. The issue is that we generated potential UB IRs, and then some backends can get rid of UB because they have such support.

I don't find a predicated sdiv instruction in x86 ISA, so we end up with disabling the masking strategy. But sure, we can enable it for SVE path if they have the support.

Sure, but then I wonder - shouldn't every iteration in the main loop "consume" exactly vector-width of elements? (where vector-width matches hardware vector-register length) So even without masking, it would be fine?

Yes, that's right on the x86 path. On the x86 path (or say without scalable vectors), there are two strategies. One is peeling and the other is masking. The operations in the main loop are all vectorized without masking. Because they have static shapes.

Things are a little different on scalable vectorization path. The "masking" strategy that we used on x86 path becomes "peeling first, then do masking on the main loop". I think you will likely hit the same issue if the ISA does not support predicated sdiv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arith.remsi causes floating point exception on llvm-cpu
4 participants