Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS - Data-tiling prototype for targeting multi-device. #18738

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Oct 10, 2024

The branch demonstrates how data-tiling + heterogeneous computing run altogether in IREE.

Design Doc: https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e

IR dump: https://gist.github.com/hanhanW/5029dc652aec1379102e43e702aaf15b

// Zen4 CPU
#executable_target_embedded_elf_x86_64_with_encoding_solver = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64",
  {cpu = "znver4", cpu_features = "+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+f16c,+fsgsbase,+crc32,+invpcid,+rdpru,+sahf,+lzcnt,+movbe,+mwaitx,+x87,+pku,+evex512,+prfchw,+rdpid,+rdrnd,+rdseed,+sha,+shstk,+vaes,+wbnoinvd,+xsave,+xsavec,+xsaveopt,+xsaves,+fxsr",
   data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
   native_vector_size = 64 : i64,
   target_triple = "x86_64-unknown-unknown-eabi-elf",
   encoding_solver = #iree_cpu.cpu_encoding_solver<>
   }>

// VMVX with ukernels enabled.
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<>, ukernels = "all"}>

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_with_encoding_solver
]> : !hal.device
util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [
  #executable_target_vmvx_bytecode_fb
]> : !hal.device

func.func @foo(
  %lhs: tensor<?x?xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>},
  %rhs: tensor<?x?xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}) -> (tensor<?x?xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}) {

  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %M = tensor.dim %lhs, %c0 : tensor<?x?xf32>
  %K = tensor.dim %lhs, %c1 : tensor<?x?xf32>
  %N = tensor.dim %rhs, %c1 : tensor<?x?xf32>
  %cst = arith.constant 0.0 : f32
  %init = tensor.empty(%M, %N) : tensor<?x?xf32>
  %fill = linalg.fill ins(%cst : f32) outs(%init : tensor<?x?xf32>) -> tensor<?x?xf32>
  %op = linalg.matmul
      ins(%lhs, %rhs : tensor<?x?xf32>, tensor<?x?xf32>)
      outs(%fill : tensor<?x?xf32>) -> tensor<?x?xf32>
  // Execute matmul on device_a and transfer the result to device_b
  %transient_op = flow.tensor.transfer %op : tensor<?x?xf32>{%M, %N} to #hal.device.affinity<@device_b>

  // Transfer input data to device_b
  %lhsb = flow.tensor.transfer %lhs : tensor<?x?xf32>{%M, %K} to #hal.device.affinity<@device_b>
  %rhsb = flow.tensor.transfer %rhs : tensor<?x?xf32>{%K, %N} to #hal.device.affinity<@device_b>
  %initb = tensor.empty(%M, %N) : tensor<?x?xf32>
  %fillb = linalg.fill ins(%cst : f32) outs(%initb : tensor<?x?xf32>) -> tensor<?x?xf32>
  // Execute matmul on device_b and accumulate the result and the result from device_a.
  %opb = linalg.matmul
      ins(%lhsb, %rhsb : tensor<?x?xf32>, tensor<?x?xf32>)
      outs(%fillb : tensor<?x?xf32>) -> tensor<?x?xf32>
  %add = arith.addf %transient_op, %opb : tensor<?x?xf32>

  // Transfer the result from device_b -> device_a.
  %result_a = flow.tensor.transfer %add : tensor<?x?xf32>{%M, %N} to #hal.device.affinity<@device_a>

  // Return the result on device_a.
  func.return %result_a : tensor<?x?xf32>
}
# Compilation
iree-compile --iree-execution-model=async-external ~/matmul.mlir -o /tmp/z.vmfb --iree-global-opt-enable-early-materialization=false

# Execution
iree-run-module --module=/tmp/z.vmfb --function=foo --input=2x3xf32=1,2,3,4,5,6 --input=3x5xf32=1 --device=local-task --device=local-task

# EXEC @foo
# result[0]: hal.buffer_view
# 2x5xf32=[12 12 12 12 12][30 30 30 30 30]

@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch 5 times, most recently from 8105d67 to 84fe339 Compare October 12, 2024 00:55
@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch from c76513d to e992c93 Compare October 28, 2024 22:53
@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch from 1812f5e to a992fbf Compare November 6, 2024 20:17
@ScottTodd
Copy link
Member

@hanhanW do you want to keep this PR open with the "check bazel deps" title? Seems like you are pushing to this branch fairly regularly.

@hanhanW hanhanW changed the title DNS - check bazel deps DNS - Data-tiling prototype for targeting multi-device. Nov 6, 2024
@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 6, 2024

@hanhanW do you want to keep this PR open with the "check bazel deps" title? Seems like you are pushing to this branch fairly regularly.

I forgot that I have this PR when I was updating my branch; I totally forgot that the title was not updated. :/ Sorry about that, I updated the title.

@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch from a992fbf to 224a47a Compare November 7, 2024 23:14
@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch 5 times, most recently from 6c8e63b to a4966c0 Compare November 11, 2024 21:37
This is the third take. It introduces a "cloneWithConfig" interface
method to solve dup config issue.

Signed-off-by: hanhanW <[email protected]>
The prototype is not the final state, ideally we should introduce the
other attribute interface to handle materializations.

The current implementation works only if the encoding target is as same
as the execution target.

Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
The boundary operations (i.e., bindings, flow.tensor.load/store, etc)
has EncodingSolver attached. It defines the layout for inputs/outputs.

The encodings on the operations (e.g., compute ops) captures all the
original encoding fields and we do know that they will be executed on
the execution device, so we should be able to insert some ops which
bring it from layout_a to layout_b.

It adds the indexing maps back, we can revisit it later.

Signed-off-by: hanhanW <[email protected]>
@hanhanW hanhanW force-pushed the hanhan-encoding-interface-prototype-v2 branch from 0e19402 to 65c1d4f Compare November 19, 2024 00:56
Note that this should be an issue when we land it to the main branch,
because the refactoring should happen and they will use the same code.

Signed-off-by: hanhanW <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants