-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DT] Plans for the buffer allocation in data-tiling #17924
Comments
@benvanik I implemented the AffinityAnalysisDialectInterface interface and a pass that attach the list of executables to I'm going to look at SpecializeEncodingsPass tomorrow. Just in case if I misunderstood our discussion, could you skim through the IR or implementation when you're available? The implementation only modifies the encodings in Snippet of the IR dump: // Before the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
// After the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
round_dims_to = array<i64: 32, 32, 32>, targets = [#executable_target_vmvx_bytecode_fb]>>{%0, %1} : index |
Looks promising! Some initial notes:
|
Thanks for the note, it is helpful!
I see, makes sense!
I found that |
the trait would allow for filtering to just those ops that may have tensors we want to change, instead of all ops in the program - so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs |
SG! I'm using tensor.sizeof op for prototype now; will switch it to TensorPhaseOp later. (I'd like to see at least one e2e workflow working.) But this is exactly what I'm looking for! In my prototype, I filter the ops with |
I made some progress and got stuck in specialization. The issues are mostly about how we gather affinities, clone dispatches and update encodings, especially for multi-device concept. I was going to ping @benvanik on discord, then I realized that it is Friday afternoon! So I'm leaving messages here, and hopefully we can walk through an example next Monday. Progress update and potential issue in EncodingSolverI moved the Stream dialect interface from analysis/ to IR/ and verified that there are no dependency issues. I finished the backend encoding solver prototype (using VMVX), and found that there is a duplication issue when we create the solver. The difficulty is that the solver needs to access the target config (like cpu_features, iree_gpu configurations, etc.). We can either (a) pass the dictionary through interface method (i.e., calculateStorageElementCountInBytes) or (b) store the information in the parameter (like below snippet). The issue in (a) is that we need to hold the dictionary somewhere until we resolving all the encoding informations. It will make the EncodingAttr's The issue in (b) is that we are duplicating the config twice in IRs. One is in solver, and the other is in ExecutableTargets.
Both solutions look bad to me. I think we need to let (c) ExecutableTargetAttr inherits from an Encoding attribute interface. It is something similar to what we discussed in the note:
In getExecutableTarget methods, we can populate the attribute and store it to the The prototype now goes with (b) approach. It does not matter which one is implemented in the prototype. I'm not worried about it because it is solvable. I just need some input about which path should I go. I like (c) better, what do you think? Specialization IssueThis part is hard to work out without an example. I'm at a state that can produce required encoding attribute, so I'd like to look at IR details with @benvanik together. My first step is creating the input IR, and studying multi-device concept. The writeup is good, btw. I learned that a device could refer to a list of available devices. An AffinityAttr indicates a device (which can be a list and the device is selected from the list). My understanding was wrong because I thought that it includes all the devices.. Inlining the note to below, and I need @benvanik to help me unpack more context from it. I don't understand the terminology meaning of
Below snippet is inlined from the output of current MakeEncodingSolvable pass. Let's take the #executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {ukernels = "none"}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
util.global private @__device_0 = #hal.device.select<[#device_target_local, #device_target_local1]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @foo(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%c0 = arith.constant 0 : index
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%element_type_f32 = hal.element_type<f32> : i32
%dense_row_major = hal.encoding_type<dense_row_major> : i32
hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%0, %1]) type(%element_type_f32) encoding(%dense_row_major)
%2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %1} : index
%3 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<?x?xf32>{%0, %1} in !stream.resource<external>{%2}
%4 = stream.async.transfer %3 : !stream.resource<external>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%2}
%5 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%6 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%5, %6]) type(%element_type_f32) encoding(%dense_row_major)
%7 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%5, %6} : index
%8 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<?x?xf32>{%5, %6} in !stream.resource<external>{%7}
%9 = stream.async.transfer %8 : !stream.resource<external>{%7} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%7}
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
%11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%4[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
%12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%5, %6} : index
%13 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_1::@foo_dispatch_1_set_encoding_RHS_DxD[%5, %6](%9[%c0 to %7 for %7], %5, %6) : (!stream.resource<*>{%7}, index, index) -> !stream.resource<*>{%12}
%14 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %5} : index
%15 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_2::@foo_dispatch_2_matmul_DxDxD_f32[%1, %6, %0, %5](%11[%c0 to %10 for %10], %13[%c0 to %12 for %12], %1, %6, %0, %5) : (!stream.resource<*>{%10}, !stream.resource<*>{%12}, index, index, index, index) -> !stream.resource<*>{%14}
%16 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %5} : index
%17 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_3::@foo_dispatch_3_unset_encoding_RESULT_DxD[%0, %5](%15[%c0 to %14 for %14], %0, %5) : (!stream.resource<*>{%14}, index, index) -> !stream.resource<*>{%16}
%18 = stream.async.transfer %17 : !stream.resource<*>{%16} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%16}
%19 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %18 : tensor<?x?xf32>{%0, %5} in !stream.resource<external>{%16} -> !hal.buffer_view
util.return %19 : !hal.buffer_view
}
} What does "dispatch site" mean? There are ops like %11 = stream.async.dispatch
on(#hal.device.affinity<@__device_0>)
@foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
[%0, %1](%4[%c0 to %2 for %2], %0, %1)
: (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10} What do we do when we duplicate executable? Does it mean that we are cloning more stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
} becomes stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
// The device name can be in suffix. I make it prefix for readability.
func.func @llvmcpu_foo_dispatch_0_set_encoding_LHS_DxD {
// the target field in the encoding becomes [#executable_target_embedded_elf_x86_64_ ]
}
func.func @vmvx_foo_dispatch_0_set_encoding_LHS_DxD {
// the target field in the encoding becomes [#executable_target_vmvx_bytecode_fb ]
}
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
} Is it correct? If so, what should the main function look like? It was a single Also, how do I get the "execution affinity"? I assumed that it means the actual device that we're going to run on, is it correct? %11 = stream.async.dispatch
on(#hal.device.affinity<@__device_0>)
@foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
[%0, %1](%4[%c0 to %2 for %2], %0, %1)
: (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10} It will be very helpful if we can look at the IR and modify few of them manually sometime next week! |
Let's chat next week, but I'm confused at what a solver is and why it needs anything at all on it. We shouldn't have any duplication. The solver is just a way to reference a function, essentially, and doesn't need any information of its own (unless there are encoding-specific information). |
(epic progress, though! it's really coming together :) |
I have a prototype that addresses the dup config issue. One of the challenges is that the attribute is not mutable, so we can not update some field once we create it. The other challenge is that the interface can't have parameters (which is fair). So my solution is to declare an interface method to get the config. The prototype wraps the whole dictionary config into "encoding". In the HAL::ExecutableTargetAttr, I renamed the Without this commit, the IR is: #executable_target_vmvx_bytecode_fb =
#hal.executable.target<
"vmvx",
"vmvx-bytecode-fb",
{encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>, {ukernels = "none"} }
> With the commit, the IR is: #executable_target_vmvx_bytecode_fb =
#hal.executable.target<
"vmvx",
"vmvx-bytecode-fb",
{encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}
> ====== Side note: I found a bug about dialect registration in few passes during the prototype. The overwritten |
I wrote an example about running one matmul on device_a and the same matmul on the device_b; it gives me the multi-device IR that we want to solve in SpecializeEncoding pass. I put some critical IRs in the snippet, and now I think I understand what we want to duplicate for executables. The set_encoding_LHS dispatch is used by both device, while they are referring to the same function. We need to duplicate the site (i.e., functions inside an export) and update the site to relevant duplicate executables. #executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// .... The previous one (e.g., I'll start implementing the rest part of SpecializeEncoding pass, and share the update. |
Note: here is the example input that I used in the prototype. What the IR does is
module {
func.func @foo(%arg0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %arg1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
%dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
%dim_1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
%cst = arith.constant 0.000000e+00 : f32
%0 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<?x?xf32>) -> tensor<?x?xf32>
%2 = linalg.matmul ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%1 : tensor<?x?xf32>) -> tensor<?x?xf32>
%3 = flow.tensor.transfer %2 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_b>
%4 = flow.tensor.transfer %arg0 : tensor<?x?xf32>{%dim, %dim_0} to #hal.device.promise<@device_b>
%5 = flow.tensor.transfer %arg1 : tensor<?x?xf32>{%dim_0, %dim_1} to #hal.device.promise<@device_b>
%6 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
%7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<?x?xf32>) -> tensor<?x?xf32>
%8 = linalg.matmul ins(%4, %5 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%7 : tensor<?x?xf32>) -> tensor<?x?xf32>
%9 = arith.addf %3, %8 : tensor<?x?xf32>
%10 = flow.tensor.transfer %9 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_a>
return %10 : tensor<?x?xf32>
}
} Commands to generate the IR: |
I have a second take for dup config issue w/o HAL attribute changes. It is still creating an additional level of wrapping but it is scoped within the Codegen directory. I.e., I create a new method (i.e., It looks cleaner because the host/HAL side does not really care about the configuration field (IMO). They are target features and should(?) only be used by Codegen backends. |
Although I haven't finished the update of cloned executable part, but it looks like I'm doing something wrong. So I posted the update here and I'm looking for feedback. So I have a commit, which collects the "export -> affinities" maps and duplicates the stream.executable ops; the commit also updates the entry points of each
IR before my SpecializedEncoding pass: #executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// .... IR after my SpecializedEncoding pass: stream.executable private @foo_dispatch_0 { ... }
// cloned executable
stream.executable private @foo_dispatch_0_0 {
// The body is the same, no changed.
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1,
#map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64:
32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #
map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_di
ms_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// The updated stream.async.dispatch op now has @foo_dispatch_0_0::@... entry point.
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24} @benvanik do I clone the |
Adding the status update to the issue: #18738 this PR has the prototype, and you can find the design doc at https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e We had a great brainstorm yesterday, and here are the required items to land the prototype to the main branch:
We also chat about the case that a device has several executable targets. IMO, we’re able to specialize the case in my prototype. It will be the next TODO after I land my prototype to the main branch. The other next topics in my mind is to cancel encodings properly. Ben suggested me to look it at flow level, and turn them into flow.clone ops when we know that the potential targets do not implement encodings. The information can be queried by the same analysis -- but it's fuzzier. It's on my TODO list. |
I have a prototype for encoding information compression. It still carry the whole config (as an intermediate step) when we populate the attributes from HALAffinityAnalysisDialectInterface implementation. The main difference is that we introduce an On the codegen side, there are two groups of encodings. One has encoding_solver and the other does not. The boundary operations (e.g., hal.binding, flow.dispatch.tensor.load/store, etc) have solver attributes which describes that the incoming layout. The other operations do not have solve attributes and they are all compute ops, which means that they will be executed on the device that attached on @benvanik @bjacob @qedawkins @MaheshRavishankar Does the way that I define layout look good to you? #encoding_solver = #iree_cpu.cpu_encoding_solver<>
#encoding_solver1 = #iree_cpu.vmvx_encoding_solver<>
#encoding_solver2 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 1], outerDimsPerm = [0, 1]}>
#encoding_solver3 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [0, 1]}>
#encoding_solver4 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [16, 1], outerDimsPerm = [1, 0]}>
#encoding_solver5 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [1, 0]}>
#encoding_solver6 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 16], outerDimsPerm = [0, 1]}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#encoding = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver2]>
#encoding1 = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding2 = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]>
#encoding3 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver4]>
#encoding4 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding5 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver5]>
#encoding6 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver6]>
#encoding7 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding8 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]> |
Okay, I have verified that e2e is working with the new changes. I still don't get a good name for the attribute interface, perhaps I'll just call it EncodingAttrInterface for now, and we can always rename it later. So I'm going to start breaking down my prototype and land it to the main branch. |
I think we can name it to |
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute with encoding attribute interface implementation. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
The revisions introduces a new attribute interface in the Encoding dialect. The interface is used to query layout information needed to materialize encoding attributes. Any backend can implement the interface to interpret an encoding layout based on their needs. The current expectation of the interface is to propagate layout information from backends to the host compliation or other targets. The expectation can be adjusted as long as we identify additional needs for encodings/layouts. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute with encoding attribute interface implementation. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute with encoding attribute interface implementation. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
The revisions introduces a new attribute interface in the Encoding dialect. The interface is used to query layout information needed to materialize encoding attributes. Any backend can implement the interface to interpret an encoding layout based on their needs. The current expectation of the interface is to propagate layout information from backends to the host compilation or other targets. The expectation can be adjusted as long as we identify additional needs for encodings/layouts. It is a step towards #17924 --------- Signed-off-by: hanhanW <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute with encoding attribute interface implementation. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute that implements EncodingLayoutAttrInterface. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. The revision also updates the TODO item for `round_dims_to` field. Because IREE is going to use the new "layouts" field and upcoming attribute interface to handle the allocation problem. It is a step towards #17924 Signed-off-by: hanhanW <[email protected]>
…ils (#19234) The revision moves MaterializeEncodingInfo struct and TileSwizzle struct to `compiler/Codegen/Dialect/Codegen/Utils/Utils.[h|cpp]`. It is a preparation for #17924 because they are how we define layouts in data-tiling and we're going to expose the layouts to EncodingAttr. The revision also updates the namespace in ` GPUTileSwizzleUtils.[h|cpp]`, which follows the convention; a typo in license. They are all created in 2024, so the year should be 2024: 740e301 --------- Signed-off-by: hanhanW <[email protected]>
The revisions introduces a new attribute interface in the Encoding dialect. The interface is used to query layout information needed to materialize encoding attributes. Any backend can implement the interface to interpret an encoding layout based on their needs. The current expectation of the interface is to propagate layout information from backends to the host compilation or other targets. The expectation can be adjusted as long as we identify additional needs for encodings/layouts. It is a step towards iree-org#17924 --------- Signed-off-by: hanhanW <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute that implements EncodingLayoutAttrInterface. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. The revision also updates the TODO item for `round_dims_to` field. Because IREE is going to use the new "layouts" field and upcoming attribute interface to handle the allocation problem. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]>
…ils (iree-org#19234) The revision moves MaterializeEncodingInfo struct and TileSwizzle struct to `compiler/Codegen/Dialect/Codegen/Utils/Utils.[h|cpp]`. It is a preparation for iree-org#17924 because they are how we define layouts in data-tiling and we're going to expose the layouts to EncodingAttr. The revision also updates the namespace in ` GPUTileSwizzleUtils.[h|cpp]`, which follows the convention; a typo in license. They are all created in 2024, so the year should be 2024: iree-org@740e301 --------- Signed-off-by: hanhanW <[email protected]>
The revisions introduces a new attribute interface in the Encoding dialect. The interface is used to query layout information needed to materialize encoding attributes. Any backend can implement the interface to interpret an encoding layout based on their needs. The current expectation of the interface is to propagate layout information from backends to the host compilation or other targets. The expectation can be adjusted as long as we identify additional needs for encodings/layouts. It is a step towards iree-org#17924 --------- Signed-off-by: hanhanW <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
The revision introduces an optional "layouts" field to EncodingAttr. It is an array of attributes that describes the potential layouts on the device. It is an array because a device could have several executable targets. Note that it can be any attribute that implements EncodingLayoutAttrInterface. The expectation of the field is to bridge the logics between host codes and device codes. If an attribute does not implement the interface, it could be discarded anytime. The revision also updates the TODO item for `round_dims_to` field. Because IREE is going to use the new "layouts" field and upcoming attribute interface to handle the allocation problem. It is a step towards iree-org#17924 Signed-off-by: hanhanW <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
…ils (iree-org#19234) The revision moves MaterializeEncodingInfo struct and TileSwizzle struct to `compiler/Codegen/Dialect/Codegen/Utils/Utils.[h|cpp]`. It is a preparation for iree-org#17924 because they are how we define layouts in data-tiling and we're going to expose the layouts to EncodingAttr. The revision also updates the namespace in ` GPUTileSwizzleUtils.[h|cpp]`, which follows the convention; a typo in license. They are all created in 2024, so the year should be 2024: iree-org@740e301 --------- Signed-off-by: hanhanW <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
The branch demonstrates how data-tiling + heterogeneous computing run altogether in IREE: #18738
Design Doc: https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e
IR dump: https://gist.github.com/hanhanW/5029dc652aec1379102e43e702aaf15b
How I think about buffer allocation in data-tiling
What we can get from here is:
Execution Plan
Retire the query_upper_bound op and CPUMaterializeUpperBoundTileSize pass.
Goal: remove old operations and decouple the deps between HAL and the CPU specific pass.
Plan: Update the max_padding semantic in the encoding. If it is set, the backend should take it into account and select appropriate inner tile sizes (to avoid out-of-bound access). If it is not set, the backend can decide whatever inner tile sizes they want. In the current default path (which will eventually be moved to preprocessing), we do not set max_padding attribute. In the path that we're building, we set max_padding attribute to hint the actual buffer size for Stream.
Finish the data-tiling fusion and basic functional GPU data-tiling
See #17722 for more details. Basically we want to enable fusion for mmt4d ops on CPU side, and build the data-tiling for GPU path. There are some changes needed by CPU backend, because mmt4d fusion is new. It is scoped in the #17722 issue.
Outcome: we'll be able to flip data-tiling to the fusion path and use data-tiling in multi-device project.
Move SetEncoding and MaterializationEncoding from GlobalOpt to preprocessing
Learn buffer allocation for multi-device (i.e., LCM?)
More items: TBD
cc @MaheshRavishankar @benvanik @bjacob @Max191 @pashu123 @lialan
The text was updated successfully, but these errors were encountered: