Overall Model Testing/Benchmarking Plan #19115

saienduri · 2024-11-12T17:21:34Z

Tentative Plan to Extend Bencmarking/Validation Testing of Pytorch/ONNX Models

Halo Models

Halo Models are currently thoroughly tested and benchmarked in model-validation and benchmarking respectively. They currently live in the experimental section of IREE, and there are a few improvements that need to be made from the compiler side and testing side before we can move it out of experimental.

For our halo models (sdxl, llama, flux, etc.), we can expect for the modelling to be implemented in sharktank, and we can assume that MLIR generation is well tested and taken care of there. IREE testing of halo models will always start at the MLIR stage. Like the current implementation, we can continue to host these artifacts in azure or move there somewhere more accessible such as Hugging face.

The following tasks would help us move out of experimental and build something reliable, easy to navigate, and recreate locally:

Model Validation

Compiler Tasks:

Remove the need for complex compiler flag configurations. Maintaining a long list of flags that varies based on the submodel and data type of the model makes it hard to scale the testing and debug/recreate for the developer
Remove the need for a spec file for tuning. Currently we require a 300 line spec file from the codegen side to get performance on our halo models. We should find a way to enable these optimizations by default in the compiler.

Testing Tasks:

Scale the backends being tested (currently cpu, mi250, mi300).
Find a better way than the VmfbManager to share data between tests that compile and run here
Currently, bolierplate code to download each file that we need from Azure. Switch to an array of some sort for the halo model. example
Better way to keep track of all the compiler flag configs especially as we scale the backends being tested. Also, will switching to iree.build path help alleviate some of the pressure (https://github.com/nod-ai/SHARK-Platform/pull/427/files)

Model Benchmarking

Testing Tasks:

Currently, a developer would have to specify a bunch of flags to recreate the benchmarking to match the workflow file (source, workflow). Find a way to enable by default and override with CI CLI flags so recreate is easier
The benchmark is currently one pytest with over 300 lines of code with incoming changes here. The reason for this is so that it is easy to keep track of data and generate custom reports for latency, dispatch count, and binary size. Find a better way to split the tests up and use some sort of pytest hook mechanism to still generate reports (might have to find some middle ground here). See this comment for more details

General Models

For the general model suite, we can start mainly with pytorch and onnx as these are the two frameworks we rely on. All these tests should live in the iree-test-suite. For onnx, we have a supported path in IREE to import MLIR from onnx source files example. For pytorch models, we rely on the iree-turbine repo to export to MLIR example, so we can decide to either include this intake path or start at the MLIR.

We already have had some work on this here, so a good starting point to expand on.

Testing Tasks:

Figure out the intake path for pytorch models and have it setup similar to the onnx path
Add more models and configure a backend matrix (just cpu at the moment) for testing here for backends and here for adding more models
Figure out pytest hooks to have different xfails based on the backend as we scale here.
For benchmarking, we should have it depend on the model test. We can add it as a benchmark step that runs after compilation if it passes? Whatever we decide, it should run simple configurations for the different backends that pass in the validation flow.
Figure out how we want to store all the model artifacts in this test suite. The artifacts we intake (mlir, onnx files, etc.) shoud all be stored in huggingface with a directory for each model. Have a little readme about where the model is from and what it entails for each huggingface directory.
Figure out how we want to share artifacts between the validation and benchmarking steps. The way we do this in IREE is having a model ouputs directory set as a global env variable

saienduri mentioned this issue Nov 12, 2024

Add punet benchmarking to the regression suite #19088

Merged

saienduri added infrastructure Relating to build systems, CI, or testing infrastructure/benchmark Relating to benchmarking infrastructure labels Nov 12, 2024

saienduri assigned saienduri and ScottTodd Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overall Model Testing/Benchmarking Plan #19115

Overall Model Testing/Benchmarking Plan #19115

saienduri commented Nov 12, 2024 •

edited

Loading

Overall Model Testing/Benchmarking Plan #19115

Overall Model Testing/Benchmarking Plan #19115

Comments

saienduri commented Nov 12, 2024 • edited Loading

Tentative Plan to Extend Bencmarking/Validation Testing of Pytorch/ONNX Models

Halo Models

General Models

saienduri commented Nov 12, 2024 •

edited

Loading