Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tuner] Improving ease of use for the tuner #814

Open
1 of 26 tasks
Max191 opened this issue Jan 10, 2025 · 5 comments
Open
1 of 26 tasks

[Tuner] Improving ease of use for the tuner #814

Max191 opened this issue Jan 10, 2025 · 5 comments
Assignees
Labels

Comments

@Max191
Copy link
Contributor

Max191 commented Jan 10, 2025

Overview and Goals

This issue is for listing out the goals for the future state of the tuner, focusing on better testing and ease of setup/use.

In the simplest terms, the end goal of this issue is for the tuner to have little to no setup time, and if a user is able to compile and run a program, then the user should be able to (nearly) just as easily tune the program using the tuner. This means that nearly all of the current process for tuning needs to be automated, and hooked into components that are directly generated from the compiler, which leads to the next point:

Another focus of this issue is to continue hooking the tuner into components directly generated by the compiler. The current state of the tuner requires the user to know about many special flags (marking root ops, dumping benchmarks, etc), and then manually arrange the necessary inputs (flag file, benchmark files) and outputs (concatenated tuning TD spec). All of the inputs to the tuner should be easily directly generated by the compiler, and all outputs should be directly generated by the tuner.

Future Tasks

There is a lot to be done, so I will try to break down some of the work into smaller sub-projects:

Extracting Dispatches to Tune

In the current state, the first manual step of the tuner is to collect a tracy profile, and pick out the top dispatches to tune based on the runtime percentage in the model. This should ultimately be automated somehow.

Offload Work to the Compiler

There is a lot of python code to go from benchmark -> candidate TD spec in the tuner. Ideally, the compiler should generate something that is easy for the tuner ingest, and the TD spec should be very simple to create.

  • Create friendlier TransformDialect ops for tuning. We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation, but this op is very sensitive to extra attributes, and we need to be careful about what attributes are present in the TD spec. Ideally there should be a TD op designed for tuning spec matching, which is less sensitive to extraneous attributes.
  • Expose utils for finding tunable ops to the python bindings. We are using a hacky attribute that is set by a compiler flag to match for the root op of a dispatch, but there should be an exposed function for finding the set of tunable ops in a dispatch.
  • Move the matching logic out of the tuner, and expose it as python bindings. This lets us remove most of the dispatch parsing logic.
  • Use python bindings for more in the tuner (building TD specs, finding contraction dimensions, etc)

Tuner Ease of Use

This refers to an overall easier user experience. This means reducing the number of flags required by the user, and automating the setup process for the tuner.

  • Automate generation of compile/benchmark flag files. This should be done in the compiler, so a user who compiles and benchmarks a program can simply add an option to dump the flags to be later used for tuning.
  • Create better defaults for tuner flags. This includes things like the codegen-pipeline, the search space for gpu pipeline options, the number of each type of candidate. The user should not have to be aware of any tuner implementation, and these flags should have defaults that work well out of the box.
  • Create a general tuning loop that can be used to automagically tune a model, given the compilation and benchmarking flags. We have been relying on the examples/simple example for tuning, but that is only meant to be an example for how to make a tuning client. There should be a central tuning loop, and it should be obvious to the user how to use it.
  • Automatically generate concatenated TD specs after tuning.
  • Generate better logs with more condensed and organized information.
  • Improve documentation. Anyone should be able to use the tuner, so documentation should be clear enough for people outside of IREE contributors to use it. Instructions should be clear, and any drawbacks/risks/potential failures should be well documented. Anything that can go wrong should be clearly explained, along with steps on what to do afterwards.

Improve Benchmarking Stability

This is partly documentation, partly implementation. We can implement features to attempt to reduce noise as much as possible, and warn when noise is detected, but it is impossible to prevent all noise, so a user should also be aware of things that cause noisy tuning results.

  • Automatically find a cutoff point for model-in-the-loop tuning. This can be tuned around some number of standard deviations of iree-benchmark-module. For example, if a dispatch takes less than the standard deviation of benchmark times, then it should not be tuned with model-in-the-loop.
  • Periodically check for machine noise by running baseline benchmarks.
  • Run baseline benchmarks serially per device. It is critical that we have non-noisy baseline benchmarks, and baseline benchmarks are a small fraction of the total benchmarking time. Ensuring that only one benchmark is running per device at a time reduces overall noise in devices with split partitions (QPX and CPX modes on MI3xx).
  • Control the number of benchmarks running on each device during candidate benchmarking. Too many on one device (in QPX and CPX) causes a lot of noise, especially with model-in-the-loop. We should have a good default for this, but it should also be tunable based on the machine (different machines may have different tolerances).
  • Document causes for benchmark instability (model-in-the-loop tuning with small dispatches, noisy machines, too many parallel devices for benchmarking)

Further Tuning Support and Maintainability

  • Support tuning of dynamic shaped ops.
  • Support more dispatch types:
    • Arbitrary conv layouts
    • Horizontally fused contractions
    • Attention
    • Fusions
  • Enable numerical accuracy tests to validate tuned specs during the tuning loop. This can be done with the user supplied benchmark command for the full model. Check top candidates until one is found with numerics matching the baseline.
  • Decouple pipeline constraints from the shared ProblemSize struct. With more operation types, the problem types require different types of information. It would be better to have some sort of ProblemSizeInterface with implementations for constraint generation based on a given codegen pipeline.
  • Reorder the candidate generation to generate expected best candidates first (Po2 tile sizes, etc.) first. This could reduce tuning time significantly by allowing reduced numbers of candidates, while still achieving good tuning results.

Improve Test Coverage in the Tuner

The poor test coverage was made very clear in the last sprint for SDXL tuning, as there were many bugs found in the new tuner path once real model tuning loops were being used. There needs to be overall better testing coverage and error handling in the tuner, since each bug that is hit at the end of tuning leads to the loss of a lot of time, which is very important in time pressure.

  • Add tests for runtime failures of all external calls within the tuner
  • Restructure code to make more parts testable with mocking. All code in the tuner should have tests written for it, and large functions should be broken down into smaller functions that can be easily mocked and tested.
  • Add eventual e2e tuning loop tests. This would probably require CPU tuning to be implemented, since we do not want to require GPU runners for tuning tests, but it would be good to eventually have e2e tests of the full tuner flow running in the CI.

Packaging Default Tuning Specs with IREE

We should also have a good solution for packaging tuning specs with IREE, so we can get good performance out of the box with certain important ops/shapes.

  • Create some more generic TD ops for matching and applying specs on special operations we care about. This prevents us from having gigantic tuning specs (with many different shapes, permutations, etc.) that need to be maintained and shipped with IREE. A good first target is attention, since it benefits a lot from tuning specs.
@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

We currently use transform.iree.match.cast_compatible_dag_from_root to match the operation,

Another issue is that it does not support matching constants that may be used in bodies of linalg ops.

@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

@Max191
Copy link
Contributor Author

Max191 commented Jan 10, 2025

Another big action item should be to automatically collect profiles so that users don't have to collect Tracy traces and manually select ops to tune. This is described in the original tuner issue: iree-org/iree#16952 . This will require compiler support as well.

One more thing: support dispatches with dynamic shapes. This requires us to add support for generating benchmarks for dynamic shapes: iree-org/iree#19518

Thanks for the suggestions! I'll add them to the task list. When you say automatically collect profiles, do you specifically mean tracy profiles? One of my tasks above talks about adding some simple hooks in the compiler to track total run time, but I did not include automating the full tracy trace, since I didn't think the full tracy trace was necessary for the tuning loop.

@kuhar
Copy link
Member

kuhar commented Jan 10, 2025

Not exactly tracy profiles but something equivalent with enough fidelity for the tuner to identify top dispatches. Ideally we should survey existing profile data formats used in PGO/AutoFDO and pick something portable, if that exists.

@Max191
Copy link
Contributor Author

Max191 commented Feb 25, 2025

I have added some more bullets to the list at the top, but we have a lot of tasks to work on here. Let's try to order this a bit in terms of priority. I'll start in this comment with what I think is the best first task to tackle, and we can build from there, and create sub-issues.

1. Support More Dispatch Types

Immediate first priority in my mind is to add support for tuning more dispatch types, since it has a direct impact on how far we can tune a given model. This requires us to lay some initial groundwork, though:

  1. We need to get rid of the lengthy op_matchers.py logic, and offload it to the compiler. This means exposing things like linalg::inferContractionDims, linalg::isaContractionOpInterface, linalg::inferConvolutionDims, etc. The more ops we add, the longer this op matching logic will get, and it will quickly become difficult to maintain.
  2. Decouple the dispatch constraints generation from the singular ProblemSize dataclass. New operations will not fit into the single problem type, and we will have to keep extending the class to fit more problem types. This quickly gets out of control.
    • A better solution is to have an abstract class for ProblemSize, that requires implementations for: 1. A function to match and pull shape information from the dispatch, and 2. A function to generate constraints for a given codegen pipeline based on the extracted information. Then, each dispatch tuner can implement its own ProblemSize, and the main tuning loop uses the interface to generate constraints and tune.
  3. Come up with a better way of generating spec matching + application named sequences. The current way of string replacement in a predefined TD spec is not good enough for operations like attention. Attention requires additional operation attributes that are separate from the lowering_config, so we cannot simply set the lowering config like with other ops. One way to do this better would be to create some special transform dialect ops that we use for setting configs for the ops we need. Then, when we build the spec in the tuner, we create the specific transform op that is needed for matching + application for the dispatch type.

The above 3 tasks are the important things I have in my mind right now before we start to add tuning for more dispatch types. We can start with these tasks, and build on it or break them down as needed.

cc @kuhar @bangtianliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants