Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxidize UnitarySynthesis pass #13141

Merged
merged 51 commits into from
Oct 18, 2024
Merged

Conversation

ElePT
Copy link
Contributor

@ElePT ElePT commented Sep 12, 2024

Summary

Closes #12210. This is an initial effort to port UnitarySynthesis to Rust, currently limited to the target + default plugin path (custom plugins will run in Python).

Details and comments

TODO:

  • Add unit tests for qsd path and 1q synthesis path
  • Address drawer issue after calling dag_to_circuit (not reflected in unit tests)
  • Clean up comments
  • Review use of dag.push_back and see if it can be done with the new dag.apply_operation_back method

@ElePT ElePT force-pushed the oxidize-unitary-synthesis branch from 479fbe0 to e4ec3ff Compare September 12, 2024 10:20
@mtreinish mtreinish added performance Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Sep 12, 2024
@ElePT ElePT force-pushed the oxidize-unitary-synthesis branch from 8de52b5 to e11ffe6 Compare September 12, 2024 10:55
@mtreinish mtreinish added this to the 1.3.0 milestone Sep 12, 2024
@ElePT ElePT force-pushed the oxidize-unitary-synthesis branch from e11ffe6 to 1296234 Compare September 12, 2024 11:48
@coveralls
Copy link

coveralls commented Sep 12, 2024

Pull Request Test Coverage Report for Build 11409692028

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 784 of 849 (92.34%) changed or added relevant lines in 5 files are covered.
  • 856 unchanged lines in 37 files lost coverage.
  • Overall coverage decreased (-0.3%) to 88.535%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/two_qubit_decompose.rs 14 15 93.33%
crates/accelerate/src/unitary_synthesis.rs 761 825 92.24%
Files with Coverage Reduction New Missed Lines %
qiskit/circuit/library/standard_gates/u.py 1 93.0%
crates/qasm2/src/expr.rs 1 94.02%
crates/accelerate/src/elide_permutations.rs 1 97.06%
crates/accelerate/src/error_map.rs 1 85.42%
crates/accelerate/src/circuit_library/quantum_volume.rs 1 96.69%
qiskit/circuit/library/standard_gates/rz.py 2 95.83%
qiskit/circuit/library/quantum_volume.py 2 95.45%
crates/circuit/src/lib.rs 2 96.15%
qiskit/synthesis/stabilizer/stabilizer_decompose.py 2 96.77%
crates/circuit/src/bit_data.rs 2 94.62%
Totals Coverage Status
Change from base Build 11241264267: -0.3%
Covered Lines: 73846
Relevant Lines: 83409

💛 - Coveralls

@ElePT
Copy link
Contributor Author

ElePT commented Sep 13, 2024

Performance is starting to look promising:

Benchmarks that have improved:

| Change   | Before [2fa6dd65] <main>   | After [739a5c84] <oxidize-unitary-synthesis-benchmarking-3>   |   Ratio | Benchmark (Parameter)                                              |
|----------|----------------------------|---------------------------------------------------------------|---------|--------------------------------------------------------------------|
| -        | 1.30±0.04s                 | 1.17±0.02s                                                    |    0.9  | utility_scale.UtilityScaleBenchmarks.time_qaoa('ecr')              |
| -        | 483±60ms                   | 375±2ms                                                       |    0.78 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')  |
| -        | 695±90ms                   | 476±20ms                                                      |    0.69 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')  |
| -        | 705±100ms                  | 481±10ms                                                      |    0.68 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr') |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@ElePT
Copy link
Contributor Author

ElePT commented Sep 16, 2024

Benchmarking status as of eb815d1 (more benchmarks show improvement but the improvement looks slightly smaller).

| Change   | Before [2fa6dd65] <unitary-test>   | After [266a22f0] <oxidize-unitary-synthesis>   |   Ratio | Benchmark (Parameter)                                              |
|----------|------------------------------------|------------------------------------------------|---------|--------------------------------------------------------------------|
| -        | 1.38±0.04s                         | 1.21±0.03s                                     |    0.87 | utility_scale.UtilityScaleBenchmarks.time_qaoa('cz')               |
| -        | 425±8ms                            | 361±8ms                                        |    0.85 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cx')  |
| -        | 566±10ms                           | 467±10ms                                       |    0.83 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('cz')  |
| -        | 3.94±0.2s                          | 3.04±0.02s                                     |    0.77 | utility_scale.UtilityScaleBenchmarks.time_qv('ecr')                |
| -        | 4.21±0.1s                          | 3.15±0.04s                                     |    0.75 | utility_scale.UtilityScaleBenchmarks.time_qv('cz')                 |
| -        | 626±40ms                           | 456±4ms                                        |    0.73 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg('ecr') |

@ElePT ElePT marked this pull request as ready for review September 16, 2024 16:06
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core
  • @kevinhartman
  • @levbishop
  • @mtreinish

@ElePT ElePT changed the title [WIP] Oxidize UnitarySynthesis pass Oxidize UnitarySynthesis pass Sep 16, 2024
Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass through this, thanks for doing this @ElePT! I left some inline comments. The two high level ones that stuck out the most to me from my first pass is that this PR seems to make a lot of things public, some of them make sense like the dagcircuit methods, but others I'm skeptical we need to be reusing or accessing outside of their definition. Some of the internal attributes of structures for example.

The other thing that I think is potentially a bigger issue especially for performance is around some of the typing choices and how that leads to a lot of conversions. The most concrete example is things are normalized on DAGCircuit between all the different decomposer, but this leads to creating an expensive object to be constructed when it'd be far more efficient to iterate over the synthesized sequence directly and add the gates directly to the dag.

crates/accelerate/src/two_qubit_decompose.rs Outdated Show resolved Hide resolved
qiskit/transpiler/passes/synthesis/unitary_synthesis.py Outdated Show resolved Hide resolved
Comment on lines 186 to 187
let circuit_to_dag = imports::CIRCUIT_TO_DAG.get_bound(py);
let dag_to_circuit = imports::DAG_TO_CIRCUIT.get_bound(py);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exist in rust now, see: https://github.com/Qiskit/qiskit/blob/main/crates/circuit/src/converters.rs so I'd suggest calling them directly. Especially if you're doing circuit_to_dag you can call the inner DAGCircuit constructor and avoid the python overhead when you've built a CircuitData directly in rust (not that I think that's the case here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates on when we will start using these here? I understand that you may not want to use dag_to_circuit in rust to avoid using CircuitData but circuit_to_dag takes a QuantumCircuit instance, and should work the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to use circuit_to_dag from the new converters crate in a912248. I am hesitant with dag_to_circuit because to use it I'd have to add a few extra conversion steps.

crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
crates/circuit/src/dag_circuit.rs Outdated Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
@ElePT ElePT force-pushed the oxidize-unitary-synthesis branch 2 times, most recently from e560bf8 to 7498419 Compare September 18, 2024 10:08
@ElePT
Copy link
Contributor Author

ElePT commented Oct 9, 2024

This is the latest benchmarking output after pulling from main, again a smaller number of tests improve but the improvement seems better.

| Change   | Before [90e92a46] <oxidize-unitary-synthesis^2>   | After [9cacf6fc] <oxidize-unitary-synthesis>   |   Ratio | Benchmark (Parameter)                                              |
|----------|---------------------------------------------------|------------------------------------------------|---------|--------------------------------------------------------------------|
| +        | 14.7±0.6ms                                        | 18.5±1ms                                       |    1.25 | utility_scale.UtilityScaleBenchmarks.time_parse_qaoa_n100(‘cx’)    |
| -        | 512±100ms                                         | 363±9ms                                        |    0.71 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg(‘cx’)  |
| -        | 841±100ms                                         | 444±30ms                                       |    0.53 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg(‘cz’)  |
| -        | 858±70ms                                          | 446±30ms                                       |    0.52 | utility_scale.UtilityScaleBenchmarks.time_square_heisenberg(‘ecr’) |

@ElePT
Copy link
Contributor Author

ElePT commented Oct 14, 2024

I also ran the transpiler benchmarks for QV, which might give a better idea of the speedup than the utility scale ones, and got the following numbers on the tests that use the rust path (the rust path is only followed if transpile is called with a target, and some of these tests give coupling map/basis gates directly). I added a rust-path test for the 50x20 example to have more data to look at.

| Change   | Before [90e92a46] <oxidize-unitary-synthesis^2>   | After [9cacf6fc] <oxidize-unitary-synthesis>   |   Ratio | Benchmark (Parameter)                                                                       |
|----------|---------------------------------------------------|------------------------------------------------|---------|---------------------------------------------------------------------------------------------|
| -        | 138±0.8ms                                         | 121±2ms                                        |    0.88 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                    |
| -        | 691±20ms                                          | 600±30ms                                       |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20_target(2) |
| -        | 100±1ms                                           | 87.4±4ms                                       |    0.87 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                    |
| -        | 62.4±0.8ms                                        | 52.8±4ms                                       |    0.85 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                    |
| -        | 321±3ms                                           | 262±10ms                                       |    0.82 | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20_target(0) |
| -        | 62.0±1ms                                          | 44.9±0.5ms                                     |    0.72 | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                    |

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about ready in my opinion. We'll see about the panicking and errror handling on the synth_sequences based on the comments I just left. I'd appreciate if @ShellyGarion or @alexanderivrii can take a look at the arithmetic, or a second look from the team.

));
}

let target_basis_set = get_target_basis_set(target, qubits[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something like this be part of #13308? Since we're moving towards a more Target centric transpiler, being able to extract the basis gates automatically. I see this being done in many transpiler passes whenever basis gates are not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely, I would include this.

let mut embodiment =
xx_embodiments.get_item(op.to_object(py).getattr(py, "base_class")?)?;

if embodiment.getattr("parameters")?.len()? == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I wanted to ask since CircuitData has a version of assign_parameters that can be worked entirely through rust.

let synth = if let DecomposerType::TwoQubitBasisDecomposer(decomp) = &decomposer_2q.decomposer {
decomp.call_inner(su4_mat.view(), None, is_approximate, None)?
} else {
panic!("synth_su4_sequence should only be called for TwoQubitBasisDecomposer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's unreachable code you should use unreachable!(), the only downside is that these exceptions cannot really be caught from Python code. However, I think it is better for debugging purposes.

@ShellyGarion
Copy link
Member

I think this is about ready in my opinion. We'll see about the panicking and errror handling on the synth_sequences based on the comments I just left. I'd appreciate if @ShellyGarion or @alexanderivrii can take a look at the arithmetic, or a second look from the team.

The general framework looks good to me.
For a 1-qubit unitary we use euler_one_qubit_decomposer.
For a 2-qubit unitary we use the TwoQubitBasisDecomposer (default for the basis gates 'cx', 'cz' or 'ecr') or the XXDecomposer.
Later, we also plan to add the TwoQubitControlledUDecomposer (which should be default for gates like 'rzz' with an arbitrary angle), see #13139 and #13320.
For a larger unitary, we use the qs_decomposition.

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it through all of the rust code yet, but from what I seen this is looking good enough to merge (once the build failure is fixed). I think there are some potential optimizations and improvements we'll want to make down the road but this is large enough we could nitpick forever on particular pieces of it and never merge it when we can make incremental improvements after it merges. So if I don't circle back to this in a timely manner don't let that stop someone else from feeling empowered to approve and merge it.

crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
// no need to bother trying the XXDecomposer.
static GOODBYE_SET: [&str; 3] = ["cx", "cz", "ecr"];

fn get_target_basis_set(target: &Target, qubit: PhysicalQubit) -> EulerBasisSet {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is value in eventually consolidating this with the similar logic in euler_one_qubit_decomposer. It's different enough that it's not worth it right now, but I think there is a path to unifying the functions probably in a follow up PR.

crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
target: &Target,
) -> f64 {
let mut gate_fidelities = Vec::new();
let mut score_instruction =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why did you end up going for a closure here? I probably would have just put this in the loop body directly. It doesn't really matter so no reason to change it, I was just wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was originally inlining it in the code, switched a couple of times between that and the function, and it ended up like this, but no proper reason for it.

Comment on lines 232 to 233
let node_ids: Vec<NodeIndex> = dag.op_nodes(false).collect();
for node in node_ids {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to do this as a separate loop over all the nodes? I feel like we should be able to integrate this into the main loop over the nodes in topological order below and reduce the need to iterate over the dag twice.

If not we can do a simple check up front and call dag.has_control_flow() to check if we have any control flow in the dag at all and if not we don't have to bother iterating.

Comment on lines 249 to 253
let new_ids = dag.get_qargs(inst.qubits).iter().map(|qarg| {
qubit_indices
.get_item(qarg.0 as usize)
.expect("Unexpected index error in DAG")
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to be using a PyList for this? I feel like using a vec here makes more sense. There is a conversion cost coming from python but if we're calling this function recursively from rust or repeatedly accessing it optimizing for that will end up being a lot more efficient.

@raynelfss
Copy link
Contributor

Currently reviewing... Any small changes I'll commit myself and merge

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your hard work on this. Hopefully, we can make the interners private again in the future (hopefully #13335 is sufficient). I am excited about all the speedups this PR introduces :)

crates/accelerate/src/unitary_synthesis.rs Show resolved Hide resolved
crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
Comment on lines +257 to +274
let res = py_run_main_loop(
py,
&mut circuit_to_dag(
py,
QuantumCircuitData::extract_bound(&raw_block?)?,
false,
None,
None,
)?,
new_ids,
min_qubits,
target,
coupling_edges,
approximation_degree,
natural_direction,
)?;
new_blocks.push(dag_to_circuit.call1((res,))?);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a follow-up but we should consider adding a variant of this that also works with QuantumCircuit or CircuitData preferably. And so we avoid these two conversions.

Comment on lines +318 to +327
out_dag.apply_operation_back(
py,
gate.into(),
&[qubit],
&[],
Some(new_params),
ExtraInstructionAttributes::default(),
#[cfg(feature = "cache_pygates")]
None,
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the follow-up we should consider using the stuff that will be added by #13335

crates/accelerate/src/unitary_synthesis.rs Outdated Show resolved Hide resolved
@raynelfss raynelfss enabled auto-merge October 18, 2024 19:18
@raynelfss raynelfss linked an issue Oct 18, 2024 that may be closed by this pull request
@raynelfss raynelfss added this pull request to the merge queue Oct 18, 2024
Merged via the queue into Qiskit:main with commit a9b8f4a Oct 18, 2024
15 checks passed
raynelfss added a commit to raynelfss/qiskit that referenced this pull request Oct 18, 2024
- Leverage usage of new methods in `UnitarySynthesis` after Qiskit#13141 merged.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Oct 28, 2024
This commit fixes a performance regression that was introduced in
PR Qiskit#13141. When the pass is looking up the preferred synthesis direction
for a unitary based on the connectvity constraints the connectivity was
being provided as a PyList. To look up the edge in connectivity set this
meant we needed to iterate over the list and then create a set that rust
could lookup if it contains an edge or it's reverse. This has
significant overhead because its iterating via python and also iterating
per decomposition. This commit addresses this by changing the input type
to be a HashSet from Python so Pyo3 will convert a pyset directly to a
HashSet once at call time and that's used by reference for lookups
directly instead of needing to iterate over the list each time.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Oct 28, 2024
This commit fixes a performance regression that was introduced in
PR Qiskit#13141. When the pass is looking up the preferred synthesis direction
for a unitary based on the connectvity constraints the connectivity was
being provided as a PyList. To look up the edge in connectivity set this
meant we needed to iterate over the list and then create a set that rust
could lookup if it contains an edge or it's reverse. This has
significant overhead because its iterating via python and also iterating
per decomposition. This commit addresses this by changing the input type
to be a HashSet from Python so Pyo3 will convert a pyset directly to a
HashSet once at call time and that's used by reference for lookups
directly instead of needing to iterate over the list each time.
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2024
* Fix performance regression in UnitarySynthesis

This commit fixes a performance regression that was introduced in
PR #13141. When the pass is looking up the preferred synthesis direction
for a unitary based on the connectvity constraints the connectivity was
being provided as a PyList. To look up the edge in connectivity set this
meant we needed to iterate over the list and then create a set that rust
could lookup if it contains an edge or it's reverse. This has
significant overhead because its iterating via python and also iterating
per decomposition. This commit addresses this by changing the input type
to be a HashSet from Python so Pyo3 will convert a pyset directly to a
HashSet once at call time and that's used by reference for lookups
directly instead of needing to iterate over the list each time.

* Avoid constructing plugin data views with default plugin

If we're using the default plugin the execution will happen in rust
and we don't need to build the plugin data views that are defined in the
plugin interface. Profiling the benchpress test using the hamlib
hamiltonian:

ham_graph-2D-grid-nonpbc-qubitnodes_Lx-5_Ly-186_h-0.5-all-to-all

that caught this originally regression was showing an inordinate amount
of time being spent in the construction of
`_build_gate_lengths_by_qubit` and `_build_gate_errors_by_qubit` which
isn't being used because this happens internally in rust now. To
mitigate this overhead this commit migrates the sections computing these
values to the code branch that uses them and not the default plugin
branch that uses rust.
@ShellyGarion ShellyGarion added the Changelog: None Do not include in changelog label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port UnitarySynthesis to Rust
6 participants