Self-auditApril 24, 2026 · ~6 min read

Two of our four capability axes were perfectly rank-correlated on 9 devices. Here's what we did.

The Workload-Conditioned Physical Projection (WCPP) framework decomposes device capability into four axes: connectivity (Γ), coherence (Φ), gate fidelity (F), and throughput (T). A reviewer asked the obvious question — are they actually independent on real data? We measured. One pair was fine. One pair was not.

The feared pair, and the feared answer

The most intuitive worry: Γ (from BSEQ, which measures Bell-inequality violation) and F (from EPLG / WIT, which measures per-gate fidelity) both depend on 2-qubit gate quality. So they should be strongly positively correlated, right?

Measured: ρ(Γ, F) = −0.56 on N = 11. Negative, moderate. Not what the reviewer or we expected.

Physical interpretation: current superconducting hardware reaches high Γ (Heron-class IBM devices, dense coupling graphs) at the cost of per-gate error from crosstalk. Conversely, sparse or small devices (wukong_72, IQM Garnet) achieve high F by being well-isolated. BSEQ and EPLG are measuring genuinely different quantities; in the current market, the engineering pressures on the two push against each other.

So the most-feared collinearity does not exist. Good.

The pair we did not fear, and the answer that broke

Measured: ρ(Γ, Φ) = +0.76 Pearson, ρ_s = +1.00 Spearman on N = 9. Rank-perfect. Under a two-sided permutation test, p = 2/9! ≈ 5.5 × 10⁻⁶. Not a small-N artifact.

This was not the pair we worried about. It is the pair that turns out to be problematic. Every measured device in our snapshot ranks the same way on Γ as on Φ. That is the empirical signature of two axes carrying the same information, which is precisely what the framework's workload-conditioning logic requires not to happen.

The chemistry workload's default weights put 15% on Γ and 30% on Φ (F carries 45%, T carries 10%). If Γ and Φ are rank-identical on the measured subset, the effective independent information in the Φ weight is close to zero on those devices. The chemistry ranking is not invalidated, but it is doing less than the four-axis story suggests.

What the full matrix says

Axis pair	N	Pearson ρ	Spearman ρs	Classification
Γ – F	11	−0.56	−0.52	moderate, negative
Γ – Φ	9	+0.76	+1.00	strong positive
Γ – T	6	+0.48	+0.37	moderate
Φ – F	9	≈ 0	−0.27	weak
Φ – T	6	+0.30	+0.37	weak
F – T	6	−0.17	−0.03	weak

Pairwise correlations on the 13-device Metriq snapshot, measured-only subset per pair. Source: scripts/axis_correlation.py and paper §8.8b.

Three pairs within Pearson ±0.3 of zero. One pair moderately anti- correlated (Γ–F, the one we feared). One pair strongly collinear (Γ–Φ, the one we did not). This is, in raw information-theoretic terms, a 3.5-axis snapshot rather than a 4-axis snapshot.

What we are doing about it

Two options we considered:

Collapse Γ and Φ into a single axis. Rejected — it breaks the workload conditioning interpretability (teams want to reason about connectivity and coherence separately even if the current device population does not separate them well), and a neutral- atom or sparse-long-coherence device entering the snapshot would break the pattern anyway.
Report an orthogonalised Φ̃ alongside the native Φ. Accepted. v1.2 will publish Φ̃ = Φ − β · Γ (β estimated by OLS on the current population) and re-rank chemistry / large-depth workloads under both Φ and Φ̃. Any disagreement between the two rankings will be flagged so users see exactly which device decisions depend on the collinear component.

We also considered hoping the problem would go away as the snapshot grew. That is a bet on unseen data, not a fix. We do not want the procurement workflow to depend on "trust us, it will resolve".

Why this is in a blog post, not a press release

We found a weakness in our own framework by measuring it. The framework paper ships with a dedicated §8.8b describing the result and a specific v1.2 remediation plan rather than a vague caveat. In a commercial benchmarking product, that transparency is the product: the alternative is that someone else finds the weakness first, in their own procurement review, and the story stops being under our control.

Source paper: Qlro WCPP v1.2, Zenodo DOI 10.5281/zenodo.19785800, §8.8b "Axis correlation — testing the four-axis independence assumption".

Reproduce: scripts/axis_correlation.py · runs in ~5 seconds on the shipped Metriq snapshot.