FindingApril 24, 2026 · ~5 min read

We ran 100 circuits across 4 quantum vendors. Here's the r(τ) you can't unsee.

A two-point empirical observation from the v0.9 Phase E cross-vendor validation: the correlation between Qlro's predicted fidelity and hardware-observed fidelity decays measurably as the calibration snapshot ages. This is a physical observable nobody in the quantum benchmarking community has characterised yet.

The setup

100 executions across IBM Heron, IQM (Garnet + Emerald), Rigetti Cepheus-1-108Q, and IonQ Forte. Five out-of-sample circuit families (GHZ, Bernstein–Vazirani, VQE ansatz, Deep Ladder, W-state) spanning 15× in 2-qubit-gate count and 12× in depth. Mixture-consistent predictor F̂ = K · F_ideal + (1−K) · F_uniform against observed hardware metric values.

The two points

Subset	N	τ (days since Metriq snapshot)	Pearson r
Same-day (IQM + IonQ)	70	≈ 0	0.964
Rigetti Cepheus	30	3	0.690

r(τ) at τ=0 and τ=3 days on comparable superconducting / trapped-ion hardware. Source: v1.2 paper Table 18b.

A drop from 0.96 to 0.69 over three days is a lot. It is roughly what you'd expect if the calibration drift at a superconducting device saturates on a multi-day timescale — but it is the first quantitative point we are aware of, and it is ahead of a vendor- neutral systematic study of this question.

Why this matters for device selection

Two consequences, both operational rather than theoretical:

No single number answers "how accurate is Qlro?" The answer depends on calibration age at execution. A paper or procurement document that quotes r = 0.96 without saying when the snapshot was taken is reporting one point on a decay curve, not a steady-state accuracy.
The economically meaningful quantity is ⟨r(τ)⟩ over realistic deployment τ distributions, not r at any fixed τ. If your team executes once a week and the snapshot is refreshed monthly, the number that matters is the integral under r(τ) between τ=0 and τ=~14 days.

This is what the v1.2 paper calls the calibration-age-aware accuracy statistic. It does not replace Pearson r; it augments it. The paper deliberately stops short of claiming a functional form — two points is not enough to fit a curve. The campaign we are running next (τ ∈ {0, 1, 3, 7, 14} days, three superconducting vendors) will produce the first cross-vendor r(τ) curve.

What the Cepheus subset is not

A fair critique: r = 0.69 on Cepheus could be drift, could be a single recalibration event, could be routing differences, could be transpiler changes. With one time point and one vendor we cannot say which. What we can say is that shot noise is ruled out statistically (a ~30 pp drift in mean GHZ fidelity between Phase A and Phase E is ≥20σ shot-noise on 1000-shot binomials) and that every alternative explanation, in practice, reduces to the same product requirement: calibration freshness matters for the estimator's absolute accuracy.

Why we publish this

A vendor-neutral benchmarking product has one obligation a vendor does not: it must report the conditions under which its numbers stop holding. The headline r=0.96 on our stable subset is real. The r=0.69 on the drifted subset is also real. Publishing only the first would be marketing; publishing both, with the framing that they are points on a physical decay curve, is science.

Source paper: Qlro WCPP v1.2, Zenodo DOI 10.5281/zenodo.19785800, Table 18b and surrounding discussion.

Try it: /demo · paste your OpenQASM circuit and see the ranking. For an audit record of the exact prediction, go to /recommend.