We ran 100 circuits across 4 quantum vendors. Here's the r(τ) you can't unsee.
A two-point empirical observation from the v0.9 Phase E cross-vendor validation: the correlation between Qlro's predicted fidelity and hardware-observed fidelity decays measurably as the calibration snapshot ages. This is a physical observable nobody in the quantum benchmarking community has characterised yet.
The setup
100 executions across IBM Heron, IQM (Garnet + Emerald), Rigetti Cepheus-1-108Q, and IonQ Forte. Five out-of-sample circuit families (GHZ, Bernstein–Vazirani, VQE ansatz, Deep Ladder, W-state) spanning 15× in 2-qubit-gate count and 12× in depth. Mixture-consistent predictor F̂ = K · F_ideal + (1−K) · F_uniform against observed hardware metric values.
The two points
| Subset | N | τ (days since Metriq snapshot) | Pearson r |
|---|---|---|---|
| Same-day (IQM + IonQ) | 70 | ≈ 0 | 0.964 |
| Rigetti Cepheus | 30 | 3 | 0.690 |
r(τ) at τ=0 and τ=3 days on comparable superconducting / trapped-ion hardware. Source: v1.2 paper Table 18b.
A drop from 0.96 to 0.69 over three days is a lot. It is roughly what you'd expect if the calibration drift at a superconducting device saturates on a multi-day timescale — but it is the first quantitative point we are aware of, and it is ahead of a vendor- neutral systematic study of this question.
Why this matters for device selection
Two consequences, both operational rather than theoretical:
- No single number answers "how accurate is Qlro?" The answer depends on calibration age at execution. A paper or procurement document that quotes r = 0.96 without saying when the snapshot was taken is reporting one point on a decay curve, not a steady-state accuracy.
- The economically meaningful quantity is ⟨r(τ)⟩ over realistic deployment τ distributions, not r at any fixed τ. If your team executes once a week and the snapshot is refreshed monthly, the number that matters is the integral under r(τ) between τ=0 and τ=~14 days.
What the Cepheus subset is not
A fair critique: r = 0.69 on Cepheus could be drift, could be a single recalibration event, could be routing differences, could be transpiler changes. With one time point and one vendor we cannot say which. What we can say is that shot noise is ruled out statistically (a ~30 pp drift in mean GHZ fidelity between Phase A and Phase E is ≥20σ shot-noise on 1000-shot binomials) and that every alternative explanation, in practice, reduces to the same product requirement: calibration freshness matters for the estimator's absolute accuracy.
Why we publish this
A vendor-neutral benchmarking product has one obligation a vendor does not: it must report the conditions under which its numbers stop holding. The headline r=0.96 on our stable subset is real. The r=0.69 on the drifted subset is also real. Publishing only the first would be marketing; publishing both, with the framing that they are points on a physical decay curve, is science.