Regime Backtest Plan

Updated: 2026-05-24 12:50 CEST

This plan defines the calibration work required before ibkr regime may expose forecast probabilities. Until then, the dashboard should continue to report evidence balance, risk score, confidence, bands, and scoped warnings only.

Historical Inputs

Build a daily panel with one row per U.S. trading day, using only data that would have been observable at that day's decision time.

evaluates intraday reads.

streak state.

source revisions.

dates and revisions.

coverage, warning details, and cache timestamp.

50/200-DMA percentages, new-high/new-low counts, and cache timestamp.

a licensed point-in-time source. Keep it outside the production feature set until the live sourcing question is solved; never fill it with ETF/futures proxies.

Target Stress Definitions

Test several targets rather than tune to one story:

days exceeding 3%, 5%, and 8%.

change exceeding a fixed percentile of its trailing distribution.

observations.

USD/JPY yen-strengthening shock occur inside the target window. MOVE can be evaluated as a candidate target only after a licensed point-in-time source is available.

Targets must be evaluated from the timestamp at which the dashboard would have been read; do not let later official revisions change the historical feature row used for that read.

Walk-Forward Calibration

Use expanding or rolling walk-forward windows:

probability model is actually fitted), false-alarm rate, missed-stress rate, and average lead time.

cluster worst-band count, and any proposed cluster weighting.

Look-Ahead Controls

conservative publication delay and label that assumption.

chains, OI, IV, and the exact method version can be reconstructed.

a missing critical row into a fake green/yellow/red reading.

Missing Rows and Revisions

Score each historical row with the same contract as live JSON:

Evaluate both the predictive signal and the coverage signal. A model that looks good only after dropping unavailable gamma/breadth/official-file days is not a valid live dashboard model.

Cluster Weight Evaluation

Test cluster logic separately from row thresholds:

label, regardless of green row count.

Promote a learned weight only if it improves out-of-sample utility without making stale/unavailable critical rows look safer than they are.

Current Recommendation

Keep the present threshold labels marked heuristic: true and pending_backtest: true. Do not add forecast probabilities. The next useful implementation step is a point-in-time data builder that writes one compact daily JSONL row matching the live ibkr regime --json shape, followed by a separate notebook/script that evaluates the targets above.