The Validation Gauntlet

The 11 gates

Every strategy is judged against the same 11 pre-registered gates — locked before any out-of-sample number is computed, so nothing can be tuned to pass. Most strategies fail at least one. The honest ones tell you which.

Minimum sample

≥ 100 trades / rebalances

Why it matters. Statistics need n. A strategy with 30 trades has no signal you can trust — only noise that happens to look like one.

CAUGHT NightMeanRevert produced ~8 trades/year — 8× short of the bar and statistically un-validatable.

Profit factor

PF ≥ 1.20 net of costs

Why it matters. Gross profit on a backtest is free. A real edge has to clear a deployment bar after realistic spread, slippage, commission and swap.

CAUGHT Basket-grid PF 0.51, Forex Fury 0.73–0.90, D1 trend 0.94, carry 0.84 — all below the line.

Risk-adjusted return

Annualized Sharpe ≥ 0.6

Why it matters. Profit factor ignores volatility. Sharpe asks whether you are paid for the risk you take, not just whether you end up green.

CAUGHT Crabel NR7 Sharpe −0.83; hardened carry −0.29.

Survivability

Max drawdown ≤ 12%

Why it matters. A small account cannot psychologically or financially ride a 50% drawdown to the recovery. Drawdown is the gate that kills real deployments.

CAUGHT Turtle −49% and NR7 −50% fail here even when the edge is real.

Consistency

Positive in ≥ 60% of years

Why it matters. One lucky regime is not an edge. A durable strategy makes money across most periods, not in a single favourable window.

CAUGHT Carry was positive in only 38% of years; NR7 in 20%.

Bootstrap significance

Block-bootstrap 95% lower-bound Sharpe > 0

Why it matters. Resampling blocks of returns asks: given the autocorrelation in the data, is the Sharpe reliably above zero, or could this path have been luck?

CAUGHT Carry lower-bound −0.83; funding harvest −1.54.

Placebo (the kill shot)

Real PF > 95th-percentile of ≥ 200 random-permutation placebos

Why it matters. Re-run the system hundreds of times with the trade labels or signal signs shuffled. If random does as well as your signal, you have no signal — just a backtest-shaped object.

CAUGHT Carry lost to 33% of random sign-permutations; ETH/BTC pairs lost to 58%. This gate retires more strategies than any other.

Cost robustness

2× cost stress: PF > 1.0

Why it matters. You never know real costs exactly. Double them. A thin edge that only survives at modelled costs is not deployable.

CAUGHT Funding harvest collapsed to PF 0.49 under 2× costs — the binding constraint that flagged its whipsaw drag.

Multiple-testing correction

Deflated Sharpe Ratio positive (PSR > 0.95)

Why it matters. The #1 cause of fake backtests is trying many configurations and reporting the winner. The deflated Sharpe penalises that selection. It is the math that catches data-mining.

CAUGHT Carry DSR 0.012; funding harvest 0.847 — both flagged as selection-fragile.

Concentration

No single instrument / component / year > 40% of P/L

Why it matters. A "portfolio" edge carried by one instrument is one instrument with a story. Concentration is fragility wearing a diversification costume.

CAUGHT D1 trend made money only on gold; Turtle 43% and VRP 51% concentration fail this gate.

Walk-forward persistence

Out-of-sample PF ≥ 0.9× in-sample PF

Why it matters. Re-fit on a rolling window, test forward, repeat. A real edge persists; an overfit one decays the moment it leaves the data it was tuned on.

CAUGHT The decay test that separates a robust signal from a curve fit.

How the method works

What is The Validation Gauntlet?

The Validation Gauntlet is the pre-registered 11-gate validation framework Validated Strategies uses to test every trading strategy. The specification — instruments, costs, rules, and all 11 pass/fail thresholds — is locked before any out-of-sample metric is computed, so nothing can be tuned after the fact to make a result pass.

What is the placebo gate and why does it matter most?

The placebo gate re-runs the strategy 200+ times with its trade labels or signal signs randomly shuffled, then checks whether the real profit factor beats the 95th percentile of those random runs. If shuffled-random does as well as your signal, the signal carries no information. It is the single gate that retires the most strategies — the carry trade and the ETH/BTC pairs trade both died here.

Why is a deflated Sharpe ratio used?

Because the most common way backtests lie is selection: try fifty configurations, report the best one. The deflated Sharpe ratio (and the related probabilistic Sharpe ratio) penalises the Sharpe for the number of trials and the non-normality of returns, estimating the probability the true Sharpe is above zero. It is the explicit multiple-testing correction in the battery.

See the gates in action on the scoreboard — every strategy lists exactly which gates it passed and failed.