Briefing · structural — no expiry

Backtests always look good

Issued 2026-07-05established frameworkConfidence high

Every backtested track record a reader is shown is a survivor, and the surviving is the trick. The chart arrived because it looked good; the thousands that looked bad were never sent. Before examining any specific performance claim, it pays to understand the machine that manufactures them — because the machine guarantees a brilliant-looking output whether or not any skill exists anywhere in the process.

Built after the fact

A backtest is a search through data that already happened. The builder chooses indicators, thresholds, entry rules, exit rules, and date ranges, then adjusts until the historical curve looks right. With enough adjustable parts, any past can be fitted perfectly — including its random noise — and the coincidences captured in the fit will not repeat.

The mathematics of luck

The mathematics is unforgiving. Test enough strategies against the same history and the best one looks skilled even when every strategy is pure noise: the expected best Sharpe ratio (return earned per unit of risk taken) rises with the number of attempts through luck alone. Around a thousand noise strategies produce a luckiest specimen with a Sharpe ratio near 3.7 — a figure a genuinely skilled fund would celebrate — with zero real edge.

Bailey and López de Prado formalised this as backtest overfitting: the curve in the marketing material is the winner of a private tournament the reader never saw, and the polish of the curve measures the intensity of the tournament, not the quality of the idea.

The silent graveyard

Survivorship bias runs the same selection one level up — on funds and accounts rather than parameters.

Consider a firm that launches sixteen funds with essentially random strategies. After one year, roughly eight are up by chance; the eight losers are quietly closed. After four years of the same pruning, one fund has been up every single year — purely by coin flip — and that fund gets the glossy campaign while fifteen closures vanish from the record.

The graveyard is silent by construction. Closed funds stop reporting, deleted accounts stop posting, and abandoned model portfolios leave no trace, so the visible sample is systematically unrepresentative of the attempted population. Any performance figure that arrives without a count of the failures alongside it is a numerator with the denominator amputated.

One trial is not a track record

A single result — one portfolio, one year, one anecdote — carries almost no information about skill, because skill is a property of a distribution and one draw does not reveal a distribution. The base rate makes this concrete: across decades of SPIVA-style scorecards, roughly 80–90% of professional active funds fail to beat their simple benchmark index after fees over horizons of ten years or more. Sharpe’s arithmetic explains why this is not a fixable flaw: active managers collectively are the market, so the average active dollar must trail the average passive dollar by exactly the difference in costs — an accounting identity, not an opinion.

Against a prior that low, one reported win moves nothing.

A coin that landed heads once is not evidence of a two-headed coin.

Two 2026 specimens

A hypothetical but familiar artefact: “my AI portfolio returned 34% in six months,” with a screenshot. Three mechanisms operate at once.

The post exists because the number is good — losing portfolios do not get posted, so the selection happened before the reader arrived.
Six months is one draw: in a rising market, a random basket of popular names performs well, and no benchmark is offered against what a broad index did over the same window.
And the AI framing adds authority without adding evidence — a language model generated the portfolio, but it did not generate a track record, and the anecdote remains a single trial however sophisticated the generator.

Behind the one visible post stands the usual graveyard: the prompts, portfolios, and screenshots that lost, unposted.

The fund marketing email

Equally familiar: “our machine-learning strategy returned 22% annually (backtested to 2010).” The parenthesis is load-bearing. A strategy built in the present and tested back to 2010 was constructed knowing every crash, every winning sector, and every regime change since 2010 — it passed an exam while holding the answer key.

Machine learning enlarges this problem rather than solving it: modern pipelines can evaluate millions of parameter combinations, so the winning configuration is drawn from a vastly larger pool of attempts, and luck’s ceiling scales with search capacity. The more powerful the machine, the better the backtest looks — and the weaker the inference it supports.

Where this skepticism breaks

The filter has a real limitation: it also rejects genuine, not-yet-proven skill. A new manager with true edge and a short record is indistinguishable, from the outside, from the lucky survivor — the discipline simply refuses to pay for the distinction before evidence exists.

Live, dated, independently audited records spanning at least one full drawdown, with results on data the strategy never saw during construction, do constitute evidence and deserve to be weighed as such. The rule allocates the burden of proof; it does not claim skill is impossible.

The two questions

Two questions dissolve most performance claims before any deeper diligence is spent.

First: is this record live or backtested? Live means real money, dated trades, independent audit. “Backtested,” “simulated,” “model portfolio,” and “would have returned” all mean built after the fact — a hypothesis wearing the costume of a result.

Second: how many tries existed before this one was shown? How many strategies were tested, how many funds launched, how many portfolios run — and what happened to the ones not in the brochure? An honest operation can answer with a number; a survivorship machine cannot answer at all.

Neither question requires mathematics, market knowledge, or confrontation. They cost nothing, and they relocate the conversation from the quality of the curve — which is manufactured — to the size of the graveyard, which is where the truth about skill actually lives.

Sources: Sharpe, W. F. (1991). The Arithmetic of Active Management. Financial Analysts Journal. · Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting. Notices of the AMS. · Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the Cross-Section of Expected Returns. Review of Financial Studies. · S&P Dow Jones Indices. SPIVA Scorecards (ongoing series).

Frameworks: Sharpe arithmetic of active management · Bailey–López de Prado backtest overfitting · survivorship bias · base-rate reasoning

Educational material, not investment advice. No prediction is made here; no security is recommended.