Skip to main content
A long-form science essay

Weather, Causation,
and the Forecasting Revolution

For ninety-six years, two scientists solved complementary halves of the same problem — in different disciplines, in different languages, using different mathematics. One wanted to predict the atmosphere. The other wanted to identify causation in economic data. Neither knew about the other. The question they were both answering was always the same question.

On July 17, 2019, Chicago hit 94°F by noon.

Carla Reyes has worked the Walgreens at Wacker and State for eleven summers now, which is enough time to know things that the company's supply chain planners don't know and arguably couldn't know: that when the overnight low is above seventy and the forecast high is above ninety, the Gatorade display — four shelving units, roughly three hundred bottles, restocked the day before — will be depleted by 1:45 in the afternoon, that the orange goes first and the grape second and nobody has ever been able to explain either fact to her satisfaction, and that by the time July 17, 2019 comes around and it is 94°F at noon and climbing, she already knows what the delivery manifest for the next morning is going to look like. Foot traffic on Michigan Avenue was down 23 percent by 1 p.m. — people staying inside, hugging the shade of buildings, buying cold things. She is not doing demand forecasting. She does not know that term. She is doing something better and older than that, which is paying attention.

Forty miles north, in a glass tower in Schaumburg, a marketing analytics team was preparing a quarterly review. They had spent $2.1 million on digital and out-of-home advertising in Q2. Someone had brought donuts. The word "crush" was used. The problem they were there to solve — the one that justified the deck, the donuts, the word "crush" — was a general one, even if it felt like an advertising problem: of all the things that moved demand in Q2, what caused what? The analytics team had an answer. Their model attributed $4.7 million in incremental revenue to the advertising campaign — a 2.2x return. What the model could not tell them was how much of that $4.7 million was actually the advertising, how much was the hottest Chicago summer on record, and how much was the baseline growth that would have happened with no campaign at all. These are not the same as a simple "did it work?" — they are the sub-questions that a binary answer can hide, and they are the ones that actually govern budget decisions. What nobody mentioned, because nobody had clicked on it, was that column 47 in the training data was called temp_hi_f, and it was doing roughly a third of the model's predictive work, invisibly, silently, in the way that most important things do their work. You might ask: if the model already included temperature as a predictor, why couldn't the team just read off the temperature coefficient and subtract the weather effect from total lift? Because the coefficient on temp_hi_f does not measure weather's causal share of demand. It measures the correlation between temperature and sales after conditioning on advertising spend — but advertising spend and temperature are correlated through seasonality: summer campaigns are larger, summer is hotter, and both track the same calendar. That shared variation is absorbed into the coefficient in ways that cannot be separately attributed. Controlling for temperature and using temperature as an instrument are completely different operations. The model had the column. It was not using it correctly.

This essay is about that general problem: how do you find a variable that moves economic behavior but cannot itself be moved by economic agents? For a hundred years, economists have needed this kind of variable everywhere — in energy prices, labor supply, agricultural markets, retail operations, and yes, advertising attribution. The answer has been in every retailer's data feed for decades — temperature, precipitation, humidity — waiting for the instrumental variable (IV) framework to turn it from a predictor into a causal handle. An instrumental variable is a variable that moves the thing you care about for reasons entirely outside the feedback system you are studying — exogenous variation that lets you separate cause from correlation. It is not a trick. It is not statistical cleverness. And it is not merely the observation that weather affects demand — everyone who has ever stocked an umbrella or bought a fan knows that. The hard property is more specific: the variable must satisfy two conditions simultaneously. First, it must strongly move the thing you are trying to measure (relevance). Second, it must affect that thing only through the channel you are studying, and not through the advertiser's own choices or any other pathway. That second condition is what economists call the exclusion restriction, and it is where most proposed variables fail to qualify. It is a deep property of weather's physical structure — the fact that temperature anomalies are determined by atmospheric dynamics that predate and ignore any commercial decision — that makes weather unusually well-suited to satisfying both conditions at once.

Watch the four forces separate. Which one is the advertising? Which one is the weather? Can you tell before the system shows you?

1

Why Prediction Models Answer the Wrong Question

For the better part of a century, the central question in demand forecasting was accuracy — how close can we get? The answer, as of 2024, is: very close. Closer than anyone expected. And it turns out to be the wrong question.

The pattern was established in 1979, before it had a name. Spyros Makridakis and Michele Hibon published a paper in the Journal of the Royal Statistical Society showing that simple extrapolation methods — the kind a reasonably diligent clerk could run on a pocket calculator — outperformed the sophisticated Box-Jenkins ARIMA models that had become the professional standard. The statisticians responded — in the published discussion section of the JRSS-A paper — that the result could not be correct, which is a particular kind of academic criticism that means: your method must be wrong because your conclusion is unacceptable. So Makridakis ran a formal competition.

In 1982, using 1,001 time series from demography, industry, and economics, he tested 21 forecasting methods in what he called the M-Competition. Simple exponential smoothing — describable in a single equation — outperformed Box-Jenkins ARIMA, regression methods, and the full arsenal of then-current technique. The statisticians' position was untenable. The engineers went back to exponential smoothing. Makridakis ran the competition again in 1993, and in 2000, and in 2018 with 100,000 time series — the largest forecasting competition ever conducted — and the basic pattern held. This pattern — expert confidence meeting empirical humiliation — runs through the entire history of prediction. The embarrassment came from the specific direction of the result. The Box-Jenkins ARIMA methods the statisticians had developed required a skilled analyst to estimate the autocorrelation structure of each series, test for stationarity, select the appropriate differencing and lag structure, and estimate a custom model. Simple exponential smoothing required one equation: the new estimate is a weighted average of the old estimate and the most recent observation. The weight is a single parameter, fit once. The ARIMA analyst spent hours. The simple method won — not once, not on one dataset, but systematically, across 1,001 series. The lesson, which the field resisted for fifteen years, was that complexity in the method does not translate to accuracy in the forecast.

Then something changed. In M4 (2018), a hybrid method called ES-RNN — exponential smoothing fused with a recurrent neural network — won by a margin large enough that it genuinely surprised the organizers, improving on the benchmark by 9.4% in sMAPE (symmetric mean absolute percentage error). By M5 (2020), which used 42,840 Walmart products across 10 stores and 1,941 daily observations, deep learning methods were competitive across most categories.

A neural network is a mathematical function composed of many layers, where each layer transforms its input into a new representation and passes the result to the next layer. The word "network" refers to the way these transformations are chained: early layers detect simple patterns (for a time series, something like "this week is higher than last week"), later layers detect patterns in those patterns ("summer weeks are consistently higher than winter weeks"), and the deepest layers learn structures that no human would have thought to specify in advance. What distinguishes a neural network from an ordinary regression is not the math — the individual operations are elementary linear algebra — but the architecture: hundreds or thousands of such transformations stacked sequentially, allowing the model to discover structure at many levels of abstraction simultaneously. The word "deep" in deep learning refers to depth of layering — the number of transformations stacked between input and output. An exponential smoothing model has, roughly speaking, one layer: a weighted average of past observations. A deep learning model has dozens or hundreds. Each additional layer allows the model to represent patterns at a higher level of abstraction: one layer to notice that Monday is different from Saturday, another to notice that seasonal patterns vary by year, another to notice that those yearly variations correlate with something in other series it was trained on. The practical consequence of depth is that the model can discover useful structure without being told what structure to look for.

The learning happens through a procedure called gradient descent: the model makes predictions, measures its error, and then adjusts each of its millions of parameters by a small amount in the direction that reduces error. Training a neural network means finding the settings of its millions of parameters that minimize prediction error on the training data. Start with random parameter values, make a prediction, measure the error, then compute — for each parameter — how much the error would change if that parameter moved slightly in either direction. Move each parameter a small step in the error-reducing direction. Repeat for millions of examples, millions of times. The word "gradient" refers to the direction of steepest descent in the error landscape: the procedure is, geometrically, like rolling a ball downhill in a very high-dimensional space, looking for a valley. The valleys are the parameter settings that make good predictions. Finding them requires compute. A lot of compute.

By M6 (2022), the neural network takeover was complete. Not because neural networks suddenly became better statistics. Because the data had gotten richer, the compute had gotten cheaper, and the models had gotten bigger. The extrapolation problem, the core of demand forecasting, was essentially solved.

The statisticians were not being dishonest. They were being wrong in the specific way that expertise makes you wrong: by trusting the authority of the method over the authority of the result.

In 2024, Amazon published a paper describing Chronos, a foundation model for time series forecasting trained on tens of millions of series from every domain imaginable — capable of forecasting any new series you hand it, zero-shot, without additional training. Google published TimesFM. Salesforce published MOIRAI, trained on the LOTSA dataset: 27 billion observations. These models achieve accuracy on held-out benchmarks that would have seemed like science fiction to the statisticians at M1. They are, in a quite literal sense, Lewis Fry Richardson's Forecast Factory — the distributed computing architecture Richardson imagined in 1922, finally built, running on silicon instead of human computers. They tell you, with astonishing precision, what will happen.

They cannot tell you why.

That sentence deserves a pause. Not because it is a surprise — it is not — but because the entire industry built on these models continues to behave as if it is not true.

This is not a subtle point, and it is not a limitation that better models will overcome. It is a mathematical impossibility, and it flows directly from what prediction is and what it is not. Let Y be weekly sales, X be advertising spend, and W be temperature. A forecasting model fits some function Ŷ = f(X, W) that minimizes prediction error on held-out data. If the fit is good — if f(X, W) gets the right answer most of the time — we have learned something real about the joint distribution of Y, X, and W. What we have not learned is the causal structure within that joint distribution. For any decomposition α, β such that α + β = constant, there exists a causal story Y = α·X + β·W + ε that produces the same f(X, W) with the same prediction accuracy. The prediction score is invariant to the causal attribution. You can have 98 percent accuracy while crediting advertising with three times its true effect — or one-third its true effect — and the accuracy number will not move.

The model called it 98% accurate. Change the attribution slider. Does the accuracy score move? What does that tell you?

The stakes of this confusion are not abstract, and they are not confined to advertising. The same confounding structure appears wherever demand and treatment are simultaneously determined: energy utilities that observe demand spikes and raise prices face the same simultaneity; agricultural suppliers who respond to price forecasts create the same loop. Brett Gordon, Wesley Zettelmeyer, Neha Bhargava, and Daniel Chapsky published a paper in Marketing Science in 2019 comparing advertising attribution estimates from observational models to estimates from large-scale randomized field experiments run at Facebook. The finding was stark: observational methods overestimated advertising effects by five to ten times compared to the experimental estimates — particularly in digital advertising channels, where targeting algorithms create especially strong selection effects. Five to ten times. Bradley Shapiro, Anna Tuchman, and Nils Wernerfelt later evaluated 288 brands and found the median experimental ROAS (return on ad spend) was significantly below what observational models reported. The Schaumburg team's 2.2x might be 0.4x. That is not a rounding error. That is a different decision — a different budget, a different plan, a different board deck.

Why is this distinction so persistent? Because the score moves. Every time you add a weather variable to the model, accuracy improves. Every accuracy improvement feels like understanding.

Jim Simons understood this distinction with unusual clarity, even if he spent thirty years deliberately ignoring it. In 1978, age 40, Simons set up Monemetrics in a strip mall in East Setauket, Long Island — not a detail you'd invent — and hired Leonard Baum, who had spent his career cracking Soviet codes for the NSA, as his first employee. Baum had co-invented the Baum-Welch algorithm for hidden Markov models; he was, in other words, a specialist in finding structure in sequences that looked like noise. The fund they eventually created was named Medallion, after the Oswald Veblen Prize medallions that both Simons and his colleague James Ax had won for their work in geometric topology — not, it should be said, the usual credential for a hedge fund manager. From 1988 to 2018, the Medallion Fund returned 66.1% per year before fees, 39.1% after. Total profits: over $100 billion. Employees sign nondisclosure agreements so airtight they cannot discuss the work with their spouses.

What Simons proved — definitively, expensively, with three decades of refusal to explain himself — is that statistical structure in financial prices is real, extractable, and worth more than you can easily imagine. What he never claimed to prove, and what you cannot prove with pattern-recognition alone, is what causes the patterns. "We don't look inside the black box," Simons reportedly told an interviewer. This works — spectacularly — in liquid financial markets, and it is worth understanding why, because the reason is specific. In financial markets, the payoff is forward-looking: you profit from knowing what prices will do next, without needing to know why they moved as they did. Prediction generates returns. The Schaumburg problem is categorically different: it is retrospective attribution. The team needs to know not "what will Q3 sales be" but "of Q2 sales, how much was caused by advertising, as distinct from the heat wave?" That decomposition of a past outcome into its causal contributors cannot be provided by a prediction model — no matter how accurate — because the prediction score is invariant to the causal story, as the math in Section 1 shows. Simons knew this distinction with precision and deliberately chose a domain where it didn't matter. The Medallion Fund is proof that prediction divorced from causation can generate extraordinary returns in some settings. The Schaumburg team is proof of what happens when you apply that logic to a setting where the market doesn't price in what advertising causes.

Two problems. One is solved. The other — as we'll see — is not, and for a reason that better prediction cannot fix.

The extrapolation problem and the attribution problem are different problems. The first asks: what will Y be? The second asks: why is Y what it is? The M-competitions, the exponential smoothing renaissance, the neural network takeover, Simons's Medallion fund — all of it is progress on the first question. Carla Reyes's cold-beverage inventory instinct is, in some sense, her personal solution to it: she knows that when July is hot, cold-beverage volume will track last year's July, and she stocks accordingly. But predicting next week's demand does not require knowing why people buy more orange Gatorade when it's hot. It only requires knowing that they do. The model doesn't need to understand anything. It just needs to have seen enough Julys. Attribution is different. Attribution asks: of the 0.8% demand increase last quarter, how much was the heat wave, how much was the campaign, how much was the price change in April, how much was the competitor's promotion in May? And this question — as we'll see — requires a fundamentally different tool.

All four methods survived the backtest. Which one knows why sales broke in week seven? Try to find it.

The skeptic's objection is natural and worth taking seriously: if the model's prediction is validated on holdout data — if it genuinely predicts next quarter with high accuracy — doesn't that confirm the causal story? It doesn't, quite. Validated predictions are consistent with multiple causal stories simultaneously. A model that perfectly predicts ice cream sales from temperature and advertising spend cannot tell you whether to credit the advertising or the heat wave, because both causal stories produce equally good predictions on the data you have. The validation confirms the joint predictive relationship. It does not identify the individual causal contributions. These are mathematically different objects.

There is a deeper sense in which the predictive models do not understand what they are predicting — a gap between performance and comprehension that persists even as accuracy climbs.

The deep-learning forecasting revolution has produced a phenomenon called double descent — named by Mikhail Belkin and colleagues in a 2019 PNAS paper that showed the classical bias-variance tradeoff breaks down in the overparameterized regime. In estimation, bias and variance are two different ways to be wrong. A biased estimator is one that is systematically wrong in a particular direction — like a scale that always reads two pounds heavy. No matter how many times you weigh yourself, the systematic error persists. A high-variance estimator gives different answers each time — correct on average but wildly unstable, so any single estimate is unreliable. The fundamental tension: reducing bias usually requires a more flexible model, which tends to increase variance (the model fits the noise); reducing variance usually requires a simpler model or more regularization, which introduces bias (the model is systematically wrong but consistently so). The classical view held that this tradeoff was inescapable — that as model complexity increases, test error eventually rises. Classical statistics says: as model complexity increases, test error eventually rises (overfitting). Belkin et al. showed that if you keep increasing complexity past the interpolation threshold — past the point where the model fits training data perfectly — test error eventually descends again. Modern foundation models operate in this regime. They interpolate perfectly and then generalize. Why this happens is still not fully understood. That it happens is documented.

Drag model complexity past the interpolation threshold. Classical statistics says error should keep rising. Check if it does.

There is a related phenomenon that Power and colleagues named "grokking" in a 2022 paper: models trained on small algorithmic datasets would first memorize the training data — achieving near-zero training loss while test loss remained high — and then, thousands of gradient steps later, generalize suddenly and completely.

A phase transition, delayed. The model had already apparently failed.

The memorization phase could last for what would correspond, in human terms, to years of apparent failure before the generalization phase arrived — not at the interpolation threshold, but long after it, when the model seemed to have already failed. In the paper's primary experiments on modular arithmetic (division modulo 97, 50% training split), training accuracy reached 100% in under 1,000 gradient steps. Test accuracy stayed at chance for approximately a million steps — a roughly three-order-of-magnitude gap, though the exact duration is sensitive to weight decay, dataset fraction, and the specific task — before suddenly converging.

Grokking and double descent are the same phenomenon viewed from different angles. Both reveal that the interpolation threshold — the point where training error hits zero — is not where understanding lives. Understanding arrives later, through mechanisms that are still being worked out. Nanda et al. (2023) mechanistically reverse-engineered the grokked solution for modular arithmetic: the network had learned a discrete Fourier transform, composing sines and cosines at five key frequencies. The memorizing solution stored individual examples. The generalizing solution discovered an algorithm.

The pattern across these three results is not coincidental. In 1956, Stein showed that combining information from seemingly unrelated estimation problems improves accuracy — that more pooling, past the point where intuition says it should stop helping, keeps helping. In 2019, Belkin showed that more parameters, past the interpolation threshold where classical theory says performance must degrade, eventually improve generalization. In 2022, Power showed that more training — far past apparent convergence — produces a phase transition into genuine understanding. Three results, three decades apart, three different fields. The same counter-intuitive principle: in high-dimensional problems, the cure for overfitting is more, not less. This is the regime that demand models, trained on years of weather-sales data across thousands of stores, now operate in.

The Nanda result is worth dwelling on. The memorizing network and the grokked network produce identical outputs on the training set. If you stopped training at step 50,000 — where training accuracy is 100% and test accuracy is at chance — you would have a model that looks, on training metrics, like it has converged. It hasn't. It has memorized. The model that grokked at step 1,200,000 discovered a Fourier algorithm: a structural representation of modular arithmetic built from discrete sines and cosines. It did not find a better way to store the training examples. It found the underlying structure that generates them. This is a qualitative transition, not a quantitative improvement.

The analogy to IV models is direct. A predictive demand model learns the correlational patterns in the training data — the historical relationship between advertising spend and sales, which is confounded by the feedback loop. An IV-estimated causal model learns something different: the structural relationship between exogenous variation in weather and resulting demand, stripped of the confound. This is not a larger amount of the same thing. It is a different thing. And the question grokking raises — which is a real question, not a metaphor — is whether this structural transition also requires a minimum data window that practitioners routinely cut short.

In applied estimation of large hierarchical causal demand models, the grokking risk is this: you train a model on six months of weather-sales data across a few hundred stores. The held-out metrics look poor. The causal estimates are noisy, uncertainty intervals wide, the out-of-sample fit unimpressive. The team reviews the results and kills the training run. What they may be killing is a model that has not yet seen enough exogenous variation to separate the causal structure from the noise. Weather instruments derive their power from genuine temperature anomalies — cold snaps, heat waves, unseasonable weeks that shift demand in ways the baseline model can attribute to a cause. Six months may not contain enough of these anomalies, distributed across enough markets, to identify the structural parameters reliably. The model looks like it is memorizing because, in a sense, it is — it has not yet seen enough unusual weather to learn the causal structure rather than the seasonal correlations. Two years of data, spanning multiple summer heat events, winter cold periods, and anomalous months in heterogeneous markets, is where the IV estimator starts to find its footing. This is not a longer version of the six-month evaluation. It is a different regime.

Whether causal demand models undergo a sharp phase transition analogous to grokking's million-step gap — or whether the improvement is gradual, just requiring more data than practitioners typically allocate — is an open empirical question. What the grokking literature establishes is that the shape of the learning curve matters: early-regime performance is not a reliable predictor of asymptotic behavior, and the transition point can be far outside the evaluation window. A business assessing this approach on six months of data and finding noisy causal estimates may be observing the memorization phase. The correct response is not to abandon the method. It is to run longer. Power et al. (2022) and Nanda et al. (2023) did not discover a theoretical curiosity. They documented a failure mode that practitioners keep finding, independently, and consistently misattributing to the method rather than the data window.

What nobody measured, across four decades and four M-competitions — and this is the important thing, the thing the competitions were never designed to test — was whether the winning models were right for the right reasons. Back in Schaumburg, the quarterly review ended at 4:17 p.m. The VP thanked the team. The attribution number — 2.2x — went into the board deck. The model had no idea it was 94°F in Chicago. Gordon et al. (2019) would later estimate that observational methods systematically overestimate advertising effects by five to ten times. The 2.2x might be 0.4x. That is not a rounding error. That is a different decision.

On M-competition methodology: The M4 competition (Makridakis, Spiliotis, Assimakopoulos, 2020) included 100,000 series and attracted 248 participating teams. The ES-RNN winner (Slawek Smyl, 2018) improved on the benchmark by 9.4% in sMAPE. The M5 competition used Walmart hierarchical sales data — 42,840 products, 10 stores, across 1,941 daily observations — and was explicitly designed to test forecasting in the presence of heterogeneous demand patterns, promotions, and calendar events. Neural network methods dominated M5 in a way they had not dominated M4.

On zero-shot foundation models: Chronos (Ansari et al., Amazon, 2024) pre-trains on a large collection of publicly available time series data and synthetic data, then forecasts zero-shot using a language-model-style tokenization of numerical values. TimesFM (Das et al., Google, 2024) uses a patched-decoder architecture. MOIRAI (Salesforce, 2024) trains on the LOTSA dataset (27 billion observations). All three achieve competitive zero-shot performance on standard benchmarks without any dataset-specific fine-tuning — a qualitative shift from the M-competition paradigm of method-per-series estimation.

On Renaissance Technologies: The Medallion Fund's 66.1% gross / 39.1% net annualized returns over 1988–2018 are documented in Gregory Zuckerman's The Man Who Solved the Market (2019). Simons's first hire, Leonard Baum, co-invented the Baum-Welch algorithm with Lloyd Welch in 1966 while at the Institute for Defense Analyses. The fund was named after the Oswald Veblen Prize, awarded by the American Mathematical Society for outstanding research in geometry or topology; both Simons (1976) and James Ax (1967) were prior recipients.

We have established that the problem is not accuracy — that prediction and causal attribution are different objectives, and that the accuracy score is invariant to the causal story. But “different objectives” is not the same as “requires a different tool.” To understand why attribution requires a fundamentally different approach — one that cannot be improved by collecting more data, running better models, or adding more controls — you need to see the specific structure of the failure.

The Identification Problem: Feedback Loops and the Limits of Data

Not every hard problem is hard for the same reason. If you use the wrong tool for the wrong kind of hard, you can work forever and get nowhere. Before we can solve the attribution problem, we need to understand what kind of problem it actually is.

On a January afternoon in 1961, in Building 24 on the MIT campus — a drab structure near the center of campus, the kind of building that exists to contain computation rather than inspire it — Edward Lorenz re-entered a number into a Royal McBee LGP-30 computer. The McBee weighed 800 pounds and had its own office because of the noise, a fact that merits emphasis: the foundational discovery of chaos theory was made in a room dedicated to a machine too loud to share space with people. The number Lorenz entered was 0.506, truncated from the stored value of 0.506127 — a difference of roughly one part in five thousand, which is to say a difference of nothing, a rounding, the kind of approximation you make a hundred times a day without thinking about it. He had started the simulation from a midpoint, using values from a printout, and gone to get a coffee. When he came back, the two runs — the original and the restart — matched for a while and then diverged completely, the simulated weather in the second run bearing, after some simulated weeks, no relationship to the weather in the first.

His first thought was a vacuum tube failure. His second thought changed everything.

What Lorenz had found — and would publish in 1963 in a paper called "Deterministic Nonperiodic Flow," the kind of understated title that conceals seismic content — is that deterministic equations can produce unpredictable outcomes, that the atmosphere does not care about your decimal places, that the distance between 0.506 and 0.506127 is, meteorologically speaking, eventually everything. The "butterfly effect" title for this phenomenon was suggested by Philip Merilees in 1972, with Merilees proposing a butterfly over Brazil instead of Lorenz's preferred seagull over New York, because butterfly-Brazil, tornado-Texas alliterated better. The name stuck. But the name obscures what Lorenz actually discovered. He didn't discover that weather is sensitive. He discovered that the atmosphere is computationally irreducible: no shortcut exists between initial conditions and future state. You cannot predict the output without running the process.

Lorenz's result shows that weather is sensitive to initial conditions. But the implications go deeper than sensitivity. Sensitivity says: small errors in what you know now become large errors in what you predict later. What Lorenz actually discovered is more specific and more alarming: the separation between two nearby trajectories in the atmosphere grows exponentially in time. Not linearly — not “a little worse each day” — but exponentially, doubling the error approximately every few days. This is a hard mathematical property of the equations, characterized by what is called the Lyapunov exponent. If the error doubles every five days, then after ten days your forecast error is four times your initial uncertainty, after fifteen days it is eight times, after twenty days it is sixteen times. You can reduce the initial uncertainty by improving your weather observations — finer instruments, denser networks, satellite data. But exponential growth guarantees that any finite improvement in initial conditions only buys you a finite additional time before the error swamps the signal. The two-week predictability horizon is not a technological failure. It is a mathematical consequence of the exponential divergence rate of the atmospheric equations — a hard ceiling built into the physics, not a modeling limitation. A stronger claim — formalized by Stephen Wolfram in his 2002 book A New Kind of Science — is that some systems are computationally irreducible: no shortcut exists between initial conditions and future state, regardless of how well you know the starting point. For many systems, there is no computational shortcut — the shortest description of what the system will do is the computation itself. His canonical example is Rule 30, a one-dimensional cellular automaton whose evolution is so complex that each step must be computed individually; there is no closed-form formula that jumps ahead. The rule is simple. The behavior is irreducible. Consumer demand has an analogous property — not formally irreducible in Wolfram's precise sense, which applies to deterministic systems like cellular automata, but practically irreducible in that the interactions between weather, price, advertising, competitive response, and consumer state are sufficiently complex that no closed-form shortcut exists to the answer. The implication is the same: simulation and data are necessary, not optional. The demand for cold beverages in Chicago on any given afternoon is a function of temperature, humidity, day of week, competitive promotions within a half-mile radius, the consumer's recent purchase history, whether a Cubs game just ended, and a dozen other variables that interact non-linearly in ways that would require simulating the full socioeconomic context to predict from first principles. Chronos, TimesFM, MOIRAI — they don't shortcut the irreducible process; they approximate it using historical patterns. When the patterns break — when a heat wave arrives that is outside the historical distribution — the approximation fails.

Slide the reveal open. Watch where the rule grid draws its lines. Where does the grid's logic break down?

Here is the crucial asymmetry — the one the essay turns on.

Computational irreducibility constrains prediction. It says nothing useful about causal inference. These are different computational operations. "What will demand be next Tuesday?" is a prospective extrapolation — you need to run the system forward from current conditions, and if the system is irreducible, you cannot shortcut that. "What caused the demand spike last Tuesday?" is a retrospective causal question — you're asking about the structure of the mechanism, not computing forward from initial conditions. The first is hard because of the system's complexity. The second is hard for an entirely different reason — the data's structure — which the next few paragraphs make precise.

The distinction matters for the Schaumburg problem. The Schaumburg team's forecasting model failed — its prediction of next summer's demand will deteriorate as weather gets more extreme — because demand is practically irreducible in the relevant sense: no tractable model can shortcut the forward evolution of a complex system under novel conditions. But this failure is separate from the attribution failure. The attribution model failed — its 2.2x return estimate is wrong — not because the system is irreducible, but because the data has a different structure entirely: a feedback loop that makes causal direction ambiguous.

The information-theoretic language is worth stating precisely here because it is the language in which the instrument conditions are later specified. Weather's value as an instrument rests on the fact that it produces high mutual information with demand (strong relevance) while having zero correlation with the error term in the demand equation (exogeneity). These are two separate properties, and they are both required. High mutual information alone is not sufficient — as the common-cause examples below show.

The information-theoretic framing of this distinction is precise. Shannon mutual information, written I(Y; W), measures how much information weather variable W carries about demand outcome Y. High I(Y; W) means knowing W reduces your uncertainty about Y — which means W is a useful predictor. This is exactly what Chronos and TimesFM exploit. But high I(Y; W) does not imply that W caused Y. Both W and Y might be caused by a third variable — a hidden common cause — that makes them move together without any causal connection between them. Seasonality, consumer income cycles, competitive activity, calendar effects: demand data is full of such common causes. High mutual information tells you the variables are correlated. It does not tell you why.

What does the overall trend tell you? Toggle the confounder. Does the trend still tell you the same thing?

But common causes — seasonality, income cycles — are a fixable problem: you can control for them. Add them as covariates and their confounding effect disappears. The second type of failure is categorically different: it cannot be fixed by adding more variables, because the confound is not a third variable you can observe. It is embedded in the structure of the feedback between the variables you are trying to study. The natural response is: just control for the media buyer's decision signal. Add seasonal trend variables, lagged demand indicators, competitive activity — whatever drove the buy decision, partial it out. This does not work, for a specific reason: the unobserved driver is the buyer's private demand forecast — the internal signal they acted on before setting this week's budget. That signal was never recorded. Even if you observe every public indicator available (seasonality, competitor promotions, publicly available weather forecasts), the buyer's private model incorporates proprietary market intelligence, brand-specific context, and campaign history that is simply not in your dataset. The residual correlation between their spend and the unobserved demand component cannot be eliminated by adding more public covariates. The problem is not omitted observable variables. It is an unobservable common cause.

The next move in the argument is the hardest one — so let me state it plainly before building to it. The problem is not just that weather and sales correlate without causing each other. The problem is that the feedback between spend and demand makes it mathematically impossible to separate cause from effect from inside the model, regardless of how much data you have. This is a different kind of impossible than computational irreducibility. It is not about the system's complexity. It is about the data's structure.

Feedback loops come in two varieties, and they have opposite characters. Negative feedback — the kind in a thermostat or a biological hormone regulation system — corrects deviations: when the output rises above a target, the controller acts to push it back down. Negative feedback stabilizes. Positive feedback amplifies: when the output rises, the controller acts in the same direction, pushing it higher still. A microphone held near a speaker picks up the speaker's output and sends it back through the speaker, amplifying itself until the screech fills the room. A bank run works the same way: the news that a bank is failing causes depositors to withdraw funds, which makes the bank more likely to fail, which causes more withdrawals. The marketing-spend-and-sales loop is positive feedback: skilled media buyers observe strong demand and increase their spend, which correlates their spend with demand expectations, which correlates their spend with actual demand, which makes their spend look more effective than it is.

The feedback structure of the data — the second kind of hard — was formalized by Norbert Wiener in his 1948 book Cybernetics. Wiener was studying control systems: mechanical governors, biological regulators, social feedback loops. His key insight was that when a system's output feeds back into its input — when the controller adjusts based on what it observes — the causal direction becomes ambiguous from outside. You cannot tell, by watching the output, what the controller is doing versus what the underlying system being controlled — what engineers call "the plant" — is doing. The system is a loop. Loops don't have clean causal directions.

Marketing spend and demand are simultaneously determined: the media buyer's spend is itself a function of expected demand, not an independent cause of it.

A regression that uses only variables within this feedback loop cannot separate the causal effect of spend on demand from the anticipatory effect of demand on spend — it sees only their sum.

Here is the formal consequence for rule-based demand management — the kind of system that most businesses are actually running. William Ross Ashby's Law of Requisite Variety, stated in his 1956 An Introduction to Cybernetics, gives a precise account of why these systems fail not accidentally but structurally: a controller cannot manage a system that has more possible states than the controller itself can distinguish. His Law of Requisite Variety: "Only variety can absorb variety." For a controller to regulate a system, the controller must have at least as many distinguishable states as the system it regulates. Ashby showed the deep connection to Shannon's channel capacity theorem — a controller with insufficient variety is, in information-theoretic terms, a channel with insufficient capacity to transmit the disturbing environment's signals to the control output. A controller with insufficient variety cannot absorb the system's uncertainty, regardless of how well-designed it is. A rule-based demand management system with 200 pricing and promotion rules has approximately log₂(200) ≈ 7.6 bits of variety. The weather-demand state space — temperature bands × humidity × day-of-week × competitive context × seasonal factors — has approximately 20 bits of relevant entropy. These numbers are order-of-magnitude estimates; the point is structural, not quantitative. The gap — roughly 12 bits in this illustration — is not a number to trust precisely. What the calculation shows is that the weather-demand state space is high-dimensional in ways that rule-based systems systematically underrepresent. The rule-based system is, in Ashby's structural sense, incapable of controlling the demand system at this resolution. Adding more rules helps at the margin. It cannot close a gap of this kind.

Increase the number of rules. At what point does the coverage gap close? Keep going. Notice what happens.

The variety gap is not a modeling failure. It is a structural argument: you cannot solve the attribution problem by adding more rules to a rule-based system, because you are trying to regulate a system with more complexity than your controller can represent. The only exit is to use a variable that provides an external handle on the demand system — something that varies outside the feedback loop and can be used to open it up from outside.

What Ashby called Requisite Variety, the IV framework supplies through weather: a variable with enough external variation to match the degrees of freedom in the demand system. The gap between the limited variety of any rule-based controller and the full entropy of weather-driven demand is not closed by adding more rules. It is closed by importing a variable whose variety originates outside the system — whose degrees of freedom are generated by atmospheric dynamics that predate any commercial calendar, any promotional schedule, any pricing algorithm. The handle is not statistical cleverness. It is physics. A cold front arriving in Chicago on a Tuesday in July is not a decision any retailer made. It is information from outside the feedback loop. That is precisely what Ashby's theorem requires: the regulator must have access to states the system cannot anticipate or respond to in advance. Weather provides this. No rule-based system can. Richardson spent thirty years building a machine to measure weather accurately. Wright spent an appendix explaining why that measurement, once available, could do something that no amount of observational cleverness inside the demand system could accomplish. The external handle is the whole point.

There is one more layer that smooth models consistently miss: demand undergoes genuine phase transitions at category-specific temperature thresholds. A phase transition is not a smooth change. Water does not gradually become more ice-like as you cool it — it cools, cools, cools, and then, at a precise temperature, the entire system reorganizes at once. The transition is discontinuous: the derivative of the state variable changes sign or magnitude sharply at the threshold, not continuously through it. This is not a curiosity of physics. It is a structural feature of any system in which local interactions between components produce a global reorganization that no individual component was capable of producing on its own. In physical systems this emerges when a system crosses a critical point and a new organized state spontaneously appears — what physicists call broken symmetry. In economic systems the analogous structure occurs when multiple equilibria exist and the system is pushed past the threshold between them. Cold beverage demand climbs sharply above 75°F and explodes above 85°F — the relationship is sigmoidal, with a steep inflection region that functions as a threshold. Sunscreen demand peaks around 85°F and then inverts at extreme heat, as consumers retreat indoors and outdoor activity collapses. Outerwear demand has an analogous threshold around 50°F. These inflection points are real, measurable, and consistent across markets. They are not literally discontinuous in the thermodynamic sense — consumer demand does not have a true phase boundary with a precisely defined critical temperature — but they are steep enough to matter statistically: a model that fits a smooth linear relationship across the full temperature range will systematically underpredict demand above the kink and overpredict below it. The practical effect of a steep threshold in economic data is not identical to a phase transition in physics, but it produces the same failure mode in smooth-function models.

Switch to the sunscreen curve. Find the critical temperature. Now find what happens above 100°F. Why does it invert?

So we have three distinct failure modes, not one. And the solution to each is different. The mistake Schaumburg made — the mistake most analytics teams make — is treating all three as one problem called "modeling." They are not one problem.

Taken together, these three properties — computational irreducibility, feedback-induced identifiability failure, and phase transitions — explain why prediction and attribution are categorically different problems. The forecasting tradition spent decades solving the first. Foundation models are its triumphant end product. But prediction tells you what will happen. It does not tell you what caused what happened. And attribution — answering the causal question — requires something that breaks open the feedback loop from the outside. Something exogenous. We'll call it an instrument. But first, it is worth understanding what the feedback loop actually looks like. The feedback loop does not feel like a problem when you are inside it. The regression runs. The coefficient is positive. The chart goes up. The wrongness is invisible from inside the loop — and it is invisible because it is structural, not accidental. Understanding the structure is how you find the exit.

There is a fourth complication, less fundamental than the three above but practically important: the weather-demand parameter space is enormous. Temperature, humidity, precipitation, wind, UV index — each of these interacts with demand differently, and each of their pairwise interactions is a distinct parameter that the model must estimate. As the number of weather variables grows, the number of interaction terms explodes faster than data can support — which means that even after solving the feedback problem with an instrument, you face a dimensionality problem in the estimation step itself. Understanding the scale of that problem is useful preparation for the hierarchical pooling solution in Section 5.

Watch the interaction terms multiply. When do they outnumber the data points you have? What happens to the regression then?

The Schaumburg model had two separate failures. The team didn't know it. They thought they had one problem — accuracy — and they were solving it. But the model couldn't predict what would happen next July 17 AND it couldn't explain what had happened last July 17. These are not the same failure. The first is because demand dynamics are practically irreducible in the Wolfram sense — the interactions are complex enough that no tractable model can shortcut the forward evolution. The second is because spend and sales are in a feedback loop: the regression coefficient is a weighted average of cause and anticipation, and there is no way to disentangle them from inside the model. Lorenz didn't discover that weather was hard to predict. He discovered that the atmosphere is computationally irreducible — and that irreducibility and unattributability are two different problems, requiring two different tools.

On Lorenz and chaos: "Deterministic Nonperiodic Flow" appeared in the Journal of Atmospheric Sciences 20(2), pp. 130–141 (1963). The "butterfly" framing was introduced by Philip Merilees at a 1972 AAAS meeting in Washington, D.C.; the full title was "Predictability: Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?" Lorenz preferred the seagull metaphor but accepted Merilees's butterfly — the alliteration of butterfly-Brazil, tornado-Texas is usually cited as the reason, though the precise origin of this detail is uncertain in the historical record. The Royal McBee LGP-30 was manufactured from 1956; it had 113 vacuum tubes and 1,450 magnetic core memory locations. That Lorenz was running simulations on such hardware is a reminder that the computational history of meteorology is, among other things, a history of working at the edge of available compute.

On Ashby's Law and Shannon's theorem: The formal equivalence is proven in Part III of An Introduction to Cybernetics (1956), Theorem 11: "The Requisite Variety of a regulator R is bounded below by the entropy H(D) of the disturbing environment D." This is formally identical to Shannon's Theorem 10 on channel capacity. The 200-rule / 20-bit calculation in the text is approximate: a 200-rule system distinguishes log₂(200) ≈ 7.6 bits; the weather-demand state space is estimated at 8 weather variables × 2.5 bits each plus 2 bits temporal context. These are order-of-magnitude estimates; actual dimensionality depends on the granularity of the weather data and the temporal aggregation level used in the demand model.

On computational irreducibility and forecasting: Wolfram (2002) defines a system as computationally irreducible if "the only way to determine its behavior is to actually run through the evolution, with no shortcut possible." This is a stronger claim than chaos — a chaotic system can still be computationally reducible if you can find a closed-form solution. Rule 30 is irreducible in the strong sense. The demand application is weaker: fine-grained demand is computationally irreducible, not all demand dynamics. The practical implication is that forecasting errors at the tail of the distribution — extreme weather events, novel promotional contexts — are not improvable through standard ML scaling.

2

Instrumental Variables: How to Break Simultaneity Bias

Section 1.5 identified three distinct failures. The first — computational irreducibility — is a fundamental limit on prediction. The third — phase transitions — is a modeling specification issue. The second — feedback-induced identifiability failure — is the one that cannot be fixed from inside the model, and it is the one this section is about. Here is what it looks like, concretely, from inside a clean, well-specified regression.

The dataset is clean. The regression runs. The coefficient on the treatment variable is positive, statistically significant, and stable across holdout periods. Everything looks right. Something is wrong.

The wrongness has a name. Trygve Haavelmo stated it precisely in 1943, in a paper called "The Statistical Implications of a System of Simultaneous Equations," published in Econometrica. Forty-six years later, he won the Nobel Prize for it. That gap — 1943 to 1989 — is how long it takes for a correct idea to percolate through an applied discipline when the correct idea is uncomfortable.

Here is the idea. The idea: when the variables in a regression are jointly determined — when they simultaneously cause each other — the ordinary least squares (OLS) estimator is biased. Not slightly biased. Systematically, directionally biased, in a way that is always upward when the simultaneity comes from rational anticipation.

Here is the math, because it is cleaner than the prose. The feedback loop appears wherever an agent observes demand and responds to it. In advertising it is a media buyer; in energy markets it is a utility adjusting reserve capacity; in retail it is a buyer pre-ordering inventory. The structure is the same. Let Y (outcome) and T (treatment) be simultaneously determined:

Y_t = α·T_t + β·W_t + ε_t T_t = γ·E[Y_t | F_{t−1}] + η_t

The first equation says the outcome is driven by the treatment T, weather W, and a noise term ε. The second says the treatment is driven by the agent's expectation of the outcome — their forecast, based on information available before period t — plus their own noise. Substitute the second into the first: because T_t depends on anticipated outcomes, and anticipated outcomes depend partly on the same weather and baseline conditions that drive actual outcomes, T_t and ε_t are correlated. Cov(T_t, ε_t) ≠ 0. OLS requires this covariance to be zero. It isn't.

The bias in the OLS estimate of α is:

plim(α̂_OLS) = α + Cov(T_t, ε_t) / Var(T_t)

The term Cov(T_t, ε_t) / Var(T_t) is always positive when γ >0 — when the agent is doing their job and successfully anticipating demand. This is the paradox that makes the measurement problem particularly vexing: the better the agent, the worse the OLS estimate. A perfectly competent agent — one who precisely tracks demand signals and adjusts treatment optimally — produces a dataset where the OLS estimate of treatment effectiveness is maximally wrong. This is not a modeling assumption. It is a mathematical consequence of competent behavior inside a feedback system.

The Gordon et al. magnitude — five to ten times, from Section 1 — is not a rounding error but a structural consequence of the upward bias term in the OLS estimator.

The practical consequence is the Gordon et al. finding: observational estimates of advertising effects are 5–10x higher than experimental estimates. The observational model sees spend rising when demand is high and attributes the high demand to the spend. The experiment, which randomizes the spend, cannot be fooled this way. The experiment asks: in matched markets where only spend differs, what is the sales difference? The observational model asks: across all periods, when spend is high, are sales high? The first question has a causal answer. The second has a correlational one. They are not the same question. Everything is downstream of everything else, and when everything is downstream of everything else, the concept of cause starts to get slippery in ways that are genuinely uncomfortable to sit with. The same structure appears wherever strategic agents observe demand and respond to it: energy utilities pricing reserves ahead of demand peaks, agricultural suppliers expanding capacity in response to price signals, labor markets where job posting rates respond to employment forecasts.

The simultaneity structure is not specific to advertising. It appears wherever an agent observes demand signals and responds to them. Here is what it looks like in a different domain — and here is what an instrument looks like in practice.

Consider a utility trying to measure whether its conservation incentive program actually reduces power consumption. The utility runs the campaign in summer — when it anticipates high demand. Summer is also hot. Hot weather causes high demand independently of any campaign. The observational regression sees: campaign active → high demand → high power use. It concludes the campaign does nothing, or even increases use. This is backwards. The instrument, in this case, was a billing-cycle deadline: households whose billing cycle ended in the first week of June received mailers earlier than otherwise-identical households whose cycle ended in the third week. The deadline shifted who received the mailer without affecting the weather, the season, or the underlying demand. That variation — exogenous to everything except the administrative calendar — is the handle that breaks open the loop, solid ground to stand on outside the feedback system.

Run the system without a weather instrument. Watch the feedback loop build. Now add the instrument. What breaks open?

The natural response — and the honest one, so it deserves a direct answer — is: why not just run experiments? If randomized trials give you causal estimates, run them. The answer is that randomized experiments are the gold standard for prospective causal claims, and when they're available they should be used. Bayesian sequential testing, multi-armed bandit methods, geo-experimentation: these are genuinely useful tools for making decisions about future campaigns.

But they cannot answer the question the Schaumburg team was asking. A randomized geo-experiment — 20 markets increase spend, 20 markets hold flat, measure the difference — gives you a clean causal estimate of advertising effectiveness for those markets, during that period, under those conditions. It does not tell you what caused Q2 sales last year. It does not allow you to decompose past performance into weather effect, advertising effect, and baseline. And perhaps most importantly for large retail chains: an experiment-based estimate of average advertising effectiveness is a different estimand than a hierarchical model of advertising effectiveness by market, by season, by weather condition. You need the latter to make good decisions at the store level. Experiments give you an average. The business operates in specific markets, in specific conditions, not in averages.

Double/Debiased Machine Learning (DML), introduced by Victor Chernozhukov and colleagues in 2018, is the most sophisticated observational method for this problem. The idea: use machine learning models to partial out high-dimensional confounders — seasonality, day-of-week, competitor activity, regional trends — from both the outcome variable and the treatment, then run the causal estimation on the residuals. The double-robustness property means the estimate is consistent even if one of the ML models is slightly misspecified. This is a real advance. But DML is still an observational method: it controls for observed confounders, but it cannot remove the simultaneity bias from the strategic agent's unobserved demand anticipation. The endogeneity is in the error term, not in a covariate.

DML cleans up what you can observe. It cannot clean up what you cannot. The media buyer's forecast — the signal that informed this week's spend — is not in your dataset. It never was.

Other design-based methods exist — difference-in-differences requires parallel trends across markets that may not hold when weather varies; regression discontinuity discards variation away from the threshold and works best for sharp natural experiments; synthetic control requires a long pre-treatment history and does not scale easily to 847 stores simultaneously. Each has its place. IV dominates in the specific setting where weather provides rich, continuous, multi-dimensional exogenous variation across every market, every day — which is exactly this setting.

What you need — what the math demands — is variation in the treatment that is not caused by variation in demand expectations. You need a source of variation that comes from outside the feedback loop. Something exogenous. A signal that perturbs the system without being perturbed by it. The word "instrument" is borrowed from physics: an instrument reads a signal without changing it. In econometrics, an instrumental variable is a variable that shifts the treatment without directly affecting the outcome through any channel other than the treatment itself. This problem is not unique to advertising — it appears everywhere economic agents observe demand and respond to it.

A solution has been sitting in an economics appendix since 1928. Nobody read the appendix. Everyone read the main text about animal and vegetable oil tariffs.

In Section 1.5, the confounder was a third variable you could observe and control for. Here, the confounder is embedded in the feedback loop itself — invisible inside the regression. The correlation you see in the data is not a spurious third-variable correlation: it is a structural artifact of the loop. Adding more variables to the regression cannot remove it, because it is not caused by a variable. It is caused by the data's architecture.

What does the overall trend tell you? Toggle the confounder. Does that change what the trend tells you?

The temp_hi_f column was there the whole time. Their regression included it as a control — a hedge against spurious correlation. But controlling for weather is not the same as using weather. Weather in their model was a noise suppressor. What it should have been was a signal amplifier — a lever for extracting causal structure from data that, without it, was a closed loop of spend and sales chasing each other in circles. The instrument was in column 47 the entire time. Nobody had clicked on it.

In practice, the difference looks like this. Including temp_hi_f as a control variable treats weather as a confounder to be partialled out: the regression conditions on temperature, gives you a coefficient on advertising spend that holds temperature fixed, and moves on. Running weather as an instrument means something different entirely: use temperature variation to construct predicted values of advertising spend that are orthogonal to demand anticipation — the component of spend variation that weather caused, not the component that demand expectations caused — then regress sales on those predicted values. The first approach produces a prediction model that accounts for weather. The second produces a causal estimate that uses weather as a lever to break the feedback loop. Same column 47. Entirely different role. The Section 3 math makes this precise. For now: the team had the instrument. They were treating it as a nuisance variable.

On Haavelmo's simultaneous equations: Haavelmo (1943) showed that a system of equations in which each dependent variable appears as a regressor in other equations cannot be estimated equation-by-equation with OLS without producing biased and inconsistent estimates. The formal proof uses the joint distribution of endogenous variables: because they are jointly determined, they are correlated with the error terms of each other's equations. The identification solution — using instrumental variables or exclusion restrictions to isolate exogenous variation — was developed through Haavelmo, Wright, and the Cowles Commission (Koopmans, Marschak, Klein, 1944–1950). Haavelmo won the 1989 Nobel Prize "for his clarification of the probability theory foundations of econometrics and his analyses of simultaneous economic structures."

On Double Machine Learning: DML (Chernozhukov et al., 2018, Econometrics Journal 21(1), C1–C68) extends IV estimation to the high-dimensional confounder setting through cross-fitting: the sample is split into K folds; nuisance functions are estimated on the complement of each fold and used to partial out confounders within each fold; the causal parameter is estimated from the cross-fitted residuals. The Neyman-orthogonality condition ensures that estimation errors in the nuisance functions do not inflate the estimation error in the causal parameter to first order. DML is not an IV estimator in the strict sense — it does not require an instrument — but it can be combined with IV to handle both high-dimensional confounders and endogeneity simultaneously.

3

Philip Wright's 1928 Proof and the Credibility Revolution

Section 3 is partly a history lesson and partly a technical argument. The history: Wright invented the instrumental variable in a tariff monograph about animal and vegetable oil imports. The technical argument: for an instrument to produce credible causal estimates rather than plausible-sounding but misleading ones, specific conditions must be met — conditions that economists spent sixty years arguing about before reaching consensus. Understanding both the method and its credibility conditions is what separates a valid instrument from a sophisticated-looking correlation.

Philip Wright's 1928 monograph on animal and vegetable oil tariffs is not a document that anyone reads today. Except for four pages at the back that changed economics.

Philip Grenville Wright was, in various orderings and with equal legitimacy, a tariff researcher, a poet, an economist at the Brookings Institution, and the man who first published Carl Sandburg. He ran a hand press in his basement in Galesburg, Illinois — he called it the Asgard Press — and in 1904 he printed a 39-page chapbook called "In Reckless Ecstasy": 17 poems, 6 prose vignettes, 100 copies, one dollar each. This was how Sandburg, then one of Wright's students at Lombard College, entered the literary world. Wright published his own poetry on the same press. He was also doing serious empirical economics. He had a Harvard economics education that he mostly used for advocacy rather than scholarship — a Mugwump-tradition writer who believed that facts about markets, stated plainly enough, could change policy.

He was right about the facts. He was wrong about who would read them.

In 1928, he published The Tariff on Animal and Vegetable Oils, a 341-page monograph commissioned, it appears, by interests connected to the vegetable oil trade. The monograph is exactly what it sounds like: a careful, dull, and thoroughly professional analysis of import tariffs on flaxseed oil, cottonseed oil, lard, tallow, and their substitutes. The book was read by a small number of agricultural economists and then not read for about thirty years.

Appendix B is four pages long. It introduces, in the context of estimating demand and supply curves for flaxseed oil, a statistical technique Wright calls "the method of confluence analysis." The core problem: if you want to know how price affects quantity demanded — the slope of the demand curve — you cannot simply regress quantity on price. Because price and demand are jointly determined. Price goes up when demand is high; demand goes up when prices fall; they chase each other through time. A scatter plot of price against quantity shows you the intersection of supply and demand dynamics, not either one separately. You see the system's output. You don't see the structure.

Wright proposed a solution. Find a third variable — something that shifts supply without shifting demand. If such a variable exists, you can use changes in that variable to trace the demand curve: when supply shifts (costs go up, quantity decreases, price rises) without any change in consumer preferences, the resulting price-quantity pairs lie along the demand curve. Connect them.

That is the demand curve's slope. The instrument Wright used for flaxseed oil was weather. Not just any weather: rainfall and growing-season conditions that affected crop yield, and through crop yield, affected supply, while leaving consumer preferences for flaxseed oil entirely unaffected. Tariff changes were also exogenous — determined by the rhythm of Congress, not consumer preferences. But Wright immediately grasped that the principle was general: any variable that shifts supply without shifting demand, or that shifts one side of a market without shifting the other, could serve as an instrument. Tariff changes. Policy shocks. Geographic accidents. Rainfall. The instrument doesn't have to be weather. It has to be exogenous. Wright's own term, in Appendix B, was “confluence analysis.” The approach — finding a variable that could read the supply-demand structure without being distorted by it — is precisely what physicists call a probe: a device that reads a signal without changing it. The econometrics literature would later call it an instrumental variable. The word stuck. The four pages stuck. Everything else in the monograph is forgotten.

There is an unresolved scholarly question about whether Philip actually wrote Appendix B, or whether his son Sewall Wright — the geneticist who invented path coefficients in the 1920s and who has a legitimate claim to being the originator of structural equation modeling — contributed to it. The question is genuinely unresolvable. James Stock and Francesco Trebbi (2003) argued, using a combination of stylometric comparison — the statistical study of writing style as a fingerprint, using measurable features of text such as sentence length, function-word frequency, and syntactic patterns to identify authorship — and analysis of the historical record, that Philip was the sole author. But the letters between Philip and Sewall over the winter of 1925–26 show joint development. The econometrician Joshua Angrist, who received the 2021 Nobel Prize partly for work that builds directly on this method, has described the paternity question as genuinely unresolvable given available evidence. What is not in question is that the technique — finding exogenous variation in one variable to identify the causal effect on another — traces its first clear statement to those four pages, in that appendix, about oil tariff policy.

Move the weather slider to shift supply. Watch the intersection trace the demand curve. This is what Wright figured out in 1928.

Philip Wright died in 1938. His appendix sat largely unread until Tinbergen and Haavelmo developed related ideas independently in the 1940s, and until the Cowles Commission — a group of mathematical economists working in Chicago and then New Haven — formalized the simultaneous equations framework in the late 1940s and early 1950s. The Commission was something new in economics: an institution that believed economic theory could be made rigorous in the way physics had become rigorous — through mathematics, through explicit assumptions, through empirical test. Its motto was “Science is Measurement.” It attracted economists who wanted to turn Haavelmo's probability-theoretic framework for econometrics into operational technique: Jacob Marschak, Tjalling Koopmans, Lawrence Klein. Their central project was the simultaneous equations model — a formal system for representing economies as sets of equations in which multiple variables are jointly determined and classical regression fails. They were doing, for economics, what Richardson had done for meteorology: turning intuitive understanding into numerical procedure. The method reappeared under the name "instrumental variables" in the structural econometrics tradition of the 1960s and 1970s. And then, in 1983, a critique.

Edward Leamer's "Let's Take the Con Out of Econometrics," published in the American Economic Review, is one of the most cited papers in the history of applied economics — and one of the most uncomfortable to read if you're an applied economist. Leamer's argument: the profession's standards for what constitutes a valid instrumental variable are embarrassingly low. The exclusion restriction — the claim that the instrument affects the outcome only through the treatment, not through any other channel — is typically assumed rather than tested. Point estimates are highly sensitive to which instruments are included and which confounders are controlled. Leamer's critique was not that IV was wrong. It was that IV, as practiced, was not rigorous enough to be trusted.

That is a sharper claim. Wrong is fixable. Insufficiently rigorous simply becomes the default. The sociological pattern was consistent across subfields: an empirically correct result arrives, the establishment says it is "not possible," and eventually the result forces the field to revise its standards. It happened to Makridakis. It happened to Wright. It happened to Leamer.

The credibility revolution was a response to a specific failure mode of structural econometrics. Before Angrist, Imbens, and Card, the standard approach to measuring causal effects in economics was to build a model of the entire economy — a system of simultaneous equations with explicit assumptions about technology, preferences, and behavior — and then estimate the model's parameters from observational data. The method was mathematically sophisticated and could produce causal estimates from non-experimental data. The problem was that the causal estimates were only as credible as the model's assumptions, and those assumptions were typically untestable, non-obvious, and chosen to make the math tractable rather than because anyone believed them empirically. Leamer's 1983 critique documented this precisely: move the assumptions slightly, and the estimates change drastically. The credibility revolution's response was to abandon the search for model-based identification and instead find natural or quasi-natural experiments — situations where real-world events had created random or as-good-as-random variation in treatment assignment, independently of any model. The draft lottery. A state border. An administrative deadline. These “instruments” provided causal identification that was credible because the variation was genuinely exogenous — not assumed to be exogenous based on a model that nobody had tested.

The credibility revolution that followed is the direct response. On December 1, 1969, in a ceremony broadcast on television, a Selective Service official removed blue plastic capsules from a large glass drum one at a time. Each capsule contained a date of birth. The order of extraction determined, for every American man born between 1944 and 1950, his relative priority for military service in Vietnam. Men born on September 14th were capsule 1. Men born on June 8th were capsule 356, and most of them were not drafted at all. The assignment was random: the drum did not know or care about your family income, your education, or your connections.

Joshua Angrist was a graduate student in economics at Princeton when he realized what this meant statistically. The lottery had, inadvertently and without anyone understanding it as such at the time, run something very close to a randomized controlled experiment on military service. You could compare men with low lottery numbers — who served at high rates — to men with high lottery numbers — who did not — and the two groups would be comparable on everything except, probabilistically, service itself. Angrist published the paper in 1990. White veterans, he found, earned approximately 15% less than comparable nonveterans in the early 1980s, a decade after their service, using Social Security administrative records as the outcome data. The instrument was a drum. The experiment was a war.

David Card and Alan Krueger's 1994 study of the New Jersey minimum wage is the canonical example of geographic regression discontinuity. On April 1, 1992 — the date is coincidentally April Fools' Day, which is either meaningful or not — New Jersey raised its minimum wage from $4.25 to $5.05 per hour. Card and Krueger called 410 fast food restaurants, in New Jersey and in Pennsylvania, which had not raised its minimum wage, before and after the raise. Classical labor economics predicted employment in New Jersey should fall: higher wages mean fewer jobs. What Card and Krueger found was that employment in New Jersey's fast food sector increased by approximately 13% relative to Pennsylvania. A reviewer called the telephone survey methodology "a monument to poor survey methodology." David Neumark and William Wascher published a replication using actual payroll records and got the opposite result. The fight went on for years. In October 2021, the Royal Swedish Academy awarded David Card the Nobel Prize in Economic Sciences, partly for this work. Alan Krueger had died in March 2019, age 58, two years before. The prize is not awarded posthumously. These methods weren't invented in the 1990s. They were formalized, and their credibility conditions were stated precisely. The credibility revolution is the name for that formalization.

Angrist and Guido Imbens added a crucial clarification in 1994 with the Local Average Treatment Effect (LATE) theorem. When treatment effects are heterogeneous — when the advertising campaign has different effects for different consumers — IV doesn't estimate the average treatment effect across all consumers. It estimates the treatment effect for "compliers": consumers whose treatment status changes when the instrument changes. In our case, compliers are consumers who buy a cold beverage on a hot day but not a cool one.

To see why the LATE is an inescapable constraint and not a modeling choice, consider what the instrument actually does. Weather shifts some consumers from “not buying” to “buying cold beverages.” These are the compliers. A second group — always-takers — buy cold beverages regardless of the weather; the instrument doesn't change their behavior. A third group — never-takers — won't buy no matter how hot it gets. The instrument has no leverage over always-takers or never-takers. Their behavior doesn't change when the instrument moves, so the instrument cannot identify any causal effect for them. The IV estimator, built as it is around the instrument's movement, can only learn about compliers — the people who actually respond to the source of exogenous variation being exploited. This is not an approximation or a limitation of the method. It is a logical necessity: to estimate the effect of changing someone's behavior, you need people whose behavior can be changed by something random. The LATE is what the instrument can see. Extending the estimate to always-takers and never-takers requires additional assumptions — typically, that compliers look like the full population in their treatment effects — which is an assumption, not a result.

If you care about the complier consumer — and if you're planning summer inventory for a convenience store chain, you do — the LATE is precisely the right estimand. The LATE is not the estimate being wrong. It's the estimate being precise about a specific, identifiable subpopulation. You need to know which subpopulation you're looking at.

Angrist and Imbens won the Nobel Prize for this work in 2021, shared with David Card. The committee cited their "methodological contributions to the analysis of causal relationships." Forty-seven years after Wright's appendix. Twenty-seven years after the LATE theorem. Philip Wright died in 1938. He never knew any of this had happened. The appendix worked for thirty years without him.

Three results. Three different instruments. What they have in common is the exogeneity: the draft lottery was set by a drum; the state border was set by history; the billing-cycle deadline was set by administrative accident. In each case, the instrument provides variation in the treatment that is not caused by variation in the outcome. Weather is in this class. The next section tests whether it earns its membership.

Peel the first layer. Does the correlation change? Keep peeling. What's left when all the confounders are gone?

Weather works the same way as the draft lottery, for the same reason. The temperature at the corner of Wacker and State on July 17, 2019, was determined by atmospheric dynamics — the pressure gradient between the Great Plains and the Great Lakes, the jet stream's behavior over the preceding week, the specific moisture content of the air mass that moved through Chicago that morning. It was not determined by the marketing team's advertising budget. It was not determined by consumers' underlying preferences for cold beverages. It is, in the strict technical sense, exogenous to the demand-advertising feedback loop.

It is worth being precise about how the modern application differs from Wright's original. Wright used weather on the supply side of the market: rainfall shifted crop yields, which shifted supply, which — given unchanged consumer preferences — allowed the demand curve to be traced. His exclusion restriction was that rainfall didn't affect consumer demand for flaxseed oil directly. The modern application inverts the instrument's position in the causal graph entirely: here, weather shifts consumer demand directly, and we use that exogenous demand variation to isolate the advertising effect. The exclusion restriction we must defend is different — that weather affects advertising-induced demand only through consumer purchase behavior, not through the advertiser's own choices. Both exclusion restrictions are defensible. They are simply different arguments, and each must stand on its own. Section 4 makes that argument for the demand-side application. Economists have since used weather to identify demand elasticity for energy, agricultural commodities, retail goods, and labor supply — always with a separately stated exclusion restriction for each domain. The instrument is domain-general in the sense that its exogeneity holds universally. The specific restriction it must satisfy differs every time. Weather is the draft lottery of economic behavior — exogenous, universally experienced, and perpetually running, but identifying a specific causal effect only when the validity conditions are met in the specific context. Unlike the draft lottery, weather is not independently drawn each day — it is autocorrelated across time and structured across space. The instrument's power comes from using weather anomalies — departures from climatological normal — rather than raw levels, so that the residual variation, conditional on season and geography, is as close to randomly assigned as any economic instrument gets.

Move the instrument strength slider toward zero. Watch what happens to the confidence interval. Now ask: is your instrument this strong? What are you actually learning from a weak one?

Philip Wright solved a version of this problem in 1928 using flaxseed oil. The mechanism is identical. A tariff on imported vegetable oil shifts supply — costs go up, prices go up, quantity sold changes. But does observing price and quantity together tell you the demand curve's slope? No. Because price and quantity move together for a dozen reasons at once. Wright's insight — that something shifting supply but not demand could separate the two — is exactly the insight needed in that Schaumburg boardroom. Weather shifts demand. Whether it also shifts the media budget — through competitor responses, supply-chain effects, or other channels — is the critical question, and Section 4 takes it seriously. The short version is: at daily and zip-code resolution, the instrument passes the test. But that answer requires argument, not assertion.

The game is not just advertising. It is any domain where a variable affects economic behavior without being affected by economic agents. The draft lottery worked as an instrument for military service because soldiers didn't choose their lottery numbers. Weather works as an instrument for demand because consumers don't choose the temperature. It is worth naming where weather as an instrument fails entirely: for staple goods with inelastic demand (the first-stage F will be near zero), for digital goods and services where demand is not weather-contingent, for B2B procurement on fixed contract cycles, and for markets dominated by price-regulated supply. A preliminary check — regress demand on weather anomalies and report the F-statistic — is both necessary and sufficient to screen applicability. If F < 10, the instrument has no grip and no identification is possible. The question for the next section is not whether weather qualifies in general. It is: how well, by exactly what mechanism, and where does it fail?

On the LATE theorem: Angrist and Imbens (1994,Econometrica 62(2), pp. 467–475) proved that under four conditions — (1) instrument relevance: Cov(Z, A) ≠ 0; (2) instrument exogeneity: Cov(Z, ε) = 0; (3) exclusion restriction: Z affects Y only through A; (4) monotonicity: A(Z=1) ≥ A(Z=0) for all individuals — the IV estimator identifies the average treatment effect for compliers, defined as individuals for whom A(Z=1) > A(Z=0). The LATE is a weighted average of individual treatment effects with weights proportional to the probability of compliance. It is generally not the Average Treatment Effect (ATE) and is not the Average Treatment Effect on the Treated (ATT). When the instrument is weather and the treatment is advertising exposure, compliers are consumers who respond to weather-induced changes in their demand state. The empirical relevance of the LATE in this setting is high: these are exactly the consumers whose purchasing behavior is most responsive to the conditions that drive demand variation in weather-sensitive categories.

On the draft lottery paper: Angrist (1990, "Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records," American Economic Review 80(3), pp. 313–336). The 15% earnings penalty applies to white veterans; the earnings effects for nonwhite veterans were not statistically significant in the administrative data. The "credibility revolution" term was later coined by Angrist and Pischke in their 2010Journal of Economic Perspectives retrospective. Card received the Nobel "for his empirical contributions to labour economics"; Angrist and Imbens shared the other half "for their methodological contributions to the analysis of causal relationships."

4

Why Weather Is a Near-Perfect Causal Instrument

Any variable can be called an instrument. Very few variables actually are one. Weather is one of them — but only if you understand exactly why, and only if you're honest about where it isn't.

The standard IV conditions are three: relevance, exogeneity, and the exclusion restriction. A fourth condition — monotonicity — is added specifically for the LATE theorem, which applies when treatment effects are heterogeneous and the instrument is binary or discrete. All four are not suggestions. An instrument that fails any one of them produces estimates that are worse than useless — confidently wrong in a direction that depends on the structure of the failure. Leamer's 1983 critique was precisely that applied economists had been treating these conditions as soft guidelines rather than hard requirements. The credibility revolution forced each condition to be argued explicitly rather than assumed implicitly.

So how does weather fare, checked against each condition seriously? The first two are easy. The third is where things get interesting.

Relevance: The instrument must actually move the treatment variable. A weak instrument — one that barely moves the treatment — produces estimates nearly as biased as OLS. The test: run the first-stage regression of the treatment on weather, and report the F-statistic. Staiger and Stock (1997) established F >10 as the rough threshold for instrument strength; Stock and Yogo (2005) gave exact critical values. The F > 10 threshold is a rule of thumb that has been substantially refined in subsequent work — Andrews, Stock, and Sun (2019) developed more powerful tests that account for the number of instruments and the test of interest; Lee et al. (2022) showed that the threshold for many-instrument settings is considerably higher. In practice, the first-stage F-statistic remains a useful screening tool, but F > 10 should be treated as a floor, not a target. For weather as an instrument for consumer demand, first-stage F-statistics in beverage categories are typically well above this floor. Temperature is a strong predictor of cold beverage purchases. The relevance condition is easily satisfied.

Exogeneity: The instrument must be uncorrelated with the error term in the outcome equation. This requires that weather is uncorrelated with the unobserved drivers of sales. Media buyers may adjust seasonal budgets in anticipation of summer demand, but conditional on day-of-week, week-of-year, and year-over-year trend controls, the residual variation in temperature — the difference between what temperature you'd predict from seasonal patterns and what actually occurred — is uncorrelated with advertising decisions. Nobody increases their ad budget because they forecast a heat wave coming on Tuesday. The weather forecast might inform inventory decisions (Carla Reyes was acting on exactly this kind of signal), but advertising buys operate on quarterly planning cycles, not Tuesday-specific temperature anomalies. Exogeneity is defensible — not free, but defensible and testable through overidentification tests when multiple weather variables are available. One complication: extreme weather events are sometimes correlated with other economic events — power outages, transportation disruptions, public events. A July heat wave in Chicago in 2019 arrived with public health advisories and specific media coverage that independently affected retail behavior. Conditioning on a rich set of temporal controls (day-of-week, week-of-year, year fixed effects) and using weather anomalies rather than absolute levels reduces this concern, but does not eliminate it entirely in the presence of truly anomalous events.

The exclusion restriction: Here is the hard one. The instrument must affect the outcome only through the treatment — through the causal pathway you're trying to identify. Weather affects many things simultaneously: foot traffic (people stay home in extreme heat), employee staffing, supply chain logistics, competitor promotions, consumer mood. The exclusion restriction requires that all of these channels run through the demand variable you're modeling, and that none affect sales through a separate pathway.

Here is where the honest answer requires more care.

This is not an assumption you can make for free. A 2025 working paper by Jonathan Mellon and colleagues documented 194 product-market pairs where weather appeared to violate the exclusion restriction in a specific and systematic way: competitors were responding to weather by adjusting their advertising spend in the following week. A heat wave hits; the competitor's media agency sees the heat wave in their forecasting tool; the competitor increases spending on cold beverages in the subsequent seven days. If that competitive spend increase affected the focal firm's sales through share-of-voice dynamics, then weather is correlated with a demand driver not captured in the model. Mellon et al. estimate the bias at 8–22% of the point estimate. One hundred and ninety-four cases. In papers that had already been peer-reviewed and published and cited. This is either a finding about the state of a subfield or it is a finding about the difficulty of the underlying problem. The Mellon paper suggests it is both.

I am not sure the distinction between a subfield problem and an irreducible difficulty is as clean as the authors suggest, but I am not sure it isn't, and this uncertainty is the kind I have learned to sit with rather than resolve prematurely. The response to the Mellon finding is this: the violations are a planning-horizon mismatch, not a fundamental invalidity. Competitive advertising response operates on weekly planning cycles — media agencies adjust their buys on a weekly basis, responding to the previous week's weather or to their seasonal forecasts. Temperature variation at daily resolution, within a specific zip code, is faster and more local than any weekly planning cycle can track. A competitor cannot adjust their DMA-level (Designated Market Area) television buy in response to yesterday's temperature anomaly in a single zip code. This argument holds most cleanly for traditional media with weekly planning cycles. Programmatic digital advertising operates at sub-daily resolution and could in principle respond to weather signals in near-real-time — a real limitation for digital-heavy attribution contexts, where the resolution argument must be supplemented by overidentification tests using multiple orthogonal weather variables. At daily × zip-code granularity, the exclusion restriction holds with much higher credibility than at weekly × DMA granularity. The solution is not to abandon the instrument but to use it at the resolution that exceeds the strategic response cycle — and to verify with overidentification tests in contexts where programmatic response cycles are plausibly short.

There is a second potential exclusion restriction violation worth taking seriously: the supply-side channel. Weather affects not just consumer demand but also supply costs — cold-chain logistics become more expensive in extreme heat, agricultural inputs fluctuate with precipitation and temperature, and labor productivity varies with working conditions. If weather → supply costs → shelf prices → consumer demand, then weather is affecting the outcome through a channel other than the treatment being studied. The resolution depends on scope. For packaged beverages and shelf-stable consumer goods with centralized pricing set quarterly, the supply-cost channel is negligible at weekly granularity — shelf prices for Gatorade do not update in response to this week's temperature. For fresh produce, energy, and agricultural commodities, the supply-side channel is real and must be controlled for explicitly, typically by including lagged commodity price indices as covariates. The instrument is valid for the categories where it is valid. Bounding that set is an empirical question, not a theoretical one.

Monotonicity: The fourth condition is that the instrument affects the treatment in a consistent direction. For weather, the non-linear (inverted-U) relationships between temperature and demand for some categories — sunscreen sales fall above 100°F when people stay indoors — are empirical shape features, not monotonicity violations. The instrument can be specified per category and per temperature range to ensure monotonicity within the relevant domain. This is not a workaround; it is correct specification.

Rotate the sunscreen surface. Find the peak. Now find where demand inverts. What is happening to consumers above 100°F?

Does weather actually satisfy the three IV conditions?

Temperature is a scalar. Weather is not.

When we say "use weather as an instrument," we mean something richer than "use today's temperature." Weather is a high-dimensional, temporally structured field, and each of its dimensions interacts with economic behavior in distinct ways. Four properties matter most.

Trajectory. 72°F while temperatures are falling is not the same as 72°F while temperatures are rising. A consumer who spent the previous week in 90°F heat and is now cooling off is in a different behavioral state than a consumer who spent the previous week in 50°F cold and is now warming up. The slope of the temperature trajectory — the derivative, not just the level — is a separate instrument for a separate demand channel. Heating and cooling trajectories affect demand for apparel, energy, food service, and recreation in categorically different ways.

Forecast dependence. People respond to expected weather, not just current weather. If a heat wave is forecast for Friday, consumers begin adjusting behavior on Wednesday — stocking food, buying sunscreen, booking hotels or canceling plans. This means the instrument has a forward-looking component: the weather forecast itself becomes an instrument for advance demand adjustment, separate from the weather realization. It interacts with information about itself. This is unusual, and useful.

Anomaly versus absolute level. 85°F in October hits differently than 85°F in July — not because of the physics, but because it deviates from what consumers expected and prepared for. What matters for demand is not the absolute temperature but its departure from climatological normal: the anomaly. The same absolute reading, different anomaly, different behavioral response.

Higher-order interactions. Heat and humidity interact nonlinearly to produce apparent temperature. UV index and cloud cover affect outdoor activity independently of temperature. Precipitation trajectory — is rain arriving or clearing? — affects purchasing timing and mode in ways that neither precipitation level nor temperature alone predicts.

This richness is not a complication. It is what makes weather a better instrument than most variables available to empirical economists. A one-dimensional instrument is easy to challenge: you can usually find a plausible exclusion restriction violation. A multi-dimensional instrument with orthogonal components and distinct lag structures is much harder to challenge: you can test each component against the others and verify that the causal structure is consistent across instruments. The same physical properties that make weather exogenous from the economic system also make it an extraordinary instrument — not despite its complexity, but because of it.

The connection between weather's exogeneity and its role as an instrument is most transparent in the simplest form of the IV estimator. The Wald estimator shows exactly how the instrument is used.

β_IV = Cov(Y, Z) / Cov(A, Z)

The numerator, Cov(Y, Z), captures the reduced-form effect — how much sales move when weather moves. The denominator, Cov(A, Z), captures the first-stage strength — how much advertising spend moves when weather moves. The ratio scales the sales-weather relationship by the advertising-weather relationship, yielding the causal effect of advertising on sales.

The first four IV conditions describe weather's structural suitability as an instrument — properties it has always had. There is a fifth reason weather has become dramatically more useful in the past two years, and it has nothing to do with econometrics: the instrument itself got better. Much better.

The deeper reason weather satisfies these conditions better than most instruments in the economics literature is captured by the AI weather forecasting revolution of 2023–2024. In December 2024, Google DeepMind published "Probabilistic weather forecasting with machine learning" in Nature. The system described — GenCast — outperformed the European Centre for Medium-Range Weather Forecasts' ensemble forecasting system on 97.2% of the 1,320 verification targets evaluated. At lead times beyond 36 hours, the number was 99.8%. The GenCast forecast took 8 minutes to produce on a single Google Cloud TPU v5 chip (a Tensor Processing Unit, Google's specialized AI hardware); the ECMWF system it outperformed takes hours on a supercomputer with tens of thousands of processors. GenCast also provided approximately 12 additional hours of tropical cyclone track warning before landfall — the kind of number that has direct consequences for evacuation decisions. Ilan Price, the lead author, said the model was "hopefully a much better tool in the toolbox going forward," which is the sort of measured statement you make when you have just comprehensively beaten the institution whose methods you were trained on.

Before explaining what makes GenCast different, it helps to understand what it replaced. Classical weather forecasts give you one number: tomorrow's high will be 78°F. Ensemble forecasting gives you a distribution: fifty slightly different runs of the forecast model, each started from slightly different initial conditions, producing fifty different outcomes. The spread of those outcomes is the uncertainty. A tight cluster means the forecast is confident; a wide spread means the atmosphere is genuinely unpredictable at that lead time. ECMWF's ensemble system, the gold standard for decades, runs 51 members — one unperturbed control run plus 50 runs with small, carefully designed perturbations to the initial conditions. The perturbations represent the range of plausible initial states given the observing network's resolution. The ensemble is the numerical expression of Lorenz's insight about initial-condition sensitivity: quantified, probabilistic, and operationally useful.

GenCast is technically a diffusion model — the same architecture as the AI systems that generate photorealistic images from text prompts. A diffusion model works by learning to reverse a specific kind of destruction. During training, Gaussian noise is progressively added to real data — in this case, real weather states — until the data looks like pure static. The model learns, across millions of examples, how to run that destruction backwards: given a noisy weather state, predict what the clean weather state was. At inference, you start from pure noise and run the learned reverse process, which produces a plausible weather state. The crucial property for weather forecasting is that if you run the reverse process many times from different random starting points, you get an ensemble of different plausible weather states — not one forecast but a distribution of forecasts. That distribution is the probabilistic forecast. The spread of the ensemble captures the genuine uncertainty in the system: for a forecast two days ahead, the ensemble members are close together; for a forecast ten days ahead, they diverge, reflecting the atmosphere's growing uncertainty at longer lead times. GenCast was trained on 40 years of ERA5 reanalysis data.

ERA5 is not a collection of weather observations. It is a reconstruction. Meteorologists at ECMWF take every observation available — satellite retrievals, balloon soundings, surface station reports, aircraft measurements, ocean buoy readings — and run them through a physical atmospheric model, adjusting the model's state until it is mathematically consistent with all the observations simultaneously. The result is called a reanalysis: a complete, spatially continuous, physically consistent record of the global atmosphere at hourly resolution, covering every point on Earth, from 1940 to the present. Observations alone cannot do this — a weather station in Kansas measures Kansas; it tells you nothing about the simultaneous state of the Pacific. The physical model is what connects them into a coherent global picture. ERA5 is that global picture, at 31-kilometer horizontal resolution, covering 83 years. Before ERA5, there was no training dataset that covered the globe continuously with the spatiotemporal density that deep learning requires. ERA5 is why the AI weather revolution happened in 2023 and not in 2013.

It has learned, with extraordinary fidelity, what weather patterns precede what other weather patterns. Lewis Fry Richardson, in 1922, imagined 64,000 human computers in a circular theatre doing something like this. The machine now doing it took a few months to train and runs in eight minutes. What it cannot do is tell you why any weather pattern preceded any other. Matthew Chantry, of ECMWF, called GenCast "a really great contribution to open science." What matters for this essay is more specific: a richer, higher-resolution weather forecast provides more instruments — temperature, humidity, precipitation, cloud cover, wind speed, UV index, and their spatial distributions — that are correlated with different demand channels, allow overidentification tests, and provide more first-stage variation. To be precise about what improved: the relevance of the instrument — its strength in predicting demand variation — improved with richer, higher-resolution forecasts. The validity of the instrument — its exogeneity, its exclusion restriction — depends on weather's physical independence from economic agents, which GenCast did not change. Better forecasts make the instrument richer, not more valid. The validity was always there. What improved is the ability to exploit it.

Find two markets with the same average lift but different weather sensitivities. The causal forest sees this difference. The aggregate model cannot. What does that difference mean for planning a campaign?

Chicago, July 17, 2019. The weather was not planned. It was not coordinated with the ad buy. The marketing team did not know, when they scheduled their campaign, that the heat wave was coming. That is precisely what makes it useful. A variable that cannot be manipulated by the people whose behavior it affects is, in the technical sense, exogenous. In the informal sense, it is honest. It tells you something the data, left to itself, cannot tell you.

On GenCast: Price, I., et al. (2024). "Probabilistic weather forecasting with machine learning." Nature 637, pp. 84–90. The 97.2% figure covers 1,320 verification targets evaluated against the 2018 out-of-sample year using CRPS (Continuous Ranked Probability Score). The quote from Ilan Price appears in reporting on the paper's December 4, 2024 publication. ECMWF's Matthew Chantry's quote appeared in the same reporting cycle. The ERA5 reanalysis dataset, produced by ECMWF and released in 2019, provides 40+ years of global atmospheric state at 31-kilometer resolution, every hour, and is the training substrate for GenCast, Pangu-Weather, Aurora, and most of the 2023–2024 AI weather models. Pangu-Weather (Bi et al., 2023, Nature 619, pp. 533–538) was the first AI model to reach operational NWP quality on most verification metrics; GenCast surpassed it on ensemble forecasting.

On the Mellon (2025) violations: The working paper examined a proprietary retail panel dataset with daily sales, weekly advertising spend at the DMA level, and daily weather observations. The 194 violations were identified by testing whether lagged weather was correlated with subsequent competitor advertising spend after controlling for season, trend, and brand fixed effects. The 8–22% bias estimate uses a simulation calibrated to the observed violation rate. As of February 2026, the paper is available as a working paper and has not yet appeared in a peer-reviewed journal. The exclusion restriction is, in some sense, unfalsifiable from the data alone: if you have a valid instrument, you cannot prove it's valid from within the dataset that uses it — you can only argue, on substantive grounds, that the instrument doesn't affect the outcome except through the channel you're measuring. The Mellon finding suggests that in 194 cases, that argument was made too quickly.

5

Stein's Paradox: Why Pooling Estimates Across Markets Improves All of Them

We have a valid instrument — in principle, at the right granularity, for the right categories. That gives us a causal estimate for one market, one period, one product. The actual decision-making problem is ten thousand markets, every day, across dozens of categories. Scale creates a new kind of difficulty that is neither the irreducibility problem from Section 1.5 nor the feedback problem from Section 2. It is a variance problem: the more finely you disaggregate, the less data you have, and the less reliable each estimate becomes. Charles Stein figured out the answer in 1956. It is counterintuitive and it is exact.

The natural response — run the causal model for each store separately, get 847 estimates, report them — runs into a mathematical obstacle that is unintuitive and important. Charles Stein stated it in 1956, in a paper called "Inadmissibility of the usual estimator for the mean of a multivariate normal distribution." The theorem: if you want to estimate the means of three or more normal distributions simultaneously — and you are being judged by the total squared error across all your estimates — the estimator that uses only each group's own data is dominated by an estimator that shrinks each estimate toward a common mean. Not sometimes. Always. The independent estimators are inadmissible. The shrinkage estimator is better, in expectation, for every possible configuration of true means. The critical threshold is three: for one or two simultaneous estimation problems, the independent estimator is admissible and cannot be improved by shrinkage. At three or more, the paradox kicks in.

This feels wrong, in a specific way that is worth naming. It seems to imply that your estimate of demand elasticity in Chicago can be improved by incorporating what you know about demand elasticity in Dallas, even if you believe those markets have nothing to do with each other.

This is counterintuitive in the best possible way: it violates the assumption that independent groups should be estimated independently, and it violates it not sometimes but provably, always.

The improvement is not large in any individual case. But it is real, and it is guaranteed — a word that rarely gets to appear in the same sentence as "statistics." Stein's result was initially received as a curiosity: a paradox that seemed to violate common sense. The resolution, which took years to become widely understood, is that the estimators are being judged jointly. The total squared error across all 847 stores can be reduced by accepting small increases in bias at each store in exchange for large reductions in variance. The stores are noisy. The noise dominates. Shrinkage reduces the noise.

Bradley Efron and Carl Morris showed in 1975 that the James-Stein estimator reduces total squared error by approximately 71% on realistic examples — using a famous demonstration with baseball batting averages. They estimated the true batting averages of 18 players from their first 45 at-bats; the James-Stein estimator, applied to all 18 simultaneously, outperformed the individual estimates for 16 of the 18 players. Not because it had better data. Because it was using information from the whole group to stabilize estimates of each individual. The James-Stein estimator is:

θ̂ᵢ = Bᵢ·μ̂ + (1 − Bᵢ)·Xᵢ

Where Xᵢ is store i's local estimate, μ̂ is the pooled estimate, and Bᵢ is a data-driven shrinkage factor that determines how much weight to give to the local estimate versus the pooled one. When a store has a lot of data (low variance Xᵢ), Bᵢ is small and the local estimate dominates. When a store has thin data (high variance), Bᵢ is large and the estimate shrinks toward the group mean. The degree of pooling is determined by the data, not by the analyst's judgment. This is not a modeling assumption. It is optimal under exchangeability — the condition that the stores, before observing their data, are treated as draws from the same population. The implication — and this is the one that matters — is that treating each market as if it exists in complete isolation is not just an approximation but a systematic error. More pooling than you think. Always.

The evidence for shrinkage's superiority is not merely theoretical. Efron and Morris demonstrated it empirically in 1977 using eighteen Major League Baseball players' batting averages from their first 45 at-bats of the 1970 season. The question: which estimator better predicts each player's season-end average? The independent estimator uses each player's own 45-at-bat sample. The James-Stein estimator pools all eighteen players toward a common mean and shrinks each individual estimate by a data-driven factor. The James-Stein estimator was closer to the true season-end average for 16 of the 18 players. The total squared error across all 18 was 71% lower — from 0.077 to 0.022 (in batting average units, summed across 18 players; from Table 1 of the 1975 JASA paper).

Thurman Munson and Carl Yastrzemski share no batting mechanics — they were, in their methods, almost opposites. The pooling exploits the statistical structure of the estimation problem, not any subject-matter connection between players. Stein proved it for batting averages. The math is the same for cold beverage demand in Chicago and Phoenix.

The demand estimation analog extends the pooling logic in a direction that is not obvious. Consider Phoenix Coca-Cola and Phoenix Pepsi: two brands in the same market, competing categories, responding to the same heat waves. If both respond to temperature anomalies, their weather elasticities are not independent draws from a uniform prior — they share a common environmental driver. You can estimate Phoenix Coca-Cola's weather elasticity better by pooling with Phoenix Pepsi than by using Coca-Cola's own data alone. This is cross-brand pooling. It is standard practice in hierarchical Bayesian models of category demand.

The cross-industry pooling argument is more powerful and less obvious. If weather is a systematic factor across industries — with category-specific loadings — then a model trained on cold beverages, sunscreen, home energy, agricultural commodity demand, and airline bookings simultaneously learns the prior distribution over weather elasticities. That prior is a description of how human economic behavior responds to weather, pooled across every domain. A new entrant to any weather-sensitive category inherits this prior on day one.

This is the foundation model logic applied to causal estimation. Chronos and TimesFM exploit a principle analogous to James-Stein shrinkage: by pooling across tens of millions of time series during pre-training, they construct a prior over time series behavior that regularizes estimates on any new series. The analogy to James-Stein estimation is suggestive rather than exact — pre-training on heterogeneous time series data shares the structural feature of pooling information across many estimation problems simultaneously, which is what produces James-Stein shrinkage's benefits in theory. Whether the formal admissibility result transfers to gradient-descent pre-training on a prediction objective is an open question. The empirical pattern — strong zero-shot generalization across benchmark domains — is consistent with the analogy. Chronos, trained on 42 benchmark datasets spanning diverse domains, achieves zero-shot performance competitive with models trained specifically on the target domain. The causal analog pools across domains to construct a prior over weather elasticities, which then regularizes estimates in any new market. The mechanism is the same. The object being estimated is different: not predictive structure, but causal structure.

Pooling has a limit. When categories are genuinely dissimilar in their weather response — when the mechanism connecting weather to demand differs in kind, not just magnitude — shrinkage toward a common prior can be misleading. The hierarchical model must include the right grouping structure for the pooling to help rather than hurt. Pooling across similar categories is always beneficial. Pooling across mechanistically distinct categories requires specifying the similarity structure explicitly. The horseshoe prior handles within-group sparsity. The group structure itself is a modeling assumption. A technical note for the econometrically minded: Stein's admissibility result applies formally to the normal-means problem; IV estimators, being ratio estimators, have different small-sample properties and heavier tails. The hierarchical Bayesian formulation — rather than direct appeal to Stein's theorem — is the correct theoretical foundation for pooling in the IV context; Chamberlain and Imbens (2004) provide a Bayesian IV framework that handles this correctly.

Toggle to the hierarchical model. Some of your estimates just changed sign. That's not a bug. Click "Reveal Truth" and check who was closer.

The skeptic raises a legitimate objection here: if you shrink Chicago's estimate toward the national mean, don't you destroy the real heterogeneity that makes Chicago different from Phoenix? If the true causal effect in Phoenix is genuinely different from Chicago — and it probably is, because Phoenix consumers in summer are not Chicago consumers in summer — shrinking toward a common mean would obscure exactly the heterogeneity you need for store-level decisions. The objection is correct and the resolution is important: hierarchical models don't shrink toward a single mean. They shrink toward a structured mean — one that is itself a function of covariates. Phoenix stores shrink toward the Phoenix-region mean, which accounts for climate zone, store format, and consumer demographics. Chicago stores shrink toward the Chicago-region mean. The national mean is used as the prior only for regions with no within-region data. The hierarchical structure allows genuine heterogeneity to be estimated while still capturing the information in similar units.

One complication deserves mention before the architecture section, because it is the kind of subtle point that gets buried in methods appendices and then quietly distorts everything downstream.

Hierarchical pooling solves the across-store variance problem. But there is a subtler issue in how we define “average” demand that affects whether pooling time series across stores is valid in the first place. Economists typically report average outcomes — mean demand, mean elasticity — but for stores experiencing multiplicative growth shocks (a 20% weather-driven spike, a 15% competitor promotion), the mean across stores and the typical trajectory of any individual store can diverge dramatically.

A process is ergodic if your experience of it over time matches what you would see if you averaged over many parallel lives — if the long-run time average equals the ensemble average. Most of probability theory assumes ergodicity implicitly: it treats “the average outcome” as a meaningful guide to what any individual will experience. For an ergodic system, this is true. For a non-ergodic one, it is deeply misleading. A simple example: imagine a coin flip that doubles your wealth on heads and halves it on tails. The expected return per flip is 1.25x (0.5 × 2 + 0.5 × 0.5). But a single trajectory through many flips will — with probability one — see your wealth go to zero, because the geometric average of 2 × 0.5 is 1.0, not 1.25. The ensemble average, computed across many parallel players, says the average player gets rich. The time average, computed along a single player's path, says the single player goes broke. These are different objects. The ensemble average describes an average across people who will never all exist simultaneously. The time average describes what actually happens to a real person over time. When demand shocks are multiplicative — a 20% weather-driven spike multiplied by a 15% competitor promotion — the ensemble average across stores can look healthy while the median store trajectory deteriorates.

Ole Peters, in a 2019 Nature Physics paper, applied this distinction to economics and argued that standard economic models systematically confuse ensemble averages with time averages in ways that change their practical prescriptions. Jensen's inequality gives the formal statement: E[ln(r)] < ln(E[r]) for any random variable r with variance > 0. In demand terms: the expected log-return to a weather-sensitive category is not the same as the log of the expected return. The practical implication: when pooling time series across stores with different volatility regimes, the pooled estimate needs to account for this divergence — otherwise the “average” you compute describes no actual store. Peters's ergodicity economics framework has been contested by mainstream economists, who argue that the multiplicativity argument does not overturn expected utility theory in the settings Peters claims. The relevance here is narrower and less contested: the Jensen's inequality gap between ensemble averages and time averages is real, and it matters specifically for pooling decisions when processes are genuinely multiplicative. Whether consumer demand is multiplicative in the relevant sense is an empirical question, and the answer varies by category.

Watch the ensemble average at the top of the chart. Now watch the median trajectory. Are they going in the same direction? What does that mean for a single store?

The operational response to this is careful: when pooling stores with different demand volatility, use log-returns rather than level changes, and specify the hierarchical model in log-space. This recommendation applies most forcefully when demand shocks are genuinely multiplicative — as in growth-rate contexts and over longer time horizons. At weekly or daily horizons for stable consumer goods categories, the distinction between additive and multiplicative dynamics is an empirical question that should be tested in the data rather than assumed. The ErgodicitySim shows you why the ensemble average can look healthy while the median trajectory deteriorates — which is the regime most retail stores with high demand volatility actually live in.

The full weather-demand parameter space presents a further technical challenge. Weather has 400+ relevant variables — temperature, humidity, precipitation, wind, cloud cover, UV index, dew point, pressure, and many others — each varying across spatial grids and time lags. Pairwise interactions between 8 core variables produce C(8,2) = 28 interaction terms. Three-way interactions produce C(8,3) = 56 more. The total parameter space, including all plausible interactions across categories and markets, is approximately 18,500 parameters. Most of these are effectively zero — most weather interactions have negligible demand effects, and the signal is sparse.

Sparse signals in high dimensions require specialized priors. The horseshoe prior, introduced by Carlos Carvalho, Nicholas Polson, and James Scott in 2010, is designed for exactly this setting. The horseshoe uses a half-Cauchy mixing distribution that produces a shrinkage pattern unlike ridge regression (which shrinks uniformly toward zero, destroying large effects) or LASSO (which produces binary inclusion/exclusion, losing effect heterogeneity). The horseshoe allows true signals — the temperature effect on cold beverages, the precipitation effect on foot traffic — to retain their magnitude while collapsing noise parameters to effectively zero. Here is what the horseshoe actually does. For each parameter, it draws a local scaling factor from a half-Cauchy distribution — a distribution with extremely heavy tails. Heavy tails mean that most of the scaling factors collapse toward zero (the noise parameters), but a few escape to be large (the signal parameters), because the Cauchy's tails are fat enough to put non-trivial probability on values far from zero. The boundary between signal and noise is set by the data's structure, not by a tuning parameter the analyst must choose. About 85% of the 18,500 weather-demand parameters are shrunk to negligibly small values; the remaining 15% carry the signal. This is not model selection; it is continuous, calibrated regularization.

Drag the dimensionality slider to 50. What fraction of the space does the data ball occupy now? This is the space your model is estimating in.

Once you have causal estimates across 847 stores — hierarchically pooled, sparse-signal regularized, with posterior uncertainty quantified — something follows that is not purely statistical. If you know the causal effect of weather on demand, with uncertainty, and you also have a probabilistic seven-day weather forecast, you can connect the two to produce a probabilistic causal demand forecast. Not a predictive demand forecast — a causal one. A forecast that decomposes next week's expected demand into baseline + weather effect (quantified, causal) + advertising effect (causal, not confounded) + price effect + competitive response. And that decomposed forecast feeds directly into the newsvendor problem.

The newsvendor model is the classic operations research framework for inventory decisions under uncertainty. Its solution: the optimal order quantity Q* satisfies F(Q*) = 1 − c_u/(c_u + c_o), where c_u is the unit cost of understocking (a lost sale) and c_o is the unit cost of overstocking (holding cost plus waste). F is the cumulative distribution function of demand. The formula has a clean derivation from first principles. If you stock Q units and demand turns out to be D: if D> Q (stockout), you pay c_u for each unit of unmet demand; if D < Q (surplus), you pay c_o for each unsold unit. The optimal Q minimizes expected total cost. At the margin, ordering one more unit costs c_o (with probability F(Q), you won't sell it) and saves c_u (with probability 1 − F(Q), you would have stocked out without it). Set the marginal cost equal to the marginal saving: c_o × F(Q*) = c_u × (1 − F(Q*)). Solve: F(Q*) = c_u / (c_u + c_o). This is the critical ratio — the target percentile of the demand distribution. If a stockout costs ten times an unsold unit, you should stock at the 91st percentile of demand. The formula is a precise translation of the intuition: the more expensive it is to run short relative to running long, the further up the demand distribution you should stock. The same calculation applies wherever perishable supply meets uncertain demand — agricultural purchasing desks estimating harvest inventory, energy traders sizing reserve capacity before a forecast cold snap, hotel revenue managers pricing rooms ahead of a forecasted sun-filled weekend. Carla Reyes was solving this problem intuitively — she had eleven summers of experience telling her that cold July 17s meant different demand distributions than cold April 3s, and she stocked accordingly. The formal model does the same calculation, at scale, with quantified uncertainty, for 847 stores simultaneously.

Add the 10th brand. Notice the accuracy improvement for brands 1 through 9. Now add the 100th. Does improvement stop?

Every brand added to the pooling structure improves the estimates for every brand already there — including across categories that share no obvious mechanism. This keeps being true past the point where intuition says it should stop. Stein said so in 1956. The data confirms it every time.

What does a brand inherit on day one, before it has generated any of its own data? The answer is the entire prior — every weather elasticity the model has ever estimated, pooled across every category and market it has observed, waiting to be updated by the first observation.

Start the timeline. On day 1, what is this brand's estimated causal effect? Where did that estimate come from?

What Carla Reyes knew — that summer heat means sports drink shortages — was knowledge local to Wacker and State, built from eleven summers of a specific store, a specific neighborhood, a specific customer base. The model in Schaumburg was trying to know this for 847 stores at once. That is not a harder version of Carla's problem. It is a different kind of problem. Carla's estimate shrinks when you pool it with the other 846 stores. This seems wrong. Stein showed it is actually optimal. And once you have that optimal estimate, you can do what Carla does by instinct — but at scale, in advance, with a seven-day probabilistic forecast as input. The inventory decision becomes, in principle, a tractable one.

But there is a prior question the method doesn't answer on its own: why now? The tools described here — IV estimation, hierarchical pooling, probabilistic weather forecasts — didn't arrive together by accident. Two separate traditions, running in parallel for nearly a century without knowing about each other, both hit their scaling limits at the same historical moment, and both were rehabilitated by the same computational revolution. Understanding where these tools came from is understanding why the convergence is happening now, rather than in 1970 or in 2005.

Understanding where these tools came from is not mere intellectual history — it explains why the methods are ready now, what their failure modes were, and why the AI weather forecasting revolution is not an incremental improvement but a qualitative shift in instrument quality. The two traditions — numerical weather prediction and causal econometrics — hit exactly the same obstacles at exactly the same historical moments, and were rehabilitated by exactly the same computational advances. That parallel is not coincidental. It tells you something about the structure of the problem.

On Stein's paradox: The result appears in Stein (1956), "Inadmissibility of the usual estimator for the mean of a multivariate normal distribution," and was made constructive in James and Stein (1961), "Estimation with Quadratic Loss." The paradox implies that information about the price of flaxseed oil should improve your estimate of voting preference in an unrelated district — not because of a causal connection, but because the joint MSE criterion allows trading bias for variance across dimensions. Efron and Morris (1975) provided the most accessible treatment, including the baseball example, in "Data Analysis Using Stein's Estimator" in JASA 70(350). The 71% error reduction figure is from their Table 1, comparison between the maximum likelihood estimator and the James-Stein estimator on the 18 batting average dataset.

On causal forests: Wager and Athey (2018, JASA 113(523), pp. 1228–1242) introduced causal forests as a method for estimating heterogeneous treatment effects τ(x) = E[Y(1) − Y(0) | X = x] nonparametrically. The key innovation is "honest" trees: the sample is split, with one half used to choose splits and the other used to estimate leaf-level effects. Asymptotic normality holds under regularity conditions, enabling valid inference. The R package grf (Generalized Random Forests) implements causal forests and is the standard implementation.

6

Two Parallel Histories: Numerical Weather Prediction and Causal Econometrics

In 1916, Lewis Fry Richardson sat in a barn in northern France and spent six weeks computing a six-hour weather forecast. The forecast was off by a factor of 145. He published it anyway. In 1928, Philip Wright buried four pages of statistical theory in the appendix of a government monograph about butter and lard. Nobody read the appendix for thirty years. Both men had solved their respective problems correctly. The world wasn't ready to notice.

Here is what Richardson was doing in that barn. He was not just doing weather. He was attempting something that had never been attempted — the complete physical simulation of a column of atmosphere above central Europe, derived from first principles alone, without any appeal to historical pattern or empirical curve. The equations were Newton's laws of motion applied to a fluid on a rotating sphere: conservation of mass, momentum, energy. The unknowns were wind velocity, pressure, temperature, humidity at each gridpoint. The math was not complicated by the standards of the time. The computation was staggering: Richardson estimated later that his trial forecast had required approximately 960 hours of arithmetic. He was doing it by hand, with a slide rule, in the middle of the Battle of the Somme, driving ambulances for the Friends' Ambulance Unit between sessions at the notebook.

The Friends' Ambulance Unit was the institution that Quakers had created for conscientious objectors — a category that, in 1916 England, did not yet have legal standing, only social opprobrium. Richardson believed that war was wrong as a structural matter, which made his presence in the Western theater more complicated than it sounds. He drove ambulances because he would not carry weapons. He did the mathematics because the mathematics was what he had. The manuscript was lost during the Battle of Champagne in the spring of 1917 — not destroyed, just lost, misplaced in the chaos of the Nivelle Offensive — and found months later under a pile of coal. He revised it for five years and published it in 1922 as Weather Prediction by Numerical Process, with the failed forecast included in full, uncorrected, available for any reader to check.

The failure is worth examining closely, because it is the kind of failure that is more instructive than success. Richardson predicted a 145-hectopascal surface pressure change over six hours. The observed change was about 1 hPa. Off by a factor of 145. This is not a rounding error. This is the prediction saying a hurricane appeared; the observation saying there was a slight breeze. Peter Lynch, working through Richardson's original calculation in 2006, identified the cause: geostrophic imbalance in the initial conditions. In the real atmosphere, winds and pressure fields are tightly coupled — the Coriolis force of the Earth's rotation means large-scale winds flow roughly parallel to the isobars rather than across them, a balance so pervasive that meteorologists call it geostrophic balance. Richardson's initial wind and pressure fields were not in balance. The equations registered this inconsistency and corrected for it the only way they could: by generating enormous, rapidly oscillating pressure waves, real physical phenomena but completely irrelevant to the actual forecast. Richardson's signal was swamped by the noise his own initialization had created. When Lynch applied a modern digital-filter procedure to balance Richardson's initial fields — a procedure Richardson had no way to perform in 1916, because it requires many iterations of the forecast equations run forward and backward, which requires a computer — Richardson's computation is essentially correct. The method was sound. The data going in was wrong. He did not know how to fix the data. He published the failure anyway, because the method was right and he knew it.

This is, as a posture toward one's own work, admirable and unusual. Most scientists bury their failures. Richardson put his on the second page.

The book's most famous passage is not the failed forecast. It is a thought experiment that Richardson placed in the middle of the book as a kind of visionary parenthesis — a description of how you would actually run his method fast enough to be useful. He imagined a spherical hall, the size of a concert venue, with maps of the globe painted on the walls. Sixty-four thousand human computers — this was the word for people who compute, before machines took it — would sit inside this sphere, each responsible for one small patch of the atmosphere. Their conductor, standing at the center on a raised podium, would watch the computation progress across the globe, flash colored lights to synchronize the timing, and route numbers from the periphery toward the center where they could be combined. The result: a numerical weather forecast that outran the actual weather. Richardson calculated that 64,000 people, working simultaneously, could produce a 24-hour forecast faster than 24 hours elapsed. He described this in engineering detail — the spatial arrangement of computers, the information flow, the coordination protocol. He was describing, in 1922, a massively parallel distributed computing architecture. He was describing, in 1922, a data center. He had no hardware to run it on.

In 1926, the Meteorological Office was absorbed by the Air Ministry. Richardson resigned rather than do weather work in service of military aviation. He had spent the war driving ambulances; he was not going to spend the peace helping bombers navigate. He turned his differential equations toward a different problem — the mathematics of arms races, the dynamics of nations accumulating weapons in response to each other's accumulation. Same tools. Different question. His work on conflict escalation, published posthumously as Arms and Insecurity and Statistics of Deadly Quarrels, is still cited by conflict researchers. He died in September 1953.

Three years before Richardson died, in April 1950, Jule Charney and his team at the Institute for Advanced Study in Princeton ran Richardson's method on the ENIAC — the Electronic Numerical Integrator and Computer, the first general-purpose electronic computer, completed at the University of Pennsylvania in 1945, occupying 1,800 square feet and consuming 150 kilowatts of power. It could perform approximately 5,000 additions per second, which is roughly a thousand times faster than a skilled human computer with a slide rule. The calculation that had taken Richardson six weeks took the ENIAC twenty-four hours. The forecast was, by the standards of the time, reasonably good — not perfect, not operational, but demonstrably useful. The method Richardson had invented in a barn in 1916 worked. He knew it worked before he died. He had always known it would.

Now cut to 1928. Philip Grenville Wright is not computing the atmosphere. He is writing a 341-page monograph about import tariffs on animal and vegetable oils — flaxseed oil, cottonseed oil, lard, tallow, butter. This is a commissioned study, dull by design, intended for agricultural policy economists who need facts about trade flows. Wright is, among other things, a poet — he runs a hand press in his basement in Galesburg, Illinois, and he had used it in 1904 to publish Carl Sandburg's first chapbook, which is the kind of minor literary fact that history is full of. He is also an economist trained at Harvard, and he has a problem.

The problem is this: he needs to estimate how much demand for flaxseed oil changes when its price changes. The slope of the demand curve. This seems straightforward until you think about it for thirty seconds. Price and demand are not independent. When demand is high, prices rise. When prices fall, demand increases. They move together because they are jointly determined — each responds to the other, in continuous time, simultaneously. A scatter plot of price against quantity doesn't show you the demand curve. It shows you the tangled output of a system in which supply and demand are chasing each other. You cannot separate them by looking at their intersection. You see the system's output. You don't see the structure that produced it.

Wright's solution, in four pages of Appendix B, is to find a variable that shifts supply without touching demand — something that moves prices through a mechanism that has nothing to do with consumer preferences. If such a variable exists, you can use its fluctuations to trace the demand curve: when it pushes supply up or down, the resulting price-quantity pairs are points on the demand curve, because the only thing that changed was supply. Connect those points. That is your slope. The variable Wright used was weather — specifically, rainfall and growing-season conditions, which affect crop yields and therefore the supply of flaxseed oil, without directly affecting how much consumers want flaxseed oil. But the generalization was the insight. Any variable that is genuinely external to the feedback loop between price and demand can serve as this handle. The variable doesn't have to be weather. It has to be exogenous. It has to originate outside the system you are trying to measure.

Wright called the technique "confluence analysis." The econometrics literature would call it the instrumental variable. His son Sewall — who had, in the 1920s, invented path coefficients, which would become foundational to both genetics and structural equation modeling, and who is one of the more consequential methodologists of the 20th century — appears to have been in close correspondence with Philip over the winter of 1925–26 while Appendix B was being developed. Whether Sewall contributed substantively or whether the four pages are Philip's alone is a question that historians of statistics have argued about and not resolved. What is not in question is that those four pages contained the method. Philip Wright died in 1938. The appendix sat largely unread for thirty years, waiting for econometrics to develop enough conceptual vocabulary to recognize what it had.

Richardson's forecast was wrong for the same structural reason Wright's appendix was ignored: the method was ahead of its inputs. Richardson needed initial conditions he didn't have — balanced, high-resolution atmospheric observations that in 1916 didn't exist. Wright needed the data infrastructure and theoretical apparatus that wouldn't arrive until the Cowles Commission, two decades later. Both methods sat, essentially correct, waiting for the world to catch up with the mathematics.

There is a temptation, at this point in the story, to say that Richardson and Wright were both working on the same question. They were not. Richardson was asking a forward question: given the state of the atmosphere now, what will it do next? Wright was asking a retrospective question: given what I observed, what caused it? These are not the same question. They are, in a precise sense, inverses — Richardson needed to integrate the equations of motion forward in time; Wright needed to invert the feedback structure of a market backward to its causal roots. The direction of time is not a minor detail. It is the whole problem. The two traditions were not solving the same thing from different angles. They were each solving half of something that neither had conceived as a whole.

The convergence, when it came, was this: Richardson's tradition, refined through seventy years of atmospheric science, eventually produced forecasts accurate enough that weather could function as a genuine instrument in Wright's sense — a variable with enough independent, exogenous variation to identify causal effects in a demand system. Wright needed genuinely exogenous data. Richardson was building the machine that would eventually provide it. Neither man knew they needed each other. The instrument Wright required had to come from physics — from a system whose dynamics were independent of consumer markets, whose variation was generated by atmospheric processes that predate any commercial calendar. That is what weather is. And weather's usefulness as an instrument depends directly on how well you can measure and forecast it — which is Richardson's problem. To use Wright's method at scale, you need Richardson's method to work. The two questions were each impossible without the other.

A third stream — the physics of complex systems — had been developing separately and was about to collide with economics in a way that clarified exactly why standard equilibrium models fail in the presence of feedback loops, increasing returns, and path dependence.

In September 1987, in Santa Fe, New Mexico, a small group of physicists and economists got into a room together and discovered they had been working on the same problem from opposite ends for decades. Philip Anderson, who had won the Nobel Prize in physics in 1977 for work on disordered systems and who had written a famous 1972 paper called "More is Different" arguing that complex systems produce behaviors irreducible to their constituent rules, looked at W. Brian Arthur's economics of increasing returns and recognized it immediately — as Arthur later recalled the encounter — as broken symmetry, as phase transition, as what happened when a system had multiple equilibria and history, rather than optimality, determined which one you ended up in.

Standard economics assumes diminishing returns: the more of something you produce, the more expensive each additional unit becomes, and the economy tends toward a stable equilibrium where no producer has a lasting advantage. Arthur was studying high-technology industries — software, microchips, operating systems — where the opposite is true. The more users an operating system has, the more software is written for it; the more software written for it, the more valuable it becomes to new users. This is increasing returns: scale reinforces itself. The consequence is path dependence: which technology wins is not determined by which one is objectively best. It is determined by which one got there first, or got lucky early, and then used its initial advantage to lock in users. VHS over Betamax. Windows over every other operating system. QWERTY over every keyboard that ergonomists prefer. In a world of increasing returns, history matters more than optimality. You cannot predict which equilibrium the system will settle into from first principles — you can only observe which one it happened to choose. Anderson saw this as the economic analogue of broken symmetry: a state selected not by optimality but by historical accident.

The economists in the room had a word for what Arthur was describing: wrong. His papers on increasing returns had been rejected by four journals over six years; a referee at one major journal suggested the work "would be better suited to a regional economics journal" — the academic equivalent of being told to sit at the children's table. Anderson's endorsement, from the far side of a disciplinary wall, was what finally made it respectable. The Santa Fe Institute was founded partly from the momentum of that 1987 workshop. Two months after the meeting, Black Monday — the largest single-day market crash in U.S. history — occurred, which concentrated minds wonderfully on the question of whether standard equilibrium economics was actually modeling what markets do. It turned out it was not.

Doyne Farmer was at Santa Fe from early on. In 1977, as a physics graduate student at UC Santa Cruz, Farmer had walked into a casino in Las Vegas with a computer strapped inside his shoe — the size of a cigarette pack, hand-coded in machine language in three kilobytes of memory, receiving input from a microswitch under his big toe and communicating output via three vibrating solenoids positioned against his abdomen, which was admittedly uncomfortable and occasionally dangerous: at least once, insulation failure sent an electric shock into whoever was wearing it, which clarified certain decisions about the project's future. The Eudaemonic Enterprises operation made approximately $10,000 over eleven trips to Nevada casinos — a 20% edge over the house, achieved by real-time solving the differential equations of a roulette ball in motion. Fourteen years later, Farmer co-founded the Prediction Company with Norman Packard, which applied the same logic — find the structure in apparently random time series — to financial markets. Twenty-seven of twenty-eight profitable years. Sold to UBS. What Farmer did next was ask whether finding the patterns was the same as understanding them. The answer was no, and that answer is what brought him to the Santa Fe Institute, where he is today building models of the economy as a complex adaptive system — a system in which agents observe outcomes and adjust their behavior in response, those adjustments change the outcomes, which change the adjustments, and no individual agent designed or intended the aggregate result.

Farmer's trajectory is the individual-level version of the essay's central claim: prediction and understanding are not the same problem, and expertise in one does not transfer to the other. A physicist who could solve the differential equations of a roulette ball in real time — who was, by any measure, the world's best predictor of a chaotic physical system — found that prediction gave him no grip on the causal structure of financial markets. The instrument is what provides the grip.

The progression from Richardson to these parallel traditions — complexity economics, credibility econometrics, deep learning weather forecasting — is not a straight line. It is more like a river system, multiple tributaries carrying the same water without knowing they share a source. Richardson was asking how to predict what comes next. Farmer was asking the same question about roulette balls, then stock prices. The Santa Fe economists were asking why standard prediction frameworks failed in economies with increasing returns and path dependence. The credibility revolution was asking how to separate causal signal from predictive noise. The AI forecasting revolution was asking whether you could bypass the equations altogether and learn the atmosphere's behavior directly from data. These look like different questions. They are different facets of one object: the problem of knowing what a complex system will do, and why it does it.

In 2023, Pangu-Weather. In 2024, GenCast. Aurora. Prithvi. AIFS. In two years, deep-learning weather models moved from "interesting benchmark result" to "operationally deployed at national weather services." The ERA5 reanalysis dataset — 80 years of atmospheric observations, retrospectively analyzed to produce a consistent global record, completed by the European Centre for Medium-Range Weather Forecasts — had created the training corpus. Richardson's method, translated into the language of neural network weights and gradient descent, finally had the compute it needed. GenCast produces 15-day probabilistic forecasts at 0.25° resolution, with calibrated uncertainty across an ensemble of predictions. It is better than the European model on most verification metrics. It runs in minutes on a single TPU pod rather than hours on a supercomputer.

The Forecast Factory became a transformer architecture on a cluster of TPUs. Richardson had described the architecture in 1922 in a book that was found under a pile of coal.

This matters for Wright's method in a way neither man could have anticipated. The quality of weather as an instrument depends directly on the quality of weather data — the spatial resolution, the temporal precision, the uncertainty quantification. Coarse weekly DMAs smear the temperature signal across heterogeneous markets. Daily zip-code weather, joined to daily store-level sales data, gives the instrument real power: genuine variation in the exogenous input, at the resolution where it actually affects consumer behavior. GenCast's probabilistic forecasts give you something richer still: a distribution over future weather states that you can propagate forward through the causal model to produce a decomposed demand forecast with quantified uncertainty. The better Richardson's tradition got, the sharper Wright's tradition became. The two traditions were waiting for each other across ninety years without knowing it.

The AI forecasting revolution improved the instrument. At exactly the same time, the same computational revolution deepened the problem the instrument is needed to solve. The recommendation systems that now mediate advertising exposure — the LLMs powering social media feeds, the targeting algorithms selecting which users see which ads — select based on latent embeddings of purchase likelihood trained on realized sales data. That embedding is, in a precise technical sense, a confounder: it predicts both treatment and outcome, is unobservable to the analyst, and renders the correlation between advertising exposure and sales almost entirely unidentified. The media mix models built for the television era, where media buyers acted on seasonal patterns and simple demand signals, are structurally broken in the recommendation engine era. The forecasting revolution made the instrument better. The targeting revolution made the instrument more necessary. Both happened simultaneously. Nobody was steering.

The honest reason the industry has not adopted weather-as-IV at scale has nothing to do with the statistics and everything to do with data infrastructure. Running weather as an instrumental variable at zip-code and daily resolution requires joining weather observations to sales data at a spatial resolution that most retailer data warehouses simply do not support. Most sales systems report at the store level or the DMA level, on a weekly basis. The weather data exists at hourly × zip-code resolution — ERA5 grid cells are 31 kilometers, NOAA cooperative observer stations are denser, commercial weather providers are denser still. The mismatch between the resolution at which weather operates and the resolution at which business data is collected is the implementation gap. It is a data engineering problem. It is solvable.

Find the year when the prediction tradition and the identification tradition were closest. Did they know about each other? Find the moment they first explicitly intersect.

On July 17, 2019, a weather model had predicted the heat wave four days earlier with 91 percent accuracy using a neural network trained on fifty years of ERA5 reanalysis data. Carla Reyes's instincts were vindicated. The Schaumburg model remained, permanently, confused. The question is not what the weather was. We know that now, with astonishing precision. The question is what the weather did. That question has an answer too. This essay is about how to find it.

On Richardson's wartime context: The Friends' Ambulance Unit served with the French Army from 1914; Richardson joined in 1916 and served through the armistice. The "960 hours" estimate of calculation time appears in Lynch (2006). The manuscript being lost under coal during the spring 1917 Battle of Champagne (the Nivelle Offensive, in which the 16th French Infantry Division participated) and recovered later is documented by Richardson himself in the preface to Weather Prediction by Numerical Process. Richardson's subsequent career in conflict research produced Arms and Insecurity (published posthumously, 1960) and Statistics of Deadly Quarrels (1960), which applied the same mathematical methods to the frequency and scale of human conflict.

On the Santa Fe Institute 1987 workshop: The Economy as an Evolving Complex System workshop (September 1987) brought together Kenneth Arrow, Philip Anderson, W. Brian Arthur, Doyne Farmer, and others. Brian Arthur's increasing returns paper ("Competing Technologies, Increasing Returns, and Lock-In by Historical Events") was eventually published in Economic Journal in 1989, after six years of journal rejection. Anderson's "More is Different" appeared in Science in 1972. Black Monday (October 19, 1987) — the Dow Jones fell 22.6% in a single day — occurred two months after the workshop. Doyne Farmer's shoe computer project is detailed in Thomas Bass, The Eudaemonic Pie (1985). The Prediction Company (1991–present) is now part of UBS O'Connor; the 27-of-28 profitable years figure appears in published accounts of the firm's history.

On causal emergence: Hoel, Albantakis, and Tononi (2013, PNAS 110(49)) showed that coarse-grained (macro-level) descriptions of a system can carry more causal information — measured by effective information — than fine-grained (micro-level) descriptions. Temperature at the 31-kilometer ERA5 grid level is causally more powerful for demand prediction than molecular kinetics at the individual air-molecule level, even though the latter is more fundamental. Weather is the causally emergent level of atmospheric description for human behavioral outcomes. This has operational implications for how weather instruments should be constructed.

7

Assembling the System: From Causal Estimate to Probabilistic Demand Forecast

Richardson wanted to predict the atmosphere. Wright wanted to identify causation in economic data. They were solving complementary halves of the same problem. Neither knew it. What does it look like when both halves are solved at once?

The answer is not that you plug better weather forecasts into your existing attribution model. The answer is more interesting than that. When you have probabilistic weather forecasts — not point forecasts but full probability distributions over possible weather states — and when you have causal estimates of weather's effect on demand (IV estimates, hierarchically pooled, with posterior uncertainty), you can do something neither tradition could do alone: propagate forecast uncertainty through the causal model to produce a decomposed demand forecast. This is still, in the forward-looking direction, a prediction — you are extrapolating future demand from future weather. What the causal model adds is the ability to attribute each component of that prediction to its genuine causal driver: weather (predictable seven days ahead from the probabilistic forecast), advertising (controllable by adjusting spend), baseline demand, competitive response. A forecast that says "demand next week will be X" is a prediction. A forecast that says "demand next week will be X = 40% weather-driven + 25% baseline + 35% advertising-driven, each with quantified uncertainty" is a forecast you can act on — separately optimizing against the predictable component and the controllable one. That decomposition is what a purely predictive model cannot provide, and it is what neither Richardson nor Wright could produce alone.

The IV estimator gives us the causal effect of advertising on demand, isolated from the feedback loop. But a complete system needs to do more: it needs to propagate uncertainty from the weather forecast forward through the causal model to produce a demand forecast with proper uncertainty bounds. The formal language for this — the distinction between observing a variable and intervening on it — is Judea Pearl's do-calculus, developed in his 2009 book Causality. The notation P(Y | do(A)) versus P(Y | A) captures exactly the distinction the Schaumburg team missed. The quantity we want is not P(Y | A, W) — the probability of Y given that we observe A and W — but P(Y | do(A), W) — the probability of Y given that we intervene to set A to a specific value, while W takes its natural value. The observational distribution P(Y | A) is confounded by the feedback loop. The interventional distribution P(Y | do(A)) is not, because do(A) represents a hypothetical randomization of A that breaks the feedback. In the linear-Gaussian IV case, the IV estimator consistently estimates dE[Y | do(A)] / dA — the causal effect of advertising on sales, stripped of confounding.

The math is not the hard part. The hard part is building the data infrastructure to run it. The gap is closer to three engineering sprints than three years — but only for organizations whose data warehouses already support zip-code-level joins.

The integrated system works as follows. First, the demand equation is estimated using weather as IV, hierarchically pooled across 847 stores, with a horseshoe prior on the high-dimensional weather-demand parameter space. The output is not a scalar estimate of advertising effectiveness. It is a posterior distribution over effectiveness as a function of weather, market, season, and observable covariates — a high-dimensional object that captures how advertising interacts with context. Most estimates you encounter are numbers: advertising elasticity is 0.3, standard error 0.04. The posterior distribution is something richer. It is a probability distribution not over outcomes but over what the parameter's true value might be. After seeing two years of sales, weather anomalies, and advertising spend data from a Chicago store, the posterior for advertising elasticity might say there is a 68% chance the true elasticity lies between 0.18 and 0.34, a 10% chance it is below 0.15, and a 3% chance it is above 0.45. Every value in that range is consistent with the data; the posterior tells you which values are more consistent than others. The shape of the posterior is the answer — it widens when the data is sparse or conflicting, narrows when the data is rich and unambiguous. Working with posteriors rather than point estimates means uncertainty is carried forward: when you plug the posterior elasticity into next week's demand forecast, the forecast inherits the estimation uncertainty. Second, the GenCast seven-day probabilistic forecast provides a distribution over next week's weather states, at zip-code and hourly resolution. Third, the two are integrated: for each draw from the weather forecast distribution, the causal model produces a predicted demand distribution. Average over the weather draws, weighted by their forecast probability, and you get the expected demand distribution for next week — decomposed into its causal components.

The decomposition looks like this, for a specific store on a specific week: baseline demand (stable purchasing behavior unrelated to current conditions) + weather effect (quantified causal contribution of forecast weather to expected demand, with uncertainty bounds) + advertising effect (causal attribution of the campaign, separated from weather-driven demand) + price effect + competitive response. The decomposition is domain-general: the same structural separation of causal components applies wherever demand responds to multiple drivers simultaneously. Each component is a causal estimate — not a correlation, not a model parameter, but an answer to "what would demand have been if this component had been different?" This is the question Philip Wright was asking about flaxseed oil in 1928. The integrated system answers it for 847 stores simultaneously, continuously, using a weather forecast as input.

Look at the weather bar. Now look at the campaign bar. Which one would the Schaumburg team have credited the campaign with? Which one is actually campaign?

The supply chain implication follows directly. The newsvendor optimal order quantity Q* is a function of the demand CDF — specifically, Q* = F⁻¹(1 − c_u/(c_u + c_o)). The causal model does not reduce the variance of F — IV estimates are by construction less efficient than OLS, because the instrument uses only the exogenous component of treatment variation. This efficiency loss deserves a direct assessment: it matters only when the bias is small relative to the estimation variance. Here, the bias is 5–10x the true effect (Gordon et al. 2019). No increase in sample size reduces a bias of that magnitude — bias is not variance, and more data does not fix a systematic directional error. A wider confidence interval around the correct answer is always preferable to a narrow confidence interval around an answer that is 500% wrong. The IV estimator gives you the former. OLS, in this setting, gives you the latter. This is not a close call. What the causal model does is decompose F. Variance attributable to weather is now forecastable seven days ahead. Variance attributable to unobserved consumer behavior remains. The newsvendor's F is now conditional on the weather forecast rather than marginal — and a conditional distribution with smaller support over the predictable weather component is a more useful basis for an ordering decision than a marginal distribution that averages over weather states you won't actually encounter next week. A weather-driven demand spike is different from an advertising-driven demand spike: the weather spike is predictable from the seven-day forecast; the advertising spike is controllable by adjusting spend. The causal model allows you to distinguish them and optimize against each separately. Carla Reyes did this by instinct — she knew that a 94°F Tuesday in July at Wacker and State was a sports-drink day, and she stocked accordingly, regardless of whether the advertising team had run a campaign that week. The integrated model does the same calculation, for 847 stores, with quantified uncertainty, before the heat wave arrives.

The honest assessment of what this system can and cannot do deserves a direct statement. The LATE estimator identifies the causal effect for compliers — consumers who respond to weather-induced demand changes and who would also respond to the advertising campaign. If the business cares about non-compliers (consumers whose behavior is weather-invariant), the LATE is the wrong estimand. The causal forest extension partially addresses this by estimating separate effects for different consumer segments, but it does not fully solve the population-versus-LATE problem. Likewise, conditional exogeneity of weather is an argument, not a theorem — the argument has been made precisely in Sections 3 and 4, and the residual bias from Mellon-type violations is quantified at 8–22% of the point estimate. That is not zero. It is known. A further limitation deserves naming: climate change introduces non-stationarity into the weather-demand relationship itself. Causal elasticity estimates built from historical data reflect the behavioral response to historical temperature distributions. As climate change shifts temperature into regimes outside that history — more extreme heat events, different seasonal patterns — the structural estimates may drift. The model trained on 2000–2024 data may systematically underestimate demand responses to temperatures that occur once a decade in the historical record but become routine in 2035. This is a real limitation that argues for temporal discounting of older estimates and periodic recalibration rather than treating the causal parameters as fixed. And the framework is not confined to advertising. Wherever economic agents respond to demand signals — energy markets, agricultural contracting, retail pricing — the same IV logic applies, the same exclusion restriction arguments must be made, and the same residual biases must be quantified. The honesty requirement is domain-neutral.

And the infrastructure requirement is real. Joining weather data to sales data at zip-code × daily resolution requires a data warehouse architecture that most retailers have not yet built. The spatial join alone — matching each store to its relevant weather station and interpolating to zip-code resolution — is non-trivial engineering. The computation — running hierarchical Bayesian IV across 847 stores, updating weekly with new data — requires compute and statistical expertise that most internal analytics teams don't have. This essay does not pretend these are easy problems. It argues that the cost of getting the attribution wrong — systematically misattributing demand, making resource allocation decisions based on confounded estimates, planning inventory and pricing from regressions that cannot separate cause from coincidence — is also quantifiable. And at scale, in any setting where treatment decisions are substantial and frequent, the magnitude of that misattribution is large enough to motivate the engineering.

The integrated system — IV estimation with weather instruments, hierarchical Bayesian pooling, probabilistic weather forecast inputs — is the convergence point Richardson and Wright were separately reaching toward. It is not a hypothetical architecture or a proposed research agenda. The methods exist; the weather data exists; the causal identification strategy exists. What remains is the engineering work of joining them. The question Carla Reyes answered by instinct for one store, for eleven summers, this approach answers formally — decomposed into causal components, each with quantified uncertainty, each informed by the seven-day forecast — before the heat wave arrives.

Return to the ergodicity simulation from Section 5. Earlier, it showed you that the ensemble average and the median trajectory diverge — that the “average” store trajectory doesn't describe any actual store. Now we can say something new: the sources of variance that produce that divergence are not equivalent. Some variance is weather-driven and therefore predictable from the seven-day forecast. Some is advertising-driven and therefore controllable by adjusting spend. The causal decomposition is precisely the tool that separates them.

The ensemble average and the median are still diverging. Now you know something about why: some of the variance is weather-driven — predictable from the seven-day forecast — and some is advertising-driven — controllable by adjusting spend. The simulation doesn't separate them. The causal model does. If you could shrink only one source of spread, which would you target first?

In the opening of this essay, two scenes played out simultaneously on July 17, 2019. Carla Reyes was watching cold beverage sales climb from behind a cash register, making an inventory call that would have required six weeks of econometric analysis to formally justify and which she made in thirty seconds based on eleven summers of experience. In Schaumburg, a team was preparing a board deck with a 2.2x attribution number that the data could not support, because the data — including the temp_hi_f column sitting unused in the left panel of their spreadsheet — was a closed loop of spend and sales that had no way to see its own structure. The weather was not a nuisance variable in either story. It was the entire story. Carla knew this because experience had taught her the causal structure directly. The analytics team didn't know it because nobody had shown them how to use the column they already had.

The tools to use that column — to extract causal structure from closed-loop data using weather as the exogenous excitation signal — were already available in 1928, in four pages of a tariff economics appendix. They were formalized in the 1950s. They were made rigorous in the 1990s. They were given the computational scale to run across 847 stores in 2018. And in 2024, the weather instrument became instrument-grade: precise, probabilistic, high-resolution, and freely available at the resolution that exceeds any strategic response cycle.

Richardson's Forecast Factory finally produced a forecast accurate enough to use as the input to Wright's identification strategy. The two traditions converged. The joint solution is something neither man could have imagined. Carla Reyes, at Wacker and State, was already living it.

On the do-calculus and the IV estimator: Pearl's do-calculus (2009) provides a graphical framework for computing interventional distributions from observational data and graphical models. In the linear-Gaussian case with a single instrument Z, the Wald estimator β_IV = Cov(Y, Z) / Cov(A, Z) is a consistent estimator of the interventional derivative dE[Y | do(A)] / dA under the four IV conditions. In the nonlinear case, the causal forest extension produces a consistent estimator of the conditional average treatment effect τ(x) = E[Y(1) − Y(0) | X = x] for each covariate profile x, under regularity conditions from Wager and Athey (2018). The integration over the weather forecast distribution is a standard Bayesian posterior predictive integration: E[Yt+k | do(At+k), Ft] = ∫ P(Wt+k | Ft) · E[Y | do(At+k), Wt+k] dWt+k, where Ft is the information available at time t, P(Wt+k | Ft) is the GenCast probabilistic forecast, and E[Y | do(A), W] is the causal demand function estimated from the hierarchical IV model.

For ninety-six years, Richardson's method and Wright's method ran in parallel, solving complementary halves of the same problem, in different disciplines, in different languages, using different mathematics — and the question they were both trying to answer was always the same question: given what we observe, what would have happened if something had been different?

References

Forecasting & Time Series

Makridakis, S., Wheelwright, S., & Hyndman, R. J. (1998). Forecasting: Methods and Applications. Wiley.

Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2022). M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting 38(4), 1346–1364.

Ansari, A. F., et al. (Amazon, 2024). Chronos: Learning the language of time series. arXiv:2403.07815.

Das, A., et al. (Google, 2024). A decoder-only foundation model for time-series forecasting. arXiv:2310.10688.

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine learning practice and the classical bias-variance trade-off. PNAS 116(32), 15849–15854.

Power, A., et al. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177.

Causal Inference & Econometrics

Wright, P. G. (1928). The Tariff on Animal and Vegetable Oils. Macmillan. [Appendix B]

Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica 11(1), 1–12.

Leamer, E. E. (1983). Let's take the con out of econometrics. American Economic Review 73(1), 31–43.

Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from Social Security administrative records. American Economic Review 80(3), 313–336.

Angrist, J. D., & Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics 106(4), 979–1014.

Card, D., & Krueger, A. B. (1994). Minimum wages and employment. American Economic Review 84(4), 772–793.

Angrist, J. D., & Imbens, G. W. (1994). Identification and estimation of local average treatment effects. Econometrica 62(2), 467–475.

Staiger, D., & Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica 65(3), 557–586.

Stock, J. H., & Trebbi, F. (2003). Retrospectives: Who invented instrumental variable regression? Journal of Economic Perspectives 17(3), 177–194.

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies. JASA 105(490), 493–505.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA 113(523), 1228–1242.

Chernozhukov, V., et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21(1), C1–C68.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.

Angrist, J. D., & Pischke, J.-S. (2010). The credibility revolution in empirical economics. Journal of Economic Perspectives 24(2), 3–30.

Mellon, J., et al. (2025). Weather instruments and competitor advertising response. Working paper.

Statistics & Estimation

Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium 1, 197–206.

James, W., & Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium 1, 361–379.

Efron, B., & Morris, C. (1975). Data analysis using Stein's estimator and its generalizations. JASA 70(350), 311–319.

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480.

Complexity Science & Cybernetics

Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. MIT Press.

Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman and Hall.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379–423.

Anderson, P. W. (1972). More is different. Science 177(4047), 393–396.

Wolfram, S. (2002). A New Kind of Science. Wolfram Media.

Hoel, E., Albantakis, L., & Tononi, G. (2013). Quantifying causal emergence shows that macro can beat micro. PNAS 110(49), 19790–19795.

Peters, O. (2019). The ergodicity problem in economics. Nature Physics 15, 1216–1221.

Marketing Science

Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2019). A comparison of approaches to advertising measurement. Marketing Science 38(2), 193–225.

Shapiro, B., Tuchman, A., & Wernerfelt, N. (2021). TV advertising effectiveness and profitability: Generalizable results from 288 brands. Econometrica 89(4), 1855–1879.

Busse, M. R., Pope, D. G., Pope, J. C., & Silva-Risso, J. (2015). The psychological effect of weather on car purchases. Quarterly Journal of Economics 130(1), 371–414.

AI Weather Forecasting

Bi, K., et al. (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature 619, 533–538. [Pangu-Weather]

Price, I., et al. (2024). GenCast: Diffusion-based ensemble forecasting for medium-range weather. Nature 637, 84–90.

Bodnar, C., et al. (2024). Aurora: A foundation model of the atmosphere. Microsoft Research preprint.

Richardson & Historical Meteorology

Richardson, L. F. (1922). Weather Prediction by Numerical Process. Cambridge University Press.

Richardson, L. F. (1960). Arms and Insecurity. Boxwood Press.

Lynch, P. (2006). The Emergence of Numerical Weather Prediction: Richardson's Dream. Cambridge University Press.

Complexity Economics

Arthur, W. B. (1989). Competing technologies, increasing returns, and lock-in by historical events. Economic Journal 99(394), 116–131.

Beinhocker, E. D. (2006). The Origin of Wealth: Evolution, Complexity, and the Radical Remaking of Economics. Harvard Business School Press.

Bass, T. A. (1985). The Eudaemonic Pie. Houghton Mifflin.

Zuckerman, G. (2019). The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution. Portfolio/Penguin.