Prediction, Causation, and the Weather

How two intellectual traditions—one that asks “what will happen?” and one that asks “what did this cause?”—converge on weather and commercial demand

February 2026·~20 min read

85 degrees Fahrenheit in Phoenix and 85 degrees in Houston do not mean the same thing. In Phoenix, 85 is pleasant—below the summer average, a relief that brings people outdoors and into stores. In Houston, at 90% humidity, 85 is oppressive—the kind of heat that keeps people indoors, shifts purchases toward cold drinks, and suppresses foot traffic. The same number, measured on the same thermometer scale, produces opposite effects on what people buy. A lookup table cannot capture this. A monthly average destroys it. And yet this distinction matters: across the U.S. economy, weather-attributable variability amounts to roughly $485 billion annually—3.4% of GDP across all eleven non-governmental sectors.

In December 2024, a DeepMind team published GenCast, a neural network producing probabilistic weather forecasts that beat the gold-standard ECMWF ensemble system on 97.2% of evaluation targets—99.8% beyond 36 hours. By February 2025, ECMWF had operationalized its own AI forecasting system. NOAA followed in December 2025 with three AI models, one of which uses 0.3% of the computing power of the system it augments. Simultaneously, a separate revolution has been unfolding in time series forecasting. Amazon’s Chronos quantizes demand signals into 4,096 discrete tokens and processes them with a language model architecture. Google’s TimesFM treats time series as patches. IBM’s FlowState—just 9.1 million parameters, presented at NeurIPS 2025—models them as continuous flows. These foundation models learn cross-domain temporal patterns and forecast zero-shot on data they have never seen. Two revolutions are happening at once: AI weather forecasting and foundation models for demand. Together, they could produce a golden age of economic prediction. But there is a catch. The most powerful prediction machinery in history still cannot answer whether the weather caused the demand shift, or merely coincided with it.

The Oldest Prediction Problem

Prediction as the Engine of Science

The desire to predict weather’s impact on harvests is one of the oldest motivations for systematic knowledge. The Babylonian MUL.APIN tablets, dating to roughly 3000 BCE, recorded celestial phenomena not for their own sake but to predict seasonal weather for planting—astronomy was agricultural economics. Egyptian nilometers measured the Nile’s annual flood, one of the first forecasting instruments, because the entire economy depended on the prediction. Indian monsoon knowledge organized the largest trade network in the ancient world. When Aristotle wrote his Meteorologica around 340 BCE—literally the origin of the word “meteorology”—he was continuing a tradition already millennia old.

This is not just about weather. Astronomy drove geometry and trigonometry. Navigation drove coordinate systems and Harrison’s chronometer. Insurance drove probability theory—Pascal and Fermat, corresponding in 1654 about a gambling problem, laid the mathematical foundations that would eventually price risk across every industry. Physics drove calculus. Economics drove regression and time series analysis. Computing drove weather simulation—von Neumann chose weather prediction as one of the first problems for electronic computers. Machine learning drives pattern recognition at scale. Every major mathematical framework was invented to predict something. This essay is about one of the oldest prediction problems there is.

From Richardson’s Forecast Factory to Foundation Models

In 1922, Lewis Fry Richardson attempted to compute six hours of weather by hand. It took six weeks. The result was spectacularly wrong—he predicted a pressure change of 145 hectopascals, an absurd figure given that normal atmospheric pressure hovers around 1013 hPa. Decades later, Peter Lynch demonstrated that with proper initialization, Richardson’s method produces accurate results. The mathematics was right; the data was wrong. Before abandoning the attempt, Richardson envisioned a “forecast factory”: 64,000 human computers arrayed in a vast globe-shaped hall, the ceiling painted as the North Pole, the pit as Antarctica, a conductor using a beam of red light to pace the lagging sections and restrain the fast.

In 1950, John von Neumann realized Richardson’s dream electronically. Jule Charney—born in San Francisco to immigrant tailors, who studied mathematics at UCLA and switched to meteorology in Oslo during the war—led the team that produced the first computer-generated weather forecast on the ENIAC. A 24-hour forecast took approximately 24 hours. The computer barely kept pace with weather itself.

Then came chaos. In 1961, Edward Lorenz, working on a Royal McBee LGP-30 computer so loud it needed its own office, entered 0.506 instead of 0.506127 into a weather simulation and returned to find a completely different scenario. The discovery, published in 1963, established that deterministic systems can produce unpredictable behavior—an insight so important that the metaphor coined for it, the “butterfly effect,” entered common language. (The metaphor itself was suggested by Philip Merilees at a 1972 AAAS session; Lorenz’s original metaphor was a seagull.)

Interactive

Try to identify which correlations hide genuine causal relationships. Most people’s intuitions are reliably wrong.

For four decades, the forecasting competition known as the M-Competitions delivered a humiliating lesson. In M1 (1982), M2 (1993), and M3 (2000), statisticians armed with PhDs and elaborate models were consistently beaten by simple exponential smoothing—a method that weights recent observations more heavily and can be implemented in a spreadsheet. Then the pattern reversed. In M4 (2018), ES-RNN, a hybrid neural network built by Slawek Smyl at Uber, was roughly 10% more accurate than the best traditional benchmark. In M5 (2020), using real Walmart sales data, machine learning methods dominated. Forty years of accumulated wisdom overturned in two competitions.

Interactive

Compare forecast methods across four decades of competition. The best predictor cannot answer why demand changed.

The Deep Learning Revolution

The M5 result did not appear from nowhere. It was the culmination of a seventy-year arc in artificial intelligence—one punctuated by winters, false starts, and sudden accelerations that together constitute one of the great intellectual dramas of the twentieth century.

The story begins with Frank Rosenblatt’s Perceptron (1958), a learning machine inspired by neurons that could classify simple patterns. Minsky and Papert formalized its limitations in 1969—a single-layer network cannot learn the XOR function—and helped trigger the first AI winter. Neural networks fell out of fashion for nearly two decades. In 1986, Rumelhart, Hinton, and Williams demonstrated that backpropagation—computing gradients through multiple layers by repeated application of the chain rule—could train deep networks effectively. The second act had begun, but the hardware was not ready. Through the 1990s, support vector machines dominated: mathematically elegant, provably optimal in certain senses, but fundamentally limited by their inability to learn hierarchical representations from raw data.

Two inventions changed the trajectory. In 1997, Hochreiter and Schmidhuber introduced Long Short-Term Memory networks, solving the vanishing gradient problem that had crippled recurrent neural networks. LSTMs could learn dependencies across hundreds of time steps—the first architecture with genuine temporal memory, and the backbone of sequence modeling for the next two decades. Then in 2012, Krizhevsky, Sutskever, and Hinton’s AlexNet won the ImageNet competition by a staggering margin, proving that deep networks with sufficient data and GPU compute could learn representations no hand-engineered feature could match. The modern deep learning era had arrived.

What followed was a cascade of architectural innovation. Bahdanau and colleagues (2014) introduced the attention mechanism, letting models dynamically focus on relevant parts of their input rather than compressing everything into a fixed-size vector. In 2017, Vaswani and colleagues at Google published “Attention Is All You Need,” replacing recurrence entirely with self-attention. The Transformer architecture scaled where recurrent networks could not: parallelizable across sequence length, with performance that improved predictably with compute, data, and parameters—a relationship Kaplan and colleagues would quantify as scaling laws in 2020. Within two years of the original paper, BERT and GPT had launched the modern language model paradigm. The same architecture would soon reshape every domain it touched, including time series forecasting.

For time series specifically, the neural revolution arrived in stages. DeepAR (Salinas et al., Amazon, 2020) brought probabilistic forecasting with autoregressive RNNs—each prediction was not a point but a distribution. N-BEATS (Oreshkin et al., 2020) demonstrated that pure deep learning, with no handcrafted features and no domain knowledge, could beat statistical methods on the M4 benchmark. The Temporal Fusion Transformer (Lim et al., 2021) added interpretable attention across multiple forecast horizons. Then came a shock: Zeng and colleagues (2023) showed that DLinear—a single-layer linear model—outperformed complex Transformer architectures on several long-horizon benchmarks. The field was forced to ask whether architectural complexity was actually helping. PatchTST (Nie et al., 2023) resolved the tension by borrowing from computer vision: slice the time series into patches, attend across them, and let channel independence prevent cross-variable overfitting. The Transformer’s advantage was real—but only when the architecture respected the structure of temporal data.

The latest stage is foundation models for time series—pretrained on millions of diverse series and capable of zero-shot forecasting on data never seen during training. Chronos treats demand as language, quantizing values into discrete tokens and processing them with a T5 architecture. TimesFM treats it as vision, slicing series into patches and attending across them. Moirai uses a mixture-of-experts architecture, achieving 17% better accuracy with 65 times fewer parameters than its predecessor. FlowState models time series as continuous flows using state space models, achieving state-of-the-art results with just 9.1 million parameters. Meanwhile, Mamba and other selective state space models are bringing linear-complexity sequence modeling to long-horizon forecasting, with MambaTS achieving state-of-the-art results across eight benchmarks in 2024.

Interactive

The classical forecasting toolkit: adjust the smoothing parameter to see how exponential smoothing weighs recent data against history.

The classical toolkit: exponential smoothing and ARIMA

Exponential smoothing (Brown 1956, extended by Holt 1957 and Winters 1960) produces forecasts as weighted averages of past observations, with exponentially decreasing weights. The smoothing parameter α controls how quickly the model forgets: α near 1 tracks recent data closely; α near 0 produces stable, slowly-updating forecasts. Holt added a trend component; Winters added seasonality.

ARIMA (Box & Jenkins, 1970) decomposes a time series into autoregressive (past values predict future values), integrated (differencing to achieve stationarity), and moving average (past forecast errors predict future values) components. The (p,d,q) notation specifies the order of each. For forty years, selecting these orders was as much art as science—which is partly why simple methods kept winning the M-Competitions.

The $485 Billion Practice Gap

The science is extraordinary. In practice, most companies use Excel, simple regression, or vendor black boxes. Weather enters demand planning—if it enters at all—as a binary flag (rain or no rain) or a monthly average. A 2011 study by Lazo and colleagues in the Bulletin of the American Meteorological Society estimated $485 billion in annual weather-attributable economic variability across all non-governmental sectors of the U.S. economy. Roth Tran (2023), publishing in Management Science, found that a one-standard-deviation favorable weather shock produces a 5.6% increase in retail sales at indoor stores, and that lost sales due to bad weather are persistent, not merely deferred. A 2025 study involving 1.2 million model training runs across 50 states over ten years found that incorporating weather data improves grocery forecasting accuracy by 20.2% and casual dining by 12.2%.

Prediction Cannot Answer “Why”

Even if this practice gap were closed, a deeper problem remains. In M6 (2022–2023), a competition involving 100 securities, forecast accuracy had zero correlation with investment returns. The top forecaster and the top investor were different participants. A foundation model can ingest temperature as a covariate and produce an excellent forecast. It cannot tell you whether temperature caused the shift in demand or merely accompanied it. Prediction and causal attribution are different problems requiring different mathematics.

Two intellectual traditions address them. One asks “what will happen?” The other asks “what did this cause?” They spent a century developing in parallel. The domain where they converge is weather and commercial demand. The remainder of this essay follows that convergence.

Weather as Natural Experiment

Philip Wright and the Invention of Instrumental Variables

In 1928, Philip Wright—a poet, publisher, and professor at Lombard College in Galesburg, Illinois—published a study of tariffs on animal and vegetable oils. He ran the Asgard Press from his basement and had published Carl Sandburg’s first book, In Reckless Ecstasy, in 1904, when Sandburg was his student. In Appendix B of the tariff study—not even the main text—Wright described a technique for disentangling the simultaneous determination of supply and demand. His insight: certain variables shift only the supply curve (including weather conditions affecting crop yields) while leaving the demand curve untouched. By exploiting these supply-shifting variables, one could trace the causal arrow from supply shocks to price changes. The technique is now called instrumental variable estimation. It would take sixty years to be fully appreciated. It has since won two Nobel Prizes.

Wright developed the approach collaboratively with his son Sewall Wright, the geneticist who independently invented path analysis—a method for tracing causal relationships through diagrams of variables. Stock and Trebbi confirmed the attribution through stylometric analysis in 2003.

Interactive

Wright’s identification strategy: a supply-shifting variable (weather) traces the demand curve by moving one equation while holding the other still.

The Credibility Revolution in Economics

For decades, the potential of natural experiments went underappreciated. In 1983, Edward Leamer published a remarkable provocation in the American Economic Review: “Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analysis seriously.” The problem was that researchers could reach almost any conclusion by choosing which variables to include, how to specify the model, and which observations to exclude. Leamer called for sensitivity analysis—testing whether conclusions survive under different assumptions. But the field needed more than diagnostics. It needed a different source of variation entirely.

The credibility revolution provided one. Angrist and Krueger (1991) used quarter of birth as an instrument for years of schooling. Card and Krueger (1994) compared fast-food employment in 410 restaurants across the New Jersey–Pennsylvania border after a minimum wage increase. Imbens and Angrist (1994) formalized the Local Average Treatment Effect, clarifying what instrumental variables actually estimate. In 2021, Angrist, Imbens, and Card received the Nobel Prize in Economics.

Four Properties That Make Weather Uniquely Powerful

Weather possesses four properties that make it a uniquely powerful instrument. It is exogenous: no advertiser, retailer, or competitor can cause the weather. It is continuously varying: across 210 designated market areas in the United States, weather generates 76,650 unique market-days of variation every year. It is high-frequency: weather changes daily while most planning cycles operate weekly or monthly. And it is universal: no tracking pixel, no consent form, no IRB approval required. Every day, across every market, the atmosphere randomly assigns different treatments to different populations. This is the equivalent of running thousands of randomized controlled trials simultaneously, for free, forever.

Interactive

Every day, 210 U.S. market areas receive randomly different weather treatments—the world’s largest ongoing natural experiment.

Mellon (2025), in the American Journal of Political Science, documented 159 studies that use weather as an instrumental variable, spanning the full breadth of economics—agriculture, industry, health, conflict, and growth. Dell, Jones, and Olken’s canonical 2014 review in the Journal of Economic Literature surveyed weather’s ubiquity as a source of identification across disciplines. The empirical evidence is remarkably specific. Busse and colleagues (2015), analyzing 40 million car transactions in the Quarterly Journal of Economics, found that a 20-degree Fahrenheit above-normal temperature anomaly increases convertible market share by 8.5%. Buyers project their current weather experience onto future utility—a form of psychological projection bias. Unilever’s CFO, Graeme Pitkethly, noted in 2023 that “when it gets too hot, people move away from ice cream and buy a cold drink instead.” Even the companies selling the products know the relationship is nonlinear.

Pearl’s Ladder of Causation

The question this essay is ultimately asking can be written in the notation of Judea Pearl’s structural causal models: what is P(demand | do(weather = w))? The “do”—Pearl’s do-operator—marks the difference between observation and intervention. Not “what did we observe when it was hot?” but “what would happen if we could make it hot while holding everything else constant?” Pearl’s Ladder of Causation distinguishes three rungs. The first is association: ice cream sales correlate with sunburn. The second is intervention: what happens if we set the temperature to 95°F? The third is counterfactual: what would have happened if it had not rained? The instrumental variable tradition and Pearl’s structural causal model tradition approach the same insight from different angles. Do-calculus’s three rules are sound and complete for identifying causal effects from observational data given a known causal graph.

Pearl’s do-calculus: the three rules

Given a causal graph G, the three rules of do-calculus allow manipulation of interventional distributions: (1) Insertion/deletion of observations: if a variable is independent of the outcome given the intervention and observed variables in a modified graph, it can be added or removed from the conditioning set. (2) Action/observation exchange: under specific d-separation conditions in a modified graph, an intervention can be replaced by an observation. (3) Insertion/deletion of actions: if all causal paths from the intervention to the outcome are blocked in a modified graph, the intervention has no effect and can be removed. These three rules, combined with standard probability axioms, are sufficient to derive all identifiable causal effects.

From FourCastNet to Operational AI Weather

The instrument is becoming dramatically more precise. In February 2022, NVIDIA released FourCastNet—the starting gun for AI weather forecasting, 45,000 times faster than traditional numerical weather prediction. In July 2023, Huawei’s Pangu-Weather became the first AI model to beat the operational ECMWF Integrated Forecasting System, published in Nature as the first Chinese technology company to sole-author a paper in the journal. In November 2023, DeepMind’s GraphCast surpassed the ECMWF high-resolution model on over 90% of 1,300 test areas and predicted Hurricane Lee’s Nova Scotia landfall nine days in advance. Then GenCast (December 2024): a probabilistic diffusion model beating the ensemble system on 97.2% of targets. NeuralGCM hybridized physics and machine learning. Microsoft’s Aurora, a 1.3-billion-parameter foundation model, generalized across weather, climate, and air quality. ECMWF’s AIFS went operational on February 25, 2025. NOAA deployed three AI models in December 2025. NVIDIA launched Earth-2, an open weather AI stack, in January 2026. From FourCastNet to ECMWF operational deployment: exactly 36 months.

Better weather forecasts make the instrument more precise. GenCast’s 2025 hurricane season results demonstrated this concretely: for a Category 5 intensification event, the model was 140 km closer to the true position on the 5-day track—a 1.5-day improvement in effective lead time, the kind of gain that typically takes over a decade of incremental progress. The weather community completed this AI transition. The demand modeling community has not yet begun.

The Exclusion Restriction and 194 Potential Violations

For the instrument to work, weather must affect demand through only the channel being studied. This is the exclusion restriction. What if it does not hold? Mellon (2025) documented 194 potential exclusion restriction violations across 159 weather-IV studies. Some results are extremely sensitive: Cinelli and Hazlett’s (2020) sensitivity analysis framework shows that for certain studies, an omitted variable explaining as little as 0.01% of outcome variance could nullify the IV estimate.

194 potential exclusion restriction violations. Studies where confounders could explain away the result. A systematic review that calls into question every weather-IV result in the literature. This should be devastating.

It is not.

Interactive

Peel away confounding variables one by one to reveal the causal signal underneath. Double machine learning automates this process.

The 194 violations are not 194 reasons to abandon the instrument. They are 194 documented causal channels. Double machine learning (Chernozhukov et al., 2018, with seven authors including Esther Duflo) separates prediction from identification: use machine learning to predict and remove everything you can explain, then estimate the causal effect on the residual. The “double” means you do this from both sides—predicting both the treatment and the outcome, then using the orthogonal residuals. Cross-fitting ensures the same data used for prediction is not used for inference. The result: root-n convergence rates for the causal parameter even when the nuisance functions are estimated with flexible machine learning methods.

Interactive

194 exclusion restriction violations, reframed: each is a documented causal channel that a multi-channel model should capture.

Causal forests (Wager and Athey, 2018) go further, discovering heterogeneous treatment effects—the fact that weather’s impact varies by location, season, and product category. Their honesty constraint (splitting and estimation samples are separate) provides valid confidence intervals with asymptotic normality. Wager’s 2024 textbook unified the pedagogy. Synthetic control methods (Abadie, 2003 and 2010) build mathematical twins of treated units from weighted combinations of untreated ones; the augmented version (Ben-Michael, Feller, and Rothstein, 2021) adds ridge-regression bias correction. These methods do not rescue a broken instrument. They operationalize the comprehensive multi-channel model that the critique itself demands.

Interactive

Treatment effects vary by location and context. Causal forests discover this heterogeneity automatically.

Sensitivity Analysis and Honest Confidence Intervals

What does the toolkit not solve? Conley, Hansen, and Rossi (2012) developed bounds for plausible exogeneity—how much the exclusion restriction can be violated before the conclusion changes. Cinelli and Hazlett (2020) formalized sensitivity analysis with robustness values: how strong would a confounder need to be to explain away this result? Their sensemakr R package makes this analysis routine. Conformal prediction (Lei and Candès, 2021) provides distribution-free coverage guarantees for counterfactual prediction intervals—the true value falls in this interval 90% of the time, guaranteed, with no distributional assumptions. The honest answer is not zero residual violation, but bounded violation with known confidence intervals.

DML mechanics: orthogonal scores and cross-fitting

Double/debiased machine learning constructs Neyman-orthogonal score functions that are locally insensitive to errors in the nuisance parameter estimates. The procedure: (1) Split data into K folds. (2) For each fold, estimate nuisance functions (the conditional expectations of treatment and outcome given controls) on the remaining K-1 folds. (3) Compute residuals on the held-out fold. (4) Estimate the causal parameter from the residuals. Cross-fitting prevents overfitting bias. The result converges at the parametric rate (root-n) even when the nuisance functions are estimated at slower nonparametric rates. Production-ready implementations: DoubleML (Python/R) and EconML (Microsoft).

AI weather model architectures compared

GraphCast (DeepMind, 2023): Graph neural network operating on an icosahedral mesh. Encodes atmospheric state as node features, message-passes across edges representing spatial adjacency, decodes to the next time step. Autoregressive multi-step rollout.

GenCast (DeepMind, 2024): Diffusion model generating an ensemble of weather trajectories. Produces probabilistic forecasts natively rather than as a post-processing step. Beats ENS on 97.2% of targets.

NeuralGCM (Google, 2024): Hybrid physics+ML. Retains the dynamical core (atmospheric fluid equations) and replaces parameterized sub-grid processes (convection, radiation, turbulence) with learned neural networks. Principled physics conservation.

Aurora (Microsoft, 2025): Foundation model approach. 1.3B parameters, pretrained on diverse atmospheric data, fine-tuned for weather, climate, and air quality. Generalizes across tasks.

The Nonlinear Demand Landscape

The weather-demand relationship is not a line. It is a landscape—nonlinear, discontinuous, context-dependent, and high-dimensional. The same temperature means different things in different markets. The same market responds differently in different seasons. Rule systems fail not by small margins but by orders of magnitude, because the structure they are trying to compress is computationally irreducible.

Interactive

Same temperature, opposite meaning. Compare how 85°F affects behavior in Phoenix versus Houston.

Phase Transitions in Demand

At 32°F, water becomes ice and road salt demand goes vertical. UV index crosses a sunscreen-saturation threshold. Rainfall triggers umbrella purchases in a step function, not a gradient. The critical insight is that anomaly matters more than absolute value: 60°F in February moves more inventory than 90°F in August, because what people respond to is departure from expectation. Cough medicine demand peaks three to four weeks after a cold snap—a lagged, nonlinear relationship invisible to any contemporaneous model. Hot sauce sales correlate with UV index, not temperature. A 2025 study found that the temperature-consumption relationship follows an inverted U, peaking at approximately 28.6°C (83.5°F)—above which further warming suppresses demand.

Interactive

Demand does not respond linearly to weather. Explore the discontinuities, thresholds, and phase transitions in the response surface.

Ashby’s Law: Why Rule Systems Fail

W. Ross Ashby proved in 1956, in Part Three of An Introduction to Cybernetics, that a controller must have at least as much variety as the system it controls. “Only variety can absorb variety”—a result that corresponds to Shannon’s Theorem 10. Consider what this means for weather-demand modeling. A typical retail operation manages perhaps 200 weather rules. The actual system generates millions of distinct weather-demand states across products, markets, seasons, and their interactions. The variety gap is orders of magnitude. This was proved in 1956. The retail industry has been running insufficient-variety controllers ever since.

Interactive

Ashby’s Law in practice: compare the variety of a rule-based weather system against the variety of actual demand states.

These demand surfaces share a property with Wolfram’s Rule 30: they are computationally irreducible. You cannot shortcut the computation. The fine structure can be discovered by a model—but it cannot be compressed into a set of rules. This is why rule-based weather adjustments fail. They are attempting to compress a surface that resists compression.

Interactive

The fine structure of weather-demand relationships cannot be compressed into rules. Let the hidden pattern emerge.

Why Averaging Across Markets Destroys Information

Ole Peters demonstrated in a 2019 Nature Physics paper that for multiplicative processes, the ensemble average—the expected value across many parallel instances—diverges from the time average—the expected outcome for a single instance over time. Consider a multiplicative coin flip: 50% chance of gaining 50%, 50% chance of losing 40%. The ensemble average grows at +5% per flip. The time average shrinks at a geometric rate of −5.1%. A single player almost surely goes broke. Two markets with identical average temperatures but different weather volatility have fundamentally different demand dynamics—even when their averages look the same. Climate change shifts variance, not just the mean. This is a mathematical fact with important implications for understanding demand dynamics in individual markets.

Interactive

The ensemble average and the time average diverge for multiplicative processes. Watch a single player go broke while the expected value grows.

What the surface actually looks like: Portland shows a +31% demand response to the same temperature anomaly that produces +8% in Phoenix. Temperature, humidity, anomaly from local baseline, product category, day of week, seasonal position—these variables interact to produce a landscape with cliffs, plateaus, and valleys. Every retail transaction is a sensor reading of human behavior under specific environmental conditions. The network of stores across a country constitutes the largest continuous sensor network for human behavioral response to weather ever assembled. Commerce is not only an economic system. It is a measurement system.

Interactive

The full weather-demand response surface: a landscape of cliffs, plateaus, and valleys that no lookup table can capture.

Ergodicity economics: the mathematics

For a multiplicative process X(t+1) = X(t) · r, where r is drawn from a distribution, the ensemble average growth rate is E[r] while the time-average growth rate is exp(E[ln(r)]). By Jensen’s inequality, for any non-degenerate distribution, E[ln(r)] < ln(E[r]). The more volatile the multiplier, the larger the gap. The Kelly criterion (1956)—maximizing the expected logarithm of wealth—emerges not as an assumption about risk preferences but as a consequence of multiplicative dynamics. Log utility is not a behavioral assumption; it is a mathematical result.

Why Estimating Together Beats Estimating Alone

An outdoor apparel startup has two stores, 90 days of data, and the demand surface from the previous section has cliffs and valleys. Ninety data points are not enough to map it. Individual estimates are noise. The startup faces the cold-start problem, and it is not alone: every new brand, every new product, every new market begins with insufficient data.

Stein’s Paradox: Joint Estimation Beats Individual Estimation

In 1956, Charles Stein presented a result at the Third Berkeley Symposium that was met with disbelief. Estimating three or more quantities together—even if the quantities are completely unrelated—always produces lower total error than estimating each one separately. Baseball batting averages, rainfall in Zurich, the price of butter in Cleveland: estimate them jointly, and the total squared error decreases. Maximum likelihood estimation—the workhorse of statistics—is inadmissible for dimension three and above. James and Stein (1961) provided the explicit dominating estimator. Efron and Morris (1975) quantified the effect using baseball data: total squared prediction error dropped from 0.077 (individual estimates) to 0.022 (shrinkage estimates)—a 71% reduction. The 1977 Scientific American article was written specifically because the mathematical community found the result so shocking. Many tried to find the error in the proof. There is no error.

Imagine being told that knowing the price of butter in Cleveland helps you estimate baseball batting averages. It sounds insane. It is mathematically proven.

Interactive

Stein’s paradox made tangible: adjust the shrinkage parameter and watch total estimation error decrease as individual estimates move toward the group.

The mechanism is hierarchical Bayesian estimation with empirical Bayes shrinkage (Robbins, 1956). Information about the population constrains individual estimates toward more plausible values. The noisier your individual estimate, the more you borrow from the group. As your own data accumulates, the shrinkage relaxes—your estimate moves from the population average toward your own signal. Gelman and Hill (2006) unified the pedagogy of these multilevel models.

Solving the Cold-Start Problem with Partial Pooling

Return to the outdoor apparel startup. It joins a network of 200 brands. On its first hot weekend, it inherits the temperature-demand pattern from its product category through partial pooling. After six months, its own signal has diverged—this brand has an anomalous temperature response, perhaps because its customers skew toward a demographic that behaves differently in heat. The shrinkage relaxes. But information flows in both directions: the startup’s anomalous pattern improves the population posterior for all future entrants. Each new participant improves every existing estimate. The improvement follows a logarithmic curve—it never stops.

Interactive

From noise to signal: watch a new brand’s weather-demand estimate improve from day one to day 360 as data accumulates and pooling information flows.

There are natural boundaries. Sunscreen and ski equipment cannot be naively pooled—they have opposite temperature responses. Automatic shrinkage reduction activates when within-group variation exceeds between-group variation. Heavy-tailed priors accommodate outlier brands. Hard constraints prevent the group from overriding strong individual signals. The hierarchy is not a straightjacket; it is a prior that yields to data.

Interactive

Each new participant improves every existing estimate. Information compounds across the network.

The James-Stein estimator

For p ≥ 3 independent normal means with known equal variance σ², the James-Stein estimator is: θ̂_JS = (1 - (p-2)σ² / ||X||²) · X, where X is the vector of observed means and ||X||² is its squared norm. The shrinkage factor (p-2)σ²/||X||² pulls estimates toward zero (or any chosen target). The estimator dominates the MLE uniformly—for every possible true value of θ, the expected squared error is smaller. This is not a Bayesian result requiring a prior; it is a frequentist result holding for all θ.

When Prediction Meets Causal Inference

A Century of Parallel Development

The reader has now encountered both intellectual traditions. One produces forecasts of extraordinary accuracy. The other extracts causal effects from observational data. Each developed over roughly a century. Each hit a wall the other had already solved.

The prediction tradition runs from Galton (1886, regression to the mean) through exponential smoothing (1956–1960), Box-Jenkins ARIMA (1970), the M-Competitions (1982–2024), the deep learning revolution—backpropagation, LSTMs, Transformers, scaling laws—neural forecasting (DeepAR, N-BEATS, TFT, PatchTST), and into the current foundation model era (Chronos, TimesFM, Moirai, FlowState). This tradition can predict. It cannot attribute.

The identification tradition runs from Wright (1928, Appendix B) through the Cowles Commission, Pearl’s structural causal models, Leamer’s credibility crisis (1983), the natural experiments revolution (1991–1994), Nobel 2021, and into double machine learning (Chernozhukov et al., 2018), causal forests (Wager and Athey, 2018), difference-in-differences with continuous treatment (Callaway, Goodman-Bacon, and Sant’Anna, 2025), and sensitivity analysis (Cinelli and Hazlett, 2020). This tradition can attribute. It needed prediction’s machinery to handle the complexity.

Interactive

A century of parallel development: the prediction tradition and the identification tradition, converging for the first time. Scroll to explore.

Why Weather and Demand Are the Convergence Point

Weather provides the exogenous variation that causal inference requires. Demand provides the high-dimensional surface that prediction was built for. The domain requires both; neither alone suffices. And the tools have only just become powerful enough. AI weather forecasting went from paper to operational deployment in 36 months. Time series foundation models now offer multiple tokenization strategies—quantization, patching, continuous state space models—across architectures ranging from 9.1 million to 710 million parameters. Double machine learning and causal forests matured around 2018; Wager’s textbook unified the pedagogy in 2024. Difference-in-differences with continuous treatment, published in 2025, handles weather’s continuous nature directly. These capabilities existed separately for years. The synthesis that combines them does not yet exist—but for the first time, it could.

The integration works like a doctor and a medicine. A doctor predicts a patient will recover. But the doctor also needs to know: was it the medicine, or would the patient have recovered anyway? That is identification. A prediction model says “demand will be X.” The causal step says “of that X, weather contributed Y—and here is why we believe that, with these confidence intervals and these sensitivity bounds.”

What the Synthesis Actually Looks Like

Imagine a retailer asks a concrete question: how much of last month’s 12% sales increase in the Southeast was caused by weather? The prediction tradition alone would say: we forecast 8% growth, actual was 12%, so the residual is 4 percentage points. But that residual could be anything—a competitor’s stockout, a viral TikTok, a pricing change. The identification tradition alone would say: we can estimate the local average treatment effect of temperature on sales using weather as an instrument. But with a simple linear IV, the estimate ignores the nonlinear surface—the phase transitions, the anomaly effects, the category interactions. Neither tradition, alone, can give a credible answer.

The synthesis combines them. A foundation model captures the demand surface—the full nonlinear, high-dimensional relationship between weather and purchases. Double machine learning separates prediction from identification: the model predicts and removes everything it can explain (day-of-week effects, trend, seasonality, promotions), then estimates the causal effect of weather on the residual, orthogonalizing from both sides. Causal forests discover that this effect varies—the Southeast responded differently from the Northwest, categories with outdoor exposure responded differently from shelf-stable goods. Hierarchical estimation pools information across hundreds of brands, giving even the retailer’s newest product line a usable estimate from day one. And conformal prediction wraps the answer in honest intervals: “weather caused between 2.8 and 5.1 percentage points of that increase, at 90% coverage, with no distributional assumptions.”

That is one credible number, extracted from a system with hundreds of interacting variables, bounded by confidence intervals, and accompanied by an honest accounting of what remains unknown.

Emergent Structure at Scale

What happens when this synthesis operates across thousands of brands and markets simultaneously? Two phenomena from deep learning suggest the answer. The first is grokking (Power et al., 2022): models trained on algorithmic tasks memorize their training data for thousands of optimization steps and then suddenly, discontinuously generalize—a phase transition from brute-force interpolation to genuine structure discovery. The mechanism is linked to weight decay slowly eroding memorization circuits until generalizing circuits dominate. The second is double descent (Belkin et al., 2019): classical wisdom holds that more parameters mean more overfitting. In practice, beyond the interpolation threshold, error falls again with massive overparameterization—the model discovers cross-market transfer, where patterns learned in one context generalize to others.

These are not just theoretical curiosities. They predict something specific about a system that combines causal identification with large-scale prediction: given enough brands, enough markets, enough weather variation, structure emerges that no individual dataset could reveal. A temperature anomaly in Portland teaches the model something about Portland, but also something about every market with similar climate patterns. A seasonal phase transition in sunscreen demand sharpens the estimate for every outdoor-exposure category. The network effect from Section 4—Stein’s paradox, hierarchical pooling—combines with the causal identification from Section 2 to produce a system where each new participant improves every existing causal estimate.

Interactive

Grokking: watch a model memorize its training data for thousands of steps, then suddenly discover generalizable structure.

Interactive

Classical wisdom says more parameters mean more overfitting. Beyond the interpolation threshold, error falls again.

Why the Convergence Happens at Weather’s Scale

There is a principled reason the convergence happens at the level of weather rather than, say, molecular dynamics. Hoel, Albantakis, and Tononi demonstrated in a 2013 PNAS paper that coarse-grained descriptions of a system can carry more causal information than fine-grained ones. Temperature is not merely a convenient shorthand for molecular kinetics. It is a causally emergent variable—the macro-level description has more effective information than the micro. The 2025 revision of this work addresses criticisms of the original measure, and substantive objections remain open. But the core formal result is robust: higher-level descriptions can be strictly more causally informative. Weather is the right level of description because it is the level at which causal influence on human behavior is maximized. The atmosphere’s molecular state carries less information about demand than its temperature, humidity, and wind.

Climate change makes this more urgent, not less. As weather distributions shift, the natural experiments intensify—unprecedented temperature anomalies, novel precipitation patterns, weather sequences outside historical training distributions. This is simultaneously a threat (historical models may not generalize) and an opportunity (new variation provides new identification). A system built on causal foundations, not just correlational ones, has a structural advantage: it can detect when its estimates become unreliable, because the sensitivity analysis framework tells it when confounders grow too strong or the exclusion restriction becomes implausible.

An Honest Accounting

No system built on these ideas will work everywhere. Categories with negligible weather sensitivity—software, digital media, financial instruments—are outside the domain. Markets with fewer than 60 days of data and no pooling partners produce estimates too uncertain to act on. Extreme weather events beyond the training distribution require epistemic humility: the model should say “I don’t know” rather than extrapolate. And if the climate baseline is shifting, the question of how long a historical window remains valid has no settled answer. These are not disclaimers. They are the boundary conditions of the mathematics, and a system worth building should encode them explicitly—flagging when it operates near the edge of its competence rather than pretending omniscience.

The Argument for Building It

This essay has traced two intellectual traditions across a century. One, the prediction tradition, runs from Galton through the Perceptron and backpropagation, LSTMs and Transformers, scaling laws and foundation models that can forecast zero-shot on unseen time series. The other, the identification tradition, runs from Wright’s 1928 appendix through double machine learning and causal forests that can estimate heterogeneous treatment effects with valid confidence intervals. Each hit a wall the other had already solved. The domain where they converge is weather and commercial demand—a domain where exogenous variation is abundant, the causal surface is rich enough to reward sophisticated modeling, and the economic stakes justify the effort.

The synthesis described in this essay—AI weather forecasts as instruments, foundation model architectures for the nonlinear demand surface, double machine learning for identification, causal forests for heterogeneity, hierarchical pooling across a network of participants, conformal prediction for honest intervals—is not hypothetical. Every component is published, peer-reviewed, and has production-ready implementations. What does not yet exist is the system that combines them: a model that ingests probabilistic weather forecasts as instrumental variables, uses them to identify causal effects across a high-dimensional demand surface, pools information hierarchically across hundreds of brands and thousands of markets, and wraps every estimate in distribution-free confidence intervals.

The mathematical argument says it should be built. The tools exist. The data exists. The AI weather revolution has compressed what was once a decade of incremental progress into 36 months of transformative change. The remaining work is engineering—the disciplined assembly of capabilities that have never been combined.

If this problem interests you—as a researcher, a practitioner, or a skeptic—we would like to hear from you.

Difference-in-differences with continuous treatment

Classical difference-in-differences compares treated and control groups before and after a binary treatment. Callaway, Goodman-Bacon, and Sant’Anna (2025) extend this to continuous treatments—directly relevant for weather, which does not turn on and off but varies continuously in intensity. The key identification assumption shifts from parallel trends (binary) to a dose-response function with a common trend assumption across doses. This allows estimation of the causal effect of each additional degree of temperature, each additional millimeter of rainfall, at each point along the distribution.

Grokking and double descent: the mathematical details

Grokking (Power et al., 2022): On algorithmic datasets (modular arithmetic, group operations), models achieve 100% training accuracy early but near-random test accuracy for thousands of additional steps. Then test accuracy jumps to near-perfect, often within a narrow window of training steps. The mechanism appears linked to weight decay: regularization slowly erodes the memorization circuits, eventually allowing generalizing circuits to dominate.

Double descent (Belkin et al., 2019): The classical bias-variance tradeoff predicts that increasing model complexity beyond a certain point increases test error. In practice, there are three regimes: (1) classical regime (underfitting to good fit), (2) interpolation threshold (model has exactly enough capacity to memorize training data—test error peaks), (3) modern interpolating regime (further increasing capacity decreases test error). Nakkiran et al. (2020) documented three variants: model-wise (more parameters), epoch-wise (more training), and sample-wise (more data can temporarily hurt).

Causal emergence: effective information

Hoel et al. (2013) define effective information (EI) as the mutual information between the current state and next state of a system when the input distribution is set to its maximum entropy. A macro-level description (e.g., temperature) can have higher EI than the corresponding micro-level description (molecular kinetics) because coarse-graining removes degeneracy—multiple micro-states that produce identical macro-transitions. Critics (Dewhurst 2021, Eberhardt & Lee 2022) argue that the result depends on the specific choice of coarse-graining and the maximum entropy input assumption. Hoel’s 2025 revision (“Causal Emergence 2.0”) addresses these concerns with a refined measure.

nate@schmiedehaus.com

References

Prediction Tradition

Galton, F. (1886). “Regression towards Mediocrity in Hereditary Stature.” Journal of the Anthropological Institute, 15, 246–263.
Box, G.E.P. & Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day.
Makridakis, S. et al. (1982). “The accuracy of extrapolation (time series) methods: Results of a forecasting competition.” Journal of Forecasting, 1(2), 111–153.
Makridakis, S. et al. (1993). “The M2-Competition.” International Journal of Forecasting, 9(1), 5–22.
Makridakis, S. & Hibon, M. (2000). “The M3-Competition.” International Journal of Forecasting, 16(4), 451–476.
Makridakis, S. et al. (2020). “The M4 Competition.” International Journal of Forecasting, 36(1), 54–74.
Makridakis, S. et al. (2022). “M5 accuracy competition.” International Journal of Forecasting, 38(4), 1346–1364.
Makridakis, S. et al. (2024). “The M6 financial forecasting competition.” International Journal of Forecasting.
Rosenblatt, F. (1958). “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review, 65(6), 386–408.
Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). “Learning representations by back-propagating errors.” Nature, 323, 533–536.
Hochreiter, S. & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735–1780.
Krizhevsky, A., Sutskever, I. & Hinton, G.E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 2012.
Bahdanau, D., Cho, K. & Bengio, Y. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv:1409.0473.
Vaswani, A. et al. (2017). “Attention Is All You Need.” NeurIPS 2017.
Kaplan, J. et al. (2020). “Scaling Laws for Neural Language Models.” arXiv:2001.08361.
Salinas, D. et al. (2020). “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” International Journal of Forecasting, 36(3), 1181–1191.
Oreshkin, B.N. et al. (2020). “N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting.” ICLR 2020.
Lim, B. et al. (2021). “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.” International Journal of Forecasting, 37(4), 1748–1764.
Zeng, A. et al. (2023). “Are Transformers Effective for Time Series Forecasting?” AAAI 2023.
Nie, Y. et al. (2023). “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.” ICLR 2023.
Ansari, A.F. et al. (2024). “Chronos: Learning the Language of Time Series.” arXiv:2403.07815.
Ansari, A.F. et al. (2025). “Chronos-2: From Univariate to Universal Forecasting.” arXiv:2510.15821.
Das, A. et al. (2024). “A decoder-only foundation model for time-series forecasting.” (TimesFM) ICML 2024.
Woo, G. et al. (2024). “Unified Training of Universal Time Series Forecasting Transformers.” (Moirai) ICML 2024.
Woo, G. et al. (2025). “Moirai 2.0.” arXiv:2511.11698.
Schmidt, N. et al. (2025). “FlowState: Sampling Rate Invariant Time Series Forecasting.” NeurIPS 2025.
Garza, A. & Challu, C. (2023). “TimeGPT-1.” arXiv:2310.03589.
Power, A. et al. (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” arXiv:2201.02177.
Belkin, M. et al. (2019). “Reconciling modern machine learning practice and the bias-variance trade-off.” PNAS, 116(32), 15849–15854.
Nakkiran, P. et al. (2021). “Deep Double Descent.” Journal of Statistical Mechanics.

Identification Tradition

Wright, P.G. (1928). The Tariff on Animal and Vegetable Oils. Macmillan.
Stock, J.H. & Trebbi, F. (2003). “Who Invented Instrumental Variable Regression?”Journal of Economic Perspectives, 17(3).
Leamer, E.E. (1983). “Let’s Take the Con out of Econometrics.” American Economic Review, 73(1), 31–43.
Angrist, J.D. & Krueger, A.B. (1991). “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics.
Card, D. & Krueger, A.B. (1994). “Minimum Wages and Employment.” American Economic Review, 84(4), 772–793.
Imbens, G.W. & Angrist, J.D. (1994). “Identification and Estimation of Local Average Treatment Effects.” Econometrica, 62(2), 467–476.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
Chernozhukov, V. et al. (2018). “Double/Debiased Machine Learning for Treatment and Structural Parameters.” Econometrics Journal, 21(1), C1–C68.
Wager, S. & Athey, S. (2018). “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” JASA, 113(523), 1228–1242.
Wager, S. (2024). Causal Inference: A Statistical Learning Approach.
Abadie, A. & Gardeazabal, J. (2003). “The Economic Costs of Conflict.” American Economic Review, 93(1), 113–132.
Abadie, A., Diamond, A. & Hainmueller, J. (2010). “Synthetic Control Methods.”JASA, 105(490), 493–505.
Ben-Michael, E., Feller, A. & Rothstein, J. (2021). “The Augmented Synthetic Control Method.” JASA, 116(536), 1789–1803.
Callaway, B. & Sant’Anna, P.H.C. (2021). “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics.
Callaway, B., Goodman-Bacon, A. & Sant’Anna, P.H.C. (2025). “Difference-in-Differences with a Continuous Treatment.”
Cinelli, C. & Hazlett, C. (2020). “Making Sense of Sensitivity.” Journal of the Royal Statistical Society: Series B.
Conley, T.G., Hansen, C.B. & Rossi, P.E. (2012). “Plausibly Exogenous.” Review of Economics and Statistics.
Kiciman, E. et al. (2023/2025). “Causal Reasoning and Large Language Models.” arXiv:2305.00050.
Lei, L. & Candès, E.J. (2021). “Conformal Inference of Counterfactuals and Individual Treatment Effects.” Journal of the Royal Statistical Society: Series B, 83(5), 911–938.
Chernozhukov, V., Wüthrich, K. & Zhu, Y. (2021). “An Exact and Robust Conformal Inference Method for Counterfactual and Synthetic Controls.”JASA, 116(536).

Complexity Science

Ashby, W.R. (1956). An Introduction to Cybernetics. Chapman and Hall.
Peters, O. (2019). “The ergodicity problem in economics.” Nature Physics, 15, 1216–1221.
Hoel, E.P., Albantakis, L. & Tononi, G. (2013). “Quantifying causal emergence shows that macro can beat micro.” PNAS, 110(49), 19790–19795.
Hoel, E.P. (2017). “When the Map Is Better Than the Territory.” Entropy, 19(5), 188.
Hoel, E.P. (2025). “Causal Emergence 2.0.” arXiv:2503.13395.
Lorenz, E.N. (1963). “Deterministic Nonperiodic Flow.” Journal of the Atmospheric Sciences, 20, 130–141.
Richardson, L.F. (1922). Weather Prediction by Numerical Process. Cambridge University Press.
Charney, J.G., Fjörtoft, R. & von Neumann, J. (1950). “Numerical Integration of the Barotropic Vorticity Equation.” Tellus.

Weather-Demand Evidence

Dell, M., Jones, B.F. & Olken, B.A. (2014). “What Do We Learn from the Weather? The New Climate-Economy Literature.” Journal of Economic Literature, 52(3), 740–798.
Mellon, J. (2025). “Rain, Rain, Go Away: 194 Potential Exclusion-Restriction Violations for Studies Using Weather as an Instrumental Variable.”American Journal of Political Science.
Busse, M.R. et al. (2015). “The Psychological Effect of Weather on Car Purchases.”Quarterly Journal of Economics, 130(1), 371–414.
Lazo, J.K. et al. (2011). “U.S. Economic Sensitivity to Weather Variability.”Bulletin of the American Meteorological Society, 92(6), 709–720.
Roth Tran, B. (2023). “Sellin’ in the Rain: Weather, Climate, and Retail Sales.”Management Science, 69(12), 7423–7447.

AI Weather Models

Pathak, J. et al. (2022). “FourCastNet: A Global Data-driven High-resolution Weather Forecasting Model.” arXiv:2202.11214.
Bi, K. et al. (2023). “Accurate medium-range global weather forecasting with 3D neural networks.” Nature, 619, 533–538.
Lam, R. et al. (2023). “Learning skillful medium-range global weather forecasting.”Science, 382(6677), 1416–1421.
Price, I. et al. (2024). “Probabilistic weather forecasting with machine learning.”Nature, 636, 84–90.
Kochkov, D. et al. (2024). “Neural general circulation models for weather and climate.” Nature, 632, 1060–1066.
Bodnar, C. et al. (2024/2025). “Aurora: A Foundation Model of the Atmosphere.”Nature.

Statistical Estimation

Stein, C. (1956). “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution.” Proceedings of the Third Berkeley Symposium.
James, W. & Stein, C. (1961). “Estimation with Quadratic Loss.” Proceedings of the Fourth Berkeley Symposium.
Efron, B. & Morris, C. (1975). “Data Analysis Using Stein’s Estimator and its Generalizations.” JASA, 70(350), 311–319.
Efron, B. & Morris, C. (1977). “Stein’s Paradox in Statistics.” Scientific American, 236(5), 119–127.
Robbins, H. (1956). “An Empirical Bayes Approach to Statistics.” Proceedings of the Third Berkeley Symposium.
Gelman, A. & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.