Skip to main content
Working Paper · February 2026

Why does ice cream sell better in the rain?

A 3-rule system covers 0.000016% of the weather-demand parameter space. Here’s why -- and what happens when you add the other 99.999984%.

Nathaniel Schmiedehaus9 sections · 52 references
01Foundation

The complexity gap

Rules carry ~7.6 bits of mutual information. The demand system has ~20 bits of relevant entropy. The remaining 12.4 bits are structurally invisible -- not a matter of writing better rules.

Same temperature, opposite outcome

72°F with a light breeze: cold beverage demand hits 67 units. 72°F with 25mph wind: demand drops to 41. The thermometer reads the same. The cash register doesn’t. Now multiply that by 8 weather variables, 4 temporal derivatives, 7 product categories, and 47 markets. A human writing rules has seen 0.000016% of this space.

0%
Temperature alone
0%
Full weather tensor
0%
Information left behind

Temperature alone explains roughly 12% of demand variance in weather-sensitive categories. This is the number that rules-based systems optimize against. The full weather tensor -- temperature, humidity, UV index, wind speed, barometric pressure, precipitation, cloud cover, and their temporal derivatives -- explains 45-60% of demand variance in those same categories. The gap between 12% and 55% is not noise. It is decision-relevant information that rules cannot represent.

The information-theoretic bound. Touchette & Lloyd (2000) proved that the entropy reduction achievable by any controller is bounded by the mutual information between the controller’s model and the system. A rule set with 200 rules carries at most log2(200) ≈ 7.6 bits of mutual information. The underlying demand system has roughly 20 bits of relevant entropy -- about 106 distinguishable states. Rules can reduce uncertainty by 7.6 bits out of 20. The remaining 12.4 bits are information the controller cannot see.

The curse of dimensionality. Weather-demand relationships are not seven independent channels. They interact: 90°F at 30% humidity produces different demand than 90°F at 85% humidity. UV index modulates temperature effects nonlinearly. Wind speed interacts with precipitation to determine outdoor activity. With 8 weather variables, the number of pairwise interactions is C(8,2) = 28. Three-way interactions: 56. The response surface is continuous and nonlinear, but rules discretize it into a finite set of rectangular regions. Every threshold is a knife edge that reality does not respect.

Three converging proofs. The result is robust because three independent traditions arrive at the same conclusion:

Shannon 1948

A channel cannot transmit more information than its capacity. Rules are a low-capacity channel between the demand system and the allocation decision.

Ashby 1956

A controller must have at least as much variety as the system it controls. 200 rules have variety 200; a million-state system requires variety ≥ 106.

Touchette & Lloyd 2000

Entropy reduction is bounded by mutual information. The controller cannot reduce what it cannot model.

Mathematical detail: the information bottleneckTishby et al. (2000)

Tishby, Pereira, & Bialek (2000) formalized the compression-prediction tradeoff. Given a signal and a target, any compression must choose what to keep. Rules compress for legibility -- they preserve what a human can read in a spreadsheet. Models compress for decision quality -- they preserve what actually predicts the outcome. These are different objectives with provably different optimal solutions.

Information Bottleneck Objective
min  I(X; T) - beta * I(T; Y)

where:
  X = input (full weather tensor)
  T = compressed representation (rules or model)
  Y = target (demand)
  beta = tradeoff parameter

Rules: low I(X;T) by design -> low I(T;Y) necessarily
Model: higher I(X;T) capacity -> higher I(T;Y) achievable
When three separate branches of mathematics point at the same wall, it is probably a wall. The gap between rules and reality is structural, not a matter of engineering effort.

Five stages of demand complexity

The LCDM models demand as a function of 8 weather variables: temperature, humidity, UV index, wind speed, barometric pressure, precipitation, cloud cover, and their temporal derivatives. The interaction taxonomy is combinatorially explosive: pairwise interactions number C(8,2) = 28, three-way interactions number C(8,3) = 56, and four-way interactions number C(8,4) = 70 -- totalling 154 interaction terms before considering temporal lags, geographic heterogeneity, or cross-category effects.

Rules discretize this continuous response surface into a finite set of rectangular regions. Every threshold is a knife edge that reality does not respect. The mathematical argument is that threshold rules can capture at most log₂(N) bits of mutual information, where N is the number of rules. 200 rules carry at most ~7.6 bits; the demand system has ~20 bits of relevant entropy. The remaining 12.4 bits are structurally invisible to rules.

Stage 1Temperature aloneparams: ~3 · interactions: 0

Clean sigmoid. Monotonic. Intuitive. What rules optimize against.

Stage 2+ Humidityparams: ~12 · interactions: 1 pairwise

Same temperature, different demand. The line becomes a band. 90°F at 30% humidity ≠ 90°F at 85% humidity.

Stage 3+ Wind & feels-likeparams: ~45 · interactions: 3 pairwise

Non-monotonic warps emerge. Wind chill (NWS formula) and heat index (Steadman 1979) compose into a feels-like surface that creates opposite outcomes at identical temperatures.

Stage 4+ Temporal lagsparams: ~370 · interactions: 6 + temporal

Path dependence. Day 1 vs Day 5 of the same weather bifurcates demand. Consecutive-day fatigue, cumulative dehydration, and behavioral adaptation create history-dependent response.

Stage 5+ Cross-categoryparams: ~18,500 · interactions: 15 + cascading

Computationally irreducible. A particle cloud of demand trajectories. The best 3-rule approximation covers <0.001% of this parameter space.

Cross-category cannibalization

When demand for one category surges past a threshold, it cannibalizes or amplifies demand in adjacent categories. The LCDM models this with a 7×7 cannibalization weight matrix that activates when any category exceeds its surge threshold (default: 65th percentile demand index).

Cannibalization weight matrix7 categories, cross-effects
Cross-Category Cannibalization Matrix
Category interactions (weight applied when source exceeds surge threshold):

  HVAC surge    --> Outdoor Rec:  -0.15  (indoor retreat suppresses outdoor)
  Cold Bev surge --> Ice Cream:   -0.12  (substitution effect)
  Allergy surge  --> Outdoor Rec: -0.12  (symptom avoidance)
  Outerwear surge -> HVAC:       +0.08  (cold-weather co-demand)
  Outdoor surge  -> Cold Bev:    +0.08  (complementary consumption)
  Outdoor surge  -> Ice Cream:   +0.06  (occasion bundling)
  Outdoor surge  -> Sunscreen:   +0.05  (activity co-occurrence)

Adjustment formula:
  adj_j = SUM_i [ w_ij * max(0, (demand_i - T) / (100 - T)) * 100 ]
  where T = surge threshold (65), w_ij = cannibalization weight

Anomaly detection formalism

The LCDM flags moments where model predictions diverge from rule-based intuition. Each anomaly type carries a severity score in [0, 1] and an explanation string. The anomaly detector currently identifies 8+ anomaly types:

Anomaly type catalogue8+ types with severity scoring
  • Tropical rain + sunscreen (severity: demand/60): Warm rain at >70°F with high humidity produces diffuse UV burns. Sunscreen demand stays elevated despite precipitation -- contradicts every rule-based system.
  • Warm outerwear demand (severity: demand/35): High wind speed in Denver/Chicago/Boston creates windbreaker demand at 70°F+. Rules keyed on temperature alone miss this entirely.
  • Hot + low beverage demand (severity: 0.7): Extreme humidity (>80%) collapses foot traffic even at high temperatures, suppressing cold beverage sales below what temperature-only rules predict.
  • Rainy-day ice cream (severity: 0.6): Light warm rain increases ice cream demand (comfort food culture). Documented in Unilever UK/Ireland sales data.
  • Barometric jerk → allergy (severity: |d³|/4): Rapid pressure oscillations trigger migraine clusters and acute allergy episodes 36 hours before weather arrives (Kimoto et al. 2011, Cephalalgia).
  • Pre-position outerwear (severity: 0.5): Negative jerk (d³ < -1) predicts incoming cold snap; outerwear demand rises while temperature is still 70°F.
  • Moderate-temp HVAC (severity: 0.6): High |d²| (thermal whiplash) stresses HVAC systems at moderate temperatures -- building thermal mass cannot track rapid changes.
  • Regime instability (severity: |d⁴|/4): 4th-derivative snap disrupts all categories simultaneously, as consumers cannot trust the forecast.
Counterintuitive Findings

Light warm rain INCREASES ice cream demand -- comfort food / rainy-day treat culture. Documented in Unilever UK/Ireland sales data. At 68°F with 0.2″ rain in Boston, ice cream demand rises 20% above the dry-day baseline.

First rain after a dry streak creates a pollen SURGE -- rain knocks pollen to breathing height. In Denver after 4 dry days at 72°F, a light shower (0.1″) spikes allergy medication demand 50% within 4-6 hours. Every rule says “rain = less outdoor activity = less allergy.” The opposite happens.

Outerwear demand rises while it’s still 70°F -- the model reads the trajectory, not the snapshot. When the 3rd derivative (jerk) turns strongly negative, the model detects incoming cold and pre-positions spend. The jerk is the pre-positioning signal that no human rule writer has ever seen.

Dewpoint-wind cross-acceleration near thunderstorms creates a “last chance” ice cream purchase window. When temperature is accelerating (d² > 0.5) and wind is decelerating (d¹ < -1), a warm front collision is approaching. People rush for ice cream before the storm hits -- an 18% demand spike in a 2-hour window invisible to any snapshot rule.

Core mathematical functionsSteadman 1979, NWS, Hermite interpolation
Feels-Like Composite (Steadman 1979 + NWS)
Heat Index (Steadman 1979, T >= 80F):
  HI = -42.379 + 2.049*T + 10.143*RH - 0.2248*T*RH
       - 6.838e-3*T^2 - 5.482e-2*RH^2
       + 1.229e-3*T^2*RH + 8.528e-4*T*RH^2
       - 1.99e-6*T^2*RH^2

Wind Chill (NWS formula, T <= 50F, WS >= 3mph):
  WC = 35.74 + 0.6215*T - 35.75*WS^0.16 + 0.4275*T*WS^0.16

Feels-Like Composite:
  FL(T, RH, WS) =
    HI(T, RH)           if T >= 80
    WC(T, WS)           if T <= 50
    WC*(1-w) + HI*w     if 50 < T < 80, w = (T-50)/30

Smoothstep (Hermite interpolation for threshold transitions):
  S(e0, e1, x) = t^2 * (3 - 2*t)
    where t = clamp((x - e0) / (e1 - e0), 0, 1)
Hill Saturation Curve (Diminishing Returns)
Revenue response with saturation:
  R(demand, baseRev, spend) = (demand/100) * baseRev * spend^alpha / (spend^alpha + k^alpha)

Parameters:
  alpha = 0.65  (shape: steepness of diminishing returns)
  k = 8000      (half-saturation point in spend units)

Marginal revenue:
  dR/dSpend = (demand/100) * baseRev * alpha * spend^(alpha-1) * k^alpha / (spend^alpha + k^alpha)^2

Key insight: Lower k = faster saturation, punishing over-concentration
on "obvious" cells. The marginal dollar in a saturated category is
worth less than the marginal dollar in an unsaturated adjacent category.
02Temporal Physics

Temporal dynamics

A snapshot is to a trajectory as a photograph is to a film. Demand responds to rate, acceleration, jerk, and snap -- four derivatives that encode path-dependent behavior invisible to any threshold rule.

The LCDM models four temporal derivatives of each weather variable. These represent the recent trajectory of conditions:

d¹ -- Rate of change

°F/day. A warming trend (d¹ = +2.5) means consumers have not yet adapted -- UV awareness is low, sunscreen demand spikes 12% above what the snapshot temperature predicts. Cooling (d¹ = -4) triggers outerwear demand before the cold arrives.

d² -- Acceleration

Accelerating warmth (d² > 0.2) at temperatures above 75°F means people have not hydrated yet -- cold beverage demand spikes. “Thermal whiplash” (high |d²|) stresses HVAC systems at moderate temperatures because building thermal mass cannot track rapid changes.

d³ -- Jerk

The rate of change of acceleration. This is where the model becomes genuinely counterintuitive. Barometric jerk (proxied by temperature jerk) triggers migraine clusters and acute allergy episodes 36 hours before the weather event arrives (Kimoto et al. 2011, Cephalalgia). Allergy/pharma demand surges when jerk is high but current weather is fine.

Temporal Derivative Profiles
Pattern          d1 (rate)   d2 (accel)   d3 (jerk)   d4 (snap)
-------          ---------   ----------   ---------   ---------
Steady State       0.0         0.0          0.0         0.0
Warming Trend     +2.5        +0.3          0.0         0.0
Cooling Snap      -4.0        -0.8         +0.15        0.0
Volatile           0.0         0.0         +2.8        -1.2
Thermal Whiplash  +1.2        -2.5         +4.5        -3.5

Pre-positioning: reading the trajectory, not the snapshot

The most striking derivative effect is outerwear pre-positioning. When the 3rd derivative is strongly negative (d³ < -1) -- meaning the rate of cooling is itself accelerating -- the model increases outerwear spend before actual cold arrives. At 70°F, this looks insane. But the model is reading the trajectory: temperature is bending colder, and the system pre-positions budget into the demand curve that has not yet materialized.

Similarly, the 4th derivative (“snap”) encodes regime instability. When snap is high, weather forecasts become unreliable. Consumers respond by not trusting forecasts either: planned outdoor purchases drop (people cancel trips they cannot rely on), while impulse purchases spike (every nice moment might be the last one this week). Ice cream sees a “panic treat” effect at high |d⁴| -- spontaneous purchasing driven by forecast uncertainty.

The 4th derivative also creates a pseudo-random noise effect in cold beverages. When snap is high, habitual purchases collapse (people do not stock up because they cannot predict their needs) while impulse purchases surge. The net demand oscillates in a pattern that is deterministic but practically unpredictable without modeling the full derivative chain.

Barometric jerk and medical demandKimoto et al. 2011

Kimoto et al. (2011) demonstrated in Cephalalgia that rapid barometric pressure oscillations (which we proxy via temperature trajectory jerk) trigger migraine clusters and acute allergy episodes. The mechanism is physiological: rapid pressure changes alter sinus cavity pressure differentials, triggering histamine release and inflammatory cascades.

The demand signal precedes the weather event by 24-36 hours. When |d³| > 1.5, allergy/pharma demand increases by up to 18 index points -- visible as elevated antihistamine and pain reliever purchases even when current weather conditions are mild. For chronic allergy sufferers, sustained barometric instability (high |d⁴|) triggers preemptive medication stocking, adding approximately 8 index points per unit of |d⁴|.

The model does not predict demand from weather. It predicts demand from the trajectory of weather -- rate, acceleration, jerk, and snap. A snapshot is to a trajectory as a photograph is to a film.
03Spatial Heterogeneity

Geographic heterogeneity

The same temperature produces opposite demand in different markets. Demand responds to weather shocks -- deviations from local norms -- not absolute conditions.

Same temperature, opposite behavior

78°F in Seattle is a +2.8σ heatwave. Sunscreen sells out. 78°F in Phoenix is 27 degrees below baseline -- people reach for jackets. Same reading on the thermometer. Opposite behavior at the register.

Planning vs. impulse markets

Outdoor recreation demand reveals a fundamental geographic bifurcation that emerges at the 3-day mark of consecutive good weather. The bifurcation separates two behavioral regimes:

Planning Markets
Denver · Seattle · Boston

Consumers need consecutive good weather to commit to outdoor trips. Day 1-2: suppressed demand (0.7-1.0x baseline) as people wait to confirm the trend. Day 3+: logarithmic ramp as confidence builds. Demand function: D = 0.7 + 0.15n (n ≤ 2), then 1.0 + 0.12 ln(n) (n > 2)

Impulse Markets
Phoenix · Miami · Austin

Early surge then saturation. Good weather is the norm, not the exception. Day 1-2: immediate surge (0.85-1.15x). Day 3+: exponential decay as novelty wears off. Demand function: D = 0.85 + 0.15n (n ≤ 2), then 1.15 · e⁻⁰⋅¹⁵⁽ⁿ⁻²⁾ (n > 2)

Geo-specific saturation thresholds

Each geography has calibrated baselines for temperature (mean μ and standard deviation σ), humidity, and wind. These are not arbitrary -- they represent the local distribution against which weather shocks are measured. Critical thresholds are geo-relative:

Geographic baselines and saturation6 markets, calibrated parameters
Geographic Baseline Parameters
Market     Temp(mu)  Temp(sigma)  Humidity(mu)  Wind(mu)  Notes
------     --------  -----------  ------------  --------  -----
Phoenix     105.0     6.2          18%           6 mph    Sunscreen drops >105F (indoor retreat)
Seattle      62.0     5.7          72%          10 mph    Rain tolerance: 0.15 (highest)
Denver       65.0    12.0          35%          12 mph    Altitude: +20% cold bev (dehydration)
Miami        84.0     4.5          76%           9 mph    Narrow sigma: small shocks = big signal
Chicago      50.0    18.0          62%          15 mph    Huge sigma: continental extremes
Boston       52.0    15.0          65%          13 mph    Ice cream culture premium: 1.25x

Shock interpretation:
  78F in Seattle = (78-62)/5.7 = +2.8 sigma (major heatwave)
  78F in Phoenix = (78-105)/6.2 = -4.4 sigma (extreme cold snap)
  Same temperature, opposite demand implications.

Geo-specific saturation effects create counterintuitive demand curves. In Phoenix, sunscreen demand drops above 105°F because consumers retreat indoors to escape extreme heat -- the demand curve inverts past the threshold. The smoothstep transition from 110°F to 125°F (feels-like) subtracts up to 50 demand index points. A temperature-is-good rule that boosts sunscreen spend at 108°F in Phoenix is spending into a demand trough.

Similarly, Seattle residents have the highest rain tolerance (threshold: 0.15 inches before outdoor demand suppresses) -- Seattleites do not cancel plans for drizzle. Phoenix residents have the lowest (0.02 inches), because rain is so rare that any precipitation disrupts outdoor behavior. A universal “if rain, reduce outdoor spend” rule over-suppresses in Seattle and under-suppresses in Phoenix.

The Busse et al. (2015) insight: Weather affects high-consideration purchases through psychological salience, not just physical comfort. In a study of 40 million vehicle transactions, convertible sales spiked on sunny days and 4WD sales spiked after snowfall -- even when the weather was transient and climatically irrelevant to the buyer’s location. If weather moves car purchases (a months-long research cycle), its effect on low-consideration, weather-sensitive categories is larger and more immediate.

78°F in Seattle is a heatwave. 78°F in Phoenix is a cold snap. A model that treats temperature as a universal input is measuring the wrong thing. Demand responds to weather shocks -- deviations from local norms.
04Causal Engine

Identification strategy

Observational marketing mix models overestimate ad effects by 5-10x (Gordon et al., 2019). Weather is the oldest instrumental variable in economics -- exogenous, high-frequency, and 10,000x richer than Wright had in 1928.

Every marketing mix model confronts the same identification problem: advertising spending is endogenous. Firms increase budgets when they anticipate demand will be high. This creates a correlation between spend and sales that has nothing to do with advertising effectiveness. Gordon et al. (2019), using 15 large-scale randomized experiments at Facebook, found that observational methods overestimate advertising effects by 5-10x at the median. This is not a marginal measurement error. It leads to fundamental misallocation of marketing budgets.

Weather as an instrument

The solution is an instrumental variable (IV) -- a source of variation that shifts demand but is uncorrelated with the confounders. Weather is the canonical instrument. Philip Wright used rainfall as the first IV in economics in 1928. Since then, weather has been used as an instrument in at least 83 papers published in top-5 economics journals (Mellon, 2024).

A valid instrument must satisfy three conditions:

  • Relevance: Weather must causally affect demand. Empirically testable via first-stage F-statistics. For weather-sensitive categories, F > 100 is typical -- far above the Stock & Yogo (2005) threshold of 10.
  • Independence: Weather must be uncorrelated with unobserved confounders. Defensible because weather is determined by atmospheric physics, not by marketing decisions or consumer preferences.
  • Exclusion restriction: Weather must affect sales only through its effect on demand, not through other channels. This is the most debated condition -- see below.
weathervane.app/causal-dag
TemperatureHumidityUV IndexWindPrecipitationNatural ExperimentEXOGENOUSSeasonalityPromotionsHolidaysTrendsDemandEvidence BundleWeather VariablesInstrumentConfounders (blocked)OutcomeCausal FlowBlocked Path

Weather shifts demand exogenously. Because no marketer chose the weather, the variation it creates is unconfounded -- enabling causal identification.

The Mellon critique and our response

Mellon (2024) catalogued 194 potential exclusion-restriction violations for weather instruments. The core concern: weather affects mood, which affects all spending -- not just weather-sensitive categories. If rain makes people sad and sad people buy less of everything, then rain is not a valid instrument for isolating advertising effects.

Thagorus addresses this through four mechanisms:

  • Direct demand modeling: We explicitly model the direct effect of weather on demand. The remaining variation -- demand conditional on weather controls -- is used for identification.
  • Category-specific identification: The mood channel affects all categories similarly. Weather-sensitive categories show differential responses. The between-category variation identifies the demand-specific channel.
  • Conley-Hansen-Rossi bounds: Rather than assuming perfect exogeneity, we report bounds on causal estimates that remain valid under calibrated violations of the exclusion restriction (Conley, Hansen, & Rossi, 2012).
  • Cross-geography comparisons: Using synthetic control methods (Abadie et al., 2010), we compare geographies experiencing different weather but sharing all other confounders.
Formal specification: demand response functionNotation and estimands
Demand Response Function
Y_gt = Phi(W_gt, A_gt, X_gt; theta_gc, lambda_t) + epsilon_gt

where:
  Phi(.) = continuous, differentiable demand surface
  W_gt   = weather shock vector (15 variables)
  A_gt   = ad spend allocation (by channel)
  X_gt   = control variables (season, holidays, price)
  theta_gc = geography x category fixed effects
  lambda_t = time fixed effects
  epsilon_gt = idiosyncratic error

Identification assumption:
  Cov(W_gt, epsilon_gt | X_gt, theta_gc, lambda_t) = 0

Estimands:
  beta_w = dPhi/dW |_{A,X fixed}  -- causal weather elasticity
  alpha_a = dPhi/dA |_{W,X fixed} -- causal ad elasticity (instrumented)
DML framework: orthogonal estimation with cross-fittingChernozhukov et al. (2018)

Double/Debiased Machine Learning uses ML for what it does best (prediction) and econometrics for what it does best (causal inference). The nuisance functions are estimated via gradient boosting or random forests with cross-fitting to avoid overfitting bias.

Neyman-orthogonality is the key property. The moment condition for the causal parameter is constructed so that it has zero derivative with respect to the nuisance parameters at their true values. This means that first-order errors in the nuisance estimates do not contaminate the causal estimate -- the bias is second-order. Combined with cross-fitting (splitting the data into K folds, estimating nuisance on K-1 folds and the causal parameter on the held-out fold), the result is a √n-consistent estimator even when the nuisance functions converge at slower rates.

DML -- Cross-Fitting Algorithm
Step 1: SPLIT DATA into K folds (K=5 default)

Step 2: FOR each fold k = 1, ..., K:
  Train nuisance models on all data EXCEPT fold k:
    m_hat^(-k)(X) = E[Y | X]   -- predict demand from controls
    e_hat^(-k)(X) = E[W | X]   -- predict weather from controls

  Compute residuals on fold k (out-of-sample):
    Y_tilde_k = Y_k - m_hat^(-k)(X_k)  -- demand residual
    W_tilde_k = W_k - e_hat^(-k)(X_k)  -- weather residual

Step 3: AGGREGATE across all folds:
  beta_w = [ SUM_k SUM_{i in k} W_tilde_i * Y_tilde_i ]
         / [ SUM_k SUM_{i in k} W_tilde_i * W_tilde_i ]

Step 4: INFERENCE (valid by Neyman orthogonality):
  Var(beta_w) = (1/n^2) * SUM_i psi_i^2 / (E[W_tilde^2])^2
  where psi_i = W_tilde_i * (Y_tilde_i - beta_w * W_tilde_i)

Properties:
  - sqrt(n)-consistent even when nuisance functions converge at n^(-1/4)
  - Neyman-orthogonal: d/d(eta) E[psi(theta, eta)] = 0 at true eta
  - Cross-fitting eliminates overfitting bias (Donsker condition not needed)
  - Valid confidence intervals via standard asymptotic theory
  - Nuisance can use ANY ML method: GBM, random forest, neural nets
IV diagnostic batteryStock & Yogo (2005), first-stage F

A valid instrument must pass stringent diagnostic tests. The LCDM reports the following diagnostics for every market-category pair:

Instrumental Variable Diagnostics
First-stage F-statistic:
  F = (R^2_1st / k) / ((1 - R^2_1st) / (n - k - 1))
  Requirement: F > 10 (Stock & Yogo 2005 weak instrument threshold)
  Typical LCDM result: F > 100 for weather-sensitive categories

Stock & Yogo (2005) critical values for 2SLS:
  10% maximal IV size:   F > 16.38 (1 endogenous, 1 instrument)
  15% maximal IV size:   F > 8.96
  20% maximal IV size:   F > 6.66
  LCDM instruments pass the 10% threshold by 6-10x

Weak identification test (Kleibergen-Paap rk Wald):
  Tests whether instruments are sufficiently correlated with
  endogenous regressors in the presence of heteroskedasticity.

Hansen J overidentification test:
  When using multiple weather variables as instruments,
  tests whether all instruments satisfy the exclusion restriction.
  Null: instruments are valid. Reject at p < 0.05 triggers review.

Durbin-Wu-Hausman endogeneity test:
  Compares OLS to IV estimates. Significant difference confirms
  that IV correction is necessary (endogeneity is present).
The first instrumental variable in the history of economics used weather to identify demand -- Wright, 1928.via Angrist & Krueger (2001), Journal of Economic Perspectives
05Model Architecture

Multi-tenant pooling

Estimating parameters separately is always worse than estimating them together. Always. The James-Stein theorem guarantees it. Every brand on the platform makes every other brand more accurate.

Partial pooling and the James-Stein theorem

In 1961, Charles Stein proved something that embarrassed the statistics establishment: if you are estimating three or more quantities simultaneously, estimating them separately is always worse than estimating them together. Always. Even if the quantities are unrelated. Efron & Morris (1975) demonstrated the effect using baseball batting averages -- shrinkage toward the grand mean reduced total squared error by 71%.

Thagorus does not build one model per brand. It builds a hierarchical model that learns from every brand simultaneously. The James-Stein result guarantees that this produces better estimates for every single tenant. A brand joining the platform gets better estimates on day one than it would after six months alone. This is a mathematical guarantee, not a product claim.

weathervane.app/network-pooling
Network size15 markets

Drag the slider to see how accuracy improves as markets join the network. James-Stein partial pooling guarantees every tenant benefits.

Empirical Bayes Shrinkage Estimator
Hierarchical model:
  theta_i | mu, tau^2  ~  N(mu, tau^2)    -- prior: elasticities from population
  X_i | theta_i        ~  N(theta_i, sigma^2_i)  -- likelihood: noisy observation

Posterior (shrinkage estimator):
  theta_hat_i = B_i * mu_hat + (1 - B_i) * X_i

  where B_i = sigma^2_i / (sigma^2_i + tau_hat^2)

Interpretation:
  B_i -> 1 (high noise, short history)  -->  lean on population mean
  B_i -> 0 (low noise, long history)    -->  lean on own data

Risk comparison (James & Stein, 1961):
  R(theta_hat^JS) < R(theta_hat^MLE) for ALL theta when p >= 3

Prior specification: horseshoe sparsity

The horseshoe prior (Carvalho, Polson, & Scott, 2010) handles the sparsity problem: UV index matters enormously for sunscreen and not at all for hardware. Most weather variables have zero effect on most categories, but a few have enormous effects on a few. The horseshoe aggressively shrinks irrelevant signals to exactly zero while leaving genuine effects untouched.

The horseshoe achieves this through its half-Cauchy mixing distribution, which places substantial mass at zero (aggressive shrinkage for noise) while having heavy tails (minimal shrinkage for true signals). Compared to the LASSO (L1 penalty), the horseshoe does not suffer from bias on large coefficients. Compared to spike-and-slab priors, it is computationally tractable for the LCDM’s parameter space (~18,500 parameters across 7 categories × 8 weather variables × interactions × geographies).

Horseshoe prior specificationCarvalho, Polson & Scott (2010)
Horseshoe Prior Hierarchy
Global-local shrinkage:
  beta_j | lambda_j, tau ~ N(0, lambda_j^2 * tau^2)
  lambda_j ~ C+(0, 1)    -- local shrinkage (half-Cauchy)
  tau ~ C+(0, tau_0)      -- global shrinkage (half-Cauchy)

  tau_0 = (p_0 / (p - p_0)) * (sigma / sqrt(n))
  where p_0 = expected number of nonzero coefficients,
        p = total number of coefficients

Shrinkage profile:
  kappa_j = 1 / (1 + lambda_j^2 * tau^2)
  E[beta_j | data] approx (1 - kappa_j) * beta_j^MLE

Key property:
  kappa_j -> 1 (shrink to zero) when signal is weak
  kappa_j -> 0 (no shrinkage) when signal is strong
  Transition is SHARP -- unlike ridge, which shrinks everything uniformly

LCDM application:
  Of ~18,500 weather-demand parameters, ~85% are shrunk to effectively zero.
  The remaining ~15% carry the genuine weather-demand signal.

MCMC vs. variational inference

Full MCMC (NUTS sampler via Stan or NumPyro) provides the gold standard for posterior inference but faces a scaling wall at enterprise dimensions. PyMC Marketing with 1,931 parameters achieves approximately 0.19 effective samples per second. The LCDM uses a two-track approach:

  • Panel ridge regression + empirical Bayes for production inference. Closed-form shrinkage, O(seconds) computation for 1,000 markets × 50 categories.
  • Full MCMC for model diagnostics, prior sensitivity checks, and validation. Run offline on model updates, not on daily inference cycles.
  • Variational inference as a middle ground for uncertainty quantification. Mean-field ADVI with reparameterization.
Training objective and loss functionWeighted MSE + ridge penalty
Training Objective
L(beta) = (1/N) * SUM_i w_i * (Y_i - X_i * beta)^2 + lambda * ||beta||^2

where:
  w_i = observation weights (recency-weighted, higher for recent data)
  lambda = ridge penalty (selected via time-series cross-validation)

Hyperparameter selection:
  - Time-series CV with expanding window (no data leakage)
  - Bayesian optimization over lambda, adstock decay, saturation params
  - ~15 minutes for 1,000 markets x 50 categories
  - <100ms per inference call (production)
Time-series cross-validation protocolExpanding window, no leakage

Standard k-fold cross-validation is invalid for time-series data because it allows future data to inform predictions about the past, creating optimistic bias. The LCDM uses expanding window cross-validation:

Expanding Window CV
For t = t_min, t_min+1, ..., T:
  Training window:  [1, ..., t]
  Validation window: [t+1, ..., t+h]  where h = forecast horizon

Fold structure (example with 3 years of daily data):
  Fold 1:  Train on months 1-12,  validate on months 13-15
  Fold 2:  Train on months 1-15,  validate on months 16-18
  Fold 3:  Train on months 1-18,  validate on months 19-21
  ...
  Fold K:  Train on months 1-33,  validate on months 34-36

Properties:
  - No future information leakage
  - Expanding training window captures structural breaks
  - Validation windows are always out-of-sample and forward-looking
  - Hyperparameters selected to minimize average validation MAPE
  - Separate CV for each market-category pair (different optima)

Bayesian optimization over hyperparameter space:
  lambda      in [1e-4, 1e2]   -- ridge penalty
  theta       in [0.1, 0.99]   -- adstock decay rate
  alpha_hill  in [0.3, 1.0]    -- Hill saturation shape
  K_hill      in [1e3, 1e5]    -- Hill half-saturation point
  tau         in [0.01, 0.5]   -- horseshoe global shrinkage
When estimating 3 or more means simultaneously, the individual sample mean is inadmissible. Shrinkage toward the common mean always reduces total squared error.James & Stein (1961), 4th Berkeley Symposium
06Landscape

Competitive landscape

An honest technical comparison. We acknowledge where competitors are stronger. The goal is accuracy, not positioning.

DimensionRobyn (Meta)Meridian (Google)PyMC MarketingLCDM
Statistical coreRidge regression + NevergradHierarchical Bayesian (MCMC)Full MCMC (NUTS)Panel ridge + empirical Bayes
Causal identificationNone (regularization only)GQV for search; geo experiments recommendedPriors onlyWeather IV + synthetic control + DML
Cross-tenant poolingNoGeo-level random effectsHierarchical (if custom-built)James-Stein guaranteed
Weather modelingNoneNoneNone (add manually)6 foundation models, ensemble
Closed-loop optimizationSimulator (not controller)One-shot optimizationNoMPC with receding horizon
Uncertainty quantificationPoint estimates + Pareto frontFull posteriorFull posteriorEmpirical Bayes + conformal
Computational costMinutesHours (MCMC)Hours-days (MCMC scaling wall)Seconds (production)
Ease of setupLow barrier, R packageModerate, PythonHigh, requires custom devManaged service

vs. Robyn (Meta)

Robyn is ridge regression wrapped in a multi-objective hyperparameter search. Ridge addresses multicollinearity and overfitting -- it does not address endogeneity. The ridge penalty changes the magnitude of endogeneity bias but does not eliminate it.

Where Robyn is stronger: Robyn’s simplicity is a genuine advantage for teams without econometric expertise. Robyn’s open-source ecosystem is mature and well-documented.

vs. Meridian (Google)

Meridian is the most sophisticated competitor. Its hierarchical Bayesian framework with geo-level random effects is technically sound. However, its default ROI prior -- LogNormal(0.2, 0.9) -- encodes a strong belief that all media channels are profitable.

Where Meridian is stronger: Brands with heavy YouTube spend benefit from R&F data that no competitor can match. Meridian’s full posterior inference provides richer uncertainty quantification than empirical Bayes.

vs. PyMC Marketing

PyMC Marketing offers the most flexible modeling framework via full MCMC inference. The limitation is practical: MCMC at enterprise scale achieves approximately 0.19 ESS/s.

Where PyMC is stronger: Full posterior inference when it converges. Maximum modeling flexibility for custom research. No vendor lock-in.

vs. Rules-based approaches

Threshold triggers and simple correlations are the status quo. Their limitation is the complexity gap described in Section 1. See our interactive demonstrations for an empirical comparison.

Ridge regression addresses multicollinearity. It does not address endogeneity. These are different problems with different solutions.Standard econometrics; see Angrist & Pischke (2009)
07Trust Architecture

Validation and safety

Every recommendation ships with break conditions -- specific, testable statements about when it becomes invalid. The system defaults to shadow mode: it recommends, humans approve.

weathervane.app/backtest-race
0.0%Rule-based MAPE
0.0%Model MAPE

52-week synthetic backtest: model (green) tracks actual demand far more closely than rules (red). MAPE counters animate on completion.

Backtesting methodology

All validation results below are from synthetic backtests -- simulated demand data with known ground truth. We include them because honesty about what we have and haven’t proven is our most important trust signal.

MetricResultCaveat
Parameter recoveryR² > 0.92 for weather coefficientsSynthetic data, known DGP
Out-of-sample MAPE8-12% across categoriesWeather-sensitive categories only
Conformal coverage94-96% at 95% nominalSynthetic data
Lift vs. naive allocation+15-25% estimatedBacktest, not live measurement
Lift vs. rules+8-18% estimatedSimulated rule competitors
weathervane.app/fan-chart

Living fan chart: 50/80/95% confidence bands breathe in real time. Hover for per-week breakdowns. Orange dots = actuals within bands.

Confidence calibration

A 90% confidence interval should contain the true value 90% of the time. We verify this using conformal prediction (Vovk et al., 2005) with adaptive calibration under distribution shift (Gibbs & Candes, 2021).

Break conditions as safety architecture

Every recommendation includes explicit break conditions -- specific, testable statements about when the recommendation becomes invalid. The system operates in shadow mode by default: recommendations are generated but require human approval before execution.

Circuit breaker specificationAutomated safety constraints
  • Spend cap: No single recommendation can shift more than 35% of daily budget without explicit human approval.
  • Confidence floor: Recommendations below 70% confidence are flagged, not executed.
  • Anomaly detection: If observed demand deviates from predicted by more than 2 standard deviations, the system pauses and alerts.
  • Kill switch: Any stakeholder can halt all recommendations instantly via API or dashboard.
08Honest Gaps

Open questions

What we haven’t proven yet. Where the methodology has known weaknesses. Where we need your data.

Weather-insensitive categories

The identification strategy is strongest for categories with clear weather-demand relationships (outdoor, beverage, apparel, personal care). For weather-insensitive categories (enterprise software, financial services), the advantage is attenuated.

Synthetic vs. real-world validation

All backtest results are from synthetic data with known data-generating processes. Real-world demand is messier, noisier, and subject to confounders that synthetic data does not capture. We are collecting real-world validation data from design partners.

Exclusion restriction imperfection

The exclusion restriction for weather IVs is imperfect (Mellon, 2024). Weather affects mood, which affects all spending. We report Conley-Hansen-Rossi bounds, but the bounds can be wide when the assumed violation is large.

SUTVA violations

The Stable Unit Treatment Value Assumption may be violated through geographic demand substitution. If a heatwave in Phoenix shifts online purchases away from Tucson retailers, the treatment effect in Phoenix is partially at the expense of Tucson.

Empirical Bayes uncertainty

The empirical Bayes framework underestimates posterior uncertainty compared to full Bayesian inference. The population parameters are treated as known rather than uncertain. For production decisions, this is acceptable. For research conclusions, we recommend full MCMC.

Network effect scaling laws

We hypothesize that estimation error decreases as a power law of the number of tenants on the platform, analogous to neural scaling laws (Kaplan et al., 2020). This hypothesis has not been empirically validated at scale.

Long-range forecast degradation

Weather forecast skill degrades with horizon. Beyond 10 days, deterministic forecasts add little value. The LCDM leans on probabilistic ensemble forecasts (GenCast) for 3-10 day horizons and sub-seasonal models (FuXi-S2S) for 14-42 days.

We are currently onboarding design partners to generate real-world validation. If you want to be part of the first cohort, we will share all results -- including failures.
09Bibliography

References

Causal Inference & Econometrics

  • Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies. JASA, 105(490), 493-505.
  • Angrist, J. D. & Krueger, A. B. (2001). Instrumental variables and the search for identification. JEP, 15(4), 69-85.
  • Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
  • Callaway, B. & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. J. Econometrics, 225(2), 200-230.
  • Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics J., 21(1), C1-C68.
  • Conley, T. G., Hansen, C. B., & Rossi, P. E. (2012). Plausibly exogenous. REStat, 94(1), 260-272.
  • Dell, M., Jones, B. F., & Olken, B. A. (2014). What do we learn from the weather? JEL, 52(3), 740-798.
  • Hartford, J. et al. (2017). Deep IV: a flexible approach for counterfactual prediction. ICML 2017.
  • Imbens, G. W. & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467-475.
  • Stock, J. H. & Yogo, M. (2005). Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models, Andrews & Stock (eds.), 80-108.
  • Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA, 113(523), 1228-1242.
  • Wright, P. G. (1928). The Tariff on Animal and Vegetable Oils. Macmillan.

Marketing Science & Media Mix Modeling

  • Blake, T., Nosko, C., & Tadelis, S. (2015). Consumer heterogeneity and paid search effectiveness. Econometrica, 83(1), 155-174.
  • Dew, R., Padilla, N., & Shchetkina, I. (2024). Your MMM is broken. arXiv:2408.07678.
  • Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2019). A comparison of approaches to advertising measurement. Marketing Science, 38(2), 193-225.
  • Lewis, R. A. & Rao, J. M. (2015). The unfavorable economics of measuring the returns to advertising. QJE, 130(4), 1941-1973.
  • Shapiro, B. T., Hitsch, G. J., & Tuchman, A. E. (2021). TV advertising effectiveness and profitability. Econometrica, 89(4), 1855-1879.

Weather-Demand Economics

  • Busse, M. R., Pope, D. G., Pope, J. C., & Silva-Risso, J. (2015). The psychological effect of weather on car purchases. QJE, 130(1), 371-414.
  • Mellon, J. (2024). Rain, rain, go away: 194 potential exclusion-restriction violations for weather instruments. AJPS, 69, 881-898.
  • Roth Tran, B. (2023). Sellin’ in the rain: weather, climate, and retail sales. Management Science, 69(12), 7423-7447.
  • Steadman, R. G. (1979). The assessment of sultriness. Part I: a temperature-humidity index based on human physiology and clothing science. J. Appl. Meteorol., 18(7), 861-873.

Statistics & Bayesian Methods

  • Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
  • Efron, B. & Morris, C. (1975). Data analysis using Stein’s estimator. JASA, 70(350), 311-319.
  • Gelman, A. & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge.
  • James, W. & Stein, C. (1961). Estimation with quadratic loss. 4th Berkeley Symposium, 1, 361-379.
  • Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
  • Morris, C. N. (1983). Parametric empirical Bayes inference. JASA, 78(381), 47-55.
  • Piironen, J. & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron. J. Stat., 11(2), 5018-5051.
  • Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.

Information Theory & Cybernetics

  • Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman & Hall.
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
  • Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv:physics/0004057.
  • Touchette, H. & Lloyd, S. (2000). Information-theoretic limits of control. PRL, 84(6), 1156-1159.

AI Weather Forecasting

  • Bi, K. et al. (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533-538. (Pangu-Weather.)
  • Bodnar, C. et al. (2025). Aurora: a foundation model for the Earth system. Nature, 641, 1180-1187.
  • Lam, R. et al. (2023). GraphCast: learning skillful medium-range global weather forecasting. Science, 382(6677), 1416-1421.
  • Price, I. et al. (2024). GenCast: probabilistic weather forecasting with diffusion models. Nature, 637, 84-90.

Biometeorology & Medical

  • Kimoto, K. et al. (2011). Influence of barometric pressure and humidity on the onset of clinical symptoms of migraine. Cephalalgia, 31(3), 338-343.
  • Mukamal, K. J. et al. (2009). Weather and air pollution as triggers of severe headaches. Neurology, 72(10), 922-927.

Conformal Prediction & Calibration

  • Barber, R. F. et al. (2023). Conformal prediction beyond exchangeability. Ann. Stat., 51(2), 816-845.
  • Gibbs, I. & Candes, E. (2021). Adaptive conformal inference under distribution shift. NeurIPS 2021.

Mathematical Functions & Meteorology

  • NWS (National Weather Service). Wind chill temperature index. NOAA Technical Memorandum.
  • Rothfusz, L. P. (1990). The heat index “equation” (or, more than you ever wanted to know about heat index). NWS Southern Region Technical Attachment SR 90-23.
Key Findings
> 100Weather IV F-statistic
8-12%Out-of-sample MAPE
94-96%Conformal coverage at 95%
> 0.92Parameter recovery R²
+15-25%Lift vs. naive allocation
+8-18%Lift vs. rules-based
> 100Weather IV F-statistic
8-12%Out-of-sample MAPE
94-96%Conformal coverage at 95%
> 0.92Parameter recovery R²
+15-25%Lift vs. naive allocation
+8-18%Lift vs. rules-based
< 100msProduction inference latency
50,000+Markets x categories
15Weather variables modeled
28Pairwise interactions
6Foundation weather models
42 daysForecast horizon
< 100msProduction inference latency
50,000+Markets x categories
15Weather variables modeled
28Pairwise interactions
6Foundation weather models
42 daysForecast horizon

Questions about the methodology? Interested in contributing to validation?

See the interactive demonstrations → · Read the technical overview →