The complexity gap
Rules carry ~7.6 bits of mutual information. The demand system has ~20 bits of relevant entropy. The remaining 12.4 bits are structurally invisible -- not a matter of writing better rules.
72°F with a light breeze: cold beverage demand hits 67 units. 72°F with 25mph wind: demand drops to 41. The thermometer reads the same. The cash register doesn’t. Now multiply that by 8 weather variables, 4 temporal derivatives, 7 product categories, and 47 markets. A human writing rules has seen 0.000016% of this space.
Temperature alone explains roughly 12% of demand variance in weather-sensitive categories. This is the number that rules-based systems optimize against. The full weather tensor -- temperature, humidity, UV index, wind speed, barometric pressure, precipitation, cloud cover, and their temporal derivatives -- explains 45-60% of demand variance in those same categories. The gap between 12% and 55% is not noise. It is decision-relevant information that rules cannot represent.
The information-theoretic bound. Touchette & Lloyd (2000) proved that the entropy reduction achievable by any controller is bounded by the mutual information between the controller’s model and the system. A rule set with 200 rules carries at most log2(200) ≈ 7.6 bits of mutual information. The underlying demand system has roughly 20 bits of relevant entropy -- about 106 distinguishable states. Rules can reduce uncertainty by 7.6 bits out of 20. The remaining 12.4 bits are information the controller cannot see.
The curse of dimensionality. Weather-demand relationships are not seven independent channels. They interact: 90°F at 30% humidity produces different demand than 90°F at 85% humidity. UV index modulates temperature effects nonlinearly. Wind speed interacts with precipitation to determine outdoor activity. With 8 weather variables, the number of pairwise interactions is C(8,2) = 28. Three-way interactions: 56. The response surface is continuous and nonlinear, but rules discretize it into a finite set of rectangular regions. Every threshold is a knife edge that reality does not respect.
Three converging proofs. The result is robust because three independent traditions arrive at the same conclusion:
Mathematical detail: the information bottleneckTishby et al. (2000)
Tishby, Pereira, & Bialek (2000) formalized the compression-prediction tradeoff. Given a signal and a target, any compression must choose what to keep. Rules compress for legibility -- they preserve what a human can read in a spreadsheet. Models compress for decision quality -- they preserve what actually predicts the outcome. These are different objectives with provably different optimal solutions.
min I(X; T) - beta * I(T; Y) where: X = input (full weather tensor) T = compressed representation (rules or model) Y = target (demand) beta = tradeoff parameter Rules: low I(X;T) by design -> low I(T;Y) necessarily Model: higher I(X;T) capacity -> higher I(T;Y) achievable
When three separate branches of mathematics point at the same wall, it is probably a wall. The gap between rules and reality is structural, not a matter of engineering effort.
Five stages of demand complexity
The LCDM models demand as a function of 8 weather variables: temperature, humidity, UV index, wind speed, barometric pressure, precipitation, cloud cover, and their temporal derivatives. The interaction taxonomy is combinatorially explosive: pairwise interactions number C(8,2) = 28, three-way interactions number C(8,3) = 56, and four-way interactions number C(8,4) = 70 -- totalling 154 interaction terms before considering temporal lags, geographic heterogeneity, or cross-category effects.
Rules discretize this continuous response surface into a finite set of rectangular regions. Every threshold is a knife edge that reality does not respect. The mathematical argument is that threshold rules can capture at most log₂(N) bits of mutual information, where N is the number of rules. 200 rules carry at most ~7.6 bits; the demand system has ~20 bits of relevant entropy. The remaining 12.4 bits are structurally invisible to rules.
Cross-category cannibalization
When demand for one category surges past a threshold, it cannibalizes or amplifies demand in adjacent categories. The LCDM models this with a 7×7 cannibalization weight matrix that activates when any category exceeds its surge threshold (default: 65th percentile demand index).
Cannibalization weight matrix7 categories, cross-effects
Category interactions (weight applied when source exceeds surge threshold): HVAC surge --> Outdoor Rec: -0.15 (indoor retreat suppresses outdoor) Cold Bev surge --> Ice Cream: -0.12 (substitution effect) Allergy surge --> Outdoor Rec: -0.12 (symptom avoidance) Outerwear surge -> HVAC: +0.08 (cold-weather co-demand) Outdoor surge -> Cold Bev: +0.08 (complementary consumption) Outdoor surge -> Ice Cream: +0.06 (occasion bundling) Outdoor surge -> Sunscreen: +0.05 (activity co-occurrence) Adjustment formula: adj_j = SUM_i [ w_ij * max(0, (demand_i - T) / (100 - T)) * 100 ] where T = surge threshold (65), w_ij = cannibalization weight
Anomaly detection formalism
The LCDM flags moments where model predictions diverge from rule-based intuition. Each anomaly type carries a severity score in [0, 1] and an explanation string. The anomaly detector currently identifies 8+ anomaly types:
Anomaly type catalogue8+ types with severity scoring
- Tropical rain + sunscreen (severity: demand/60): Warm rain at >70°F with high humidity produces diffuse UV burns. Sunscreen demand stays elevated despite precipitation -- contradicts every rule-based system.
- Warm outerwear demand (severity: demand/35): High wind speed in Denver/Chicago/Boston creates windbreaker demand at 70°F+. Rules keyed on temperature alone miss this entirely.
- Hot + low beverage demand (severity: 0.7): Extreme humidity (>80%) collapses foot traffic even at high temperatures, suppressing cold beverage sales below what temperature-only rules predict.
- Rainy-day ice cream (severity: 0.6): Light warm rain increases ice cream demand (comfort food culture). Documented in Unilever UK/Ireland sales data.
- Barometric jerk → allergy (severity: |d³|/4): Rapid pressure oscillations trigger migraine clusters and acute allergy episodes 36 hours before weather arrives (Kimoto et al. 2011, Cephalalgia).
- Pre-position outerwear (severity: 0.5): Negative jerk (d³ < -1) predicts incoming cold snap; outerwear demand rises while temperature is still 70°F.
- Moderate-temp HVAC (severity: 0.6): High |d²| (thermal whiplash) stresses HVAC systems at moderate temperatures -- building thermal mass cannot track rapid changes.
- Regime instability (severity: |d⁴|/4): 4th-derivative snap disrupts all categories simultaneously, as consumers cannot trust the forecast.
Light warm rain INCREASES ice cream demand -- comfort food / rainy-day treat culture. Documented in Unilever UK/Ireland sales data. At 68°F with 0.2″ rain in Boston, ice cream demand rises 20% above the dry-day baseline.
First rain after a dry streak creates a pollen SURGE -- rain knocks pollen to breathing height. In Denver after 4 dry days at 72°F, a light shower (0.1″) spikes allergy medication demand 50% within 4-6 hours. Every rule says “rain = less outdoor activity = less allergy.” The opposite happens.
Outerwear demand rises while it’s still 70°F -- the model reads the trajectory, not the snapshot. When the 3rd derivative (jerk) turns strongly negative, the model detects incoming cold and pre-positions spend. The jerk is the pre-positioning signal that no human rule writer has ever seen.
Dewpoint-wind cross-acceleration near thunderstorms creates a “last chance” ice cream purchase window. When temperature is accelerating (d² > 0.5) and wind is decelerating (d¹ < -1), a warm front collision is approaching. People rush for ice cream before the storm hits -- an 18% demand spike in a 2-hour window invisible to any snapshot rule.
Core mathematical functionsSteadman 1979, NWS, Hermite interpolation
Heat Index (Steadman 1979, T >= 80F):
HI = -42.379 + 2.049*T + 10.143*RH - 0.2248*T*RH
- 6.838e-3*T^2 - 5.482e-2*RH^2
+ 1.229e-3*T^2*RH + 8.528e-4*T*RH^2
- 1.99e-6*T^2*RH^2
Wind Chill (NWS formula, T <= 50F, WS >= 3mph):
WC = 35.74 + 0.6215*T - 35.75*WS^0.16 + 0.4275*T*WS^0.16
Feels-Like Composite:
FL(T, RH, WS) =
HI(T, RH) if T >= 80
WC(T, WS) if T <= 50
WC*(1-w) + HI*w if 50 < T < 80, w = (T-50)/30
Smoothstep (Hermite interpolation for threshold transitions):
S(e0, e1, x) = t^2 * (3 - 2*t)
where t = clamp((x - e0) / (e1 - e0), 0, 1)Revenue response with saturation: R(demand, baseRev, spend) = (demand/100) * baseRev * spend^alpha / (spend^alpha + k^alpha) Parameters: alpha = 0.65 (shape: steepness of diminishing returns) k = 8000 (half-saturation point in spend units) Marginal revenue: dR/dSpend = (demand/100) * baseRev * alpha * spend^(alpha-1) * k^alpha / (spend^alpha + k^alpha)^2 Key insight: Lower k = faster saturation, punishing over-concentration on "obvious" cells. The marginal dollar in a saturated category is worth less than the marginal dollar in an unsaturated adjacent category.
Temporal dynamics
A snapshot is to a trajectory as a photograph is to a film. Demand responds to rate, acceleration, jerk, and snap -- four derivatives that encode path-dependent behavior invisible to any threshold rule.
The LCDM models four temporal derivatives of each weather variable. These represent the recent trajectory of conditions:
Pattern d1 (rate) d2 (accel) d3 (jerk) d4 (snap) ------- --------- ---------- --------- --------- Steady State 0.0 0.0 0.0 0.0 Warming Trend +2.5 +0.3 0.0 0.0 Cooling Snap -4.0 -0.8 +0.15 0.0 Volatile 0.0 0.0 +2.8 -1.2 Thermal Whiplash +1.2 -2.5 +4.5 -3.5
Pre-positioning: reading the trajectory, not the snapshot
The most striking derivative effect is outerwear pre-positioning. When the 3rd derivative is strongly negative (d³ < -1) -- meaning the rate of cooling is itself accelerating -- the model increases outerwear spend before actual cold arrives. At 70°F, this looks insane. But the model is reading the trajectory: temperature is bending colder, and the system pre-positions budget into the demand curve that has not yet materialized.
Similarly, the 4th derivative (“snap”) encodes regime instability. When snap is high, weather forecasts become unreliable. Consumers respond by not trusting forecasts either: planned outdoor purchases drop (people cancel trips they cannot rely on), while impulse purchases spike (every nice moment might be the last one this week). Ice cream sees a “panic treat” effect at high |d⁴| -- spontaneous purchasing driven by forecast uncertainty.
The 4th derivative also creates a pseudo-random noise effect in cold beverages. When snap is high, habitual purchases collapse (people do not stock up because they cannot predict their needs) while impulse purchases surge. The net demand oscillates in a pattern that is deterministic but practically unpredictable without modeling the full derivative chain.
Barometric jerk and medical demandKimoto et al. 2011
Kimoto et al. (2011) demonstrated in Cephalalgia that rapid barometric pressure oscillations (which we proxy via temperature trajectory jerk) trigger migraine clusters and acute allergy episodes. The mechanism is physiological: rapid pressure changes alter sinus cavity pressure differentials, triggering histamine release and inflammatory cascades.
The demand signal precedes the weather event by 24-36 hours. When |d³| > 1.5, allergy/pharma demand increases by up to 18 index points -- visible as elevated antihistamine and pain reliever purchases even when current weather conditions are mild. For chronic allergy sufferers, sustained barometric instability (high |d⁴|) triggers preemptive medication stocking, adding approximately 8 index points per unit of |d⁴|.
The model does not predict demand from weather. It predicts demand from the trajectory of weather -- rate, acceleration, jerk, and snap. A snapshot is to a trajectory as a photograph is to a film.
Geographic heterogeneity
The same temperature produces opposite demand in different markets. Demand responds to weather shocks -- deviations from local norms -- not absolute conditions.
78°F in Seattle is a +2.8σ heatwave. Sunscreen sells out. 78°F in Phoenix is 27 degrees below baseline -- people reach for jackets. Same reading on the thermometer. Opposite behavior at the register.
Planning vs. impulse markets
Outdoor recreation demand reveals a fundamental geographic bifurcation that emerges at the 3-day mark of consecutive good weather. The bifurcation separates two behavioral regimes:
Geo-specific saturation thresholds
Each geography has calibrated baselines for temperature (mean μ and standard deviation σ), humidity, and wind. These are not arbitrary -- they represent the local distribution against which weather shocks are measured. Critical thresholds are geo-relative:
Geographic baselines and saturation6 markets, calibrated parameters
Market Temp(mu) Temp(sigma) Humidity(mu) Wind(mu) Notes ------ -------- ----------- ------------ -------- ----- Phoenix 105.0 6.2 18% 6 mph Sunscreen drops >105F (indoor retreat) Seattle 62.0 5.7 72% 10 mph Rain tolerance: 0.15 (highest) Denver 65.0 12.0 35% 12 mph Altitude: +20% cold bev (dehydration) Miami 84.0 4.5 76% 9 mph Narrow sigma: small shocks = big signal Chicago 50.0 18.0 62% 15 mph Huge sigma: continental extremes Boston 52.0 15.0 65% 13 mph Ice cream culture premium: 1.25x Shock interpretation: 78F in Seattle = (78-62)/5.7 = +2.8 sigma (major heatwave) 78F in Phoenix = (78-105)/6.2 = -4.4 sigma (extreme cold snap) Same temperature, opposite demand implications.
Geo-specific saturation effects create counterintuitive demand curves. In Phoenix, sunscreen demand drops above 105°F because consumers retreat indoors to escape extreme heat -- the demand curve inverts past the threshold. The smoothstep transition from 110°F to 125°F (feels-like) subtracts up to 50 demand index points. A temperature-is-good rule that boosts sunscreen spend at 108°F in Phoenix is spending into a demand trough.
Similarly, Seattle residents have the highest rain tolerance (threshold: 0.15 inches before outdoor demand suppresses) -- Seattleites do not cancel plans for drizzle. Phoenix residents have the lowest (0.02 inches), because rain is so rare that any precipitation disrupts outdoor behavior. A universal “if rain, reduce outdoor spend” rule over-suppresses in Seattle and under-suppresses in Phoenix.
The Busse et al. (2015) insight: Weather affects high-consideration purchases through psychological salience, not just physical comfort. In a study of 40 million vehicle transactions, convertible sales spiked on sunny days and 4WD sales spiked after snowfall -- even when the weather was transient and climatically irrelevant to the buyer’s location. If weather moves car purchases (a months-long research cycle), its effect on low-consideration, weather-sensitive categories is larger and more immediate.
78°F in Seattle is a heatwave. 78°F in Phoenix is a cold snap. A model that treats temperature as a universal input is measuring the wrong thing. Demand responds to weather shocks -- deviations from local norms.
Identification strategy
Observational marketing mix models overestimate ad effects by 5-10x (Gordon et al., 2019). Weather is the oldest instrumental variable in economics -- exogenous, high-frequency, and 10,000x richer than Wright had in 1928.
Every marketing mix model confronts the same identification problem: advertising spending is endogenous. Firms increase budgets when they anticipate demand will be high. This creates a correlation between spend and sales that has nothing to do with advertising effectiveness. Gordon et al. (2019), using 15 large-scale randomized experiments at Facebook, found that observational methods overestimate advertising effects by 5-10x at the median. This is not a marginal measurement error. It leads to fundamental misallocation of marketing budgets.
Weather as an instrument
The solution is an instrumental variable (IV) -- a source of variation that shifts demand but is uncorrelated with the confounders. Weather is the canonical instrument. Philip Wright used rainfall as the first IV in economics in 1928. Since then, weather has been used as an instrument in at least 83 papers published in top-5 economics journals (Mellon, 2024).
A valid instrument must satisfy three conditions:
- Relevance: Weather must causally affect demand. Empirically testable via first-stage F-statistics. For weather-sensitive categories, F > 100 is typical -- far above the Stock & Yogo (2005) threshold of 10.
- Independence: Weather must be uncorrelated with unobserved confounders. Defensible because weather is determined by atmospheric physics, not by marketing decisions or consumer preferences.
- Exclusion restriction: Weather must affect sales only through its effect on demand, not through other channels. This is the most debated condition -- see below.
The Mellon critique and our response
Mellon (2024) catalogued 194 potential exclusion-restriction violations for weather instruments. The core concern: weather affects mood, which affects all spending -- not just weather-sensitive categories. If rain makes people sad and sad people buy less of everything, then rain is not a valid instrument for isolating advertising effects.
Thagorus addresses this through four mechanisms:
- Direct demand modeling: We explicitly model the direct effect of weather on demand. The remaining variation -- demand conditional on weather controls -- is used for identification.
- Category-specific identification: The mood channel affects all categories similarly. Weather-sensitive categories show differential responses. The between-category variation identifies the demand-specific channel.
- Conley-Hansen-Rossi bounds: Rather than assuming perfect exogeneity, we report bounds on causal estimates that remain valid under calibrated violations of the exclusion restriction (Conley, Hansen, & Rossi, 2012).
- Cross-geography comparisons: Using synthetic control methods (Abadie et al., 2010), we compare geographies experiencing different weather but sharing all other confounders.
Formal specification: demand response functionNotation and estimands
Y_gt = Phi(W_gt, A_gt, X_gt; theta_gc, lambda_t) + epsilon_gt
where:
Phi(.) = continuous, differentiable demand surface
W_gt = weather shock vector (15 variables)
A_gt = ad spend allocation (by channel)
X_gt = control variables (season, holidays, price)
theta_gc = geography x category fixed effects
lambda_t = time fixed effects
epsilon_gt = idiosyncratic error
Identification assumption:
Cov(W_gt, epsilon_gt | X_gt, theta_gc, lambda_t) = 0
Estimands:
beta_w = dPhi/dW |_{A,X fixed} -- causal weather elasticity
alpha_a = dPhi/dA |_{W,X fixed} -- causal ad elasticity (instrumented)DML framework: orthogonal estimation with cross-fittingChernozhukov et al. (2018)
Double/Debiased Machine Learning uses ML for what it does best (prediction) and econometrics for what it does best (causal inference). The nuisance functions are estimated via gradient boosting or random forests with cross-fitting to avoid overfitting bias.
Neyman-orthogonality is the key property. The moment condition for the causal parameter is constructed so that it has zero derivative with respect to the nuisance parameters at their true values. This means that first-order errors in the nuisance estimates do not contaminate the causal estimate -- the bias is second-order. Combined with cross-fitting (splitting the data into K folds, estimating nuisance on K-1 folds and the causal parameter on the held-out fold), the result is a √n-consistent estimator even when the nuisance functions converge at slower rates.
Step 1: SPLIT DATA into K folds (K=5 default)
Step 2: FOR each fold k = 1, ..., K:
Train nuisance models on all data EXCEPT fold k:
m_hat^(-k)(X) = E[Y | X] -- predict demand from controls
e_hat^(-k)(X) = E[W | X] -- predict weather from controls
Compute residuals on fold k (out-of-sample):
Y_tilde_k = Y_k - m_hat^(-k)(X_k) -- demand residual
W_tilde_k = W_k - e_hat^(-k)(X_k) -- weather residual
Step 3: AGGREGATE across all folds:
beta_w = [ SUM_k SUM_{i in k} W_tilde_i * Y_tilde_i ]
/ [ SUM_k SUM_{i in k} W_tilde_i * W_tilde_i ]
Step 4: INFERENCE (valid by Neyman orthogonality):
Var(beta_w) = (1/n^2) * SUM_i psi_i^2 / (E[W_tilde^2])^2
where psi_i = W_tilde_i * (Y_tilde_i - beta_w * W_tilde_i)
Properties:
- sqrt(n)-consistent even when nuisance functions converge at n^(-1/4)
- Neyman-orthogonal: d/d(eta) E[psi(theta, eta)] = 0 at true eta
- Cross-fitting eliminates overfitting bias (Donsker condition not needed)
- Valid confidence intervals via standard asymptotic theory
- Nuisance can use ANY ML method: GBM, random forest, neural netsIV diagnostic batteryStock & Yogo (2005), first-stage F
A valid instrument must pass stringent diagnostic tests. The LCDM reports the following diagnostics for every market-category pair:
First-stage F-statistic: F = (R^2_1st / k) / ((1 - R^2_1st) / (n - k - 1)) Requirement: F > 10 (Stock & Yogo 2005 weak instrument threshold) Typical LCDM result: F > 100 for weather-sensitive categories Stock & Yogo (2005) critical values for 2SLS: 10% maximal IV size: F > 16.38 (1 endogenous, 1 instrument) 15% maximal IV size: F > 8.96 20% maximal IV size: F > 6.66 LCDM instruments pass the 10% threshold by 6-10x Weak identification test (Kleibergen-Paap rk Wald): Tests whether instruments are sufficiently correlated with endogenous regressors in the presence of heteroskedasticity. Hansen J overidentification test: When using multiple weather variables as instruments, tests whether all instruments satisfy the exclusion restriction. Null: instruments are valid. Reject at p < 0.05 triggers review. Durbin-Wu-Hausman endogeneity test: Compares OLS to IV estimates. Significant difference confirms that IV correction is necessary (endogeneity is present).
The first instrumental variable in the history of economics used weather to identify demand -- Wright, 1928.via Angrist & Krueger (2001), Journal of Economic Perspectives
Multi-tenant pooling
Estimating parameters separately is always worse than estimating them together. Always. The James-Stein theorem guarantees it. Every brand on the platform makes every other brand more accurate.
Partial pooling and the James-Stein theorem
In 1961, Charles Stein proved something that embarrassed the statistics establishment: if you are estimating three or more quantities simultaneously, estimating them separately is always worse than estimating them together. Always. Even if the quantities are unrelated. Efron & Morris (1975) demonstrated the effect using baseball batting averages -- shrinkage toward the grand mean reduced total squared error by 71%.
Thagorus does not build one model per brand. It builds a hierarchical model that learns from every brand simultaneously. The James-Stein result guarantees that this produces better estimates for every single tenant. A brand joining the platform gets better estimates on day one than it would after six months alone. This is a mathematical guarantee, not a product claim.
Hierarchical model: theta_i | mu, tau^2 ~ N(mu, tau^2) -- prior: elasticities from population X_i | theta_i ~ N(theta_i, sigma^2_i) -- likelihood: noisy observation Posterior (shrinkage estimator): theta_hat_i = B_i * mu_hat + (1 - B_i) * X_i where B_i = sigma^2_i / (sigma^2_i + tau_hat^2) Interpretation: B_i -> 1 (high noise, short history) --> lean on population mean B_i -> 0 (low noise, long history) --> lean on own data Risk comparison (James & Stein, 1961): R(theta_hat^JS) < R(theta_hat^MLE) for ALL theta when p >= 3
Prior specification: horseshoe sparsity
The horseshoe prior (Carvalho, Polson, & Scott, 2010) handles the sparsity problem: UV index matters enormously for sunscreen and not at all for hardware. Most weather variables have zero effect on most categories, but a few have enormous effects on a few. The horseshoe aggressively shrinks irrelevant signals to exactly zero while leaving genuine effects untouched.
The horseshoe achieves this through its half-Cauchy mixing distribution, which places substantial mass at zero (aggressive shrinkage for noise) while having heavy tails (minimal shrinkage for true signals). Compared to the LASSO (L1 penalty), the horseshoe does not suffer from bias on large coefficients. Compared to spike-and-slab priors, it is computationally tractable for the LCDM’s parameter space (~18,500 parameters across 7 categories × 8 weather variables × interactions × geographies).
Horseshoe prior specificationCarvalho, Polson & Scott (2010)
Global-local shrinkage:
beta_j | lambda_j, tau ~ N(0, lambda_j^2 * tau^2)
lambda_j ~ C+(0, 1) -- local shrinkage (half-Cauchy)
tau ~ C+(0, tau_0) -- global shrinkage (half-Cauchy)
tau_0 = (p_0 / (p - p_0)) * (sigma / sqrt(n))
where p_0 = expected number of nonzero coefficients,
p = total number of coefficients
Shrinkage profile:
kappa_j = 1 / (1 + lambda_j^2 * tau^2)
E[beta_j | data] approx (1 - kappa_j) * beta_j^MLE
Key property:
kappa_j -> 1 (shrink to zero) when signal is weak
kappa_j -> 0 (no shrinkage) when signal is strong
Transition is SHARP -- unlike ridge, which shrinks everything uniformly
LCDM application:
Of ~18,500 weather-demand parameters, ~85% are shrunk to effectively zero.
The remaining ~15% carry the genuine weather-demand signal.MCMC vs. variational inference
Full MCMC (NUTS sampler via Stan or NumPyro) provides the gold standard for posterior inference but faces a scaling wall at enterprise dimensions. PyMC Marketing with 1,931 parameters achieves approximately 0.19 effective samples per second. The LCDM uses a two-track approach:
- Panel ridge regression + empirical Bayes for production inference. Closed-form shrinkage, O(seconds) computation for 1,000 markets × 50 categories.
- Full MCMC for model diagnostics, prior sensitivity checks, and validation. Run offline on model updates, not on daily inference cycles.
- Variational inference as a middle ground for uncertainty quantification. Mean-field ADVI with reparameterization.
Training objective and loss functionWeighted MSE + ridge penalty
L(beta) = (1/N) * SUM_i w_i * (Y_i - X_i * beta)^2 + lambda * ||beta||^2 where: w_i = observation weights (recency-weighted, higher for recent data) lambda = ridge penalty (selected via time-series cross-validation) Hyperparameter selection: - Time-series CV with expanding window (no data leakage) - Bayesian optimization over lambda, adstock decay, saturation params - ~15 minutes for 1,000 markets x 50 categories - <100ms per inference call (production)
Time-series cross-validation protocolExpanding window, no leakage
Standard k-fold cross-validation is invalid for time-series data because it allows future data to inform predictions about the past, creating optimistic bias. The LCDM uses expanding window cross-validation:
For t = t_min, t_min+1, ..., T: Training window: [1, ..., t] Validation window: [t+1, ..., t+h] where h = forecast horizon Fold structure (example with 3 years of daily data): Fold 1: Train on months 1-12, validate on months 13-15 Fold 2: Train on months 1-15, validate on months 16-18 Fold 3: Train on months 1-18, validate on months 19-21 ... Fold K: Train on months 1-33, validate on months 34-36 Properties: - No future information leakage - Expanding training window captures structural breaks - Validation windows are always out-of-sample and forward-looking - Hyperparameters selected to minimize average validation MAPE - Separate CV for each market-category pair (different optima) Bayesian optimization over hyperparameter space: lambda in [1e-4, 1e2] -- ridge penalty theta in [0.1, 0.99] -- adstock decay rate alpha_hill in [0.3, 1.0] -- Hill saturation shape K_hill in [1e3, 1e5] -- Hill half-saturation point tau in [0.01, 0.5] -- horseshoe global shrinkage
When estimating 3 or more means simultaneously, the individual sample mean is inadmissible. Shrinkage toward the common mean always reduces total squared error.James & Stein (1961), 4th Berkeley Symposium
Competitive landscape
An honest technical comparison. We acknowledge where competitors are stronger. The goal is accuracy, not positioning.
| Dimension | Robyn (Meta) | Meridian (Google) | PyMC Marketing | LCDM |
|---|---|---|---|---|
| Statistical core | Ridge regression + Nevergrad | Hierarchical Bayesian (MCMC) | Full MCMC (NUTS) | Panel ridge + empirical Bayes |
| Causal identification | None (regularization only) | GQV for search; geo experiments recommended | Priors only | Weather IV + synthetic control + DML |
| Cross-tenant pooling | No | Geo-level random effects | Hierarchical (if custom-built) | James-Stein guaranteed |
| Weather modeling | None | None | None (add manually) | 6 foundation models, ensemble |
| Closed-loop optimization | Simulator (not controller) | One-shot optimization | No | MPC with receding horizon |
| Uncertainty quantification | Point estimates + Pareto front | Full posterior | Full posterior | Empirical Bayes + conformal |
| Computational cost | Minutes | Hours (MCMC) | Hours-days (MCMC scaling wall) | Seconds (production) |
| Ease of setup | Low barrier, R package | Moderate, Python | High, requires custom dev | Managed service |
vs. Robyn (Meta)
Robyn is ridge regression wrapped in a multi-objective hyperparameter search. Ridge addresses multicollinearity and overfitting -- it does not address endogeneity. The ridge penalty changes the magnitude of endogeneity bias but does not eliminate it.
Where Robyn is stronger: Robyn’s simplicity is a genuine advantage for teams without econometric expertise. Robyn’s open-source ecosystem is mature and well-documented.
vs. Meridian (Google)
Meridian is the most sophisticated competitor. Its hierarchical Bayesian framework with geo-level random effects is technically sound. However, its default ROI prior -- LogNormal(0.2, 0.9) -- encodes a strong belief that all media channels are profitable.
Where Meridian is stronger: Brands with heavy YouTube spend benefit from R&F data that no competitor can match. Meridian’s full posterior inference provides richer uncertainty quantification than empirical Bayes.
vs. PyMC Marketing
PyMC Marketing offers the most flexible modeling framework via full MCMC inference. The limitation is practical: MCMC at enterprise scale achieves approximately 0.19 ESS/s.
Where PyMC is stronger: Full posterior inference when it converges. Maximum modeling flexibility for custom research. No vendor lock-in.
vs. Rules-based approaches
Threshold triggers and simple correlations are the status quo. Their limitation is the complexity gap described in Section 1. See our interactive demonstrations for an empirical comparison.
Ridge regression addresses multicollinearity. It does not address endogeneity. These are different problems with different solutions.Standard econometrics; see Angrist & Pischke (2009)
Validation and safety
Every recommendation ships with break conditions -- specific, testable statements about when it becomes invalid. The system defaults to shadow mode: it recommends, humans approve.
Backtesting methodology
All validation results below are from synthetic backtests -- simulated demand data with known ground truth. We include them because honesty about what we have and haven’t proven is our most important trust signal.
| Metric | Result | Caveat |
|---|---|---|
| Parameter recovery | R² > 0.92 for weather coefficients | Synthetic data, known DGP |
| Out-of-sample MAPE | 8-12% across categories | Weather-sensitive categories only |
| Conformal coverage | 94-96% at 95% nominal | Synthetic data |
| Lift vs. naive allocation | +15-25% estimated | Backtest, not live measurement |
| Lift vs. rules | +8-18% estimated | Simulated rule competitors |
Confidence calibration
A 90% confidence interval should contain the true value 90% of the time. We verify this using conformal prediction (Vovk et al., 2005) with adaptive calibration under distribution shift (Gibbs & Candes, 2021).
Break conditions as safety architecture
Every recommendation includes explicit break conditions -- specific, testable statements about when the recommendation becomes invalid. The system operates in shadow mode by default: recommendations are generated but require human approval before execution.
Circuit breaker specificationAutomated safety constraints
- Spend cap: No single recommendation can shift more than 35% of daily budget without explicit human approval.
- Confidence floor: Recommendations below 70% confidence are flagged, not executed.
- Anomaly detection: If observed demand deviates from predicted by more than 2 standard deviations, the system pauses and alerts.
- Kill switch: Any stakeholder can halt all recommendations instantly via API or dashboard.
Open questions
What we haven’t proven yet. Where the methodology has known weaknesses. Where we need your data.
We are currently onboarding design partners to generate real-world validation. If you want to be part of the first cohort, we will share all results -- including failures.
References
Causal Inference & Econometrics
- Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies. JASA, 105(490), 493-505.
- Angrist, J. D. & Krueger, A. B. (2001). Instrumental variables and the search for identification. JEP, 15(4), 69-85.
- Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- Callaway, B. & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. J. Econometrics, 225(2), 200-230.
- Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics J., 21(1), C1-C68.
- Conley, T. G., Hansen, C. B., & Rossi, P. E. (2012). Plausibly exogenous. REStat, 94(1), 260-272.
- Dell, M., Jones, B. F., & Olken, B. A. (2014). What do we learn from the weather? JEL, 52(3), 740-798.
- Hartford, J. et al. (2017). Deep IV: a flexible approach for counterfactual prediction. ICML 2017.
- Imbens, G. W. & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467-475.
- Stock, J. H. & Yogo, M. (2005). Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models, Andrews & Stock (eds.), 80-108.
- Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA, 113(523), 1228-1242.
- Wright, P. G. (1928). The Tariff on Animal and Vegetable Oils. Macmillan.
Marketing Science & Media Mix Modeling
- Blake, T., Nosko, C., & Tadelis, S. (2015). Consumer heterogeneity and paid search effectiveness. Econometrica, 83(1), 155-174.
- Dew, R., Padilla, N., & Shchetkina, I. (2024). Your MMM is broken. arXiv:2408.07678.
- Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2019). A comparison of approaches to advertising measurement. Marketing Science, 38(2), 193-225.
- Lewis, R. A. & Rao, J. M. (2015). The unfavorable economics of measuring the returns to advertising. QJE, 130(4), 1941-1973.
- Shapiro, B. T., Hitsch, G. J., & Tuchman, A. E. (2021). TV advertising effectiveness and profitability. Econometrica, 89(4), 1855-1879.
Weather-Demand Economics
- Busse, M. R., Pope, D. G., Pope, J. C., & Silva-Risso, J. (2015). The psychological effect of weather on car purchases. QJE, 130(1), 371-414.
- Mellon, J. (2024). Rain, rain, go away: 194 potential exclusion-restriction violations for weather instruments. AJPS, 69, 881-898.
- Roth Tran, B. (2023). Sellin’ in the rain: weather, climate, and retail sales. Management Science, 69(12), 7423-7447.
- Steadman, R. G. (1979). The assessment of sultriness. Part I: a temperature-humidity index based on human physiology and clothing science. J. Appl. Meteorol., 18(7), 861-873.
Statistics & Bayesian Methods
- Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
- Efron, B. & Morris, C. (1975). Data analysis using Stein’s estimator. JASA, 70(350), 311-319.
- Gelman, A. & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge.
- James, W. & Stein, C. (1961). Estimation with quadratic loss. 4th Berkeley Symposium, 1, 361-379.
- Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
- Morris, C. N. (1983). Parametric empirical Bayes inference. JASA, 78(381), 47-55.
- Piironen, J. & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron. J. Stat., 11(2), 5018-5051.
- Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
Information Theory & Cybernetics
- Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman & Hall.
- Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423.
- Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv:physics/0004057.
- Touchette, H. & Lloyd, S. (2000). Information-theoretic limits of control. PRL, 84(6), 1156-1159.
AI Weather Forecasting
- Bi, K. et al. (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533-538. (Pangu-Weather.)
- Bodnar, C. et al. (2025). Aurora: a foundation model for the Earth system. Nature, 641, 1180-1187.
- Lam, R. et al. (2023). GraphCast: learning skillful medium-range global weather forecasting. Science, 382(6677), 1416-1421.
- Price, I. et al. (2024). GenCast: probabilistic weather forecasting with diffusion models. Nature, 637, 84-90.
Biometeorology & Medical
- Kimoto, K. et al. (2011). Influence of barometric pressure and humidity on the onset of clinical symptoms of migraine. Cephalalgia, 31(3), 338-343.
- Mukamal, K. J. et al. (2009). Weather and air pollution as triggers of severe headaches. Neurology, 72(10), 922-927.
Conformal Prediction & Calibration
- Barber, R. F. et al. (2023). Conformal prediction beyond exchangeability. Ann. Stat., 51(2), 816-845.
- Gibbs, I. & Candes, E. (2021). Adaptive conformal inference under distribution shift. NeurIPS 2021.
Mathematical Functions & Meteorology
- NWS (National Weather Service). Wind chill temperature index. NOAA Technical Memorandum.
- Rothfusz, L. P. (1990). The heat index “equation” (or, more than you ever wanted to know about heat index). NWS Southern Region Technical Attachment SR 90-23.