Skip to main content
WEATHERVANE RESEARCH · FEBRUARY 2026

How to build an economic
forecasting model for everyone

The complete playbook for building, training, and scaling a causal demand intelligence platform — from first data point to network dominance.

Nathaniel Schmiedehaus

Scroll to begin ↓
ThagorusFileEditViewWindowHelp
7:44 PM
Q1 Forecasts
Campaign Assets
Board Deck.pdf
hero-photo.jpg
DMA_analysis.csv
app.weathervane.ai
Thagorus
NS
Nate S.Pro Plan
AI Chat
Model online · 247ms
$0B+
US Ad Spend TAM
Zero
Causal Competitors
N+1 > N
Compounding Data Moat
Weather → Economy
The Platform
THE THESIS

Five things to know

Before the details — five claims this document will prove.

The data is achievable

574M observations per year from 3,143 counties × 500 categories. Two years of daily data reaches full causal identification. 3–5 design partners provide enough.

The moat isn't compute

~$95K Year 1 infrastructure — serious but 1,000× cheaper than frontier LLMs. The model architecture is an open research problem we solve via neural architecture search. The real moat is the causal graph that only grows with partners.

Known risks have known mitigations

Grokking, double descent, catastrophic forgetting — each is a real threat with peer-reviewed countermeasures. Grokfast accelerates grokking 50×. Deliberate overparameterization tames double descent. The playbook exists; execution is the challenge.

Growth improves the product

Stein's Paradox: every new partner improves every existing partner's estimates. A pool chemical company in Phoenix sharpens a raincoat brand's model in Seattle.

Weather is the wedge, not the ceiling

The LCDM is a general-purpose causal inference engine. Weather-sensitive products are the beachhead. The platform extends to pricing, inventory, creative, and economic nowcasting.

The AlphaFold parallel

AlphaFold didn't succeed because DeepMind had better neural networks. It succeeded because the Protein Data Bank — decades of painstaking crystallography — gave the model something to learn from. The architecture was necessary but not sufficient. The data was the moat.

ALPHAFOLDWEATHERVANE LCDMSAMEPATTERNProtein Structure PredictionCausal Demand PredictionsTHE MODELEvoformer Neural NetMulti-track attention overevolutionary sequences & 3Dspatial relationshipsTHE DATASETProtein Data Bank170K+ crystal structures from50 years of X-ray crystallography& cryo-EM experimentsTHE GAPThe Folding ProblemSequence → 3D structure.Physics sim took days per protein.200M+ proteins remained unsolvedTHE MODELIV + Stein ShrinkageInstrumental variables isolatecausal effects; James-Stein poolsestimates across 3,143 countiesTHE DATASETCausal Demand PanelCounty×category×day, 60 dims.574M obs/yr per partner. Each newpartner adds cross-category signalTHE GAPCorrelation ≠ CausationAd spend, seasons & competitorsall correlate with demand. Weatheris the instrument that breaks itBREAKTHROUGH = MODEL + PROPRIETARY DATASET + BRIDGED GAP

Causal demand modeling has the same structure. The statistical methods exist (instrumental variables, James-Stein shrinkage, hierarchical Bayes). The weather data exists (NOAA, ERA5). What didn't exist was a panel of actual sales outcomes dense enough to estimate causal effects across thousands of county-category pairs. That's what Thagorus's partner network creates — and like the Protein Data Bank, each new contribution makes everyone's estimates more precise.

AlphaFoldThagorus LCDM
The modelTransformer architectureCausal inference (IV + Stein shrinkage)
The datasetProtein Data Bank (crystallography)Multi-partner sales panel (3,143 counties × 500 categories)
The gap cataloguedComputational chemistry vs. realityCorrelation vs. causation (weather as natural experiment)
Why it couldn't exist beforeNo ground-truth protein structures at scaleNo multi-partner causal demand panel at county-category resolution
The moatYears of crystallography dataYears of partner sales data + daily causal learning
The pattern

Breakthrough = Model + Domain-Specific Empirical Data + Fast Feedback Loop. No single component is sufficient. The partner panel is Thagorus's Protein Data Bank — proprietary, compounding, and impossible to replicate from a standing start.

Nobody has assembled a multi-partner causal demand panel because the dataset simply doesn't pre-exist. Nielsen tracks sales. NOAA tracks weather. But nobody has systematically joined them at the county-category-day level, cleaned for confounders, and estimated causal effects. The data has to be built through partner integrations — and the only way to build it is to get live with real partners generating real signal.

This is why Thagorus's early partners are so valuable. They're not just customers paying for forecasts. They're co-creating the first dataset of its kind.

Why now

Three things converged in the last three years that make this possible for the first time.

1
Causal ML matured. Double/debiased machine learning, generalized random forests, and modern IV estimators moved from academic papers to production-grade libraries (EconML, DoWhy) in the last 3 years. The methods existed in theory; now they run at scale.
2
Weather data got granular. NOAA's HRRR model now provides hourly, 3km-resolution forecasts. ERA5 reanalysis covers 40+ years at county level. The instrumental variable (weather) went from noisy proxy to high-resolution signal.
3
Retail data infrastructure exists. Shopify, Amazon Seller Central, POS APIs, and data clean rooms mean partners can share sales data with privacy guarantees. Five years ago, assembling a multi-retailer panel required enterprise data deals. Now it's an API integration.

Any one of these alone isn't enough. Together, they make a Large Causal Demand Model feasible for the first time — the same way AlphaFold required the convergence of transformers, the Protein Data Bank, and sufficient compute to train on both.

The Data Flywheel

Every new partner is simultaneously a customer and a data source. Their sales signal improves the model for everyone — including themselves. Revenue and data accumulation are the same act. Hover each node to see the mechanism.

Chapter 1

The data

What the model needs to learn

The LCDM's panel dataset: county × category × day, ~60 dimensions per observation.

Weather (Raw)~22
Temperature (°F)
Humidity (%)
Precipitation (in)
Wind speed (mph)
UV index
Heat index
Dew point
Cloud cover (%)
Barometric pressure
Visibility (mi)
Weather (Derived)~30
Anomaly vs 30yr normal
7-day temp trend (slope)
Heating degree days
Cooling degree days
Feels-like delta
Temp × humidity interaction
Precip × weekend interaction
UV × season interaction
Sales & Demand~18
Daily units sold
Revenue ($)
Avg order value
Gross margin
Return rate
Conversion rate
Basket size
Ad Spend & Perf~20
Google spend ($)
Meta spend ($)
TikTok spend ($)
Amazon spend ($)
Total impressions
CTR by channel
CPC by channel
ROAS by channel
Temporal & Calendar~16
Day of week (one-hot)
Month (cyclical sin/cos)
Week of year
Holiday flag
Payday proximity
Inventory & Supply~8
Stock level (units)
Days of supply
Reorder point proximity
Stockout probability
Promotions & Pricing~10
Active promo flag
Discount depth (%)
Promo channel
Days into promo
Competitor promo flag
Macro & External~14
CPI (regional)
Gas price (local)
Unemployment rate
Consumer confidence index
Mortgage rate
Cross-Entity~18
County population
Median income
Urban/rural classification
Category penetration rate
3,143 counties × ~500 categories × 365 days × ~156+ features = >574M observations/year
Sports drinks on shelves — weather drives what people buy
Weather-sensitive products span every aisle — beverages, apparel, outdoor, home, pharmacy
574M
observations per year
1.15B
2-year target for full ID
~60
dimensions per obs
F > 20
IV strength target

Most demand forecasting is correlation-based. A model sees that ice cream sales and sunscreen sales both spike in July and concludes they're related. But correlation isn't actionable — you can't intervene on a correlation.

Thagorus's approach is structural, not statistical. Weather is an instrumental variable — it affects demand but isn't affected by pricing, promotions, or competitor actions. This lets us isolate the causal effect of weather on demand, the way a randomized trial isolates drug effects. The result isn't "these things tend to move together." It's "a 10-degree temperature anomaly in Harris County causes a 23% increase in sunscreen demand within 48 hours, holding all else equal."

That's the difference between a correlation and a prescription. One tells you what happened. The other tells you what to do.

The LCDM's identification strategy rests on a panel dataset where the cross-sectional unit is a county × product-category pair observed daily. Each observation carries weather features, sales signals, ad-spend by channel, inventory levels, and promotional calendars — roughly 60 effective dimensions per observation after interaction reduction. Weather events serve as instrumental variables across all 3,143 counties simultaneously.

3,143 U.S. counties × ~500 product categories × 365 days per year = 573,597,500 observations per year, rounded to 574M. Each observation is a unique county-category-day triple with its associated feature vector.

The U.S. Census Bureau defines 3,143 county-equivalent administrative divisions (3,007 counties + 136 county equivalents in Louisiana, Alaska, and independent cities). Counties are the finest geographic resolution at which both NOAA weather data and BLS economic data are consistently available, making them the natural unit for the LCDM's spatial panel.

The Bureau of Labor Statistics Consumer Expenditure Survey defines roughly 300 base expenditure categories. We extend this to ~500 by adding weather-sensitive subcategories: splitting "outdoor recreation" into equipment vs. apparel, "beverages" into hot vs. cold, "home improvement" into indoor vs. outdoor projects. The additional granularity is necessary because weather affects subcategories asymmetrically — a heat wave boosts iced coffee but suppresses hot coffee, and the model needs to see both.

WindowObservationsSeasonal CyclesIV Power
90 days~143M0.25Weak — network priors only
6 months~287M0.5Noisy; "memorization phase"
1 year~574M1.0Baseline seasonal coverage
2 years~1.15B2.0Full identification regime

Causal identification via weather instruments requires observing county-category pairs across multiple extremes. Two years provide at minimum two independent realizations of each seasonal extreme, enabling F > 20 for weather-sensitive categories.

SignalDimensionsSource
Weather~20NOAA GHCN-Daily + forecast APIs
Sales/demand~8Shopify, Amazon SP-API, POS
Ad spend~12Google Ads, Meta, TikTok APIs
Inventory~4Tenant ERP / inventory system
Promotions~6Tenant promo calendar
Macro~10FRED, BLS, Census
Key Insight

Data quality dominates quantity for causal inference. Missing confounder data creates omitted variable bias that no amount of additional observations can fix. This is why design partner onboarding requires OAuth access to all ad platforms and sales channels — partial data destroys identification.

Children playing in the rain
Weather shapes demand. The model learns why.
Chapter 2

The model

How the LCDM is built

The LCDM: six stages from raw data to causal predictions.

LCDM PIPELINERaw DataWeather + Sales + AdsFeatures60 dims/obsEncoder12-layer transformerCausalIV F>20ShrinkageStein poolingForecasts10-day aheadDaily morning cycle — raw data to predictions in <200ms

The Large Causal Demand Model uses the same transformer architecture behind ChatGPT — but instead of predicting the next word, it predicts the next day's demand and identifies what caused it. The architecture is not fixed: we treat it as a research problem solved through neural architecture search, starting from a strong baseline and optimizing for our specific data structure.

1. Data ingestion

Every morning, the pipeline pulls the previous day's weather from NOAA and forecast APIs, sales data from Shopify and Amazon, ad spend from Google and Meta, and inventory snapshots from each partner. These feeds are aligned to a consistent county-category-day grid.

2. Feature engineering

Raw temperature tells you almost nothing. The feature engine transforms it into weather anomalies — deviations from the 30-year normal for that county and day. 95°F in Phoenix is unremarkable; 95°F in Seattle is a five-sigma event. The engine also constructs interaction terms: the humidity-temperature combo that drives "feels-like" discomfort, the UV trajectory over 5 days that predicts sunscreen demand, the wind chill delta that triggers coat purchases.

3. Transformer encoder

The core is a transformer adapted for multivariate time series. The starting architecture — informed by BERT-base and time-series foundation models like MOIRAI and TimesFM — uses ~12 layers with ~768 embedding dimensions. But these are hyperparameters to be optimized via NAS, not gospel. What matters: each attention head learns different temporal and cross-category patterns. One head might learn that rain in Miami suppresses outdoor dining within hours; another learns that a Midwest cold snap predicts heating equipment demand 3–5 days later.

4. Causal identification layer

This is where the LCDM diverges from every other demand model. Standard models learn correlations — "when it's hot, sunscreen sells." The causal layer asks: "how much of this increase is caused by the heat, and how much would have happened anyway due to summer promotions, school breaks, or seasonal trends?" Weather is a natural experiment — nobody controls or predicts it perfectly. When an unexpected heat wave hits Florida but not New York, and sunscreen sales spike only in Florida, the difference is causally attributable to the heat. The causal layer uses this logic at scale across all 3,143 counties.

5. Network pooling

James-Stein shrinkage pools estimates across all partners. A new partner with 90 days of data inherits the statistical power of the entire network. The shrinkage factor is learned per category and geography — this is the mathematical foundation of Thagorus's network effect.

6. Daily predictions

Every morning, the trained model produces a 10-day demand forecast for each partner, category, and geography — with confidence intervals from ensemble weather uncertainty, causal attribution breakdowns, and recommended budget adjustments.

The 12-layer / 768-dim / 12-head configuration is a well-studied starting point, not a final architecture. BERT-base, GPT-2 Small, and time-series foundation models (MOIRAI, TimesFM, Chronos) all converge on similar dimensions for medium-scale tasks. Our NAS pipeline will systematically explore width, depth, attention head count, and the causal sub-network structure. The final architecture will almost certainly differ — the point is that we start from a strong, well-understood baseline rather than guessing.

The target parameter range is ~200–400M, informed by Chinchilla scaling laws applied to our dataset size (~1.15B observations). For a BERT-base-style starting point: 768 embedding dim × 12 heads × 12 layers gives ~85M in the transformer stack. Add input embeddings (~46M), positional encodings, output heads, and the causal identification sub-network (~40M for IV estimation + Stein shrinkage) — baseline ~270M. NAS will explore the 100M–500M range to find the compute-optimal point.

At any point in this range, the model is firmly "medium" — 10–50× smaller than GPT-2 Large, small enough for single-GPU inference in <200ms, but large enough to capture the nonlinear interactions between weather, geography, category, and demand that make causal identification work.

An instrumental variable must satisfy three requirements:

1. Relevance: The instrument must be correlated with the treatment (weather must actually affect consumer behavior). Measured by the first-stage F-statistic — F > 10 is the minimum (Staiger & Stock, 1997), F > 20 is strong (Stock & Yogo, 2005).

2. Independence: The instrument must be unrelated to confounders (weather cannot be caused by your marketing budget or competitor actions). Weather is exogenous by definition.

3. Exclusion: The instrument must only affect the outcome through the treatment. Addressed by daily × county resolution, which controls for supply-side effects and competitive responses.

Data preprocessing: Raw feeds are cleaned, aligned to the county-category-day grid, and split into train (80%), validation (10%), and test (10%) sets with temporal splits to prevent data leakage.

Batching: Mini-batches of 2,048 county-category-day observations, stratified by geography and category.

Optimizer: AdamW with weight decay 0.01, betas (0.9, 0.999), gradient clipping at 1.0.

Learning rate schedule: Linear warmup over 1,000 steps to peak LR of 3e-4, then cosine decay to 1e-5.

Checkpointing: Save every 500 steps. Keep best 5 checkpoints by validation loss. Final model is an exponential moving average of the last 3 checkpoints.

Validation: Evaluate on held-out counties (spatial generalization) and held-out time periods (temporal generalization) separately.

Early stopping: Halt if validation loss does not improve for 10 consecutive evaluations (5,000 steps). Combined with weight decay, this prevents epoch-wise double descent.

Chapter 3

The science

Two surprising things that happen during training

Grokking: the model memorizes first, then suddenly learns the real pattern.

Training Steps (log scale)LossGrokkingMEMORIZATION PLATEAUTrainTest

Imagine a student who memorizes every answer for a math exam — "Question 7 is 42, Question 12 is 17" — and aces it. New exam, different numbers: they fail. They keep studying. For weeks, nothing visibly changes. Then suddenly, overnight, they understand the underlying math and can solve any problem they've never seen.

What changed? The student stopped memorizing individual data points and started discovering the structureunderneath. For the LCDM, that means the model transitions from memorizing "90°F in Miami on June 3 → sunscreen +15%" to understanding the mechanism: that it's not the absolute temperature that matters, but the deviation from the 30-year normal, modulated by humidity interaction, baseline seasonal demand, and how quickly the weather changed. It discovers that this same mechanism operates differently across geographies — 90°F in Phoenix barely registers while 90°F in Portland is a demand shock. It finds cross-category cascades: the same heat event that boosts pool chemical sales also suppresses hot coffee and outdoor dining, and these relationships are causal, not just correlated.

That's the "extra information" — not more data points, but the discovery of latent causal structure that was always present in the training data but invisible to a memorizing model. Neel Nanda et al. (2023) showed this transition happens because the model's internal representations reorganize from lookup tables to generalizable circuits. Grokfast (Lee et al., 2024) accelerates this 50× by amplifying slow-varying gradient components, reducing the grokking budget from thousands of GPU-hours to hundreds.

ScenarioExtra Chip-hrsCost @ $0.48/hrCost @ $1.20/hr
With Grokfast (50×)~200$96$240
Standard (3× train time)~3,600$1,728$4,320
Worst case (10× train time)~12,000$5,760$14,400

Reference: Nanda, N. et al. (2023). "Progress measures for grokking via mechanistic interpretability." ICLR 2023.

Double descent: more parameters can actually reduce error, defying classical intuition.

Model Complexity (parameters)Test ErrorINTERPOLATION THRESHOLDLCDMUnder-param.Over-parameterizedActualClassical

Classical ML teaches that there's a "sweet spot" for model size: too small and it underfits, too large and it overfits. This U-shaped bias-variance trade-off is every ML textbook's chapter 1. It turns out to be incomplete.

When you keep making the model bigger past the point where it can perfectly memorize the training data (the "interpolation threshold"), something unexpected happens: test error peaks, then descends again. Larger models actually perform better, not worse. This is double descent, documented by Belkin et al. (2019) and Nakkiran et al. (2021), and it overturns fifty years of statistical intuition. For the LCDM, it means deliberate overparameterization — combined with weight decay and early stopping — is a feature, not a risk.

Model-wise double descent: Test error peaks when model capacity matches dataset size, then decreases as the model grows.

Epoch-wise double descent: Test error decreases, increases during the "critical regime," then decreases again. Eliminated by early stopping + tuned weight decay.

Sample-wise double descent: Adding more data can temporarily worsen performance. Resolves with either more data or larger models.

LCDM position: At ~270M parameters on ~1.15B observations (ratio 1:4.3), we're firmly in the overparameterized regime. The strongest defense is deliberate overparameterization + weight decay + early stopping + hierarchical shrinkage.

Catastrophic forgetting: Updating with new partner data risks degrading existing partners. Mitigated via elastic weight consolidation (EWC) and monthly full-network retrains on the complete pooled dataset.

Distribution shift: Weather-demand relationships are non-stationary. Addressed with anomaly encoding (deviations from 30-year normals, not raw temps), rolling retraining, and 4σ divergence detection with automatic checkpoint fallback.

Mode collapse: Prevented by conformal prediction wrappers, ensemble disagreement monitoring across 5 checkpoints, and heteroscedastic output heads.

Instrument strength: The first-stage F-statistic must exceed 10 and ideally 20+ for reliable causal estimates. Monte Carlo simulations show 24 months of data achieves F > 20 for 80% of county-category pairs.

Storm clouds forming over landscape
Standing on the shoulders of AI weather

Standing on the shoulders of AI weather

The LCDM doesn't predict weather — it consumes the best weather forecasts available and uses them as instruments. A new generation of AI weather models produces forecasts that rival the European Centre (ECMWF) at a fraction of the cost.

ModelDeveloperResolutionInferenceNotes
GraphCastGoogle DeepMind0.25°<1 min10-day forecast, single TPU, open weights
GenCastGoogle DeepMind0.25°~8 minProbabilistic ensembles, diffusion model
FourCastNetNVIDIA0.25°<2 secFourier Neural Operator, 7-day forecast
Pangu-WeatherHuawei0.25°~1.4 sec3D Earth-specific transformer
AuroraMicrosoft0.1°<1 minFoundation model, flexible fine-tuning

Physics + ML hybrid

Pure ML risks learning spurious correlations. Pure physics can't capture nonlinear demand responses. The LCDM uses a hybrid approach:

  • Physics-informed priors: Known relationships are encoded as Bayesian priors, not hard constraints. The model can override them with sufficient data.
  • Ensemble weather inputs: We ensemble GraphCast + GenCast + NOAA GFS. Ensemble disagreement provides built-in uncertainty for downstream causal estimates.
  • Physics-guided regularization: The loss function penalizes causal estimates that violate known physical constraints (e.g., negative temperature elasticity for heating products above 70°F).
  • Interpretable + flexible: The physics component is fully interpretable. The ML component captures nonlinearities the physics can't. Partners see both layers.
Cost Advantage

GraphCast generates a 10-day global forecast in under 1 minute on a single TPU v4. At $3.22/hr for a TPU v4, that's $0.05 per global forecast — roughly $18/year for daily county-level weather inputs. NOAA's HRES model costs ~$50M/year. We get comparable quality for 3,000× less.

Chapter 4

The economics

What it takes to build the economic graph

Building the economic graph is a multi-stage, capital-intensive engineering challenge.

$2–5MSEED3–5 partnersProve the model$10–20MSERIES A20–50 partnersScale the team$40–80MSERIES B200+ partnersBuild the platform$150M+SERIES C+1,000+ partnersEconomic infrastructure

We considered selling weather-demand analytics as a tool. But demand data is fragmented across thousands of retailers, each with different POS systems, category taxonomies, and data quality. Selling into that fragmentation means years of enterprise sales cycles and bespoke integrations.

Instead, Thagorus controls the full inference pipeline: data ingestion, causal estimation, forecast generation, and decision delivery. Partners send us data; we send them decisions. We own the statistical methodology, the cross-partner shrinkage, and the feedback loop. This full-stack approach is more capital-intensive, but it's the only way to ensure the causal estimates are actually valid — and the only way to capture the network effects that make the model improve with scale.

The capital buys time to recruit PhD-level causal inference researchers, secure enterprise data partnerships that take 6–18 months to close, build SOC 2-compliant infrastructure, and stand up real-time serving for millions of daily predictions — before competitors realize what's possible.

Seed ($2–5M): 3–4 person team (ML + eng), $95K compute/infra, 3–5 design partners, SOC 2 Type I. Proves the model works on real data.

Series A ($10–20M): 12–18 person team, $200K–$500K/yr compute, 20–50 partners, SOC 2 Type II, enterprise sales motion. Proves the business.

Series B+ ($40–80M): 40–60 person team, $1M–$3M/yr compute, 200+ partners, in-house GPU cluster, FedRAMP. Builds the platform.

Investment AreaSeedSeries ASeries B+
Team3–4 (ML + eng)12–18 (ML, eng, data, sales)40–60 (full org)
Compute + Infra$25K–$50K/yr$200K–$500K/yr$1M–$3M/yr
Data partnerships3–5 design partners20–50 paid + BDRs200+ self-serve + enterprise
Compliance & securitySOC 2 Type ISOC 2 Type II, pen testingFedRAMP, financial regs
GTM + SalesFounder-led2–3 AEs + marketingFull sales org + partnerships

Foundation model companies raise billions because compute is their moat. Thagorus's moat is the causal graph — the data network, not the compute.

Foundation ModelLCDM
Parameters~1.8 trillion270 million
Training dataThe entire internet1.15B structured observations
Where $ goesCompute (thousands of GPUs)Team, data, partners, GTM
Moat sourceScale of computeScale of causal graph
DefensibilityAnyone with $100M+ can try2+ years of multi-tenant data

This is why $150M builds a category-defining economic intelligence platform while $10B builds one more language model. The capital goes to assets that compound — data partnerships, the causal graph, regulatory moats. The LCDM's architecture is defensible but ultimately replicable. The multi-partner causal panel is neither. Lead with the data asset, not the model.

At each stage, the platform becomes something qualitatively different:

Stage 1: Weather intelligence (Seed → Series A)

5–20 partners in weather-sensitive categories. The LCDM resolves DMA-level causal effects. Partners get causal demand signals no existing tool can provide. ARR: $120K–$2M.

Stage 2: Demand intelligence (Series A → Series B)

50–200 partners across dozens of verticals. The causal graph resolves county-level effects and transcends weather. Partners start asking: "what's causing the demand shift in Phoenix this week?" The model begins seeing cross-category demand cascades. This is the inflection point. ARR: $5M–$25M.

Stage 3: Economic forecasting (Series B → IPO)

500–5,000+ partners. The causal graph becomes a nowcasting engine for the real economy. Transaction signals from thousands of businesses, combined with weather and new instruments, create something that doesn't exist today — a real-time, causally-identified model of consumer economic behavior at county-level granularity. ARR: $50M+.

How costs evolve

Costs scale sublinearly. Adding partners doesn't mean retraining from scratch. At 1,000 partners: ~$200K/year training, ~$150K/year inference, ~$100K/year infra. Total ~$450K/year on $50M+ ARR = 99%+ gross margin.

Infrastructure budget

Year 1 Cost Breakdown — Spot Optimized
Base training runs
3,000 hrs
$2,160
Hyperparameter search (12×)
36,000 hrs
$17,280
Neural architecture search
12,000 hrs
$5,760
Grokking buffer (5×)
15,000 hrs
$7,200
Ablation & validation
6,000 hrs
$2,880
Monthly retrains (12×)
36,000 hrs
$25,920
Inference serving (24/7)
8,760 hrs
$6,307
Weather APIs (3 providers)
$8,400
Data storage & pipeline
$4,800
Cloud infra & monitoring
$6,000
Spot preemption overhead
$8,300
Year 1 Total (blended spot/on-demand @ $0.48–$0.72/chip-hr)
$95,007

The standard FLOPs estimate for transformer training is C = 6 × N × D, where N is parameters and D is dataset size.

N = 200–400M parameters (NAS range). D = 1.15B observations. Using the midpoint (300M): C = 6 × 300M × 1.15B = 2.07 × 1018 FLOPs per pass.

A TPU v5e delivers ~393 TFLOPS (BF16) sustained. At ~40% practical throughput: ~4.4 chip-hours per epoch. 200–400 epochs → ~3,000 chip-hours base.

Multipliers: 12× HPO sweeps across the architecture search space, 5× grokking buffer (reduced by Grokfast), 2× ablation & validation, 12× monthly production retrains.

Compute total: ~108,000 chip-hours at blended $0.48–$0.72/hr = ~$67,500. Add 24/7 inference serving ($6,300), weather APIs × 3 providers ($8,400), data pipeline ($4,800), cloud infra ($6,000), spot preemption overhead ($8,300) = ~$95K Year 1.

This is 3–4× higher than a naive spot-only estimate because it accounts for: mixed spot/on-demand pricing, NAS requiring broader search, inference costs for production serving, and the overhead of preemption recovery.

Google Cloud TPU (Q1 2026 rates):

ChipOn-Demand1-yr CUD3-yr CUDSpot
TPU v5e$1.20$0.84$0.54~$0.48
TPU v5p$4.20$2.94$1.89~$1.68
TPU v6e$1.38$0.97$0.55~$0.55

AWS EC2 GPU:

InstanceGPUsOn-DemandPer-GPUSpot
p4d.24xlarge8× A100$22.03/hr$2.75~$7.20
p5.48xlarge8× H100$33.10/hr$4.14~$13.20

Local vs. cloud

An in-house GPU workstation amortizes to competitive hourly rates. The recommended approach is hybrid — local for daily work, cloud for burst compute.

Local (4× RTX 4090)

$13,000 upfront, amortizes to $0.12/GPU-hr over 3 years. Always available, no spot preemption. ~35,000 GPU-hours/year at 24/7 utilization.

Cloud (TPU v5e spot)

$0.48/chip-hr with no commitment. Elastic scaling to 32+ chips for hyperparameter search. Year 1 cloud estimate: ~$6,900 for burst HPO + architecture search.

ConfigurationUpfrontAmortized/yrEffective $/GPU-hr
1× RTX 4090 workstation~$3,500$1,167$0.13
4× RTX 4090 server~$13,000$4,333$0.12
Lambda Scalar (8× A6000)~$48,000$16,000$0.23
NVIDIA DGX A100~$199,000$66,333$0.95
Sunscreen products displayed in warm light
From data to decisions

Model your own scenario

Chapter 5

The strategy

Solving the cold start

The causal demand panel doesn't exist in any usable form. Not at Nielsen. Not at IRI. Not at any retailer. The data is the company — and it only exists if we build it. Every decision should be evaluated against: does this get us to 10 partners with 12 months of data faster?

Workstream A
Synthetic Proof-of-Concept

Build a fully functional demo using synthetic demand data calibrated to real weather. BLS retail sales indices + NOAA weather generate realistic panels.

Workstream B
Design Partner Program

3–5 DTC e-commerce brands provide OAuth access in exchange for 6–12 months free service. They get analytics they could never build in-house. We get the data to train the model.

Workstream C
TSFM Baseline

Fine-tune MOIRAI on available weather-demand data. Delivers a working product within weeks of data access while the LCDM trains.

Workstream D
Public Data Pre-training

Pre-train on BLS Consumer Expenditure Survey, Census Retail Trade, FRED, Kilts-Nielsen panels. Transfer learning cuts per-partner data requirements dramatically.

Ideal design partners

Tier 1 — Dream
National Home Improvement

Think Home Depot, Lowe's, Tractor Supply. Thousands of SKUs from HVAC to outdoor furniture. Strong seasonality, massive geographic spread, years of POS data.

~500+ categories · ~2,000 locations · $5B+ ad spend · 10+ yrs history
Tier 2 — Sweet Spot
DTC Outdoor / Seasonal Brand

Companies like YETI, Hydro Flask, Solo Stove, Traeger. Clear weather signal, Shopify or D2C data, $5M–$50M ad spend. Fast to onboard via API.

20–100 categories · National DTC · $5M–$50M ad spend · Shopify/Amazon
Tier 2 — Sweet Spot
CPG Beverage / Food

Companies like Liquid Death, Athletic Brewing, Olipop. Beverage sales are strongly weather-driven. Both DTC and retail channel data available.

10–50 SKUs · Multi-channel · $10M+ ad spend · Strong weather signal
Tier 3 — Quick Win
Seasonal Apparel / Beauty

Sunscreen, outdoor apparel, seasonal skincare. Shopify-native brands with clear weather sensitivity. Fast onboarding, immediate signal.

5–30 SKUs · Shopify · $1M–$10M ad spend · 90-day onboard

A company selling both sunscreen and books provides a natural within-company control. When a heat wave hits and sunscreen sales spike while book sales stay flat, we can isolate the causal effect of temperature on sunscreen — the books are the control group.

IntegrationPurposeEffort
Shopify / Amazon SP-APISales, orders, inventory by SKU/dayOAuth, <1 hr
Google Ads APISpend, impressions, clicks by campaign/dayOAuth, <1 hr
Meta Marketing APISpend, reach, conversions by campaign/dayOAuth, <1 hr
Historical data exportBackfill 12–24 monthsCSV/API, 1–3 hrs
Data sharing agreementLegal, NDA, data usage terms1–2 weeks

A typical Tier 2 partner with 50 categories across 200 DMAs generates 50 × 200 × 365 = 3.65M observations per year. With ~60 dimensions per observation, that's 219M data points annually. Three such partners provide 10.95M observations — enough for statistically significant causal estimates within 6 months.

Golden hour light over landscape
Every partner makes every other partner better

Why every new partner makes every existing partner better

As the network grows, minimum data requirements drop — that's how "for everyone" works.

5 partners12mo min50 partners6mo500 partners90 days5,000+30 days
1
Network pooling: Each new partner contributes statistical power to every other partner via Stein shrinkage. The 50th partner gets dramatically better estimates than the 5th.
2
Category coverage: More partners across more categories = richer cross-category causal graph.
3
Geographic coverage: More partners across more geographies = better spatial identification. A cold snap with 10 unaffected southern partners is a clean natural experiment.
4
Instrument diversity: Different weather events serve as independent instruments. More variation = stronger F-statistics.
5
Switching costs accumulate: A partner on the network for 2 years benefits from a causal graph that took 2 years of network-wide data to build. A competitor starting from scratch cannot match it.

In 1961, Charles Stein proved something counterintuitive: estimating many things at once is MORE accurate than estimating each separately. Even if they're unrelated. This is the James-Stein estimator, and it's the mathematical foundation of Thagorus's network effect.

Today, only companies with massive data teams can do causal demand modeling. Thagorus inverts this. A small DTC brand selling $2M/year has no chance of building these models alone. But with James-Stein shrinkage, a new partner joining with even 6 months of data in a single category immediately gets causal estimates informed by every other partner in the network.

The minimum data requirement drops as the network grows. A food truck in Austin and a $5B retailer both benefit from the same causal graph. The food truck couldn't build this alone in a hundred years. But it doesn't have to — the network already did the work.

The shrinkage factor for partner i is:

λi = σ2i / (σ2i + τ2)

Where σ2i is the individual partner's estimation variance, and τ2 is the between-partner variance. As the network grows, τ2 shrinks and every partner's estimates get pulled toward a more accurate group mean. Empirical Bayes estimates show 40–60% variance reduction vs. partner-only estimation.

What this unlocks at scale

Network sizeMinimum partner dataWhat becomes possible
5 partners12 months, 20+ categoriesDMA-level weather effects for similar businesses
50 partners6 months, 5+ categoriesCounty-level effects; cross-category signals
500 partners90 days, 1+ categoryInstant causal estimates for any weather-sensitive business
5,000+ partners30 days, any categoryReal-time demand nowcasting; economic forecasting for everyone
Speed of learning is the moat

Weather changes daily. Sales happen daily. Thagorus's causal learning loop runs every 24 hours — faster than any consulting engagement, quarterly review, or annual planning cycle. A competitor starting today faces the same cold-start problem we faced, but we've been compounding daily signal across a growing partner network. The advantage isn't the model architecture (which is published science). The advantage is the accumulated daily learning that no one can fast-forward.

Stein's Paradox says that estimating many things simultaneously is always more accurate than estimating each one alone. This isn't a business strategy — it's a mathematical theorem. The company with the most partner data will have the most accurate estimates, provably. A competitor with half the partners doesn't get half the accuracy — they get worse than half because they can't borrow as much statistical strength. This creates a natural winner-take-most dynamic driven by mathematics, not just economics.

At sufficient network density, Thagorus becomes the default infrastructure for demand intelligence — the way Stripe became the default for payments or Twilio for communications. Not because it's cheaper, but because the network effects make it categorically better than anything you could build yourself, regardless of how much you spend.

The platform play

Three layers, each serving different customers from the same causal graph.

Economic InfrastructureHedge funds, Fed, central banksIntelligence MarketplaceAll partners share & benefitData PlatformDevelopers, APIs, 3rd-party apps

Layer 1: Data platform. The causal graph becomes the most comprehensive real-time map of American consumer demand.

Layer 2: Intelligence marketplace. Partners share anonymized signals; a $49/mo food truck gets Fortune 500 calibration.

Layer 3: Economic infrastructure. Hedge funds and central banks subscribe to real-time economic indicators from the same graph.

Thagorus starts as a product (causal demand signals) but becomes a platform as the network grows. The same pattern that made Stripe inevitable for payments and Twilio inevitable for communications.

Layer 1: Third-party developers, analytics firms, and financial institutions build on top of the causal graph via API.

Layer 2: Partners opt-in to share anonymized, aggregated signals. A $49/mo food truck gets demand insights calibrated by Fortune 500 grocery data. The Fortune 500 gets granularity from thousands of small businesses filling geographic gaps.

Layer 3: Hedge funds, government agencies, and central banks subscribe to GDP nowcasting, regional consumer confidence, sector rotation signals. This is a new asset class built on the same data that powers the $49/mo dashboard.

What the causal engine powers

Decision DomainValue
Ad spend allocationOptimal budget across channels & geos, causally identified
Dynamic pricingPrice elasticity estimates in demand context
Inventory positioningPre-position ahead of demand surges before competitors react
Product decisionsWhich SKUs to promote by demand regime
Financial signalsReal-time consumer spending indicators for hedge funds & macro
Economic nowcastingCounty-level GDP estimation from transaction + environmental signals

The causal demand graph becomes valuable to industries far beyond performance marketing:

Customer SegmentWhat They Pay ForWhy It’s Unique
Hedge funds & quant tradersReal-time alternative data on consumer spendingCausally-identified — not scraped, not correlative
Insurance & reinsuranceGranular weather-economic impact modelsCounty-level loss exposure calibrated to actual outcomes
Supply chain & logisticsDemand forecasts for inventory pre-positioningCausal signals 7–14 days ahead of traditional indicators
Commercial real estateLocation intelligenceMulti-category demand maps that reveal site potential
Government & central banksEconomic nowcastingMonthly GDP with 2-month lag → daily county-level in real time
CPG & food companiesCategory-level demand planningWeather × geography × category interactions at scale
Small businesses ($49/mo)Causal intelligence they could never build aloneNetwork does the work — 90 days of data is enough

Each segment represents a distinct revenue line. A hedge fund paying $50K/year for alpha signals and a food truck paying $49/month both pull from the same causal graph — and both contribute to it. That's the platform.

Pricing

Like ChatGPT made AI accessible, Thagorus makes causal business intelligence accessible. Marginal cost to serve: $3–$15/month. We price for adoption velocity and network growth.

Starter
$49/mo
Any business, any size
  • 5 categories, your region
  • Weekly demand signals
  • Connect Shopify or CSV
  • Causal demand dashboard
Self-serve — start in minutes
Growth
$499/mo
$500K–$5M annual ad spend
  • 50 categories, all DMAs
  • Daily causal demand signals
  • Full channel coverage
  • Slack & email alerts
Designed for scaling brands
RECOMMENDEDPro
$2,500/mo
$5M–$50M annual ad spend
  • 200 categories, all DMAs
  • Real-time causal signals
  • Budget optimization engine
  • API access + scenario planning
The core offering
Enterprise
Custom
$50M+ annual ad spend
  • Unlimited categories
  • Custom model calibration
  • White-label & VPC deploy
  • SLA + performance fee option
0.5–1% of incremental lift

Unit economics

97%+
Gross Margin
$30K
Avg. ACV (blended)
>120%
Target NRR
CompanyFundingValuationRelevance
Scale AI$600M+$13.8BData infra for AI; network effects in labeling
Measured$47M~$200MClosest comp — incrementality for ad spend
Recast$18M~$80MBayesian MMM as a service; Series A 2023
Tomorrow.io$190M~$1BWeather intelligence platform; launched satellite
RoundTargetGateUse of Funds
Pre-seed$500K–$1.5MSynthetic proof + 3 design partnersFounder, compute ($13K local + $7K cloud), pipeline
Seed$2M–$5Mv1 live; 5+ partners; 2+ case studiesTeam (3–4), sales, dedicated compute
Series A$10M–$20M$500K+ ARR; 20+ partners; v2 deployedGTM, enterprise, R&D, in-house GPU cluster
Series B$40M–$80M$5M+ ARR; 200+ partners; demand intelligenceVertical expansion, API platform, new instruments
Series C / Growth$100M+$25M+ ARR; 1,000+ partners; economic forecastingGov / finance verticals, international, R&D lab
At scale5,000+ partners; real-time economic graphThe economic forecasting platform for everyone
Dramatic purple sky at dusk
The real-time economic graph
Chapter 6

The vision

The real-time economic
graph for the world.

Weather-sensitive products are the wedge — sunscreen, cold medicine, patio furniture, winter gear. They have the strongest causal signal, the fastest feedback loops, and the clearest ROI. But the model architecture generalizes. Once the causal estimation pipeline works for sunscreen in Phoenix, the same instrumental variable framework applies to HVAC parts in Chicago, energy drinks in Miami, or umbrella inventory in Seattle. The wedge is narrow; the platform is broad.

At 20 partners you have weather intelligence. At 200 you have demand intelligence. At 2,000 you have something that doesn't exist yet — a causally-identified, real-time model of how the economy actually works, at the resolution of individual counties and categories, updated daily. The Fed gets monthly aggregates with a two-month lag. Thagorus's partners will see it happening in real time.

That's what "economic forecasting for everyone" means. Not a dashboard. Not a prediction. A living, causal understanding of why people buy what they buy, where, and when — available to every company willing to contribute their signal to the graph. Demand planning is going to have its AlphaFold moment. We're building the Protein Data Bank.

nate@schmiedehaus.com →

References

  1. Power, A. et al. (2022). "Grokking: Generalization beyond overfitting on small algorithmic datasets." arXiv:2201.02177.
  2. Lee, J. et al. (2024). "Grokfast: Accelerated Grokking by Amplifying Slow Gradients." arXiv:2405.20233.
  3. Nakkiran, P. et al. (2021). "Deep Double Descent." JSTAT. OpenAI.
  4. Belkin, M. et al. (2019). "Reconciling modern ML practice and the classical bias-variance trade-off." PNAS 116(32).
  5. Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." DeepMind.
  6. Shi, J. et al. (2024). "Scaling Law for Time Series Forecasting." NeurIPS 2024.
  7. Das, A. et al. (2024). "A Decoder-Only Foundation Model for Time-Series Forecasting." Google. (TimesFM)
  8. Woo, G. et al. (2024). "Unified Training of Universal Time Series Forecasting Transformers." Salesforce. (MOIRAI)
  9. Ansari, A. F. et al. (2024). "Chronos: Learning the Language of Time Series." Amazon.
  10. Lam, R. et al. (2023). "Learning skillful medium-range global weather forecasting." Science. Google DeepMind. (GraphCast)
  11. Price, I. et al. (2024). "GenCast: Diffusion-based ensemble forecasting for medium-range weather." Google DeepMind.
  12. Pathak, J. et al. (2022). "FourCastNet: A Global Data-driven High-resolution Weather Forecasting Model." NVIDIA.
  13. Bi, K. et al. (2023). "Accurate medium-range global weather forecasting with 3D neural networks." Nature. Huawei. (Pangu-Weather)
  14. Stock, J. H. & Yogo, M. (2005). "Testing for Weak Instruments in Linear IV Regression." Cambridge UP.
  15. James, W. & Stein, C. (1961). "Estimation with Quadratic Loss." Fourth Berkeley Symposium.
  16. Chernozhukov, V. et al. (2018). "Double/Debiased ML for Treatment and Structural Parameters." Econometrics Journal.
  17. Hartford, J. et al. (2017). "Deep IV: A Flexible Approach for Counterfactual Prediction." ICML 2017.
  18. Nanda, N. et al. (2023). "Progress measures for grokking via mechanistic interpretability." ICLR 2023.
  19. Heckel, R. & Yilmaz, F. F. (2024). "Regularization-wise double descent." ICLR 2024.
  20. Bessemer (2025). "The AI pricing and monetization playbook." bvp.com/atlas.
  21. Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596, 583–589.
  22. Conviction (2025). "Plausible Schemes: Measured Physics." conviction.com/startups.html.