WEATHERVANE RESEARCH · FEBRUARY 2026

How to build an economic
forecasting model for everyone

The complete playbook for building, training, and scaling a causal demand intelligence platform — from first data point to network dominance.

Nathaniel Schmiedehaus

Scroll to begin ↓

ThagorusFileEditViewWindowHelp

7:44 PM

Q1 Forecasts

Campaign Assets

Board Deck.pdf

hero-photo.jpg

DMA_analysis.csv

app.weathervane.ai

Thagorus

Nate S.Pro Plan

AI Chat

Model online · 247ms

$0B+

US Ad Spend TAM

Zero

Causal Competitors

N+1 > N

Compounding Data Moat

Weather → Economy

The Platform

THE THESIS

Five things to know

Before the details — five claims this document will prove.

The data is achievable

574M observations per year from 3,143 counties × 500 categories. Two years of daily data reaches full causal identification. 3–5 design partners provide enough.

The moat isn't compute

~$95K Year 1 infrastructure — serious but 1,000× cheaper than frontier LLMs. The model architecture is an open research problem we solve via neural architecture search. The real moat is the causal graph that only grows with partners.

Known risks have known mitigations

Grokking, double descent, catastrophic forgetting — each is a real threat with peer-reviewed countermeasures. Grokfast accelerates grokking 50×. Deliberate overparameterization tames double descent. The playbook exists; execution is the challenge.

Growth improves the product

Stein's Paradox: every new partner improves every existing partner's estimates. A pool chemical company in Phoenix sharpens a raincoat brand's model in Seattle.

Weather is the wedge, not the ceiling

The LCDM is a general-purpose causal inference engine. Weather-sensitive products are the beachhead. The platform extends to pricing, inventory, creative, and economic nowcasting.

The AlphaFold parallel

AlphaFold didn't succeed because DeepMind had better neural networks. It succeeded because the Protein Data Bank — decades of painstaking crystallography — gave the model something to learn from. The architecture was necessary but not sufficient. The data was the moat.

Causal demand modeling has the same structure. The statistical methods exist (instrumental variables, James-Stein shrinkage, hierarchical Bayes). The weather data exists (NOAA, ERA5). What didn't exist was a panel of actual sales outcomes dense enough to estimate causal effects across thousands of county-category pairs. That's what Thagorus's partner network creates — and like the Protein Data Bank, each new contribution makes everyone's estimates more precise.

	AlphaFold	Thagorus LCDM
The model	Transformer architecture	Causal inference (IV + Stein shrinkage)
The dataset	Protein Data Bank (crystallography)	Multi-partner sales panel (3,143 counties × 500 categories)
The gap catalogued	Computational chemistry vs. reality	Correlation vs. causation (weather as natural experiment)
Why it couldn't exist before	No ground-truth protein structures at scale	No multi-partner causal demand panel at county-category resolution
The moat	Years of crystallography data	Years of partner sales data + daily causal learning

The pattern

Breakthrough = Model + Domain-Specific Empirical Data + Fast Feedback Loop. No single component is sufficient. The partner panel is Thagorus's Protein Data Bank — proprietary, compounding, and impossible to replicate from a standing start.

Nobody has assembled a multi-partner causal demand panel because the dataset simply doesn't pre-exist. Nielsen tracks sales. NOAA tracks weather. But nobody has systematically joined them at the county-category-day level, cleaned for confounders, and estimated causal effects. The data has to be built through partner integrations — and the only way to build it is to get live with real partners generating real signal.

This is why Thagorus's early partners are so valuable. They're not just customers paying for forecasts. They're co-creating the first dataset of its kind.

Why now

Three things converged in the last three years that make this possible for the first time.

Causal ML matured. Double/debiased machine learning, generalized random forests, and modern IV estimators moved from academic papers to production-grade libraries (EconML, DoWhy) in the last 3 years. The methods existed in theory; now they run at scale.

Weather data got granular. NOAA's HRRR model now provides hourly, 3km-resolution forecasts. ERA5 reanalysis covers 40+ years at county level. The instrumental variable (weather) went from noisy proxy to high-resolution signal.

Retail data infrastructure exists. Shopify, Amazon Seller Central, POS APIs, and data clean rooms mean partners can share sales data with privacy guarantees. Five years ago, assembling a multi-retailer panel required enterprise data deals. Now it's an API integration.

Any one of these alone isn't enough. Together, they make a Large Causal Demand Model feasible for the first time — the same way AlphaFold required the convergence of transformers, the Protein Data Bank, and sufficient compute to train on both.

The Data Flywheel

Every new partner is simultaneously a customer and a data source. Their sales signal improves the model for everyone — including themselves. Revenue and data accumulation are the same act. Hover each node to see the mechanism.

Chapter 1

The data

What the model needs to learn

The LCDM's panel dataset: county × category × day, ~60 dimensions per observation.

Weather (Raw)~22

Temperature (°F)

Humidity (%)

Precipitation (in)

Wind speed (mph)

UV index

Heat index

Dew point

Cloud cover (%)

Barometric pressure

Visibility (mi)

Weather (Derived)~30

Anomaly vs 30yr normal

7-day temp trend (slope)

Heating degree days

Cooling degree days

Feels-like delta

Temp × humidity interaction

Precip × weekend interaction

UV × season interaction

Sales & Demand~18

Daily units sold

Revenue ($)

Avg order value

Gross margin

Return rate

Conversion rate

Basket size

Ad Spend & Perf~20

Google spend ($)

Meta spend ($)

TikTok spend ($)

Amazon spend ($)

Total impressions

CTR by channel

CPC by channel

ROAS by channel

Temporal & Calendar~16

Day of week (one-hot)

Month (cyclical sin/cos)

Week of year

Holiday flag

Payday proximity

Inventory & Supply~8

Stock level (units)

Days of supply

Reorder point proximity

Stockout probability

Promotions & Pricing~10

Active promo flag

Discount depth (%)

Promo channel

Days into promo

Competitor promo flag

Macro & External~14

CPI (regional)

Gas price (local)

Unemployment rate

Consumer confidence index

Mortgage rate

Cross-Entity~18

County population

Median income

Urban/rural classification

Category penetration rate

3,143 counties × ~500 categories × 365 days × ~156+ features = >574M observations/year

Sports drinks on shelves — weather drives what people buy

Weather-sensitive products span every aisle — beverages, apparel, outdoor, home, pharmacy

574M

observations per year

1.15B

2-year target for full ID

~60

dimensions per obs

F > 20

IV strength target

Most demand forecasting is correlation-based. A model sees that ice cream sales and sunscreen sales both spike in July and concludes they're related. But correlation isn't actionable — you can't intervene on a correlation.

Thagorus's approach is structural, not statistical. Weather is an instrumental variable — it affects demand but isn't affected by pricing, promotions, or competitor actions. This lets us isolate the causal effect of weather on demand, the way a randomized trial isolates drug effects. The result isn't "these things tend to move together." It's "a 10-degree temperature anomaly in Harris County causes a 23% increase in sunscreen demand within 48 hours, holding all else equal."

That's the difference between a correlation and a prescription. One tells you what happened. The other tells you what to do.

The LCDM's identification strategy rests on a panel dataset where the cross-sectional unit is a county × product-category pair observed daily. Each observation carries weather features, sales signals, ad-spend by channel, inventory levels, and promotional calendars — roughly 60 effective dimensions per observation after interaction reduction. Weather events serve as instrumental variables across all 3,143 counties simultaneously.

3,143 U.S. counties × ~500 product categories × 365 days per year = 573,597,500 observations per year, rounded to 574M. Each observation is a unique county-category-day triple with its associated feature vector.

The U.S. Census Bureau defines 3,143 county-equivalent administrative divisions (3,007 counties + 136 county equivalents in Louisiana, Alaska, and independent cities). Counties are the finest geographic resolution at which both NOAA weather data and BLS economic data are consistently available, making them the natural unit for the LCDM's spatial panel.

The Bureau of Labor Statistics Consumer Expenditure Survey defines roughly 300 base expenditure categories. We extend this to ~500 by adding weather-sensitive subcategories: splitting "outdoor recreation" into equipment vs. apparel, "beverages" into hot vs. cold, "home improvement" into indoor vs. outdoor projects. The additional granularity is necessary because weather affects subcategories asymmetrically — a heat wave boosts iced coffee but suppresses hot coffee, and the model needs to see both.

Window	Observations	Seasonal Cycles	IV Power
90 days	~143M	0.25	Weak — network priors only
6 months	~287M	0.5	Noisy; "memorization phase"
1 year	~574M	1.0	Baseline seasonal coverage
2 years	~1.15B	2.0	Full identification regime

Causal identification via weather instruments requires observing county-category pairs across multiple extremes. Two years provide at minimum two independent realizations of each seasonal extreme, enabling F > 20 for weather-sensitive categories.

Signal	Dimensions	Source
Weather	~20	NOAA GHCN-Daily + forecast APIs
Sales/demand	~8	Shopify, Amazon SP-API, POS
Ad spend	~12	Google Ads, Meta, TikTok APIs
Inventory	~4	Tenant ERP / inventory system
Promotions	~6	Tenant promo calendar
Macro	~10	FRED, BLS, Census

Key Insight

Data quality dominates quantity for causal inference. Missing confounder data creates omitted variable bias that no amount of additional observations can fix. This is why design partner onboarding requires OAuth access to all ad platforms and sales channels — partial data destroys identification.

Weather shapes demand. The model learns why.

Chapter 2

The model

How the LCDM is built

The LCDM: six stages from raw data to causal predictions.

The Large Causal Demand Model uses the same transformer architecture behind ChatGPT — but instead of predicting the next word, it predicts the next day's demand and identifies what caused it. The architecture is not fixed: we treat it as a research problem solved through neural architecture search, starting from a strong baseline and optimizing for our specific data structure.

1. Data ingestion

Every morning, the pipeline pulls the previous day's weather from NOAA and forecast APIs, sales data from Shopify and Amazon, ad spend from Google and Meta, and inventory snapshots from each partner. These feeds are aligned to a consistent county-category-day grid.

2. Feature engineering

Raw temperature tells you almost nothing. The feature engine transforms it into weather anomalies — deviations from the 30-year normal for that county and day. 95°F in Phoenix is unremarkable; 95°F in Seattle is a five-sigma event. The engine also constructs interaction terms: the humidity-temperature combo that drives "feels-like" discomfort, the UV trajectory over 5 days that predicts sunscreen demand, the wind chill delta that triggers coat purchases.

3. Transformer encoder

The core is a transformer adapted for multivariate time series. The starting architecture — informed by BERT-base and time-series foundation models like MOIRAI and TimesFM — uses ~12 layers with ~768 embedding dimensions. But these are hyperparameters to be optimized via NAS, not gospel. What matters: each attention head learns different temporal and cross-category patterns. One head might learn that rain in Miami suppresses outdoor dining within hours; another learns that a Midwest cold snap predicts heating equipment demand 3–5 days later.

4. Causal identification layer

This is where the LCDM diverges from every other demand model. Standard models learn correlations — "when it's hot, sunscreen sells." The causal layer asks: "how much of this increase is caused by the heat, and how much would have happened anyway due to summer promotions, school breaks, or seasonal trends?" Weather is a natural experiment — nobody controls or predicts it perfectly. When an unexpected heat wave hits Florida but not New York, and sunscreen sales spike only in Florida, the difference is causally attributable to the heat. The causal layer uses this logic at scale across all 3,143 counties.

5. Network pooling

James-Stein shrinkage pools estimates across all partners. A new partner with 90 days of data inherits the statistical power of the entire network. The shrinkage factor is learned per category and geography — this is the mathematical foundation of Thagorus's network effect.

6. Daily predictions

Every morning, the trained model produces a 10-day demand forecast for each partner, category, and geography — with confidence intervals from ensemble weather uncertainty, causal attribution breakdowns, and recommended budget adjustments.

The 12-layer / 768-dim / 12-head configuration is a well-studied starting point, not a final architecture. BERT-base, GPT-2 Small, and time-series foundation models (MOIRAI, TimesFM, Chronos) all converge on similar dimensions for medium-scale tasks. Our NAS pipeline will systematically explore width, depth, attention head count, and the causal sub-network structure. The final architecture will almost certainly differ — the point is that we start from a strong, well-understood baseline rather than guessing.

The target parameter range is ~200–400M, informed by Chinchilla scaling laws applied to our dataset size (~1.15B observations). For a BERT-base-style starting point: 768 embedding dim × 12 heads × 12 layers gives ~85M in the transformer stack. Add input embeddings (~46M), positional encodings, output heads, and the causal identification sub-network (~40M for IV estimation + Stein shrinkage) — baseline ~270M. NAS will explore the 100M–500M range to find the compute-optimal point.

At any point in this range, the model is firmly "medium" — 10–50× smaller than GPT-2 Large, small enough for single-GPU inference in <200ms, but large enough to capture the nonlinear interactions between weather, geography, category, and demand that make causal identification work.

An instrumental variable must satisfy three requirements:

1. Relevance: The instrument must be correlated with the treatment (weather must actually affect consumer behavior). Measured by the first-stage F-statistic — F > 10 is the minimum (Staiger & Stock, 1997), F > 20 is strong (Stock & Yogo, 2005).

2. Independence: The instrument must be unrelated to confounders (weather cannot be caused by your marketing budget or competitor actions). Weather is exogenous by definition.

3. Exclusion: The instrument must only affect the outcome through the treatment. Addressed by daily × county resolution, which controls for supply-side effects and competitive responses.

Data preprocessing: Raw feeds are cleaned, aligned to the county-category-day grid, and split into train (80%), validation (10%), and test (10%) sets with temporal splits to prevent data leakage.

Batching: Mini-batches of 2,048 county-category-day observations, stratified by geography and category.

Optimizer: AdamW with weight decay 0.01, betas (0.9, 0.999), gradient clipping at 1.0.

Learning rate schedule: Linear warmup over 1,000 steps to peak LR of 3e-4, then cosine decay to 1e-5.

Checkpointing: Save every 500 steps. Keep best 5 checkpoints by validation loss. Final model is an exponential moving average of the last 3 checkpoints.

Validation: Evaluate on held-out counties (spatial generalization) and held-out time periods (temporal generalization) separately.

Early stopping: Halt if validation loss does not improve for 10 consecutive evaluations (5,000 steps). Combined with weight decay, this prevents epoch-wise double descent.

Chapter 3

The science

Two surprising things that happen during training

Grokking: the model memorizes first, then suddenly learns the real pattern.

Imagine a student who memorizes every answer for a math exam — "Question 7 is 42, Question 12 is 17" — and aces it. New exam, different numbers: they fail. They keep studying. For weeks, nothing visibly changes. Then suddenly, overnight, they understand the underlying math and can solve any problem they've never seen.

What changed? The student stopped memorizing individual data points and started discovering the structureunderneath. For the LCDM, that means the model transitions from memorizing "90°F in Miami on June 3 → sunscreen +15%" to understanding the mechanism: that it's not the absolute temperature that matters, but the deviation from the 30-year normal, modulated by humidity interaction, baseline seasonal demand, and how quickly the weather changed. It discovers that this same mechanism operates differently across geographies — 90°F in Phoenix barely registers while 90°F in Portland is a demand shock. It finds cross-category cascades: the same heat event that boosts pool chemical sales also suppresses hot coffee and outdoor dining, and these relationships are causal, not just correlated.

That's the "extra information" — not more data points, but the discovery of latent causal structure that was always present in the training data but invisible to a memorizing model. Neel Nanda et al. (2023) showed this transition happens because the model's internal representations reorganize from lookup tables to generalizable circuits. Grokfast (Lee et al., 2024) accelerates this 50× by amplifying slow-varying gradient components, reducing the grokking budget from thousands of GPU-hours to hundreds.

Scenario	Extra Chip-hrs	Cost @ $0.48/hr	Cost @ $1.20/hr
With Grokfast (50×)	~200	$96	$240
Standard (3× train time)	~3,600	$1,728	$4,320
Worst case (10× train time)	~12,000	$5,760	$14,400

Reference: Nanda, N. et al. (2023). "Progress measures for grokking via mechanistic interpretability." ICLR 2023.

Double descent: more parameters can actually reduce error, defying classical intuition.

Classical ML teaches that there's a "sweet spot" for model size: too small and it underfits, too large and it overfits. This U-shaped bias-variance trade-off is every ML textbook's chapter 1. It turns out to be incomplete.

When you keep making the model bigger past the point where it can perfectly memorize the training data (the "interpolation threshold"), something unexpected happens: test error peaks, then descends again. Larger models actually perform better, not worse. This is double descent, documented by Belkin et al. (2019) and Nakkiran et al. (2021), and it overturns fifty years of statistical intuition. For the LCDM, it means deliberate overparameterization — combined with weight decay and early stopping — is a feature, not a risk.

Model-wise double descent: Test error peaks when model capacity matches dataset size, then decreases as the model grows.

Epoch-wise double descent: Test error decreases, increases during the "critical regime," then decreases again. Eliminated by early stopping + tuned weight decay.

Sample-wise double descent: Adding more data can temporarily worsen performance. Resolves with either more data or larger models.

LCDM position: At ~270M parameters on ~1.15B observations (ratio 1:4.3), we're firmly in the overparameterized regime. The strongest defense is deliberate overparameterization + weight decay + early stopping + hierarchical shrinkage.

Catastrophic forgetting: Updating with new partner data risks degrading existing partners. Mitigated via elastic weight consolidation (EWC) and monthly full-network retrains on the complete pooled dataset.

Distribution shift: Weather-demand relationships are non-stationary. Addressed with anomaly encoding (deviations from 30-year normals, not raw temps), rolling retraining, and 4σ divergence detection with automatic checkpoint fallback.

Mode collapse: Prevented by conformal prediction wrappers, ensemble disagreement monitoring across 5 checkpoints, and heteroscedastic output heads.

Instrument strength: The first-stage F-statistic must exceed 10 and ideally 20+ for reliable causal estimates. Monte Carlo simulations show 24 months of data achieves F > 20 for 80% of county-category pairs.

Standing on the shoulders of AI weather

The LCDM doesn't predict weather — it consumes the best weather forecasts available and uses them as instruments. A new generation of AI weather models produces forecasts that rival the European Centre (ECMWF) at a fraction of the cost.

Model	Developer	Resolution	Inference	Notes
GraphCast	Google DeepMind	0.25°	<1 min	10-day forecast, single TPU, open weights
GenCast	Google DeepMind	0.25°	~8 min	Probabilistic ensembles, diffusion model
FourCastNet	NVIDIA	0.25°	<2 sec	Fourier Neural Operator, 7-day forecast
Pangu-Weather	Huawei	0.25°	~1.4 sec	3D Earth-specific transformer
Aurora	Microsoft	0.1°	<1 min	Foundation model, flexible fine-tuning

Physics + ML hybrid

Pure ML risks learning spurious correlations. Pure physics can't capture nonlinear demand responses. The LCDM uses a hybrid approach:

Physics-informed priors: Known relationships are encoded as Bayesian priors, not hard constraints. The model can override them with sufficient data.
Ensemble weather inputs: We ensemble GraphCast + GenCast + NOAA GFS. Ensemble disagreement provides built-in uncertainty for downstream causal estimates.
Physics-guided regularization: The loss function penalizes causal estimates that violate known physical constraints (e.g., negative temperature elasticity for heating products above 70°F).
Interpretable + flexible: The physics component is fully interpretable. The ML component captures nonlinearities the physics can't. Partners see both layers.

Cost Advantage

GraphCast generates a 10-day global forecast in under 1 minute on a single TPU v4. At $3.22/hr for a TPU v4, that's $0.05 per global forecast — roughly $18/year for daily county-level weather inputs. NOAA's HRES model costs ~$50M/year. We get comparable quality for 3,000× less.

Chapter 4

The economics

What it takes to build the economic graph

Building the economic graph is a multi-stage, capital-intensive engineering challenge.

We considered selling weather-demand analytics as a tool. But demand data is fragmented across thousands of retailers, each with different POS systems, category taxonomies, and data quality. Selling into that fragmentation means years of enterprise sales cycles and bespoke integrations.

Instead, Thagorus controls the full inference pipeline: data ingestion, causal estimation, forecast generation, and decision delivery. Partners send us data; we send them decisions. We own the statistical methodology, the cross-partner shrinkage, and the feedback loop. This full-stack approach is more capital-intensive, but it's the only way to ensure the causal estimates are actually valid — and the only way to capture the network effects that make the model improve with scale.

The capital buys time to recruit PhD-level causal inference researchers, secure enterprise data partnerships that take 6–18 months to close, build SOC 2-compliant infrastructure, and stand up real-time serving for millions of daily predictions — before competitors realize what's possible.

Seed ($2–5M): 3–4 person team (ML + eng), $95K compute/infra, 3–5 design partners, SOC 2 Type I. Proves the model works on real data.

Series A ($10–20M): 12–18 person team, $200K–$500K/yr compute, 20–50 partners, SOC 2 Type II, enterprise sales motion. Proves the business.

Series B+ ($40–80M): 40–60 person team, $1M–$3M/yr compute, 200+ partners, in-house GPU cluster, FedRAMP. Builds the platform.

Investment Area	Seed	Series A	Series B+
Team	3–4 (ML + eng)	12–18 (ML, eng, data, sales)	40–60 (full org)
Compute + Infra	$25K–$50K/yr	$200K–$500K/yr	$1M–$3M/yr
Data partnerships	3–5 design partners	20–50 paid + BDRs	200+ self-serve + enterprise
Compliance & security	SOC 2 Type I	SOC 2 Type II, pen testing	FedRAMP, financial regs
GTM + Sales	Founder-led	2–3 AEs + marketing	Full sales org + partnerships

Foundation model companies raise billions because compute is their moat. Thagorus's moat is the causal graph — the data network, not the compute.

	Foundation Model	LCDM
Parameters	~1.8 trillion	270 million
Training data	The entire internet	1.15B structured observations
Where $ goes	Compute (thousands of GPUs)	Team, data, partners, GTM
Moat source	Scale of compute	Scale of causal graph
Defensibility	Anyone with $100M+ can try	2+ years of multi-tenant data

This is why $150M builds a category-defining economic intelligence platform while $10B builds one more language model. The capital goes to assets that compound — data partnerships, the causal graph, regulatory moats. The LCDM's architecture is defensible but ultimately replicable. The multi-partner causal panel is neither. Lead with the data asset, not the model.

At each stage, the platform becomes something qualitatively different:

Stage 1: Weather intelligence (Seed → Series A)

5–20 partners in weather-sensitive categories. The LCDM resolves DMA-level causal effects. Partners get causal demand signals no existing tool can provide. ARR: $120K–$2M.

Stage 2: Demand intelligence (Series A → Series B)

50–200 partners across dozens of verticals. The causal graph resolves county-level effects and transcends weather. Partners start asking: "what's causing the demand shift in Phoenix this week?" The model begins seeing cross-category demand cascades. This is the inflection point. ARR: $5M–$25M.

Stage 3: Economic forecasting (Series B → IPO)

500–5,000+ partners. The causal graph becomes a nowcasting engine for the real economy. Transaction signals from thousands of businesses, combined with weather and new instruments, create something that doesn't exist today — a real-time, causally-identified model of consumer economic behavior at county-level granularity. ARR: $50M+.

How costs evolve

Costs scale sublinearly. Adding partners doesn't mean retraining from scratch. At 1,000 partners: ~$200K/year training, ~$150K/year inference, ~$100K/year infra. Total ~$450K/year on $50M+ ARR = 99%+ gross margin.

Infrastructure budget

Year 1 Cost Breakdown — Spot Optimized

Base training runs

3,000 hrs

$2,160

Hyperparameter search (12×)

36,000 hrs

$17,280

Neural architecture search

12,000 hrs

$5,760

Grokking buffer (5×)

15,000 hrs

$7,200

Ablation & validation

6,000 hrs

$2,880

Monthly retrains (12×)

36,000 hrs

$25,920

Inference serving (24/7)

8,760 hrs

$6,307

Weather APIs (3 providers)

—

$8,400

Data storage & pipeline

—

$4,800

Cloud infra & monitoring

—

$6,000

Spot preemption overhead

—

$8,300

Year 1 Total (blended spot/on-demand @ $0.48–$0.72/chip-hr)

$95,007

The standard FLOPs estimate for transformer training is C = 6 × N × D, where N is parameters and D is dataset size.

N = 200–400M parameters (NAS range). D = 1.15B observations. Using the midpoint (300M): C = 6 × 300M × 1.15B = 2.07 × 10¹⁸ FLOPs per pass.

A TPU v5e delivers ~393 TFLOPS (BF16) sustained. At ~40% practical throughput: ~4.4 chip-hours per epoch. 200–400 epochs → ~3,000 chip-hours base.

Multipliers: 12× HPO sweeps across the architecture search space, 5× grokking buffer (reduced by Grokfast), 2× ablation & validation, 12× monthly production retrains.

Compute total: ~108,000 chip-hours at blended $0.48–$0.72/hr = ~$67,500. Add 24/7 inference serving ($6,300), weather APIs × 3 providers ($8,400), data pipeline ($4,800), cloud infra ($6,000), spot preemption overhead ($8,300) = ~$95K Year 1.

This is 3–4× higher than a naive spot-only estimate because it accounts for: mixed spot/on-demand pricing, NAS requiring broader search, inference costs for production serving, and the overhead of preemption recovery.

Google Cloud TPU (Q1 2026 rates):

Chip	On-Demand	1-yr CUD	3-yr CUD	Spot
TPU v5e	$1.20	$0.84	$0.54	~$0.48
TPU v5p	$4.20	$2.94	$1.89	~$1.68
TPU v6e	$1.38	$0.97	$0.55	~$0.55

AWS EC2 GPU:

Instance	GPUs	On-Demand	Per-GPU	Spot
p4d.24xlarge	8× A100	$22.03/hr	$2.75	~$7.20
p5.48xlarge	8× H100	$33.10/hr	$4.14	~$13.20

Local vs. cloud

An in-house GPU workstation amortizes to competitive hourly rates. The recommended approach is hybrid — local for daily work, cloud for burst compute.

Local (4× RTX 4090)

$13,000 upfront, amortizes to $0.12/GPU-hr over 3 years. Always available, no spot preemption. ~35,000 GPU-hours/year at 24/7 utilization.

Cloud (TPU v5e spot)

$0.48/chip-hr with no commitment. Elastic scaling to 32+ chips for hyperparameter search. Year 1 cloud estimate: ~$6,900 for burst HPO + architecture search.

Configuration	Upfront	Amortized/yr	Effective $/GPU-hr
1× RTX 4090 workstation	~$3,500	$1,167	$0.13
4× RTX 4090 server	~$13,000	$4,333	$0.12
Lambda Scalar (8× A6000)	~$48,000	$16,000	$0.23
NVIDIA DGX A100	~$199,000	$66,333	$0.95

Sunscreen products displayed in warm light

From data to decisions

Model your own scenario

Chapter 5

The strategy

Solving the cold start

The causal demand panel doesn't exist in any usable form. Not at Nielsen. Not at IRI. Not at any retailer. The data is the company — and it only exists if we build it. Every decision should be evaluated against: does this get us to 10 partners with 12 months of data faster?

Workstream A

Synthetic Proof-of-Concept

Build a fully functional demo using synthetic demand data calibrated to real weather. BLS retail sales indices + NOAA weather generate realistic panels.

Workstream B

Design Partner Program

3–5 DTC e-commerce brands provide OAuth access in exchange for 6–12 months free service. They get analytics they could never build in-house. We get the data to train the model.

Workstream C

TSFM Baseline

Fine-tune MOIRAI on available weather-demand data. Delivers a working product within weeks of data access while the LCDM trains.

Workstream D

Public Data Pre-training

Pre-train on BLS Consumer Expenditure Survey, Census Retail Trade, FRED, Kilts-Nielsen panels. Transfer learning cuts per-partner data requirements dramatically.

Ideal design partners

Tier 1 — Dream

National Home Improvement

Think Home Depot, Lowe's, Tractor Supply. Thousands of SKUs from HVAC to outdoor furniture. Strong seasonality, massive geographic spread, years of POS data.

~500+ categories · ~2,000 locations · $5B+ ad spend · 10+ yrs history

Tier 2 — Sweet Spot

DTC Outdoor / Seasonal Brand

Companies like YETI, Hydro Flask, Solo Stove, Traeger. Clear weather signal, Shopify or D2C data, $5M–$50M ad spend. Fast to onboard via API.

20–100 categories · National DTC · $5M–$50M ad spend · Shopify/Amazon

Tier 2 — Sweet Spot

CPG Beverage / Food

Companies like Liquid Death, Athletic Brewing, Olipop. Beverage sales are strongly weather-driven. Both DTC and retail channel data available.

10–50 SKUs · Multi-channel · $10M+ ad spend · Strong weather signal

Tier 3 — Quick Win

Seasonal Apparel / Beauty

Sunscreen, outdoor apparel, seasonal skincare. Shopify-native brands with clear weather sensitivity. Fast onboarding, immediate signal.

5–30 SKUs · Shopify · $1M–$10M ad spend · 90-day onboard

A company selling both sunscreen and books provides a natural within-company control. When a heat wave hits and sunscreen sales spike while book sales stay flat, we can isolate the causal effect of temperature on sunscreen — the books are the control group.

Integration	Purpose	Effort
Shopify / Amazon SP-API	Sales, orders, inventory by SKU/day	OAuth, <1 hr
Google Ads API	Spend, impressions, clicks by campaign/day	OAuth, <1 hr
Meta Marketing API	Spend, reach, conversions by campaign/day	OAuth, <1 hr
Historical data export	Backfill 12–24 months	CSV/API, 1–3 hrs
Data sharing agreement	Legal, NDA, data usage terms	1–2 weeks

A typical Tier 2 partner with 50 categories across 200 DMAs generates 50 × 200 × 365 = 3.65M observations per year. With ~60 dimensions per observation, that's 219M data points annually. Three such partners provide 10.95M observations — enough for statistically significant causal estimates within 6 months.

Every partner makes every other partner better

Why every new partner makes every existing partner better

As the network grows, minimum data requirements drop — that's how "for everyone" works.

Network pooling: Each new partner contributes statistical power to every other partner via Stein shrinkage. The 50th partner gets dramatically better estimates than the 5th.

Category coverage: More partners across more categories = richer cross-category causal graph.

Geographic coverage: More partners across more geographies = better spatial identification. A cold snap with 10 unaffected southern partners is a clean natural experiment.

Instrument diversity: Different weather events serve as independent instruments. More variation = stronger F-statistics.

Switching costs accumulate: A partner on the network for 2 years benefits from a causal graph that took 2 years of network-wide data to build. A competitor starting from scratch cannot match it.

In 1961, Charles Stein proved something counterintuitive: estimating many things at once is MORE accurate than estimating each separately. Even if they're unrelated. This is the James-Stein estimator, and it's the mathematical foundation of Thagorus's network effect.

Today, only companies with massive data teams can do causal demand modeling. Thagorus inverts this. A small DTC brand selling $2M/year has no chance of building these models alone. But with James-Stein shrinkage, a new partner joining with even 6 months of data in a single category immediately gets causal estimates informed by every other partner in the network.

The minimum data requirement drops as the network grows. A food truck in Austin and a $5B retailer both benefit from the same causal graph. The food truck couldn't build this alone in a hundred years. But it doesn't have to — the network already did the work.

The shrinkage factor for partner i is:

λ_i = σ²_i / (σ²_i + τ²)

Where σ²_i is the individual partner's estimation variance, and τ² is the between-partner variance. As the network grows, τ² shrinks and every partner's estimates get pulled toward a more accurate group mean. Empirical Bayes estimates show 40–60% variance reduction vs. partner-only estimation.

What this unlocks at scale

Network size	Minimum partner data	What becomes possible
5 partners	12 months, 20+ categories	DMA-level weather effects for similar businesses
50 partners	6 months, 5+ categories	County-level effects; cross-category signals
500 partners	90 days, 1+ category	Instant causal estimates for any weather-sensitive business
5,000+ partners	30 days, any category	Real-time demand nowcasting; economic forecasting for everyone

Speed of learning is the moat

Weather changes daily. Sales happen daily. Thagorus's causal learning loop runs every 24 hours — faster than any consulting engagement, quarterly review, or annual planning cycle. A competitor starting today faces the same cold-start problem we faced, but we've been compounding daily signal across a growing partner network. The advantage isn't the model architecture (which is published science). The advantage is the accumulated daily learning that no one can fast-forward.

Stein's Paradox says that estimating many things simultaneously is always more accurate than estimating each one alone. This isn't a business strategy — it's a mathematical theorem. The company with the most partner data will have the most accurate estimates, provably. A competitor with half the partners doesn't get half the accuracy — they get worse than half because they can't borrow as much statistical strength. This creates a natural winner-take-most dynamic driven by mathematics, not just economics.

At sufficient network density, Thagorus becomes the default infrastructure for demand intelligence — the way Stripe became the default for payments or Twilio for communications. Not because it's cheaper, but because the network effects make it categorically better than anything you could build yourself, regardless of how much you spend.

The platform play

Three layers, each serving different customers from the same causal graph.

Layer 1: Data platform. The causal graph becomes the most comprehensive real-time map of American consumer demand.

Layer 2: Intelligence marketplace. Partners share anonymized signals; a $49/mo food truck gets Fortune 500 calibration.

Layer 3: Economic infrastructure. Hedge funds and central banks subscribe to real-time economic indicators from the same graph.

Thagorus starts as a product (causal demand signals) but becomes a platform as the network grows. The same pattern that made Stripe inevitable for payments and Twilio inevitable for communications.

Layer 1: Third-party developers, analytics firms, and financial institutions build on top of the causal graph via API.

Layer 2: Partners opt-in to share anonymized, aggregated signals. A $49/mo food truck gets demand insights calibrated by Fortune 500 grocery data. The Fortune 500 gets granularity from thousands of small businesses filling geographic gaps.

Layer 3: Hedge funds, government agencies, and central banks subscribe to GDP nowcasting, regional consumer confidence, sector rotation signals. This is a new asset class built on the same data that powers the $49/mo dashboard.

What the causal engine powers

Decision Domain	Value
Ad spend allocation	Optimal budget across channels & geos, causally identified
Dynamic pricing	Price elasticity estimates in demand context
Inventory positioning	Pre-position ahead of demand surges before competitors react
Product decisions	Which SKUs to promote by demand regime
Financial signals	Real-time consumer spending indicators for hedge funds & macro
Economic nowcasting	County-level GDP estimation from transaction + environmental signals

The causal demand graph becomes valuable to industries far beyond performance marketing:

Customer Segment	What They Pay For	Why It’s Unique
Hedge funds & quant traders	Real-time alternative data on consumer spending	Causally-identified — not scraped, not correlative
Insurance & reinsurance	Granular weather-economic impact models	County-level loss exposure calibrated to actual outcomes
Supply chain & logistics	Demand forecasts for inventory pre-positioning	Causal signals 7–14 days ahead of traditional indicators
Commercial real estate	Location intelligence	Multi-category demand maps that reveal site potential
Government & central banks	Economic nowcasting	Monthly GDP with 2-month lag → daily county-level in real time
CPG & food companies	Category-level demand planning	Weather × geography × category interactions at scale
Small businesses ($49/mo)	Causal intelligence they could never build alone	Network does the work — 90 days of data is enough

Each segment represents a distinct revenue line. A hedge fund paying $50K/year for alpha signals and a food truck paying $49/month both pull from the same causal graph — and both contribute to it. That's the platform.

Pricing

Like ChatGPT made AI accessible, Thagorus makes causal business intelligence accessible. Marginal cost to serve: $3–$15/month. We price for adoption velocity and network growth.

Starter

$49/mo

Any business, any size

5 categories, your region
Weekly demand signals
Connect Shopify or CSV
Causal demand dashboard

Self-serve — start in minutes

Growth

$499/mo

$500K–$5M annual ad spend

50 categories, all DMAs
Daily causal demand signals
Full channel coverage
Slack & email alerts

Designed for scaling brands

RECOMMENDEDPro

$2,500/mo

$5M–$50M annual ad spend

200 categories, all DMAs
Real-time causal signals
Budget optimization engine
API access + scenario planning

The core offering

Enterprise

Custom

$50M+ annual ad spend

Unlimited categories
Custom model calibration
White-label & VPC deploy
SLA + performance fee option

0.5–1% of incremental lift

Unit economics

97%+

Gross Margin

$30K

Avg. ACV (blended)

>120%

Target NRR

Company	Funding	Valuation	Relevance
Scale AI	$600M+	$13.8B	Data infra for AI; network effects in labeling
Measured	$47M	~$200M	Closest comp — incrementality for ad spend
Recast	$18M	~$80M	Bayesian MMM as a service; Series A 2023
Tomorrow.io	$190M	~$1B	Weather intelligence platform; launched satellite

Round	Target	Gate	Use of Funds
Pre-seed	$500K–$1.5M	Synthetic proof + 3 design partners	Founder, compute ($13K local + $7K cloud), pipeline
Seed	$2M–$5M	v1 live; 5+ partners; 2+ case studies	Team (3–4), sales, dedicated compute
Series A	$10M–$20M	$500K+ ARR; 20+ partners; v2 deployed	GTM, enterprise, R&D, in-house GPU cluster
Series B	$40M–$80M	$5M+ ARR; 200+ partners; demand intelligence	Vertical expansion, API platform, new instruments
Series C / Growth	$100M+	$25M+ ARR; 1,000+ partners; economic forecasting	Gov / finance verticals, international, R&D lab
At scale	—	5,000+ partners; real-time economic graph	The economic forecasting platform for everyone

The real-time economic graph

Chapter 6

The vision

The real-time economic
graph for the world.

Weather-sensitive products are the wedge — sunscreen, cold medicine, patio furniture, winter gear. They have the strongest causal signal, the fastest feedback loops, and the clearest ROI. But the model architecture generalizes. Once the causal estimation pipeline works for sunscreen in Phoenix, the same instrumental variable framework applies to HVAC parts in Chicago, energy drinks in Miami, or umbrella inventory in Seattle. The wedge is narrow; the platform is broad.

At 20 partners you have weather intelligence. At 200 you have demand intelligence. At 2,000 you have something that doesn't exist yet — a causally-identified, real-time model of how the economy actually works, at the resolution of individual counties and categories, updated daily. The Fed gets monthly aggregates with a two-month lag. Thagorus's partners will see it happening in real time.

That's what "economic forecasting for everyone" means. Not a dashboard. Not a prediction. A living, causal understanding of why people buy what they buy, where, and when — available to every company willing to contribute their signal to the graph. Demand planning is going to have its AlphaFold moment. We're building the Protein Data Bank.

nate@schmiedehaus.com →

References

Power, A. et al. (2022). "Grokking: Generalization beyond overfitting on small algorithmic datasets." arXiv:2201.02177.
Lee, J. et al. (2024). "Grokfast: Accelerated Grokking by Amplifying Slow Gradients." arXiv:2405.20233.
Nakkiran, P. et al. (2021). "Deep Double Descent." JSTAT. OpenAI.
Belkin, M. et al. (2019). "Reconciling modern ML practice and the classical bias-variance trade-off." PNAS 116(32).
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." DeepMind.
Shi, J. et al. (2024). "Scaling Law for Time Series Forecasting." NeurIPS 2024.
Das, A. et al. (2024). "A Decoder-Only Foundation Model for Time-Series Forecasting." Google. (TimesFM)
Woo, G. et al. (2024). "Unified Training of Universal Time Series Forecasting Transformers." Salesforce. (MOIRAI)
Ansari, A. F. et al. (2024). "Chronos: Learning the Language of Time Series." Amazon.
Lam, R. et al. (2023). "Learning skillful medium-range global weather forecasting." Science. Google DeepMind. (GraphCast)
Price, I. et al. (2024). "GenCast: Diffusion-based ensemble forecasting for medium-range weather." Google DeepMind.
Pathak, J. et al. (2022). "FourCastNet: A Global Data-driven High-resolution Weather Forecasting Model." NVIDIA.
Bi, K. et al. (2023). "Accurate medium-range global weather forecasting with 3D neural networks." Nature. Huawei. (Pangu-Weather)
Stock, J. H. & Yogo, M. (2005). "Testing for Weak Instruments in Linear IV Regression." Cambridge UP.
James, W. & Stein, C. (1961). "Estimation with Quadratic Loss." Fourth Berkeley Symposium.
Chernozhukov, V. et al. (2018). "Double/Debiased ML for Treatment and Structural Parameters." Econometrics Journal.
Hartford, J. et al. (2017). "Deep IV: A Flexible Approach for Counterfactual Prediction." ICML 2017.
Nanda, N. et al. (2023). "Progress measures for grokking via mechanistic interpretability." ICLR 2023.
Heckel, R. & Yilmaz, F. F. (2024). "Regularization-wise double descent." ICLR 2024.
Bessemer (2025). "The AI pricing and monetization playbook." bvp.com/atlas.
Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596, 583–589.
Conviction (2025). "Plausible Schemes: Measured Physics." conviction.com/startups.html.

How to build an economicforecasting model for everyone