Thursday, April 16, 2026

Forecasting Store Sales at Scale: Lessons from 1,782 Time Series

Jose Borges

When Corporación Favorita - Ecuador's largest grocery retailer - opened a Kaggle competition to forecast daily sales across its stores, the dataset looked deceptively simple: four years of sales history, a handful of supporting files, and a single ask. Predict the next 15 days.

The twist: you're forecasting 1,782 time series simultaneously - every combination of 54 stores and 33 product families. Some sell thousands of units a day, some sell zero most days of the year. A single model has to handle all of them.

Here's what I learned building a solution that beat the strongest naive baselines by 38%.

The Problem

The simplest forecast you can make is "tomorrow will look like last week." For a grocery chain, that's not a bad guess - demand is heavily seasonal and weekly patterns are stable. Any model you build has to prove it's doing more than memorizing the day of the week.

That baseline sets the bar:

  • Seasonal naive (repeat last year's value): RMSLE ≈ 0.84
  • Last-week repeat: RMSLE ≈ 0.71

Those are surprisingly hard to beat. A lot of "sophisticated" time series models don't.

What Actually Moved the Needle

Three things, in order of impact.

1. Feature engineering, not model complexity

I spent roughly 70% of my time on features and 30% on modeling. The features that mattered most:

  • Lag features: sales 7, 14, 28, 365 days ago
  • Rolling statistics: 7-day and 28-day mean/std/max by store-family
  • Calendar features: day of week, month, Ecuadorian national holidays, payday (15th and end-of-month - Ecuador pays salaries biweekly, which materially shifts grocery demand)
  • Oil price: Ecuador's economy is oil-dependent; a lagged 7-day moving average of crude correlates meaningfully with discretionary spending
  • Promotions: onpromotion counts per store-family per day, plus 7-day rolling sums

The single biggest jump in accuracy came from adding the payday features. That's a domain signal a generic model would never infer on its own.

2. Validation that actually simulates deployment

Random k-fold on time series is a bug disguised as a best practice. If your validation splits let the model see "future" rows during training, your offline metrics will be optimistic - sometimes catastrophically so.

I used a walk-forward scheme: train on weeks 1–N, validate on weeks N+1 and N+2, step forward, repeat. It's slower, but it mirrors the real production setting - and the gap between my validation error and my leaderboard score ended up under 3%, which is what you want.

3. Boosted trees over deep learning (for this dataset)

I tried two approaches in parallel:

  • LightGBM with engineered features
  • A small transformer (think: N-HiTS / TFT-lite) on raw sequences

LightGBM won by a meaningful margin at roughly 1/40th the training cost. Deep learning for time series shines when you have homogeneous, high-frequency data and massive history - think electricity load curves with millions of samples per year. For 1,782 mid-frequency series with 4 years of history and rich exogenous signals, gradient boosting is still the pragmatic choice in 2026.

Results

Metric Value
Final RMSLE (private leaderboard) 0.44
vs. seasonal-naive baseline −48%
vs. last-week baseline −38%
Training time (full model) ~22 min on a MacBook
Inference (15-day forecast, all 1,782 series) < 3 sec

The 38% improvement over last-week naive is the number I lead with when I talk about this project - it's the most honest comparison, because nobody deploying a forecasting system is going up against a model that's already terrible.

What I'd Do Differently

  • Spend more time on cold-start families. A handful of product families had sparse history (opened mid-dataset or had long zero runs). A hierarchical model that pools information across similar families would likely lift those specifically.
  • Reconcile forecasts. My per-store-family forecasts don't sum to a consistent store-level total. In a real deployment, that inconsistency becomes operationally annoying. Hierarchical reconciliation (MinT, ERM) would fix it.
  • Uncertainty, not just point forecasts. Operations teams don't just need a number - they need to know how confident the model is. Quantile regression or conformal prediction on top of the LightGBM output would make the forecast decision-ready.

Takeaways for Anyone Working on Forecasting

  • Baselines are honest. Beat them publicly, not just on your own validation set.
  • The model is the easy part. The features, the splits, and the loss function are where the game is won.
  • Boring, interpretable models win in production. Every time I've reached for a transformer on tabular-adjacent time series, a well-tuned LightGBM has been within a whisker of it - and shipped faster, ran cheaper, and debugged easier.

The full notebook, requirements, and submission files are up on GitHub. Happy to chat if you're working on anything similar.

Forecasting Store Sales at Scale: Lessons from 1,782 Time Series