The utility industry has adopted ML-based forecasting faster than most sectors, but the "ML always beats classical statistics" framing that circulates in vendor literature doesn't survive contact with real operational data. ARIMA and its variants outperform gradient-boosted models in specific, identifiable conditions. Understanding when matters more than picking a winner.
ARIMA (AutoRegressive Integrated Moving Average) and its seasonal variant SARIMA were designed for univariate time series with clear trend and seasonal structure. Load forecasting is among the most suitable applications for SARIMA because power consumption has exceptionally strong weekly and annual seasonality: load on a Tuesday in August is highly predictable from load on the previous Tuesday in August, particularly for systems without major BTM solar or large industrial customers.
SARIMA's practical advantages in load forecasting are: (1) it requires less training data than ensemble methods — 12 months of interval data is sufficient for a well-specified model, while gradient-boosted models typically need 24+ months to generalize reliably across seasonal regimes; (2) it has fewer hyperparameters to tune, reducing the risk of overfitting to idiosyncratic historical events; and (3) its error structure is interpretable — when a SARIMA model performs poorly, the diagnostic outputs (ACF/PACF plots, residual analysis) clearly indicate whether the issue is under-differencing, missing seasonality terms, or forecast horizon degradation.
For load profiles that are dominated by time-of-day and seasonal patterns — traditional residential and small commercial accounts without smart thermostats or large rooftop solar — SARIMA-based models often achieve competitive accuracy with gradient boosting at 1–3% of the computational cost.
Gradient-boosted decision tree ensembles (XGBoost, LightGBM, and their variants) outperform ARIMA in load forecasting specifically when the relationship between weather variables and load is non-linear and context-dependent. The classic example is the temperature-load relationship during transition seasons: in spring and fall, the load response to a given temperature reading depends on whether the previous week was warm or cold (thermostat recalibration behavior), what day of week it is, and whether it's a major holiday week. ARIMA can incorporate temperature as an exogenous variable through ARIMAX, but it models the relationship as linear — a constant MW-per-degree coefficient that doesn't capture these interactions.
Gradient boosting handles these interactions natively through its tree structure, where temperature effects can differ based on day-of-week, recent history, and other features. The practical difference shows up most clearly in: spring and fall transition period forecasts, holiday load patterns (which break standard day-of-week seasonality), and systems with significant industrial load that responds to price signals or production schedules rather than temperature.
To put numbers to the comparison: across a dataset of 6 utility systems ranging from 200–800 MW peak load, covering 36 months of 15-minute interval data (2021–2023), mean average percentage error at the 15-minute forecast horizon shows the following patterns:
The aggregate MAPE across all conditions shows XGBoost with approximately a 15% improvement over SARIMA, but that aggregate obscures where the gains concentrate: gradient boosting earns its accuracy improvement primarily during high-variability conditions, not during the stable demand periods that represent most operating hours.
The practical implication of the accuracy analysis is that a blended model — one that routes forecast requests to SARIMA during stable-regime conditions and gradient boosting during high-variability conditions — can outperform either model alone. This isn't a novel insight in academic forecasting literature, but it's underused in production utility systems.
The blending logic requires a regime classifier: a real-time assessment of which forecast conditions are active that routes to the appropriate model component. Simple classifiers based on day-type (holiday/non-holiday, weekday/weekend), season, and NWP-derived cloud cover fraction capture most of the regime variation with minimal operational complexity. More sophisticated classifiers that incorporate recent forecast error trends (if the model has been under-predicting for the last 6 hours, shift weight toward the component that adjusts faster) can improve aggregate accuracy further.
The operational cost of the ensemble approach is model maintenance: SARIMA coefficients need re-estimation when load patterns shift (e.g., when a large industrial customer modifies operating schedule), and gradient-boosted models need periodic retraining as new weather-load relationships are observed. Managing two model components requires more systematic monitoring than relying on a single model — but the accuracy gains during high-impact operating periods typically justify the overhead.
LSTM (Long Short-Term Memory) networks and transformer-based architectures have shown competitive accuracy in load forecasting research, with several published studies reporting MAPE improvements of 5–15% over gradient boosting on systems with high complexity and long training datasets. Whether these improvements transfer to production utility environments depends heavily on data availability: LSTMs trained on fewer than 24 months of 15-minute data often exhibit high variance across different training windows, making operational reliability worse than simpler models despite better average benchmark performance.
For utilities with 36+ months of clean interval data and the engineering capacity to manage deep learning model infrastructure, LSTM-based or transformer-based models are worth evaluating. For the majority of mid-tier utilities — those with 12–24 months of usable historical data and limited ML operations capability — gradient-boosted ensembles with SARIMA blending offer better reliability-to-complexity tradeoff.
The honest summary: gradient boosting outperforms ARIMA in conditions that matter most for balancing costs. Neither is as accurate as a well-tuned ensemble during the operating conditions — partly cloudy, transition season, holiday periods — where forecast error translates into the highest real-time imbalance charges.
When evaluating a forecasting system, the model architecture question is secondary to several operational questions that more directly determine whether accuracy benchmarks translate into operational value: Does the system provide probabilistic forecasts with calibrated confidence intervals, not just point forecasts? Are confidence intervals calibrated empirically on your system's data or derived from model assumptions? Does the system maintain separate model components for different operating regimes, or does it apply a single model across all conditions? And does the vendor provide error attribution — identifying which input features or conditions are responsible for forecast misses, rather than just reporting aggregate MAPE?
Without answers to these questions, a comparison of aggregate MAPE across model architectures doesn't tell you which system will reduce your imbalance charges, which is the operational objective that matters.
Model selection is automatic based on real-time regime classification. You see the forecast — not the plumbing behind it.
Start Pilot Program