How to Build a Load Forecasting Pilot That Produces Defensible Numbers

Designing a rigorous load forecasting pilot with proper train-test splits

Vendor-run pilots with favorable design choices — training on the test period, selecting the best-performing historical window, reporting MAPE without error distribution breakdowns — can produce accuracy numbers that don't survive contact with new data in production. A utility-controlled pilot with a pre-specified evaluation methodology produces numbers that are actually predictive of operational performance.

The Foundational Design Principle: Temporal Hold-Out

Load forecasting accuracy must be evaluated on data that was not used in any form during model training. This seems obvious, but vendor-provided pilot results frequently violate it in subtle ways: models trained on 24 months of data and evaluated on the "last 3 months" often use the full 27-month dataset for feature engineering and normalization, leaking information about the test period into the feature set.

The rigorous approach is a strict temporal cutoff: all data before date T is available for training; all data after date T is the evaluation set; no information from the post-T period is used in feature engineering, normalization, or hyperparameter selection. The evaluation period should contain at least: one summer peak period, one winter heating period, at least two holiday weeks, and at least two weeks in each of the spring and fall transition seasons. This typically requires 12 months of evaluation data, which means the training set should be at least 24 months (longer for ensemble models that benefit from more seasonal diversity).

Shorter evaluation periods that happen to include favorable conditions — a summer with unusually stable temperatures, no major holidays — will produce MAPE numbers that look excellent but won't generalize. The evaluation period should be selected to include the conditions where the model is expected to perform worst, not best.

Defining Accuracy Metrics Before Seeing Results

Pilot evaluation criteria should be fixed in writing before the model is trained, not after reviewing preliminary results. Vendors who are willing to commit to specific accuracy targets before training on your data are demonstrating confidence that the system is generalizable; vendors who insist on seeing the data before committing to accuracy expectations are signaling the opposite.

The metrics to specify in advance are: (1) primary MAPE target at the 15-minute forecast horizon, (2) secondary MAPE target at the 1-hour horizon, (3) MAPE breakdown by season and hour-of-day, (4) confidence interval coverage at 80% and 95% bands (what percentage of actual values fall within the stated confidence intervals), and (5) maximum allowable MAPE during peak demand periods (hours where system load exceeds the 90th percentile of historical load).

The peak-period MAPE constraint is particularly important because, as discussed in our analysis of ISO/RTO imbalance energy settlement costs, forecast errors during peak demand periods generate disproportionate settlement charges. A model that achieves excellent aggregate MAPE through strong off-peak performance but fails during peak hours is not operationally valuable.

Baseline Comparison: What You're Comparing Against

Absolute MAPE numbers are less informative than relative improvement over your current baseline. The baseline should be your existing forecasting method, applied to the same evaluation period using the same data available at the time of each forecast. Common baseline choices:

Persistence model: Forecast for period T equals actual load for the same period in the previous week. This is a simple but surprisingly strong baseline, particularly for day-of-week-dominated load patterns. Any forecasting system that doesn't beat a persistence model by a meaningful margin is not worth operational consideration.
Your current production system: The most directly relevant baseline. Running the new system in parallel with your existing system, on live data for 30 days, with both systems producing forecasts in real time, provides an apples-to-apples comparison. It also validates that the new system works in your actual data environment, not just in a historical back-test.
Vendor's published benchmark: Useful for context but not for procurement decisions. Benchmark datasets used in academic publications and vendor materials are selected for well-behaved load profiles with clean data. Your system may be significantly harder to forecast.

Data Transfer and Environment Validation

A pilot that runs on a carefully curated, manually prepared data extract doesn't validate that the system will work with your operational data in production. The pilot data pipeline should use the same data transfer mechanism, the same polling cadence, and the same data quality issues that will be present in production. If your SCADA historian has 2% missing intervals and the pilot uses a clean dataset with all gaps filled, the pilot results won't reflect production performance.

Including at least one known data quality period in the evaluation set — a week where SCADA data quality was degraded, if such a period is available in the historical record — tests the vendor's data validation handling. A system that fails badly on a week where communication interruptions caused 15-minute SCADA gaps is not production-ready regardless of its accuracy on clean data.

Evaluating Model Behavior Under Distribution Shift

Load profiles change over time as customer mix changes, building stock becomes more efficient, and behind-the-meter generation grows. A model trained on historical data will degrade in accuracy as the distribution of load patterns shifts away from the training distribution. The pilot should explicitly test for this degradation by evaluating accuracy on the most recent 6 months of data separately from older historical data.

If accuracy is substantially worse on the most recent 6 months than on older data — while controlling for season effects — it indicates that the model hasn't adapted to recent load profile changes. The relevant question to ask the vendor is: what is the model recalibration mechanism, what triggers it, and what was the average time between load profile changes in their existing deployments and model performance degradation becoming operationally significant?

Settlement Cost Analysis: Closing the Loop

The purpose of a load forecasting pilot is ultimately to determine whether the system will reduce operating costs, not to produce impressive accuracy statistics. The settlement cost analysis connects accuracy improvements to dollar value: using the forecast error logs from the evaluation period and the actual real-time LMPs from your ISO/RTO market, calculate the imbalance energy charges that would have been incurred under the pilot system compared to your baseline system.

This calculation requires access to your actual ISO/RTO settlement interval data — real-time LMPs by delivery point for each 5-minute dispatch interval during the evaluation period. With this data, the imbalance cost comparison is straightforward: for each interval, multiply (forecast error MW) × (real-time LMP $/MWh) × (1/12 hours per 5-minute interval) to get the interval imbalance charge, then sum across the evaluation period. The difference between pilot system charges and baseline system charges is the expected annual savings, pro-rated to annual from the evaluation period length.

A vendor who resists providing forecast error logs at the interval level — claiming that aggregate accuracy statistics are sufficient for the settlement cost analysis — is preventing the utility from doing the analysis that would most credibly demonstrate value. Interval-level logs are a contractual deliverable that should be specified before the pilot begins.

Back to Blog