The application of historical data for predicting future outcomes is pivotal across various domains, encompassing revenue projections, weather forecasts, stock market trends, sports predictions, and beyond. This dependence on predictive analytics relies on extracting valuable insights from historical data, addressing diverse forecasting challenges.
To predict future outcomes, a predictive model is built from historical data. Over the past few years, I’ve built dozens of predictive models using a range of techniques, from simple to complex, and from spreadsheets to machine learning. In this post, I’ll discuss the general strategies employed to build predictive models. (If you’re a data scientist, you might scoff at how I’m defining things because you’ve read Forecasting: Principles and Practice backwards and forwards. That’s OK. The book is great!)
Time series data is data that is collected or recorded sequentially over time. Each data point in a time series dataset is associated with a timestamp. Time series forecasting is a crucial analytical approach applied to such data, aiming to predict future values or trends based on the historical patterns observed. By leveraging the temporal order of the data, time series forecasting methods can capture and analyze trends, seasonality, and other recurring patterns to make informed predictions about future developments.
Imagine we are forecasting electricity ⚡ demand for a town using a Simple Moving Average (SMA). We observe the electricity usage for the past 30 days, and each day, we calculate the average demand over that period. For instance, on day one, we take the average of the demand over the last 30 days. The next day, we recalculate the average by incorporating the newest day’s data and excluding the oldest. This dynamic approach allows us to consider recent trends and fluctuations in electricity demand. If, over the past 30 days, the demand has fluctuated between 3.4 MwH/day and 3.6 MwH/day, our SMA forecast for the next day would be a more adaptive estimate, considering the evolving pattern of demand over time.
Most people have likely used SMA to do time series forecasting, without necessarily using the term SMA. The benefit of SMA is that it creates a simple and intuitive predictive model, making it accessible for quick estimations and general trend observations. However, the tradeoffs of SMA are notable. SMA tends to smooth out extreme fluctuations, making it less responsive to sudden changes or outliers in the data. Additionally, it may lag behind abrupt shifts in the underlying pattern of the time series.
To address these problems, most forecasters end up using a more sophisticated strategy. Exponential Moving Average (EMA), which assigns different weights to recent data points, giving more importance to the most recent observations. This makes EMA more responsive to changes compared to SMA. Autoregressive integrated moving average (ARIMA) is another popular strategy that works particularly well on time series data that has seasonality trends.
Regression analysis also relies on historical data, but it differs in its approach and objectives. In this method, the emphasis is on establishing a mathematical relationship between the input variables and the corresponding output variable. Unlike time series forecasting, where the primary goal is to predict future values based on temporal patterns, supervised regression aims to understand and quantify the relationships between variables. Through the training of a regression model on historical data, the algorithm learns to generalize and predict outcomes for new, unseen data points.
Again, using the electrical ⚡ demand forecasting scenario, instead of using SMA to predict the demand, we could use regression analysis to look at independent variables such as the daily temperature and day of the week. We could then build a predictive model where the historical electrical demand serves as the dependent variable, and daily temperature and day of the week act as predictors. This approach allows us to capture not only the historical patterns but also the influence of external factors on electricity consumption. For instance, the model might reveal that higher temperatures are associated with increased demand for cooling systems, or that certain days of the week exhibit distinct usage patterns. By considering these variables, regression analysis provides a more nuanced and context-specific prediction, enhancing the accuracy and interpretability of our forecasting model for electrical demand.
Regression analysis is not a free lunch, however. In addition to greater complexity, regression analysis is heavily dependent on the selected independent variables and how each of these variables is encoded. This process is typically called “feature engineering”, and is part art and part science. Choices on including or excluding certain variables, and how they are translated into numerical parameters, can significantly impact the model’s performance.
Time series forecasting with covariates combines elements of regression analysis with time series forecasting. In this approach, the traditional time series models is enhanced by incorporating additional independent variables, known as covariates or exogenous variables. These covariates may include external factors such as economic indicators, environmental conditions, or other relevant variables that influence the time series behavior. By integrating these covariates into the forecasting model, practitioners can capture more nuanced relationships and improve the accuracy of predictions, especially in situations where external factors play a significant role in shaping the time series patterns.
Using the electrical ⚡ demand forecasting scenario, we first compute the historical SMA (e.g., a 30 day moving average) over the course of a year. We then examine the correlation between the SMA values and temperature, and calculate an adjustment factor to the SMA values based on temperature. Our final predictive model then takes historical consumption data, computes the 30 day moving average, and applies an adjustment factor to the SMA based on forecasted temperature.
While adding covariates to a time series forecast can improve accuracy, the feature engineering required for a successful model is even more complex and requires additional experimentation. Each independent variable needs to be mapped to the same time series. For example, temperature varies over the course of the day, so should we use the temperature at noon? The average temperature for a day? In addition, the dynamic nature of data demands continuous updates and adjustments to the predictive model to account for changing patterns and relationships.
There are other techniques for building predictive models from historical data. These include Facebook Prophet and various forms of deep learning models (e.g., LSTMs, a type of neural network, are a popular architecture for time series forecasting). In general, these strategies, while exciting, require significant engineering to outperform standard regression analysis or ARIMA forecasting techniques. That said, in cases where accuracy truly matters, ensemble approaches that combine multiple techniques can produce superior results. Uber used a hybrid exponential smoothing combined with a neural network model to predict driver supply and demand with very strong results.
When I started building predictive models, I frequently conflated time series forecasting and regression analysis. I realized that this was because both of these approaches predict a numerical output … but that’s where the similarities end. Understanding these different strategies, and how they apply to your data, is the first step to unlocking the power of predictive models.
In my next article, I’ll talk about applying these different strategies to B2B analytics. In the meantime, if you have a bunch of B2B data that you want to analyze and don’t know where to start, please send an email to info@amorphousdata.com 😀.
(Edit: This post got a good discussion on Hacker News).