Exploring Auto ARIMA in Python for Multiple Time Series Forecasting

14 min readMay 8

--

Forecasting is the process of using historical data to predict future events or trends. It is a critical tool for businesses and organizations, allowing them to plan for future changes and opportunities. In this article we will discuss about auto_arima() function and how this method can help us to applying forecasting for multiple timeseries data.

A while ago, my boss tasked me with forecasting job opening trends for the next few months. The request was not just for the total number of job openings, but also segmented by job categories and countries in our market.

Okay that’s a lot.

My first impression when I knew I will do a forecasting is using ARIMA with standard procedure. If we ask ChatGPT what is the steps, it would be like this (can skip if you are already mastered the ARIMA):

1. Stationarity Check: The first step in ARIMA modeling is to check for stationarity of the time series. Stationarity means that the statistical properties of the time series such as the mean and variance remain constant over time. If the time series is not stationary, it can be made stationary by taking the first or second difference or applying a seasonal differencing. The Augmented Dickey-Fuller (ADF) test is commonly used to test for stationarity.
2. Identification of p, d, and q: The next step is to identify the values of p, d, and q that should be used in the ARIMA model. Here, p is the order of the autoregressive (AR) term, d is the degree of differencing, and q is the order of the moving average (MA) term. These values can be identified by analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the time series.
3. Model Fitting: Once the values of p, d, and q have been identified, an ARIMA model is fit to the time series using maximum likelihood estimation. The fitted model is then checked for goodness of fit using diagnostic tests such as the Ljung-Box test and residual plots.
4. Model Selection: Several ARIMA models may fit the data well. The best model is selected based on goodness of fit measures such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).
5. Forecasting: Finally, the selected ARIMA model is used to forecast future values of the time series. The forecast can be obtained using recursive or direct methods.

There are many articles that explain each steps on how to use it using Python. I believe this should be the ideal way to do the forecasting, but again, my task require me to do all the steps for many segmentation that produced from the combination between the job categories and the country (not mentioned some other special cases).

Also, the steps required some detail observation and kind of “subjective” (maybe I will got cancelled by my fellow statistician). For example like when determining the p, d, and q values. I said “subjective” here because we need to look closely to the ACF and PACF chart and judge them to get the values of p, d, and q. If you interested to understand this more, can find it in this article.

Here comes auto_arima() from pmdarima

So I was too lazy to follow standard procedure of developing ARIMA model and I remember in R we have something like to do all of this “automatically”. After little searching, I found `auto_arima()` function from pmdarima library (see doc here).

Basically, `auto_arima()`works to find the optimal order of p, d, and q by taking the lower AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) parameters. I really like this method since the traditional way is to evaluate the ACF and PACF plots is time consuming and honestly… I keep forget how to evaluate it. So in the name of “laziness”, this method is really a winner.

Now let’s talk about how is this `auto_arima()`compared to traditional method. It really depend on how good is the analyst to interpret ACF and PACF plot. An comprehensive comparison has been made in below story.

In conclusion, `auto_arima()`really gives benefit on performing ARIMA model analysis in an effective way. Assuming the author is correctly evaluating the seasonality and also ACF/PACF plots in traditional way, the result of `auto_arima()`surprisingly much better than the traditional one.

Let’s go back to initial purpose why I am doing this whole searching. The task that I need to do is to do forecasting for multiple timeseries that coming from combination of job categories and market countries. How `auto_arima()`can help me on this task?

a. Data Preparation

To demonstrate the process, I will use a dataset of Iowa Liquor Retail Sales from BigQuery public dataset. You can pull the same dataset

`SELECT   county,   DATE_TRUNC(date, MONTH) AS month,   SUM(bottles_sold) AS total_bottles_sold,  SUM(sale_dollars) AS total_sale_dollarsFROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE county IS NOT NULL GROUP BY 1,2ORDER BY 1,2`

Yes, this is a dataset that contain timeseries information of liquor sales including total bottles sold and total sale in dollar. I aggregate the number based on county where the liquor has been sold and in monthly timeframe. Using this dataset we will do forecasting for each of the county to replicate my case where I need to forecast multiple combination of job segmentation.

After saving our dataset in CSV (to minimize the cost of pulling the data), let’s call it as data frame in Python and see how is the data looks like also the distribution using `describe()`function. Since we also need to check how many county that appears in the dataset and also the time range, we need to add `include = 'all'` argument in the `describe()`function.

`# Read in the Iowa liquor sales data from a CSV file and store it in a pandas DataFrame object called 'df'df = pd.read_csv('./dataset/iowa_liquor_sales_data.csv')# Convert the 'month' column to a datetime format using the pandas to_datetime() functiondf['month'] = pd.to_datetime(df['month'])# Display a random sample of 10 rows from the DataFrame for exploratory purposesdisplay(df.sample(10))# Generate a summary of the DataFrame's statistics, including datetime columns treated as numericdisplay(df.describe(include='all', datetime_is_numeric=True)))`

It turns out that we have 104 counties in the dataset! Also the time range is quite long since 2012 to 2022. Let’s cut the time range into five years back before the last month of the dataset which is from January 2017 to November 2022. For counties, let’s find top 10 counties based on average total bottle sold from 2017.

`# Filter the DataFrame to include only rows with a 'month' value of January 2017 or laterdf = df[df['month'] >= pd.to_datetime('2017-01-01')]# Group the DataFrame by county and calculate the average total bottles sold per countydf_agg = df.groupby(['county'])['total_bottles_sold'].mean().reset_index()# Sort the resulting DataFrame in descending order by total bottles sold and select the top 10 countiesdf_agg = df_agg.sort_values(['total_bottles_sold'], ascending = False).reset_index(drop = True)df_agg = df_agg.head(10)# Create a horizontal bar plot of the top 10 counties by average total bottles sold using Seabornfig, axs = plt.subplots(1,1, figsize = (8,6))sns.barplot(ax = axs, data = df_agg, x = 'total_bottles_sold', y = 'county')plt.show()`

Now we have Polk as the most county that sold bottle in average, and also other counties. Notice that Polk is significantly higher than the rest here. Let’s keep the top 10 counties list and take a look at the trend using timeseries chart.

`# Get the top 10 counties by average total bottles sold from the previously created 'df_agg' DataFramecounty_top10 = df_agg['county']# Filter the original DataFrame to include only the top 10 countiesdf_top10 = df[df['county'].isin(county_top10)]# Create a 2x1 grid of subplots and plot the monthly total bottles sold for each of the top 10 counties in the top subplot# and the overall monthly total bottles sold for all counties combined in the bottom subplotfig, axs = plt.subplots(2,1, figsize = (15,13))sns.lineplot(ax = axs[0], data = df_top10, x = 'month', y = 'total_bottles_sold', hue = 'county')sns.lineplot(ax = axs[1], data = df_top10, x = 'month', y = 'total_bottles_sold')plt.show()`

The timeseries chart is quite consistent with the average total bottle sold chart that we have before where Polk is showing really strong number and slightly growing over time.

If you observed closely, there is a pattern that we can find here (make the data is more interesting even we know this is a toy dataset, or is it?). Every beginning of the year the total bottle sold is dropping than the last month and slowly get the number again in the next month. I am not sure about the actual context, but clearly there is a seasonality detected here.

Alright, more or less we already understand the dataset and its characteristic, especially regarding the trend of the timeseries. Let’s prepare the data before we applied the function.

Our current data is a long type table where county values is in the county field. Since later we will add values for the forecasting result, it is better to have a pivot table where each column represent the county. Let’s define the function:

`def get_actual_data(data, date, values, category, start_date, end_date):    # Pivot the data to aggregate total bottles sold by county and month    data_pivot = pd.pivot_table(data,                                 index = date,                                values = values,                                 columns = category,                                 aggfunc=np.sum,                                 fill_value=0)        # Create a table of monthly dates from start_date to end_date    monthly_table = pd.DataFrame({'month' : pd.date_range(start = start_date, end = end_date, freq='MS')})        # Merge the monthly table with the pivot table to fill in missing months with 0s    data_result = monthly_table.merge(data_pivot, left_on = 'month', right_on = 'month', how = 'left').fillna(0)    data_result = data_result.set_index('month')    return data_result# Call the function to get actual sales data for the top 10 countiesdf_actual = get_actual_data(data = df_top10,                             date = 'month',                             values = 'total_bottles_sold',                             category = 'county',                            start_date = '2017-01-01',                             end_date = '2022-11-01')# Display the result df_actual`

In `get_actual_data` function, firstly we do a pivot to the original table with the column of county. After that we create a monthly table that will be merged into the pivot table. The objective of this procedure is to make sure in the dataset we will have a complete series of month for `start_date` to `end_date` without missing any month. If we found some county has a missing month, then we can replace it with value of 0.

The result will be like this.

b. Apply auto_arima() for single timeseries

Okay, let’s try to fit ARIMA model with a single county first to understand how is the function works.

`# import necessary librariesfrom statsmodels.tsa.arima_model import ARIMAimport pmdarima as pm# get actual data for county Polkdata_actual = df_actual['POLK']# set seasonal to Trueseasonal = True# use pmdarima to automatically select best ARIMA modelmodel = pm.auto_arima(data_actual,                       m=12,               # frequency of series                                            seasonal=seasonal,  # TRUE if seasonal series                      d=None,             # let model determine 'd'                      test='adf',         # use adftest to find optimal 'd'                      start_p=0, start_q=0, # minimum p and q                      max_p=12, max_q=12, # maximum p and q                      D=None,             # let model determine 'D'                      trace=True,                      error_action='ignore',                        suppress_warnings=True,                       stepwise=True)# print model summaryprint(model.summary())`

We will use Polk county as sample and set the seasonal as True since we know there is a seasonal trend from previous plot that we have. In `auto_arima` function we need to put some values. I put comment on each arguments above but one of the important values to note is `d` and `D` values where we sate as `None`. Doing this will give the model to determine the value of them instead of we need to analysis the stationary of the model. We can also put the value by our number, but in this case later we will forecast multiple series that I think it is better to let the model determine instead of we need to look one by one to get the stationary series.

The `auto_arima` function performed a stepwise search to find the ARIMA model with the lowest AIC value. It searched through various combinations of the p, d, and q parameters of ARIMA, and the P, D, and Q parameters of seasonal ARIMA (SARIMA) models.

The summary displays the AIC values and estimated model parameters for each candidate model that was evaluated. The best model is determined based on the lowest AIC value, which in this case is a non-seasonal ARIMA(0,1,0) model with an intercept term. The AIC for this model is 1400.206, which is the smallest value among all candidate models.

c. Multiple timeseries forecasting with auto_arima()

Great, now we know how it works for a single county. Let’s use the same methodology and build a function that apply this function for multiple timeseries. Beside finding the best model for each timeseries, we also try to predict the next 24 month since the last data that we have and plot it in charts.

First, let’s define three functions :`get_forecast_group()` , `get_combined_data()` , and `get_plot_fc()` . I will write down the three function in a snippet code, looks like a long code but we will run through on each of the function.

`def get_forecast_group(data, n_periods, seasonal):    # Initialize empty lists to store forecast data    data_fc = []    data_lower = []    data_upper = []    data_aic = []    data_fitted = []        # Iterate over columns in data    for group in data.columns:        # Fit an ARIMA model using the auto_arima function        data_actual = data[group]        model = pm.auto_arima(data_actual,                               start_p=0, start_q=0,                              max_p=12, max_q=12, # maximum p and q                              test='adf',         # use adftest to find optimal 'd'                              seasonal=seasonal,  # TRUE if seasonal series                              m=12,               # frequency of series                              d=None,             # let model determine 'd'                              D=None,             # let model determine 'D'                              trace=False,                              error_action='ignore',                                suppress_warnings=True,                               stepwise=True)                # Generate forecast and confidence intervals for n_periods into the future        fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)        index_of_fc = pd.date_range(pd.to_datetime(data_actual.index[-1])  + relativedelta(months = +1), periods = n_periods, freq = 'MS')                # Append forecast data to lists        data_fc.append(fc)        data_lower.append(confint[:, 0])        data_upper.append(confint[:, 1])        data_aic.append(model.aic())        data_fitted.append(model.fittedvalues())        # Create dataframes for forecast, lower bound, and upper bound        df_fc = pd.DataFrame(index = index_of_fc)        df_lower = pd.DataFrame(index = index_of_fc)        df_upper = pd.DataFrame(index = index_of_fc)        df_aic = pd.DataFrame()        df_fitted = pd.DataFrame(index = data_actual.index)    # Populate dataframes with forecast data    i = 0    for group in data.columns:        df_fc[group] = data_fc[i][:]        df_lower[group] = data_lower[i][:]        df_upper[group] = data_upper[i][:]        df_aic[group] = data_aic[i]        df_fitted[group] = data_fitted[i][:]        i = i + 1        return df_fc, df_lower, df_upper, df_aic, df_fitteddef get_combined_data(df_actual, df_forecast):    # Assign input data to separate variables    data_actual = df_actual    data_forecast = df_forecast        # Add a 'desc' column to indicate whether the data is actual or forecast    data_actual['desc'] = 'Actual'    data_forecast['desc'] = 'Forecast'        # Combine actual and forecast data into a single DataFrame and reset the index    df_act_fc = pd.concat([data_actual, data_forecast]).reset_index()        # Rename the index column to 'month'    df_act_fc = df_act_fc.rename(columns={'index': 'month'})    # Return the combined DataFrame    return df_act_fcdef get_plot_fc(df_act_fc, df_lower, df_upper, df_fitted, nrow, ncol, figsize_x, figsize_y, category_field_values,  title, ylabel):    # Set the years and months locators and formatter    years = mdates.YearLocator()    # every year    months = mdates.MonthLocator()  # every month    years_fmt = mdates.DateFormatter('%Y')    # Melt the data for plotting    df_melt = df_act_fc.melt(id_vars = ['month', 'desc'])    df_melt_fitted = df_fitted.reset_index().melt(id_vars = ['month'])    # Create subplots and set the title    fig, axs = plt.subplots(nrow, ncol, figsize = (figsize_x,figsize_y))    fig.suptitle(title, size = 20, y = 0.90)    i = 0    j = 0    for cat in category_field_values:        # Filter data for the current category        df_plot = df_melt[df_melt['variable'] == cat]        df_lower_plot = df_lower[cat]        df_upper_plot = df_upper[cat]        df_plot_fitted = df_melt_fitted[df_melt_fitted['variable'] == cat]        # Plot the actual and forecasted data        sns.lineplot(ax = axs[j,i], data = df_plot, x = 'month', y = 'value', hue = 'desc', marker = 'o')        # Plot the fitted data with dashed lines        sns.lineplot(ax = axs[j,i], data = df_plot_fitted, x = 'month', y = 'value', dashes=True, alpha = 0.5)        # Set the x-label, y-label, and fill between the lower and upper bounds of the forecast        axs[j, i].set_xlabel(cat, size = 15)        axs[j, i].set_ylabel(ylabel, size = 15)        axs[j,i].fill_between(df_lower_plot.index,                       df_lower_plot,                       df_upper_plot,                       color='k', alpha=.15)        # Set the legend and y-limits        axs[j,i].legend(loc = 'upper left')        axs[j,i].set_ylim([df_plot['value'].min()-1000, df_plot['value'].max()+1000])        # Set the x-axis tickers and format        axs[j,i].xaxis.set_major_locator(years)        axs[j,i].xaxis.set_major_formatter(years_fmt)        axs[j,i].xaxis.set_minor_locator(months)        i = i + 1         if i >= ncol:            j = j + 1            i = 0    plt.show()`

The first function objective is to fit the ARIMA model with the input data, then get the forecast based on the best model that we get. The process of getting the best model is the same with previous section but now we will not print the steps of searching). Beside the forecast result, other useful figure also can be used to complete our understanding of the forecast, such as the upper and lower bound of forecast’s confident interval.

Let’s call the first function with the input of our previous `df_actual` . Other arguments that required here is how many periods that we want to return in the forecast and is it a seasonal or not. In this example we will put 24 months as `n_periods` input and `True` as `seasonal` input.

`df_fc, df_lower, df_upper, df_aic, df_fitted = get_forecast_group(data = df_actual,                                                                   n_periods = 24,                                                                   seasonal = True)`

Basically the function will return the number of rows as what we request in the argument. Then we will combine the actual data with the forecast data in the second function.

`df_act_fc = get_combined_data(df_actual = df_actual, df_forecast = df_fc)`

Finally, we will plot the results in a grid chart that includes all of the country timeseries, along with the forecast, the confidence interval, and the fitted number. And voila! Here are the results of our long process.

`get_plot_fc(df_act_fc,             df_lower,             df_upper,             df_fitted,            nrow = 5, ncol = 2,             figsize_x = 25, figsize_y = 25,            category_field_values = df_act_fc.drop(['month', 'desc'], axis = 1).columns,             title = 'Total Bottle Sold on Top 10 Counties',            ylabel = 'Bottles')`

Based on the results above, we can gain some insights by examining the forecast for the next 24 months:

• Polk, the strongest county, shows a very confident result that it will experience a positive trend over the next 24 months.
• Pottawatta is also predicted to experience stable growth during this period.
• In contrast, most other counties show a stagnant trend with small fluctuations.

The result can be used to optimize production and inventory management, ensuring that the right products are available in the right quantities at the right time.

Conclusion

In this article, we have demonstrated the application of the `auto_arima()` function for multiple timeseries forecasting. One of the advantages of using `auto_arima()`instead of the traditional method is its effectiveness and scalability, particularly when dealing with many segmentations.

As the output provides a rough forecast with some level of error tolerance, it is a safe method to use. However, if we need to perform a detailed and specific forecasting, we should consider using the traditional step-by-step method to establish a strong statistical foundation.

Let’s connect with me in LinkedIn.