Exploring Auto ARIMA in Python for Multiple Time Series Forecasting

Iqbal Rahmadhan
14 min readMay 8, 2023

--

Forecasting is the process of using historical data to predict future events or trends. It is a critical tool for businesses and organizations, allowing them to plan for future changes and opportunities. In this article we will discuss about auto_arima() function and how this method can help us to applying forecasting for multiple timeseries data.

Sunset view of my hometown. Captured by Aisah.

A while ago, my boss tasked me with forecasting job opening trends for the next few months. The request was not just for the total number of job openings, but also segmented by job categories and countries in our market.

Okay that’s a lot.

My first impression when I knew I will do a forecasting is using ARIMA with standard procedure. If we ask ChatGPT what is the steps, it would be like this (can skip if you are already mastered the ARIMA):

  1. Stationarity Check: The first step in ARIMA modeling is to check for stationarity of the time series. Stationarity means that the statistical properties of the time series such as the mean and variance remain constant over time. If the time series is not stationary, it can be made stationary by taking the first or second difference or applying a seasonal differencing. The Augmented Dickey-Fuller (ADF) test is commonly used to test for stationarity.
  2. Identification of p, d, and q: The next step is to identify the values of p, d, and q that should be used in the ARIMA model. Here, p is the order of the autoregressive (AR) term, d is the degree of differencing, and q is the order of the moving average (MA) term. These values can be identified by analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the time series.
  3. Model Fitting: Once the values of p, d, and q have been identified, an ARIMA model is fit to the time series using maximum likelihood estimation. The fitted model is then checked for goodness of fit using diagnostic tests such as the Ljung-Box test and residual plots.
  4. Model Selection: Several ARIMA models may fit the data well. The best model is selected based on goodness of fit measures such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).
  5. Forecasting: Finally, the selected ARIMA model is used to forecast future values of the time series. The forecast can be obtained using recursive or direct methods.

There are many articles that explain each steps on how to use it using Python. I believe this should be the ideal way to do the forecasting, but again, my task require me to do all the steps for many segmentation that produced from the combination between the job categories and the country (not mentioned some other special cases).

Also, the steps required some detail observation and kind of “subjective” (maybe I will got cancelled by my fellow statistician). For example like when determining the p, d, and q values. I said “subjective” here because we need to look closely to the ACF and PACF chart and judge them to get the values of p, d, and q. If you interested to understand this more, can find it in this article.

Here comes auto_arima() from pmdarima

So I was too lazy to follow standard procedure of developing ARIMA model and I remember in R we have something like to do all of this “automatically”. After little searching, I found auto_arima() function from pmdarima library (see doc here).

Basically, auto_arima()works to find the optimal order of p, d, and q by taking the lower AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) parameters. I really like this method since the traditional way is to evaluate the ACF and PACF plots is time consuming and honestly… I keep forget how to evaluate it. So in the name of “laziness”, this method is really a winner.

Now let’s talk about how is this auto_arima()compared to traditional method. It really depend on how good is the analyst to interpret ACF and PACF plot. An comprehensive comparison has been made in below story.

In conclusion, auto_arima()really gives benefit on performing ARIMA model analysis in an effective way. Assuming the author is correctly evaluating the seasonality and also ACF/PACF plots in traditional way, the result of auto_arima()surprisingly much better than the traditional one.

Let’s go back to initial purpose why I am doing this whole searching. The task that I need to do is to do forecasting for multiple timeseries that coming from combination of job categories and market countries. How auto_arima()can help me on this task?

a. Data Preparation

To demonstrate the process, I will use a dataset of Iowa Liquor Retail Sales from BigQuery public dataset. You can pull the same dataset

SELECT 
county,
DATE_TRUNC(date, MONTH) AS month,
SUM(bottles_sold) AS total_bottles_sold,
SUM(sale_dollars) AS total_sale_dollars
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE county IS NOT NULL
GROUP BY 1,2
ORDER BY 1,2

Yes, this is a dataset that contain timeseries information of liquor sales including total bottles sold and total sale in dollar. I aggregate the number based on county where the liquor has been sold and in monthly timeframe. Using this dataset we will do forecasting for each of the county to replicate my case where I need to forecast multiple combination of job segmentation.

After saving our dataset in CSV (to minimize the cost of pulling the data), let’s call it as data frame in Python and see how is the data looks like also the distribution using describe()function. Since we also need to check how many county that appears in the dataset and also the time range, we need to add include = 'all' argument in the describe()function.

# Read in the Iowa liquor sales data from a CSV file and store it in a pandas DataFrame object called 'df'
df = pd.read_csv('./dataset/iowa_liquor_sales_data.csv')

# Convert the 'month' column to a datetime format using the pandas to_datetime() function
df['month'] = pd.to_datetime(df['month'])

# Display a random sample of 10 rows from the DataFrame for exploratory purposes
display(df.sample(10))

# Generate a summary of the DataFrame's statistics, including datetime columns treated as numeric
display(df.describe(include='all', datetime_is_numeric=True)))

It turns out that we have 104 counties in the dataset! Also the time range is quite long since 2012 to 2022. Let’s cut the time range into five years back before the last month of the dataset which is from January 2017 to November 2022. For counties, let’s find top 10 counties based on average total bottle sold from 2017.

# Filter the DataFrame to include only rows with a 'month' value of January 2017 or later
df = df[df['month'] >= pd.to_datetime('2017-01-01')]

# Group the DataFrame by county and calculate the average total bottles sold per county
df_agg = df.groupby(['county'])['total_bottles_sold'].mean().reset_index()

# Sort the resulting DataFrame in descending order by total bottles sold and select the top 10 counties
df_agg = df_agg.sort_values(['total_bottles_sold'], ascending = False).reset_index(drop = True)
df_agg = df_agg.head(10)

# Create a horizontal bar plot of the top 10 counties by average total bottles sold using Seaborn
fig, axs = plt.subplots(1,1, figsize = (8,6))
sns.barplot(ax = axs, data = df_agg, x = 'total_bottles_sold', y = 'county')
plt.show()

Now we have Polk as the most county that sold bottle in average, and also other counties. Notice that Polk is significantly higher than the rest here. Let’s keep the top 10 counties list and take a look at the trend using timeseries chart.

# Get the top 10 counties by average total bottles sold from the previously created 'df_agg' DataFrame
county_top10 = df_agg['county']

# Filter the original DataFrame to include only the top 10 counties
df_top10 = df[df['county'].isin(county_top10)]

# Create a 2x1 grid of subplots and plot the monthly total bottles sold for each of the top 10 counties in the top subplot
# and the overall monthly total bottles sold for all counties combined in the bottom subplot
fig, axs = plt.subplots(2,1, figsize = (15,13))
sns.lineplot(ax = axs[0], data = df_top10, x = 'month', y = 'total_bottles_sold', hue = 'county')
sns.lineplot(ax = axs[1], data = df_top10, x = 'month', y = 'total_bottles_sold')
plt.show()

The timeseries chart is quite consistent with the average total bottle sold chart that we have before where Polk is showing really strong number and slightly growing over time.

If you observed closely, there is a pattern that we can find here (make the data is more interesting even we know this is a toy dataset, or is it?). Every beginning of the year the total bottle sold is dropping than the last month and slowly get the number again in the next month. I am not sure about the actual context, but clearly there is a seasonality detected here.

Alright, more or less we already understand the dataset and its characteristic, especially regarding the trend of the timeseries. Let’s prepare the data before we applied the function.

Our current data is a long type table where county values is in the county field. Since later we will add values for the forecasting result, it is better to have a pivot table where each column represent the county. Let’s define the function:

def get_actual_data(data, date, values, category, start_date, end_date):
# Pivot the data to aggregate total bottles sold by county and month
data_pivot = pd.pivot_table(data,
index = date,
values = values,
columns = category,
aggfunc=np.sum,
fill_value=0)

# Create a table of monthly dates from start_date to end_date
monthly_table = pd.DataFrame({'month' : pd.date_range(start = start_date, end = end_date, freq='MS')})

# Merge the monthly table with the pivot table to fill in missing months with 0s
data_result = monthly_table.merge(data_pivot, left_on = 'month', right_on = 'month', how = 'left').fillna(0)
data_result = data_result.set_index('month')
return data_result

# Call the function to get actual sales data for the top 10 counties
df_actual = get_actual_data(data = df_top10,
date = 'month',
values = 'total_bottles_sold',
category = 'county',
start_date = '2017-01-01',
end_date = '2022-11-01')

# Display the result
df_actual

In get_actual_data function, firstly we do a pivot to the original table with the column of county. After that we create a monthly table that will be merged into the pivot table. The objective of this procedure is to make sure in the dataset we will have a complete series of month for start_date to end_date without missing any month. If we found some county has a missing month, then we can replace it with value of 0.

The result will be like this.

b. Apply auto_arima() for single timeseries

Okay, let’s try to fit ARIMA model with a single county first to understand how is the function works.

# import necessary libraries
from statsmodels.tsa.arima_model import ARIMA
import pmdarima as pm

# get actual data for county Polk
data_actual = df_actual['POLK']

# set seasonal to True
seasonal = True

# use pmdarima to automatically select best ARIMA model
model = pm.auto_arima(data_actual,
m=12, # frequency of series
seasonal=seasonal, # TRUE if seasonal series
d=None, # let model determine 'd'
test='adf', # use adftest to find optimal 'd'
start_p=0, start_q=0, # minimum p and q
max_p=12, max_q=12, # maximum p and q
D=None, # let model determine 'D'
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)

# print model summary
print(model.summary())

We will use Polk county as sample and set the seasonal as True since we know there is a seasonal trend from previous plot that we have. In auto_arima function we need to put some values. I put comment on each arguments above but one of the important values to note is d and D values where we sate as None. Doing this will give the model to determine the value of them instead of we need to analysis the stationary of the model. We can also put the value by our number, but in this case later we will forecast multiple series that I think it is better to let the model determine instead of we need to look one by one to get the stationary series.

The auto_arima function performed a stepwise search to find the ARIMA model with the lowest AIC value. It searched through various combinations of the p, d, and q parameters of ARIMA, and the P, D, and Q parameters of seasonal ARIMA (SARIMA) models.

The summary displays the AIC values and estimated model parameters for each candidate model that was evaluated. The best model is determined based on the lowest AIC value, which in this case is a non-seasonal ARIMA(0,1,0) model with an intercept term. The AIC for this model is 1400.206, which is the smallest value among all candidate models.

c. Multiple timeseries forecasting with auto_arima()

Great, now we know how it works for a single county. Let’s use the same methodology and build a function that apply this function for multiple timeseries. Beside finding the best model for each timeseries, we also try to predict the next 24 month since the last data that we have and plot it in charts.

First, let’s define three functions :get_forecast_group() , get_combined_data() , and get_plot_fc() . I will write down the three function in a snippet code, looks like a long code but we will run through on each of the function.

def get_forecast_group(data, n_periods, seasonal):
# Initialize empty lists to store forecast data
data_fc = []
data_lower = []
data_upper = []
data_aic = []
data_fitted = []

# Iterate over columns in data
for group in data.columns:
# Fit an ARIMA model using the auto_arima function
data_actual = data[group]
model = pm.auto_arima(data_actual,
start_p=0, start_q=0,
max_p=12, max_q=12, # maximum p and q
test='adf', # use adftest to find optimal 'd'
seasonal=seasonal, # TRUE if seasonal series
m=12, # frequency of series
d=None, # let model determine 'd'
D=None, # let model determine 'D'
trace=False,
error_action='ignore',
suppress_warnings=True,
stepwise=True)

# Generate forecast and confidence intervals for n_periods into the future
fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)
index_of_fc = pd.date_range(pd.to_datetime(data_actual.index[-1]) + relativedelta(months = +1), periods = n_periods, freq = 'MS')

# Append forecast data to lists
data_fc.append(fc)
data_lower.append(confint[:, 0])
data_upper.append(confint[:, 1])
data_aic.append(model.aic())
data_fitted.append(model.fittedvalues())

# Create dataframes for forecast, lower bound, and upper bound
df_fc = pd.DataFrame(index = index_of_fc)
df_lower = pd.DataFrame(index = index_of_fc)
df_upper = pd.DataFrame(index = index_of_fc)
df_aic = pd.DataFrame()
df_fitted = pd.DataFrame(index = data_actual.index)

# Populate dataframes with forecast data
i = 0
for group in data.columns:
df_fc[group] = data_fc[i][:]
df_lower[group] = data_lower[i][:]
df_upper[group] = data_upper[i][:]
df_aic[group] = data_aic[i]
df_fitted[group] = data_fitted[i][:]
i = i + 1

return df_fc, df_lower, df_upper, df_aic, df_fitted

def get_combined_data(df_actual, df_forecast):
# Assign input data to separate variables
data_actual = df_actual
data_forecast = df_forecast

# Add a 'desc' column to indicate whether the data is actual or forecast
data_actual['desc'] = 'Actual'
data_forecast['desc'] = 'Forecast'

# Combine actual and forecast data into a single DataFrame and reset the index
df_act_fc = pd.concat([data_actual, data_forecast]).reset_index()

# Rename the index column to 'month'
df_act_fc = df_act_fc.rename(columns={'index': 'month'})

# Return the combined DataFrame
return df_act_fc

def get_plot_fc(df_act_fc, df_lower, df_upper, df_fitted, nrow, ncol, figsize_x, figsize_y, category_field_values, title, ylabel):
# Set the years and months locators and formatter
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
years_fmt = mdates.DateFormatter('%Y')

# Melt the data for plotting
df_melt = df_act_fc.melt(id_vars = ['month', 'desc'])
df_melt_fitted = df_fitted.reset_index().melt(id_vars = ['month'])

# Create subplots and set the title
fig, axs = plt.subplots(nrow, ncol, figsize = (figsize_x,figsize_y))
fig.suptitle(title, size = 20, y = 0.90)

i = 0
j = 0
for cat in category_field_values:
# Filter data for the current category
df_plot = df_melt[df_melt['variable'] == cat]
df_lower_plot = df_lower[cat]
df_upper_plot = df_upper[cat]
df_plot_fitted = df_melt_fitted[df_melt_fitted['variable'] == cat]

# Plot the actual and forecasted data
sns.lineplot(ax = axs[j,i], data = df_plot, x = 'month', y = 'value', hue = 'desc', marker = 'o')
# Plot the fitted data with dashed lines
sns.lineplot(ax = axs[j,i], data = df_plot_fitted, x = 'month', y = 'value', dashes=True, alpha = 0.5)
# Set the x-label, y-label, and fill between the lower and upper bounds of the forecast
axs[j, i].set_xlabel(cat, size = 15)
axs[j, i].set_ylabel(ylabel, size = 15)
axs[j,i].fill_between(df_lower_plot.index,
df_lower_plot,
df_upper_plot,
color='k', alpha=.15)
# Set the legend and y-limits
axs[j,i].legend(loc = 'upper left')
axs[j,i].set_ylim([df_plot['value'].min()-1000, df_plot['value'].max()+1000])

# Set the x-axis tickers and format
axs[j,i].xaxis.set_major_locator(years)
axs[j,i].xaxis.set_major_formatter(years_fmt)
axs[j,i].xaxis.set_minor_locator(months)

i = i + 1
if i >= ncol:
j = j + 1
i = 0

plt.show()

The first function objective is to fit the ARIMA model with the input data, then get the forecast based on the best model that we get. The process of getting the best model is the same with previous section but now we will not print the steps of searching). Beside the forecast result, other useful figure also can be used to complete our understanding of the forecast, such as the upper and lower bound of forecast’s confident interval.

Let’s call the first function with the input of our previous df_actual . Other arguments that required here is how many periods that we want to return in the forecast and is it a seasonal or not. In this example we will put 24 months as n_periods input and True as seasonal input.

df_fc, df_lower, df_upper, df_aic, df_fitted = get_forecast_group(data = df_actual, 
n_periods = 24,
seasonal = True)

Basically the function will return the number of rows as what we request in the argument. Then we will combine the actual data with the forecast data in the second function.

df_act_fc = get_combined_data(df_actual = df_actual, df_forecast = df_fc)

Finally, we will plot the results in a grid chart that includes all of the country timeseries, along with the forecast, the confidence interval, and the fitted number. And voila! Here are the results of our long process.

get_plot_fc(df_act_fc, 
df_lower,
df_upper,
df_fitted,
nrow = 5, ncol = 2,
figsize_x = 25, figsize_y = 25,
category_field_values = df_act_fc.drop(['month', 'desc'], axis = 1).columns,
title = 'Total Bottle Sold on Top 10 Counties',
ylabel = 'Bottles')

Based on the results above, we can gain some insights by examining the forecast for the next 24 months:

  • Polk, the strongest county, shows a very confident result that it will experience a positive trend over the next 24 months.
  • Pottawatta is also predicted to experience stable growth during this period.
  • In contrast, most other counties show a stagnant trend with small fluctuations.

The result can be used to optimize production and inventory management, ensuring that the right products are available in the right quantities at the right time.

Conclusion

In this article, we have demonstrated the application of the auto_arima() function for multiple timeseries forecasting. One of the advantages of using auto_arima()instead of the traditional method is its effectiveness and scalability, particularly when dealing with many segmentations.

As the output provides a rough forecast with some level of error tolerance, it is a safe method to use. However, if we need to perform a detailed and specific forecasting, we should consider using the traditional step-by-step method to establish a strong statistical foundation.

Let’s connect with me in LinkedIn.

--

--