Project Overview:¶

The goal of this project is to predict future daily sales of 9146 items in Walmart for the next 28 days.

This notebook covers entire forecasting process:

  • Loading, preprocessing and optimizing data using Pandas.
  • Exploratory data analysis to gain insights.
  • Coding accuracy measure (RMSSE).
  • Feature engineering performed based on the insights gained from EDA.
  • Choosing parameters for the machine learning model (LightGBM).
  • Fiting and predicting future sales using statistical and machine learning methods.
  • Creating ensemble for all models.
  • Visualizing forecasts and comparing all models accuracy.

The dataset used in this project is one of the most realistic retail time-series datasets - the actual Walmart data. Walmart is one of the largets retailers in the world that operates in more than 10,500 stores and clubs under in 20 countries (2023 information). Data was made available for a M5 time series forecasting competition. To make it even more close to the real world datasets it was preprocessed in a way that it imitates store transactions.

This project is based on materials and knowledge that I gained from: https://corise.com/course/forecasting-with-machine-learning

Data Wrangling¶

In [ ]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

from prophet import Prophet
from statsforecast import StatsForecast
from statsforecast.models import HoltWinters
from mlforecast import MLForecast
from sklearn.preprocessing import OrdinalEncoder
from numba import njit
from window_ops.rolling import rolling_mean, rolling_std
import lightgbm as lgb
import gc

Getting our data in the right format¶

Time-series data has to be collected from some real-world, data-generating process. That means that raw data comes in as a series of observations. Depending on your experience with time-series data, you may be used to data that looks like this:

Date Sales
2022-01-01 23
2022-01-02 45
2022-01-03 12
2022-01-04 67
2022-01-05 89

But, if you're in retail, each of those "sales" probably came in some JSON from some point-of-sale system (i.e. cash register) that probably looked something like this:

{
    "timestamp": 2022-01-01 12:34:56,
    "product_id": 5,
    "store_id": 12,
    "category_id": 36,
    ...
}

Usually, it's the job of a data engineer to collect all of these records and aggregate them into a nice, tabular format, but it's worth at least having an appreciation for how it's done. So, we're going to start from a mock version of a transactions table.

In [ ]:
transactions = pd.read_csv(f'transactions_data.csv')

transactions.head()
Out[ ]:
date id item_id dept_id cat_id store_id state_id
0 2013-01-01 13:41:03 HOBBIES_1_004_TX_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES TX_1 TX
1 2013-01-01 07:30:52 HOBBIES_1_004_TX_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES TX_1 TX
2 2013-01-01 11:17:38 HOBBIES_1_004_TX_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES TX_1 TX
3 2013-01-01 20:18:59 HOBBIES_1_025_TX_1_evaluation HOBBIES_1_025 HOBBIES_1 HOBBIES TX_1 TX
4 2013-01-01 21:36:09 HOBBIES_1_028_TX_1_evaluation HOBBIES_1_028 HOBBIES_1 HOBBIES TX_1 TX
In [ ]:
transactions.dtypes
Out[ ]:
date        object
id          object
item_id     object
dept_id     object
cat_id      object
store_id    object
state_id    object
dtype: object

You can see that this is a DataFrame where each row relates to purchases for an individual item. Here's a little data dictionary:

  • date: the time at which an item was bought, down to the second
  • id: the product ID. Each of these is an individual item at a specific store.
  • item_id: this is an identifier for items, but not at the store level. You can use this to find the same item at different stores.
  • dept_id: department ID. One level up from item_id in the hierarchy
  • cat_id: category ID. One level up from dept_id in the hierarchy
  • store_id: identifies the specific store where the product was bought
  • state_id: identifies the specific state where the product was bought

date is supposed to be a datetime-like object, but you can see that when we loaded it from disk, it was loaded in as a string. Let's convert that column to datetime.

The goal is to transform this dataset into one that's easy to analyze and train models on. For this project, the goal is going to be to work at the daily level. So, the first step is to aggregate our transactions data up to the daily level.

To be more specific, this is what we want it to look like:

In [ ]:
# This is a hefty table, so just peeking at the first 5 rows
pd.read_csv(f'sales_data_sampled.csv', nrows=5)
Out[ ]:
date id item_id dept_id cat_id store_id state_id sales
0 2013-01-01 HOBBIES_1_004_TX_1_evaluation HOBBIES_1_004 HOBBIES_1 HOBBIES TX_1 TX 3
1 2013-01-01 HOBBIES_2_075_TX_1_evaluation HOBBIES_2_075 HOBBIES_2 HOBBIES TX_1 TX 0
2 2013-01-01 HOUSEHOLD_1_247_TX_1_evaluation HOUSEHOLD_1_247 HOUSEHOLD_1 HOUSEHOLD TX_1 TX 0
3 2013-01-01 HOUSEHOLD_1_266_TX_1_evaluation HOUSEHOLD_1_266 HOUSEHOLD_1 HOUSEHOLD TX_1 TX 0
4 2013-01-01 FOODS_1_001_TX_1_evaluation FOODS_1_001 FOODS_1 FOODS TX_1 TX 0

You can see that the sales column is really just a daily count of transactions for that particular id.

In the cell below, I will create a dataframe called data, which is the transactions dataframe aggregated to the daily level. It should look like the above, except it won't have zero sales days.

Here is how to conver transactions data into sales dataframe:

In [ ]:
data = (
    transactions
    .assign(
        date = lambda df: pd.to_datetime(df.date).dt.date   # First converts date str object into a datetime and extracts only year-month-day information
    )
    .pipe(lambda df: df.groupby(list(df.columns))['id'].count()) # Converts each transaction into sales based on id count grouped by rest of the columns
    .reset_index(name='sales')
)
data
Out[ ]:
date id item_id dept_id cat_id store_id state_id sales
0 2013-01-01 FOODS_1_004_TX_1_evaluation FOODS_1_004 FOODS_1 FOODS TX_1 TX 20
1 2013-01-01 FOODS_1_004_TX_2_evaluation FOODS_1_004 FOODS_1 FOODS TX_2 TX 20
2 2013-01-01 FOODS_1_004_TX_3_evaluation FOODS_1_004 FOODS_1 FOODS TX_3 TX 4
3 2013-01-01 FOODS_1_005_TX_2_evaluation FOODS_1_005 FOODS_1 FOODS TX_2 TX 1
4 2013-01-01 FOODS_1_009_TX_2_evaluation FOODS_1_009 FOODS_1 FOODS TX_2 TX 3
... ... ... ... ... ... ... ... ...
3895933 2016-05-22 HOUSEHOLD_2_511_TX_3_evaluation HOUSEHOLD_2_511 HOUSEHOLD_2 HOUSEHOLD TX_3 TX 4
3895934 2016-05-22 HOUSEHOLD_2_513_TX_1_evaluation HOUSEHOLD_2_513 HOUSEHOLD_2 HOUSEHOLD TX_1 TX 2
3895935 2016-05-22 HOUSEHOLD_2_514_TX_3_evaluation HOUSEHOLD_2_514 HOUSEHOLD_2 HOUSEHOLD TX_3 TX 1
3895936 2016-05-22 HOUSEHOLD_2_516_TX_2_evaluation HOUSEHOLD_2_516 HOUSEHOLD_2 HOUSEHOLD TX_2 TX 1
3895937 2016-05-22 HOUSEHOLD_2_516_TX_3_evaluation HOUSEHOLD_2_516 HOUSEHOLD_2 HOUSEHOLD TX_3 TX 2

3895938 rows × 8 columns

Optimizing the data¶

Let's take a look at how our data is being stored in memory.

In [ ]:
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3895938 entries, 0 to 3895937
Data columns (total 8 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   date      object
 1   id        object
 2   item_id   object
 3   dept_id   object
 4   cat_id    object
 5   store_id  object
 6   state_id  object
 7   sales     int64 
dtypes: int64(1), object(7)
memory usage: 1.6 GB

1.6 GB of data for our purposed is certainly no joke. But how much of that is really necessary?

Most of our data is stored in the least memory efficient format for pandas: strings (objects). Let's fix that.

While changing column type to category pandas in the background converts string objects into an integers. This is done to save memory and improve performance, as integer encoding takes less memory than string encoding. However, pandas allows you to refer and filter those converted columns by their original string values. It is very convienient and efficient way to handle string objects and optimize the data.

Let's see how converting into category results in less memory usage:

In [ ]:
data = (
    data
    .assign(
        id = lambda df: df.id.astype('category'),
        item_id = lambda df: df.item_id.astype('category'),
        cat_id = lambda df: df.cat_id.astype('category'),
        store_id = lambda df: df.store_id.astype('category'),
        state_id = lambda df: df.state_id.astype('category'),
        dept_id = lambda df: df.dept_id.astype('category')
    )
)
In [ ]:
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3895938 entries, 0 to 3895937
Data columns (total 8 columns):
 #   Column    Dtype   
---  ------    -----   
 0   date      object  
 1   id        category
 2   item_id   category
 3   dept_id   category
 4   cat_id    category
 5   store_id  category
 6   state_id  category
 7   sales     int64   
dtypes: category(6), int64(1), object(1)
memory usage: 209.3 MB

Final DataFrame got down to 209.3 MB, which is about 13% of the original size!

While we're at it, it's worth talking about the best way to store this data on disk. If we saved this as a CSV, it wouldn't maintain any of the data type modifications we just made. Pandas offers a bunch of options for saving DataFrames, but here are the two I'd recommend:

  • Parquet has basically become the industry standard for storing tabular data on disk. It's a columnar file format that automatically compresses your data (which it does really well) and will maintain any data types you use in Pandas, with only a couple exceptions.

  • Feather is also a columnar data format, but it optimizes heavily for read speed. Your file size will be much bigger than Parquets, but it's really useful when you need to heavily optimize for data reading.

In [ ]:
data.to_parquet('data.parquet')
In [ ]:
data = pd.read_parquet('data.parquet')
In [ ]:
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3895938 entries, 0 to 3895937
Data columns (total 8 columns):
 #   Column    Dtype   
---  ------    -----   
 0   date      object  
 1   id        category
 2   item_id   category
 3   dept_id   category
 4   cat_id    category
 5   store_id  category
 6   state_id  category
 7   sales     int64   
dtypes: category(6), int64(1), object(1)
memory usage: 209.3 MB

On my local machine, loading the original CSV took ~13.7 seconds, and that only took 0.1 seconds. And our data types were maintained! Nice!

Finishing up our data pre-processing¶

There's one last modification we need to make to our data before it's ready to go. The way that we converted transactions into sales was slightly problematic because now, when a product doesn't sell it just isn't present in our data, rather than appearing as a zero.

That's an issue for our forecasting models, so it needs to be fixed.

To handle that MultiIndex can be created:

First, set your index to columns that the DataFrame is distinct on (date and id).

In [ ]:
data = data.set_index(['date','id'])

Now, create a MultiIndex with all combinations of daily dates and ids using pd.MultiIndex.from_product and use it and .reindex() to fill the gaps in your data.

In [ ]:
# Create date_range that contains all dates between minimum and maximum date from the data
min_date = data.index.get_level_values('date').min()
max_date = data.index.get_level_values('date').max()
dates = pd.date_range(start=min_date, end=max_date, freq='D').rename('date')

# Get all unique ids
ids = data.index.get_level_values('id').unique()

index_to_select = pd.MultiIndex.from_product([dates, ids], names=['date', 'id'])
index_to_select
Out[ ]:
MultiIndex([('2013-01-01', 'FOODS_1_004_TX_1_evaluation'),
            ('2013-01-01', 'FOODS_1_004_TX_2_evaluation'),
            ('2013-01-01', 'FOODS_1_004_TX_3_evaluation'),
            ('2013-01-01', 'FOODS_1_005_TX_2_evaluation'),
            ('2013-01-01', 'FOODS_1_009_TX_2_evaluation'),
            ('2013-01-01', 'FOODS_1_012_TX_1_evaluation'),
            ('2013-01-01', 'FOODS_1_012_TX_2_evaluation'),
            ('2013-01-01', 'FOODS_1_013_TX_3_evaluation'),
            ('2013-01-01', 'FOODS_1_014_TX_3_evaluation'),
            ('2013-01-01', 'FOODS_1_015_TX_2_evaluation'),
            ...
            ('2016-05-22', 'FOODS_2_133_TX_3_evaluation'),
            ('2016-05-22', 'FOODS_2_177_TX_2_evaluation'),
            ('2016-05-22', 'FOODS_2_117_TX_2_evaluation'),
            ('2016-05-22', 'FOODS_2_209_TX_3_evaluation'),
            ('2016-05-22', 'FOODS_2_256_TX_2_evaluation'),
            ('2016-05-22', 'FOODS_2_117_TX_1_evaluation'),
            ('2016-05-22', 'FOODS_2_256_TX_3_evaluation'),
            ('2016-05-22', 'FOODS_2_256_TX_1_evaluation'),
            ('2016-05-22', 'FOODS_3_296_TX_2_evaluation'),
            ('2016-05-22', 'FOODS_2_069_TX_3_evaluation')],
           names=['date', 'id'], length=11322748)
In [ ]:
data = (
 data
    .reindex(index_to_select)
    .sort_index()
)
data.head()
Out[ ]:
item_id dept_id cat_id store_id state_id sales
date id
2013-01-01 FOODS_1_001_TX_1_evaluation NaN NaN NaN NaN NaN NaN
FOODS_1_001_TX_2_evaluation NaN NaN NaN NaN NaN NaN
FOODS_1_001_TX_3_evaluation NaN NaN NaN NaN NaN NaN
FOODS_1_002_TX_1_evaluation NaN NaN NaN NaN NaN NaN
FOODS_1_002_TX_2_evaluation NaN NaN NaN NaN NaN NaN

Finally, let's fill the resulting NaNs in the dataframe. It's tempting to use .groupby().fillna(method='ffill') (and backfilling), but unfortunately this method is quite slow on grouped data. The better way to handle missing values is ti create temporary dataframe that contains all attributes assigned to all unique ids and fill NAs in our data with it.

In [ ]:
data = data.reindex(index_to_select)
temp = data.reset_index()['id']

## split ID column and create other columns
temp.drop_duplicates(inplace=True)
temp = temp.str.split('_', expand=True)
temp['id'] = data.reset_index()['id']
temp['item_id2']=temp[0]+'_'+temp[1]+'_'+temp[2]
temp['dept_id2']=temp[0]+'_'+temp[1]
temp['cat_id2']=temp[0]
temp['store_id2']=temp[3]+'_'+temp[4]
temp['state_id2']=temp[3]
temp.drop([0,1,2,3,4,5],axis=1,inplace=True)
temp.set_index('id',inplace=True)

# data.reset_index(inplace=True)
# data = pd.merge(data,temp, on=['id'], how='left')

## join temp data to help with fillna
data = data.join(temp,on=['id'])

data['item_id'].fillna(data['item_id2'],inplace=True)
data['dept_id'].fillna(data['dept_id2'],inplace=True)
data['cat_id'].fillna(data['cat_id2'],inplace=True)
data['store_id'].fillna(data['store_id2'],inplace=True)
data['state_id'].fillna(data['state_id2'],inplace=True)

data.drop(['item_id2','dept_id2','cat_id2','store_id2','state_id2'],axis=1,inplace=True)

data.fillna({'sales':0},inplace=True)

data
Out[ ]:
item_id dept_id cat_id store_id state_id sales
date id
2013-01-01 FOODS_1_004_TX_1_evaluation FOODS_1_004 FOODS_1 FOODS TX_1 TX 20.0
FOODS_1_004_TX_2_evaluation FOODS_1_004 FOODS_1 FOODS TX_2 TX 20.0
FOODS_1_004_TX_3_evaluation FOODS_1_004 FOODS_1 FOODS TX_3 TX 4.0
FOODS_1_005_TX_2_evaluation FOODS_1_005 FOODS_1 FOODS TX_2 TX 1.0
FOODS_1_009_TX_2_evaluation FOODS_1_009 FOODS_1 FOODS TX_2 TX 3.0
... ... ... ... ... ... ... ...
2016-05-22 FOODS_2_117_TX_1_evaluation FOODS_2_117 FOODS_2 FOODS TX_1 TX 0.0
FOODS_2_256_TX_3_evaluation FOODS_2_256 FOODS_2 FOODS TX_3 TX 1.0
FOODS_2_256_TX_1_evaluation FOODS_2_256 FOODS_2 FOODS TX_1 TX 0.0
FOODS_3_296_TX_2_evaluation FOODS_3_296 FOODS_3 FOODS TX_2 TX 3.0
FOODS_2_069_TX_3_evaluation FOODS_2_069 FOODS_2 FOODS TX_3 TX 0.0

11322748 rows × 6 columns

In [ ]:
# Check if there are any NAs in filled dataframe
data.isna().sum()
Out[ ]:
item_id     0
dept_id     0
cat_id      0
store_id    0
state_id    0
sales       0
dtype: int64
In [ ]:
# Check if sum of sales from original dataframe is equal to sum of sales from preprocessed above transaction dataframe
# If true, preprocessing was done without any errors
print('Sales sum in original data: {}'.format(pd.read_csv(f'sales_data.csv', usecols=['date', 'id', 'sales'])['sales'].sum()))
print('Sales sum in preprocessed transaction data: {}'.format(int(data.sales.sum())))
Sales sum in original data: 12905715
Sales sum in preprocessed transaction data: 12905715

Exploring our data¶

Exploratory data analysis is crucial for building the best models. It allows to understand the data and assess what features should be created from time series history to obtain best accuracy while modeling.

For this section, the goal is to find few insights about the data that might be helpful for building models.

  1. Looking for seasonal patterns and trends at high-level sales aggregations (department, categories)
  2. How do high-volume items compare to low-volume/itermittent items? What sort of seasonal patterns are at play at item level?
  3. Does the same item show different behavior at different stores?

There are many techniques to understand impact of trend and seasonality patterns in analyzed time series.

  • Moving Averages (MA): Visualzing moving averages (MA) is helpful tool to smooth out fluctuations in the data. By plotting it you might be able to detect underlying reccuring patterns (shorter MA windows) and overal trendlines (longer MA windows). One of the most common used and powerful technique to analyze time series is
  • Autocorrelation: Autocorrelation is a measure of the degree of similarity between a given time series and a lagged version of itself. It alows to understand to what extend past data points in time series are corelated with present points. For instance beer sales are highly dependent on day of the week, as many people are drinking during the weekends. In that case beer sales are expected to have a strong autocorrelation at lag 7.
  • Prophet: Prophet is automated forecasting procedure designed by Facebook. It decomposes time series into two main components (optionally more) which are trend and seasonalities. It fits a different curves that are breaking down time series into trend terms & seasonality terms. Very useful tool to understand an impact of each component in the data.

Before deep diving into EDA, let's check sales distribution:

In [ ]:
# Group data per each id
id_total = data.reset_index().groupby(['id']).sales.sum().reset_index(name = 'total_sales').dropna()

sns.set_style('whitegrid')

fig, ax = plt.subplots(figsize=(12, 6))
sns.kdeplot(data = id_total, ax = ax)
ax.set_title('Distribution of Total sales per each item')
Out[ ]:
Text(0.5, 1.0, 'Distribution of Total sales per each item')

Distribution is highly skewed to the right side. Most of items sales are concetrated near 0 volume.

1. Looking for seasonal patterns and trends at higher aggregations (department, categories)¶

In [ ]:
departments = list(data.dept_id.unique())
categories = list(data.cat_id.unique())

dept_sales = data.groupby(['date', 'dept_id']).sales.sum().unstack('dept_id').fillna(0)
cat_sales = data.groupby(['date', 'cat_id']).sales.sum().unstack('cat_id').fillna(0)
In [ ]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(len(categories), 1, figsize=(12, 6*len(categories)))

for i, cat in enumerate(categories):

    # Plot actuals
    sns.lineplot(data=cat_sales, 
                 x = cat_sales.index, 
                 y = cat, 
                 color = 'lightsteelblue', 
                 ax = ax[i], 
                 label = 'y')

    # Plot 30 days moving average
    sns.lineplot(data=cat_sales.rolling(30, min_periods = 1).mean(),
                x=cat_sales.index, 
                y=cat, color='indianred', 
                ax=ax[i], 
                label = '30days MA')
    
    # Plot 365 days moving average
    sns.lineplot(data=cat_sales.rolling(365, min_periods = 1).mean(), 
                 x=cat_sales.index, 
                 y=cat, 
                 color='darkblue', 
                 ax=ax[i], 
                 label = '365days MA')

    ax[i].set_title('Category: {}'.format(cat))
    ax[i].set_xlabel('')

plt.tight_layout()

Based on 30 days moving average analysis we can conclude that there is a clear seasonality pattern in Household and Food sales. Red-line shape is increasing/decreasing at the similar periods each year. Looking at 365 days moving average, Household and Hobbies categories are in upward trend, while food trendline is flat.

In [ ]:
sns.set_style('whitegrid')
fig, ax = plt.subplots(len(departments), 1, figsize=(12, 6*len(departments)))

for i, dept in enumerate(departments):

    # Plot actuals
    sns.lineplot(data=dept_sales, 
                 x = dept_sales.index, 
                 y = dept, 
                 color = 'lightsteelblue', 
                 ax = ax[i], 
                 label = 'y')

    # Plot 30 days moving average
    sns.lineplot(data=dept_sales.rolling(30, min_periods = 1).mean(),
                x=dept_sales.index, 
                y=dept, color='indianred', 
                ax=ax[i], 
                label = '30days MA')
    
    # Plot 365 days moving average
    sns.lineplot(data=dept_sales.rolling(365, min_periods = 1).mean(), 
                 x=dept_sales.index, 
                 y=dept, 
                 color='darkblue', 
                 ax=ax[i], 
                 label = '365days MA')

    ax[i].set_title('Department: {}'.format(dept))
    ax[i].set_xlabel('')
    
plt.tight_layout()

Based on charts above, we can conclude that there is a yearly seasonality in most of the departments.

Let's fit Prophet model for each department and investigate it's components:

In [ ]:
df_prophet_dept_id = pd.melt(dept_sales.reset_index(), id_vars = ['date'], var_name='dept_id', value_name='y').rename(columns = {'date':'ds'})

for i, dept in enumerate(departments):
    
    model = Prophet(
                    seasonality_mode='multiplicative', 
                    weekly_seasonality=True, 
                    yearly_seasonality=4, 
                    changepoint_prior_scale=0.05
                    )
    
    model.add_seasonality(
    name='monthly', 
    period=365.25/12, 
    fourier_order=4,
    mode='multiplicative'
    )

    
    model.fit(df_prophet_dept_id[df_prophet_dept_id['dept_id'] == dept])
    
    future = model.make_future_dataframe(periods=0)
    
    forecast = model.predict(future)
    
    # Plot the forecast components
    fig1 = model.plot(forecast)
    fig = model.plot_components(forecast)
    fig1.axes[0].set_title(f'Prophet fit for Department {dept}', y=1.05)
    fig.axes[0].set_title(f'Forecast Components for Department {dept}', y=1.05)
    plt.show()
11:13:24 - cmdstanpy - INFO - Chain [1] start processing
11:13:24 - cmdstanpy - INFO - Chain [1] done processing
11:13:27 - cmdstanpy - INFO - Chain [1] start processing
11:13:27 - cmdstanpy - INFO - Chain [1] done processing
11:13:29 - cmdstanpy - INFO - Chain [1] start processing
11:13:29 - cmdstanpy - INFO - Chain [1] done processing
11:13:31 - cmdstanpy - INFO - Chain [1] start processing
11:13:31 - cmdstanpy - INFO - Chain [1] done processing
11:13:33 - cmdstanpy - INFO - Chain [1] start processing
11:13:34 - cmdstanpy - INFO - Chain [1] done processing
11:13:36 - cmdstanpy - INFO - Chain [1] start processing
11:13:36 - cmdstanpy - INFO - Chain [1] done processing
11:13:38 - cmdstanpy - INFO - Chain [1] start processing
11:13:38 - cmdstanpy - INFO - Chain [1] done processing

There is one pattern that applies for all sales time series aggregated on department level - strong weekly seasonality. Sales are highly dependent on the day of week. Most consumers are shopping during the weekends.

2. How do high-volume items compare to low-volume/itermittent items? What sort of seasonal patterns are at play at item level?¶

To investigate high/low volume items, let's create a list with top 25 highest and lowest items by total sales:

In [ ]:
# group data by item
item_sales = data.groupby(['date','item_id'])['sales'].sum().unstack('item_id').fillna(0)
item_sales_total = data.reset_index().groupby(['item_id']).sales.sum().reset_index(name = 'total_sales')

# Create list of top_n lowest/highest items by volume
top_n = 25

low_volume_items_list = list(item_sales_total.sort_values(by = 'total_sales', ascending = True).head(top_n).item_id.unique())
high_volume_items_list = list(item_sales_total.sort_values(by = 'total_sales', ascending = False).head(top_n).item_id.unique())
In [ ]:
# Inspect lowest items by volume
fig, ax = plt.subplots(len(low_volume_items_list), 2, figsize=(18, 4*len(low_volume_items_list)))

for i, item in enumerate(low_volume_items_list):
    
    sns.lineplot(data=item_sales[item].reset_index(), 
             x = item_sales[item].index, 
             y = item, 
             color = 'lightsteelblue', 
             ax = ax[i,0], 
             label = 'y'
            )

    sns.lineplot(data=item_sales[item].rolling(365, min_periods = 1).mean().reset_index(), 
         x = item_sales[item].index, 
         y = item, 
         color = 'indianred', 
         ax = ax[i,0], 
         label = '365days MA',
         linewidth = 2
        )
    
    
    sns.lineplot(data=item_sales[item].rolling(30, min_periods = 1).mean().reset_index(), 
         x = item_sales[item].index, 
         y = item, 
         color = 'darkblue', 
         ax = ax[i,0], 
         label = '30days MA'
        )
    
    pd.plotting.autocorrelation_plot(item_sales[item], ax=ax[i,1], color = 'steelblue')
    
    ax[i,0].set_title('Low-Volume item Sales: {}'.format(item))
    ax[i,1].set_title('Low-Volume item ACF: {}'.format(item))

plt.tight_layout()

Low-Volume items are showing very little or no pattern at all.

Some item sales are very random/noisy or intermittent.

Only couple items are displaying some kind of yearly seasonality like FOOD_2_073 where there is significant ACF spike at lag 365.

Analysis of 365 days MA shows that most of Low-Volume items are having rather flat trendline.

30 days MA is not displaying any reccuring pattern which indicates no yearly seasonality in most of the casers.

In [ ]:
# Inspect highest items by volume
fig, ax = plt.subplots(len(high_volume_items_list), 2, figsize=(18, 4*len(high_volume_items_list)))

for i, item in enumerate(high_volume_items_list):
    
    sns.lineplot(data=item_sales[item].reset_index(), 
             x = item_sales[item].index, 
             y = item, 
             color = 'lightsteelblue', 
             ax = ax[i,0], 
             label = 'y'
            )

    sns.lineplot(data=item_sales[item].rolling(365, min_periods = 1).mean().reset_index(), 
         x = item_sales[item].index, 
         y = item, 
         color = 'indianred', 
         ax = ax[i,0], 
         label = '365days MA',
         linewidth = 2
        )
    
    sns.lineplot(data=item_sales[item].rolling(30, min_periods = 1).mean().reset_index(), 
         x = item_sales[item].index, 
         y = item, 
         color = 'darkblue', 
         ax = ax[i,0], 
         label = '30days MA'
        )
    
    pd.plotting.autocorrelation_plot(item_sales[item], ax=ax[i,1], color = 'steelblue')
    
    ax[i,0].set_title('High-Volume item Sales: {}'.format(item))
    ax[i,1].set_title('High-Volume item ACF: {}'.format(item))

plt.tight_layout()

High-Volume items sales are seasonal.

ACF function is showing signifficant autocorrelation between current sales and sales from last year for most of the inspected items.

Trendlines displayed by 365 days MA are having different directions among items.

Based on analysis of 30 days MA, we can conclude that there is recurring seasonal pattern for most of the items like for example FOODS_3_365.

3. Does the same item show different behavior at different stores?¶

In [ ]:
# Inspect lowest items by volume per store
stores_list = list(data.store_id.unique())

fig, ax = plt.subplots(len(low_volume_items_list), len(stores_list), figsize=(6*len(stores_list), 4*len(low_volume_items_list)))

for j, store in enumerate(stores_list):
    for i, item in enumerate(low_volume_items_list):

        data_aux = data.loc[(data['item_id'] == item) & (data['store_id'] == store)]

        sns.lineplot(data=data_aux, 
                 x = data_aux.index.get_level_values('date'), 
                 y = 'sales', 
                 color = 'lightsteelblue', 
                 ax = ax[i,j], 
                 label = 'y'
                )
        
        sns.lineplot(data=data_aux.sales.rolling(365, min_periods = 1).mean().reset_index(), 
             x = item_sales[item].index, 
             y = 'sales', 
             color = 'indianred', 
             ax = ax[i,j], 
             label = '365days MA',
             linewidth = 2
            )

        sns.lineplot(data=data_aux.sales.rolling(30, min_periods = 1).mean().reset_index(), 
             x = item_sales[item].index, 
             y = 'sales', 
             color = 'darkblue', 
             ax = ax[i,j], 
             label = '30days MA'
            )
    
        ax[i,j].set_title('{}, {}'.format(store, item), fontsize = 10)
        ax[i,j].set_xlabel('')
        ax[i,j].set_ylabel('')
        ax[i,j].tick_params(axis='x', which='both', labelbottom=False)

Each row is representing single item, while each column represents store.

Low-Volume items sales are rather random among different stores, however there are stores that are sharing the same patterns for a specific item. Like for example: HOUSEHOLD_2_162.

In [ ]:
# Inspect highest items by volume per store
stores_list = list(data.store_id.unique())

fig, ax = plt.subplots(len(high_volume_items_list), len(stores_list), figsize=(6*len(stores_list), 4*len(high_volume_items_list)))

for j, store in enumerate(stores_list):
    for i, item in enumerate(high_volume_items_list):

        data_aux = data.loc[(data['item_id'] == item) & (data['store_id'] == store)]

        sns.lineplot(data=data_aux, 
                 x = data_aux.index.get_level_values('date'), 
                 y = 'sales', 
                 color = 'lightsteelblue', 
                 ax = ax[i,j], 
                 label = 'y'
                )
        
        sns.lineplot(data=data_aux.sales.rolling(365, min_periods = 1).mean().reset_index(), 
             x = item_sales[item].index, 
             y = 'sales', 
             color = 'indianred', 
             ax = ax[i,j], 
             label = '365days MA',
             linewidth = 2
            )

        sns.lineplot(data=data_aux.sales.rolling(30, min_periods = 1).mean().reset_index(), 
             x = item_sales[item].index, 
             y = 'sales', 
             color = 'darkblue', 
             ax = ax[i,j], 
             label = '30days MA'
            )
    
        ax[i,j].set_title('{}, {}'.format(store, item), fontsize = 10)
        ax[i,j].set_xlabel('')
        ax[i,j].set_ylabel('')
        ax[i,j].tick_params(axis='x', which='both', labelbottom=False)

Each row is representing single item, while each column represents store.

Hig-Volume items sales are sharing similar patterns among different stores. There is clearly similar seasonality and trend for the same items in different stores.

For example FOODS_3_808 item is declining in all stores while sharing the same seasonal pattern.

Modeling¶

Before we get to modeling, let's create our evaluation setup. The models that we're going to create have a 28-day forecast horizon, and our goal is to best approximate "average" sales.

The first step is to implement our evaluation metric. The original competition used a metric called RMSSE, or "Root Mean Squared Scaled Error." It's similar to the MASE metric that we discussed before, except that the metric optimizes better for "average" sales (as opposed to MASE, which optimizes for the median, since it's an absolute error metric). The competition actually used a weighted version of RMSSE which is techincally more robust, but we're going to stick to RMSSE. Here's what RMSSE looks like:

$RMSSE = \sqrt{\frac{1}{h}\frac{\sum^{n+h}_{t=n+1} (Y_t - \hat{Y}_t)^2}{\frac{1}{n-1}\sum^n_{t=2} (Y_t - Y_{t-1})^2}}$

where $Y_t$ is the actual future value of sales at date $t$, $\hat{Y}_t$ is your forecast for date $t$, $n$ is the number of dates in our training set, and $h$ is our forecast horizon (28 days, in our case).

That looks intimidating! But, similarly to MASE, you can break it down into two parts:

  • The numerator: $\frac{1}{h}\sum^{n+h}_{t=n+1} (Y_t - \hat{Y}_t)^2$, which is just the MSE for every prediction in the validation set.
  • The denominator: $\frac{1}{n-1}\sum^n_{t=2} (Y_t - Y_{t-1})^2$, which is just the MSE over the entire training set if your forecast was a naive, one-day-ahead forecast. We refer to this as the "scale" since it's really just a benchmark -- errors less than this are better than the benchmark, and errors greater than this are worse.

Of course, the "naive, one-day-ahead forecast" part only works if you calculate both the numerator and denominator separately for each id. So, the idea here is that you are effectively calculating an RMSSE value for each id, and then averaging those to get the final RMSSE.

Last comment: there are products in the dataset that don't start showing sales for some time. For those products, the denominator is only supposed to be calculated after the first sale in the dataset. I'd recommend just dropping the records for those products until that first sales, which is straightforward to do using .cumsum() over sales while grouping by id.

In [ ]:
# QUESTION: filter out products that don't have sales using cumsum

data['cumsum'] = data.groupby(['id']).sales.transform('cumsum')

# filter for one product to investigate if cumsum is working
print('before removing records before first sale:')
display(data.loc[(slice(None),'FOODS_1_003_TX_2_evaluation'),:].head(3))

# filter out product rows before first sales
data = data[data['cumsum'] != 0]

# Drop cumsum column
data.drop('cumsum',axis = 1, inplace = True)

print('after removing records before first sale:')
display(data.loc[(slice(None),'FOODS_1_003_TX_2_evaluation'),:].head(3))
before removing records before first sale:
item_id dept_id cat_id store_id state_id sales cumsum
date id
2013-01-02 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 3.0 3.0
2013-01-03 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 0.0 3.0
2013-01-04 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 0.0 3.0
after removing records before first sale:
item_id dept_id cat_id store_id state_id sales
date id
2013-01-02 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 3.0
2013-01-03 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 0.0
2013-01-04 FOODS_1_003_TX_2_evaluation FOODS_1_003 FOODS_1 FOODS TX_2 TX 0.0

Filtered data for FOODS_1_003_TX_2_evaluation product is starting after first sale, everything seems to be working fine

Let's implement RMSSE and test it on simple example (result should be equal to 0.92290404515501):

In [ ]:
def rmsse(train, val, y_preds, model):
    
    denominator = (
        train.groupby("unique_id")["y"]
        .apply(lambda x: (x.diff() ** 2).sum() / (len(x) - 1))
        .reset_index(name="denominator")
    )

    numerator = (
        pd.merge(val, y_preds, how = 'left', on = ['ds', 'unique_id'])
        .groupby('unique_id')
        .apply(lambda x: ((x["y"] - x[model]) ** 2).sum() / len(x))
        .reset_index(name="numerator")
    )
    
    numerator = numerator.merge(denominator, on=["unique_id"])
    return np.mean(np.sqrt(numerator["numerator"] / numerator["denominator"]).replace(np.inf,0))


test_train = pd.DataFrame({
    'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
    'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
    'y': [3, 2, 5, 100, 150, 60, 10, 20, 30],
})
test_val = pd.DataFrame({
    'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
    'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
    'y': [6, 1, 4, 200, 120, 270, 10, 20, 30],
})

test_y_pred = pd.DataFrame({
    'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
    'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
    'predictions': [1, 2, 3, 180, 160, 240, 20, 30, 40],
})

rmsse(test_train, test_val, test_y_pred, 'predictions')
Out[ ]:
0.92290404515501

Training some models!¶

Finally, we can train some models!

Libraries used: statsforecast, mlforecast by nixtla.

For modeling I will be using LightGBM. There are several reasons why LightGBM is a powerful tool to forecast retail sales and was used in many highest score solutions in M5 competition.

  • LightGBM is highly efficient and faster than other traditional tree-based algorithms, which is crucial in retail because it involves processing large amounts of data.
  • Good at handling non-linear relationships: LightGBM is good at modeling non-linear relationships, which is important in retail forecasting since sales patterns are often non-linear and complex.
  • Possibility to inspect feature importance that allows us to see which variables our model finds most helpful, which itself is helpful when diagnosing our models.
  • LightGBM can learn from many time-series at once due to that it's more likely to be prepared for future trends that haven't occured in one time series, because it occured in another.

'Objective' is very important LightGBM parameter. As we are dealing with highly right-skewed data, it does make sense to set objective funtion to 'tweedie'. Tweedie distribution assumes that data is mostly concetrated near 0 with plenty of outliers on the right side (the same as our data). Rest of parameters are choosen by me based on my intuition after studying EDA part.

For a benchmarking purpose I will calculate predictions for each unique_id with HoltWinters method. HoltWinters is oldschool time series forecasting method which assumes that impact of previous observation on current one is exponentially decreasing. It also has Trend and Seasonality component. HoltWinters RMSSE will be used as a benchmark to improve LightGBM models.

In [ ]:
# Load Prices data
prices = pd.read_parquet(f'prices.parquet')
prices = prices.reset_index(drop = False)
prices.head()

# Load events data
events = pd.read_parquet(f'calendar.parquet')
events = events.reset_index(drop = False)
# Create dummy variables for events
events = pd.get_dummies(events, columns = list(events.iloc[:,2:].columns))
list_of_events = list(events.iloc[:,1:].columns)
In [ ]:
# Merge prices data
data_with_prices = pd.merge(data.reset_index(),prices, how = 'left', on = ['date','store_id','item_id']).set_index(['date','id'])
data_with_prices.head()
Out[ ]:
item_id dept_id cat_id store_id state_id sales sell_price
date id
2013-01-01 FOODS_1_004_TX_1_evaluation FOODS_1_004 FOODS_1 FOODS TX_1 TX 20.0 1.78
FOODS_1_004_TX_2_evaluation FOODS_1_004 FOODS_1 FOODS TX_2 TX 20.0 1.78
FOODS_1_004_TX_3_evaluation FOODS_1_004 FOODS_1 FOODS TX_3 TX 4.0 1.78
FOODS_1_005_TX_2_evaluation FOODS_1_005 FOODS_1 FOODS TX_2 TX 1.0 3.28
FOODS_1_009_TX_2_evaluation FOODS_1_009 FOODS_1 FOODS TX_2 TX 3.0 2.68
In [ ]:
# Merge events
data_with_prices_and_events = pd.merge(data_with_prices.reset_index(),events, how = 'left', on = ['date']).set_index(['date','id'])
data_with_prices_and_events[list_of_events] = data_with_prices_and_events[list_of_events].fillna(0)
data_with_prices_and_events.head()
Out[ ]:
item_id dept_id cat_id store_id state_id sales sell_price snap_TX event_name_1_Chanukah End event_name_1_Christmas ... event_type_1_Cultural event_type_1_National event_type_1_Religious event_type_1_Sporting event_name_2_Cinco De Mayo event_name_2_Easter event_name_2_Father's day event_name_2_OrthodoxEaster event_type_2_Cultural event_type_2_Religious
date id
2013-01-01 FOODS_1_004_TX_1_evaluation FOODS_1_004 FOODS_1 FOODS TX_1 TX 20.0 1.78 1 0 0 ... 0 1 0 0 0 0 0 0 0 0
FOODS_1_004_TX_2_evaluation FOODS_1_004 FOODS_1 FOODS TX_2 TX 20.0 1.78 1 0 0 ... 0 1 0 0 0 0 0 0 0 0
FOODS_1_004_TX_3_evaluation FOODS_1_004 FOODS_1 FOODS TX_3 TX 4.0 1.78 1 0 0 ... 0 1 0 0 0 0 0 0 0 0
FOODS_1_005_TX_2_evaluation FOODS_1_005 FOODS_1 FOODS TX_2 TX 1.0 3.28 1 0 0 ... 0 1 0 0 0 0 0 0 0 0
FOODS_1_009_TX_2_evaluation FOODS_1_009 FOODS_1 FOODS TX_2 TX 3.0 2.68 1 0 0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 48 columns

HoltWinters¶

Great resource about HoltWinters: https://otexts.com/fpp2/holt-winters.html.

In [ ]:
val = (
    data_with_prices_and_events
    .reset_index()
    .groupby('id')
    .tail(28)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)
train = (
    data_with_prices_and_events
    .reset_index()
    .drop(val.index)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)
In [ ]:
sf = StatsForecast(
    models = [HoltWinters(season_length=7)], # Set weakly seasonality
    freq = 'D'
)

sf.fit(
    (
    train[train['unique_id'] != 'FOODS_2_069_TX_3_evaluation'][['ds','unique_id','y']] # Filter out one item with only 4 observations
    .assign(
        unique_id = lambda df: df.unique_id.astype('str')
    )
    )
)

holtwinters_preds = sf.predict(h = 28).reset_index().sort_values(by = ['unique_id','ds'])

plot_data_ets = (
    pd.concat([
        train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']], 
        val[['unique_id', 'ds', 'y']].sort_values(by = ['unique_id','ds']), 
        holtwinters_preds
    ])
)
In [ ]:
rmsse_holtwinters = rmsse(train, val, holtwinters_preds, 'HoltWinters')

Recursive Forecasting¶

Recursive forecasting is a time series forecasting method in which a model is trained on historical data up to a certain point in time and then used to make forecasts for the future. As new data becomes available, the model is updated and used to make another forecast for the next time period. This process is repeated until all desired forecast horizons have been reached. Steps:

  1. Predict one step out
  2. Concatenate your prediction onto your ground-truth data
  3. Recalculate your features, using the prediction as a new ground-truth data point
  4. Repeat

Based on EDA analysis, strong yearly and weakly seasonality can be observed especially for high-volume items. In that case it makes sense to create rolling_features with different windows. Feature engineering to capture impact of observed time series components:

  • Rolling averages and standard deviations with shorter windows (7, 14, 30) will allow to capture more frequent seasonal patterns and fluctuations. To include in the model recent trend and mid-term volatility I will create 60 and 180 days rolling averages and standard deviations.
  • Lags with different windows will allow to include autoregressive component with.
  • Price information will be used as one of regressors that are known in the future.
  • Columns representing item_id, cat_id, dept_id, store_id are going to be encoded as static variables.
In [ ]:
# split into training and validation sets and conform the column names to what MLForecast expects
val = (
    data_with_prices_and_events
    .reset_index()
    .groupby('id')
    .tail(28)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)
train = (
    data_with_prices_and_events
    .reset_index()
    .drop(val.index)
    .rename(columns={
        'date': 'ds',
        'id': 'unique_id',
        'sales': 'y',
    })
)

# label encode categorical features
cat_feats = ['unique_id', 'item_id', 'dept_id', 'cat_id']
enc_cat_feats = [f'{feat}_enc' for feat in cat_feats]

encoder = OrdinalEncoder()
train[enc_cat_feats] = encoder.fit_transform(train[cat_feats])
val[enc_cat_feats] = encoder.transform(val[cat_feats])

reference_cols = ['unique_id', 'ds', 'y','sell_price']

# add features to this list if you want to use them
features = reference_cols + enc_cat_feats + list_of_events
train = train[features]
val = val[features]

@njit
def rollingmean7d(x):
    return rolling_mean(x, window_size=7)

@njit
def rollingmean14d(x):
    return rolling_mean(x, window_size=14)

@njit
def rollingmean30d(x):
    return rolling_mean(x, window_size=30)

@njit
def rollingmean60d(x):
    return rolling_mean(x, window_size=60)

@njit
def rollingmean180d(x):
    return rolling_mean(x, window_size=180)


#####################################################

@njit
def rollingstd7d(x):
    return rolling_std(x, window_size=7)

@njit
def rollingstd14d(x):
    return rolling_std(x, window_size=14)

@njit
def rollingstd30d(x):
    return rolling_std(x, window_size=30)

@njit
def rollingstd60d(x):
    return rolling_std(x, window_size=60)

@njit
def rollingstd180d(x):
    return rolling_std(x, window_size=180)




# feel free to tweak these parameters!
model_params = {
    'verbose': -1,
    'num_leaves': 256,
    'n_estimators': 500,
    'objective': 'tweedie',
    'tweedie_variance_power': 1.1,
    'metric': 'rmse',
    'learning_rate': 0.03
}




models = [
    lgb.LGBMRegressor(**model_params),
]


fcst = MLForecast(
    models=models,
    freq='D',
    # dictionary reads like this:
    # {number of days to lag the feature: [list of functions to apply to the lagged data]}
    lags=[7, 
          14,
          30,
          31,
          365],
    lag_transforms = {
        7:  [rollingmean7d,
             rollingmean14d,
             rollingmean30d,
             rollingmean60d,
             rollingmean180d,
             rollingstd7d,
             rollingstd14d,
             rollingstd30d,
             rollingstd60d,
             rollingstd180d]
    },
    date_features=['year', 'month', 'day', 'dayofweek', 'quarter']
)

# don't worry about nul value warnings. LightGBM and XGBoost can handle it!
fcst.fit(
    train, 
    id_col='unique_id', 
    time_col='ds', 
    target_col='y', 
    dropna=False,
    static_features = enc_cat_feats
)

recursive_preds = fcst.predict(28, 
                           # Add Future values for prices from validation dataset
                           dynamic_dfs = [val[['ds','unique_id','sell_price'] + list_of_events]])

# plot the last 45 days of the training set, the validation set, and the predictions
plot_data_recursive = (
    pd.concat([
        train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']], 
        val[['unique_id', 'ds', 'y']], 
        recursive_preds
    ])
)
In [ ]:
rmsse_recursive = rmsse(train, val, recursive_preds, 'LGBMRegressor')
In [ ]:
lgb.plot_importance(fcst.models_['LGBMRegressor'], max_num_features=15 , figsize=(12,6))
Out[ ]:
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

According to feature importance from LightGBM, the most important variable is price. This is kinda expected, as we are modeling retail sales which are highly affected by price changes and promotions.

Direct Forecasting¶

Direct forecasting is a bit different than recursive one. It is directly using past values of time series to predict the future instead of iteratively recalculating the features with each prediction.

Idea is: for each department train 4 different models:

  1. Model_1 - Forecasting 1st week of validation set
  2. Model_2 - Forecasting 2nd week of validation set
  3. Model_3 - Forecasting 3rd week of validation set
  4. Model_4 - Forecasting 4th week of validation set

In total there will be 7*4 = 28 LightGBM models (7 departments times 4 weeks). Variables item_id and unique_id are going to be treated as categoricals. Variables to capture autoregressive relationships, seasonalities and trends are going to be the same as in Recursive approach.

However, each weekly model will be using features that are lagged specificaly for each week (lag >= forecast horizon)

Feature engineering with help of MLForecast library:

In [ ]:
@njit
def rollingmean7d(x):
    return rolling_mean(x, window_size=7)

@njit
def rollingmean14d(x):
    return rolling_mean(x, window_size=14)

@njit
def rollingmean30d(x):
    return rolling_mean(x, window_size=30)

@njit
def rollingmean60d(x):
    return rolling_mean(x, window_size=60)

@njit
def rollingmean180d(x):
    return rolling_mean(x, window_size=180)


#####################################################

@njit
def rollingstd7d(x):
    return rolling_std(x, window_size=7)

@njit
def rollingstd14d(x):
    return rolling_std(x, window_size=14)

@njit
def rollingstd30d(x):
    return rolling_std(x, window_size=30)

@njit
def rollingstd60d(x):
    return rolling_std(x, window_size=60)

@njit
def rollingstd180d(x):
    return rolling_std(x, window_size=180)




# feel free to tweak these parameters!
model_params = {
    'verbose': -1,
    'num_leaves': 256,
    'n_estimators': 500,
    'objective': 'tweedie',
    'tweedie_variance_power': 1.1,
    'metric': 'rmse',
    'learning_rate': 0.03
}



models = [
    lgb.LGBMRegressor(**model_params),
]


fcst = MLForecast(
    models=models,
    freq='D',
    # dictionary reads like this:
    # {number of days to lag the feature: [list of functions to apply to the lagged data]}
    lags=[7, 
          8, 
          9, 
          10, 
          11, 
          12, 
          13, 
          14,
          15,
          16,
          17,
          18,
          19,
          20,
          21,
          22,
          23,
          24,
          25,
          26,
          27,
          28,
          29,
          30,
          31,
          32,
          33,
          34,
          365],
    lag_transforms = {
        7:  [rollingmean7d,
             rollingmean14d,
             rollingmean30d,
             rollingmean60d,
             rollingmean180d,
             rollingstd7d,
             rollingstd14d,
             rollingstd30d,
             rollingstd60d,
             rollingstd180d],
        14:  [rollingmean7d,
             rollingmean14d,
             rollingmean30d,
             rollingmean60d,
             rollingmean180d,
             rollingstd7d,
             rollingstd14d,
             rollingstd30d,
             rollingstd60d,
             rollingstd180d],
        21:  [rollingmean7d,
             rollingmean14d,
             rollingmean30d,
             rollingmean60d,
             rollingmean180d,
             rollingstd7d,
             rollingstd14d,
             rollingstd30d,
             rollingstd60d,
             rollingstd180d],
        28:  [rollingmean7d,
             rollingmean14d,
             rollingmean30d,
             rollingmean60d,
             rollingmean180d,
             rollingstd7d,
             rollingstd14d,
             rollingstd30d,
             rollingstd60d,
             rollingstd180d]
    },
    date_features=['year', 'month', 'day', 'dayofweek', 'quarter']
)


# Create dataframe to preprocess
data_with_prices_toprep = (
    data_with_prices_and_events
    .reset_index()
    .rename(columns = {'date':'ds',
                       'id':'unique_id',
                       'sales': 'y'}).
    drop(['state_id','cat_id'],axis = 1)

)

categorical_features = ['unique_id', 'item_id','dept_id','store_id']
encoder = OrdinalEncoder()
data_with_prices_toprep[categorical_features] = encoder.fit_transform(data_with_prices_toprep[categorical_features])

# Feature engineering per unique id
prep = fcst.preprocess(
    data_with_prices_toprep, 
    id_col='unique_id', 
    time_col='ds', 
    target_col='y', 
    dropna=False,
    static_features = ['item_id','dept_id','store_id']
)


# Split train/val data
val_direct = (
    prep
    .groupby('unique_id')
    .tail(28)
)
train_direct = (
    prep
    .drop(val.index)
)
In [ ]:
# Clean unused dataframes
del data;  gc.collect() 
Out[ ]:
353641
In [ ]:
# Create list of diffent feature categories
sales_features = list(train.columns[(train.columns.str.contains('lag')) | (train.columns =='y')])
calendar_features = ['year', 'month', 'day', 'dayofweek', 'quarter']
In [ ]:
# Create features for each multi-horizon model
lags_model_1stweek = ['lag7','lag8','lag9','lag10','lag11','lag12','lag13']
rollingfeatures_model_1stweek =  list(train.columns[(train.columns.str.endswith('_lag7'))])
model_1stweek_features = lags_model_1stweek + rollingfeatures_model_1stweek + calendar_features + ['unique_id', 'item_id'] + list_of_events

lags_model_2ndweek = ['lag14','lag15','lag16','lag17','lag18','lag19','lag20']
rollingfeatures_model_2ndweek =  list(train.columns[(train.columns.str.endswith('_lag14'))])
model_2ndweek_features = lags_model_2ndweek + rollingfeatures_model_2ndweek + calendar_features + ['unique_id', 'item_id'] + list_of_events

lags_model_3rdweek = ['lag21','lag22','lag23','lag24','lag25','lag26','lag27']
rollingfeatures_model_3rdweek =  list(train.columns[(train.columns.str.endswith('_lag21'))])
model_3rdweek_features = lags_model_3rdweek + rollingfeatures_model_3rdweek + calendar_features + ['unique_id', 'item_id'] + list_of_events

lags_model_4thweek = ['lag28','lag29','lag30','lag31','lag32','lag33','lag34']
rollingfeatures_model_4thweek =  list(train.columns[(train.columns.str.endswith('_lag28'))])
model_4thweek_features = lags_model_4thweek + rollingfeatures_model_4thweek + calendar_features + ['unique_id', 'item_id'] + list_of_events
model_4thweek_features

# Create dictionary for each weekly model containing features that are going to be used
models_per_week = {
    'model_1stweek':model_1stweek_features,
    'model_2ndweek':model_2ndweek_features,
    'model_3rdweek':model_3rdweek_features,
    'model_4thweek':model_4thweek_features
}

for i, features in models_per_week.items():
    print('Model: {}'.format(i))
    print('Features used:')
    print(features)
    print(' ')
Model: model_1stweek
Features used:
['lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12', 'lag13', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious']
 
Model: model_2ndweek
Features used:
['lag14', 'lag15', 'lag16', 'lag17', 'lag18', 'lag19', 'lag20', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious']
 
Model: model_3rdweek
Features used:
['lag21', 'lag22', 'lag23', 'lag24', 'lag25', 'lag26', 'lag27', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious']
 
Model: model_4thweek
Features used:
['lag28', 'lag29', 'lag30', 'lag31', 'lag32', 'lag33', 'lag34', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious']
 
In [ ]:
# Change id variables into category so LightGBM performs correct tree split based on those columns
train_direct['unique_id'] = train_direct['unique_id'].astype('category')
train_direct['item_id'] = train_direct['item_id'].astype('category')
In [ ]:
model_params = {
    'verbose': -1,
    'num_leaves': 256,
    'n_estimators': 500,
    'objective': 'tweedie',
    'tweedie_variance_power': 1.1,
    'metric': 'rmse',
    'learning_rate': 0.03
}

# create an empty dictionary to store the models
models_dict = {}

# loop through each department in the validation set
for dept in train_direct.dept_id.unique():
    
    models_dict[dept] = {}
    
    # Create temporary dataframe for each department
    dept_data = train_direct[(train_direct['dept_id'] == dept)]
    
     # loop through each multi-horizon model   
    for i, features in models_per_week.items():
        
        # Create X for each department and weekly model (take features assigned to weekly models)
        X = dept_data[features]
        y = dept_data['y']
        
        # train models
        lgb_train = lgb.Dataset(X, y)
        model = lgb.train(model_params, lgb_train)
        
        # Update dictionary for each weekly model and department
        models_dict[dept][i] = model
In [ ]:
# # Save models UNCOMMENT TO SAVE THE MODELS
# models_dict_tosave = models_dict.copy()

# for dept in train.dept_id.unique():
#     for i, features in models_per_week.items():
#         print(models_dict_tosave[dept][i].save_model(str(dept) + '_' + str(i) + '_' + 'mode.txt'))
In [ ]:
# Change id variables into category for validation set
val_direct['unique_id'] = val_direct['unique_id'].astype('category')
val_direct['item_id'] = val_direct['item_id'].astype('category')
In [ ]:
# Create Predictions Dataframe
direct_preds = pd.DataFrame()

# loop through each department in the validation set
for dept in val_direct.dept_id.unique():
    
    dept_data = val_direct[(val_direct['dept_id'] == dept)]
    
    # loop through each multi-horizon model
    for i, features in models_per_week.items():

        # Create Predictions
        X_val = dept_data[features]
        preds = pd.DataFrame(models_dict[dept][i].predict(X_val))
        preds.columns = ['LGBMRegressor']
        preds['model'] = i
        preds['dept_id'] = dept
        preds['ds'] = dept_data.reset_index()['ds']
        preds['unique_id'] = dept_data.reset_index()['unique_id']
        
        
        # Filter only predictions that are assigned specificaly to each weekly model
        if i == 'model_1stweek':
            preds = preds[preds['ds'] <= pd.to_datetime('2016-05-01')]
            
        elif i == 'model_2ndweek':
            preds = preds[(preds['ds'] > pd.to_datetime('2016-05-01'))
                         & (preds['ds'] <= pd.to_datetime('2016-05-08'))] 
            
        elif i == 'model_3rdweek':
            preds = preds[(preds['ds'] > pd.to_datetime('2016-05-08')) 
                         & (preds['ds'] <= pd.to_datetime('2016-05-15'))]
            
        elif i == 'model_4thweek':
            preds = preds[(preds['ds'] > pd.to_datetime('2016-05-15'))
                         & (preds['ds'] <= pd.to_datetime('2016-05-22'))]
            
        direct_preds = pd.concat([preds,direct_preds],axis = 0)
In [ ]:
# Inverse back categorical features encoding 
train_direct[categorical_features] = pd.DataFrame(encoder.inverse_transform(train_direct[categorical_features]), columns = train_direct[categorical_features].columns)

# Merge multi horizon predictions with categorical features from val set to inverse encoding later on
direct_preds = pd.merge(direct_preds.drop('dept_id', axis = 1), val_direct[categorical_features + ['ds']], how = 'left', on = ['ds','unique_id'])

# Inverse encoding for direct predictions
direct_preds[categorical_features] = pd.DataFrame(encoder.inverse_transform(direct_preds[categorical_features]), columns = direct_preds[categorical_features].columns)
In [ ]:
val_direct = val_direct.reset_index(drop = True)
val_direct[categorical_features] = pd.DataFrame(encoder.inverse_transform(val_direct[categorical_features]), columns = val_direct[categorical_features].columns)
In [ ]:
rmsse_direct = rmsse(train, val, direct_preds, 'LGBMRegressor')
In [ ]:
rmsse_direct
Out[ ]:
0.7493979496196602
In [ ]:
# plot the last 45 days of the training set, the validation set, and the predictions
plot_data_direct = (
    pd.concat([
        train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']], 
        val[['unique_id', 'ds', 'y']], 
        direct_preds
    ])
)

Results & Conclusions¶

In [ ]:
# Group to plot total sales for each model
data_grouped_mutlihorizon = plot_data_direct.groupby('ds')[['y','LGBMRegressor']].sum().reset_index()
data_grouped_recursive  = plot_data_recursive.groupby('ds')[['y','LGBMRegressor']].sum().reset_index()
data_grouped_ets  = plot_data_ets.groupby('ds')[['y','HoltWinters']].sum().reset_index()

# Combine all predictors into one dataframe
plot_data = (
    data_grouped_ets.set_index('ds')
    .join(
        data_grouped_mutlihorizon.set_index('ds')
        .drop('y',axis =1).rename(columns = {'LGBMRegressor':'LGBM_MultiHorizon'}))
    .join(data_grouped_recursive.set_index('ds')
        .drop('y',axis =1).rename(columns = {'LGBMRegressor':'LGBM_Recursive'}))
)
plot_data
Out[ ]:
y HoltWinters LGBM_MultiHorizon LGBM_Recursive
ds
2016-03-11 11066.0 0.000000 0.000000 0.000000
2016-03-12 13113.0 0.000000 0.000000 0.000000
2016-03-13 13466.0 0.000000 0.000000 0.000000
2016-03-14 11882.0 0.000000 0.000000 0.000000
2016-03-15 11659.0 0.000000 0.000000 0.000000
... ... ... ... ...
2016-05-18 10375.0 9579.687500 9532.604587 9406.946322
2016-05-19 9162.0 9690.434570 9578.767776 9412.847663
2016-05-20 12303.0 10573.802734 10444.268249 10176.920809
2016-05-21 13681.0 12760.348633 12419.145383 12436.963334
2016-05-22 14815.0 13333.077148 12828.134820 12881.801693

73 rows × 4 columns

In [ ]:
# Create model lists
models_list = ['HoltWinters', 'LGBM_MultiHorizon', 'LGBM_Recursive']

# Define MAPE formula
def MAPE(df, ac, preds):
    return round(np.mean(abs(df[ac] - df[preds])/df[ac])*100, 2)

# Create MAPE results for Total Sales and save in the dataframe
mape_df = pd.DataFrame()
mape = pd.DataFrame()
for model in models_list:
    mape['mape'] = pd.Series(MAPE(plot_data[plot_data[model] != 0], 'y',model))
    mape['model'] = model
    mape_df = pd.concat([mape, mape_df],axis = 0)

mape_df
Out[ ]:
mape model
0 6.89 LGBM_Recursive
0 8.10 LGBM_MultiHorizon
0 8.81 HoltWinters
In [ ]:
fig, ax = plt.subplots(figsize=(18, 10))

# Plot Total Sales and Total Predictions by date
sns.lineplot(data=plot_data, 
         x = 'ds', 
         y = 'y', 
         color = 'steelblue', 
         ax = ax, 
         label = 'y'
        )

sns.lineplot(data=plot_data[plot_data['LGBM_MultiHorizon'] != 0], 
         x = 'ds', 
         y = 'LGBM_MultiHorizon', 
         color = 'darkblue', 
         ax = ax, 
         label = 'Predictions Direct Strategy'
        )

sns.lineplot(data=plot_data[plot_data['LGBM_Recursive'] != 0], 
         x = 'ds', 
         y = 'LGBM_Recursive', 
         color = 'indianred', 
         ax = ax, 
         label = 'Predictions Recursive Strategy'
        )

sns.lineplot(data=plot_data[plot_data['HoltWinters'] != 0], 
         x = 'ds', 
         y = 'HoltWinters', 
         color = 'green', 
         ax = ax, 
         label = 'Predictions HoltWinters'
        )

ax.set_title('Total Sales and Predictions \n(Recursive Strategy: RMSSE = {}, Total Sales MAPE = {}%) \n(Direct Strategy: RMSSE = {}, Total Sales MAPE = {}%) \n(HoltWinters: RMSSE = {}, Total Sales MAPE = {}%)'.format(round(rmsse_recursive,4),mape_df[mape_df['model'] == 'LGBM_Recursive']['mape'][0],round(rmsse_direct,4),mape_df[mape_df['model'] == 'LGBM_MultiHorizon']['mape'][0], round(rmsse_holtwinters, 4),mape_df[mape_df['model'] == 'HoltWinters']['mape'][0]), fontsize = 15)
Out[ ]:
Text(0.5, 1.0, 'Total Sales and Predictions \n(Recursive Strategy: RMSSE = 0.7447, Total Sales MAPE = 6.89%) \n(Direct Strategy: RMSSE = 0.7494, Total Sales MAPE = 8.1%) \n(HoltWinters: RMSSE = 0.7508, Total Sales MAPE = 8.81%)')

End Notes:¶

Lowest average Root Mean Squared Scaled Error per each item was obtained by LightGBM predicted with recursive strategy (0.7447).

Lowest Mean Absolute Squared Percentage Error on total sales was obtained also by Recursive LightGBM (6.89%).

This project is a result of many experiments, here is brief summary of what helped my models and what didn't help. Was it expected before the results?

  • Direct strategy is slightly less accurate than recursive strategy (not expected).

  • Inclusion of price information data resulted in huge accuracy improvement (expected).

  • RMSSE for recursive forecasting was around 1.5. Including price information RMSSE went down to around 0.75.

  • Direct approach takes longer to train the models. (expected)

  • Including rolling features to capture mid-term trends (windows 60-180) helped to improve accuracy. (expected)

  • Throwing bunch of rolling features lagged by different days (7, 14, 21, 28) helped to improve accuracy in recursive strategy. (unexpected and not included in the final results)

  • Adding events resulted in accuracy improvement (From 0.747 RMSSE to 0.746 RMSSE for Recursive Stragety) (expected)

Things that could possibly improve accuracy:¶

- ensembling (creating weighted averages of all predictions usually leads to better accuracy)

Ensembling is a technique to combine predictions from multiple models. It often leads to better accuracy, especially in case of time series forecasting. Let's think about scenario where we have two models, one is overforecasting and another is underforecasting. Combining predictions from those two models and calculating an average out of them will lead to better accuracy, as single error is netting out while combined.

The easiest way to create an ensemble is to calculate simple average. Based on my experience, usually it works pretty well. However most of the times it is better to calculate weighted average. How to calculate weights for each model? For that coefficients you can fit linear regression where target will be dependent variable, and model predictions are going to be regressors. Calculated coefficients can be treated as a weight as linear regression is built to minimize MSE. However if MSE is not an objective funtion it is better to use linear optimizer.

- building recursive/direct models on the data filtered by categories/departmens

- experimenting with more features (seasonal Fourier Series, seasonal rolling features, rolling features aggregated on different levels)

- hyperparameters tunning with Optuna

- different machine learning/deep learning models (NBEATS could be a good fit for this exercise)