The goal of this project is to predict future daily sales of 9146 items in Walmart for the next 28 days.
This notebook covers entire forecasting process:
The dataset used in this project is one of the most realistic retail time-series datasets - the actual Walmart data. Walmart is one of the largets retailers in the world that operates in more than 10,500 stores and clubs under in 20 countries (2023 information). Data was made available for a M5 time series forecasting competition. To make it even more close to the real world datasets it was preprocessed in a way that it imitates store transactions.
This project is based on materials and knowledge that I gained from: https://corise.com/course/forecasting-with-machine-learning
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
from prophet import Prophet
from statsforecast import StatsForecast
from statsforecast.models import HoltWinters
from mlforecast import MLForecast
from sklearn.preprocessing import OrdinalEncoder
from numba import njit
from window_ops.rolling import rolling_mean, rolling_std
import lightgbm as lgb
import gc
Time-series data has to be collected from some real-world, data-generating process. That means that raw data comes in as a series of observations. Depending on your experience with time-series data, you may be used to data that looks like this:
Date | Sales |
---|---|
2022-01-01 | 23 |
2022-01-02 | 45 |
2022-01-03 | 12 |
2022-01-04 | 67 |
2022-01-05 | 89 |
But, if you're in retail, each of those "sales" probably came in some JSON from some point-of-sale system (i.e. cash register) that probably looked something like this:
{
"timestamp": 2022-01-01 12:34:56,
"product_id": 5,
"store_id": 12,
"category_id": 36,
...
}
Usually, it's the job of a data engineer to collect all of these records and aggregate them into a nice, tabular format, but it's worth at least having an appreciation for how it's done. So, we're going to start from a mock version of a transactions table.
transactions = pd.read_csv(f'transactions_data.csv')
transactions.head()
date | id | item_id | dept_id | cat_id | store_id | state_id | |
---|---|---|---|---|---|---|---|
0 | 2013-01-01 13:41:03 | HOBBIES_1_004_TX_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | TX_1 | TX |
1 | 2013-01-01 07:30:52 | HOBBIES_1_004_TX_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | TX_1 | TX |
2 | 2013-01-01 11:17:38 | HOBBIES_1_004_TX_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | TX_1 | TX |
3 | 2013-01-01 20:18:59 | HOBBIES_1_025_TX_1_evaluation | HOBBIES_1_025 | HOBBIES_1 | HOBBIES | TX_1 | TX |
4 | 2013-01-01 21:36:09 | HOBBIES_1_028_TX_1_evaluation | HOBBIES_1_028 | HOBBIES_1 | HOBBIES | TX_1 | TX |
transactions.dtypes
date object id object item_id object dept_id object cat_id object store_id object state_id object dtype: object
You can see that this is a DataFrame where each row relates to purchases for an individual item. Here's a little data dictionary:
date
: the time at which an item was bought, down to the secondid
: the product ID. Each of these is an individual item at a specific store.item_id
: this is an identifier for items, but not at the store level. You can use this to find the same item at different stores.dept_id
: department ID. One level up from item_id
in the hierarchycat_id
: category ID. One level up from dept_id
in the hierarchystore_id
: identifies the specific store where the product was boughtstate_id
: identifies the specific state where the product was boughtdate
is supposed to be a datetime
-like object, but you can see that when we loaded it from disk, it was loaded in as a string. Let's convert that column to datetime
.
The goal is to transform this dataset into one that's easy to analyze and train models on. For this project, the goal is going to be to work at the daily level. So, the first step is to aggregate our transactions data up to the daily level.
To be more specific, this is what we want it to look like:
# This is a hefty table, so just peeking at the first 5 rows
pd.read_csv(f'sales_data_sampled.csv', nrows=5)
date | id | item_id | dept_id | cat_id | store_id | state_id | sales | |
---|---|---|---|---|---|---|---|---|
0 | 2013-01-01 | HOBBIES_1_004_TX_1_evaluation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | TX_1 | TX | 3 |
1 | 2013-01-01 | HOBBIES_2_075_TX_1_evaluation | HOBBIES_2_075 | HOBBIES_2 | HOBBIES | TX_1 | TX | 0 |
2 | 2013-01-01 | HOUSEHOLD_1_247_TX_1_evaluation | HOUSEHOLD_1_247 | HOUSEHOLD_1 | HOUSEHOLD | TX_1 | TX | 0 |
3 | 2013-01-01 | HOUSEHOLD_1_266_TX_1_evaluation | HOUSEHOLD_1_266 | HOUSEHOLD_1 | HOUSEHOLD | TX_1 | TX | 0 |
4 | 2013-01-01 | FOODS_1_001_TX_1_evaluation | FOODS_1_001 | FOODS_1 | FOODS | TX_1 | TX | 0 |
You can see that the sales
column is really just a daily count of transactions for that particular id
.
In the cell below, I will create a dataframe called data
, which is the transactions dataframe aggregated to the daily level. It should look like the above, except it won't have zero sales days.
Here is how to conver transactions data into sales dataframe:
data = (
transactions
.assign(
date = lambda df: pd.to_datetime(df.date).dt.date # First converts date str object into a datetime and extracts only year-month-day information
)
.pipe(lambda df: df.groupby(list(df.columns))['id'].count()) # Converts each transaction into sales based on id count grouped by rest of the columns
.reset_index(name='sales')
)
data
date | id | item_id | dept_id | cat_id | store_id | state_id | sales | |
---|---|---|---|---|---|---|---|---|
0 | 2013-01-01 | FOODS_1_004_TX_1_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_1 | TX | 20 |
1 | 2013-01-01 | FOODS_1_004_TX_2_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_2 | TX | 20 |
2 | 2013-01-01 | FOODS_1_004_TX_3_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_3 | TX | 4 |
3 | 2013-01-01 | FOODS_1_005_TX_2_evaluation | FOODS_1_005 | FOODS_1 | FOODS | TX_2 | TX | 1 |
4 | 2013-01-01 | FOODS_1_009_TX_2_evaluation | FOODS_1_009 | FOODS_1 | FOODS | TX_2 | TX | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3895933 | 2016-05-22 | HOUSEHOLD_2_511_TX_3_evaluation | HOUSEHOLD_2_511 | HOUSEHOLD_2 | HOUSEHOLD | TX_3 | TX | 4 |
3895934 | 2016-05-22 | HOUSEHOLD_2_513_TX_1_evaluation | HOUSEHOLD_2_513 | HOUSEHOLD_2 | HOUSEHOLD | TX_1 | TX | 2 |
3895935 | 2016-05-22 | HOUSEHOLD_2_514_TX_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | TX_3 | TX | 1 |
3895936 | 2016-05-22 | HOUSEHOLD_2_516_TX_2_evaluation | HOUSEHOLD_2_516 | HOUSEHOLD_2 | HOUSEHOLD | TX_2 | TX | 1 |
3895937 | 2016-05-22 | HOUSEHOLD_2_516_TX_3_evaluation | HOUSEHOLD_2_516 | HOUSEHOLD_2 | HOUSEHOLD | TX_3 | TX | 2 |
3895938 rows × 8 columns
Let's take a look at how our data is being stored in memory.
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3895938 entries, 0 to 3895937 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 date object 1 id object 2 item_id object 3 dept_id object 4 cat_id object 5 store_id object 6 state_id object 7 sales int64 dtypes: int64(1), object(7) memory usage: 1.6 GB
1.6 GB of data for our purposed is certainly no joke. But how much of that is really necessary?
Most of our data is stored in the least memory efficient format for pandas: strings (objects). Let's fix that.
While changing column type to category
pandas in the background converts string objects into an integers. This is done to save memory and improve performance, as integer encoding takes less memory than string encoding. However, pandas allows you to refer and filter those converted columns by their original string values. It is very convienient and efficient way to handle string objects and optimize the data.
Let's see how converting into category
results in less memory usage:
data = (
data
.assign(
id = lambda df: df.id.astype('category'),
item_id = lambda df: df.item_id.astype('category'),
cat_id = lambda df: df.cat_id.astype('category'),
store_id = lambda df: df.store_id.astype('category'),
state_id = lambda df: df.state_id.astype('category'),
dept_id = lambda df: df.dept_id.astype('category')
)
)
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3895938 entries, 0 to 3895937 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 date object 1 id category 2 item_id category 3 dept_id category 4 cat_id category 5 store_id category 6 state_id category 7 sales int64 dtypes: category(6), int64(1), object(1) memory usage: 209.3 MB
Final DataFrame got down to 209.3 MB, which is about 13% of the original size!
While we're at it, it's worth talking about the best way to store this data on disk. If we saved this as a CSV, it wouldn't maintain any of the data type modifications we just made. Pandas offers a bunch of options for saving DataFrames, but here are the two I'd recommend:
Parquet has basically become the industry standard for storing tabular data on disk. It's a columnar file format that automatically compresses your data (which it does really well) and will maintain any data types you use in Pandas, with only a couple exceptions.
Feather is also a columnar data format, but it optimizes heavily for read speed. Your file size will be much bigger than Parquets, but it's really useful when you need to heavily optimize for data reading.
data.to_parquet('data.parquet')
data = pd.read_parquet('data.parquet')
data.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3895938 entries, 0 to 3895937 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 date object 1 id category 2 item_id category 3 dept_id category 4 cat_id category 5 store_id category 6 state_id category 7 sales int64 dtypes: category(6), int64(1), object(1) memory usage: 209.3 MB
On my local machine, loading the original CSV took ~13.7 seconds, and that only took 0.1 seconds. And our data types were maintained! Nice!
There's one last modification we need to make to our data before it's ready to go. The way that we converted transactions into sales was slightly problematic because now, when a product doesn't sell it just isn't present in our data, rather than appearing as a zero.
That's an issue for our forecasting models, so it needs to be fixed.
To handle that MultiIndex can be created:
First, set your index to columns that the DataFrame is distinct on (date
and id
).
data = data.set_index(['date','id'])
Now, create a MultiIndex with all combinations of daily dates and id
s using pd.MultiIndex.from_product
and use it and .reindex()
to fill the gaps in your data.
# Create date_range that contains all dates between minimum and maximum date from the data
min_date = data.index.get_level_values('date').min()
max_date = data.index.get_level_values('date').max()
dates = pd.date_range(start=min_date, end=max_date, freq='D').rename('date')
# Get all unique ids
ids = data.index.get_level_values('id').unique()
index_to_select = pd.MultiIndex.from_product([dates, ids], names=['date', 'id'])
index_to_select
MultiIndex([('2013-01-01', 'FOODS_1_004_TX_1_evaluation'), ('2013-01-01', 'FOODS_1_004_TX_2_evaluation'), ('2013-01-01', 'FOODS_1_004_TX_3_evaluation'), ('2013-01-01', 'FOODS_1_005_TX_2_evaluation'), ('2013-01-01', 'FOODS_1_009_TX_2_evaluation'), ('2013-01-01', 'FOODS_1_012_TX_1_evaluation'), ('2013-01-01', 'FOODS_1_012_TX_2_evaluation'), ('2013-01-01', 'FOODS_1_013_TX_3_evaluation'), ('2013-01-01', 'FOODS_1_014_TX_3_evaluation'), ('2013-01-01', 'FOODS_1_015_TX_2_evaluation'), ... ('2016-05-22', 'FOODS_2_133_TX_3_evaluation'), ('2016-05-22', 'FOODS_2_177_TX_2_evaluation'), ('2016-05-22', 'FOODS_2_117_TX_2_evaluation'), ('2016-05-22', 'FOODS_2_209_TX_3_evaluation'), ('2016-05-22', 'FOODS_2_256_TX_2_evaluation'), ('2016-05-22', 'FOODS_2_117_TX_1_evaluation'), ('2016-05-22', 'FOODS_2_256_TX_3_evaluation'), ('2016-05-22', 'FOODS_2_256_TX_1_evaluation'), ('2016-05-22', 'FOODS_3_296_TX_2_evaluation'), ('2016-05-22', 'FOODS_2_069_TX_3_evaluation')], names=['date', 'id'], length=11322748)
data = (
data
.reindex(index_to_select)
.sort_index()
)
data.head()
item_id | dept_id | cat_id | store_id | state_id | sales | ||
---|---|---|---|---|---|---|---|
date | id | ||||||
2013-01-01 | FOODS_1_001_TX_1_evaluation | NaN | NaN | NaN | NaN | NaN | NaN |
FOODS_1_001_TX_2_evaluation | NaN | NaN | NaN | NaN | NaN | NaN | |
FOODS_1_001_TX_3_evaluation | NaN | NaN | NaN | NaN | NaN | NaN | |
FOODS_1_002_TX_1_evaluation | NaN | NaN | NaN | NaN | NaN | NaN | |
FOODS_1_002_TX_2_evaluation | NaN | NaN | NaN | NaN | NaN | NaN |
Finally, let's fill the resulting NaN
s in the dataframe. It's tempting to use .groupby().fillna(method='ffill')
(and backfilling), but unfortunately this method is quite slow on grouped data. The better way to handle missing values is ti create temporary dataframe that contains all attributes assigned to all unique ids and fill NAs in our data with it.
data = data.reindex(index_to_select)
temp = data.reset_index()['id']
## split ID column and create other columns
temp.drop_duplicates(inplace=True)
temp = temp.str.split('_', expand=True)
temp['id'] = data.reset_index()['id']
temp['item_id2']=temp[0]+'_'+temp[1]+'_'+temp[2]
temp['dept_id2']=temp[0]+'_'+temp[1]
temp['cat_id2']=temp[0]
temp['store_id2']=temp[3]+'_'+temp[4]
temp['state_id2']=temp[3]
temp.drop([0,1,2,3,4,5],axis=1,inplace=True)
temp.set_index('id',inplace=True)
# data.reset_index(inplace=True)
# data = pd.merge(data,temp, on=['id'], how='left')
## join temp data to help with fillna
data = data.join(temp,on=['id'])
data['item_id'].fillna(data['item_id2'],inplace=True)
data['dept_id'].fillna(data['dept_id2'],inplace=True)
data['cat_id'].fillna(data['cat_id2'],inplace=True)
data['store_id'].fillna(data['store_id2'],inplace=True)
data['state_id'].fillna(data['state_id2'],inplace=True)
data.drop(['item_id2','dept_id2','cat_id2','store_id2','state_id2'],axis=1,inplace=True)
data.fillna({'sales':0},inplace=True)
data
item_id | dept_id | cat_id | store_id | state_id | sales | ||
---|---|---|---|---|---|---|---|
date | id | ||||||
2013-01-01 | FOODS_1_004_TX_1_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_1 | TX | 20.0 |
FOODS_1_004_TX_2_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_2 | TX | 20.0 | |
FOODS_1_004_TX_3_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_3 | TX | 4.0 | |
FOODS_1_005_TX_2_evaluation | FOODS_1_005 | FOODS_1 | FOODS | TX_2 | TX | 1.0 | |
FOODS_1_009_TX_2_evaluation | FOODS_1_009 | FOODS_1 | FOODS | TX_2 | TX | 3.0 | |
... | ... | ... | ... | ... | ... | ... | ... |
2016-05-22 | FOODS_2_117_TX_1_evaluation | FOODS_2_117 | FOODS_2 | FOODS | TX_1 | TX | 0.0 |
FOODS_2_256_TX_3_evaluation | FOODS_2_256 | FOODS_2 | FOODS | TX_3 | TX | 1.0 | |
FOODS_2_256_TX_1_evaluation | FOODS_2_256 | FOODS_2 | FOODS | TX_1 | TX | 0.0 | |
FOODS_3_296_TX_2_evaluation | FOODS_3_296 | FOODS_3 | FOODS | TX_2 | TX | 3.0 | |
FOODS_2_069_TX_3_evaluation | FOODS_2_069 | FOODS_2 | FOODS | TX_3 | TX | 0.0 |
11322748 rows × 6 columns
# Check if there are any NAs in filled dataframe
data.isna().sum()
item_id 0 dept_id 0 cat_id 0 store_id 0 state_id 0 sales 0 dtype: int64
# Check if sum of sales from original dataframe is equal to sum of sales from preprocessed above transaction dataframe
# If true, preprocessing was done without any errors
print('Sales sum in original data: {}'.format(pd.read_csv(f'sales_data.csv', usecols=['date', 'id', 'sales'])['sales'].sum()))
print('Sales sum in preprocessed transaction data: {}'.format(int(data.sales.sum())))
Sales sum in original data: 12905715 Sales sum in preprocessed transaction data: 12905715
Exploratory data analysis is crucial for building the best models. It allows to understand the data and assess what features should be created from time series history to obtain best accuracy while modeling.
For this section, the goal is to find few insights about the data that might be helpful for building models.
There are many techniques to understand impact of trend and seasonality patterns in analyzed time series.
Before deep diving into EDA, let's check sales distribution:
# Group data per each id
id_total = data.reset_index().groupby(['id']).sales.sum().reset_index(name = 'total_sales').dropna()
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(12, 6))
sns.kdeplot(data = id_total, ax = ax)
ax.set_title('Distribution of Total sales per each item')
Text(0.5, 1.0, 'Distribution of Total sales per each item')
Distribution is highly skewed to the right side. Most of items sales are concetrated near 0 volume.
departments = list(data.dept_id.unique())
categories = list(data.cat_id.unique())
dept_sales = data.groupby(['date', 'dept_id']).sales.sum().unstack('dept_id').fillna(0)
cat_sales = data.groupby(['date', 'cat_id']).sales.sum().unstack('cat_id').fillna(0)
sns.set_style('whitegrid')
fig, ax = plt.subplots(len(categories), 1, figsize=(12, 6*len(categories)))
for i, cat in enumerate(categories):
# Plot actuals
sns.lineplot(data=cat_sales,
x = cat_sales.index,
y = cat,
color = 'lightsteelblue',
ax = ax[i],
label = 'y')
# Plot 30 days moving average
sns.lineplot(data=cat_sales.rolling(30, min_periods = 1).mean(),
x=cat_sales.index,
y=cat, color='indianred',
ax=ax[i],
label = '30days MA')
# Plot 365 days moving average
sns.lineplot(data=cat_sales.rolling(365, min_periods = 1).mean(),
x=cat_sales.index,
y=cat,
color='darkblue',
ax=ax[i],
label = '365days MA')
ax[i].set_title('Category: {}'.format(cat))
ax[i].set_xlabel('')
plt.tight_layout()
Based on 30 days moving average analysis we can conclude that there is a clear seasonality pattern in Household and Food sales. Red-line shape is increasing/decreasing at the similar periods each year. Looking at 365 days moving average, Household and Hobbies categories are in upward trend, while food trendline is flat.
sns.set_style('whitegrid')
fig, ax = plt.subplots(len(departments), 1, figsize=(12, 6*len(departments)))
for i, dept in enumerate(departments):
# Plot actuals
sns.lineplot(data=dept_sales,
x = dept_sales.index,
y = dept,
color = 'lightsteelblue',
ax = ax[i],
label = 'y')
# Plot 30 days moving average
sns.lineplot(data=dept_sales.rolling(30, min_periods = 1).mean(),
x=dept_sales.index,
y=dept, color='indianred',
ax=ax[i],
label = '30days MA')
# Plot 365 days moving average
sns.lineplot(data=dept_sales.rolling(365, min_periods = 1).mean(),
x=dept_sales.index,
y=dept,
color='darkblue',
ax=ax[i],
label = '365days MA')
ax[i].set_title('Department: {}'.format(dept))
ax[i].set_xlabel('')
plt.tight_layout()
Based on charts above, we can conclude that there is a yearly seasonality in most of the departments.
Let's fit Prophet model for each department and investigate it's components:
df_prophet_dept_id = pd.melt(dept_sales.reset_index(), id_vars = ['date'], var_name='dept_id', value_name='y').rename(columns = {'date':'ds'})
for i, dept in enumerate(departments):
model = Prophet(
seasonality_mode='multiplicative',
weekly_seasonality=True,
yearly_seasonality=4,
changepoint_prior_scale=0.05
)
model.add_seasonality(
name='monthly',
period=365.25/12,
fourier_order=4,
mode='multiplicative'
)
model.fit(df_prophet_dept_id[df_prophet_dept_id['dept_id'] == dept])
future = model.make_future_dataframe(periods=0)
forecast = model.predict(future)
# Plot the forecast components
fig1 = model.plot(forecast)
fig = model.plot_components(forecast)
fig1.axes[0].set_title(f'Prophet fit for Department {dept}', y=1.05)
fig.axes[0].set_title(f'Forecast Components for Department {dept}', y=1.05)
plt.show()
11:13:24 - cmdstanpy - INFO - Chain [1] start processing 11:13:24 - cmdstanpy - INFO - Chain [1] done processing
11:13:27 - cmdstanpy - INFO - Chain [1] start processing 11:13:27 - cmdstanpy - INFO - Chain [1] done processing
11:13:29 - cmdstanpy - INFO - Chain [1] start processing 11:13:29 - cmdstanpy - INFO - Chain [1] done processing
11:13:31 - cmdstanpy - INFO - Chain [1] start processing 11:13:31 - cmdstanpy - INFO - Chain [1] done processing
11:13:33 - cmdstanpy - INFO - Chain [1] start processing 11:13:34 - cmdstanpy - INFO - Chain [1] done processing
11:13:36 - cmdstanpy - INFO - Chain [1] start processing 11:13:36 - cmdstanpy - INFO - Chain [1] done processing
11:13:38 - cmdstanpy - INFO - Chain [1] start processing 11:13:38 - cmdstanpy - INFO - Chain [1] done processing
There is one pattern that applies for all sales time series aggregated on department level - strong weekly seasonality. Sales are highly dependent on the day of week. Most consumers are shopping during the weekends.
To investigate high/low volume items, let's create a list with top 25 highest and lowest items by total sales:
# group data by item
item_sales = data.groupby(['date','item_id'])['sales'].sum().unstack('item_id').fillna(0)
item_sales_total = data.reset_index().groupby(['item_id']).sales.sum().reset_index(name = 'total_sales')
# Create list of top_n lowest/highest items by volume
top_n = 25
low_volume_items_list = list(item_sales_total.sort_values(by = 'total_sales', ascending = True).head(top_n).item_id.unique())
high_volume_items_list = list(item_sales_total.sort_values(by = 'total_sales', ascending = False).head(top_n).item_id.unique())
# Inspect lowest items by volume
fig, ax = plt.subplots(len(low_volume_items_list), 2, figsize=(18, 4*len(low_volume_items_list)))
for i, item in enumerate(low_volume_items_list):
sns.lineplot(data=item_sales[item].reset_index(),
x = item_sales[item].index,
y = item,
color = 'lightsteelblue',
ax = ax[i,0],
label = 'y'
)
sns.lineplot(data=item_sales[item].rolling(365, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = item,
color = 'indianred',
ax = ax[i,0],
label = '365days MA',
linewidth = 2
)
sns.lineplot(data=item_sales[item].rolling(30, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = item,
color = 'darkblue',
ax = ax[i,0],
label = '30days MA'
)
pd.plotting.autocorrelation_plot(item_sales[item], ax=ax[i,1], color = 'steelblue')
ax[i,0].set_title('Low-Volume item Sales: {}'.format(item))
ax[i,1].set_title('Low-Volume item ACF: {}'.format(item))
plt.tight_layout()
Low-Volume items are showing very little or no pattern at all.
Some item sales are very random/noisy or intermittent.
Only couple items are displaying some kind of yearly seasonality like FOOD_2_073 where there is significant ACF spike at lag 365.
Analysis of 365 days MA shows that most of Low-Volume items are having rather flat trendline.
30 days MA is not displaying any reccuring pattern which indicates no yearly seasonality in most of the casers.
# Inspect highest items by volume
fig, ax = plt.subplots(len(high_volume_items_list), 2, figsize=(18, 4*len(high_volume_items_list)))
for i, item in enumerate(high_volume_items_list):
sns.lineplot(data=item_sales[item].reset_index(),
x = item_sales[item].index,
y = item,
color = 'lightsteelblue',
ax = ax[i,0],
label = 'y'
)
sns.lineplot(data=item_sales[item].rolling(365, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = item,
color = 'indianred',
ax = ax[i,0],
label = '365days MA',
linewidth = 2
)
sns.lineplot(data=item_sales[item].rolling(30, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = item,
color = 'darkblue',
ax = ax[i,0],
label = '30days MA'
)
pd.plotting.autocorrelation_plot(item_sales[item], ax=ax[i,1], color = 'steelblue')
ax[i,0].set_title('High-Volume item Sales: {}'.format(item))
ax[i,1].set_title('High-Volume item ACF: {}'.format(item))
plt.tight_layout()
High-Volume items sales are seasonal.
ACF function is showing signifficant autocorrelation between current sales and sales from last year for most of the inspected items.
Trendlines displayed by 365 days MA are having different directions among items.
Based on analysis of 30 days MA, we can conclude that there is recurring seasonal pattern for most of the items like for example FOODS_3_365.
# Inspect lowest items by volume per store
stores_list = list(data.store_id.unique())
fig, ax = plt.subplots(len(low_volume_items_list), len(stores_list), figsize=(6*len(stores_list), 4*len(low_volume_items_list)))
for j, store in enumerate(stores_list):
for i, item in enumerate(low_volume_items_list):
data_aux = data.loc[(data['item_id'] == item) & (data['store_id'] == store)]
sns.lineplot(data=data_aux,
x = data_aux.index.get_level_values('date'),
y = 'sales',
color = 'lightsteelblue',
ax = ax[i,j],
label = 'y'
)
sns.lineplot(data=data_aux.sales.rolling(365, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = 'sales',
color = 'indianred',
ax = ax[i,j],
label = '365days MA',
linewidth = 2
)
sns.lineplot(data=data_aux.sales.rolling(30, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = 'sales',
color = 'darkblue',
ax = ax[i,j],
label = '30days MA'
)
ax[i,j].set_title('{}, {}'.format(store, item), fontsize = 10)
ax[i,j].set_xlabel('')
ax[i,j].set_ylabel('')
ax[i,j].tick_params(axis='x', which='both', labelbottom=False)
Each row is representing single item, while each column represents store.
Low-Volume items sales are rather random among different stores, however there are stores that are sharing the same patterns for a specific item. Like for example: HOUSEHOLD_2_162.
# Inspect highest items by volume per store
stores_list = list(data.store_id.unique())
fig, ax = plt.subplots(len(high_volume_items_list), len(stores_list), figsize=(6*len(stores_list), 4*len(high_volume_items_list)))
for j, store in enumerate(stores_list):
for i, item in enumerate(high_volume_items_list):
data_aux = data.loc[(data['item_id'] == item) & (data['store_id'] == store)]
sns.lineplot(data=data_aux,
x = data_aux.index.get_level_values('date'),
y = 'sales',
color = 'lightsteelblue',
ax = ax[i,j],
label = 'y'
)
sns.lineplot(data=data_aux.sales.rolling(365, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = 'sales',
color = 'indianred',
ax = ax[i,j],
label = '365days MA',
linewidth = 2
)
sns.lineplot(data=data_aux.sales.rolling(30, min_periods = 1).mean().reset_index(),
x = item_sales[item].index,
y = 'sales',
color = 'darkblue',
ax = ax[i,j],
label = '30days MA'
)
ax[i,j].set_title('{}, {}'.format(store, item), fontsize = 10)
ax[i,j].set_xlabel('')
ax[i,j].set_ylabel('')
ax[i,j].tick_params(axis='x', which='both', labelbottom=False)
Each row is representing single item, while each column represents store.
Hig-Volume items sales are sharing similar patterns among different stores. There is clearly similar seasonality and trend for the same items in different stores.
For example FOODS_3_808 item is declining in all stores while sharing the same seasonal pattern.
Before we get to modeling, let's create our evaluation setup. The models that we're going to create have a 28-day forecast horizon, and our goal is to best approximate "average" sales.
The first step is to implement our evaluation metric. The original competition used a metric called RMSSE, or "Root Mean Squared Scaled Error." It's similar to the MASE metric that we discussed before, except that the metric optimizes better for "average" sales (as opposed to MASE, which optimizes for the median, since it's an absolute error metric). The competition actually used a weighted version of RMSSE which is techincally more robust, but we're going to stick to RMSSE. Here's what RMSSE looks like:
$RMSSE = \sqrt{\frac{1}{h}\frac{\sum^{n+h}_{t=n+1} (Y_t - \hat{Y}_t)^2}{\frac{1}{n-1}\sum^n_{t=2} (Y_t - Y_{t-1})^2}}$
where $Y_t$ is the actual future value of sales at date $t$, $\hat{Y}_t$ is your forecast for date $t$, $n$ is the number of dates in our training set, and $h$ is our forecast horizon (28 days, in our case).
That looks intimidating! But, similarly to MASE, you can break it down into two parts:
Of course, the "naive, one-day-ahead forecast" part only works if you calculate both the numerator and denominator separately for each id
. So, the idea here is that you are effectively calculating an RMSSE value for each id
, and then averaging those to get the final RMSSE.
Last comment: there are products in the dataset that don't start showing sales for some time. For those products, the denominator is only supposed to be calculated after the first sale in the dataset. I'd recommend just dropping the records for those products until that first sales, which is straightforward to do using .cumsum()
over sales
while grouping by id
.
# QUESTION: filter out products that don't have sales using cumsum
data['cumsum'] = data.groupby(['id']).sales.transform('cumsum')
# filter for one product to investigate if cumsum is working
print('before removing records before first sale:')
display(data.loc[(slice(None),'FOODS_1_003_TX_2_evaluation'),:].head(3))
# filter out product rows before first sales
data = data[data['cumsum'] != 0]
# Drop cumsum column
data.drop('cumsum',axis = 1, inplace = True)
print('after removing records before first sale:')
display(data.loc[(slice(None),'FOODS_1_003_TX_2_evaluation'),:].head(3))
before removing records before first sale:
item_id | dept_id | cat_id | store_id | state_id | sales | cumsum | ||
---|---|---|---|---|---|---|---|---|
date | id | |||||||
2013-01-02 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 3.0 | 3.0 |
2013-01-03 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 0.0 | 3.0 |
2013-01-04 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 0.0 | 3.0 |
after removing records before first sale:
item_id | dept_id | cat_id | store_id | state_id | sales | ||
---|---|---|---|---|---|---|---|
date | id | ||||||
2013-01-02 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 3.0 |
2013-01-03 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 0.0 |
2013-01-04 | FOODS_1_003_TX_2_evaluation | FOODS_1_003 | FOODS_1 | FOODS | TX_2 | TX | 0.0 |
Filtered data for FOODS_1_003_TX_2_evaluation product is starting after first sale, everything seems to be working fine
Let's implement RMSSE and test it on simple example (result should be equal to 0.92290404515501):
def rmsse(train, val, y_preds, model):
denominator = (
train.groupby("unique_id")["y"]
.apply(lambda x: (x.diff() ** 2).sum() / (len(x) - 1))
.reset_index(name="denominator")
)
numerator = (
pd.merge(val, y_preds, how = 'left', on = ['ds', 'unique_id'])
.groupby('unique_id')
.apply(lambda x: ((x["y"] - x[model]) ** 2).sum() / len(x))
.reset_index(name="numerator")
)
numerator = numerator.merge(denominator, on=["unique_id"])
return np.mean(np.sqrt(numerator["numerator"] / numerator["denominator"]).replace(np.inf,0))
test_train = pd.DataFrame({
'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'y': [3, 2, 5, 100, 150, 60, 10, 20, 30],
})
test_val = pd.DataFrame({
'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'y': [6, 1, 4, 200, 120, 270, 10, 20, 30],
})
test_y_pred = pd.DataFrame({
'ds': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j'],
'unique_id': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'predictions': [1, 2, 3, 180, 160, 240, 20, 30, 40],
})
rmsse(test_train, test_val, test_y_pred, 'predictions')
0.92290404515501
Finally, we can train some models!
Libraries used: statsforecast, mlforecast by nixtla.
For modeling I will be using LightGBM. There are several reasons why LightGBM is a powerful tool to forecast retail sales and was used in many highest score solutions in M5 competition.
'Objective' is very important LightGBM parameter. As we are dealing with highly right-skewed data, it does make sense to set objective funtion to 'tweedie'. Tweedie distribution assumes that data is mostly concetrated near 0 with plenty of outliers on the right side (the same as our data). Rest of parameters are choosen by me based on my intuition after studying EDA part.
For a benchmarking purpose I will calculate predictions for each unique_id with HoltWinters method. HoltWinters is oldschool time series forecasting method which assumes that impact of previous observation on current one is exponentially decreasing. It also has Trend and Seasonality component. HoltWinters RMSSE will be used as a benchmark to improve LightGBM models.
# Load Prices data
prices = pd.read_parquet(f'prices.parquet')
prices = prices.reset_index(drop = False)
prices.head()
# Load events data
events = pd.read_parquet(f'calendar.parquet')
events = events.reset_index(drop = False)
# Create dummy variables for events
events = pd.get_dummies(events, columns = list(events.iloc[:,2:].columns))
list_of_events = list(events.iloc[:,1:].columns)
# Merge prices data
data_with_prices = pd.merge(data.reset_index(),prices, how = 'left', on = ['date','store_id','item_id']).set_index(['date','id'])
data_with_prices.head()
item_id | dept_id | cat_id | store_id | state_id | sales | sell_price | ||
---|---|---|---|---|---|---|---|---|
date | id | |||||||
2013-01-01 | FOODS_1_004_TX_1_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_1 | TX | 20.0 | 1.78 |
FOODS_1_004_TX_2_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_2 | TX | 20.0 | 1.78 | |
FOODS_1_004_TX_3_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_3 | TX | 4.0 | 1.78 | |
FOODS_1_005_TX_2_evaluation | FOODS_1_005 | FOODS_1 | FOODS | TX_2 | TX | 1.0 | 3.28 | |
FOODS_1_009_TX_2_evaluation | FOODS_1_009 | FOODS_1 | FOODS | TX_2 | TX | 3.0 | 2.68 |
# Merge events
data_with_prices_and_events = pd.merge(data_with_prices.reset_index(),events, how = 'left', on = ['date']).set_index(['date','id'])
data_with_prices_and_events[list_of_events] = data_with_prices_and_events[list_of_events].fillna(0)
data_with_prices_and_events.head()
item_id | dept_id | cat_id | store_id | state_id | sales | sell_price | snap_TX | event_name_1_Chanukah End | event_name_1_Christmas | ... | event_type_1_Cultural | event_type_1_National | event_type_1_Religious | event_type_1_Sporting | event_name_2_Cinco De Mayo | event_name_2_Easter | event_name_2_Father's day | event_name_2_OrthodoxEaster | event_type_2_Cultural | event_type_2_Religious | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | id | |||||||||||||||||||||
2013-01-01 | FOODS_1_004_TX_1_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_1 | TX | 20.0 | 1.78 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
FOODS_1_004_TX_2_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_2 | TX | 20.0 | 1.78 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
FOODS_1_004_TX_3_evaluation | FOODS_1_004 | FOODS_1 | FOODS | TX_3 | TX | 4.0 | 1.78 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
FOODS_1_005_TX_2_evaluation | FOODS_1_005 | FOODS_1 | FOODS | TX_2 | TX | 1.0 | 3.28 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
FOODS_1_009_TX_2_evaluation | FOODS_1_009 | FOODS_1 | FOODS | TX_2 | TX | 3.0 | 2.68 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 48 columns
Great resource about HoltWinters: https://otexts.com/fpp2/holt-winters.html.
val = (
data_with_prices_and_events
.reset_index()
.groupby('id')
.tail(28)
.rename(columns={
'date': 'ds',
'id': 'unique_id',
'sales': 'y',
})
)
train = (
data_with_prices_and_events
.reset_index()
.drop(val.index)
.rename(columns={
'date': 'ds',
'id': 'unique_id',
'sales': 'y',
})
)
sf = StatsForecast(
models = [HoltWinters(season_length=7)], # Set weakly seasonality
freq = 'D'
)
sf.fit(
(
train[train['unique_id'] != 'FOODS_2_069_TX_3_evaluation'][['ds','unique_id','y']] # Filter out one item with only 4 observations
.assign(
unique_id = lambda df: df.unique_id.astype('str')
)
)
)
holtwinters_preds = sf.predict(h = 28).reset_index().sort_values(by = ['unique_id','ds'])
plot_data_ets = (
pd.concat([
train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']],
val[['unique_id', 'ds', 'y']].sort_values(by = ['unique_id','ds']),
holtwinters_preds
])
)
rmsse_holtwinters = rmsse(train, val, holtwinters_preds, 'HoltWinters')
Recursive forecasting is a time series forecasting method in which a model is trained on historical data up to a certain point in time and then used to make forecasts for the future. As new data becomes available, the model is updated and used to make another forecast for the next time period. This process is repeated until all desired forecast horizons have been reached. Steps:
Based on EDA analysis, strong yearly and weakly seasonality can be observed especially for high-volume items. In that case it makes sense to create rolling_features with different windows. Feature engineering to capture impact of observed time series components:
# split into training and validation sets and conform the column names to what MLForecast expects
val = (
data_with_prices_and_events
.reset_index()
.groupby('id')
.tail(28)
.rename(columns={
'date': 'ds',
'id': 'unique_id',
'sales': 'y',
})
)
train = (
data_with_prices_and_events
.reset_index()
.drop(val.index)
.rename(columns={
'date': 'ds',
'id': 'unique_id',
'sales': 'y',
})
)
# label encode categorical features
cat_feats = ['unique_id', 'item_id', 'dept_id', 'cat_id']
enc_cat_feats = [f'{feat}_enc' for feat in cat_feats]
encoder = OrdinalEncoder()
train[enc_cat_feats] = encoder.fit_transform(train[cat_feats])
val[enc_cat_feats] = encoder.transform(val[cat_feats])
reference_cols = ['unique_id', 'ds', 'y','sell_price']
# add features to this list if you want to use them
features = reference_cols + enc_cat_feats + list_of_events
train = train[features]
val = val[features]
@njit
def rollingmean7d(x):
return rolling_mean(x, window_size=7)
@njit
def rollingmean14d(x):
return rolling_mean(x, window_size=14)
@njit
def rollingmean30d(x):
return rolling_mean(x, window_size=30)
@njit
def rollingmean60d(x):
return rolling_mean(x, window_size=60)
@njit
def rollingmean180d(x):
return rolling_mean(x, window_size=180)
#####################################################
@njit
def rollingstd7d(x):
return rolling_std(x, window_size=7)
@njit
def rollingstd14d(x):
return rolling_std(x, window_size=14)
@njit
def rollingstd30d(x):
return rolling_std(x, window_size=30)
@njit
def rollingstd60d(x):
return rolling_std(x, window_size=60)
@njit
def rollingstd180d(x):
return rolling_std(x, window_size=180)
# feel free to tweak these parameters!
model_params = {
'verbose': -1,
'num_leaves': 256,
'n_estimators': 500,
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
'learning_rate': 0.03
}
models = [
lgb.LGBMRegressor(**model_params),
]
fcst = MLForecast(
models=models,
freq='D',
# dictionary reads like this:
# {number of days to lag the feature: [list of functions to apply to the lagged data]}
lags=[7,
14,
30,
31,
365],
lag_transforms = {
7: [rollingmean7d,
rollingmean14d,
rollingmean30d,
rollingmean60d,
rollingmean180d,
rollingstd7d,
rollingstd14d,
rollingstd30d,
rollingstd60d,
rollingstd180d]
},
date_features=['year', 'month', 'day', 'dayofweek', 'quarter']
)
# don't worry about nul value warnings. LightGBM and XGBoost can handle it!
fcst.fit(
train,
id_col='unique_id',
time_col='ds',
target_col='y',
dropna=False,
static_features = enc_cat_feats
)
recursive_preds = fcst.predict(28,
# Add Future values for prices from validation dataset
dynamic_dfs = [val[['ds','unique_id','sell_price'] + list_of_events]])
# plot the last 45 days of the training set, the validation set, and the predictions
plot_data_recursive = (
pd.concat([
train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']],
val[['unique_id', 'ds', 'y']],
recursive_preds
])
)
rmsse_recursive = rmsse(train, val, recursive_preds, 'LGBMRegressor')
lgb.plot_importance(fcst.models_['LGBMRegressor'], max_num_features=15 , figsize=(12,6))
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
According to feature importance from LightGBM, the most important variable is price. This is kinda expected, as we are modeling retail sales which are highly affected by price changes and promotions.
Direct forecasting is a bit different than recursive one. It is directly using past values of time series to predict the future instead of iteratively recalculating the features with each prediction.
Idea is: for each department train 4 different models:
In total there will be 7*4 = 28 LightGBM models (7 departments times 4 weeks). Variables item_id and unique_id are going to be treated as categoricals. Variables to capture autoregressive relationships, seasonalities and trends are going to be the same as in Recursive approach.
However, each weekly model will be using features that are lagged specificaly for each week (lag >= forecast horizon)
Feature engineering with help of MLForecast library:
@njit
def rollingmean7d(x):
return rolling_mean(x, window_size=7)
@njit
def rollingmean14d(x):
return rolling_mean(x, window_size=14)
@njit
def rollingmean30d(x):
return rolling_mean(x, window_size=30)
@njit
def rollingmean60d(x):
return rolling_mean(x, window_size=60)
@njit
def rollingmean180d(x):
return rolling_mean(x, window_size=180)
#####################################################
@njit
def rollingstd7d(x):
return rolling_std(x, window_size=7)
@njit
def rollingstd14d(x):
return rolling_std(x, window_size=14)
@njit
def rollingstd30d(x):
return rolling_std(x, window_size=30)
@njit
def rollingstd60d(x):
return rolling_std(x, window_size=60)
@njit
def rollingstd180d(x):
return rolling_std(x, window_size=180)
# feel free to tweak these parameters!
model_params = {
'verbose': -1,
'num_leaves': 256,
'n_estimators': 500,
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
'learning_rate': 0.03
}
models = [
lgb.LGBMRegressor(**model_params),
]
fcst = MLForecast(
models=models,
freq='D',
# dictionary reads like this:
# {number of days to lag the feature: [list of functions to apply to the lagged data]}
lags=[7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
365],
lag_transforms = {
7: [rollingmean7d,
rollingmean14d,
rollingmean30d,
rollingmean60d,
rollingmean180d,
rollingstd7d,
rollingstd14d,
rollingstd30d,
rollingstd60d,
rollingstd180d],
14: [rollingmean7d,
rollingmean14d,
rollingmean30d,
rollingmean60d,
rollingmean180d,
rollingstd7d,
rollingstd14d,
rollingstd30d,
rollingstd60d,
rollingstd180d],
21: [rollingmean7d,
rollingmean14d,
rollingmean30d,
rollingmean60d,
rollingmean180d,
rollingstd7d,
rollingstd14d,
rollingstd30d,
rollingstd60d,
rollingstd180d],
28: [rollingmean7d,
rollingmean14d,
rollingmean30d,
rollingmean60d,
rollingmean180d,
rollingstd7d,
rollingstd14d,
rollingstd30d,
rollingstd60d,
rollingstd180d]
},
date_features=['year', 'month', 'day', 'dayofweek', 'quarter']
)
# Create dataframe to preprocess
data_with_prices_toprep = (
data_with_prices_and_events
.reset_index()
.rename(columns = {'date':'ds',
'id':'unique_id',
'sales': 'y'}).
drop(['state_id','cat_id'],axis = 1)
)
categorical_features = ['unique_id', 'item_id','dept_id','store_id']
encoder = OrdinalEncoder()
data_with_prices_toprep[categorical_features] = encoder.fit_transform(data_with_prices_toprep[categorical_features])
# Feature engineering per unique id
prep = fcst.preprocess(
data_with_prices_toprep,
id_col='unique_id',
time_col='ds',
target_col='y',
dropna=False,
static_features = ['item_id','dept_id','store_id']
)
# Split train/val data
val_direct = (
prep
.groupby('unique_id')
.tail(28)
)
train_direct = (
prep
.drop(val.index)
)
# Clean unused dataframes
del data; gc.collect()
353641
# Create list of diffent feature categories
sales_features = list(train.columns[(train.columns.str.contains('lag')) | (train.columns =='y')])
calendar_features = ['year', 'month', 'day', 'dayofweek', 'quarter']
# Create features for each multi-horizon model
lags_model_1stweek = ['lag7','lag8','lag9','lag10','lag11','lag12','lag13']
rollingfeatures_model_1stweek = list(train.columns[(train.columns.str.endswith('_lag7'))])
model_1stweek_features = lags_model_1stweek + rollingfeatures_model_1stweek + calendar_features + ['unique_id', 'item_id'] + list_of_events
lags_model_2ndweek = ['lag14','lag15','lag16','lag17','lag18','lag19','lag20']
rollingfeatures_model_2ndweek = list(train.columns[(train.columns.str.endswith('_lag14'))])
model_2ndweek_features = lags_model_2ndweek + rollingfeatures_model_2ndweek + calendar_features + ['unique_id', 'item_id'] + list_of_events
lags_model_3rdweek = ['lag21','lag22','lag23','lag24','lag25','lag26','lag27']
rollingfeatures_model_3rdweek = list(train.columns[(train.columns.str.endswith('_lag21'))])
model_3rdweek_features = lags_model_3rdweek + rollingfeatures_model_3rdweek + calendar_features + ['unique_id', 'item_id'] + list_of_events
lags_model_4thweek = ['lag28','lag29','lag30','lag31','lag32','lag33','lag34']
rollingfeatures_model_4thweek = list(train.columns[(train.columns.str.endswith('_lag28'))])
model_4thweek_features = lags_model_4thweek + rollingfeatures_model_4thweek + calendar_features + ['unique_id', 'item_id'] + list_of_events
model_4thweek_features
# Create dictionary for each weekly model containing features that are going to be used
models_per_week = {
'model_1stweek':model_1stweek_features,
'model_2ndweek':model_2ndweek_features,
'model_3rdweek':model_3rdweek_features,
'model_4thweek':model_4thweek_features
}
for i, features in models_per_week.items():
print('Model: {}'.format(i))
print('Features used:')
print(features)
print(' ')
Model: model_1stweek Features used: ['lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12', 'lag13', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious'] Model: model_2ndweek Features used: ['lag14', 'lag15', 'lag16', 'lag17', 'lag18', 'lag19', 'lag20', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious'] Model: model_3rdweek Features used: ['lag21', 'lag22', 'lag23', 'lag24', 'lag25', 'lag26', 'lag27', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious'] Model: model_4thweek Features used: ['lag28', 'lag29', 'lag30', 'lag31', 'lag32', 'lag33', 'lag34', 'year', 'month', 'day', 'dayofweek', 'quarter', 'unique_id', 'item_id', 'snap_TX', 'event_name_1_Chanukah End', 'event_name_1_Christmas', 'event_name_1_Cinco De Mayo', 'event_name_1_ColumbusDay', 'event_name_1_Easter', 'event_name_1_Eid al-Fitr', 'event_name_1_EidAlAdha', "event_name_1_Father's day", 'event_name_1_Halloween', 'event_name_1_IndependenceDay', 'event_name_1_LaborDay', 'event_name_1_LentStart', 'event_name_1_LentWeek2', 'event_name_1_MartinLutherKingDay', 'event_name_1_MemorialDay', "event_name_1_Mother's day", 'event_name_1_NBAFinalsEnd', 'event_name_1_NBAFinalsStart', 'event_name_1_NewYear', 'event_name_1_OrthodoxChristmas', 'event_name_1_OrthodoxEaster', 'event_name_1_Pesach End', 'event_name_1_PresidentsDay', 'event_name_1_Purim End', 'event_name_1_Ramadan starts', 'event_name_1_StPatricksDay', 'event_name_1_SuperBowl', 'event_name_1_Thanksgiving', 'event_name_1_ValentinesDay', 'event_name_1_VeteransDay', 'event_type_1_Cultural', 'event_type_1_National', 'event_type_1_Religious', 'event_type_1_Sporting', 'event_name_2_Cinco De Mayo', 'event_name_2_Easter', "event_name_2_Father's day", 'event_name_2_OrthodoxEaster', 'event_type_2_Cultural', 'event_type_2_Religious']
# Change id variables into category so LightGBM performs correct tree split based on those columns
train_direct['unique_id'] = train_direct['unique_id'].astype('category')
train_direct['item_id'] = train_direct['item_id'].astype('category')
model_params = {
'verbose': -1,
'num_leaves': 256,
'n_estimators': 500,
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
'learning_rate': 0.03
}
# create an empty dictionary to store the models
models_dict = {}
# loop through each department in the validation set
for dept in train_direct.dept_id.unique():
models_dict[dept] = {}
# Create temporary dataframe for each department
dept_data = train_direct[(train_direct['dept_id'] == dept)]
# loop through each multi-horizon model
for i, features in models_per_week.items():
# Create X for each department and weekly model (take features assigned to weekly models)
X = dept_data[features]
y = dept_data['y']
# train models
lgb_train = lgb.Dataset(X, y)
model = lgb.train(model_params, lgb_train)
# Update dictionary for each weekly model and department
models_dict[dept][i] = model
# # Save models UNCOMMENT TO SAVE THE MODELS
# models_dict_tosave = models_dict.copy()
# for dept in train.dept_id.unique():
# for i, features in models_per_week.items():
# print(models_dict_tosave[dept][i].save_model(str(dept) + '_' + str(i) + '_' + 'mode.txt'))
# Change id variables into category for validation set
val_direct['unique_id'] = val_direct['unique_id'].astype('category')
val_direct['item_id'] = val_direct['item_id'].astype('category')
# Create Predictions Dataframe
direct_preds = pd.DataFrame()
# loop through each department in the validation set
for dept in val_direct.dept_id.unique():
dept_data = val_direct[(val_direct['dept_id'] == dept)]
# loop through each multi-horizon model
for i, features in models_per_week.items():
# Create Predictions
X_val = dept_data[features]
preds = pd.DataFrame(models_dict[dept][i].predict(X_val))
preds.columns = ['LGBMRegressor']
preds['model'] = i
preds['dept_id'] = dept
preds['ds'] = dept_data.reset_index()['ds']
preds['unique_id'] = dept_data.reset_index()['unique_id']
# Filter only predictions that are assigned specificaly to each weekly model
if i == 'model_1stweek':
preds = preds[preds['ds'] <= pd.to_datetime('2016-05-01')]
elif i == 'model_2ndweek':
preds = preds[(preds['ds'] > pd.to_datetime('2016-05-01'))
& (preds['ds'] <= pd.to_datetime('2016-05-08'))]
elif i == 'model_3rdweek':
preds = preds[(preds['ds'] > pd.to_datetime('2016-05-08'))
& (preds['ds'] <= pd.to_datetime('2016-05-15'))]
elif i == 'model_4thweek':
preds = preds[(preds['ds'] > pd.to_datetime('2016-05-15'))
& (preds['ds'] <= pd.to_datetime('2016-05-22'))]
direct_preds = pd.concat([preds,direct_preds],axis = 0)
# Inverse back categorical features encoding
train_direct[categorical_features] = pd.DataFrame(encoder.inverse_transform(train_direct[categorical_features]), columns = train_direct[categorical_features].columns)
# Merge multi horizon predictions with categorical features from val set to inverse encoding later on
direct_preds = pd.merge(direct_preds.drop('dept_id', axis = 1), val_direct[categorical_features + ['ds']], how = 'left', on = ['ds','unique_id'])
# Inverse encoding for direct predictions
direct_preds[categorical_features] = pd.DataFrame(encoder.inverse_transform(direct_preds[categorical_features]), columns = direct_preds[categorical_features].columns)
val_direct = val_direct.reset_index(drop = True)
val_direct[categorical_features] = pd.DataFrame(encoder.inverse_transform(val_direct[categorical_features]), columns = val_direct[categorical_features].columns)
rmsse_direct = rmsse(train, val, direct_preds, 'LGBMRegressor')
rmsse_direct
0.7493979496196602
# plot the last 45 days of the training set, the validation set, and the predictions
plot_data_direct = (
pd.concat([
train.groupby('unique_id').tail(45)[['unique_id', 'ds', 'y']],
val[['unique_id', 'ds', 'y']],
direct_preds
])
)
# Group to plot total sales for each model
data_grouped_mutlihorizon = plot_data_direct.groupby('ds')[['y','LGBMRegressor']].sum().reset_index()
data_grouped_recursive = plot_data_recursive.groupby('ds')[['y','LGBMRegressor']].sum().reset_index()
data_grouped_ets = plot_data_ets.groupby('ds')[['y','HoltWinters']].sum().reset_index()
# Combine all predictors into one dataframe
plot_data = (
data_grouped_ets.set_index('ds')
.join(
data_grouped_mutlihorizon.set_index('ds')
.drop('y',axis =1).rename(columns = {'LGBMRegressor':'LGBM_MultiHorizon'}))
.join(data_grouped_recursive.set_index('ds')
.drop('y',axis =1).rename(columns = {'LGBMRegressor':'LGBM_Recursive'}))
)
plot_data
y | HoltWinters | LGBM_MultiHorizon | LGBM_Recursive | |
---|---|---|---|---|
ds | ||||
2016-03-11 | 11066.0 | 0.000000 | 0.000000 | 0.000000 |
2016-03-12 | 13113.0 | 0.000000 | 0.000000 | 0.000000 |
2016-03-13 | 13466.0 | 0.000000 | 0.000000 | 0.000000 |
2016-03-14 | 11882.0 | 0.000000 | 0.000000 | 0.000000 |
2016-03-15 | 11659.0 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... |
2016-05-18 | 10375.0 | 9579.687500 | 9532.604587 | 9406.946322 |
2016-05-19 | 9162.0 | 9690.434570 | 9578.767776 | 9412.847663 |
2016-05-20 | 12303.0 | 10573.802734 | 10444.268249 | 10176.920809 |
2016-05-21 | 13681.0 | 12760.348633 | 12419.145383 | 12436.963334 |
2016-05-22 | 14815.0 | 13333.077148 | 12828.134820 | 12881.801693 |
73 rows × 4 columns
# Create model lists
models_list = ['HoltWinters', 'LGBM_MultiHorizon', 'LGBM_Recursive']
# Define MAPE formula
def MAPE(df, ac, preds):
return round(np.mean(abs(df[ac] - df[preds])/df[ac])*100, 2)
# Create MAPE results for Total Sales and save in the dataframe
mape_df = pd.DataFrame()
mape = pd.DataFrame()
for model in models_list:
mape['mape'] = pd.Series(MAPE(plot_data[plot_data[model] != 0], 'y',model))
mape['model'] = model
mape_df = pd.concat([mape, mape_df],axis = 0)
mape_df
mape | model | |
---|---|---|
0 | 6.89 | LGBM_Recursive |
0 | 8.10 | LGBM_MultiHorizon |
0 | 8.81 | HoltWinters |
fig, ax = plt.subplots(figsize=(18, 10))
# Plot Total Sales and Total Predictions by date
sns.lineplot(data=plot_data,
x = 'ds',
y = 'y',
color = 'steelblue',
ax = ax,
label = 'y'
)
sns.lineplot(data=plot_data[plot_data['LGBM_MultiHorizon'] != 0],
x = 'ds',
y = 'LGBM_MultiHorizon',
color = 'darkblue',
ax = ax,
label = 'Predictions Direct Strategy'
)
sns.lineplot(data=plot_data[plot_data['LGBM_Recursive'] != 0],
x = 'ds',
y = 'LGBM_Recursive',
color = 'indianred',
ax = ax,
label = 'Predictions Recursive Strategy'
)
sns.lineplot(data=plot_data[plot_data['HoltWinters'] != 0],
x = 'ds',
y = 'HoltWinters',
color = 'green',
ax = ax,
label = 'Predictions HoltWinters'
)
ax.set_title('Total Sales and Predictions \n(Recursive Strategy: RMSSE = {}, Total Sales MAPE = {}%) \n(Direct Strategy: RMSSE = {}, Total Sales MAPE = {}%) \n(HoltWinters: RMSSE = {}, Total Sales MAPE = {}%)'.format(round(rmsse_recursive,4),mape_df[mape_df['model'] == 'LGBM_Recursive']['mape'][0],round(rmsse_direct,4),mape_df[mape_df['model'] == 'LGBM_MultiHorizon']['mape'][0], round(rmsse_holtwinters, 4),mape_df[mape_df['model'] == 'HoltWinters']['mape'][0]), fontsize = 15)
Text(0.5, 1.0, 'Total Sales and Predictions \n(Recursive Strategy: RMSSE = 0.7447, Total Sales MAPE = 6.89%) \n(Direct Strategy: RMSSE = 0.7494, Total Sales MAPE = 8.1%) \n(HoltWinters: RMSSE = 0.7508, Total Sales MAPE = 8.81%)')
Lowest average Root Mean Squared Scaled Error per each item was obtained by LightGBM predicted with recursive strategy (0.7447).
Lowest Mean Absolute Squared Percentage Error on total sales was obtained also by Recursive LightGBM (6.89%).
This project is a result of many experiments, here is brief summary of what helped my models and what didn't help. Was it expected before the results?
Direct strategy is slightly less accurate than recursive strategy (not expected).
Inclusion of price information data resulted in huge accuracy improvement (expected).
RMSSE for recursive forecasting was around 1.5. Including price information RMSSE went down to around 0.75.
Direct approach takes longer to train the models. (expected)
Including rolling features to capture mid-term trends (windows 60-180) helped to improve accuracy. (expected)
Throwing bunch of rolling features lagged by different days (7, 14, 21, 28) helped to improve accuracy in recursive strategy. (unexpected and not included in the final results)
Adding events resulted in accuracy improvement (From 0.747 RMSSE to 0.746 RMSSE for Recursive Stragety) (expected)
- ensembling (creating weighted averages of all predictions usually leads to better accuracy)
Ensembling is a technique to combine predictions from multiple models. It often leads to better accuracy, especially in case of time series forecasting. Let's think about scenario where we have two models, one is overforecasting and another is underforecasting. Combining predictions from those two models and calculating an average out of them will lead to better accuracy, as single error is netting out while combined.
The easiest way to create an ensemble is to calculate simple average. Based on my experience, usually it works pretty well. However most of the times it is better to calculate weighted average. How to calculate weights for each model? For that coefficients you can fit linear regression where target will be dependent variable, and model predictions are going to be regressors. Calculated coefficients can be treated as a weight as linear regression is built to minimize MSE. However if MSE is not an objective funtion it is better to use linear optimizer.
- building recursive/direct models on the data filtered by categories/departmens
- experimenting with more features (seasonal Fourier Series, seasonal rolling features, rolling features aggregated on different levels)
- hyperparameters tunning with Optuna
- different machine learning/deep learning models (NBEATS could be a good fit for this exercise)