Data Preprocessing#
General steps#
- step
Only view / click / detail events are kept, depending on the dataset. These events represent the user visiting an item’s detail page, thus they describe the sequences for the next-click prediction task.
- step
Sessionisation of user histories. When no session information is available, the sessions are computed using 1 hour as the session gap threshold. If the sessions are precomputed, the precomputed IDs are used.
- step
Any unnecessary data is discarded. Only the necessary data is kept: session ID, item ID, timestamp
- step
Subsequent repeating item filtration. If the user visits the same item multiple times in succession, only the first occurrence is kept. E.g. (i,i,j) is reduced to (i,j), but (i,i,j) is not modified.
- step
The dataset is iteratively filtered for sessions shorter than 2 and items occurring less than 5 times until there is no change in the dataset.
- step
Time based train and test splits. The number of test days are defined for each dataset (tday, defined in seconds) yielding a split time tsplit. tsplit is calculated by substracting tday from the last timestamp of the dataset. We create a train_full and test dataset. Session starting after tsplit are assigned to test. Events with timestamps smaller than tsplit are assigned to train_full.
Note
After this step train_full may contain sessions with only one event, these are discarded.
This process is repeated for train_full yielding train_tr and train_valid
Dataset specific steps#
- step
The October and November dataset is concatenated
item ID and user ID is reindexed with integers to allow faster data sorting (the final dataset will contain the original IDs).
view events are used
- step
session gap threshold is 1 hour
- step
tday is 1 day (86400 sec) for the Rees46
- step
clicks and test files are concatenated
- step
tday is 1 for Yoochoose
- step
detail events are used
- step
tday is 1 for Coveo
- step
timestamps are computed from the eventdate of the a session’s first event and the timeframe (elapsed time since the first event).
- step
tday is 7 for Diginetica
- step
view events are used
- step
session gap threshold is 1 hour
- step
tday is 7 for Retailrocket