Data Preprocessing#

General steps#

  1. step

    Only view / click / detail events are kept, depending on the dataset. These events represent the user visiting an item’s detail page, thus they describe the sequences for the next-click prediction task.

  2. step

    Sessionisation of user histories. When no session information is available, the sessions are computed using 1 hour as the session gap threshold. If the sessions are precomputed, the precomputed IDs are used.

  3. step

    Any unnecessary data is discarded. Only the necessary data is kept: session ID, item ID, timestamp

  4. step

    Subsequent repeating item filtration. If the user visits the same item multiple times in succession, only the first occurrence is kept. E.g. (i,i,j) is reduced to (i,j), but (i,i,j) is not modified.

  5. step

    The dataset is iteratively filtered for sessions shorter than 2 and items occurring less than 5 times until there is no change in the dataset.

  6. step

    Time based train and test splits. The number of test days are defined for each dataset (tday, defined in seconds) yielding a split time tsplit. tsplit is calculated by substracting tday from the last timestamp of the dataset. We create a train_full and test dataset. Session starting after tsplit are assigned to test. Events with timestamps smaller than tsplit are assigned to train_full.

    Note

    After this step train_full may contain sessions with only one event, these are discarded.

    This process is repeated for train_full yielding train_tr and train_valid

Dataset specific steps#

  1. step
    1. The October and November dataset is concatenated

    2. item ID and user ID is reindexed with integers to allow faster data sorting (the final dataset will contain the original IDs).

    3. view events are used

  2. step
    • session gap threshold is 1 hour

  1. step
    • tday is 1 day (86400 sec) for the Rees46

  1. step
    1. clicks and test files are concatenated

  1. step
    • tday is 1 for Yoochoose

  1. step
    1. detail events are used

  1. step
    • tday is 1 for Coveo

  1. step
    1. timestamps are computed from the eventdate of the a session’s first event and the timeframe (elapsed time since the first event).

  1. step
    • tday is 7 for Diginetica

  1. step
    1. view events are used

  2. step
    • session gap threshold is 1 hour

  1. step
    • tday is 7 for Retailrocket