We provide two versions of the dataset:
The Synerise dataset is based on user click behavior that was recorded at a real-world online retailer’s website, covering a period of six months. The data consists of five types of events and product attributes:
client_id (int64): Numeric ID of the client (user).
timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
sku (int64): Numeric ID of the item.
client_id (int64): Numeric ID of the client (user).
timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
sku (int64): Numeric ID of the item.
client_id (int64): Numeric ID of the client (user).
timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
sku (int64): Numeric ID of the item.
client_id (int64): Numeric ID of the client.
timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm
url (int64): Numeric ID of a visited URL. The explicit information about what (e.g., which item) is presented on a particular page is not provided.
client_id (int64): Numeric ID of the client.
timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
query (object): The textual embedding of the search query, compressed to 20-dimensional integer array using the product quantization method.
sku (int64): Numeric ID of the item.
category (int64): Numeric ID of the item category.
price (int64): Numeric ID of the item's price bucket.
embedding (object): A textual embedding of a product name, compressed using the product quantization method.
For the specific purposes of the RecSys 2025 Challenge, the Synerise dataset was heavily preprocessed. Temporally, the events were split into three distinct parts:
The dataset directory contains a restricted version of the additional item attributes file (product_properties.parquet
), only containing properties for items which appeared in pre-training timeframe. Other files are stored in two further subdirectories: input and target, whose contents we describe in the following sections.
This directory contains the same types of events (product_buy
, add_to_cart
, remove_from_cart
, page_visit
, search_query
) as the raw dataset, but here they are restricted to pre-training time frame.
In addition the directory stores a NumPy file relevant_clients.npy
containing a subset of 1M client ids. In the challenge, participants were required to create Universal Behavioral Profiles for the clients whose client_id
is listed in relevant_clients.npy
.
This directory contains two event-related files.
train_target.parquet
: Contains interactions of clients during training time frame.validation_target.parquet
: Contains interactions of clients during evaluation_time
frame.Note that these event logs contain all types of interactions.
In addition, the directory contains NumPy files, which are used by the provided evaluation scripts to score submissions. In particular, they are used to store information on items/categories/prices
for which ground truth labels are computed.
propensity_category.npy
: Contains a subset of 100 categories for which the model is asked to provide predictions.popularity_propensity_category.npy
: Contains popularity scores for categories from the propensity_category.npy
file. Scores are used to compute the Novelty measure. propensity_sku.npy
: Contains a subset of 100 products for which the model is asked to provide predictions.popularity_propensity_sku.npy
: Contains popularity scores for products from the propensity_sku.npy
file. These scores are used to compute the Novelty measure.propensity_new_sku.npy
: Contains a subset of 20 products not contained in pre-training time frame for which the model is asked to provide predictions.popularity_propensity_new_sku.npy
: Contains popularity scores for products from the propensity_new_sku.npy
file. Scores are used to compute the Novelty measure. propensity_price.npy
: Contains a set of 100 price buckets for which the model is asked to provide predictions.popularity_propensity_price.npy
` Contains popularity scores for price buckets from the propensity_price.npy
file. These scores are used to compute the Novelty measure. Finally, the directory contains active_clients.npy,
a subset of relevant clients with at least one product_buy event in history. Active clients are used to compute churn target.
We recommend using the preprocessed dataset together with the challenge code.