My solution for the Instacart Market Basket Analysis competition on Kaggle.
Given anonymized data on customer orders over time, predict which previously purchased products will be in a user’s next order. F1 score is used evaluate the prediction accuracy.
The dataeset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, between 4 and 100 of their orders are provided.
Data Exploration Analysis Notebook
The problem was reformulated as a binary classification problem: given a user u, a product p, and the user's previous n orders, predict if the product will be reordered in user's n+1 order.
Exploratory data analysis has been performed to understand what features might be related to users' reorder behaviors. 4 groups of features are used as input, which include word embedding learned for product_id using word2vec.
- product_reorder_rate', 'product_reorder_probability', 'vec1', 'vec2', 'product_nb_users', 'product_mean_add_to_cart_position', 'prod_buy_second_time_total_cnt', 'frequency_mean', 'frequency_std
- 'user_total_orders', 'user_total_items', 'total_distinct_items', 'user_average_days_between_orders', 'user_average_basket', 'user_median_basket', 'user_reorder_ratio', 'user_period', 'user_nb_aisles', 'user_nb_departments',
- 'order_hour_of_day', 'days_since_prior_order', 'days_since_ratio', 'aisle_id', 'department_id', 'product_orders', 'product_reorders',
- 'UP_orders', 'UP_orders_ratio', 'UP_average_pos_in_cart', 'UP_orders_since_last','UP_delta_hour_vs_last', 'UP_first_order', 'UP_last_order', 'UP_order_rate_since_first_order', 'UP_recency', 'UP_recency_day', 'UP_recency_exp', 'UP_interval_distance', 'UP_interval_mean', 'UP_interval_std', 'UP_order_number_skew', 'UP_order_streak'
Gradient boosting algorithms are used. Both lightgbm and xgboost are used to train the model. The prediction result is a value in the range of [0,1] for each (user,product) pair. A weighted average is used to average over multiple models to generate prediction.
The numerical probability needs to be converted to binary decisions. A threshold h is used to determine whether a product will be reordered. A product will be predicted as reorder is its predicted reorder probability is larger than threshold.
Different threshold value are used in diferent order in order to maximize F1 score. An algorithm proposed in this paper [1] is implemented. The idea is maximize the F1 score expectation given the predicted posteriors P = {p_1, p_2, ... , P_n} for all the products a user has purchased.
The goal is to find the value k so that by reordering the products with largest k reorder probability, the expected F1 score for this order is the largest.
[1] Ye, N., Chai, K., Lee, W., and Chieu, H. "Optimizing F-measure: A Tale of Two Approaches", ICML, 2012.
- lightgbm==2.0
- xgboost==0.6
- gensim==2.3
- numpy==1.13
- pandas== 0.19
- scikit-learn==0.19
- Put raw csv files into data/raw directory
- Preprocessing: order_features.py order
- Model Training:
- maximizeF1.py