MercadoLibre_2020

Repo for my work on MeLi Data Challenge 2020 (4th place on public LB). Requires ~16GB RAM to run

Required packages can be found in requirements.txt

Main file to run is read_input.py. It should check for the presence of a few files:

"ratio files": These correspond to feature extraction. They are mostly ratios answering the question:

If a domain_id/item_id/category_id is made was viewed in the history object, then how likely is it that it is the domain_id/item_id/category_id of the actual bought item????
We also use a FastText model to extract features corresponding to the likelihood of the item id belonging to the spanish/portuguese/english language, though it proved to be insignificant w.r.t. NDCG. To do this, we download a pre-trained fasttext model.

Notably, overfitting may occur for low-frequency items, so when training the LGB and RNN we recalculate the ratios so as if the current purchase and associated history were excluded from the training set, to avoid overfitting.

"RNN_pred": We use a RNN model to predict the domain_id, by looking at the features of the last 30 unique items viewed. We also use the SentenceTransformer package to extract a 512-dimensional embedding of the first two words.

"Light Gradient Boosting (lgb.pkl)": Similar to the RNN, we train a LGBRanker to rank items. There is a special function that gathers the necessary input matrix with relevant features.

"Neural_Domain_Identifier": A neural net that uses features from observed domains in the history object to predict the domain of the purchased object. Domains may repeat several times with different items, so we extract max,min,mean,std. We also use the output of another trained model (see domain_string_identifier.py), which predicts normalized domain probabilities given a title string, as extra features.

Our final model is hierarchical.

First, we rank the items. There are several rankings: recency,frequency,LGB model predictions, so on...

The LightGradientBoosting model (lgb.pkl) was trained to rank items. The RNN model also achieves the same thing. We use a linear combination of rankings with hard-coded coefficients that performed well in the validation set. We also use "Neural_Domain_Identifier" to filter items whose domain receives a low score (WARNING: it's better to use the saved model weights or the predictions already saved as csv. If you retrain, the optimal cutoff might have to be chosen again w.r.t. validation.

Finally, we use a directed graph to recommend items that we previously bought when viewing the same items as this purchase's. If that fails, we firstly rank by domain, then recommend the most popular items ( 100*_times_bought + 1*_times_searched)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
results/submissions		results/submissions
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MercadoLibre_2020

About

Releases

Packages

Languages

License

brunoklaus/MercadoLibre_2020

Folders and files

Latest commit

History

Repository files navigation

MercadoLibre_2020

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages