This is the implementation of a paper accepted in Coling2016.
- For text classification and information retrieval tasks, text data has to be represented as a fixed dimension vector.
- We propose simple feature construction technique named Graded Weighted Bag of Word Vectors (GWBoWV).
- We demonstrate our method through experiments on multi-class classification on 20newsgroup dataset and multi-label text classification on Reuters-21578 dataset.
There are 2 folders named 20news and Reuters which contains code related to multi-class classification on 20Newsgroup dataset and multi-label classification on Reuters dataset.
Change directory to 20news for experimenting on 20Newsgroup dataset and create train and test tsv files as follows:
$ cd 20news
$ python create_tsv.py
Get word vectors for all words in vocabulary:
$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.
Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:
$ python gwbowv.py 200 60
# SCDV.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.
Change directory to Reuters for experimenting on Reuters-21578 dataset. As reuters data is in SGML format, parsing data and creating pickle file of parsed data can be done as follows:
$ python create_data.py
# We don't save train and test files locally. We split data into train and test whenever needed.
Get word vectors for all words in vocabulary:
$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.
Get Sparse Document Vectors (SCDV) for documents in train and test set:
$ python gwBoWV.py 200 60
# SCDV.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.
Get performance metrics on test set:
$ python metrics.py 200 60
# metrics.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.
Minimum requirements:
- Python 2.7+
- NumPy 1.8+
- Scikit-learn
- Pandas
- Gensim
For theory and explanation of SCDV, please visit https://aclweb.org/anthology/C/C16/C16-1052.pdf.
If you use the code, please cite this paper:
Vivek Gupta, Harish Karnick, Ashendra Bansal, Dheeraj Mekala, "Product Classification in E-Commerce using Distributional Semantics ", in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 536–546, Osaka, Japan, December 11-17 2016.
@inproceedings{Gupta2016ProductCI, title={Product Classification in E-Commerce using Distributional Semantics}, author={Vivek Gupta and Harish Karnick and Ashendra Bansal and Pradhuman Jhala}, booktitle={COLING}, year={2016} }
Note: You neednot download 20Newsgroup or Reuters-21578 dataset. All datasets are present in their respective directories.