Please cite the following two papers if you are using this tool. Thanks!
-
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", submitted to SIGKDD 2017, under review. arXiv:1702.04457 [cs.CL]
-
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han, "Mining Quality Phrases from Massive Text Corpora”, Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15), Melbourne, Australia, May 2015. (* equally contributed, slides)
The originial version is shangjingbo1226/AutoPhrase.
This fork version is mainly desinged for SparseTP, a topic modeling tool for phrases, which is going to be published in the 29th IEEE International Conference on Tools with Artifical Intelligence (ICTAI'17).
- Efficient Topic Modeling on Phrases via Sparsity, Weijing Huang, Wei Chen, Tengjiao Wang and Shibo Tao, Proceedings of the 29th IEEE International Conference on Tools with Artifical Intelligence (ICTAI'17), Boston, USA, Nov 2017. (slides)
The modification of this fork is mainly in three folds.
-
Provide an portal runAutoPhrase.sh to process the raw input file to get the final result file
input_forTopicModel.txt
, which is used as the input of SparseTP. -
We add filter.py to remove the low quality phrases (e.g., score<0.5), and get high quality phrases file
results/filtered_phrases.txt
. And with the high quality phrases, we update src/segment.cpp to segment the raw input file. Finaly, we add prepare_for_topicmodeling.py to get the result fileinput_forTopicModel.txt
, with the foramtword_1,word_2,word_3,...,word_n,phrases_1,...,phrases_m\n
in each line representing a single document in a corpus. -
We add several running examples to provide a "one click" quick way to know how to use this tool. The running example 1 is designed to process the dataset 20newsgroups; The running example 2 is designed to process the Wikipedia articles under the Mathematics category (the json data are available at Dropbox); The running example 3 and 4 are designed for Chemistry (availabe json data) and Argentina (available json data).
bash runAutoPhrase.sh $input_file
$input_file
is the path of the input file, which includes the whole corpus with each line representing a single file in a corpus.
The result file will be restored in results/input_forTopicModel.txt
.
1, bash runningExample1.sh
After running on the 20newsgroups dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
Take one line in the result file as an example, it represents the document after extracting phrases: alt,introduction,april,version,introduction,atheism,mathew,...,read,article,mathew,version,pgp signed message,frequently asked questions,faq files,strong atheism,weak atheism,strong atheism,god exists,point of view,weak atheism,...,god exists,peer pressure,pgp signature,pgp signature
2, bash runningExample2.sh
After running on the Mathematics Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
Take one line in the result file as an example, it represents the document after extracting phrases: kohli,scientist,lab,cambridge,majority,research,field,machine,learning,vision,contributions,game,theory,psychometrics,picture,josh,semantic,paint,kinect,fusion,voxel,crf,inference,microsoft research,discrete algorithms,programming language,higher order,graphical models
3, bash runningExample3.sh
After running on the Chemistry Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
4, bash runningExample4.sh
After running on the Argentina Wiki dataset, the result file can be found as results/input_forTopicModel.txt.
Or for a quick view without running, the result can be downloaded from Dropbox.
We test runAutoPhrase.sh on a signle 4-Core 3.4GHz CPU, 24GB RAM machine. To see what will happen for processing a very big input file, we take whole Wikipedia pages as an input. There are 5,738,260 articles, 2,036,099,636 tokens, 10.67GB. In order to fit it in our limit memory, we split this big file into 5 smaller ones, each one with about 2.1GB size. In this way, we run AutoPhrase sequencely on these 5 splitted files, in which each 2.1GB file costs 24GB memory. After 12.5 hours, we got the processed result for Wikipedia pages.
In short, we summarize the performance as the following table.
setting | input file size | memory cost | time cost |
---|---|---|---|
Directly | 2.1GB | 24 GB | 2.5 hours |
Running on 5 splited files sequencely | 10.67GB | 24 GB | 12.5 hours |