-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
175 lines (123 loc) · 6.36 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
Random Forest Language Model Toolkit
Yi Su <[email protected]>
o Introduction
The Random Forest Language Model Toolkit is a C++ software package based on
the SRI LM Toolkit (http://www.speech.sri.com/projects/srilm/). We follow
the design philosophy and coding conventions of the SRI LM Toolkit and use
their low-level data structure classes as well as some higher level
procedural code to "stand on the shoulders of giants".
The Random Forest Language Model is a collection of randomized decision tree
language models with proven records of good performance. See the following
papers for reference:
Peng Xu and Frederick Jelinek. Random forests in language modeling. In
Proceedings of EMNLP 2004, pages 325-332, Barcelona, Spain, 2004.
Association for Computational Linguistics.
Peng Xu. Random forests and the data sparseness problem in language modeling.
PhD thesis, Johns Hopkins University, 2005.
Yi Su, Frederick Jelinek, and Sanjeev Khudanpur. Large-scale random forest
language models for speech recognition. In Proceedings of INTERSPEECH-2007,
volume 1, pages 598-601, Antwerp, Belgium, 2007.
Yi Su. Knowledge intergration into language models: a random forest approach.
PhD thesis, Johns Hopkins University, 2009.
o Contents
You should find the following files and directories after unpacking:
INSTALL # installation guide
LICENSE # legal stuff
README # this file you are reading
/bin
/doc # generated Doxygen docs
/obj
/scripts # bash scripts
setup.pl # script to set up working direcotries
/src # C++ source code
See INSTALL for installation guide.
o Tutorial/Recipe
I think the best way to get started is to show you a typical usage of this
toolkit. I assume that you have a working knowledge of SRI LM Toolkit. The
scripts under the /scripts directory are readily usable if you also use
Sun Grid Engine (SGE) as your parallel computing environment and it shouldn't
be hard to adapt them to run with other setup.
Scripts under the /scripts directory:
functs.sh # all bash function definitions; meat of the whole thing
go.sh # an example of the recipe
rf_ppl.sh # average probabilities and compute perplexity
rf_train.sh # build trees and test them
run_rf.sh # start an experiment run by submitting jobs to SGE
txt2ngram.sh # helper script to convert a text file into counts properly
vars.sh # path names and parameters
Now the RECIPE:
0. Set up a directory structure used in the scripts.
$ export WORK=<your working directory for rflm>
$ cd $SRILM/rflm
$ ./setup.pl $WORK $SRILM i686
Note that the third argument for ./setup.pl is the machine type you used
to install SRI LM Toolkit.
1. Clean up your data and devide them into training, heldout and testing.
For example, I preprocess the WSJ portion of Penn Treebank by lowercasing
words, removing punctuations and collasping numbers to ``N'' token. Then I
use sections 00-20 for training, 21-22 for heldout and 23-24 for testing.
$WORK/data/upenn/text.00-20
$WORK/data/upenn/text.21-22
$WORK/data/upenn/text.23-24
The first line of text.00-20 looks like (here it comes, mr. pierre
vinken!):
<s> pierre vinken N years old will join the board as a nonexecutive
director nov. N </s>
2. Generate counts from text using the script txt2ngram.sh, which calls
SRI LM Toolkit's ngram-count to do the job.
$ cd $WORK
$ ./scripts/txt2ngram.sh -o 3 data/upenn/text.00-20 \
data/upenn/counts/counts.00-20.3gm.gz
$ ./scripts/txt2ngram.sh -o 3 data/upenn/text.21-22 \
data/upenn/counts/counts.21-22.3gm.gz
$ ./scripts/txt2ngram.sh -o 3 -t data/upenn/text.23-24 \
data/upenn/counts/counts.23-24.3gm.gz
Note that you should use "-t" option to generate your test counts.
3. Get your vocabulary and put it under $WORK/dicts.
For exmaple, I use the 10K most frequent words as my vocabulary.
$WORK/dicts/upenn.10k.vocab
4. Train a Kneser-Ney-smoothed 3gram LM for backoff and bailout.
Note that it is ok to include heldout data here and it gives a slightly
lower perplexity.
$ cd $WORK
$ cat data/upenn/text.00-20 data/upenn/text.21-22 > data/upenn/text.00-22
$ ./scripts/txt2ngram.sh -o 3 data/upenn/text.00-22 \
data/upenn/counts/counts.00-22.3gm.gz
Now use a bash script to call the function trainLM in functs.sh
(see go.sh):
mkdir -p $root/models/upenn/kn.3gm
order=3
vocab=$root/dicts/upenn.10k.vocab
train=$root/data/upenn/counts/counts.00-22.3gm.gz
dist=$root/models/upenn/kn.3gm/discount
lm=$root/models/upenn/kn.3gm/lm.kn.3gm.gz
trainLM $order $vocab $train $dist $lm
exit 1;
Note that you should also find files named discount.[123] under the model
direcotry $WORK/models/upenn/kn.3gm. They will be used in later steps.
5. Train and test a random forest language model.
$ ./scripts/run_rf.sh baseline
Note that this script is using SGE for parallelization.
6. Check your result.
$ cat exps/upenn/baseline/ppl
file /home/yisu/Work/rflm-dev/data/upenn/counts/counts.23-24.3gm.gz:
3761 sentences, 78669 words, 0 OOVs
0 zeroprobs, logprob= -174163 ppl= 129.674 ppl1= 163.631
Note that the aggregated probabilities on test ngrams is packed into an
ARPA-format LM in $WORK/exps/upenn/baseline/lm.gz .
That's it for a simple tutorial/recipe. Now you should start looking at those
scripts and source code to get your hands dirty.
o Citing
If you use this toolkit in research, I would appreciate if you could cite it
in a footnote or a bibliography like the following:
Yi Su. Random Forest Language Model Toolkit, http://www.clsp.jhu.edu/~yisu/rflm.html.
o Copyright
The Random Forest Language Modeling Toolkit is subject to the SRILM Community
Research License Version 1.0 (the "License"); you may not use the files in
this toolkit except in compliance with the License. A copy of the License is
included in the root directory. Software distributed under the License is
distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express
or implied. See the License for the specific language governing rights and
limitations under the License. This software is Copyright (c) SRI
International, 1995-2008 and Copyright (c) Yi Su, 2005-2009. All rights
reserved.