Skip to content

Latest commit

 

History

History
50 lines (37 loc) · 3.29 KB

experiments-clueweb09b.md

File metadata and controls

50 lines (37 loc) · 3.29 KB

Anserini Experiments on ClueWeb09 (Category B)

Indexing:

nohup sh target/appassembler/bin/IndexCollection -collection CW09Collection \
 -input /path/to/cw09/ClueWeb09_English_1/ -generator JsoupGenerator \
 -index lucene-index.cw09b.pos+docvectors -threads 32 -positions -docvectors -optimize \
 > log.cw09b.pos+docvectors &

The directory /path/to/cw09/ClueWeb09_English_1 should be the root directory of ClueWeb09B collection, i.e., ls /path/to/cw09/ClueWeb09_English_1 should bring up a bunch of subdirectories, en0000 to enwp03. The above command builds an index that stores term positions (-positions) as well as doc vectors for relevance feedback (-docvectors), and -optimize force merges all index segment into one.

After indexing is done, you should be able to perform a retrieval as follows:

sh target/appassembler/bin/SearchWebCollection \
  -topicreader Webxml -index lucene-index.cw09b.pos+docvectors -bm25 \
  -topics src/main/resources/topics-and-qrels/topics.web.51-100.txt -output run.web.51-100.bm25.txt

For the retrieval model: specify -bm25 to use BM25, -ql to use query likelihood, and add -rm3 to invoke the RM3 relevance feedback model (requires docvectors index).

Topics and qrels are stored in src/main/resources/topics-and-qrels/. Use trec_eval to compute AP and P30, and use gdeval to compute NDCG@20:

eval/trec_eval.9.0/trec_eval src/main/resources/topics-and-qrels/qrels.web.51-100.txt run.web.51-100.bm25.txt
eval/gdeval.pl src/main/resources/topics-and-qrels/qrels.web.51-100.txt run.web.51-100.bm25.txt

You should be able to replicate the following results:

AP BM25 BM25+RM3 QL QL+RM3
TREC 2010 Web Track: Topics 51-100 0.1091 0.1065 0.1026 0.1056
TREC 2011 Web Track: Topics 101-150 0.1095 0.1140 0.0972 0.1021
TREC 2012 Web Track: Topics 151-200 0.1072 0.1336 0.1035 0.1120
P30 BM25 BM25+RM3 QL QL+RM3
TREC 2010 Web Track: Topics 51-100 0.2667 0.2583 0.2403 0.2528
TREC 2011 Web Track: Topics 101-150 0.2540 0.2627 0.2220 0.2267
TREC 2012 Web Track: Topics 151-200 0.2187 0.2313 0.2027 0.2007
NDCG@20 BM25 BM25+RM3 QL QL+RM3
TREC 2010 Web Track: Topics 51-100 0.1320 0.1443 0.1131 0.1307
TREC 2011 Web Track: Topics 101-150 0.1915 0.1920 0.1633 0.1670
TREC 2012 Web Track: Topics 151-200 0.0977 0.1309 0.0862 0.1027