Skip to content

Commit

Permalink
Recommend scoring hits with BM25(k1=0.9,b=0.4).
Browse files Browse the repository at this point in the history
Currently different engines use different parameters for BM25, e.g. Tantivy and
Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had
initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman
et al. later suggested that 0.9/0.4 would make better defaults and this seems
to be the consensus nowadays.

The ranking function matters because it affects which hits may be skipped via
dynamic pruninng, which in-turn affects search performance.

Closes quickwit-oss#45
  • Loading branch information
jpountz committed Sep 25, 2023
1 parent 0ca432c commit dbee92c
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 7 deletions.
15 changes: 13 additions & 2 deletions CONTRIBUTE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Adding another engine

Currently only tantivy and lucene are supported, but you can add another search
Currently only Tantivy and Lucene are supported, but you can add another search
engine by creating a directory in the engines directory and add a `Makefile`
implementing the following commands :

Expand All @@ -20,7 +20,7 @@ Stemming should be disabled. Tokenization should be something reasonably close t

Starts a program that will get `tests` from stdin, and output
a result hit count as fast as possible. *If this is not your language's default,
be sure to flush stdout after writing your answer".
be sure to flush stdout after writing your answer*.

The tests consist in a command followed by a query.

Expand All @@ -39,6 +39,17 @@ Queries are expressed in the Lucene query language.

If a command is not supported, just print to stdout "UNSUPPORTED".

# Recommendations for new engines

Engines are recommended to follow the below guidelines:
- Indexing is not measured and may be multi-threaded.
- Engines may optimize for read-only access, e.g. by merging multiple segments
down to a single one or performing document reordering.
- Search operations must run in a single thread.
- Hits must be ranked according to the
[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) ranking function with
standard parameters `k1`=0.9 and `b`=0.4.
- Result caches must be disabled.

# Adding tests

Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ The corpus used is the English wikipedia. Stemming is disabled. Queries have bee
from the [AOL query dataset](https://en.wikipedia.org/wiki/AOL_search_data_leak)
(but do not contain any personal information).

Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searches as
a phrase query.
Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searched
as a phrase query.

For each of these query, we then run them as :
- `intersection`
Expand All @@ -49,15 +49,14 @@ All tests are run once in order to make sure that
- Java's JIT already kicked in.

Test are run in a single thread.
Out of 5 runs, we only retain the best score, so Garbage Collection likely does not matter.

Out of 10 runs, we only retain the best score, so Garbage Collection likely does not matter.

## Engine specific detail

### Lucene

- Query cache is disabled.
- GC should not influence the results as we pick the best out of 5 runs.
- GC should not influence the results as we pick the best out of 10 runs.
- JVM used was openjdk 10.0.1 2018-04-17

### Tantivy
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-7.2.1/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-8.0.0/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-8.10.1/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-9.6.0/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down

0 comments on commit dbee92c

Please sign in to comment.