-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term Query is not tokenized (?) #296
Comments
Hi @afbarbaro 😄 This is the old problem of how stemmers don't always do what we think they do. Here's a small program I made to test queries quickly: from tantivy import SchemaBuilder, Index, Document
schema = SchemaBuilder().add_text_field("body", stored=True, tokenizer_name="en_stem").build()
index = Index(schema=schema, path=None)
writer = index.writer(heap_size=15_000_000, num_threads=1)
doc = Document()
doc.add_text("body", "lovely freshness older")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Australian monarchy")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Titanic sinks")
writer.add_document(doc)
writer.commit()
index.reload()
searcher = index.searcher()
def find(query_text: str):
q = index.parse_query(query_text, ["body"])
if hits := searcher.search(q).hits:
for _, doc_address in hits:
doc = searcher.doc(doc_address)
print(f"{query_text} hit: {doc['body']}")
else:
print(f"{query_text} not found")
# Run with `python -i main.py` The idea is to run it with Indeed, as you saw, $ python -i main.py
>>> find('monarch')
monarch not found
>>> find('monarchs')
monarchs not found
>>> find('monarchy')
monarchy hit: ['Australian monarchy'] However, you'll see that if we search for >>> find('Australians')
Australians hit: ['Australian monarchy'] This means that tantivy-py is indeed stemming the query text. I checked the code and the query parsers does make use of the tokenizers registered on the fields. Stemming can be very surprising, for example: >>> find('monarchies')
monarchies hit: ['Australian monarchy'] Did you expect In my test I do get a hit using >>> find('Titan')
Titan hit: ['Titanic sinks']
>>> find('Titans')
Titans hit: ['Titanic sinks']
>>> find('Titanics')
Titanics hit: ['Titanic sinks']
>>> find('titanics')
titanics hit: ['Titanic sinks'] Sometimes stemming can be very frustrating. For example, in the first document you would think >>> find('old')
old not found
>>> find('older')
older hit: ['lovely freshness older'] Even >>> find('oldest')
oldest not found This is because the stem for |
@cjrh thanks so much for the detailed explanation. I know what is the difference between your code and example and what I was doing: you're using subqueries = [(Occur.Should, Query.term_query(index.schema, 'body', term)) for term in terms]
query = tantivy.Query.boolean_query(subqueries) so I see that the stemming is indeed applied to the query text when using One of the reasons why I was constructing the query myself is because in the Python version of So I guess my question now is:
Again, thank you for your help! |
Also, for this:
There is a hack that I can use at least temporarily to get the stemmed terms by running |
Ah I see. To answer your question, this is not by design. Tantivy-py has been built by a fairly large number of volunteers and drive-by contributors over the years so there is relatively little that is specifically "designed". Earlier on we wanted to avoid adding all the fine-grained query classes (and other classes) and just have the I think more people will come across this now that we have the |
Maybe. It wraps |
Thanks @cjrh . The reason I didn't this these additional args being there is because the Python bindings are outdated. tantivy-py/tantivy/tantivy.pyi Line 364 in e3de7b1
Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute. |
Yes please that would be great if you can update those type annotations. |
Yes created by hand. It would be a simple PR. |
I'm testing
tantivy-py
, which I'm finding pretty great. However, I bumped into what seems to be an issue with thePython
package: it seems that term queries are not tokenized when using thesearcher.search(query, ..)
method, so I can't really use theen_stem
tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).I'm testing
tavinty-py
with the Simple Wikipedia Example Set fromCohere
and here's what I see with a few sample queries:Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the
search
method, but tokenizers are not exposed in thePython
bindings.Any suggestions?
The text was updated successfully, but these errors were encountered: