add a class to index multiple documents #12

louisabraham · 2020-02-27T13:13:13Z

See my proposal and code in debatem1/pydivsufsort#4

grantjenks · 2021-12-05T22:20:00Z

Fwiw, I didn't see the get_document_index API but implemented something similar to the bisect-based approach.

In my case, I have a few hundred thousand documents each with a few thousand tokens. After a learning pass to convert the tokens to integers, an additional integer is created for each document. The array of integers is then constructed as: "doc_token1,num1,num2,...,doc_token2,num3,num4,..." and fed to the suffix array and longest common prefix functions. A second linear pass stores the index of each document token number so that it can be bisected. It's hard to keep track of all the indexes and prefixes and suffixes and whatnot but it's fast!

louisabraham · 2022-03-30T10:45:08Z

Very interesting, don't hesitate to share your code!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a class to index multiple documents #12

add a class to index multiple documents #12

louisabraham commented Feb 27, 2020

grantjenks commented Dec 5, 2021

louisabraham commented Mar 30, 2022

add a class to index multiple documents #12

add a class to index multiple documents #12

Comments

louisabraham commented Feb 27, 2020

grantjenks commented Dec 5, 2021

louisabraham commented Mar 30, 2022