You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fwiw, I didn't see the get_document_index API but implemented something similar to the bisect-based approach.
In my case, I have a few hundred thousand documents each with a few thousand tokens. After a learning pass to convert the tokens to integers, an additional integer is created for each document. The array of integers is then constructed as: "doc_token1,num1,num2,...,doc_token2,num3,num4,..." and fed to the suffix array and longest common prefix functions. A second linear pass stores the index of each document token number so that it can be bisected. It's hard to keep track of all the indexes and prefixes and suffixes and whatnot but it's fast!
See my proposal and code in debatem1/pydivsufsort#4
The text was updated successfully, but these errors were encountered: