Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kerko with a large database - Incremental sync? #26

Open
emanjavacas opened this issue May 8, 2024 · 7 comments
Open

Kerko with a large database - Incremental sync? #26

emanjavacas opened this issue May 8, 2024 · 7 comments
Labels
question Further information is requested

Comments

@emanjavacas
Copy link

Hi there,

first of all thanks for this amazing piece of software. For a project I am working in, we need to publish a relatively large zotero database and make it searchable. Kerko seems to be the best fit for the job, but apparently we may have to index up to 450k items. I am wondering if there are any issues you may envision deploying this with kerko. I've been syncing some libraries I have (about 5k items) and I see that this takes considerable time. I am assuming that syncing 450k probably would take weeks, which is in principle not a big deal, as long as future syncs are incremental. But I am unsure about this...

Looking to hear your opinion on this.

best regards,

@davidlesieur
Copy link
Member

Glad to hear that Kerko is of interest for your project. Regarding the size of your database, I think there might be a few issues:

  1. Zotero: I have never tested it with that many items. Perhaps it can handle them, but with 450k items it is likely to become laggy and unpleasant to use. My understanding is that performance will be improved in Zotero 7 (currently beta), but 450k items is still an order of magnitude more items than what usually works comfortably in Zotero.
  2. Incremental sync: Kerko has two databases. The first one is a cache it builds by retrieving items from Zotero. Kerko's cache sync from Zotero is incremental. The second database is the search index, which Kerko builds from its cache. It is rebuilt in full when the cache has changed. Building the search index is much faster than synchronizing from Zotero, but for a large database it can still take significant time. I plan to make indexing incremental in the future, but this is not a trivial task (the search index is a denormalized database, thus to implement incremental indexing we have to take dependencies into account, e.g., relations between items, relations between items and collections). I do not yet have funding for this work.
  3. Search engine: Kerko's search engine is Whoosh, which is likely to be slow given the size of your database (both at searching, and at indexing). Also, the Whoosh project has become moribund. People with large databases have encountered search and indexing issues that have never been addressed. I have a nice plan for replacing Whoosh with a higher performance solution in the future, but this needs funding too.

I'll be happy to work on the above issues 2 & 3 when I get sufficient funding, but there's not much we can do about issue 1. It seems to me that your project requires that all 3 be addressed.

I hope this helps!

@davidlesieur davidlesieur added the question Further information is requested label May 10, 2024
@davidlesieur davidlesieur changed the title Incremental sync? Kerko with a large database - Incremental sync? May 10, 2024
@emanjavacas
Copy link
Author

Thanks a lot for your quick and thorough reply. I will consult and see what happens, especially considering issue 1, which seems to be the bottleneck.

@emanjavacas
Copy link
Author

Hi!

I've been syncing and testing the app with a 200k database and the main issue I can see is with there being over 30k topics. This generates an index html file of about 20mb which of course is suboptimal. I am trying to see if there's a way to deactivate the facets (or at least the topics, since the other facets are fine).

I have two questions about it.

  • I started digging a bit into the kerko codebase but when I install kerko from the local modified repo, I get a AttributeError: module 'kerko' has no attribute 'TRANSLATION_DIRECTORIES' error.
  • What would be the easiest way to disable facets?

Thanks for your work!

@mgao6767
Copy link
Contributor

Thanks very much for your work, @davidlesieur!
For 3, Elasticsearch may be a good choice. I'm working on a drop-in replacement of the current Whoosh Searcher mgao6767@4b01c84. Speed is fast with 10k+ docs. But I'm a newbee so it's going to take some time for me to implement the same logical search and fix minor bugs.

@davidlesieur
Copy link
Member

@mgao6767 Very interesting! I have used Elasticsearch (and Solr as well) on other projects. However, I feel a separate search server is overkill for most Kerko projects. It would also make deploying Kerko much harder (many users are researchers who can get by with a Python stack, but might be deterred by infrastructure complexity). I'm considering Tantivy, which would get transparently installed along with Python packages.

I have also thought about supporting multiple search engines, but that would require introducing more abstractions in Kerko's architecture. This is certainly doable, but I'm not entirely convinced the increased complexity would be worth the effort and maintenance burden. However, had an abstraction layer been built in the first place, it would now be easier to replace Whoosh... Nice dilemma!

@mgao6767
Copy link
Contributor

@mgao6767 Very interesting! I have used Elasticsearch (and Solr as well) on other projects. However, I feel a separate search server is overkill for most Kerko projects. It would also make deploying Kerko much harder (many users are researchers who can get by with a Python stack, but might be deterred by infrastructure complexity). I'm considering Tantivy, which would get transparently installed along with Python packages.

I have also thought about supporting multiple search engines, but that would require introducing more abstractions in Kerko's architecture. This is certainly doable, but I'm not entirely convinced the increased complexity would be worth the effort and maintenance burden. However, had an abstraction layer been built in the first place, it would now be easier to replace Whoosh... Nice dilemma!

Thanks @davidlesieur ! I didn't know much about Tantivy etc but it seems really great.

I'm no professional programmer. My current implementation is to use Whoosh to generate query, which is then parsed to ES query. Not very ideal but at least working. Perhaps just write adpaters for new engines and enable a config choice to use which one?

Please do let me know how it goes with Tantivy. I'm happy to contribute as much as I can.

@davidlesieur
Copy link
Member

@mgao6767 At the moment, I'm still looking for funding in order to get going with replacing Whoosh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants