Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST API endpoints for document store management #279

Open
courtarro opened this issue Apr 7, 2021 · 4 comments
Open

REST API endpoints for document store management #279

courtarro opened this issue Apr 7, 2021 · 4 comments

Comments

@courtarro
Copy link

My objective is to use Odinson as an NLP backend to view a directory of documents, annotating and indexing as new documents are brought in, then view the full dependency parse of a single entire document and allow the user to execute pattern queries against the parsed output.

To support this kind of use case, I think I would need the following additional REST API endpoints:

  • Adding documents to the corpus
  • Annotating one or more documents
  • Indexing one or more documents
  • Removing or modifying documents

Are any of these capabilities already possible with the REST API? If not, would it be possible to plan toward implementing them?

Thanks in advance!

@myedibleenso
Copy link
Member

Thanks, @courtarro .

Indexing one or more documents

This one is on our roadmap!

Adding documents to the corpus

Can you explain what you mean by this and what you would expect the endpoint to do?

Annotating one or more documents

If it were to be added, this one will probably end up in a separate REST API in extra, as Odinson is not meant to do this sort of annotation. As a further wrinkle, the library we're using in extra for parsing, tagging, and preprocessing is very memory-intensive. The runnable in extra is only meant to serve as an example of how one could produce OdinsonDocument JSON. Somewhat tangential, but I'm working on a small Python library that will make it easy to generate the OdinsonDocument JSON with spaCy.

I do think others would find something like what you describe convenient, though, so we'd welcome a contribution to extra if this is something that interests you!

Removing or modifying documents

There are some ongoing discussions about this. It seems like a /remove endpoint and an /index endpoint would be sufficient to cover the "modify" use case. Do you agree?

@courtarro
Copy link
Author

courtarro commented Apr 7, 2021

Concerning documents: if a user wishes to add a document to Odinson's corpus, I don't believe there's currently a way to do that within REST. So if indexing is the same thing, then that's sufficient.

That's interesting about annotation. Indeed, I was thinking of extra features without thinking of how they're distinct from the rest of Odinson. Indeed, if it's possible to have the extra capabilities available via REST, that would be great. I've been using the CoreNLP version of the extra annotator, which yields good results to execute rules against. Ideally, I'd want to be able to say "Here's a big text file representation of a 200-page PDF. Can you (the Odinson/Odinson extras REST API) please annotate this text file and store it within the Odinson index so that I can execute rules against it?"

Using spaCy to generate annotations would be good, though I do like the results from CoreNLP. Python is definitely my go-to language right now, but one reason I'm focused on Odinson vs. spaCy (despite having used spaCy a good bit in the past) is that spaCy's pattern matching engine is missing the critical feature of named capture. Odinson's rules/grammars, CoreNLP's Semgrex, and JAPE all support named capture, whereas spaCy has not yet added that capability.

As for removing - the REST standard is to add a DELETE verb to a document endpoint. So just like you could add a document via the upcoming index endpoint with PUT or POST, you could DELETE to get rid of it again.

@myedibleenso
Copy link
Member

NOTE: depends on #282

@myedibleenso
Copy link
Member

The new version of the REST API will support, indexing, updating, and deleting documents. The changes have been made, but the project is not yet ready to be made public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants