Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially loading utterances from a selected dataset #48

Open
wwwidonja opened this issue Jun 4, 2020 · 3 comments
Open

Partially loading utterances from a selected dataset #48

wwwidonja opened this issue Jun 4, 2020 · 3 comments

Comments

@wwwidonja
Copy link

The documentation states on multiple pages:

However, it is possible to partially load utterances from a dataset to carry out processing of large corpora sequentially.

alas, the provided link leads to a 404.

Is this still possible? For individual conversation summarization problems, loading just a single conversation would be invaluable, as currently, datasets for large subreddits take significant computational power to load.

@calebchiam
Copy link
Collaborator

Hi @wwwidonja, thanks for raising this. We did some restructuring of the repo recently, so that broke some links.

Here is the link.

And here is the documentation for Corpus. The initialization parameters utterance_start_index and utterance_end_index are what you're looking for.

@wwwidonja
Copy link
Author

Thanks for the response.
This does not, however, solve my problem entirely.
What I'm trying to achieve is fetching a single (or a computationally acceptablly small subset) conversation with all of its corresponding utterances. As far as I'm understanding, this only lets me fetch a subset of utterances, with no guarantee that all utterances of a given conversation have been fetched.

I hope you'll be able to consider my issue as a feature request.

Thank you very much!

@calebchiam
Copy link
Collaborator

calebchiam commented Jun 4, 2020

We've thought about this before, but simply put, there is no way to do this given the fact that the corpus is loaded from simple JSONList files. (Which does not allow for any kind of indexing other than line by line indexing.) If you'd like to work with a smaller subset of the corpus, I'd recommend loading the full corpus, using filter_conversations_by() and then dumping it so you have the smaller corpus to work with.

Of course, if you have other ideas for how we might implement your feature request, we're happy to hear it.

EDIT: I should clarify that this is no way to do this elegantly, but it would be possible to filter utterances.jsonl and conversations.json for a specific conversation_id. (This just requires iterating through the whole JSON.) It's not clear this is a common enough use case to include it in the package, but you could implement it or filter the utterances.jsonl and conversations.json programmatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants