-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partially loading utterances from a selected dataset #48
Comments
Hi @wwwidonja, thanks for raising this. We did some restructuring of the repo recently, so that broke some links. Here is the link. And here is the documentation for Corpus. The initialization parameters |
Thanks for the response. I hope you'll be able to consider my issue as a feature request. Thank you very much! |
We've thought about this before, but simply put, there is no way to do this given the fact that the corpus is loaded from simple JSONList files. (Which does not allow for any kind of indexing other than line by line indexing.) If you'd like to work with a smaller subset of the corpus, I'd recommend loading the full corpus, using Of course, if you have other ideas for how we might implement your feature request, we're happy to hear it. EDIT: I should clarify that this is no way to do this elegantly, but it would be possible to filter utterances.jsonl and conversations.json for a specific conversation_id. (This just requires iterating through the whole JSON.) It's not clear this is a common enough use case to include it in the package, but you could implement it or filter the utterances.jsonl and conversations.json programmatically. |
The documentation states on multiple pages:
alas, the provided link leads to a 404.
Is this still possible? For individual conversation summarization problems, loading just a single conversation would be invaluable, as currently, datasets for large subreddits take significant computational power to load.
The text was updated successfully, but these errors were encountered: