Partially loading utterances from a selected dataset #48

wwwidonja · 2020-06-04T00:32:13Z

The documentation states on multiple pages:

However, it is possible to partially load utterances from a dataset to carry out processing of large corpora sequentially.

alas, the provided link leads to a 404.

Is this still possible? For individual conversation summarization problems, loading just a single conversation would be invaluable, as currently, datasets for large subreddits take significant computational power to load.

calebchiam · 2020-06-04T01:09:52Z

Hi @wwwidonja, thanks for raising this. We did some restructuring of the repo recently, so that broke some links.

Here is the link.

And here is the documentation for Corpus. The initialization parameters utterance_start_index and utterance_end_index are what you're looking for.

wwwidonja · 2020-06-04T04:47:08Z

Thanks for the response.
This does not, however, solve my problem entirely.
What I'm trying to achieve is fetching a single (or a computationally acceptablly small subset) conversation with all of its corresponding utterances. As far as I'm understanding, this only lets me fetch a subset of utterances, with no guarantee that all utterances of a given conversation have been fetched.

I hope you'll be able to consider my issue as a feature request.

Thank you very much!

calebchiam · 2020-06-04T06:03:37Z

We've thought about this before, but simply put, there is no way to do this given the fact that the corpus is loaded from simple JSONList files. (Which does not allow for any kind of indexing other than line by line indexing.) If you'd like to work with a smaller subset of the corpus, I'd recommend loading the full corpus, using filter_conversations_by() and then dumping it so you have the smaller corpus to work with.

Of course, if you have other ideas for how we might implement your feature request, we're happy to hear it.

EDIT: I should clarify that this is no way to do this elegantly, but it would be possible to filter utterances.jsonl and conversations.json for a specific conversation_id. (This just requires iterating through the whole JSON.) It's not clear this is a common enough use case to include it in the package, but you could implement it or filter the utterances.jsonl and conversations.json programmatically.

calebchiam added the feature request label Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partially loading utterances from a selected dataset #48

Partially loading utterances from a selected dataset #48

wwwidonja commented Jun 4, 2020

calebchiam commented Jun 4, 2020

wwwidonja commented Jun 4, 2020

calebchiam commented Jun 4, 2020 •

edited

Loading

Partially loading utterances from a selected dataset #48

Partially loading utterances from a selected dataset #48

Comments

wwwidonja commented Jun 4, 2020

calebchiam commented Jun 4, 2020

wwwidonja commented Jun 4, 2020

calebchiam commented Jun 4, 2020 • edited Loading

calebchiam commented Jun 4, 2020 •

edited

Loading