Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMV Corpus Link Typo Fix #223

Merged
merged 3 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Available as an interactive notebook: [full version (fine-tuning + inference)](h
ConvoKit ships with several datasets ready for use "out-of-the-box".
These datasets can be downloaded using the `convokit.download()` [helper function](https://github.com/CornellNLP/ConvoKit/blob/master/convokit/util.py). Alternatively you can access them directly [here](http://zissou.infosci.cornell.edu/convokit/datasets/).

### [Conversations Gone Awry Datasets]([Wikipedia](https://convokit.cornell.edu/documentation/awry.html)/[CMV](https://convokit.cornell.edu/documentation/awry_cmv.html))
### Conversations Gone Awry Datasets ([Wikipedia](https://convokit.cornell.edu/documentation/awry.html)/[CMV](https://convokit.cornell.edu/documentation/awry_cmv.html))

Two related corpora of conversations that derail into antisocial behavior. One corpus (CGA-WIKI) consists of Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (4,188 conversations containing 30.021 comments). The other (CGA-CMV) consists of discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior as determined by the presence of a moderator intervention (6,842 conversations containing 42,964 comments).
Name for download: `conversations-gone-awry-corpus` (for CGA-WIKI) or `conversations-gone-awry-cmv-corpus` (for CGA-CMV)
Expand Down
12 changes: 6 additions & 6 deletions docs/source/awry_cmv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Distributed together with: Trouble on the Horizon: Forecasting the Derailment of

Summaries of conversation dynamics described in: How Did We Get Here? Summarizing Conversation Dynamics. Yilun Hua, Nick Chernogor, Yuzhe Gu, Seoyon Julie Jeong, Miranda Luo, Cristian Danescu-Niculescu-Mizil. NAACL 2024.

Example usage of the corpus and summaries: `SCD and Basic Examples <https://github.com/CornellNLP/ConvoKit/blob/master/examples/conversations-gone-awry-cmv-corpus/scd_example.ipynb>`_
Example usage of the corpus and summaries: `SCD and Basic Examples <https://github.com/CornellNLP/ConvoKit/blob/master/examples/conversations-gone-awry-cmv/scd-example.ipynb>`_

Dataset details
---------------
Expand Down Expand Up @@ -52,11 +52,11 @@ Metadata for each conversation include:
* has_removed_comment: whether the final comment in this thread was removed by CMV moderators for violation of Rule 2
* split: which split (train, val, or test) this conversation was used in for the experiments described in "Trouble on the Horizon"
* summary_meta: metadata related to conversation summaries, a list of dictionaries (one per summary available, possibly empty) with the following keys:
* * summary_text: the text of the summary;
* * summary_type: whether the summary is humman written by humans;(human_written_SCD) or generated automatically using the procedural prompt ("machine_generated_SCD") ;
* * up_to_utterance_id: the last utterance considered when creating the summary;
* * truncated_by: the number of utterances the transcript was truncated by when creating the summary (starting from the end);
* * scd_split: whether the summary was in the train/test/validation split in the 2024 Summarizing Conversations Dynamics paper;
* summary_text: the text of the summary;
* summary_type: whether the summary is humman written by humans;(human_written_SCD) or generated automatically using the procedural prompt ("machine_generated_SCD") ;
* up_to_utterance_id: the last utterance considered when creating the summary;
* truncated_by: the number of utterances the transcript was truncated by when creating the summary (starting from the end);
* scd_split: whether the summary was in the train/test/validation split in the 2024 Summarizing Conversations Dynamics paper;


Usage
Expand Down
Loading