Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikiConv example Jupyter NoteBook Code not working #96

Closed
ErikJSchmidt opened this issue May 5, 2021 · 2 comments
Closed

WikiConv example Jupyter NoteBook Code not working #96

ErikJSchmidt opened this issue May 5, 2021 · 2 comments
Assignees

Comments

@ErikJSchmidt
Copy link

ErikJSchmidt commented May 5, 2021

Situation

Hey there, I am want to work with the WikiConv corpus and think the ConvoKit framework should make that a lot easier.
To get started I tried to follow along the example notebook. I downloaded the 2003 corpus on my machine via
corpus_dir_path_cluster = "my/path/"

wikiconv_2003 = Corpus(filename=download("wikiconv-2003", data_dir = corpus_dir_path_cluster))

And started copying the code from the notebook into a python file step by step to make it work.
Now when using print_final_conversation everything works fine as long as the modification, deletion and restoration lists are not used.

The problem

No for example when I only put conversation with id '1275892.3573.3573' in the random_conversations list like:
random_conversations = [wikiconv_2003.get_conversation('1275892.3573.3573')]

and call
print_final_conversation(random_conversations, wikiconv_2003)

then we get into function check_lists_for_match with param str(utterance) =
Utterance(id: '1277066.3845.3845', conversation_id: 1275892.3573.3573, reply-to: 1275892.3573.3573, speaker: Speaker(id: Ruhrjung, vectors: [], meta: {'user_id': '10582'}), timestamp: 1060644108.0, text: 'Germany is no state of USA. That is established, even in Wikipedia_talk:Naming_conventions_(city_names), where virtually no arguments for the comma-notion are presented, although the debate there seems to have fallen asleep (about July 22nd). ', vectors: [], meta: {'is_section_header': False, 'indentation': '1', 'toxicity': 0.0887862, 'sever_toxicity': 0.01593855, 'ancestor_id': '1275929.3845.3845', 'rev_id': '1275929', 'parent_id': None, 'original': ({'id': '1275929.3845.3845', 'root': '1275892.3573.3573', 'reply_to': '1275892.3573.3573', 'timestamp': 1060640971.0, 'text': 'Germany is no state of USA. That is established, even in Wikipedia_talk:Naming_conventions_(city_names), where virtually no arguments for the comma-notion are presented, although the debate there seems to have fallen asleep. ', 'meta': {'is_section_header': False, 'indentation': '1', 'toxicity': 0.0887862, 'sever_toxicity': 0.01593855, 'ancestor_id': '1275929.3845.3845', 'rev_id': '1275929', 'parent_id': None, 'original': None, 'modification': [], 'deletion': [], 'restoration': []}}), 'modification': [({'id': '1277066.3845.3845', 'root': '1275892.3573.3573', 'reply_to': '1275892.3573.3573', 'timestamp': 1060644108.0, 'text': 'Germany is no state of USA. That is established, even in Wikipedia_talk:Naming_conventions_(city_names), where virtually no arguments for the comma-notion are presented, although the debate there seems to have fallen asleep (about July 22nd). ', 'meta': {'is_section_header': False, 'indentation': '1', 'toxicity': 0.06772689, 'sever_toxicity': 0.009684636, 'ancestor_id': '1275929.3845.3845', 'rev_id': '1277066', 'parent_id': '1275929.3845.3845', 'original': None, 'modification': [], 'deletion': [], 'restoration': []}})], 'deletion': [], 'restoration': []})

and therefore str(modification_list) =
Modifications[({'id': '1277066.3845.3845', 'root': '1275892.3573.3573', 'reply_to': '1275892.3573.3573', 'timestamp': 1060644108.0, 'text': 'Germany is no state of USA. That is established, even in Wikipedia_talk:Naming_conventions_(city_names), where virtually no arguments for the comma-notion are presented, although the debate there seems to have fallen asleep (about July 22nd). ', 'meta': {'is_section_header': False, 'indentation': '1', 'toxicity': 0.06772689, 'sever_toxicity': 0.009684636, 'ancestor_id': '1275929.3845.3845', 'rev_id': '1277066', 'parent_id': '1275929.3845.3845', 'original': None, 'modification': [], 'deletion': [], 'restoration': []}})]

That leads to the check
if (utterance_val.id == next_utterance_value.reply_to):

that produces
AttributeError: 'Utterance' object has no attribute '_id'

My insights so far

When looking at the modifications list I see 'id': '1277066.3845.3845' which I guess should correspond to the utterance's _id. So I don't get why the id is said to be missing.
I am not to familiar with Python, but I wonder why str(utterance) has the id printed as id: '1277066.3845.3845' without '' while the utterance in the modification list has 'id': '1277066.3845.3845' where id is wrapped in ''. To me it looks like the utterances in the modification list are in a json like format and I don#t know why.

Also the notebokk was update last in Nov 20, 2019 while the wikiconv models where updated last on Dec 1, 2020. So maybe the notebook is outdated?

Conclusion

It would help me a lot if anyone could check if they can reproduce this AttributeError, or if this is a problem sole for me.

Thank you all for your help.

@ErikJSchmidt ErikJSchmidt changed the title WikiConv example Jupyter NoteBook not working WikiConv example Jupyter NoteBook Code not working May 5, 2021
@jpwchang
Copy link
Collaborator

Hey there @ErikJSchmidt,

This is related to #59 - basically, the current WikiConv corpora were created with an older version of ConvoKit, and in the meantime the ConvoKit utterance API has changed in ways that make the saved modification metadata no longer compatible with the current API (specifically, the current API has utterance.id redirect to utterance._id, but there was no such distinction in ConvoKit when the modification data was computed, leading to the error that you see). The reason you see this with the specific conversation you listed, and not with other conversations, is that the conversation you found happens to contain some modification data (most conversations don't, so when randomly selecting conversations chances are the erroring code will never be triggered). We're working on constructing an updated WikiConv which will address this issue, but as noted in the linked comments this may take a while due to computational resource issues.

Thankfully, you do not have to wait for an updated WikiConv if you just want to run the example code - there is a workaround that can be used to avoid this issue! What you can do is bypass the utterance API entirely by unpacking each modification object into a dict. To do this, you can simply replace all calls to utterance.id with utterance.__dict__['id']. So the final modified version of check_lists_for_match would look like this:

def check_lists_for_match(x, conversation_ids_list, utterance, next_utterance_value, conversation_corpus):
    modification_list = utterance.meta['modification']
    deletion_list = utterance.meta['deletion']
    restoration_list = utterance.meta['restoration']
    if (len(modification_list)>0):
        for utterance_val in modification_list:
            if (utterance_val.__dict__['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(deletion_list)>0):
        for utterance_val in deletion_list:
            if (utterance_val.__dict__['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(restoration_list)>0):
        for utterance_val in restoration_list:
            if (utterance_val.__dict__['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)

This change will allow print_final_conversation to run as expected

Note that if you want to run the entire demo notebook, there is a similar issue in the function sort_changes_by_timestamp, where the workaround will again need to be applied, replacing all instances of utterance.user.name with utterance.user.__dict__['_name'].

Hope this helps! From my testing this workaround allows the entire notebook to run without issue on the conversation you mentioned, but let us know if anything else comes up. And sorry for the inconvenience!

@jpwchang
Copy link
Collaborator

As of 03/21/2022, the ConvoKit WikiConv corpora have been updated so the modification metadata now work correctly, so the workaround described in the last comment is no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants