Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix human_ans_spans entries to match snippets from human_ans_indices #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lewtun
Copy link

@lewtun lewtun commented Mar 12, 2021

This PR fixes a mismatch between some entries of the human_ans_spans column and the corresponding span of text in the review column. For example, line 17 of electronics/splits/train.csv has in the human_ans_spans column the text

The anti - glare function does work as described

but the actual text in the review column does not have any spaces around "anti-glare"

The anti-glare function does work as described

Looking at other examples, it seems that some sort of post-processing has been applied to create whitespace around punctuation characters. By using the start and end indices in the human_ans_indices column as the ground truth, I find the following mismatch percentages per domain/split:

restaurants electronics books grocery movies tripadvisor
test.csv 7.13 7.52 7.37 10.44 7.39 9.9
dev.csv 10.88 7.28 5.28 10.09 4.68 7.5
train.csv 8.59 7.76 8.06 9.82 8.48 8.82

To fix this, I utilised the following function

import pandas as pd
from pathlib import Path

def fix_answer_spans(path_to_file: Path):
    def extract_answer_spans(row: pd.Series):
        start_idx, end_idx = eval(row["human_ans_indices"])
        return row["review"][start_idx:end_idx]
    
    df = pd.read_csv(f)
    df["human_ans_spans"] = df.apply(extract_answer_spans, axis=1)
    df.to_csv(f, index=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant