Fix human_ans_spans entries to match snippets from human_ans_indices #2

lewtun · 2021-03-12T09:39:39Z

This PR fixes a mismatch between some entries of the human_ans_spans column and the corresponding span of text in the review column. For example, line 17 of electronics/splits/train.csv has in the human_ans_spans column the text

The anti - glare function does work as described

but the actual text in the review column does not have any spaces around "anti-glare"

The anti-glare function does work as described

Looking at other examples, it seems that some sort of post-processing has been applied to create whitespace around punctuation characters. By using the start and end indices in the human_ans_indices column as the ground truth, I find the following mismatch percentages per domain/split:

	restaurants	electronics	books	grocery	movies	tripadvisor
test.csv	7.13	7.52	7.37	10.44	7.39	9.9
dev.csv	10.88	7.28	5.28	10.09	4.68	7.5
train.csv	8.59	7.76	8.06	9.82	8.48	8.82

To fix this, I utilised the following function

import pandas as pd
from pathlib import Path

def fix_answer_spans(path_to_file: Path):
    def extract_answer_spans(row: pd.Series):
        start_idx, end_idx = eval(row["human_ans_indices"])
        return row["review"][start_idx:end_idx]
    
    df = pd.read_csv(f)
    df["human_ans_spans"] = df.apply(extract_answer_spans, axis=1)
    df.to_csv(f, index=False)

Fix human_ans_spans entries to match snippets from human_ans_indices

d0207ed

lewtun mentioned this pull request May 2, 2021

Add SubjQA dataset huggingface/datasets#2302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix human_ans_spans entries to match snippets from human_ans_indices #2

Fix human_ans_spans entries to match snippets from human_ans_indices #2

lewtun commented Mar 12, 2021

Fix human_ans_spans entries to match snippets from human_ans_indices #2

Are you sure you want to change the base?

Fix human_ans_spans entries to match snippets from human_ans_indices #2

Conversation

lewtun commented Mar 12, 2021