Issue with MVBench Evaluation #227

Backdrop9019 · 2024-08-24T05:59:19Z

It seems that there is an issue with the evaluation method in MVBench. Currently, the process of verifying correctness involves splitting the prediction and extracting only the first segment(word) to compare it with the correct answer. However, this approach causes any prediction that includes just a closing parenthesis “)” to be treated as correct.

I believe it is essential to add a step that verifies whether the alphabet of the answer option is correct.

While running your code, I mistakenly added an extra space in the answer prompt, making it “best option: ( “ and noticed a significant increase in performance.

It would be great if the evaluation method could be made more robust!

yinanhe · 2024-10-11T07:33:44Z

I apologize for not noticing your issue sooner, and I appreciate you bringing it to my attention. I agree that the current evaluation method needs improvement. Relying solely on the first segment for correctness verification can lead to inaccuracies, especially when a standalone closing parenthesis is accepted as correct. Adding a step to verify the alphabet of the answer option seems like a crucial enhancement. This would help ensure that predictions are valid. I also appreciate your observation about the extra space in the answer prompt and its impact on performance. It highlights the importance of refining our model and evaluation criteria to avoid such pitfalls.

I have also noticed this issue in the MVBench of lmms-eval and have made corresponding modifications, which you can check here: MVBench in lmms-eval. Thank you for your insights!

Here’s a refined version of function check_ans based on your suggestions:

def check_ans(pred, gt):
    flag = False
    
    # Split predictions and ground truth into options and content
    pred_list = pred.lower().split(' ')
    pred_option, pred_content = pred_list[0], ' '.join(pred_list[1:])
    
    gt_list = gt.lower().split(' ')
    gt_option, gt_content = gt_list[0], ' '.join(gt_list[1:])
    
    # Remove trailing period from ground truth content if present
    if gt_content.endswith('.'):
        gt_content = gt_content[:-1]

    # Clean options by removing certain characters
    pred_option = pred_option.replace('.', '').replace('(', '').replace(')', '')
    gt_option = gt_option.replace('.', '').replace('(', '').replace(')', '')
    
    # Additional check: if pred_option does not contain any answer a-e, return False
    if not any(char in pred_option for char in 'abcde'):
        return False
    # Check for equality or inclusion
    if pred_option == gt_option:
        flag = True
    elif gt_option in pred_option:
        flag = True
        
    return flag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with MVBench Evaluation #227

Issue with MVBench Evaluation #227

Backdrop9019 commented Aug 24, 2024

yinanhe commented Oct 11, 2024 •

edited

Loading

Issue with MVBench Evaluation #227

Issue with MVBench Evaluation #227

Comments

Backdrop9019 commented Aug 24, 2024

yinanhe commented Oct 11, 2024 • edited Loading

yinanhe commented Oct 11, 2024 •

edited

Loading