Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with MVBench Evaluation #227

Open
Backdrop9019 opened this issue Aug 24, 2024 · 1 comment
Open

Issue with MVBench Evaluation #227

Backdrop9019 opened this issue Aug 24, 2024 · 1 comment

Comments

@Backdrop9019
Copy link

It seems that there is an issue with the evaluation method in MVBench. Currently, the process of verifying correctness involves splitting the prediction and extracting only the first segment(word) to compare it with the correct answer. However, this approach causes any prediction that includes just a closing parenthesis “)” to be treated as correct.

I believe it is essential to add a step that verifies whether the alphabet of the answer option is correct.

While running your code, I mistakenly added an extra space in the answer prompt, making it “best option: ( “ and noticed a significant increase in performance.

It would be great if the evaluation method could be made more robust!

@yinanhe
Copy link
Member

yinanhe commented Oct 11, 2024

I apologize for not noticing your issue sooner, and I appreciate you bringing it to my attention. I agree that the current evaluation method needs improvement. Relying solely on the first segment for correctness verification can lead to inaccuracies, especially when a standalone closing parenthesis is accepted as correct. Adding a step to verify the alphabet of the answer option seems like a crucial enhancement. This would help ensure that predictions are valid. I also appreciate your observation about the extra space in the answer prompt and its impact on performance. It highlights the importance of refining our model and evaluation criteria to avoid such pitfalls.

I have also noticed this issue in the MVBench of lmms-eval and have made corresponding modifications, which you can check here: MVBench in lmms-eval. Thank you for your insights!

Here’s a refined version of function check_ans based on your suggestions:

def check_ans(pred, gt):
    flag = False
    
    # Split predictions and ground truth into options and content
    pred_list = pred.lower().split(' ')
    pred_option, pred_content = pred_list[0], ' '.join(pred_list[1:])
    
    gt_list = gt.lower().split(' ')
    gt_option, gt_content = gt_list[0], ' '.join(gt_list[1:])
    
    # Remove trailing period from ground truth content if present
    if gt_content.endswith('.'):
        gt_content = gt_content[:-1]

    # Clean options by removing certain characters
    pred_option = pred_option.replace('.', '').replace('(', '').replace(')', '')
    gt_option = gt_option.replace('.', '').replace('(', '').replace(')', '')
    
    # Additional check: if pred_option does not contain any answer a-e, return False
    if not any(char in pred_option for char in 'abcde'):
        return False
    # Check for equality or inclusion
    if pred_option == gt_option:
        flag = True
    elif gt_option in pred_option:
        flag = True
        
    return flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants