Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

Open
MarcCote opened this issue Jan 20, 2025 · 1 comment
Open
Labels
documentation Improvements or additions to documentation

Comments

@MarcCote
Copy link

MarcCote commented Jan 20, 2025

Describe the issue

I have run the evaluation script with --predictions_path gold on the 500 tasks in SWE-bench_Verified and 14 of them are failing.

I'm using the current main branch of swebench: c63a113

This is the exact command:

python -m swebench.harness.run_evaluation --predictions_path gold --max_workers 25 --run_id validate-gold-verified --dataset_name princeton-nlp/SWE-bench_Verified --cache_level instance

Those are the unresolved task ids:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Then, I ran it a second time, and got 15 unresolved tasks:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-1766",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Suggest an improvement to documentation

No response

@MarcCote MarcCote added the documentation Improvements or additions to documentation label Jan 20, 2025
@MarcCote
Copy link
Author

This might be related to #225, #167, #246, #267, and #274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant