How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

MarcCote · 2025-01-20T21:09:58Z

Describe the issue

I have run the evaluation script with --predictions_path gold on the 500 tasks in SWE-bench_Verified and 14 of them are failing.

I'm using the current main branch of swebench: c63a113

This is the exact command:

python -m swebench.harness.run_evaluation --predictions_path gold --max_workers 25 --run_id validate-gold-verified --dataset_name princeton-nlp/SWE-bench_Verified --cache_level instance

Those are the unresolved task ids:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Then, I ran it a second time, and got 15 unresolved tasks:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-1766",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Suggest an improvement to documentation

No response

The text was updated successfully, but these errors were encountered:

MarcCote · 2025-01-20T21:10:50Z

This might be related to #225, #167, #246, #267, and #274

MarcCote added the documentation Improvements or additions to documentation label Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

MarcCote commented Jan 20, 2025 •

edited

Loading

MarcCote commented Jan 20, 2025

How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

How stable is run_evaluation.py with gold patch for SWE-bench_Verified? #294

Comments

MarcCote commented Jan 20, 2025 • edited Loading

Describe the issue

Suggest an improvement to documentation

MarcCote commented Jan 20, 2025

MarcCote commented Jan 20, 2025 •

edited

Loading