Inconsistent result of gold patch #328

williamd4112 · 2025-02-18T19:41:09Z

Describe the bug

When testing the instance astropy__astropy-6938, I realized that the gold patch provided in the SWE-Bench_Lite result in different results at different versions of SWE-bench.

Here is the gold patch of astropy__astropy-6938

diff --git a/astropy/io/fits/fitsrec.py b/astropy/io/fits/fitsrec.py
--- a/astropy/io/fits/fitsrec.py
+++ b/astropy/io/fits/fitsrec.py
@@ -1261,7 +1261,7 @@ def _scale_back_ascii(self, col_idx, input_field, output_field):
 
         # Replace exponent separator in floating point numbers
         if 'D' in format:
-            output_field.replace(encode_ascii('E'), encode_ascii('D'))
+            output_field[:] = output_field.replace(b'E', b'D')
 
 
 def _get_recarray_field(array, key):

I'm testing this patch with the following command:

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids astropy__astropy-6938 \
    --run_id validate-gold \
    --cache_level instance

The report, which shows that the gold patch is failed, from the latest SWE-bench is the following:

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": false,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ]
            },
            "PASS_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ]
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

However, I got a different report from the SWE-Gym's fork of SWE-bench

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": true,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ],
                "failure": []
            },
            "PASS_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ],
                "failure": []
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

Both are using the same docker images. Could this be due to a bug that occurred in the newer version of SWE-Bench?

Steps/Code to Reproduce

Run this command:

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids astropy__astropy-6938 \
    --run_id validate-gold \
    --cache_level instance

Expected Results

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": true,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ],
                "failure": []
            },
            "PASS_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ],
                "failure": []
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

Actual Results

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": false,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ]
            },
            "PASS_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ]
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

System Information

SWE-Bench (commit: 2f621d5)
SWE-Bench Fork (commit: 242429c188fcfd06aad13fce9a54d450470bf0ac)

The text was updated successfully, but these errors were encountered:

williamd4112 added the bug Something isn't working label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent result of gold patch #328

Inconsistent result of gold patch #328

williamd4112 commented Feb 18, 2025

Inconsistent result of gold patch #328

Inconsistent result of gold patch #328

Comments

williamd4112 commented Feb 18, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

System Information