Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent result of gold patch #328

Open
williamd4112 opened this issue Feb 18, 2025 · 0 comments
Open

Inconsistent result of gold patch #328

williamd4112 opened this issue Feb 18, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@williamd4112
Copy link

Describe the bug

When testing the instance astropy__astropy-6938, I realized that the gold patch provided in the SWE-Bench_Lite result in different results at different versions of SWE-bench.

Here is the gold patch of astropy__astropy-6938

diff --git a/astropy/io/fits/fitsrec.py b/astropy/io/fits/fitsrec.py
--- a/astropy/io/fits/fitsrec.py
+++ b/astropy/io/fits/fitsrec.py
@@ -1261,7 +1261,7 @@ def _scale_back_ascii(self, col_idx, input_field, output_field):
 
         # Replace exponent separator in floating point numbers
         if 'D' in format:
-            output_field.replace(encode_ascii('E'), encode_ascii('D'))
+            output_field[:] = output_field.replace(b'E', b'D')
 
 
 def _get_recarray_field(array, key):

I'm testing this patch with the following command:

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids astropy__astropy-6938 \
    --run_id validate-gold \
    --cache_level instance

The report, which shows that the gold patch is failed, from the latest SWE-bench is the following:

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": false,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ]
            },
            "PASS_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ]
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

However, I got a different report from the SWE-Gym's fork of SWE-bench

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": true,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ],
                "failure": []
            },
            "PASS_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ],
                "failure": []
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

Both are using the same docker images. Could this be due to a bug that occurred in the newer version of SWE-Bench?

Steps/Code to Reproduce

Run this command:

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids astropy__astropy-6938 \
    --run_id validate-gold \
    --cache_level instance

Expected Results

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": true,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ],
                "failure": []
            },
            "PASS_TO_PASS": {
                "success": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ],
                "failure": []
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

Actual Results

{
    "astropy__astropy-6938": {
        "patch_is_None": false,
        "patch_exists": true,
        "patch_successfully_applied": true,
        "resolved": false,
        "tests_status": {
            "FAIL_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_ascii_table_data",
                    "astropy/io/fits/tests/test_table.py::TestTableFunctions::test_ascii_table"
                ]
            },
            "PASS_TO_PASS": {
                "success": [],
                "failure": [
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_sample_file",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_image_create",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_scaled_data_auto_rescale",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_uint16_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_groups_hdu_data",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_open_with_no_keywords",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_writeto_convenience",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_hdu_writeto",
                    "astropy/io/fits/tests/test_checksum.py::TestChecksumFunctions::test_datasum_only",
                    "astropy/io/fits/tests/test_table.py::test_regression_scalar_indexing"
                ]
            },
            "FAIL_TO_FAIL": {
                "success": [],
                "failure": []
            },
            "PASS_TO_FAIL": {
                "success": [],
                "failure": []
            }
        }
    }
}

System Information

@williamd4112 williamd4112 added the bug Something isn't working label Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant