Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtime test sometime failing? #2515

Closed
ADBond opened this issue Nov 20, 2024 · 11 comments · Fixed by #2589
Closed

Realtime test sometime failing? #2515

ADBond opened this issue Nov 20, 2024 · 11 comments · Fixed by #2589
Assignees
Labels

Comments

@ADBond
Copy link
Contributor

ADBond commented Nov 20, 2024

The test tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] failed in python 3.9 on this run. Re-running the job allowed the test to pass, which is odd as I can't see where any indeterminacy could occur.

Error was: Binder Error: Values list "l" does not have a column named "tf_first_name".

Full test output
________________ test_realtime_cache_different_settings[duckdb] ________________

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS\nWITH __splink__compare_records_left AS (\n  SELECT\n    ..._adj_city,\n  "email_l",\n  "email_r",\n  gamma_email,\n  bf_email,\n  match_key\nFROM __splink__df_match_weight_parts'
templated_name = '__splink__realtime_compare_records'
physical_name = '__splink__realtime_compare_records_aj2njrcn'

    @final
    def _log_and_run_sql_execution(
        self, final_sql: str, templated_name: str, physical_name: str
    ) -> TablishType:
        """
        Log some sql, then call _run_sql_execution()
        Any errors will be converted to SplinkException with more detail
        names are only relevant for logging, not execution
        """
        logger.debug(execute_sql_logging_message_info(templated_name, physical_name))
        logger.log(5, log_sql(final_sql))
        try:
>           return self._execute_sql_against_backend(final_sql)

splink/internals/database_api.py:63: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS \nWITH\n\n__splink__compare_records_left as (\nselect * f..._tf_adj_city,"email_l","email_r",gamma_email,bf_email,match_key \n    from __splink__df_match_weight_parts\n    \n    '

    def _execute_sql_against_backend(self, final_sql: str) -> duckdb.DuckDBPyRelation:
>       return self._con.sql(final_sql)
E       duckdb.duckdb.BinderException: Binder Error: Values list "l" does not have a column named "tf_first_name"
E       LINE 11: ...from __splink__compare_records_left_aj2njrcn), 
E       
E       __splink__compare_records_rig...
E                                                          ^

splink/internals/duckdb/database_api.py:102: BinderException

The above exception was the direct cause of the following exception:

test_helpers = {'duckdb': (<class 'tests.helpers.DuckDBTestHelper'>, []), 'spark': (<class 'tests.helpers.SparkTestHelper'>, [<functi...ite': (<class 'tests.helpers.SQLiteTestHelper'>, []), 'postgres': (<class 'tests.helpers.PostgresTestHelper'>, [None])}
dialect = 'duckdb'

    @mark_with_dialects_excluding()
    def test_realtime_cache_different_settings(test_helpers, dialect):
        helper = test_helpers[dialect]
        db_api = helper.extra_linker_args()["db_api"]
    
        df1 = pd.DataFrame(
            [
                {
                    "unique_id": 0,
                    "first_name": "Julia",
                    "surname": "Taylor",
                    "city": "London",
                    "email": "[email protected]",
                }
            ]
        )
    
        df2 = pd.DataFrame(
            [
                {
                    "unique_id": 1,
                    "first_name": "Julia",
                    "surname": "Taylor",
                    "city": "London",
                    "email": "[email protected]",
                }
            ]
        )
    
        settings_1 = SettingsCreator(
            link_type="dedupe_only",
            comparisons=[
                cl.ExactMatch("first_name"),
                cl.ExactMatch("surname"),
                cl.ExactMatch("city"),
            ],
            blocking_rules_to_generate_predictions=[block_on("first_name")],
        )
    
        settings_2 = SettingsCreator(
            link_type="dedupe_only",
            comparisons=[
                cl.ExactMatch("first_name"),
                cl.ExactMatch("surname"),
                cl.ExactMatch("email"),
            ],
            blocking_rules_to_generate_predictions=[block_on("first_name")],
        )
    
>       res1 = compare_records(
            df1, df2, settings_1, db_api, use_sql_from_cache=True
        ).as_record_dict()[0]["match_weight"]

tests/test_realtime.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
splink/internals/realtime.py:88: in compare_records
    return db_api._sql_to_splink_dataframe(
splink/internals/database_api.py:93: in _sql_to_splink_dataframe
    spark_df = self._log_and_run_sql_execution(sql, templated_name, physical_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS\nWITH __splink__compare_records_left AS (\n  SELECT\n    ..._adj_city,\n  "email_l",\n  "email_r",\n  gamma_email,\n  bf_email,\n  match_key\nFROM __splink__df_match_weight_parts'
templated_name = '__splink__realtime_compare_records'
physical_name = '__splink__realtime_compare_records_aj2njrcn'

    @final
    def _log_and_run_sql_execution(
        self, final_sql: str, templated_name: str, physical_name: str
    ) -> TablishType:
        """
        Log some sql, then call _run_sql_execution()
        Any errors will be converted to SplinkException with more detail
        names are only relevant for logging, not execution
        """
        logger.debug(execute_sql_logging_message_info(templated_name, physical_name))
        logger.log(5, log_sql(final_sql))
        try:
            return self._execute_sql_against_backend(final_sql)
        except Exception as e:
            # Parse our SQL through sqlglot to pretty print
            try:
                final_sql = sqlglot.parse_one(
                    final_sql,
                    read=self.sql_dialect.sqlglot_dialect,
                ).sql(pretty=True)
                # if sqlglot produces any errors, just report the raw SQL
            except Exception:
                pass
    
>           raise SplinkException(
                f"Error executing the following sql for table "
                f"`{templated_name}`({physical_name}):\n{final_sql}"
                f"\n\nError was: {e}"
            ) from e
E           splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__realtime_compare_records`(__splink__realtime_compare_records_aj2njrcn):
E           CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS
E           WITH __splink__compare_records_left AS (
E             SELECT
E               *
E             FROM __splink__compare_records_left_aj2njrcn
E           ), __splink__compare_records_right AS (
E             SELECT
E               *
E             FROM __splink__compare_records_right_aj2njrcn
E           ), __splink__compare_two_records_blocked AS (
E             SELECT
E               "l"."unique_id" AS "unique_id_l",
E               "r"."unique_id" AS "unique_id_r",
E               "l"."first_name" AS "first_name_l",
E               "r"."first_name" AS "first_name_r",
E               "l"."tf_first_name" AS "tf_first_name_l",
E               "r"."tf_first_name" AS "tf_first_name_r",
E               "l"."surname" AS "surname_l",
E               "r"."surname" AS "surname_r",
E               "l"."city" AS "city_l",
E               "r"."city" AS "city_r",
E               "l"."tf_city" AS "tf_city_l",
E               "r"."tf_city" AS "tf_city_r",
E               "l"."email" AS "email_l",
E               "r"."email" AS "email_r",
E               0 AS match_key
E             FROM __splink__compare_records_left AS l
E             CROSS JOIN __splink__compare_records_right AS r
E           ), __splink__df_comparison_vectors AS (
E             SELECT
E               "unique_id_l",
E               "unique_id_r",
E               "first_name_l",
E               "first_name_r",
E               CASE
E                 WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL
E                 THEN -1
E                 WHEN "first_name_l" = "first_name_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_first_name,
E               "tf_first_name_l",
E               "tf_first_name_r",
E               "surname_l",
E               "surname_r",
E               CASE
E                 WHEN "surname_l" IS NULL OR "surname_r" IS NULL
E                 THEN -1
E                 WHEN "surname_l" = "surname_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_surname,
E               "city_l",
E               "city_r",
E               CASE
E                 WHEN "city_l" IS NULL OR "city_r" IS NULL
E                 THEN -1
E                 WHEN "city_l" = "city_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_city,
E               "tf_city_l",
E               "tf_city_r",
E               "email_l",
E               "email_r",
E               CASE
E                 WHEN "email_l" IS NULL OR "email_r" IS NULL
E                 THEN -1
E                 WHEN "email_l" = "email_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_email,
E               match_key
E             FROM __splink__compare_two_records_blocked
E           ), __splink__df_match_weight_parts AS (
E             SELECT
E               "unique_id_l",
E               "unique_id_r",
E               "first_name_l",
E               "first_name_r",
E               gamma_first_name,
E               "tf_first_name_l",
E               "tf_first_name_r",
E               CASE
E                 WHEN gamma_first_name = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_first_name = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_first_name = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_first_name,
E               CASE
E                 WHEN gamma_first_name = -1
E                 THEN CAST(1 AS DOUBLE)
E                 WHEN gamma_first_name = 1
E                 THEN (
E                   CASE
E                     WHEN NOT COALESCE("tf_first_name_l", "tf_first_name_r") IS NULL
E                     THEN POWER(
E                       CAST(0.000927734375 AS DOUBLE) / NULLIF(
E                         (
E                           CASE
E                             WHEN COALESCE("tf_first_name_l", "tf_first_name_r") >= COALESCE("tf_first_name_r", "tf_first_name_l")
E                             THEN COALESCE("tf_first_name_l", "tf_first_name_r")
E                             ELSE COALESCE("tf_first_name_r", "tf_first_name_l")
E                           END
E                         ),
E                         0
E                       ),
E                       CAST(1.0 AS DOUBLE)
E                     )
E                     ELSE CAST(1 AS DOUBLE)
E                   END
E                 )
E                 WHEN gamma_first_name = 0
E                 THEN CAST(1 AS DOUBLE)
E               END AS bf_tf_adj_first_name,
E               "surname_l",
E               "surname_r",
E               gamma_surname,
E               CASE
E                 WHEN gamma_surname = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_surname = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_surname = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_surname,
E               "city_l",
E               "city_r",
E               gamma_city,
E               "tf_city_l",
E               "tf_city_r",
E               CASE
E                 WHEN gamma_city = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_city = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_city = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_city,
E               CASE
E                 WHEN gamma_city = -1
E                 THEN CAST(1 AS DOUBLE)
E                 WHEN gamma_city = 1
E                 THEN (
E                   CASE
E                     WHEN NOT COALESCE("tf_city_l", "tf_city_r") IS NULL
E                     THEN POWER(
E                       CAST(0.000927734375 AS DOUBLE) / NULLIF(
E                         (
E                           CASE
E                             WHEN COALESCE("tf_city_l", "tf_city_r") >= COALESCE("tf_city_r", "tf_city_l")
E                             THEN COALESCE("tf_city_l", "tf_city_r")
E                             ELSE COALESCE("tf_city_r", "tf_city_l")
E                           END
E                         ),
E                         0
E                       ),
E                       CAST(1.0 AS DOUBLE)
E                     )
E                     ELSE CAST(1 AS DOUBLE)
E                   END
E                 )
E                 WHEN gamma_city = 0
E                 THEN CAST(1 AS DOUBLE)
E               END AS bf_tf_adj_city,
E               "email_l",
E               "email_r",
E               gamma_email,
E               CASE
E                 WHEN gamma_email = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_email = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_email = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_email,
E               match_key
E             FROM __splink__df_comparison_vectors
E           )
E           SELECT
E             LOG(
E               2,
E               CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E             ) AS match_weight,
E             CASE
E               WHEN bf_first_name = CAST('infinity' AS DOUBLE)
E               OR bf_tf_adj_first_name = CAST('infinity' AS DOUBLE)
E               OR bf_surname = CAST('infinity' AS DOUBLE)
E               OR bf_city = CAST('infinity' AS DOUBLE)
E               OR bf_tf_adj_city = CAST('infinity' AS DOUBLE)
E               OR bf_email = CAST('infinity' AS DOUBLE)
E               THEN 1.0
E               ELSE (
E                 CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E               ) / NULLIF(
E                 (
E                   1 + (
E                     CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E                   )
E                 ),
E                 0
E               )
E             END AS match_probability,
E             "unique_id_l",
E             "unique_id_r",
E             "first_name_l",
E             "first_name_r",
E             gamma_first_name,
E             "tf_first_name_l",
E             "tf_first_name_r",
E             bf_first_name,
E             bf_tf_adj_first_name,
E             "surname_l",
E             "surname_r",
E             gamma_surname,
E             bf_surname,
E             "city_l",
E             "city_r",
E             gamma_city,
E             "tf_city_l",
E             "tf_city_r",
E             bf_city,
E             bf_tf_adj_city,
E             "email_l",
E             "email_r",
E             gamma_email,
E             bf_email,
E             match_key
E           FROM __splink__df_match_weight_parts
E           
E           Error was: Binder Error: Values list "l" does not have a column named "tf_first_name"
E           LINE 11: ...from __splink__compare_records_left_aj2njrcn), 
E           
E           __splink__compare_records_rig...
E                                                              ^

splink/internals/database_api.py:75: SplinkException
Test summary output
=========================== short test summary info ============================
FAILED tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] - splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__realtime_compare_records`(__splink__realtime_compare_records_aj2njrcn):
CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS
WITH __splink__compare_records_left AS (
  SELECT
    *
  FROM __splink__compare_records_left_aj2njrcn
), __splink__compare_records_right AS (
  SELECT
    *
  FROM __splink__compare_records_right_aj2njrcn
), __splink__compare_two_records_blocked AS (
  SELECT
    "l"."unique_id" AS "unique_id_l",
    "r"."unique_id" AS "unique_id_r",
    "l"."first_name" AS "first_name_l",
    "r"."first_name" AS "first_name_r",
    "l"."tf_first_name" AS "tf_first_name_l",
    "r"."tf_first_name" AS "tf_first_name_r",
    "l"."surname" AS "surname_l",
    "r"."surname" AS "surname_r",
    "l"."city" AS "city_l",
    "r"."city" AS "city_r",
    "l"."tf_city" AS "tf_city_l",
    "r"."tf_city" AS "tf_city_r",
    "l"."email" AS "email_l",
    "r"."email" AS "email_r",
    0 AS match_key
  FROM __splink__compare_records_left AS l
  CROSS JOIN __splink__compare_records_right AS r
), __splink__df_comparison_vectors AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "first_name_l",
    "first_name_r",
    CASE
      WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL
      THEN -1
      WHEN "first_name_l" = "first_name_r"
      THEN 1
      ELSE 0
    END AS gamma_first_name,
    "tf_first_name_l",
    "tf_first_name_r",
    "surname_l",
    "surname_r",
    CASE
      WHEN "surname_l" IS NULL OR "surname_r" IS NULL
      THEN -1
      WHEN "surname_l" = "surname_r"
      THEN 1
      ELSE 0
    END AS gamma_surname,
    "city_l",
    "city_r",
    CASE
      WHEN "city_l" IS NULL OR "city_r" IS NULL
      THEN -1
      WHEN "city_l" = "city_r"
      THEN 1
      ELSE 0
    END AS gamma_city,
    "tf_city_l",
    "tf_city_r",
    "email_l",
    "email_r",
    CASE
      WHEN "email_l" IS NULL OR "email_r" IS NULL
      THEN -1
      WHEN "email_l" = "email_r"
      THEN 1
      ELSE 0
    END AS gamma_email,
    match_key
  FROM __splink__compare_two_records_blocked
), __splink__df_match_weight_parts AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "first_name_l",
    "first_name_r",
    gamma_first_name,
    "tf_first_name_l",
    "tf_first_name_r",
    CASE
      WHEN gamma_first_name = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_first_name = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_first_name = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_first_name,
    CASE
      WHEN gamma_first_name = -1
      THEN CAST(1 AS DOUBLE)
      WHEN gamma_first_name = 1
      THEN (
        CASE
          WHEN NOT COALESCE("tf_first_name_l", "tf_first_name_r") IS NULL
          THEN POWER(
            CAST(0.000927734375 AS DOUBLE) / NULLIF(
              (
                CASE
                  WHEN COALESCE("tf_first_name_l", "tf_first_name_r") >= COALESCE("tf_first_name_r", "tf_first_name_l")
                  THEN COALESCE("tf_first_name_l", "tf_first_name_r")
                  ELSE COALESCE("tf_first_name_r", "tf_first_name_l")
                END
              ),
              0
            ),
            CAST(1.0 AS DOUBLE)
          )
          ELSE CAST(1 AS DOUBLE)
        END
      )
      WHEN gamma_first_name = 0
      THEN CAST(1 AS DOUBLE)
    END AS bf_tf_adj_first_name,
    "surname_l",
    "surname_r",
    gamma_surname,
    CASE
      WHEN gamma_surname = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_surname = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_surname = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_surname,
    "city_l",
    "city_r",
    gamma_city,
    "tf_city_l",
    "tf_city_r",
    CASE
      WHEN gamma_city = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_city = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_city = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_city,
    CASE
      WHEN gamma_city = -1
      THEN CAST(1 AS DOUBLE)
      WHEN gamma_city = 1
      THEN (
        CASE
          WHEN NOT COALESCE("tf_city_l", "tf_city_r") IS NULL
          THEN POWER(
            CAST(0.000927734375 AS DOUBLE) / NULLIF(
              (
                CASE
                  WHEN COALESCE("tf_city_l", "tf_city_r") >= COALESCE("tf_city_r", "tf_city_l")
                  THEN COALESCE("tf_city_l", "tf_city_r")
                  ELSE COALESCE("tf_city_r", "tf_city_l")
                END
              ),
              0
            ),
            CAST(1.0 AS DOUBLE)
          )
          ELSE CAST(1 AS DOUBLE)
        END
      )
      WHEN gamma_city = 0
      THEN CAST(1 AS DOUBLE)
    END AS bf_tf_adj_city,
    "email_l",
    "email_r",
    gamma_email,
    CASE
      WHEN gamma_email = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_email = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_email = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_email,
    match_key
  FROM __splink__df_comparison_vectors
)
SELECT
  LOG(
    2,
    CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
  ) AS match_weight,
  CASE
    WHEN bf_first_name = CAST('infinity' AS DOUBLE)
    OR bf_tf_adj_first_name = CAST('infinity' AS DOUBLE)
    OR bf_surname = CAST('infinity' AS DOUBLE)
    OR bf_city = CAST('infinity' AS DOUBLE)
    OR bf_tf_adj_city = CAST('infinity' AS DOUBLE)
    OR bf_email = CAST('infinity' AS DOUBLE)
    THEN 1.0
    ELSE (
      CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
    ) / NULLIF(
      (
        1 + (
          CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
        )
      ),
      0
    )
  END AS match_probability,
  "unique_id_l",
  "unique_id_r",
  "first_name_l",
  "first_name_r",
  gamma_first_name,
  "tf_first_name_l",
  "tf_first_name_r",
  bf_first_name,
  bf_tf_adj_first_name,
  "surname_l",
  "surname_r",
  gamma_surname,
  bf_surname,
  "city_l",
  "city_r",
  gamma_city,
  "tf_city_l",
  "tf_city_r",
  bf_city,
  bf_tf_adj_city,
  "email_l",
  "email_r",
  gamma_email,
  bf_email,
  match_key
FROM __splink__df_match_weight_parts

Error was: Binder Error: Values list "l" does not have a column named "tf_first_name"
LINE 11: ...from __splink__compare_records_left_aj2njrcn), 

__splink__compare_records_rig...
                                                   ^
= 1 failed, 321 passed, 12 skipped, 395 deselected, 16 warnings in 116.40s (0:01:56) =
Environment info (3.9.20)
Package operations: 36 installs, 0 updates, 1 removal

  - Removing setuptools (75.5.0)
  - Installing six (1.16.0)
  - Installing attrs (24.2.0)
  - Installing python-dateutil (2.9.0.post0)
  - Installing rpds-py (0.20.1)
  - Installing referencing (0.35.1)
  - Installing jsonschema-specifications (2023.12.1)
  - Installing markupsafe (2.1.5)
  - Installing numpy (1.24.4)
  - Installing pytz (2024.2)
  - Installing tzdata (2024.2)
  - Installing exceptiongroup (1.2.2)
  - Installing greenlet (3.1.1)
  - Installing iniconfig (2.0.0)
  - Installing jinja2 (3.1.4)
  - Installing jsonschema (4.23.0)
  - Installing packaging (24.2)
  - Installing narwhals (1.14.1)
  - Installing pandas (2.2.3)
  - Installing pluggy (1.5.0)
  - Installing py4j (0.10.9.7)
  - Installing pyarrow (17.0.0)
  - Installing texttable (1.7.0)
  - Installing tomli (2.1.0)
  - Installing typing-extensions (4.12.2)
  - Installing altair (5.4.1)
  - Installing duckdb (1.1.3)
  - Installing igraph (0.11.8)
  - Installing networkx (3.1)
  - Installing psycopg2-binary (2.9.10)
  - Installing pyspark (3.5.3)
  - Installing pytest (8.3.3)
  - Installing rapidfuzz (3.9.7)
  - Installing ruff (0.4.10)
  - Installing sqlalchemy (2.0.36)
  - Installing sqlglot (25.31.4)
  - Installing tabulate (0.9.0)
@ADBond ADBond added the testing label Nov 20, 2024
@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

See also this run - python 3.13 + postgres, (psycopg2.errors.UndefinedColumn) column l.tf_first_name does not exist.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

It looks like the test tests/test_realtime.py::test_realtime_cache_different_settings is trying to run SQL which relates to settings in the earlier tests (either of them, they should generate the same).

It's likely to do with the sql caching mechanism. I've looked through the stack a bit to try to figure out what code might be non-deterministic. My current hypothesis is based around the fact that we use id(settings) to generate a cache key:

  • in test_realtime_cache_multiple_records or test_realtime_cache_two_records we define a settings object that has tf adjustments on first_name and city. Let's say this settings has id of 123
  • we call compare_records. The SQL gets cached using the key 123
  • the test finishes, and the settings object gets garbage collected
  • test_realtime_cache_different_settings defines new settings objects that do not have term frequency adjustments at all (and in fact each have fewer columns). This settings object is assigned an id of 123 by python (as the previous settings has been garbage collected, this id is no longer reserved, and python decides to re-use it)
  • we call compare_records with use_sql_from_cache=True. Even the first time, we get a cache hit, as the cache is a global object, and it has a SQL entry corresponding to the key of 123. When we try to run this SQL, we get an error, as the input data lacks the corresponding term-frequency columns (as we don't need them!)

So far this is just a theory. I think this scenario should be possible, but I haven't managed locally to generate an id collision like this in a toy example. I also haven't managed to recreate the test failure locally. If this is the cause then it may well be to do with differences in the environment between me and CI (which could well affect gc behaviour and memory things).

If this is the issue, then it should be relatively easy to fix with a slight tweak to the caching strategy (and maybe we could do this anyway). But it would be nice to get to the bottom of this beforehand, as this kind of failure could mask more subtle bugs. If for instance the only difference between the two settings objects were in the comparison levels (or even just the parameter values!), then the 'incorrect' SQL would still execute. If we were only dealing with the match_probability (rather than looking at individual match weights or other intermediate columns) we might never be aware that we are in fact running a different model to what we thought.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

Note this line from the python docs on id()

Two objects with non-overlapping lifetimes may have the same id() value.

which suggests that this mechanism is at least plausible. How likely it is I don't really have a sense of.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

Okay I believe that this is in fact what is occurring. I have run tests repeatedly on my fork in CI to get a failure, with some additional logging in place. This run has a failure, and you can see in the logs that a settings id has been re-used. In particular, the settings object instantiated in test_realtime_cache_multiple_records has id of 140110989126384. Then in test_realtime_cache_different_settings the second settings object also gets assigned the id 140110989126384 (the previous object having fallen out of scope, and apparently garbage collected at some point in the meantime), which means we look up the wrong SQL, and thus the test fails.:

tests/test_realtime.py::test_realtime_cache_multiple_records[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
PASSED                                                                   [ 66%]
tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 140110989243008
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989243008
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
FAILED     

For reference here is an equivalent portion of the logs for a non-failing test run - in this case the object gets a distinct id from previous tests, so we do not get a cache hit, and thus construct the SQL afresh:

tests/test_realtime.py::test_realtime_cache_multiple_records[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
PASSED                                                                   [ 66%]
tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 139678152854448
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152854448
WARNING  root:realtime.py:94 Settings object had id: 139678152852816
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152852816
WARNING  root:realtime.py:94 Settings object had id: 139678152854448
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152854448
PASSED  

@ADBond ADBond self-assigned this Nov 25, 2024
@RobinL
Copy link
Member

RobinL commented Nov 26, 2024

Thanks for looking into this, and excellent detective work! I just checked back on this one because I just got the same failure here

The reason I used id(settings) was simply that it seems like a low-overhead way of uniquely identifying a settings object. Obviously it didn't work as anticipated but any other mechanism should do. It was just to avoid caching errors in the case where the user is making predictions from two different models - leading to the potential for one model to use the cached sql from the other

@ADBond
Copy link
Contributor Author

ADBond commented Nov 26, 2024

Update on this / to document my thinking:
Currently I have a small partial fix for this. Instead of passing in the id we pass in the object itself. Internally we still use the id as a key, but alongside the SQL and uid we store a weak reference to the object itself. When we get a would-be-cache-hit, if the referent is dead we delete the cache entry and return nothing. The trouble with this is that:

  • it won't work for dict or str settings, as they can't be weakreffed, so would need to handle these differently in some fashion
  • this feels like a bit of an indirect approach

I've run it with some hooks to confirm that in cases where this issue would have arisen this fixes things.

I could merge that in now, and it would solve the intermittent test failures. But as it's not a blocking issue currently, I'm going to explore another option based on hashing the object (snapshot)s. I realise it's probably a fairly niche bug, but I think it would be useful to try and get this right, as there is potential for quite a subtle (and thus hard-to-notice) error here.

@RobinL
Copy link
Member

RobinL commented Nov 26, 2024

Interesting - the weakref idea is definitely an improvement, and feels like the correct/principled solution for the object case, it's a shame about the dict and str issue

I agree that it's worth solving this 'properly'

Could we simply set our own uuid __settings_id on the settings object, check for existence and use that as the cache key if it exists?

  • This feels like it works for SettingsCreator, Settings, dict
  • If the input is a path/str, could use the path/str as the settings id?

But, I appreciate it's a bit hacky!

The loading directly from a path scenario is interesting because it's potentially a common production use case. The simplicity of 'if settings is a string, use it as a key' seems potentially nice - easy to understand and high performance.

(apologies for the mistake in the first place - i had assumed that because the id is a big number, the chance of collision was negligible, but I can see now why that's not the case)

Or maybe it's just:

  • if dict or string, use hash
  • if obj, use weakref.

That doesn't feel too bad...

@ADBond
Copy link
Contributor Author

ADBond commented Dec 3, 2024

I had thought about using a __settings_id, but a bit wary as I feel that you could get into some weird edge cases. Like if a user copied a SettingsCreator, so that they could tweak a comparison. Then we'd have to override the default copying behaviour to avoid unexpected behaviour. It might be a sensible way to go eventually, but for now I can't quite work through the implications.

The other, separate issue, I wonder about with loading from a path is the fact that the file might change between calls. So you get the SQL, then another process updates the model, but you still get the cached SQL cause the path is the same. Not sure for the moment how likely this kind of scenario is though.

I don't think there's really loads we can do about that - we could read the file in and use a hash of its contents, but then we have a disk read every time, which probably isn't ideal if we are aiming for high performance. I suppose we could query the metadata for last modified time (which is maybe cached?) if we wanted, but don't know off the top of my head if that is still too expensive. Or possibly leave it, and just emit a warning. At any rate I won't do anything here, as I think it is really a separate (albeit related) concern.

Or maybe it's just:

* if dict or string, use hash

* if obj, use weakref.

That doesn't feel too bad...

My current feeling is that this is probably the best way, and we can get something merged. I will just keep this in the back of my head though - will follow up with some additional testing to try and cover some potential edge-cases

@ADBond
Copy link
Contributor Author

ADBond commented Jan 16, 2025

After this falling by the wayside I am keen to kick this over the line, as it keeps on coming up and is pretty annoying.

The complexity is the dict. We can't easily use a hash, as it is not directly hashable. Furthermore there is a lot of ambiguity in the the things that live inside it - we can have e.g. comparisons that are dicts, or ComparisonCreators, and even if they are dicts the levels can be a mix of ComparisonLevelCreators. So we would need to convert everything to native python types + jsonify in a canonical way, or maybe define some notion of hashing on our custom objects (which feels iffy). All of that feels like quite a lot of machinery for this one issue.

I'm tempted to go down the __settings_id route, but I'm not sure it's so simple in the case of a dict (which is where we really need it). I think it gets a bit messy, as to be effective we need to reflect this id back into the actual dict (and it always feels a bit strange modifying objects that users supply, especially in these kind of cases where it might not be expected). As we process dicts together with paths (that we'll be treating differently as the dict won't be persistent) it's all a bit tangled.

Another option (which I don't love) is just turning off the caching mechanism when user supply things as a dict (along with a warning). This is a very specific use-case, and we are generally encouraging the use of SettingsCreator over a dict anyway, so I'm not sure it's unreasonable that we give a bit of limitation to users in this one instance. The function will still work, they just won't have access to caching.

Or maybe we do the opposite and always retrieve from the cache when we are dealing with dicts. That way users can still use dicts at high performance, with the caveat that if they are dealing with multi-settings they might get the wrong SQL. That feels like less of an impact to workaround, but potentially more damaging for users that ignore or do not read the warning (although who would ever ignore a warning like that!).

I'm writing this out partly to help me remember what the complications are if stuff comes up and I end up taking a moment to get back to this.

@RobinL
Copy link
Member

RobinL commented Jan 16, 2025

Thanks - yeah would be good to get this over the line.

I agree we want to avoid:

a lot of machinery for this one issue.

Overall I think I'm inclined towards simpler but less complete solutions that are bug free, than something more complex that works in every circumstance.

So I think I'd err on the side of:

just turning off the caching mechanism when user supply things as a dict (along with a warning)

I think what matters is that it's possible to get this performance boost - so that people embedding Splink into realtime systems have at least one option that gives them great performance. Such people should be 'deep' enough into Splink they've read any warnings we emit.

I think it's much less important that it always works for a typical user just playing about with Splink in say a Jupyter notebook.

@ADBond
Copy link
Contributor Author

ADBond commented Jan 20, 2025

The complexity is the dict. We can't easily use a hash, as it is not directly hashable. Furthermore there is a lot of ambiguity in the the things that live inside it - we can have e.g. comparisons that are dicts, or ComparisonCreators, and even if they are dicts the levels can be a mix of ComparisonLevelCreators. So we would need to convert everything to native python types + jsonify in a canonical way, or maybe define some notion of hashing on our custom objects (which feels iffy). All of that feels like quite a lot of machinery for this one issue.

Just for future reference on this: it turns out I had already done the work to allow dicts to be fully converted to native python types, so the solution uses this + json dump as a key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants