Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Exact match level required for TF adjustment - can this be avoided? #2006

Open
samkodes opened this issue Feb 28, 2024 · 22 comments
Open
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

samkodes commented Feb 28, 2024

Is your proposal related to a problem?

I have an application where I wish to link dataset A, representing noisy transactions, to dataset B, a master entity list. For dataset B (the master list) I am using arrays to represent multi-valued fields (e.g. known alternate office phone numbers); however dataset A (the transactions) has only single values in the corresponding fields (arrays of length 1). Because office phone numbers can be shared between multiple people, I would like to use term frequency adjustments. I can use dataset B, my master entity list, to calculate a user-supplied TF table for the relevant fields (exploding the multi-valued fields for an accurate count), which will then be applied to the singleton arrays in dataset A. In my application any match between a phone number in dataset A and any of the entries in the corresponding array field in dataset B counts as a match; so, to compare arrays, I am using array intersection functions with any non-zero intersection as my single non-null, non-"else", comparison level.

The problem is that in order to use TF adjustments, this code looks for u-probabilities corresponding to an "exact match" level in the comparison. An "exact match" seems to be identified via this code, which seems to look for exact field comparisons in the SQL. Since I am using array intersections, I do not have an "exact match" level, and an error is thrown - "An exact match level is required to make a term frequency adjustment on a comparison level that is not an exact match."

One alternative is to add an additional exact match level to the model. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one.

After a lot of thinking, I think I understand why this u-probability is needed. In the term frequency adjustment math here the reference u-probability is just a presentation convenience - the u-probabilities cancel out but are explicitly added to ensure that a term frequency adjustment can be represented as an addition to the weight that would occur without term frequency adjustment. In the calling code here, the exact-match u-probability is used to create the term frequency adjustment. In the exact matching case another identical u-probability must be calculated somewhere else that cancels out.

However, if the match is inexact, I am assuming that in general the u-probability that is calculated elsewhere is for the inexact match. TF adjustments should modify u-probabilities based on the exact field value distribution, regardless of whether the match is exact or not. My reasoning is that value multiplicity in the dataset is the relevant factor - if I have 10 "Sam"s in dataset B I am 10 times as likely to u-match "Sam" in dataset A but also 10 times as likely to fuzzy u-match "Spam" in dataset A, conditional on having value "Sam" in dataset B. However, we need something to adjust by the multiplicity 10. For exact matching, we multiply the multiplicity by 1/N to get T_x, because for exact matching in the absence of duplicates u_x is ~1/N. The problem is estimating u_x, in the absence of duplicate values, for a fuzzy match - i.e. we don't know in general how many records will be hit by a fuzzy match, and we can't estimate this for every x by random sampling (too many samples!). So instead we use an "inflation factor" of u-inexact/u-exact and multiply this by T_x. That's the behaviour here if u-probabilities don't cancel.

The question then is how could Splink get the u-exact probability if, as in my case, there is no exact match level in the model. One option would be to add extra code to sample u-exact probabilities any time there is a TF adjustment (if not already doing so), and store these separately somewhere for retrieval as needed.

Describe the solution you'd like

Rather than demanding a corresponding exact match level in all comparisons that use TF adjustments, can Splink simply pre-compute and store u-exact probabilities separately whenever TF adjustment is required by a model? This may be justified by the argument above - that the u-inexact/u-exact inflation factor is an approximation that allows TF adjustments to be used for inexact match levels, but that there are domains in which an exact match is not meaningfully different from certain kinds of inexact matches.

EDIT: After rethinking, this approach won't work for my application but may have other uses; see response below.

Describe alternatives you've considered

One alternative is to simply define an additional exact match level. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one (i.e., I have a priori knowledge that these two levels do not provide different levels of match evidence). Moreover, the effective TF adjustments for the 2 match levels are different (because they have different inflation factors), behaviour which I don't want.

Additional context

@samkodes samkodes added the enhancement New feature or request label Feb 28, 2024
@samkodes
Copy link
Contributor Author

samkodes commented Feb 28, 2024

After sleeping on it, I now think the proposed solution won't work for my application, though it does make sense for fuzzy matches like Damerau-Levenshtein.

In this case I really want u_x = T_x (term frequency), without any approximation by the inflation factor u-inexact/u-exact.

To achieve this a different solution makes sense and would be simple: simply add a custom setting in each comparison level that uses TF adjustment called "disable_tf_inflation_factor" (settable by dictionary specification to begin with, though could be added to the comparison libraries later). When that setting is on, splink should simply use u-inexact (rather than seeking u-exact) in the TF adjustment weight to cancel out the u-inexact used elsewhere, effectively setting the inflation factor to 1.

An alternative would be to allow manual specification, again through dictionary specification, of a match level to take the role of the "exact match" level, rather than using SQL parsing. This would allow any level to be used as the reference level for the TF inflation factor. I could imagine this being helpful if I were to add a fuzzy array match implementation as another match level to my model; I would then use the exact array intersection match as the reference level. Perhaps a flag called "tf_inflation_reference_level".

@samkodes
Copy link
Contributor Author

samkodes commented Mar 2, 2024

I've implemented the second and third approaches above, using custom comparison level settings flags, and set up a pull request. #2020

@RobinL
Copy link
Member

RobinL commented Mar 3, 2024

Thanks for this and the PR! At a high level:

  • Getting tf adjustments to work for array based columns is a much requested feature
  • We have had various thoughts before about how to do it, but have never settled on a good solution

You can see below I've had a bit of a think about this and here are some initial notes. I ran out of time a bit because it gets pretty complicated and was making my brain hurt a bit. I hope to find a bit more time to think about this a bit more.

For now, I at least think I have most of the key concepts in my head at the moment, but would be interested in talking a bit more about exactly how you want this to work so we can think through options for potential solutions. I'd like to think about whether there are other ways of getting the functionality you want

Would you be up for a chat? - feel free to reach out at [email protected].


I'll start by writing a few notes on how term frequency adjustments currently work for clarity (and partly to remind myself!).

Starting from the basics, the u probability a measure of the likelihood of coincidences occuring whereby information matches amongst non-matching records. For instance, an exact match on first name for two different people.

We can measure an average value for the u probability by taking a large number of non-matching pairs, and calculating how often first name matches.

Let's say that the average u value is 1%.

However, this value does not account for skew - for example, it's much more likely to observe a match on 'John' amongst non-matching record than a match on 'Robin'. Let's say Johns make up 10% of the population and Robins make up 0.1% of the population, so the term-specific u value are 0.1 and 0.001 respectively.

So:

  • taking two non-matching records at random, 1% of the time the first name matches by conincidence
  • but if the first name is John on the first record, and we take a random non-matching record, 10% of the time the first name matches, because the second record is also named John *
  • Is that right? I wasn't quite sure how to phrase this as a conditional probability. You can't condition on both sides of the match being John!!

Current Splink implementation

In Splink we currently measure:

  • average u (1% in the above example)
  • term specific u (10% for John, 0.1% for Robin)

In the case of an exact match, we use the term specific u rather than the average u. As an implementation detail, this is calculated as average u, with an additional adjustment for the delta between average u and term specific u. The reason is to make it easier to represent the term freqnecy adjustment graphically (John can be shown as a downward adjustment of match weight relative to average, Robin can be shown as an upward adjustment relative to average)

Implementation for fuzzy match levels

In the case of a fuzzy match like levenshtein, it's harder to define a definition of a term frequency adjustment because there are two terms. If the match is Jhon vs John, at a levenshtein distance of 1, what term shold be used to make the adjustment?

The decision make within Splink is to take the more frequent of the terms on the basis that one of the two terms is a typo or mistake and therefore is probably uncommon. So in the case of Jhon vs John, we make the term frequency adjustment on the basis of John.

We can now see a second reason why it's useful to apply the term frequency adjustment as a delta between average u and term specific u. Because the u value for the levenshtein level has a different u value than the exact match level, but we can still apply the adjustment to this alternative u value.

However, the strength of the adjustments can be quite high, and it's unclear whether they strictly apply in the case of a fuzzy match. 'John' vs 'Jhon' is probably a typo. Bryan vs Ryan are just two separate names, so it's much less clear how a term frequency adjustment should apply in this case. So we allow a second tf_adjustment_weight of between 0 and 1, which reduces the size of the adjustment. 1 = full adjustment, 0 = no adjustment, and you can pick and value in between to 'partially' apply the adjustment.

Why do we need to find the exact match level to apply tf adjustments to a fuzzy level?

In order to apply tf adjustments to a fuzzy level we need to know: the delta between average u and term specific u. More specifically we need a value for:

  • The term specific u. This is fairly straightforward because it comes directly from the term frequency table. (Albeit with some logic to pick the most common term from Jhon and John in the table)
  • The average u to form the basis of the delta. Since the term freqnecy adjustment table is based on 'exact match', the average u value should be on the same basis.

How could this be used for an array based column?

It's a bit unclear how to define this when the size of the intersection of the array may be 1 or more. Possibly a bit simpler if you somehow just take the least common term in the array.

  • It's possible to calculate array based term frequencies and register the table manually, see here

  • But not obvious how to compute the delta because need to:

    • Perform the tf lookup for each item in the intersection of the array
    • Compare to an average value - a bit unclear how to define the average

@samkodes
Copy link
Contributor Author

samkodes commented Mar 3, 2024

Hi Robin - thanks for the detailed reply.

First, I should clarify that my thinking on TF is a lot clearer for the link-only setting, rather than the de-duplicate setting. I think this is because in the de-dupe setting there is no information about the duplicate-generating process, which makes inference using TF (whether a-priori or empirical estimates) involve a lot of assumptions. If duplicate records are far rarer than duplicate values, however, probably a similar analysis will apply???

In a link-only setting it seems natural to condition the analysis for any pair on the value in one of the datasets (e.g. the left). This is because the link-only task can be thought of as "for each record in left, evaluate the probability of match with each record in right". With this perspective, TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset. This implies no assumptions on the data generating process for the left data set, except that each left record is independent of all others (no 1-1 match is enforced, etc; in other words duplicates in the left, but not right, dataset are allowed). Of course left and right could be swapped, but then the probability model and data-generating process assumptions would change. In other words, it seems to me that the appropriate probability model is inevitably tied to the definition of the linking task; hence my confusion regarding the de-duplication task. (Our typical assumption is that the m-probabilities do not depend on the precise value but instead reflect generic error processes that apply uniformly across the dataset, so they don't have to be brought into the analysis at all)

I prefer to think of the TF adjustment inflation for inexact (or fuzzy) matches as a scaling factor applied to TF (this is what's happening before taking logs and separating out an additive weight adjustment). This makes clear that it is simply the ratio u-inexact (or u-fuzzy)/u-exact.

How this applies to array-based matches depends on what the user wants; again, I don't think there is a general solution.

It is perhaps clearer to start by not thinking about TF "inflation" in this case - rather just try to think about u_x directly and what that should be.

For example, in my case I am looking for any array intersection at all. Because I treat the dataset with possibly arrays of greater than length 1 as my "master list" (right dataset), and the dataset with singleton arrays as my (possibly duplicated) left dataset, it makes sense to define u_x as the TF for the single element on the left (with TFs calculated based on the RIGHT dataset though). If I had left and right reversed, so my "master list" had single-valued arrays and the other dataset had multi-valued arrays, I might want to define u_x as the sum of the TFs in the multi-valued array (because a match on any of them would count, so probability of a coincidence adds up).

If a match level requires, say, a 2-element match, then u_x would have to be defined in terms of the probability of seeing any 2-element subset in the "master list", and implementing this would be challenging unless the other data set only had arrays of length 2 (so we'd know which 2-element subset we were talking about) - otherwise the combinatorial analysis becomes more complicated ...

I will email you to set up a chat and send you a notebook showing how I applied the fork approach to one of the sample data sets; that may make this example clearer.

@samkodes
Copy link
Contributor Author

samkodes commented Mar 3, 2024

Another example:

Even if an array match level only requires 1 intersection, the analysis of u_x becomes complicated as soon as both arrays can have more than 1 entry.

Above I suggested that when at most 1 array could have more than 1 entry, there were two cases depending on which was the "master list" - either u_x=TF of the singleton (when "master list" has multiple entries) or u_x=sum of TF's in the multi-valued array (when "master list" has singleton entries).

When both arrays can have more than 1 entry, the upper bound of u_x = sum of TF's in the non-master entry, and the lower bound of u_x = maximum TF of the non-master entry. But the sum of TF's figure overcounts when several of its entries are found in a single "master list" entry. A combinatorial analysis could perhaps make this more precise and suggest cheap estimates of this overcount; alternatively a model could be used for the array-generating process (e.g. Poisson for array size, followed by independent selection of entries --- but then results depend on the model assumptions, and array entries are unlikely to be independent! Moreover under an independence model overcounting should be a very minor issue and may be ignored unless arrays are very long or some terms are incredibly frequent .... )

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Still struggling a bit to get my head around this.

Maybe looking at the sql, or taking specific examples, will help clarify ideas. Would you maybe be able to look at the follow example, and use it to construct a new example of the behaviour you want? I don't mean to suggest I'm asking for you to write full sql, just to give me an idea of the modification you're looking for in the SQL and how it would apply?

Here's a test script:

Script to view sql
import logging

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


fuzzy_first_name_with_tf_adj = {
    "output_column_name": "first_name",
    "comparison_levels": [
        {
            "sql_condition": '"first_name_l" IS NULL OR "first_name_r" IS NULL',
            "label_for_charts": "Null",
            "is_null_level": True,
        },
        {
            "sql_condition": '"first_name_l" = "first_name_r"',
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 1.0,
        },
        {
            "sql_condition": 'levenshtein("first_name_l", "first_name_r") <= 2',
            "label_for_charts": "Levenshtein <= 2",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 0.5,
        },
        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
    ],
    "comparison_description": "Exact match vs. First_Name within levenshtein threshold 2 vs. anything else",
}

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        fuzzy_first_name_with_tf_adj,
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)
logging.getLogger("splink").setLevel(1)
df_predict = linker.predict()

In that script, we're using levenshtein with the first_name, and allowing TF adjustments on both the exact match and the fuzzy match

The sql is

full sql
CREATE TABLE __splink__df_predict_e91631574
        AS
        (WITH __splink__df_concat_with_tf as (select * from __splink__df_concat_with_tf_6849394a0), 
__splink__df_blocked as (
            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
            , '0' as match_key
            
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."first_name" = r."first_name")
            where l."unique_id" < r."unique_id"
            
             UNION ALL 
            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
            , '1' as match_key
            
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."surname" = r."surname")
            where l."unique_id" < r."unique_id"
            AND NOT (coalesce((l."first_name" = r."first_name"),false))
            ), 
__splink__df_comparison_vectors as (
    select "unique_id_l","unique_id_r","first_name_l","first_name_r",CASE WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1 WHEN "first_name_l" = "first_name_r" THEN 2 WHEN levenshtein("first_name_l", "first_name_r") <= 2 THEN 1 ELSE 0 END as gamma_first_name,"tf_first_name_l","tf_first_name_r","surname_l","surname_r",CASE WHEN "surname_l" IS NULL OR "surname_r" IS NULL THEN -1 WHEN "surname_l" = "surname_r" THEN 1 ELSE 0 END as gamma_surname,"dob_l","dob_r",CASE WHEN "dob_l" IS NULL OR "dob_r" IS NULL THEN -1 WHEN "dob_l" = "dob_r" THEN 1 ELSE 0 END as gamma_dob,match_key 
    from __splink__df_blocked
    ), 
__splink__df_match_weight_parts as (
    select "unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",CASE 
WHEN
gamma_first_name = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_first_name = 2
THEN cast(163.97484984984985 as float8)
 
WHEN
gamma_first_name = 1
THEN cast(2.4703796561604605 as float8)
 
WHEN
gamma_first_name = 0
THEN cast(0.025404270177413344 as float8)
 END as bf_first_name ,CASE WHEN  gamma_first_name = -1 then cast(1 as float8) WHEN  gamma_first_name = 2 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 1 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 0 then cast(1 as float8) END as bf_tf_adj_first_name ,"surname_l","surname_r",gamma_surname,CASE 
WHEN
gamma_surname = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_surname = 1
THEN cast(194.275 as float8)
 
WHEN
gamma_surname = 0
THEN cast(0.05024570024570029 as float8)
 END as bf_surname ,"dob_l","dob_r",gamma_dob,CASE 
WHEN
gamma_dob = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_dob = 1
THEN cast(543.5567010309278 as float8)
 
WHEN
gamma_dob = 0
THEN cast(0.05008754038589973 as float8)
 END as bf_dob ,match_key 
    from __splink__df_comparison_vectors
    ) 
    select
    log2(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob) as match_weight,
    CASE WHEN bf_first_name = cast('infinity' as float8) OR bf_tf_adj_first_name = cast('infinity' as float8) OR bf_surname = cast('infinity' as float8) OR bf_dob = cast('infinity' as float8) THEN 1.0 ELSE (cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)/(1+(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)) END as match_probability,
    "unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",bf_first_name,bf_tf_adj_first_name,"surname_l","surname_r",gamma_surname,bf_surname,"dob_l","dob_r",gamma_dob,bf_dob,match_key 
    from __splink__df_match_weight_parts
    
    order by 1
    )

Let's take a look at how the term frequency adjustments are applied in the sql.

Bayes factor first name:

CASE
WHEN
gamma_first_name = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_first_name = 2
THEN cast(163.97484984984985 as float8)
 
WHEN
gamma_first_name = 1
THEN cast(2.4703796561604605 as float8)
 
WHEN
gamma_first_name = 0
THEN cast(0.025404270177413344 as float8)
 END as bf_first_name

And then the term frequency bayes factor adjustment is:

CASE WHEN  gamma_first_name = -1 then cast(1 as float8) WHEN  gamma_first_name = 2 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 1 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) /
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 0 then cast(1 as float8) END 

as bf_tf_adj_first_name

Here:
163.9748 is the bayes factor corresponding to the average exact match
2.4703 is the bayes factor corresponding to the average levenshtein <2 match

Then, the TF adjustment is defined as:

  • for an exact match, the difference between the average u value (0.00579) for an exact match and the relative term frequency for the name, expressed as a bayes factor
  • for the Levenshtein level, the difference between the average u value (0.00579) for an exact match and the relative term frequency for the name, but with a further modification to reduce by half (0.5) the impact of the adjustment

We then multiply bf_first_name by bf_tf_adj_first_name. So that in the case of the levenshtein level, it's combinng the average u value for the levenshtein with half (0.5) the term frequency adjustment for the exact

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Separately, I've had a think about how it may be possible to combine term frequencies and array columns using the current Splink codebase. Specifically whether it's possible to simultanously:

  • account for the size of the intersection
  • account for the relative token frequencies
  • make it work for an arbitrary bag of words column

You can find some a readme and some code here:
https://github.com/RobinL/array_tf_ideas

This doesn't get you a highly-granular tf adjustment i.e. an adjustment that's different for every single coombination of tokens that intersect, but it does get you quite a long way towards decent predictions

@samkodes
Copy link
Contributor Author

samkodes commented Mar 4, 2024

I like the above approach for combining arrays and TFs but feel like it makes different assumptions than when the array is just meant to be a multi-valued set of options.

In effect the above scoring approach assumes (1) independent selection of intersecting terms (conditional on size of intersection) and (2) that a bigger intersection is stronger evidence of a match. Neither is always the case.

(There should also be some sort of adjustment for both array sizes to view this scoring as an explicit data generating process measuring the "rarity" of a pairing. Possibly multiply by combinatorial coefficients (N_1 choose k)*(N_2 choose k) where k is the size of the intersection and N_1, N_2 the sizes of the arrays in the pair - I think this amounts to conditioning on N_1, N_2 and ignoring higher-order corrections required to avoid an intersection of larger size? )

In this particular application, you might also want to simply exclude stopwords and their abbreviations when data cleaning (like "limited" or "ltd"); standardizing abbreviations will lower the score when they are present because of assumption (2).

@samkodes
Copy link
Contributor Author

samkodes commented Mar 4, 2024

I would like to recant my proposal that u_inexact/u_exact is an appropriate inflation factor; I now believe that it is not, and that more precise calculations are preferable.

The current approach sets match weight to:
log( m_inexact / ((u_inexact/u_exact)*TF)),
ie this sets
u_x = (u_inexact/u_exact)*TF

Unfortunately, I now believe this is wrong, even if we view u_inexact/u_exact as an "inflation factor" accounting for increased rate of coincidences due to fuzzy match. This is because the TF of 1 term, even scaled, cannot be expected to approximate the TF of all fuzzy matches to that term.

For example, suppose we are trying to find a match for "Sam" in a dataset that contains 2 "Sam"s and 8 "Spam"s . Fuzzy match (damerau-levenshtein=1) will hit all 8 Spams (assuming exact matches are not counted here), but TF for Sam remains 2/10. We would need an inflation factor of 4. But a different value in general will need a different inflation factor; for example, if our dataset also had 2 "Peg"s and 18 "Meg"s, when looking for a match for "Peg" we would find all 18 "Megs"; TF for Peg would be 2/30 and we would need an inflation factor of 9.

There is no reason in general for the TF of "Sam" to tell us anything about the TF of "Spam", or the TF of "Peg" to tell us anything about the TF of "Meg". (If TFs were in some sense continuous with respect to the fuzzy-distance metric, that would be another story).

In other words, even if u_inexact/u_exact measured the average "neighborhood size" of the fuzzy match in terms of nearby unique values (or even non-unique values i.e. records), this size is in general unrelated to any particular TF itself. This means that variation in individual TFs will in general add noise to the formula above rather than making it more individualized, which is the whole point of TF adjustments.

The correct value for u_x for "Sam" is the "neighbourhood size" of the fuzzy match operator applied to "Sam" , i.e. the number of records fuzzy-matched to "Sam", divided by the master list's size.

One way to calculate u_x for a fuzzy match level is then to simply do a complete fuzzy match calculation on all values against all other values. We don't have to do this for each record in the master list, just the values, and then total up by the value's multiplicity. That saves some work, but can still be impractical for high cardinality fields.

If the fuzzy match satisfies the triangle inequality, there should be efficient algorithms for doing this in large datasets. (e.g. using anchor points to partition the dataset into neighbourhoods and rule out comparisons between the local neighbourhoods of distant anchor points). An algorithm and python implementation for doing this should already exist - it is a generic task - count the size of the epsilon-neighbourhood of each point in a generic metric space without calculating a complete distance matrix. Questionable whether this can be implemented easily in SQL though!

An alternative would be to estimate this by random sampling for each value. Since fuzzy matches are expected to have a higher hit-rate than exact matches, sampling may be able to provide reliable estimates with a practical number of samples per value (e.g. 1k-10k?). The fuzzier the match, the more effective a sampling approach would be.

While the disable_tf_inflation logic I've submitted could be used to omit the u_inexact/u_exact factor and use a raw user-supplied TF value , every match level would need its own TF table (better to call it a u_x table), so changes to this part of the architecture would also be required.

For the simple case of array intersections I am considering, none of this matters. But in the general case if my analysis is correct, it means the current approach to inexact-match TF adjustments may be unwise?

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Agreed - the current approach isn't very sound. I think it only really works in the case of small typos, where one side of the match is 'not a word'. E.g John Vs Jhon, but not Sam Vs Spam. We don't use it very often for this reason but there are some scenarios in which it might improve accuracy a little despite being not very theoretically grounded, especially in cases with heavy data skew (e.g. city name).

@samkodes
Copy link
Contributor Author

samkodes commented Mar 5, 2024

Another few thoughts re the suggesting approach to combining TFs and arrays above.

Your proposed approach may be thought of as hypothesis testing under a particular null hypothesis model (sketched above), with match levels corresponding to different p-value thresholds. This statistical model is not in general related to the statistical model used for Fellegi-Sunter, but is just one way of generating "scores" representing match strength.

Like any method of measuring match strength, different measures will suit different domains and our understanding of what constitutes a match. One downside to the proposal is that it pays no attention to non-matching terms.

Two alternative approaches that would penalize for non-matching terms are inverse-TF-weighted symmetric difference (i.e. (sum of weights of union)-(sum of weights of intersection)) and inverse-TF-weighted Jaccard distance ( i.e. inverse-Tf-weighted symmetric difference / (sum of weights of union). These are not probability models themselves; the first could have arbitrary values while the second is in [0,1]. I believe both of these are metrics (satisfy the triangle inequality) so could be used with the efficient exact methods of calculating u_x I mentioned above.

@samkodes
Copy link
Contributor Author

samkodes commented Mar 5, 2024

Some notes coming out of our conversation today (very fruitful!):

  • Currently, T_x is multiplied by the TF-"inflation factor" (u_inexact/u_exact) to get u_x for inexact matches.
  • In the general case this is inappropriate because inexact matches may have very different multiplicities (TFs) than x (e.g. 2 "Sam" and 10 "Sham") - discussed earlier in thread.
  • In the case where inexact matching is very unlikely to result in anything except typos, the inflation factor is also inappropriate. Here we want to use T_x directly. e.g. "Londno" 1-Levenshtein matching "London" should use u_Londno = T_London , because there are no other cities 1-Levenshtein away from "Londno". Typos matching typos will be relatively rare. In general, the situation when this approach makes sense is when the "master list" consists of sparse elements each further away from each other than the fuzzy match radius, with typos likely to fuzzy match only a single element on the "master list".
  • In the above case (sparse master list), the "inflation factor" u-inexact/u-exact is very hard to interpret anyway because of the conditional structure of the algorithm; u-inexact may not in fact be any higher than u-exact since exact matches are excluded from u-inexact; i.e. it might "deflate".

Conclusion: in general, avoiding this "inflation factor" might be a better idea.

@RobinL
Copy link
Member

RobinL commented Mar 5, 2024

Thanks for the notes re our chat.

Re the arrays:

  • Yeah - I agree there are cases in which we want to pay attention to non matching terms. Agree it would be reasonably straightforward to adjust the calculation to somehow 'punish' non matching terms, and there are various ways you could do this.
  • I think one of the strengths of using these scores indirectly (as comparison levels) as opposed to using the scores directly (e.g. as a u value to drive match weights) is that they then get parameter estimates just like any other comparison level. i.e. the calculation allows us to bin matches into different groups (comparison levels). So long as those groups correspond pretty well to 'matchiness' (i.e. members of the group are similarly 'matchy'), then the approach gives you something useful. Overall I think the exact approach taken is probably data-dependent, but the high level approach is promising for quite a wide variety of use cases.

@zmbc
Copy link
Contributor

zmbc commented Dec 7, 2024

Hi both -- sorry for chiming in so late here, but I have some thoughts to offer on this. I think there are multiple unrelated issues at play here.

Term frequency and its assumptions

@samkodes, you wrote:

First, I should clarify that my thinking on TF is a lot clearer for the link-only setting, rather than the de-duplicate setting. I think this is because in the de-dupe setting there is no information about the duplicate-generating process, which makes inference using TF (whether a-priori or empirical estimates) involve a lot of assumptions.

This is because the link-only task can be thought of as "for each record in left, evaluate the probability of match with each record in right". With this perspective, TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset.

This implies no assumptions on the data generating process for the left data set, except that each left record is independent of all others (no 1-1 match is enforced, etc; in other words duplicates in the left, but not right, dataset are allowed).

I don't think this is exactly right, but you are onto something, and I believe this exposes some slight fudges in the TF math that are papered over here.

Remember, m probabilities are $P(\text{observation} | \text{match})$ and u probabilities are $P(\text{observation} | \text{non-match})$. When we use TF, we redefine "observation": instead of simply being the comparison level observed, it is the specific values observed on both sides of the pair in the relevant field.

Therefore, m and u probabilities with TF for an exact match level are:

$$ m_{TF} = P(\text{both records' field value = X} | \text{match}) $$

$$ u_{TF} = P(\text{both records' field value = X} | \text{non-match}) $$

We can use Splink's u estimation approach assumption here: the vast majority of pairs are non-matches, so we can approximate things about non-matches by looking at all pairs. (These probabilities are all evaluated across the universe of pairs.)

$$ u_{TF} \approx P(\text{both records' field value = X}) $$

We can estimate this quantity by random sampling of pairs, which might require a lot for rare values, or we can say that $P(\text{both records' field value = X}) = P(\text{a record's field value = X}) \times P(\text{a record's field value = X})$ (note these last two probabilities are evaluated across the universe of records), which is what the current TF table does.

Potential improvement We could also calculate a TF table for each dataset, and say $$P(\text{both records' field value} = X) = P(\text{a record's field value} = X | \text{dataset} = \text{left record's dataset}) \times P(\text{a record's field value} = X | \text{dataset} = \text{right record's dataset})$$

which sacrifices some sample size (if one of your datasets is too small, you don't want to estimate term frequencies from it) but, if that's not a concern, will deal better with imbalances in term frequencies between datasets.
For example, maybe the name "Mary" is much more informative for deduplicating between two criminal justice records, where the names are mostly typical men's names, than it is for linking a criminal justice record to another, gender-balanced dataset.

To tackle $m_{TF}$, observe that:

$$ m_{TF} = P(\text{both records' field value = X} | \text{match}) = P(\text{field equal} | \text{match}) \times P(\text{both records' field value = X} | \text{field equal}, \text{match}) = m \times P(\text{left record's field value = X} | \text{field equal}, \text{match}) $$

Here, the first simplifying assumption we will make is that "field equal" and "left record's field value = X" are independent, conditional on match. That is, in the set of true matches, certain values of the field in the left record are not more likely to have equality on the field.

(Note that of course all of this is completely symmetrical and you could just as easily do it with the right record instead of the left record.)

This assumption can be problematic, which has been discussed in this thread but not formalized in this way. For example, a city name of "Londno" could be a value in the left record that is highly unlikely to be equal with the city name in truly matching records, because it represents a typo that probably wasn't repeated in another data entry by/about the same person. For this reason, it is likely preferable to count term frequencies in the cleanest dataset (or data subset) available. This might make a few estimates worse, by making some errors that appear multiple times in a messier dataset appear even rarer and more distinctive, but assuming there are many possible errors to make on each true value, it should eliminate more errors from the frequency table entirely, and improve things overall.

With that, we get $m \times P(\text{left record's field value = X} | \text{match})$. But we can't estimate this without knowing something about true match status, when of course the entire point of record linkage is that we don't have this. The leap that is made in the Splink docs without acknowledgement is an assumption that $P(\text{left record's field value = X} | \text{match})$ over pairs is approximately equal to $P(\text{field value = X})$ over records. Why would these be similar? We are making a second assumption: we expect the number of true matches for a record to be independent of the field value in that record. That is, people with all last names are equally likely to exist in the other dataset (if linking) or be duplicated (if deduplicating).

This assumption can clearly be violated as well. Imagine we are linking a full-population dataset to a survey of a particular ethnic minority, who also tend to share last names. We'll directly estimate the probability two random records match, which will be (relatively) high: $\frac{N_\text{survey}}{N_\text{population} \times N_\text{survey}} = \frac{1}{N_\text{population}}$. Our TF adjustment will "double count" when it sees the names in both datasets, increasing the match probability massively because they are so "rare," when actually they are not rare at all among the matches we are trying to find; we might end up linking every pair we see with the same last name. For this reason, it is likely preferable to count term frequencies in a dataset (or data subset) that is more representative of the matches we aim to find.

I believe this means that "TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset" as a strategy is both overly cautious and vulnerable to an issue. Duplicates are fine in the TF calculation data, so long as they are evenly spread across values of the field. Instead, the desirable traits in the data used to calculate TFs are (1) cleanness (in the field of interest) to minimize "rare" values that are actually errors not expected to give useful information on an exact match and (2) representativeness (in the field of interest) to get commonness of values more correct among the records we should actually link. (1) and (2) can trade off with one another, and with sample size: it's conceivable that the very best data is too small to get reliable term frequencies.

It also means that there is an alternate strategy to estimate $m_{TF}$, which is more computationally intensive but eliminates both of these assumptions: estimate it as part of EM, calculating $P(\text{both records' field value = X} | \text{match})$ directly using the estimated match probabilities at that step of the iteration. Note that this would allow the model to learn that "Londno," though rare, is not informative, because usually when an exact match is found on that, little else matches. This EM approach is mentioned as a possibility in the docs ("The TF adjustment doesn't depend on m, and therefore does not have to be estimated by the EM algorithm - it is known already") but I believe its advantages are not taken into consideration.

Term frequency with fuzzy matches

As @samkodes noted above, it is clearly not correct in general to apply the exact-match TF adjustment to a fuzzy-match level.

Just to show this is separate from the previous, consider a case where we've got a dataset that is great at satisfying the requirements laid out in the previous section: we are doing a linkage between datasets A and B, and we know somehow that A has no typos in last name. We expect a 1-to-1 match (so A is representative of true matches), but B has typos in last name, so we want to use a fuzzy match on that field. We calculate our term frequencies using A.

Let's say the two most common names in A are "Smith" and "Johnson", and they are equally common, so they get the same TF adjustment.
However, consider that there could be people in both datasets with the last name (not a typo!) "Johnston". We presume that an actual typo is equally likely with either name, so the m probabilities of pairs ("Smith", "Smiht") and ("Johnson", "Johnston") should be the same. But the u probability should clearly be greater in the second case, since there are a whole bunch of people who coincidentally have that name.

An assumption Splink is currently making is that all values are equally "crowded" -- that the ratio between coincidental exact matches and coincidental fuzzy matches is constant across values. Values that are very different than any other value will have almost no coincidental fuzzy matches that aren't exact. Values that are similar to other, especially common, values, will have lots more coincidental fuzzy matches than they will have coincidental exact matches.

I think random sampling is most promising for eliminating this assumption; that is already used in the non-TF case, and is obviously relatively SQL-friendly. Along the lines of

... one way to calculate u_x for a fuzzy match level is then to simply do a complete fuzzy match calculation on all values against all other values...

random sampling could be used in combination with aggregating duplicate values (see #2539 (comment) about the non-TF case, though I've yet to work out the math exactly), which would have benefits for high-cardinality columns that nevertheless have some common values.

With this approach, you could either designate one side as the value that should be used for TF adjustment (the cleaner dataset, in a link-only problem), or do some simple aggregation, such as an average, of the adjustments you'd make for each of the distinct values that are being fuzzy matched. This is just an approximation, since you obviously can't do a separate TF adjustment for every pair of values.

Term frequency with arrays

I completely agree with @samkodes that

I like the above approach for combining arrays and TFs but feel like it makes different assumptions than when the array is just meant to be a multi-valued set of options.

The proposed methodology makes sense for a bag-of-words column, though I haven't thought it through it immense detail, but it definitely isn't right for alternates/options. There, I think what you'd want is essentially the same as the non-array case, just considering all pairs. That is, calculate the TFs for all values, flattening out the arrays (subject to the assumptions in the first section of this comment), and then when there is an array intersect of (at least) one, use the (average?) TF adjustment of the overlapping value(s).

If you combined arrays with best-pair fuzzy matching (using #2517, perhaps 😄) you'd want to stack the strategy of the previous paragraph with the strategy of the previous section: calculate per-level, per-value u probabilities using random sampling on the flattened set, then apply the average of the two adjustments that correspond to the closest-matching pair's values (again, maybe averaging for tied pairs). Since you'd need the closest-matching pair twice (once to check that it is close enough, again to apply the TF adjustment) you'd probably want to compute it once somehow.

That was a very long comment but hopefully not a waste of time! I'd be interested to see whether this makes sense to folks, and if so, how we might go about implementing some parts of this 😃

@decohn
Copy link

decohn commented Dec 11, 2024

Hi all,

A colleague and I have had some recent discussions with @zmbc regarding term frequency adjustments for array fields, so I thought
I'd briefly summarize them here.

The gist is that we have a linkage spine that accumulates alternate values for fields like surname, given name, and postal code.
When linking a record to the spine, we compare the record's single surname, given name, etc. to the values in the spine and make
use of both fuzzy matching and term frequency adjustments. At the moment we manage this by creating multiple records in the spine for some individuals, with each record having a different combination of alternate values. This biases our weights, so we're
interested in the ability to perform array comparisons while still maintaining TF adjustments and fuzzy matching.

To give a concrete example, we'd like to be able to get a positive weight when comparing "DAVID" to ["DAVDI", "ANYNAME", ...] or
when comparing "ZEB" to ["ZEBB", "ANYNAME", ...], while obtaining a higher weight in the latter case due to the name being less
common. Similar to the scenario outlined in this discussion, we would only be interested in the u-probability of the best matching
pair(s) between a scalar value and an array, not all pairs.

It seems like this would still be a relatively big lift, but I wanted to express that, if there was interest in implementing it,
it would definitely be a useful feature for us (and likely for some percentage of other users).

@samkodes
Copy link
Contributor Author

Hi @zmbc - thanks for your thoughts, they are helpful. I have a couple of things to add.

Definitions of TF m-and-u probabilities:

I suppose rather than specifying a probability of a particular field value, I was thinking of conditioning on it. In other words, defining:

$$m_{TF} = P(\text{exact match} | \text{left record field value = X}, \text{match})$$

In this case it makes sense to assume that $m_{TF} = m$ since "typos" could in general be assumed to be independent of X (though this is not true for typos that arise from spelling unfamiliar names heard over the phone, etc). In effect this is the math Splink is using by not doing anything to m-probabilities when TF is used.

Similarly defining

$$u_{TF} = P(\text{exact match} | \text{left record field value = X}, \text{non-match})$$

gives you $u_{TF,x}=1/TF_x$, which is what Splink uses, provided term frequencies are defined using only the right data set.

Since Bayes' rule carries through when conditioning, these expressions are legitimate for calculating overall match probabilities, though it turns it this also requires assuming $P(\text{match})$ and $P(\text{left record field value=X})$ are independent, which is similar but not quite the same as the assumptions of the approach you describe. There seems to be a quite analogous structure to these two approaches.

Array-based TF approaches:

One general problem is that there is no obvious way to map TF's of array components to TF's of arrays. That is, unless we know how the arrays are generated this relationship could be totally arbitrary. And it's further complicated by the semantics of the array comparison - e.g. best of all pairs or something else.

Another problem is that is hard to decide what to write down as a probability statement. We can't simply emulate the definition of $u_{TF}$ as you described it. For example, supposed left = [a,b,c] and right = [a, d, e] and we are using array intersection > 0 matching. In this case we care about $P(\text{pair = [a,b,c]x[a,d,e]} | \text{non-match})$. In most data sets this kind of thing will be too sparse to estimate directly from the data by looking at all pairs. So without a model of the data-generating process things are hopeless. It also ignores the array intersection logic completely.

Maybe instead you define
$$u_{TF}=P(\text{intersection contains 'a'}|\text{right size}=3, \text{left size}=3, \text{non-match})$$
Under some assumptions like array entries conditional on array size being drawn independently from underlying TF distributions, you may be able to estimate that this is something like $(3xTF_a) x (3xTF_a)$ (one term for each side).
Maybe you can also estimate
$$m_{TF}=P(\text{intersection contains 'a'}|\text{right size}=3, \text{left size}=3, \text{match})$$
as something like $m x TF_a$ (basically by more independence assumptions regarding size conditional on match.

If on the other hand we adopt a conditioning approach it's maybe a little easier to come up with a simple model that seems reasonable.

For example, suppose we condition on the left array [a,b,c] and the size of the right array. Our main assumption will be that the right-hand arrays, conditional on their array size, are generated by independent samples from the underlying TF distribution. (This is not particularly realistic, for example if these are first and middle names or aliases/nicknames, but it's a start ...)

Then

$$u_{TF}=P(\text{array intersection}>0 | \text{left}=[a,b,c], \text{non-match}, \text{right size}=3) = 1-(1-(TF_a+TF_b+TF_c))^3$$

It also seems reasonable, as above, to define the conditional m-probability by ignoring the specific value of the left field completely through independence assumptions:

$$m_{TF}=P(\text{array intersection}>0 | \text{left}=[a,b,c], \text{match}, \text{right size}=3) = P(\text{array intersection}>0 | \text{match})=m$$

This is a reasonably general model and is consistent with the current approach to not mess with the m-probabilities for TF adjustment.

Fuzzy-matching TFs and arrays

I think we are more-or-less on the same page regarding fuzzy matching $u_x$ terms. I think of these as the normalized size of fuzzy neighborhoods $N_x$ around each term $x$.

A conditional approach like the above could be used to define fuzzy matching on arrays under the same assumptions. For the expression:

$$u_{TF}=P(\text{array intersection}>0 | \text{left}=[a,b,c], \text{non-match}, \text{right size}=3) = 1-(1-(TF_a+TF_b+TF_c))^3$$

we can simply interpret replace $TF_a$ with $u_a$, where $u_a$ is the (normalized) size of the fuzzy neighbourhood around $a$ (and similarly for $TF_b$ and $TF_c$). In the case that there is some intersection between the fuzzy neighbourhoods $N_a$, $N_b$, and $N_c$, if we calculate these neighbourhoods explicitly we can also calculate their union $N_{abc}$ and replace $u_a+u_b+u_c$ with $u_{abc}$. In many applications, for example if [a,b,c] are alternate spellings, there is likely to be some overlap.

Finally, @decohn's situation seems much more suited to the conditional approach.

@zmbc
Copy link
Contributor

zmbc commented Dec 12, 2024

@samkodes Apologies, I should have addressed this more directly -- I do not think the conditioning approach is correct.

As I mentioned, m is $P(\text{observation} | \text{match})$. Your "conditioned" $m_{TF}$ is not equivalent to this.

$$ P(\text{observation} | \text{match}) = P(\text{both records' field value = X} | \text{match}) = P(\text{left record field value = X} | \text{match}) \times P(\text{exact match} | \text{left record field value = X}, \text{match}) \neq P(\text{exact match} | \text{left record field value = X}, \text{match}) $$

In other words, you are missing the consideration of how rare it is to find X in the left record.

Imagine having two datasets that you expect to have a 1-to-1 link (no duplicates in either, complete overlap). Your assumptions are completely symmetric in this case, and it doesn't make sense that your $m$ and $u$ probabilities would be different depending on which dataset you designate to be "left." If you include the $P(\text{left record field value = X} | \text{match})$ term, you will find that your results are both equivalent to the approach I described above, and symmetrical.

Edited to add: this is also true of your later equation $m_{TF} = P(\text{array intersection} &gt; 0 | \text{left} = [a, b, c], \text{match}, \text{right size} = 3)$. I don't think you can introduce $\text{right size} = 3$ as a condition without considering its probability.

@zmbc
Copy link
Contributor

zmbc commented Dec 12, 2024

One general problem is that there is no obvious way to map TF's of array components to TF's of arrays. That is, unless we know how the arrays are generated this relationship could be totally arbitrary. And it's further complicated by the semantics of the array comparison - e.g. best of all pairs or something else.

I don't find this distinction of whether or not we "know how the arrays are generated" to be a useful one. All possible linkage models are going to be isomorphic to some model of how the arrays were generated. We have to choose one.

I also agree that the semantics of the array comparison are absolutely key, but the "best pair" comparison (with reasonable assumptions) is relatively simple to deal with, I think, though I probably didn't get it totally right above. The key assumption will be the one you proposed: "that the right-hand arrays, conditional on their array size, are generated by independent samples from the underlying TF distribution" (except that you must make this assumption for the left-hand arrays as well, if you find my comment above convincing). You're right that it isn't particularly realistic in the case of aliases/nicknames, but the point is that the way it is unrealistic (if a pair matches between two arrays, it is much more likely that future pairs between those same arrays will match, as items within each array are similar to each other) doesn't matter if you take only the best pair.

Another problem is that is hard to decide what to write down as a probability statement. We can't simply emulate the definition of u T F as you described it. For example, supposed left = [a,b,c] and right = [a, d, e] and we are using array intersection > 0 matching. In this case we care about P ( pair = [a,b,c]x[a,d,e] | non-match ) . In most data sets this kind of thing will be too sparse to estimate directly from the data by looking at all pairs.

This is not an issue with arrays, it is an issue with fuzzy matching in general, as I mentioned:

This is just an approximation, since you obviously can't do a separate TF adjustment for every pair of values.

Instead, I proposed making a TF adjustment that is the average of the TF adjustments you would make for each value in the pair; more formally, this is based on the simplification that:

$$\begin{aligned} u_{TF} &= P(\text{pair = [a,b,c]x[a,d,e]} | \text{non-match}) =\\\ &\begin{aligned} &P(\text{array intersection > 0} | \text{non-match}) \times P(\text{left field value = [a,b,c]} | \text{array intersection > 0}, \text{non-match})\\\ &\times P(\text{right field value = [a,d,e]} | \text{left field value = [a,b,c]}, \text{array intersection > 0}, \text{non-match})\\\ \end{aligned} \approx\\\ &\begin{aligned}&P(\text{array intersection > 0} | \text{non-match}) \times P(\text{left field value = [a,b,c]} | \text{array intersection > 0}, \text{non-match}) \times\\\ &P(\text{right field value = [a,d,e]} | \text{array intersection > 0}, \text{non-match}) \end{aligned} \end{aligned}$$

So the assumption is that $P(\text{left field value = [a,b,c]})$ is independent of $P(\text{right field value = [a,d,e]})$ conditional on $\text{array intersection &gt; 0}$ and $\text{non-match}$; in more intuitive terms, all arrays with some intersection with [a, b, c] are equally likely to collide coincidentally in a pair with [a, b, c]. This assumption is definitely not perfect, but it seems like the best you can do while simplifying the model enough to be able to actually estimate its parameters.

@samkodes
Copy link
Contributor Author

@samkodes Apologies, I should have addressed this more directly -- I do not think the conditioning approach is correct.

As I mentioned, m is P ( observation | match ) . Your "conditioned" m T F is not equivalent to this.

P ( observation | match ) = P ( both records’ field value = X | match ) = P ( left record field value = X | match ) × P ( exact match | left record field value = X , match ) ≠ P ( exact match | left record field value = X , match )

In other words, you are missing the consideration of how rare it is to find X in the left record.

Imagine having two datasets that you expect to have a 1-to-1 link (no duplicates in either, complete overlap). Your assumptions are completely symmetric in this case, and it doesn't make sense that your m and u probabilities would be different depending on which dataset you designate to be "left." If you include the P ( left record field value = X | match ) term, you will find that your results are both equivalent to the approach I described above, and symmetrical.

I think we are talking at cross-purposes. I am not saying that I am defining the same TF-m-probability that you are. In fact, I am defining a different TF-m-probability, a conditional one. The key points are (1) that conditional TF-m and TF-u probabilities can still be used via Bayes' rule to estimate a p(match) for the pair - which is ultimately what we care about - and (2) that with some assumptions, the m-probabilities estimated by Splink will pop out of the math to be used in this calculation.

As an aside, the "extra factor" in your calculation of P(left record field value = X|match) is (under reasonable assumptions that field value is independent of match status) equivalent to an "extra factor" of $TF_x$, which cancels with the same factor in the u-probability when calculating P(match) and is simply ignored when conditioning.

@zmbc
Copy link
Contributor

zmbc commented Dec 12, 2024

(1) that conditional TF-m and TF-u probabilities can still be used via Bayes' rule to estimate a p(match) for the pair

I think this is where something is being lost. Let me see if I can more specifically point it out.

$$ \begin{aligned} p(\text{match} | \text{obs} = \text{both values X}) &= p(\text{match}) \times \frac{p(\text{obs} = \text{both values X} | \text{match})}{p(\text{obs} = \text{both values X})}\\ &= p(\text{match}) \times \frac{p(\text{left value} = X | \text{match}) \times p(\text{both values X} | \text{left value} = X, \text{match})}{p(\text{left value} = X) \times p(\text{right value} = X | \text{left value} = X)} \end{aligned} $$

which cancels with the same factor in the u-probability when calculating P(match) and is simply ignored when conditioning.

They are not the same, and only cancel out if, as I mentioned above, you make the assumption that $p(\text{left value} = X) \approx p(\text{left value} = X | \text{match})$. That may or may not be any more true for the left dataset than the right and may be more true of the two datasets combined than of either individually.

The point is, the TF math should be independent of which data (sub)set the user wants to designate as the most representative -- it may not be either the left or the right, but both combined, or a subset of one or the other. Likewise for what data (sub)set is most representative of the array size of matches, in the array case. That is why I think the "conditional m and u" framing confuses more than it helps -- because it makes implicit something that is a completely domain- and problem-specific decision that Splink can't rely on in general.

@samkodes
Copy link
Contributor Author

I think we have some basic disagreement about whether TF math should be independent of which data set user wants to designate. I know v3 of Splink (and I hope v4) allowed users to upload a custom TF table and I understood that it's purpose was as a general way to control how TFs were estimated to cover a variety of use cases.

In terms of the specific proposal for array intersection TFs, I'm not sure I understand how you are thinking of implementing this proposal.

I've proposed a conditional approach that is reducible to specific array-generating processes built on individual array element term frequencies (e.g. $TF_a$, $TF_b$, etc) - and even changing the observand to "overlap=a" may be so reducible in a non-conditional approach as I suggested above. I don't yet understand how your proposal would work. Perhaps you are not interested in reducing things to underlying array element TFs?

First, when you propose that P ( left field value = [a,b,c] ) is independent of P ( right field value = [a,d,e] ) conditional on array intersection > 0 and non-match, I immediately want to say that if the arrays are generate according to some random process that depends on individual term frequencies, this independence can't be true, simply because they are more likely to overlap on a more frequent element than a less frequent element, so even if we condition on intersection > 0, the left and right field values cannot be independent.

Second, how are you going to estimate $P(\text{right value} = [a,b,c] | \text{array intersection} &gt; 0, \text{non-match})$?

  • If you sample random pairs and restrict to pairs with array intersection > 0, you run the risk of a very small or 0 sample after restriction, because many specific list values [a,b,c] will be very rare. This approach does not use the underling TF terms at all (e.g. $TF_a$, $TF_b$, etc)
  • If you try to calculate this using some generating process, for example modelling array size or even conditioning on array size, you may be able to derive formulas involving $TF_a$, $TF_b$, etc, but they will have to be consistent with the independence-of-field-values assumption above, which seems pretty hard to achieve ...

@zmbc
Copy link
Contributor

zmbc commented Dec 12, 2024

I know v3 of Splink (and I hope v4) allowed users to upload a custom TF table and I understood that it's purpose was as a general way to control how TFs were estimated to cover a variety of use cases.

Right -- it is precisely that flexibility that I am trying to preserve with the more explicit framing (not making it implicit that one or the other dataset is used to calculate TFs).

But I think overall, now that I understand what you meant by conditional m and u, I can do the translation in my head if necessary.

I don't yet understand how your proposal would work. Perhaps you are not interested in reducing things to underlying array element TFs?

Sorry, I got a bit confused here between two problems:

  • TF adjustment with fuzzy-matching (when the left and right values are not the same, so it is not obvious how to adjust for frequency)
  • TF adjustment with arrays of alternates

What I wrote above was my proposal for the first problem, but I used an array example, so it all got a bit muddled. It would be clearer to say that I propose

$$\begin{aligned} u_{TF} &= P(\text{pair = "Johnson" x "Johnston"} | \text{non-match}) =\\\ &\begin{aligned} &P(\text{edit distance = 1} | \text{non-match}) \times P(\text{left field value = "Johnson"} | \text{edit distance = 1}, \text{non-match})\\\ &\times P(\text{right field value = "Johnston"} | \text{left field value = "Johnson"}, \text{edit distance = 1}, \text{non-match})\\\ \end{aligned} \approx\\\ &\begin{aligned}&P(\text{edit distance = 1} | \text{non-match}) \times P(\text{left field value = "Johnson"} | \text{edit distance = 1}, \text{non-match}) \times\\\ &P(\text{right field value = "Johnston"} | \text{edit distance = 1}, \text{non-match}) \end{aligned} \end{aligned}$$

and then approximating these last two terms with the "neighborhood size" of "Johnson" and "Johnston" as we've discussed. Now that I write this out, I realize it is not exactly the average of the "TF adjustments" for the two terms; it is their geometric mean. But probably easier to just think of it the way it is written above, and not as it relates to hypothetical adjustments.

For the arrays-of-alternates case, I think I agree with what you've proposed: using the best pair comparison, and assuming that the arrays are both random samples from the TF distribution on the array elements. Everything I've said about this up to this point has been somewhat hand-wavey, but I'd like to work through your math in #2006 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants