-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Exact match level required for TF adjustment - can this be avoided? #2006
Comments
After sleeping on it, I now think the proposed solution won't work for my application, though it does make sense for fuzzy matches like Damerau-Levenshtein. In this case I really want u_x = T_x (term frequency), without any approximation by the inflation factor u-inexact/u-exact. To achieve this a different solution makes sense and would be simple: simply add a custom setting in each comparison level that uses TF adjustment called "disable_tf_inflation_factor" (settable by dictionary specification to begin with, though could be added to the comparison libraries later). When that setting is on, splink should simply use u-inexact (rather than seeking u-exact) in the TF adjustment weight to cancel out the u-inexact used elsewhere, effectively setting the inflation factor to 1. An alternative would be to allow manual specification, again through dictionary specification, of a match level to take the role of the "exact match" level, rather than using SQL parsing. This would allow any level to be used as the reference level for the TF inflation factor. I could imagine this being helpful if I were to add a fuzzy array match implementation as another match level to my model; I would then use the exact array intersection match as the reference level. Perhaps a flag called "tf_inflation_reference_level". |
I've implemented the second and third approaches above, using custom comparison level settings flags, and set up a pull request. #2020 |
Thanks for this and the PR! At a high level:
You can see below I've had a bit of a think about this and here are some initial notes. I ran out of time a bit because it gets pretty complicated and was making my brain hurt a bit. I hope to find a bit more time to think about this a bit more. For now, I at least think I have most of the key concepts in my head at the moment, but would be interested in talking a bit more about exactly how you want this to work so we can think through options for potential solutions. I'd like to think about whether there are other ways of getting the functionality you want Would you be up for a chat? - feel free to reach out at [email protected]. I'll start by writing a few notes on how term frequency adjustments currently work for clarity (and partly to remind myself!). Starting from the basics, the u probability a measure of the likelihood of coincidences occuring whereby information matches amongst non-matching records. For instance, an exact match on first name for two different people. We can measure an average value for the u probability by taking a large number of non-matching pairs, and calculating how often first name matches. Let's say that the average u value is 1%. However, this value does not account for skew - for example, it's much more likely to observe a match on 'John' amongst non-matching record than a match on 'Robin'. Let's say Johns make up 10% of the population and Robins make up 0.1% of the population, so the term-specific u value are 0.1 and 0.001 respectively. So:
Current Splink implementationIn Splink we currently measure:
In the case of an exact match, we use the term specific u rather than the average u. As an implementation detail, this is calculated as average u, with an additional adjustment for the delta between average u and term specific u. The reason is to make it easier to represent the term freqnecy adjustment graphically (John can be shown as a downward adjustment of match weight relative to average, Robin can be shown as an upward adjustment relative to average) Implementation for fuzzy match levelsIn the case of a fuzzy match like levenshtein, it's harder to define a definition of a term frequency adjustment because there are two terms. If the match is Jhon vs John, at a levenshtein distance of 1, what term shold be used to make the adjustment? The decision make within Splink is to take the more frequent of the terms on the basis that one of the two terms is a typo or mistake and therefore is probably uncommon. So in the case of Jhon vs John, we make the term frequency adjustment on the basis of John. We can now see a second reason why it's useful to apply the term frequency adjustment as a delta between average u and term specific u. Because the u value for the levenshtein level has a different u value than the exact match level, but we can still apply the adjustment to this alternative u value. However, the strength of the adjustments can be quite high, and it's unclear whether they strictly apply in the case of a fuzzy match. 'John' vs 'Jhon' is probably a typo. Bryan vs Ryan are just two separate names, so it's much less clear how a term frequency adjustment should apply in this case. So we allow a second tf_adjustment_weight of between 0 and 1, which reduces the size of the adjustment. 1 = full adjustment, 0 = no adjustment, and you can pick and value in between to 'partially' apply the adjustment. Why do we need to find the exact match level to apply tf adjustments to a fuzzy level?In order to apply tf adjustments to a fuzzy level we need to know: the delta between average u and term specific u. More specifically we need a value for:
How could this be used for an array based column?It's a bit unclear how to define this when the size of the intersection of the array may be 1 or more. Possibly a bit simpler if you somehow just take the least common term in the array.
|
Hi Robin - thanks for the detailed reply. First, I should clarify that my thinking on TF is a lot clearer for the link-only setting, rather than the de-duplicate setting. I think this is because in the de-dupe setting there is no information about the duplicate-generating process, which makes inference using TF (whether a-priori or empirical estimates) involve a lot of assumptions. If duplicate records are far rarer than duplicate values, however, probably a similar analysis will apply??? In a link-only setting it seems natural to condition the analysis for any pair on the value in one of the datasets (e.g. the left). This is because the link-only task can be thought of as "for each record in left, evaluate the probability of match with each record in right". With this perspective, TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset. This implies no assumptions on the data generating process for the left data set, except that each left record is independent of all others (no 1-1 match is enforced, etc; in other words duplicates in the left, but not right, dataset are allowed). Of course left and right could be swapped, but then the probability model and data-generating process assumptions would change. In other words, it seems to me that the appropriate probability model is inevitably tied to the definition of the linking task; hence my confusion regarding the de-duplication task. (Our typical assumption is that the m-probabilities do not depend on the precise value but instead reflect generic error processes that apply uniformly across the dataset, so they don't have to be brought into the analysis at all) I prefer to think of the TF adjustment inflation for inexact (or fuzzy) matches as a scaling factor applied to TF (this is what's happening before taking logs and separating out an additive weight adjustment). This makes clear that it is simply the ratio u-inexact (or u-fuzzy)/u-exact. How this applies to array-based matches depends on what the user wants; again, I don't think there is a general solution. It is perhaps clearer to start by not thinking about TF "inflation" in this case - rather just try to think about u_x directly and what that should be. For example, in my case I am looking for any array intersection at all. Because I treat the dataset with possibly arrays of greater than length 1 as my "master list" (right dataset), and the dataset with singleton arrays as my (possibly duplicated) left dataset, it makes sense to define u_x as the TF for the single element on the left (with TFs calculated based on the RIGHT dataset though). If I had left and right reversed, so my "master list" had single-valued arrays and the other dataset had multi-valued arrays, I might want to define u_x as the sum of the TFs in the multi-valued array (because a match on any of them would count, so probability of a coincidence adds up). If a match level requires, say, a 2-element match, then u_x would have to be defined in terms of the probability of seeing any 2-element subset in the "master list", and implementing this would be challenging unless the other data set only had arrays of length 2 (so we'd know which 2-element subset we were talking about) - otherwise the combinatorial analysis becomes more complicated ... I will email you to set up a chat and send you a notebook showing how I applied the fork approach to one of the sample data sets; that may make this example clearer. |
Another example: Even if an array match level only requires 1 intersection, the analysis of u_x becomes complicated as soon as both arrays can have more than 1 entry. Above I suggested that when at most 1 array could have more than 1 entry, there were two cases depending on which was the "master list" - either u_x=TF of the singleton (when "master list" has multiple entries) or u_x=sum of TF's in the multi-valued array (when "master list" has singleton entries). When both arrays can have more than 1 entry, the upper bound of u_x = sum of TF's in the non-master entry, and the lower bound of u_x = maximum TF of the non-master entry. But the sum of TF's figure overcounts when several of its entries are found in a single "master list" entry. A combinatorial analysis could perhaps make this more precise and suggest cheap estimates of this overcount; alternatively a model could be used for the array-generating process (e.g. Poisson for array size, followed by independent selection of entries --- but then results depend on the model assumptions, and array entries are unlikely to be independent! Moreover under an independence model overcounting should be a very minor issue and may be ignored unless arrays are very long or some terms are incredibly frequent .... ) |
Still struggling a bit to get my head around this. Maybe looking at the sql, or taking specific examples, will help clarify ideas. Would you maybe be able to look at the follow example, and use it to construct a new example of the behaviour you want? I don't mean to suggest I'm asking for you to write full sql, just to give me an idea of the modification you're looking for in the SQL and how it would apply? Here's a test script: Script to view sqlimport logging
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
exact_match,
)
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.fake_1000
fuzzy_first_name_with_tf_adj = {
"output_column_name": "first_name",
"comparison_levels": [
{
"sql_condition": '"first_name_l" IS NULL OR "first_name_r" IS NULL',
"label_for_charts": "Null",
"is_null_level": True,
},
{
"sql_condition": '"first_name_l" = "first_name_r"',
"label_for_charts": "Exact match",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 1.0,
},
{
"sql_condition": 'levenshtein("first_name_l", "first_name_r") <= 2',
"label_for_charts": "Levenshtein <= 2",
"tf_adjustment_column": "first_name",
"tf_adjustment_weight": 0.5,
},
{"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
],
"comparison_description": "Exact match vs. First_Name within levenshtein threshold 2 vs. anything else",
}
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"]),
block_on(["surname"]),
],
"comparisons": [
fuzzy_first_name_with_tf_adj,
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
logging.getLogger("splink").setLevel(1)
df_predict = linker.predict() In that script, we're using levenshtein with the first_name, and allowing TF adjustments on both the exact match and the fuzzy match The sql is full sqlCREATE TABLE __splink__df_predict_e91631574
AS
(WITH __splink__df_concat_with_tf as (select * from __splink__df_concat_with_tf_6849394a0),
__splink__df_blocked as (
select
"l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
, '0' as match_key
from __splink__df_concat_with_tf as l
inner join __splink__df_concat_with_tf as r
on
(l."first_name" = r."first_name")
where l."unique_id" < r."unique_id"
UNION ALL
select
"l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
, '1' as match_key
from __splink__df_concat_with_tf as l
inner join __splink__df_concat_with_tf as r
on
(l."surname" = r."surname")
where l."unique_id" < r."unique_id"
AND NOT (coalesce((l."first_name" = r."first_name"),false))
),
__splink__df_comparison_vectors as (
select "unique_id_l","unique_id_r","first_name_l","first_name_r",CASE WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1 WHEN "first_name_l" = "first_name_r" THEN 2 WHEN levenshtein("first_name_l", "first_name_r") <= 2 THEN 1 ELSE 0 END as gamma_first_name,"tf_first_name_l","tf_first_name_r","surname_l","surname_r",CASE WHEN "surname_l" IS NULL OR "surname_r" IS NULL THEN -1 WHEN "surname_l" = "surname_r" THEN 1 ELSE 0 END as gamma_surname,"dob_l","dob_r",CASE WHEN "dob_l" IS NULL OR "dob_r" IS NULL THEN -1 WHEN "dob_l" = "dob_r" THEN 1 ELSE 0 END as gamma_dob,match_key
from __splink__df_blocked
),
__splink__df_match_weight_parts as (
select "unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",CASE
WHEN
gamma_first_name = -1
THEN cast(1.0 as float8)
WHEN
gamma_first_name = 2
THEN cast(163.97484984984985 as float8)
WHEN
gamma_first_name = 1
THEN cast(2.4703796561604605 as float8)
WHEN
gamma_first_name = 0
THEN cast(0.025404270177413344 as float8)
END as bf_first_name ,CASE WHEN gamma_first_name = -1 then cast(1 as float8) WHEN gamma_first_name = 2 then
(CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
THEN
POW(
cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
(CASE
WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
THEN coalesce("tf_first_name_l", "tf_first_name_r")
ELSE coalesce("tf_first_name_r", "tf_first_name_l")
END)
,
cast(1.0 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_first_name = 1 then
(CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
THEN
POW(
cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
(CASE
WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
THEN coalesce("tf_first_name_l", "tf_first_name_r")
ELSE coalesce("tf_first_name_r", "tf_first_name_l")
END)
,
cast(0.5 as float8)
)
ELSE cast(1 as float8)
END) WHEN gamma_first_name = 0 then cast(1 as float8) END as bf_tf_adj_first_name ,"surname_l","surname_r",gamma_surname,CASE
WHEN
gamma_surname = -1
THEN cast(1.0 as float8)
WHEN
gamma_surname = 1
THEN cast(194.275 as float8)
WHEN
gamma_surname = 0
THEN cast(0.05024570024570029 as float8)
END as bf_surname ,"dob_l","dob_r",gamma_dob,CASE
WHEN
gamma_dob = -1
THEN cast(1.0 as float8)
WHEN
gamma_dob = 1
THEN cast(543.5567010309278 as float8)
WHEN
gamma_dob = 0
THEN cast(0.05008754038589973 as float8)
END as bf_dob ,match_key
from __splink__df_comparison_vectors
)
select
log2(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob) as match_weight,
CASE WHEN bf_first_name = cast('infinity' as float8) OR bf_tf_adj_first_name = cast('infinity' as float8) OR bf_surname = cast('infinity' as float8) OR bf_dob = cast('infinity' as float8) THEN 1.0 ELSE (cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)/(1+(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)) END as match_probability,
"unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",bf_first_name,bf_tf_adj_first_name,"surname_l","surname_r",gamma_surname,bf_surname,"dob_l","dob_r",gamma_dob,bf_dob,match_key
from __splink__df_match_weight_parts
order by 1
) Let's take a look at how the term frequency adjustments are applied in the sql. Bayes factor first name:
And then the term frequency bayes factor adjustment is:
Here: Then, the TF adjustment is defined as:
We then multiply |
Separately, I've had a think about how it may be possible to combine term frequencies and array columns using the current Splink codebase. Specifically whether it's possible to simultanously:
You can find some a readme and some code here: This doesn't get you a highly-granular tf adjustment i.e. an adjustment that's different for every single coombination of tokens that intersect, but it does get you quite a long way towards decent predictions |
I like the above approach for combining arrays and TFs but feel like it makes different assumptions than when the array is just meant to be a multi-valued set of options. In effect the above scoring approach assumes (1) independent selection of intersecting terms (conditional on size of intersection) and (2) that a bigger intersection is stronger evidence of a match. Neither is always the case. (There should also be some sort of adjustment for both array sizes to view this scoring as an explicit data generating process measuring the "rarity" of a pairing. Possibly multiply by combinatorial coefficients (N_1 choose k)*(N_2 choose k) where k is the size of the intersection and N_1, N_2 the sizes of the arrays in the pair - I think this amounts to conditioning on N_1, N_2 and ignoring higher-order corrections required to avoid an intersection of larger size? ) In this particular application, you might also want to simply exclude stopwords and their abbreviations when data cleaning (like "limited" or "ltd"); standardizing abbreviations will lower the score when they are present because of assumption (2). |
I would like to recant my proposal that u_inexact/u_exact is an appropriate inflation factor; I now believe that it is not, and that more precise calculations are preferable. The current approach sets match weight to: Unfortunately, I now believe this is wrong, even if we view u_inexact/u_exact as an "inflation factor" accounting for increased rate of coincidences due to fuzzy match. This is because the TF of 1 term, even scaled, cannot be expected to approximate the TF of all fuzzy matches to that term. For example, suppose we are trying to find a match for "Sam" in a dataset that contains 2 "Sam"s and 8 "Spam"s . Fuzzy match (damerau-levenshtein=1) will hit all 8 Spams (assuming exact matches are not counted here), but TF for Sam remains 2/10. We would need an inflation factor of 4. But a different value in general will need a different inflation factor; for example, if our dataset also had 2 "Peg"s and 18 "Meg"s, when looking for a match for "Peg" we would find all 18 "Megs"; TF for Peg would be 2/30 and we would need an inflation factor of 9. There is no reason in general for the TF of "Sam" to tell us anything about the TF of "Spam", or the TF of "Peg" to tell us anything about the TF of "Meg". (If TFs were in some sense continuous with respect to the fuzzy-distance metric, that would be another story). In other words, even if u_inexact/u_exact measured the average "neighborhood size" of the fuzzy match in terms of nearby unique values (or even non-unique values i.e. records), this size is in general unrelated to any particular TF itself. This means that variation in individual TFs will in general add noise to the formula above rather than making it more individualized, which is the whole point of TF adjustments. The correct value for u_x for "Sam" is the "neighbourhood size" of the fuzzy match operator applied to "Sam" , i.e. the number of records fuzzy-matched to "Sam", divided by the master list's size. One way to calculate u_x for a fuzzy match level is then to simply do a complete fuzzy match calculation on all values against all other values. We don't have to do this for each record in the master list, just the values, and then total up by the value's multiplicity. That saves some work, but can still be impractical for high cardinality fields. If the fuzzy match satisfies the triangle inequality, there should be efficient algorithms for doing this in large datasets. (e.g. using anchor points to partition the dataset into neighbourhoods and rule out comparisons between the local neighbourhoods of distant anchor points). An algorithm and python implementation for doing this should already exist - it is a generic task - count the size of the epsilon-neighbourhood of each point in a generic metric space without calculating a complete distance matrix. Questionable whether this can be implemented easily in SQL though! An alternative would be to estimate this by random sampling for each value. Since fuzzy matches are expected to have a higher hit-rate than exact matches, sampling may be able to provide reliable estimates with a practical number of samples per value (e.g. 1k-10k?). The fuzzier the match, the more effective a sampling approach would be. While the disable_tf_inflation logic I've submitted could be used to omit the u_inexact/u_exact factor and use a raw user-supplied TF value , every match level would need its own TF table (better to call it a u_x table), so changes to this part of the architecture would also be required. For the simple case of array intersections I am considering, none of this matters. But in the general case if my analysis is correct, it means the current approach to inexact-match TF adjustments may be unwise? |
Agreed - the current approach isn't very sound. I think it only really works in the case of small typos, where one side of the match is 'not a word'. E.g John Vs Jhon, but not Sam Vs Spam. We don't use it very often for this reason but there are some scenarios in which it might improve accuracy a little despite being not very theoretically grounded, especially in cases with heavy data skew (e.g. city name). |
Another few thoughts re the suggesting approach to combining TFs and arrays above. Your proposed approach may be thought of as hypothesis testing under a particular null hypothesis model (sketched above), with match levels corresponding to different p-value thresholds. This statistical model is not in general related to the statistical model used for Fellegi-Sunter, but is just one way of generating "scores" representing match strength. Like any method of measuring match strength, different measures will suit different domains and our understanding of what constitutes a match. One downside to the proposal is that it pays no attention to non-matching terms. Two alternative approaches that would penalize for non-matching terms are inverse-TF-weighted symmetric difference (i.e. (sum of weights of union)-(sum of weights of intersection)) and inverse-TF-weighted Jaccard distance ( i.e. inverse-Tf-weighted symmetric difference / (sum of weights of union). These are not probability models themselves; the first could have arbitrary values while the second is in [0,1]. I believe both of these are metrics (satisfy the triangle inequality) so could be used with the efficient exact methods of calculating u_x I mentioned above. |
Some notes coming out of our conversation today (very fruitful!):
Conclusion: in general, avoiding this "inflation factor" might be a better idea. |
Thanks for the notes re our chat. Re the arrays:
|
Hi both -- sorry for chiming in so late here, but I have some thoughts to offer on this. I think there are multiple unrelated issues at play here. Term frequency and its assumptions@samkodes, you wrote:
I don't think this is exactly right, but you are onto something, and I believe this exposes some slight fudges in the TF math that are papered over here. Remember, m probabilities are Therefore, m and u probabilities with TF for an exact match level are: We can use Splink's u estimation approach assumption here: the vast majority of pairs are non-matches, so we can approximate things about non-matches by looking at all pairs. (These probabilities are all evaluated across the universe of pairs.) We can estimate this quantity by random sampling of pairs, which might require a lot for rare values, or we can say that Potential improvementWe could also calculate a TF table for each dataset, and saywhich sacrifices some sample size (if one of your datasets is too small, you don't want to estimate term frequencies from it) but, if that's not a concern, will deal better with imbalances in term frequencies between datasets. To tackle Here, the first simplifying assumption we will make is that "field equal" and "left record's field value = X" are independent, conditional on match. That is, in the set of true matches, certain values of the field in the left record are not more likely to have equality on the field. (Note that of course all of this is completely symmetrical and you could just as easily do it with the right record instead of the left record.) This assumption can be problematic, which has been discussed in this thread but not formalized in this way. For example, a city name of "Londno" could be a value in the left record that is highly unlikely to be equal with the city name in truly matching records, because it represents a typo that probably wasn't repeated in another data entry by/about the same person. For this reason, it is likely preferable to count term frequencies in the cleanest dataset (or data subset) available. This might make a few estimates worse, by making some errors that appear multiple times in a messier dataset appear even rarer and more distinctive, but assuming there are many possible errors to make on each true value, it should eliminate more errors from the frequency table entirely, and improve things overall. With that, we get This assumption can clearly be violated as well. Imagine we are linking a full-population dataset to a survey of a particular ethnic minority, who also tend to share last names. We'll directly estimate the probability two random records match, which will be (relatively) high: I believe this means that "TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset" as a strategy is both overly cautious and vulnerable to an issue. Duplicates are fine in the TF calculation data, so long as they are evenly spread across values of the field. Instead, the desirable traits in the data used to calculate TFs are (1) cleanness (in the field of interest) to minimize "rare" values that are actually errors not expected to give useful information on an exact match and (2) representativeness (in the field of interest) to get commonness of values more correct among the records we should actually link. (1) and (2) can trade off with one another, and with sample size: it's conceivable that the very best data is too small to get reliable term frequencies. It also means that there is an alternate strategy to estimate Term frequency with fuzzy matchesAs @samkodes noted above, it is clearly not correct in general to apply the exact-match TF adjustment to a fuzzy-match level. Just to show this is separate from the previous, consider a case where we've got a dataset that is great at satisfying the requirements laid out in the previous section: we are doing a linkage between datasets A and B, and we know somehow that A has no typos in last name. We expect a 1-to-1 match (so A is representative of true matches), but B has typos in last name, so we want to use a fuzzy match on that field. We calculate our term frequencies using A. Let's say the two most common names in A are "Smith" and "Johnson", and they are equally common, so they get the same TF adjustment. An assumption Splink is currently making is that all values are equally "crowded" -- that the ratio between coincidental exact matches and coincidental fuzzy matches is constant across values. Values that are very different than any other value will have almost no coincidental fuzzy matches that aren't exact. Values that are similar to other, especially common, values, will have lots more coincidental fuzzy matches than they will have coincidental exact matches. I think random sampling is most promising for eliminating this assumption; that is already used in the non-TF case, and is obviously relatively SQL-friendly. Along the lines of
random sampling could be used in combination with aggregating duplicate values (see #2539 (comment) about the non-TF case, though I've yet to work out the math exactly), which would have benefits for high-cardinality columns that nevertheless have some common values. With this approach, you could either designate one side as the value that should be used for TF adjustment (the cleaner dataset, in a link-only problem), or do some simple aggregation, such as an average, of the adjustments you'd make for each of the distinct values that are being fuzzy matched. This is just an approximation, since you obviously can't do a separate TF adjustment for every pair of values. Term frequency with arraysI completely agree with @samkodes that
The proposed methodology makes sense for a bag-of-words column, though I haven't thought it through it immense detail, but it definitely isn't right for alternates/options. There, I think what you'd want is essentially the same as the non-array case, just considering all pairs. That is, calculate the TFs for all values, flattening out the arrays (subject to the assumptions in the first section of this comment), and then when there is an array intersect of (at least) one, use the (average?) TF adjustment of the overlapping value(s). If you combined arrays with best-pair fuzzy matching (using #2517, perhaps 😄) you'd want to stack the strategy of the previous paragraph with the strategy of the previous section: calculate per-level, per-value u probabilities using random sampling on the flattened set, then apply the average of the two adjustments that correspond to the closest-matching pair's values (again, maybe averaging for tied pairs). Since you'd need the closest-matching pair twice (once to check that it is close enough, again to apply the TF adjustment) you'd probably want to compute it once somehow. That was a very long comment but hopefully not a waste of time! I'd be interested to see whether this makes sense to folks, and if so, how we might go about implementing some parts of this 😃 |
Hi all, A colleague and I have had some recent discussions with @zmbc regarding term frequency adjustments for array fields, so I thought The gist is that we have a linkage spine that accumulates alternate values for fields like surname, given name, and postal code. To give a concrete example, we'd like to be able to get a positive weight when comparing "DAVID" to ["DAVDI", "ANYNAME", ...] or It seems like this would still be a relatively big lift, but I wanted to express that, if there was interest in implementing it, |
Hi @zmbc - thanks for your thoughts, they are helpful. I have a couple of things to add. Definitions of TF m-and-u probabilities: I suppose rather than specifying a probability of a particular field value, I was thinking of conditioning on it. In other words, defining: In this case it makes sense to assume that Similarly defining gives you Since Bayes' rule carries through when conditioning, these expressions are legitimate for calculating overall match probabilities, though it turns it this also requires assuming Array-based TF approaches: One general problem is that there is no obvious way to map TF's of array components to TF's of arrays. That is, unless we know how the arrays are generated this relationship could be totally arbitrary. And it's further complicated by the semantics of the array comparison - e.g. best of all pairs or something else. Another problem is that is hard to decide what to write down as a probability statement. We can't simply emulate the definition of Maybe instead you define If on the other hand we adopt a conditioning approach it's maybe a little easier to come up with a simple model that seems reasonable. For example, suppose we condition on the left array [a,b,c] and the size of the right array. Our main assumption will be that the right-hand arrays, conditional on their array size, are generated by independent samples from the underlying TF distribution. (This is not particularly realistic, for example if these are first and middle names or aliases/nicknames, but it's a start ...) Then It also seems reasonable, as above, to define the conditional m-probability by ignoring the specific value of the left field completely through independence assumptions: This is a reasonably general model and is consistent with the current approach to not mess with the m-probabilities for TF adjustment. Fuzzy-matching TFs and arrays I think we are more-or-less on the same page regarding fuzzy matching A conditional approach like the above could be used to define fuzzy matching on arrays under the same assumptions. For the expression: we can simply interpret replace Finally, @decohn's situation seems much more suited to the conditional approach. |
@samkodes Apologies, I should have addressed this more directly -- I do not think the conditioning approach is correct. As I mentioned, m is In other words, you are missing the consideration of how rare it is to find X in the left record. Imagine having two datasets that you expect to have a 1-to-1 link (no duplicates in either, complete overlap). Your assumptions are completely symmetric in this case, and it doesn't make sense that your Edited to add: this is also true of your later equation |
I don't find this distinction of whether or not we "know how the arrays are generated" to be a useful one. All possible linkage models are going to be isomorphic to some model of how the arrays were generated. We have to choose one. I also agree that the semantics of the array comparison are absolutely key, but the "best pair" comparison (with reasonable assumptions) is relatively simple to deal with, I think, though I probably didn't get it totally right above. The key assumption will be the one you proposed: "that the right-hand arrays, conditional on their array size, are generated by independent samples from the underlying TF distribution" (except that you must make this assumption for the left-hand arrays as well, if you find my comment above convincing). You're right that it isn't particularly realistic in the case of aliases/nicknames, but the point is that the way it is unrealistic (if a pair matches between two arrays, it is much more likely that future pairs between those same arrays will match, as items within each array are similar to each other) doesn't matter if you take only the best pair.
This is not an issue with arrays, it is an issue with fuzzy matching in general, as I mentioned:
Instead, I proposed making a TF adjustment that is the average of the TF adjustments you would make for each value in the pair; more formally, this is based on the simplification that: So the assumption is that |
I think we are talking at cross-purposes. I am not saying that I am defining the same TF-m-probability that you are. In fact, I am defining a different TF-m-probability, a conditional one. The key points are (1) that conditional TF-m and TF-u probabilities can still be used via Bayes' rule to estimate a p(match) for the pair - which is ultimately what we care about - and (2) that with some assumptions, the m-probabilities estimated by Splink will pop out of the math to be used in this calculation. As an aside, the "extra factor" in your calculation of P(left record field value = X|match) is (under reasonable assumptions that field value is independent of match status) equivalent to an "extra factor" of |
I think this is where something is being lost. Let me see if I can more specifically point it out.
They are not the same, and only cancel out if, as I mentioned above, you make the assumption that The point is, the TF math should be independent of which data (sub)set the user wants to designate as the most representative -- it may not be either the left or the right, but both combined, or a subset of one or the other. Likewise for what data (sub)set is most representative of the array size of matches, in the array case. That is why I think the "conditional m and u" framing confuses more than it helps -- because it makes implicit something that is a completely domain- and problem-specific decision that Splink can't rely on in general. |
I think we have some basic disagreement about whether TF math should be independent of which data set user wants to designate. I know v3 of Splink (and I hope v4) allowed users to upload a custom TF table and I understood that it's purpose was as a general way to control how TFs were estimated to cover a variety of use cases. In terms of the specific proposal for array intersection TFs, I'm not sure I understand how you are thinking of implementing this proposal. I've proposed a conditional approach that is reducible to specific array-generating processes built on individual array element term frequencies (e.g. First, when you propose that P ( left field value = [a,b,c] ) is independent of P ( right field value = [a,d,e] ) conditional on array intersection > 0 and non-match, I immediately want to say that if the arrays are generate according to some random process that depends on individual term frequencies, this independence can't be true, simply because they are more likely to overlap on a more frequent element than a less frequent element, so even if we condition on intersection > 0, the left and right field values cannot be independent. Second, how are you going to estimate
|
Right -- it is precisely that flexibility that I am trying to preserve with the more explicit framing (not making it implicit that one or the other dataset is used to calculate TFs). But I think overall, now that I understand what you meant by conditional m and u, I can do the translation in my head if necessary.
Sorry, I got a bit confused here between two problems:
What I wrote above was my proposal for the first problem, but I used an array example, so it all got a bit muddled. It would be clearer to say that I propose and then approximating these last two terms with the "neighborhood size" of "Johnson" and "Johnston" as we've discussed. Now that I write this out, I realize it is not exactly the average of the "TF adjustments" for the two terms; it is their geometric mean. But probably easier to just think of it the way it is written above, and not as it relates to hypothetical adjustments. For the arrays-of-alternates case, I think I agree with what you've proposed: using the best pair comparison, and assuming that the arrays are both random samples from the TF distribution on the array elements. Everything I've said about this up to this point has been somewhat hand-wavey, but I'd like to work through your math in #2006 (comment). |
Is your proposal related to a problem?
I have an application where I wish to link dataset A, representing noisy transactions, to dataset B, a master entity list. For dataset B (the master list) I am using arrays to represent multi-valued fields (e.g. known alternate office phone numbers); however dataset A (the transactions) has only single values in the corresponding fields (arrays of length 1). Because office phone numbers can be shared between multiple people, I would like to use term frequency adjustments. I can use dataset B, my master entity list, to calculate a user-supplied TF table for the relevant fields (exploding the multi-valued fields for an accurate count), which will then be applied to the singleton arrays in dataset A. In my application any match between a phone number in dataset A and any of the entries in the corresponding array field in dataset B counts as a match; so, to compare arrays, I am using array intersection functions with any non-zero intersection as my single non-null, non-"else", comparison level.
The problem is that in order to use TF adjustments, this code looks for u-probabilities corresponding to an "exact match" level in the comparison. An "exact match" seems to be identified via this code, which seems to look for exact field comparisons in the SQL. Since I am using array intersections, I do not have an "exact match" level, and an error is thrown - "An exact match level is required to make a term frequency adjustment on a comparison level that is not an exact match."
One alternative is to add an additional exact match level to the model. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one.
After a lot of thinking, I think I understand why this u-probability is needed. In the term frequency adjustment math here the reference u-probability is just a presentation convenience - the u-probabilities cancel out but are explicitly added to ensure that a term frequency adjustment can be represented as an addition to the weight that would occur without term frequency adjustment. In the calling code here, the exact-match u-probability is used to create the term frequency adjustment. In the exact matching case another identical u-probability must be calculated somewhere else that cancels out.
However, if the match is inexact, I am assuming that in general the u-probability that is calculated elsewhere is for the inexact match. TF adjustments should modify u-probabilities based on the exact field value distribution, regardless of whether the match is exact or not. My reasoning is that value multiplicity in the dataset is the relevant factor - if I have 10 "Sam"s in dataset B I am 10 times as likely to u-match "Sam" in dataset A but also 10 times as likely to fuzzy u-match "Spam" in dataset A, conditional on having value "Sam" in dataset B. However, we need something to adjust by the multiplicity 10. For exact matching, we multiply the multiplicity by 1/N to get T_x, because for exact matching in the absence of duplicates u_x is ~1/N. The problem is estimating u_x, in the absence of duplicate values, for a fuzzy match - i.e. we don't know in general how many records will be hit by a fuzzy match, and we can't estimate this for every x by random sampling (too many samples!). So instead we use an "inflation factor" of u-inexact/u-exact and multiply this by T_x. That's the behaviour here if u-probabilities don't cancel.
The question then is how could Splink get the u-exact probability if, as in my case, there is no exact match level in the model. One option would be to add extra code to sample u-exact probabilities any time there is a TF adjustment (if not already doing so), and store these separately somewhere for retrieval as needed.
Describe the solution you'd like
Rather than demanding a corresponding exact match level in all comparisons that use TF adjustments, can Splink simply pre-compute and store u-exact probabilities separately whenever TF adjustment is required by a model? This may be justified by the argument above - that the u-inexact/u-exact inflation factor is an approximation that allows TF adjustments to be used for inexact match levels, but that there are domains in which an exact match is not meaningfully different from certain kinds of inexact matches.
EDIT: After rethinking, this approach won't work for my application but may have other uses; see response below.
Describe alternatives you've considered
One alternative is to simply define an additional exact match level. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one (i.e., I have a priori knowledge that these two levels do not provide different levels of match evidence). Moreover, the effective TF adjustments for the 2 match levels are different (because they have different inflation factors), behaviour which I don't want.
Additional context
The text was updated successfully, but these errors were encountered: