Behavior of manually_apply_fellegi_sunter_weights() #389

ericmanning · 2022-04-13T00:05:24Z

ericmanning
Apr 13, 2022

Hi all,

I am having success using this package in my ongoing academic research.

I am wondering how easy it would be to implement a modification to the manually_apply_fellegi_sunter_weights() function, and I was hoping to get some thoughts/tips from the package authors or other users as to how one might go about it (as I am still a Spark novice).

Existing behavior: manually_apply_fellegi_sunter_weights() returns all gammas and posteriors for all comparisons as defined by the blocking rules. With fairly lenient blocking rules, this includes many (mostly) comparisons with posteriors of 0 (or ~0). (For example, one billion comparisons with lenient blocking rules might only yield one million posteriors greater than 0.01.) Obviously this is great for diagnostics, but it does sacrifice memory.

Desired behavior: An option to filter gammas and/or posteriors (within each executor?) as Spark works through the comparisons, with the goal of reducing memory usage. (I see that the function computes all of the gammas, then the posteriors, so I don't think this change would be trivial to implement.) I have been thinking about the behavior of fastLink in R, in which (to my understanding) comparisons are computed in batches, and comparisons with posteriors below a threshold are disposed of as the function works through the comparison calculations. The goal here would be to do something similar.

Proposed implementation: I think this would have to be written in such a way that the gammas returned by add_gammas() are a subset of those passed to the function, OR it would have to combine the gamma calculations with the posterior calculations within each executor.

I'm simply not familiar enough with how Spark executes the code in this package to know a) how easy it would be to do this, and b) if it's reasonable, how one might go about writing this modification. Any advice (including "it's not going to work because of the way Spark assigns tasks to the executors") is welcome!

Thanks again for publishing such a great tool.

RobinL · 2022-04-21T13:38:07Z

RobinL
Apr 21, 2022
Maintainer

Hi Eric,

It's definitely a reasonable idea. If I understand correctly, you mention two possibilities (not mutually exclusive):

Filter out pairwise comparisons that have match probabilities/match weights below a given threshold, to reduce the size of the output datasets and memory usage
Filter out pairwise comparisons base on some decision criteria related to the values of the comparison vectors

Both are sensible, and the greatest (computational) efficiency gains would come from (2). On the other hand, (1) is simplest to implement. (In fact, (1) is a planned feature of Splink 3 - i.e. it's not yet implemented but will be).

In order to achieve (1), I think there are two possibilities:

Add a 'threshold_match_weight' argument to the manually_apply_fellegi_sunter_weights(). Then, you could run this filter here

i.e. something like

df_predictions = run_expectation_step(df_gammas, self.model, self.spark)
return df_predictions.filter('match_weight' < threshold_match_weight)

Or you could do this in your own code i.e. outside Splink

I think (but haven't tested) that Spark will run this relatively efficiently and it will result in some memory improvements.

In Splink2, (2) is a bit more tricky, especially due to the existence of break_lineage_blocked_comparisons. By default, I believe applying a filter to df_predictions would not work in the way intended because Splink would generate all of them, break lineage, and then apply the filter.

Instead, you'd want a mechanism for filtering the comparison vectors here. One possibility could be to add an filter argument to manually_apply_fellegi_sunter_weights, which, if not None could apply a filter at line 90 i.e. BEFORE breaking lineage.

In Splink3, that functionality would fit in much more cleanly here

Happy to look at a PR if you'd like to add this to Splink 2. Or if you're not confident enough to do a PR, do have a go and experiment with any of the options above. If you find something that seems to give you a big speed improvement, i can look at how to integrate into the main codebase.

0 replies

ericmanning · 2022-05-30T23:17:46Z

ericmanning
May 30, 2022
Author

Wanted to circle back on this given your thorough recommendations above. Given the setup of our computing cluster, the easiest + fastest thing for me to do was simply to submit a batch job thats iterates over partitions of the data, where each partition loads the model, applies the weights, filters the resulting output, and stores it. For linking, e.g., you can just partition data frame A into three parts and link each one to data frame B. For deduplication of data frame A, you can add a condition based on the partition number to the blocking rules each time to eliminate some proportion of the comparisons (compare within 1, within 2, within 3, then 1-2, 1-3, 2-3).

But that's not a very pleasant solution.

For Splink 3, it looks like you've added this code already. (Thank you!) For Splink 2, if I re-run this code I will try out solution (1) and let you know if I see any memory usage or speed improvements.

1 reply

RobinL Jun 1, 2022
Maintainer

Thanks for the update - great to have these examples of solutions that work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of manually_apply_fellegi_sunter_weights() #389

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Behavior of manually_apply_fellegi_sunter_weights() #389

ericmanning Apr 13, 2022

Replies: 2 comments · 1 reply

RobinL Apr 21, 2022 Maintainer

ericmanning May 30, 2022 Author

RobinL Jun 1, 2022 Maintainer

ericmanning
Apr 13, 2022

Replies: 2 comments 1 reply

RobinL
Apr 21, 2022
Maintainer

ericmanning
May 30, 2022
Author

RobinL Jun 1, 2022
Maintainer