Behavior of manually_apply_fellegi_sunter_weights() #389
Replies: 2 comments 1 reply
-
Hi Eric, It's definitely a reasonable idea. If I understand correctly, you mention two possibilities (not mutually exclusive):
Both are sensible, and the greatest (computational) efficiency gains would come from (2). On the other hand, (1) is simplest to implement. (In fact, (1) is a planned feature of Splink 3 - i.e. it's not yet implemented but will be). In order to achieve (1), I think there are two possibilities:
i.e. something like
Or you could do this in your own code i.e. outside Splink I think (but haven't tested) that Spark will run this relatively efficiently and it will result in some memory improvements. In Splink2, (2) is a bit more tricky, especially due to the existence of Instead, you'd want a mechanism for filtering the comparison vectors here. One possibility could be to add an filter argument to In Splink3, that functionality would fit in much more cleanly here Happy to look at a PR if you'd like to add this to Splink 2. Or if you're not confident enough to do a PR, do have a go and experiment with any of the options above. If you find something that seems to give you a big speed improvement, i can look at how to integrate into the main codebase. |
Beta Was this translation helpful? Give feedback.
-
Wanted to circle back on this given your thorough recommendations above. Given the setup of our computing cluster, the easiest + fastest thing for me to do was simply to submit a batch job thats iterates over partitions of the data, where each partition loads the model, applies the weights, filters the resulting output, and stores it. For linking, e.g., you can just partition data frame A into three parts and link each one to data frame B. For deduplication of data frame A, you can add a condition based on the partition number to the blocking rules each time to eliminate some proportion of the comparisons (compare within 1, within 2, within 3, then 1-2, 1-3, 2-3). But that's not a very pleasant solution. For Splink 3, it looks like you've added this code already. (Thank you!) For Splink 2, if I re-run this code I will try out solution (1) and let you know if I see any memory usage or speed improvements. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I am having success using this package in my ongoing academic research.
I am wondering how easy it would be to implement a modification to the
manually_apply_fellegi_sunter_weights()
function, and I was hoping to get some thoughts/tips from the package authors or other users as to how one might go about it (as I am still a Spark novice).Existing behavior:
manually_apply_fellegi_sunter_weights()
returns all gammas and posteriors for all comparisons as defined by the blocking rules. With fairly lenient blocking rules, this includes many (mostly) comparisons with posteriors of 0 (or ~0). (For example, one billion comparisons with lenient blocking rules might only yield one million posteriors greater than 0.01.) Obviously this is great for diagnostics, but it does sacrifice memory.Desired behavior: An option to filter gammas and/or posteriors (within each executor?) as Spark works through the comparisons, with the goal of reducing memory usage. (I see that the function computes all of the gammas, then the posteriors, so I don't think this change would be trivial to implement.) I have been thinking about the behavior of
fastLink
in R, in which (to my understanding) comparisons are computed in batches, and comparisons with posteriors below a threshold are disposed of as the function works through the comparison calculations. The goal here would be to do something similar.Proposed implementation: I think this would have to be written in such a way that the gammas returned by
add_gammas()
are a subset of those passed to the function, OR it would have to combine the gamma calculations with the posterior calculations within each executor.I'm simply not familiar enough with how Spark executes the code in this package to know a) how easy it would be to do this, and b) if it's reasonable, how one might go about writing this modification. Any advice (including "it's not going to work because of the way Spark assigns tasks to the executors") is welcome!
Thanks again for publishing such a great tool.
Beta Was this translation helpful? Give feedback.
All reactions