Blocking with similarity functions #461

ivsanro1 · 2022-05-12T14:26:19Z

ivsanro1
May 12, 2022

Hello. Every example I have seen of blocking is always with = (equal) comparisons, but I would like to block candidates that e.g. have a jaro-winkler with the name higher than a specified threshold (e.g. 0.7).

So I would like to do something like this:

{
    "blocking_rules": [
        "jarowinkler(name_l, name_r) > 0.7"
    ]

Is this possible anyhow? Is there any example of more complex blocking rules?

mamonu · 2022-05-12T15:20:56Z

mamonu
May 12, 2022

yes it is. we do this in our work

this is how:

what you can do is create for example a phonetic transformation of the two names

Double metaphone is very useful to clear names that have no chance of being the same . so very similar to your idea.

have a look here about it.
https://en.wikipedia.org/wiki/Metaphone

you can register this function from the jar we include with spink
spark.udf.registerJavaFunction("uk.gov.moj.dash.linkage.Dmetaphone", "DoubleMetaphone", types.StringType())

then you are able to do some preprocessing


df = df.withColumn('Surname1DM', f.expr('Dmetaphone(Surname1)'))
df = df.withColumn('Surname2DM', f.expr('Dmetaphone(Surname2)'))

and then on blocking step

{
    "blocking_rules": [
        "Surname1DM=Surname2DM", ... anything else you want to add
    ]

2 replies

ivsanro1 May 12, 2022
Author

Thank you for the suggestion @mamonu

If I understand correctly, you're mapping the names' text into the Metaphone'd name, but after that, the check you're making in the blocking_rules is also an equal (=), so at the end of the day is another example of the form col_l = col_r

My doubt is unresolved, since I want to use a different comparison operator (>) in the blocking rules, also with the computation of a similarity distance and a threshold, in the way of jarowinkler(name_l, name_r) > 0.7. This is different from any example that I've seen so far, that's why I want to know if it is possible to do this in the blocking rules.

In other words, for a specific row from the left database to match, I want to make blocks with the names of the right database that are similar to the name. The problem with the blocking rules of the type name_l = name_r is that if a name has a typo, they will get excluded from the block and therefore it won't match, so I want the blocks to be formed heuristically, so similar names will also be in the block, but other names that have nothing to do will still be filtered out.

mamonu May 12, 2022

The problem with the blocking rules of the type name_l = name_r is that if a name has a typo, they will get excluded from the block and therefore it won't match, so I want the blocks to be formed heuristically, so similar names will also be in the block, but other names that have nothing to do will still be filtered out.

But thats why Double Metaphone helps. very similar names with a typo would more likely have the same Metaphone.
example using Pythons metaphone package insted of the spark optimised version we use but concept is the same.

mamonu · 2022-05-12T18:42:27Z

mamonu
May 12, 2022

And think about this for a moment.
by using something like jarowinkler(name_l, name_r) > 0.7

You will need to run the whole cartesian product to do this once. As preprocessing.

Because you will need to run this on every candidate pair. So effectively you will not be blocking (!)

For smaller workloads thats not a problem.

As the workload gets bigger that is not the case.

Now about the inequality .... its a bigger discussion where i will come back to it a bit later. Had a long working day and need a break 😄 👍

1 reply

ivsanro1 May 12, 2022
Author

Thank you for the suggestions, I appreciate them a lot even if they don't directly solve the doubt. They'll come in handy for a different problem that I have.

I was wondering how to block with similarity functions because my use case is slightly different. I used the name example for sake of simplicity, but it's a little more complex than that.

I need to join two databases with client records. The databases look like this (database 1 has 50k~ rows and database 2 has 100k~. This is an extract of 3 rows for each)

Database 1 (left)

Address	Job	Company	Age
3562 Hott Street, Oklahoma City	Outdoor activities/education manager	N/A	34
765 Shingleton Road, Grand Rapids	Structural engineer	Simpson Gumpertz & Heger Inc.	45
4360 High Meadow Lane, Pittston	Chemist, analytical	Vertex Pharmaceuticals	52

Database 2 (right)

Address	Address extra	Job	Company	Age
Hott St.	Oklahoma	education	N/A	35
Shingleton Rd.	Rapids	engineer	Simpson G. & H. Inc.	45
H. Mdw. Lane	Pittston	Chemist	Vertex Pharma	53

As you can see, the quality of the data in database 2 is much poorer, so there's no way of effectively blocking using =.

Therefore, I wanted to block using the Address_l and Address_r fields using some robust comparation (jaro-winkler or levenshtein distance), and then, for each block, make the matching with more comparisons (also robust) on the other fields.

I understand that blocking with such an expensive operation is not optimal, but in this case scenario it's either that or match without blocking, which would be even more computationally expensive.

mamonu · 2022-05-12T22:57:47Z

mamonu
May 12, 2022

what i can see is a lot of potential for good preprocessing. 😄
Bad data is universal you see. :) . Some of the ways to deal with this are standardisation and preprocessing.
Assuming the data above are representative of the kind of data you have.

for companies you could get only the first letter of each token
"Simpson G. & H. Inc." - > S G H I
"Simpson Gumpertz & Heger Inc." - > S G H I

so then block on equality of first tokens from company

compare on the full string however.

Regarding the job column... Here in the UK there is a SOC classification of jobs . You can perhaps find a way to classify your job strings into a relevant classification. Then perform equality blocking on that.

Also find standard tokens and preprocess them to become more useful :

" Rd. " -> " Road"
" St. " -> " Street"

But if you are talking about geographical information in particular...
have a look on this geocoding example :

https://gist.github.com/mamonu/daad74f56f5d584bcf3b65945ec43c10

by batching (need to rate-limit your requests in order to not get banned by the provider) you could really fix your address quality.

On the blocking side you don't need it to be perfect. you only need something so you don't end up with really dissimilar candidate pairs. So by map transforming your columns with some preprocessing you can get better results.

1 reply

ivsanro1 May 13, 2022
Author

You're right that with a good preprocessing you can do miracles :)

The idea of normalizing the addresses look good, although I'm not sure I can always end up with the same string to make an effective blocking, but it's something I can definitely play with.

Also, I find the geocoder the most interesting of them, because it opens the opportunity of playing a lot with the coordinates. E.g. for relatively separated clients (geographically), even if a client moves to a place near his address, you can block them quite effectively by calculating the distance between the latitudes and longitudes and then thresholding.

By the way, I already found out the answer to my original doubt. It looks like the blocking rules let you use inequality operators effectively, just like it would be in a normal SQL operation, which is awesome. Since I saw that the expressions in case_expressions are also SQL, I tried to use the registered functions I'm using in the case_expressions (e.g. jaro_winkler) but in the blocking_rules and it looks that it works for blocking too (although as you said, the quadratic complexity might make unfeasible for big datasets).

Example:

    "blocking_rules": [
        "jaro_winkler_sim(l.address, r.address) > 0.7"
    ]

This works fine for blocking heuristically. It's definitely possible to block like this, just in case there are such bad datasets that no matter how good the preprocessing is.

It might also be good for the record, for future people that might need it.

Thank you for all the help, @mamonu !

RobinL · 2022-05-13T09:20:11Z

RobinL
May 13, 2022
Maintainer

Note that it is possible to block on a similarity score. The syntax would be:

"jarowinkler(l.name, r.name) > 0.7"

But everything Theo says is correct - this results in the cartesian product of record comparisons being created, evaluated, and then filtered down according to the condition. So on larger datasets it's not a good idea!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking with similarity functions #461

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Blocking with similarity functions #461

ivsanro1 May 12, 2022

Replies: 4 comments · 4 replies

mamonu May 12, 2022

ivsanro1 May 12, 2022 Author

mamonu May 12, 2022

mamonu May 12, 2022

ivsanro1 May 12, 2022 Author

mamonu May 12, 2022

ivsanro1 May 13, 2022 Author

RobinL May 13, 2022 Maintainer

ivsanro1
May 12, 2022

Replies: 4 comments 4 replies

mamonu
May 12, 2022

ivsanro1 May 12, 2022
Author

mamonu
May 12, 2022

ivsanro1 May 12, 2022
Author

mamonu
May 12, 2022

ivsanro1 May 13, 2022
Author

RobinL
May 13, 2022
Maintainer