Replies: 4 comments 4 replies
-
yes it is. we do this in our work this is how: what you can do is create for example a phonetic transformation of the two names Double metaphone is very useful to clear names that have no chance of being the same . so very similar to your idea. have a look here about it. you can register this function from the jar we include with spink then you are able to do some preprocessing
and then on blocking step
|
Beta Was this translation helpful? Give feedback.
-
And think about this for a moment. You will need to run the whole cartesian product to do this once. As preprocessing. Because you will need to run this on every candidate pair. So effectively you will not be blocking (!) For smaller workloads thats not a problem. As the workload gets bigger that is not the case. Now about the inequality .... its a bigger discussion where i will come back to it a bit later. Had a long working day and need a break 😄 👍 |
Beta Was this translation helpful? Give feedback.
-
what i can see is a lot of potential for good preprocessing. 😄 for companies you could get only the first letter of each token so then block on equality of first tokens from company compare on the full string however. Regarding the job column... Here in the UK there is a SOC classification of jobs . You can perhaps find a way to classify your job strings into a relevant classification. Then perform equality blocking on that. Also find standard tokens and preprocess them to become more useful : " Rd. " -> " Road" But if you are talking about geographical information in particular... https://gist.github.com/mamonu/daad74f56f5d584bcf3b65945ec43c10 by batching (need to rate-limit your requests in order to not get banned by the provider) you could really fix your address quality. On the blocking side you don't need it to be perfect. you only need something so you don't end up with really dissimilar candidate pairs. So by map transforming your columns with some preprocessing you can get better results. |
Beta Was this translation helpful? Give feedback.
-
Note that it is possible to block on a similarity score. The syntax would be:
But everything Theo says is correct - this results in the cartesian product of record comparisons being created, evaluated, and then filtered down according to the condition. So on larger datasets it's not a good idea! |
Beta Was this translation helpful? Give feedback.
-
Hello. Every example I have seen of blocking is always with
=
(equal) comparisons, but I would like to block candidates that e.g. have ajaro-winkler
with the name higher than a specified threshold (e.g.0.7
).So I would like to do something like this:
Is this possible anyhow? Is there any example of more complex blocking rules?
Beta Was this translation helpful? Give feedback.
All reactions