-
I am trying to estimate a number of comparisons using the following blocking rule:
but getting an error
Should I define rules that utilize some function differently? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
Here's a minimal example:
and here is the output:
Looks like SQL parser cuts prefix |
Beta Was this translation helpful? Give feedback.
-
SolutionThere seems to be a problem with how the date_diff function is transpiled in sqlglot. You can work around it using the following blocking rule:
I'd be grateful if you'd let me know if this doesn't work Other notesI should also note that these kinds of 'within a range' blocking rules execute much less efficiently than anything with an equality condition. To evaluate this blocking rules, Splink will have to generate ALL possible comparisons from the input records i.e. the cartesian product, and then evaluate this inequality condition. So if you have large input datasets, this will be extremely slow. One possible way of avoiding this may be to round your dates, and use an equality condition on the rounded dates (using more than one blocking rule). The equivalent for integers would be something like the following two blocking rules.
You need both so that e.g. the values |
Beta Was this translation helpful? Give feedback.
-
I successfully used another workaround instead of DATE_DIFF. Created two additional columns in a dataset: |
Beta Was this translation helpful? Give feedback.
Solution
There seems to be a problem with how the date_diff function is transpiled in sqlglot.
You can work around it using the following blocking rule:
I'd be grateful if you'd let me know if this doesn't work
Other notes
I should also note that these kinds of 'within a range' blocking rules execute much less efficiently than anything with an equality condition.
To evaluate this blocking rules, Splink will have to generate ALL possible comparisons from the input records i.e. the cartesian product, and then evaluate this inequality condition. So if you have large input datasets, this will be extre…