-
Hello, we are using Splink for deduplication on our data with Databricks/Spark. Currently we are running on v3.9.15 which is working fine for us but we wanted to migrate to v4 to stay up to date. Unfortunately, we observed a very poor performance when running our model on the new version. In the current we are trying to dedupe 150m records, which takes about 1hr on v3.9.15 on our Databricks cluster setup. On v4.0.2 it seems that the Spark jobs get stuck at some point, so that even after around 16 hours the When comparing the logged SQL queries, it seemed to us that in v4 the Is there anyone else who is facing such issues or does anyone know about some other changes from v3 to v4 beside the slight difference in the predict() SQL that might have such an impact, especially when running in a Databricks/Spark environment? I attached the debug logs for reference: Any help would be appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Yes you've understanding of how the pipeline has changed between 3 and 4 is corre. Does the __splink__blocked_id_pairs table get created,? There should be a message to the log in the form of 'blocking completed in x seconds '. To work out what's going wrong it'd be useful to figure out whether it's the blocking phase or the comparisons stage that's taking the time. If blocking does complete, are you able to see the files on disk. How many are there and how big is each file? Our spark jobs run faster in Splink 4 but it's very hard to test the full range of possibilities. We save (break lineage )to parquet not delta table. I guess that's a possibility, but on the other hand if you're doing that consistently between 3 and 4 it's probably not the root cause |
Beta Was this translation helpful? Give feedback.
Hi Robin,
thanks for the response! :)
The hint to __splink__blocked_id_pairs helped me to pin down the issue. Our fake data generator erroneously did not create unique values in our ID column
customercode,
which led to duplicated entries in the table :/In Splink v4 this results into a massive join between
__splink__df_concat_with_tf
and__splink__blocked_id_pairs
which is joined byINNER JOIN __splink__blocked_id_pairs AS b ON l.customercode = b.join_key_l INNER JOIN __splink__df_concat_with_tf AS r ON r.customercode = b.join_key_r
In Splink v3 on the other hand I assume this issue did not come up because the join is designed differently and therefore creates a different Spark execution…