Skip to content

Spark Linker predictions on two PySpark dataframes #574

Answered by RobinL
prabh-singh123 asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks for the report. I think this is likely a bug. Splink makes extensive use of caching for performance reasons so I suspect what's going on here is that the input data has been cached.

In the short term, running the second link after restarting your spark session should fix this.

I've converted this into an issue here

You're probably aware of this already, but for smaller datasets (less than about a million records), you can use the DuckDB linker, which should be faster, and I don't think you would run into this bug

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by prabh-singh123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants