Spark Linker predictions on two PySpark dataframes #574

prabh-singh123 · 2022-06-28T20:36:50Z

prabh-singh123
Jun 28, 2022

I estimated model parameters (m & u) on a dataset of 10,000 points and saved the settings dictionary. Then, created a linker object with the same 10,000 points and used the settings dictionary saved before. Everything works fine, and I get 121833 rows as a result of the predict method as shown below.

Next, I created a new linker object with a sampled dataframe of 1,000 points and used the same saved settings from above. However, when I run the predict method with this new linker object, I still get the exact same results in predictions.

Is this expected behaviour? How can I get predictions on the new dataframe I just created?

Answered by RobinL

Jun 29, 2022

Thanks for the report. I think this is likely a bug. Splink makes extensive use of caching for performance reasons so I suspect what's going on here is that the input data has been cached.

In the short term, running the second link after restarting your spark session should fix this.

I've converted this into an issue here

You're probably aware of this already, but for smaller datasets (less than about a million records), you can use the DuckDB linker, which should be faster, and I don't think you would run into this bug

View full answer

RobinL · 2022-06-29T06:07:27Z

RobinL
Jun 29, 2022
Maintainer

Thanks for the report. I think this is likely a bug. Splink makes extensive use of caching for performance reasons so I suspect what's going on here is that the input data has been cached.

In the short term, running the second link after restarting your spark session should fix this.

I've converted this into an issue here

You're probably aware of this already, but for smaller datasets (less than about a million records), you can use the DuckDB linker, which should be faster, and I don't think you would run into this bug

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Linker predictions on two PySpark dataframes #574

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spark Linker predictions on two PySpark dataframes #574

prabh-singh123 Jun 28, 2022

Replies: 1 comment

RobinL Jun 29, 2022 Maintainer

prabh-singh123
Jun 28, 2022

RobinL
Jun 29, 2022
Maintainer