Spark Linker predictions on two PySpark dataframes #574
-
I estimated model parameters (m & u) on a dataset of 10,000 points and saved the settings dictionary. Then, created a linker object with the same 10,000 points and used the settings dictionary saved before. Everything works fine, and I get 121833 rows as a result of the predict method as shown below. Next, I created a new linker object with a sampled dataframe of 1,000 points and used the same saved settings from above. However, when I run the predict method with this new linker object, I still get the exact same results in predictions. Is this expected behaviour? How can I get predictions on the new dataframe I just created? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thanks for the report. I think this is likely a bug. Splink makes extensive use of caching for performance reasons so I suspect what's going on here is that the input data has been cached. In the short term, running the second link after restarting your spark session should fix this. I've converted this into an issue here You're probably aware of this already, but for smaller datasets (less than about a million records), you can use the DuckDB linker, which should be faster, and I don't think you would run into this bug |
Beta Was this translation helpful? Give feedback.
Thanks for the report. I think this is likely a bug. Splink makes extensive use of caching for performance reasons so I suspect what's going on here is that the input data has been cached.
In the short term, running the second link after restarting your spark session should fix this.
I've converted this into an issue here
You're probably aware of this already, but for smaller datasets (less than about a million records), you can use the DuckDB linker, which should be faster, and I don't think you would run into this bug