Skip to content

Spark version and jaccard_at_thresholds distance #673

Answered by ericmanning
juracyjr asked this question in Q&A
Discussion options

You must be logged in to vote

I think you need to register the Jaccard UDF:

from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession, types

conf = SparkConf()
conf.set("spark.jars", path_to_udf_jar)

sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)

spark.udf.registerJavaFunction("jaccard", "uk.gov.moj.dash.linkage.JaccardSimilarity", types.DoubleType())

The UDF jar for Spark is in splink/files/spark_jars here.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@juracyjr
Comment options

Answer selected by juracyjr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants