All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
ColumnExpression
now supports accessing first or last element of an array column via methodaccess_extreme_array_element()
(#2585), or converting string literals toNULL
vianullif()
(#2586)
- Deprecated support for python
3.8.x
following end of support for that minor version (#2520)
4.0.6 - 2024-12-05
- Added new
PairwiseStringDistanceFunctionLevel
andPairwiseStringDistanceFunctionAtThresholds
for comparing array columns using a string similarity on each pair of values (#2517) - Compare two records now allows typed inputs, not just dict (#2498)
- Clustering allows match weight args not just match probability (#2454)
- Various bugfixes for
debug_mode
(#2481) - Clustering still works in DuckDB even if no edges are available (#2510)
4.0.5 - 2024-11-06
- Dataframes to be registered when using
compare_two_records
, to avoid problems with data typing (because the input data can have an explicit schema) (#2493)
4.0.4 - 2024-10-13
cluster_pairwise_predictions_at_multiple_thresholds
to more efficiently cluster at multiple thresholds (#2437)
- Fixed issue with
profile_columns
using latest Altair version (#2466)
4.0.3 - 2024-09-19
- Cluster without linker by @RobinL in moj-analytical-services#2412
- Better autocomplete for dataframes by @RobinL in moj-analytical-services#2434
4.0.2 - 2024-09-19
- Match weight and m and u probabilities charts now have improved tooltips (#2392)
- Added new
AbsoluteDifferenceLevel
comparison level for numerical columns (#2398) - Added new
CosineSimilarityLevel
andCosineSimilarityAtThresholds
for comparing array columns using cosine similarity (#2405) - Added new
ArraySubsetLevel
for comparing array columns (#2416)
- Fixed issue where
ColumnsReversedLevel
required equality on both columns (#2395)
4.0.1 - 2024-09-06
- When using DuckDB, you can now pass
duckdb.DuckDBPyRelation
s as input tables to theLinker
(#2375) - It's now possible to fix values for
m
andu
probabilities in the settings such that they are not updated/changed during training. (#2379) - All charts can now be returned as vega lite spec dictionaries (#2361)
- Completeness chart now works correctly with indexed columns in spark (#2309)
- Completeness chart works even if you have a
source_dataset
column (#2323) SQLiteAPI
can now be instantiated without error when opting not to register custom UDFs (#2342)- Splink now runs properly when working in read-only filesystems (#2357)
- Infinite Bayes factor no longer causes SQL error in
Spark
(#2372) splink_datasets
is now functional in read-only filesystems (#2378)
4.0.0 - 2024-07-24
Major release - see our blog for what's changed
3.9.15 - 2024-06-18
- Activates
higher_is_more_similar
kwarg incl.distance_function_at_thresholds
, see here linker.save_model_to_json()
now correctly serialisestf_minimum_u_value
and reloads. See here.- Performance improvements on code geenration, see here
3.9.14 - 2024-03-25
IndexError: List index out of range
error due to API changeSQLGlot>=23.0.0
, see here
- Ability to override detection of exact match level for tf adjustments. See here for example.
- Added method for computing graph metrics (#2027)
3.9.13 - 2024-03-04
- Support for Databricks Runtime 13.x+
- Bug that prevented
sqlglot <= 17.0.0
from working properly (#1996) - Fixed issues relating to duckdb 0.10.1 (#1999)
- Update sqlglot compatibility to support latest version (#1998)
3.9.12 - 2024-01-30
- Support
sqlalchemy >= 2.0.0
(#1908)
3.9.11 - 2024-01-17
- Ability to block on array columns by specifying
arrays_to_explode
in your blocking rule. (#1692) - Added ability to sample by density in cluster studio by @zslade in (#1754)
- Splink now fully parallelises data linkage when using DuckDB (#1796)
- Allow salting in EM training (#1832)
3.9.10 - 2023-12-07
- Fixed issue with
_source_dataset_col
and_source_dataset_input_column
(#1731) - Delete cached tables before resetting the cache (#1752
3.9.9 - 2023-11-14
- Upgraded sqlglot to versions >= 13.0.0 (#1642)
- Improved logging output from settings validation (#1636) and corresponding documentation (#1674)
- Emit a warning when using a default (i.e. non-trained) value for
probability_two_random_records_match
(#1653)
- Fixed issue causing occasional SQL errors with certain database and catalog combinations (#1558)
- Fixed issue where comparison vector grid not synced with corresponding histogram values in comparison viewer dashboard (#1652)
- Fixed issue where composing null levels would mistakenly sometimes result in a non-null level (#1672)
- Labelling tool correctly works even when offline (#1646)
- Explicitly cast values when using the postgres linker (#1693)
- Fixed issue where parameters to
completeness_chart
were not being applied (#1662) - Fixed issue passing boto3_session into the Athena linker (#1733)
3.9.8 - 2023-10-05
- Added ability to delete tables with Spark when working in Databricks (#1526)
- Re-added support for python 3.7 (specifically >= 3.7.1) and adjusted dependencies in this case (#1622)
- Fix behaviour where using
to_csv
with Spark backend wouldn't overwrite files even when instructed to (#1635) - Corrected path for Spark
.jar
file containing UDFs to work correctly for Spark < 3.0 (#1622) - Spark UDF
damerau_levensthein
is now only registered for Spark >= 3.0, as it is not compatible with earlier versions (#1622)