Consider supporting latent-entity based algorithms #26

NickCrews · 2024-02-19T03:53:14Z

As I understand it (which is not well, because those papers are super dense and they don't have great examples anywhere), instead of modeling the problem as

generate candidate pairs
score the pairs
do graph algorithms eg connected components make clusters from these pairs

Steorts throws that away, and instead models it as recognizing that every record really represents some true, latent entity. Then you try to link records to these entities directly. This is a bipartite graph optimization problem: one side of the graph are the records, and the other side of the graph are the latent entities. This is better, because instead of inherently O(N^2) comparisons, there are just O(E*N), where E is the number of latent entities, and E is presumably much much smaller than N, possibly near-constant, but almost definitely sub-linear in respect to N.

This also has the nice property that you don't have to worry about the "transitive links" like you do when clustering pairwise-comparisons, since you already get the likelihood that a record refers to an entity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting latent-entity based algorithms #26

Consider supporting latent-entity based algorithms #26

NickCrews commented Feb 19, 2024 •

edited

Loading

Consider supporting latent-entity based algorithms #26

Consider supporting latent-entity based algorithms #26

Comments

NickCrews commented Feb 19, 2024 • edited Loading

NickCrews commented Feb 19, 2024 •

edited

Loading