Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting latent-entity based algorithms #26

Open
NickCrews opened this issue Feb 19, 2024 · 0 comments
Open

Consider supporting latent-entity based algorithms #26

NickCrews opened this issue Feb 19, 2024 · 0 comments

Comments

@NickCrews
Copy link
Owner

NickCrews commented Feb 19, 2024

See https://github.com/cleanzr/dblink

As I understand it (which is not well, because those papers are super dense and they don't have great examples anywhere), instead of modeling the problem as

  • generate candidate pairs
  • score the pairs
  • do graph algorithms eg connected components make clusters from these pairs

Steorts throws that away, and instead models it as recognizing that every record really represents some true, latent entity. Then you try to link records to these entities directly. This is a bipartite graph optimization problem: one side of the graph are the records, and the other side of the graph are the latent entities. This is better, because instead of inherently O(N^2) comparisons, there are just O(E*N), where E is the number of latent entities, and E is presumably much much smaller than N, possibly near-constant, but almost definitely sub-linear in respect to N.

This also has the nice property that you don't have to worry about the "transitive links" like you do when clustering pairwise-comparisons, since you already get the likelihood that a record refers to an entity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant