You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I understand it (which is not well, because those papers are super dense and they don't have great examples anywhere), instead of modeling the problem as
generate candidate pairs
score the pairs
do graph algorithms eg connected components make clusters from these pairs
Steorts throws that away, and instead models it as recognizing that every record really represents some true, latent entity. Then you try to link records to these entities directly. This is a bipartite graph optimization problem: one side of the graph are the records, and the other side of the graph are the latent entities. This is better, because instead of inherently O(N^2) comparisons, there are just O(E*N), where E is the number of latent entities, and E is presumably much much smaller than N, possibly near-constant, but almost definitely sub-linear in respect to N.
This also has the nice property that you don't have to worry about the "transitive links" like you do when clustering pairwise-comparisons, since you already get the likelihood that a record refers to an entity.
The text was updated successfully, but these errors were encountered:
See https://github.com/cleanzr/dblink
As I understand it (which is not well, because those papers are super dense and they don't have great examples anywhere), instead of modeling the problem as
Steorts throws that away, and instead models it as recognizing that every record really represents some true, latent entity. Then you try to link records to these entities directly. This is a bipartite graph optimization problem: one side of the graph are the records, and the other side of the graph are the latent entities. This is better, because instead of inherently O(N^2) comparisons, there are just O(E*N), where E is the number of latent entities, and E is presumably much much smaller than N, possibly near-constant, but almost definitely sub-linear in respect to N.
This also has the nice property that you don't have to worry about the "transitive links" like you do when clustering pairwise-comparisons, since you already get the likelihood that a record refers to an entity.
The text was updated successfully, but these errors were encountered: