Add more example datasets #13

NickCrews · 2023-11-05T14:01:26Z

Here are the options I've found when googling around for candidate datasets to add to mismo:

~~Pre-trained Embeddings for Entity Resolution: An Experimental Analysis ... https://arxiv.org/pdf/2304.12329~~ (this links to the data at https://zenodo.org/records/6950980, but that just has precomputed embedding vectors for every entity, and each entity is usually just some unstructured text. I would like more raw values to work from.)
Profiling Entity Matching Benchmark Tasks 21 complete benchmark tasks, with train, validation, and test splits.
https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution
Can add more as needed

NickCrews · 2024-01-02T21:36:52Z

@OlivierBinette what do you think about option 2 above? It seems professionally made, easy to work with/download programmatically, has a variety of different attribute types, and a variety of difficulties. Only one task is deduplication, the rest seem to be record linkage between 2 datasets, so that is one drawback.

Should we choose one/a few of these to be used for benchmarking and testing? Are there considerations that you think are important when choosing a dataset to use for these purposes?

OlivierBinette · 2024-01-03T13:51:50Z

@NickCrews Number 2 above focuses on pairwise matching, but not clustering of deduplication of a large dataset. In this way, it sidesteps the class imbalance and close non-match issue in entity resolution which isn't great.

I'd recommend using data from PatentsView as one example. There is data to deduplicate on inventors, locations (city/state/country), and businesses. PatentsView does its own deduplication that we can compare to. They also have ground truth data with sampling weights and they are in the process of collecting more as well. The PatentsView team (or myself) would be able to provide you with documented datasets that are ready to deduplicate. (Note: I just finished a contract with PatentsView so I am surely a bit biased here.)

There's also the Union Army dataset that I've usued in this paper, section 4.3.2.

The RLData10000 dataset is also quite well known. It's quite simple with a limited amount of noise, great for sanity-checking things.

This paper that I co-authored has a list of benchmark datasets that can be useful, focusing on structured entity resolution: https://www.science.org/doi/suppl/10.1126/sciadv.abi8021/suppl_file/sciadv.abi8021_sm.pdf

One of the dataset described in there is the NC voter registration dataset, which is quite well-known.

OlivierBinette · 2024-01-13T16:15:33Z

This paper I authored talks about ground truth benchmarks for US inventor disambiguation: https://arxiv.org/pdf/2301.03591.pdf

NickCrews · 2024-03-11T17:44:58Z

Also from AI-team-UoA/pyJedAI#17 there was mention of https://zenodo.org/records/7252010

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more example datasets #13

Add more example datasets #13

NickCrews commented Nov 5, 2023 •

edited

Loading

NickCrews commented Jan 2, 2024

OlivierBinette commented Jan 3, 2024 •

edited

Loading

OlivierBinette commented Jan 13, 2024

NickCrews commented Mar 11, 2024

Add more example datasets #13

Add more example datasets #13

Comments

NickCrews commented Nov 5, 2023 • edited Loading

NickCrews commented Jan 2, 2024

OlivierBinette commented Jan 3, 2024 • edited Loading

OlivierBinette commented Jan 13, 2024

NickCrews commented Mar 11, 2024

NickCrews commented Nov 5, 2023 •

edited

Loading

OlivierBinette commented Jan 3, 2024 •

edited

Loading