Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more example datasets #13

Open
NickCrews opened this issue Nov 5, 2023 · 4 comments
Open

Add more example datasets #13

NickCrews opened this issue Nov 5, 2023 · 4 comments

Comments

@NickCrews
Copy link
Owner

NickCrews commented Nov 5, 2023

Here are the options I've found when googling around for candidate datasets to add to mismo:

  1. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis ... https://arxiv.org/pdf/2304.12329 (this links to the data at https://zenodo.org/records/6950980, but that just has precomputed embedding vectors for every entity, and each entity is usually just some unstructured text. I would like more raw values to work from.)
  2. Profiling Entity Matching Benchmark Tasks 21 complete benchmark tasks, with train, validation, and test splits.
  3. https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution
  4. Can add more as needed
@NickCrews
Copy link
Owner Author

@OlivierBinette what do you think about option 2 above? It seems professionally made, easy to work with/download programmatically, has a variety of different attribute types, and a variety of difficulties. Only one task is deduplication, the rest seem to be record linkage between 2 datasets, so that is one drawback.

Should we choose one/a few of these to be used for benchmarking and testing? Are there considerations that you think are important when choosing a dataset to use for these purposes?

@OlivierBinette
Copy link
Contributor

OlivierBinette commented Jan 3, 2024

@NickCrews Number 2 above focuses on pairwise matching, but not clustering of deduplication of a large dataset. In this way, it sidesteps the class imbalance and close non-match issue in entity resolution which isn't great.

I'd recommend using data from PatentsView as one example. There is data to deduplicate on inventors, locations (city/state/country), and businesses. PatentsView does its own deduplication that we can compare to. They also have ground truth data with sampling weights and they are in the process of collecting more as well. The PatentsView team (or myself) would be able to provide you with documented datasets that are ready to deduplicate. (Note: I just finished a contract with PatentsView so I am surely a bit biased here.)

There's also the Union Army dataset that I've usued in this paper, section 4.3.2.

The RLData10000 dataset is also quite well known. It's quite simple with a limited amount of noise, great for sanity-checking things.

This paper that I co-authored has a list of benchmark datasets that can be useful, focusing on structured entity resolution: https://www.science.org/doi/suppl/10.1126/sciadv.abi8021/suppl_file/sciadv.abi8021_sm.pdf

One of the dataset described in there is the NC voter registration dataset, which is quite well-known.

@OlivierBinette
Copy link
Contributor

This paper I authored talks about ground truth benchmarks for US inventor disambiguation: https://arxiv.org/pdf/2301.03591.pdf

@NickCrews
Copy link
Owner Author

Also from AI-team-UoA/pyJedAI#17 there was mention of https://zenodo.org/records/7252010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants