Code and data from Cocos et al. "Comparing Constraints for Taxonomic Organization" (NAACL 2018)
If you use the resources here, please cite our paper:
@InProceedings{N18-1030,
author = "Cocos, Anne
and Apidianaki, Marianna
and Callison-Burch, Chris",
title = "Comparing Constraints for Taxonomic Organization",
booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "323--333",
location = "New Orleans, Louisiana",
url = "http://aclweb.org/anthology/N18-1030"
}
Taxonomies encode knowledge about semantic hierarchies. For example, a mosquito IS-A insect and an insect IS-A organism.
More formally, a taxonomy is a graph where the nodes are entities (like mosquito, insect, and organism), and directed edges from entity node x to entity node y indicate that the relationship (x IS-A y) exists.
There has been a lot of work in the NLP community on trying to build taxonomies automatically. Generally, we can think about automatic taxonomy induction as having three sub-tasks: entity extraction (choosing the nodes in the graph), relation prediction (predicting which node pairs have an IS-A relationship), and taxonomy organization. The code in this repo focuses on that final step.
Taxonomy organization aims to choose the edges that should appear in a taxonomy. Its
input is a set of entities
The goal of our paper was to compare different taxonomy organization algorithms having different structural constraints. Specifically, we tried algorithms that varied by:
- Whether the final graph should be a directed acyclic graph (DAG) or a tree
- Whether or not the final graph must explicitly contain transitive edges (i.e. if (i-->j) is an edge, and (j-->k) is an edge, then (i-->k) must also be an edge).
This repo contains code for six algorithms, organized along those two axes:
/taxdata
-- contains test dataset, predicted relations, and synonym clusters used in the paper/taxdata/terms_test
-- entity sets in the test set/taxdata/terms_val
-- entity sets in the validation set/taxdata/gs_terms
-- filtered entity sets, including only terms appearing in WordNet (for evaluation)/taxdata/gs_taxo
-- 'gold-standard' taxonomies for entity sets ings_terms
, containing WordNet direct and transitive edges/taxdata/clusters
-- consolidated synonym clusters generated by combining all terms with PARAGRAM (Wieting et al. 2015) embedding cosine similarity >= 0.76/taxdata/relations
-- contains predicted relations (equiv[synonym], hypernym, and entmax[max(equiv, hypernym)]) for term pairs in the entity sets. Hypernym relations were predicted using HypeNET (Shwartz et al. 2016) and equiv relations give cosine similarity (min 0) of PARAGRAM (Wieting et al. 2015) embeddings
/src
-- code to build and evaluate taxonomies
- Install requirements from
src/requirements.txt
- Install required packages without pip support:
- The transitive algorithms (MaxTransGraph and MaxTransGraph) are implemented with the Gurobi optimization library. To run those models, you'll need to install Gurobi (free academic licenses available) and its Python API,
gurobipy
. - The DMST algorithms use the implementation of the Chu-Liu-Edmonds algorithm from the dependency_decoding package; you'll need to install this too.
- The transitive algorithms (MaxTransGraph and MaxTransGraph) are implemented with the Gurobi optimization library. To run those models, you'll need to install Gurobi (free academic licenses available) and its Python API,
Run experiments to generate taxonomies for the entity sets in /taxdata/terms_test
by running:
./src/run_experiments.sh