Issues when indexing AclAnthology #2084

ygorg · 2023-03-27T16:17:17Z

This PR is linked to #2069.

There are some issues when following the "Indexing the ACL Anthology with Anserini" tutorial:

The venues attribute is now venue and is stored in paper rather than volume.
The page_first and page_last attributes are sometimes in roman notation and thus strings, not number. I change the schema, because translating roman to int should happen in acl-anthology.
Some files are too big for fasterxml.jackson. By default files should be less that 3mb (~1,345,000 char). I increased this size to 10mb.
The script create_hugo_yaml.py produces a yaml file containing aliases and references (to prevent data redundancy). These cannot be parsed by fasterxml.jackson (YAML Anchors / References FasterXML/jackson-dataformats-text#98). One way to prevent this is modifying the script to prevent this behavior (see the updated doc). An other way is to postprocess the data in acl-anthology/build/data using

pip install ruamel.yaml.cmd
mv volumes.yaml volumes_old.yaml
yaml merge-expand volumes_old.yaml volumes.yaml
mv papers papers_old
for yaml_old in papers_old/*.yaml; do  # this is ~10minutes !
  yaml_new=${yaml_old##*/}
  yaml merge-expand $yaml_old papers/$yaml_new
done

Point number 4 is hacky, maybe there is a way to preprocess and remove the aliases in java in AclAnthology.java ? Maybe this dataset should be read using another format (jsonl, xml, bibtex) ?
I still need to update the unittests.

…345.000 characters. Adding option to increase this limit.

…a string.

…not be parsed by `fasterxml.jackson`.

ygorg · 2023-04-21T14:24:57Z

I updated the unittests with the same sample documents as before, but from a recent extract of acl-anthology github repo.

aryamancodes · 2023-04-21T21:07:35Z

I followed the instructions on this PR and the related issue (including both the "hacks" described in point 4) and was able to successfully index ACL Anthology. The unit tests ran successfully too. Just a small nitpick with the post-processing script, the correct version should look like:

pip install ruamel.yaml.cmd
mv volumes.yaml volumes_old.yaml
yaml merge-expand volumes_old.yaml volumes.yaml
mv papers papers_old

# since we renamed the papers dir, it no longer exists
mkdir papers 

for yaml_old in papers_old/*.yaml; do  # this is ~10minutes !
  yaml_new=${yaml_old##*/}
  yaml merge-expand $yaml_old papers/$yaml_new
done

lintool · 2023-04-21T21:09:12Z

@ygorg thoughts? let's sort through these final issues and then merge in your PR?

codecov-commenter · 2023-04-21T21:13:09Z

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.03 🎉

Comparison is base (910821a) 58.89% compared to head (e8e324d) 58.93%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2084      +/-   ##
============================================
+ Coverage     58.89%   58.93%   +0.03%     
  Complexity     1188     1188              
============================================
  Files           194      194              
  Lines         11318    11328      +10     
  Branches       1486     1486              
============================================
+ Hits           6666     6676      +10     
  Misses         4170     4170              
  Partials        482      482

Impacted Files	Coverage Δ
...nserini/index/generator/AclAnthologyGenerator.java	`89.04% <ø> (ø)`
...main/java/io/anserini/collection/AclAnthology.java	`81.52% <92.85%> (+2.25%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

ygorg · 2023-04-22T08:42:21Z

Thank you for the review.
@aryamancodes My original comment wasn't so clear, the solution you used is very slow, my primary recommandation was to modify create_hugo_yaml.py with this line Dumper.ignore_aliases = lambda self, data: True
Which simplifies the process so much.

aryamancodes · 2023-04-22T12:30:19Z

That makes sense. I did test the primary recommendation and it works well!

lintool · 2023-04-22T12:32:30Z

Thanks @ygorg for the contribution!

ygorg and others added 9 commits March 27, 2023 17:57

By default fasterxml.jackson does not process files longer than ~1.…

69a1afe

…345.000 characters. Adding option to increase this limit.

venues is now venues and is stored in the paper.

abcda9e

page_first and page_first sometimes are roman numerals stored as …

b02871c

…a string.

create_hugo_yaml.py creates yaml files containing aliases which can…

7881849

…not be parsed by `fasterxml.jackson`.

Smaller code for preventing aliases in YAML dump

933c6ab

Removed extra whitespaces

073259b

Updated ACL sample doc with current yaml export and expected test values

fd581ad

Merge branch 'fix-acl' of https://github.com/ygorg/anserini into fix-acl

f94ba9c

Merge branch 'castorini:master' into fix-acl

e8e324d

lintool approved these changes Apr 22, 2023

View reviewed changes

lintool merged commit 39dfcb4 into castorini:master Apr 22, 2023

ygorg deleted the fix-acl branch April 22, 2023 12:36

aryamancodes pushed a commit to aryamancodes/anserini that referenced this pull request Apr 22, 2023

Improve AclAnthology indexing (castorini#2084)

ff7a0ee

ygorg mentioned this pull request Apr 27, 2023

Problem with indexing ACLAnthology #2109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when indexing AclAnthology #2084

Issues when indexing AclAnthology #2084

ygorg commented Mar 27, 2023

ygorg commented Apr 21, 2023 •

edited

Loading

aryamancodes commented Apr 21, 2023

lintool commented Apr 21, 2023

codecov-commenter commented Apr 21, 2023

ygorg commented Apr 22, 2023 •

edited

Loading

aryamancodes commented Apr 22, 2023

lintool commented Apr 22, 2023

Issues when indexing AclAnthology #2084

Issues when indexing AclAnthology #2084

Conversation

ygorg commented Mar 27, 2023

ygorg commented Apr 21, 2023 • edited Loading

aryamancodes commented Apr 21, 2023

lintool commented Apr 21, 2023

codecov-commenter commented Apr 21, 2023

Codecov Report

ygorg commented Apr 22, 2023 • edited Loading

aryamancodes commented Apr 22, 2023

lintool commented Apr 22, 2023

ygorg commented Apr 21, 2023 •

edited

Loading

ygorg commented Apr 22, 2023 •

edited

Loading