-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when indexing AclAnthology #2084
Conversation
…345.000 characters. Adding option to increase this limit.
…not be parsed by `fasterxml.jackson`.
I updated the unittests with the same sample documents as before, but from a recent extract of acl-anthology github repo. |
I followed the instructions on this PR and the related issue (including both the "hacks" described in point 4) and was able to successfully index ACL Anthology. The unit tests ran successfully too. Just a small nitpick with the post-processing script, the correct version should look like:
|
@ygorg thoughts? let's sort through these final issues and then merge in your PR? |
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #2084 +/- ##
============================================
+ Coverage 58.89% 58.93% +0.03%
Complexity 1188 1188
============================================
Files 194 194
Lines 11318 11328 +10
Branches 1486 1486
============================================
+ Hits 6666 6676 +10
Misses 4170 4170
Partials 482 482
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Thank you for the review. |
That makes sense. I did test the primary recommendation and it works well! |
Thanks @ygorg for the contribution! |
This PR is linked to #2069.
There are some issues when following the "Indexing the ACL Anthology with Anserini" tutorial:
venues
attribute is nowvenue
and is stored in paper rather than volume.page_first
andpage_last
attributes are sometimes in roman notation and thus strings, not number. I change the schema, because translating roman to int should happen inacl-anthology
.fasterxml.jackson
. By default files should be less that 3mb (~1,345,000 char). I increased this size to 10mb.create_hugo_yaml.py
produces a yaml file containing aliases and references (to prevent data redundancy). These cannot be parsed byfasterxml.jackson
(YAML Anchors / References FasterXML/jackson-dataformats-text#98). One way to prevent this is modifying the script to prevent this behavior (see the updated doc). An other way is to postprocess the data inacl-anthology/build/data
usingPoint number 4 is hacky, maybe there is a way to preprocess and remove the aliases in java in AclAnthology.java ? Maybe this dataset should be read using another format (jsonl, xml, bibtex) ?
I still need to update the unittests.