Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

Open
marekhorst opened this issue Jan 10, 2025 · 0 comments

Comments

@marekhorst
Copy link
Member

When analyzing PMC ingestion internal datastores I found quite a lot (in terms of data volume) of faults generated by the PMC ingestion process in the past:

[marek.horst@iis-cdh5-test-gw ~]$ hadoop fs -du -h /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005
123.6 G  370.8 G  /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/fault
1.0 T    3.1 T    /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/meta

After further analysis and listing the most popular error cases:

select count(*) as licz, code from ingestpmc_20250109_fault group by code order by licz desc;

which resulted in:

481742	java.lang.RuntimeException
606	org.jdom.input.JDOMParseException
29	java.lang.NullPointerException

affecting ~2% of all the records.

Digging more into the details revealed plaintext extraction module is unable to extract text from the article element having an explicit namespace defined, such as https://jats.nlm.nih.gov/ns/archiving/1.0 or https://dtd.nlm.nih.gov/ns/archiving/2.3 (there are also other cases).

We should make the JATS records parser able to extract texts from such records. Since the set of possible namespace values is quite large and open to additions we should not be bound to any specific list of allowed namespaces because it will be difficult to maintain. It should be OK to simply accept article element defined with any namespace.

@marekhorst marekhorst self-assigned this Jan 10, 2025
marekhorst added a commit that referenced this issue Jan 10, 2025
…rticle element defined with a namespace

Text extraction coverter from JATS parser was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.
marekhorst added a commit that referenced this issue Jan 10, 2025
…rticle element defined with a namespace

Text extraction converter, which is a submodule of the JATS parser, was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant