Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

marekhorst · 2025-01-10T16:25:54Z

When analyzing PMC ingestion internal datastores I found quite a lot (in terms of data volume) of faults generated by the PMC ingestion process in the past:

[marek.horst@iis-cdh5-test-gw ~]$ hadoop fs -du -h /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005
123.6 G  370.8 G  /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/fault
1.0 T    3.1 T    /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/meta

After further analysis and listing the most popular error cases:

select count(*) as licz, code from ingestpmc_20250109_fault group by code order by licz desc;

which resulted in:

481742	java.lang.RuntimeException
606	org.jdom.input.JDOMParseException
29	java.lang.NullPointerException

affecting ~2% of all the records.

Digging more into the details revealed plaintext extraction module is unable to extract text from the article element having an explicit namespace defined, such as https://jats.nlm.nih.gov/ns/archiving/1.0 or https://dtd.nlm.nih.gov/ns/archiving/2.3 (there are also other cases).

We should make the JATS records parser able to extract texts from such records. Since the set of possible namespace values is quite large and open to additions we should not be bound to any specific list of allowed namespaces because it will be difficult to maintain. It should be OK to simply accept article element defined with any namespace.

The text was updated successfully, but these errors were encountered:

…rticle element defined with a namespace Text extraction coverter from JATS parser was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.

…rticle element defined with a namespace Text extraction converter, which is a submodule of the JATS parser, was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.

marekhorst added the functionality: metadataextraction label Jan 10, 2025

marekhorst self-assigned this Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

marekhorst commented Jan 10, 2025

Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

Make PMC ingestion compliant with JATS records having article element defined with a namespace #1516

Comments

marekhorst commented Jan 10, 2025