You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When analyzing PMC ingestion internal datastores I found quite a lot (in terms of data volume) of faults generated by the PMC ingestion process in the past:
[marek.horst@iis-cdh5-test-gw ~]$ hadoop fs -du -h /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005
123.6 G 370.8 G /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/fault
1.0 T 3.1 T /user/dnet.production/iis/cache/ingestpmc_bck_20250109/000005/meta
After further analysis and listing the most popular error cases:
select count(*) as licz, code from ingestpmc_20250109_fault group by code order by licz desc;
We should make the JATS records parser able to extract texts from such records. Since the set of possible namespace values is quite large and open to additions we should not be bound to any specific list of allowed namespaces because it will be difficult to maintain. It should be OK to simply accept article element defined with any namespace.
The text was updated successfully, but these errors were encountered:
…rticle element defined with a namespace
Text extraction coverter from JATS parser was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.
…rticle element defined with a namespace
Text extraction converter, which is a submodule of the JATS parser, was modified in a way it accepts any namespace for article element including an empty namespace. This is now in line with the PMC metadata extraction module which is namespace agnostic. Both cases were proved by the newly added unit tests.
When analyzing PMC ingestion internal datastores I found quite a lot (in terms of data volume) of faults generated by the PMC ingestion process in the past:
After further analysis and listing the most popular error cases:
which resulted in:
affecting
~2%
of all the records.Digging more into the details revealed plaintext extraction module is unable to extract text from the
article
element having an explicit namespace defined, such as https://jats.nlm.nih.gov/ns/archiving/1.0 or https://dtd.nlm.nih.gov/ns/archiving/2.3 (there are also other cases).We should make the JATS records parser able to extract texts from such records. Since the set of possible namespace values is quite large and open to additions we should not be bound to any specific list of allowed namespaces because it will be difficult to maintain. It should be OK to simply accept
article
element defined with any namespace.The text was updated successfully, but these errors were encountered: