-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/dspace harvest #63
base: main
Are you sure you want to change the base?
Conversation
…r all default dspace document types. Added support for TTL, N3 and RDF/XML format in sparqlapi loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanmrsulja thanks for this nice contribution. Please check my two comments. Also, in the description of the PR please correct name of the sh/bat file which should be run (run-dspace-oaifetch.bat).
src/main/java/org/vivoweb/harvester/util/repo/TextFileRecordHandler.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanmrsulja thanks for this contribution.
I have tested the PR on Windows 10. Initially, there was an issue with encoding which Ivan fixed. It works very well now including both ingestion approaches - tdb based and sparql api based.
…ship and type bugs when performing TDB import. Fixed SPARQL update encoding issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanmrsulja well done
@@ -193,7 +193,9 @@ public void execute() throws IOException { | |||
|
|||
if (! StringUtils.equalsIgnoreCase(strArray[1], "deleted")) { | |||
log.trace("Adding record: " + strArray[0]); | |||
this.rhOutput.addRecord(strArray[0], strArray[1], this.getClass()); | |||
String charReferenceRegex = "(?<=^|[^&])(&#(?:[0-9]+|x[0-9a-fA-F]+);)"; | |||
String fullyEscapedData = strArray[1].replaceAll(charReferenceRegex, "&$1").replace("&&#", "&#"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is going on with the formatting here, but I think the regex should be a constant Pattern to avoid compilation of the regex repeatedly in iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will fix this ASAP! The regex actually checks for every instance of unescaped HTML predefined entities so I can escape them properly in order to avoid encoding issues in further ETL stages.
Added an example of fetching publication metadata from DSpace based on the oaifetch.
Steps to run:
dspace-oaifetch.conf.xml
to point to your desired instance's endpoint, as well as other OAI properties (keep in mind that only DublinCore metadata format is supported at this moment)run-dspace-oaifetch.sh
(or.bat
, if you are on Windows)Closes #4021