-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openalex fetch example #61
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkampe thanks for this. I didn't test it yet, but I briefly reviewed the code. There is one tiny comment about gitignore file. And I have one more comment about SPARQL API based approach. Ivan Mrsulja makes SPARQL API based approach working in the case of DSpace ETL (#63). There is a parameter for the main script file (the value of the parameter might be tdb or sparql). I am wondering whether that approach might be copied in this PR as well?
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/logs/ | ||
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/data/ | ||
/example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/previous-harvest/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that lines 3-7 includes this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that adding:
**/data
**/logs
**/previous-harvest
to root-level .gitignore
will solve this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have been in there, I thought. Will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please check out my comments, I believe they can be helpfull.
//log.trace("Adding record: " + fixedkey + "_" + recID); | ||
//log.trace("data: "+ sb.toString()); | ||
//log.info("rhOutput: "+ this.rhOutput); | ||
//log.info("recID: "+recID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably remove these comments (and also in the other places you commented-out the code snippets), it will clean up the code slightly.
sb.append(" <"); | ||
sb.append(SpecialEntities.xmlEncode(field)); | ||
sb.append(">"); | ||
|
||
// insert field value | ||
sb.append(SpecialEntities.xmlEncode(val.toString().trim())); | ||
|
||
// Field END | ||
sb.append("</"); | ||
sb.append(SpecialEntities.xmlEncode(field)); | ||
sb.append(">\n"); | ||
return sb.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these appends be chained using StringBuilder's default builder pattern?
.replaceAll(" |/", "_") | ||
.replaceAll("\\(|\\)", "") | ||
.replaceAll("/", "_"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use replaceAll("[ /]", "_").replaceAll("[()]", "")
to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not good at regex. And this was obviously incrementally added. I'm happy to use your suggestion to make it cleaner.
.replaceAll(" |/", "_") | ||
.replaceAll("\\(|\\)", "") | ||
.replaceAll("/", "_"); | ||
if (!Character.isDigit(fixedkey.charAt(0)) && !fixedkey.equals("abstract_inverted_index")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fixedKey
ever be null
? If yes, then I think there should be a null-check for that edge case.
} | ||
|
||
public String getTagName(String field, Object val) { | ||
StringBuffer sb = new StringBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is StringBuffer
used here? If this class will not be used in a multithreaded environment I think we should switch to using StringBuilder
everywhere because it is a lot faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkampe I have tested harvester on Windows 10 and it works. Well done. Please, check my comments.
<!-- <param name="file">https://api.openalex.org/works?filter=concepts.id:C10238366|C118416809|C119823426|C120991184|C122156500|C123657996|C124363303|C127416549|C147176958|C148803439|C154226666|C158049464|C158550234|C1631582|C173560066|C178432105|C190831278|C196316656|C203115093|C203299862|C205300905|C2775926657|C2776009117|C2776081408|C2776136241|C2776161637|C2776311590|C2776445639|C2776748203|C2776825979|C2777231864|C2777364373|C2777800518|C2777831296|C2778206487|C2778647717|C2778684775|C2778753569|C2778906150|C2779054714|C2779201158|C2779265402|C2779331490|C2779635184|C2780021121|C2780113678|C2780344732|C2780886216|C2780933643|C2781052401,authorships.institutions.country_code:de&per-page=200&[email protected]&cursor=*</param>--> | ||
|
||
<param name="output">raw-records.config.xml</param> | ||
<param name="namespaceBase">http://vivo.example.com/harvest/aims_users/</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this parameter, and what is the difference between this and baseURI in *.datamap.xsl. Moreover, this is hardcoded value in *-datamap.xsl file, meaning if someone change the value in this file, also has to update the another file to make harvesting process working properly. I found file changenamespace-all.config in other examples. Any chance to use it for openalex etl process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied it from another already existing config file. I have to find out what its purpose is.
xmlns:vitro = 'http://vitro.mannlib.cornell.edu/ns/vitro/0.7#' | ||
xmlns:vcard = 'http://www.w3.org/2006/vcard/ns#' | ||
xmlns:kdsf-vivo = 'http://lod.tib.eu/onto/kdsf/' | ||
xmlns:node-publication='http://vivo.example.com/harvest/aims_users/fields/publication/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we keep this hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are better ways to handle that. But this is how the harvester was build. I kept most from the existing cofiguration and functionality and added just the necessary things.
xmlns:node-publication='http://vivo.example.com/harvest/aims_users/fields/publication/' | ||
xmlns:fn='http://www.w3.org/2005/xpath-functions' | ||
xmlns:functx='http://www.functx.com' | ||
xmlns:vivo-oa='http://lod.tib.eu/onto/vivo-oa/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIB specific namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove all the TIB specific things..
xmlns:c4o='http://purl.org/spar/c4o/' > | ||
|
||
<xsl:output method = "xml" indent = "yes"/> | ||
<xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded to TIB specific value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove all the TIB specific things.
<xsl:if test="normalize-space( $cited_by_count )"> | ||
<rdf:Description rdf:about="{$baseURI}gcf_{$oaid}"> | ||
<rdf:type rdf:resource="http://purl.org/spar/c4o/GlobalCitationCount"/> | ||
<c4o:hasGlobalCountSource rdf:resource="https://forschungsatlas01.develop.service.tib.eu/individual/n4885"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIB specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to find out why there is a hardcoded individual URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This links to the entity where the count comes from. So this is the URI of OpenAlex itself in the Forschungsatlas.
I will see how to replace this with an object for the data source OpenAlex that is not tied to a localized property.
--> | ||
<config> | ||
# harvesting publications from TIB – Leibniz Information Centre for Science and Technology. exchange the ROR ID to test with your institution. | ||
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&per-page=200&[email protected]&cursor=*</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding here an example with timestamp. I have used this one:
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&per-page=200&[email protected]&cursor=*</param> | |
<param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181,from_publication_date:2024-12-01,to_publication_date:2024-12-31&per-page=200&[email protected]&cursor=*</param> |
Added an example of fetching publication metadata from Openalex based on the JSON fetch.
Three example queries are available in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openAlexfetch.config.xml. Choose one for first test or modify it to fit to your needs.
OpenAlex fetch is already used in the Research Atlas: https://forschungsatlas.fid-bau.de/research
Namespace in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openalex-to-vivo.datamap.xsl needs to be adjusted according to your settings in runtime.properties:
<xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable>
JSONFetch.java was extended to be capable of handling nested object. Also some filtering for unwanted characters was added to avoid problems with the XSLTranslator (javax.xml.transform)
Closes #56