Skip to content

A Java-program which retrieves the full-texts or datasets from the Publication-Web-Pages.

License

Notifications You must be signed in to change notification settings

LSmyrnaios/PublicationsRetriever

Repository files navigation

PublicationsRetriever

CI Workflows

Github Actions

Maven CI: Build Status
CodeQL: CodeQL
Github pages: pages-build-deployment


Description & basic information

A Java-program which retrieves the Document and Dataset Urls from the given Publication-Web-Pages and if wanted, it can also download the full-texts and/or upload them to an S3 Object Store.
Afterwards, these full-text documents are mined (by other pieces of software), in order to enrich a much more complete set of OpenAIRE publications with inference links, in the OpenAIRE Graph.

This program is used either as a stand-alone download-tool for full-texts and datasets, or as a library for the UrlsWorker's code, of OpenAIRE's "PDF Aggregation Service".

The PublicationsRetriever takes as input the PubPages with their IDs -in JSON format- and gives an output -also in JSON format, which contains the IDs, the source and target PubPages, the Document or Dataset Urls, a series of informative booleans, the MD5 "fileHash", the "fileSize", the "mimeType" and a "comment".
The "booleans" are:

  • "wasUrlChecked": it signals whether the url was checked
  • "wasUrlValid": it signals whether the url was a valid url (one that can be connected)
  • "wasDocumentOrDatasetAccessible": it signals whether the url gave a document or dataset url
  • "wasDirectLink": it signals whether the url was a document or dataset link itself
  • "couldRetry": it signals whether it could be worth to check the url in the future (in case the sourceUrl gave the docOrDatasetUrl or it resulted in an error which might be eliminated in the future, like a "ConnectionTimeout")

Note: the values to the above "booleans" are Strings: "true", "false" or "N/A".

The "comment" can have the following values:

  • an empty string, if the document url is retrieved, and the user specified that the document files will not be downloaded
  • the information if the resulted url is a dataset url
  • the DocFileFullPath, if we have chosen to download the DocFiles
  • the ErrorCause, if there was any error which prevented the discovery of the DocOrDatasetUrl (in that case, the DocOrDatasetUrl is set to "unreachable")

Sample JSON-input:

{"id":"dedup_wf_001::83872a151fd78b045e62275ca626ec94","url":"https://zenodo.org/records/884160"}

Sample JSON-output (with downloading of the full-texts):

{"id":"dedup_wf_001::83872a151fd78b045e62275ca626ec94","sourceUrl":"https://zenodo.org/records/884160","pageUrl":"https://zenodo.org/records/884160","docOrDatasetUrl":"https://zenodo.org/records/884160/files/Data_for_Policy_2017_paper_55.pdf","wasUrlChecked":"true","wasUrlValid":"true","wasDocumentOrDatasetAccessible":"true","wasDirectLink":"false","couldRetry":"true","fileHash":"4e38a82fe1182e62b1c752b50f5ea59b","fileSize":"263917","mimeType":"application/pdf","comment":"/home/labros/PublicationsRetriever/target/../example/sample_output/DocFiles/dedup_wf_001::83872a151fd78b045e62275ca626ec94.pdf"}

Explanation of some keywords:
PubPage: the web page with the publication's information.
DocUrl: the url of the fulltext-document-file.
DatasetUrl: the url of the dataset-file.
DocOrDatasetUrl: the url of the document or the dataset file.
Full-text: the document containing all the text of a publication.
DocFileFullPath: the full-storage-path of the fulltext-document-file.
ErrorCause: the cause of the failure of retrieving the docUrl or the docFile.

The program's execution process can be found here.
This program utilizes multiple threads to speed up the process, while using politeness-delays between same-domain connections, in order to avoid overloading the data-providers.
In case no IDs are available to be used in the input, the user should provide a file containing just urls (one url per line) and specify that wishes to process a data-set with no IDs, by changing the "util.url.LoaderAndChecker.useIdUrlPairs"-variable to "false".
If you want to run it with distributed execution on multiple VMs, you may give a different starting-number for the docFiles in each instance (see the run-instructions below).

Disclaimers:

  • Keep in mind that it's best to run the program for a small set of urls (a few hundred maybe) at first, in order to see which parameters work best for you (url-timeouts, domainsBlocking ect.).
  • Please note that PublicationsRetriever is currently in beta, so you may encounter some issues.

Install & Run (using MAVEN)

Requirements

  • Java 11
  • Maven

Procedure

To install the application, navigate to the directory of the project, where the pom.xml is located.
Then enter this command in the terminal:
mvn clean install -U

To run the application you should navigate to the target directory, which will be created by MAVEN and run the executable JAR file, while choosing the appropriate run-command.

Run with standard input/output:
java -jar publications_retriever-1.2-SNAPSHOT.jar arg1:'-inputFileFullPath' arg2:<inputFile> arg3:'-retrieveDataType' arg4:'<dataType: document | dataset | all>' arg5:'-downloadDocFiles' arg6:'-fileNameType' arg7:'idName' arg8:'-firstFileNum' arg9:'NUM' arg10:'-docFilesStorage' arg11:'storageDir' < stdIn:'inputJsonFile' > stdOut:'outputJsonFile'

Run tests with custom input/output:

  • Inside pom.xml, change the mainClass of maven-shade-plugin from "PublicationsRetriever" to "TestNonStandardInputOutput".
  • Inside src/test/.../TestNonStandardInputOutput.java, give the wanted testInput and testOutput files.
  • If you want to provide a .tsv or a .csv file with a title in its column, you can specify it in the util.file.FileUtils.skipFirstRow-variable, in order for the first-row (headers) to be ignored.
  • If you want to see the logging-messages in the Console, open the resources/logback.xml and change the appender-ref, from File to Console.
  • Run mvn clean install to create the new JAR file.
  • Execute the program with the following command:
    java -jar publications_retriever-1.2-SNAPSHOT.jar arg2:'<dataType: document | dataset | all>' arg3:'-downloadDocFiles' arg4:'-fileNameType' arg5:'numberName' arg6:'-firstFileNum' arg7:'NUM' arg8:'-docFilesStorage' arg9:'storageDir' arg10:'-inputDataUrl' arg11: 'inputUrl' arg12: '-numOfThreads' arg13: <NUM>

    You can use the argument '-inputFileFullPath' to define the inputFile, instead of the stdin-redirection. That way, the progress percentage will appear in the logging file.

Arguments explanation:

  • -retrieveDataType and dataType will tell the program to retrieve the urls of type "document", "dataset" or "all"-dataTypes.
  • -downloadDocFiles will tell the program to download the DocFiles. The absence of this argument will cause the program to NOT download the docFiles, but just to find the DocUrls instead. Either way the DocUrls will be written to the JsonOutputFile.
  • -fileNameType and < fileNameType > will tell the program which fileName-type to use (originalName, idName, numberName).
  • -firstFileNum and < NUM > will tell the program to use numbers as DocFileNames and the first DocFile will have the given number "NUM". The absence of this argument-group will cause the program to use the original-docFileNames.
  • -docFilesStorage and storageDir will tell the program to use the given DocFiles-storageDir. If the storageDir is equal to "S3ObjectStore" , then the program uploads the DocFiles to an S3 storage (see the note below). The absence of this argument will cause the program to use a pre-defined storageDir which is: "./docFiles".
  • -inputDataUrl and inputUrl will tell the program to use the given URL to retrieve the inputFile, instead of having it locally stored and redirect the Standard Input Stream.
  • -numOfThreads and NUM will tell the program to use NUM number of worker-threads.

    The order of the program's arguments matters only per pair. For example, the argument 'storageDir', has to be placed always after the '-docFilesStorage'' argument.

Notes:

  • In order to access the S3ObjectStore, you should provide the file "S3_credentials.txt", inside the working directory, which must contain the endpoint, the accessKey, the secretKey, the region and the bucket, in that order, separated by commas.
  • In case you provide a very large input (over 100.000 records) or/and you have domains with very large html-pages, please consider setting the -Xms and -Xmx Java-arguments.
    • For example: java -Xms1g -Xmx4g -jar publications_retriever-1.3-SNAPSHOT.jar ...

Example

You can check the functionality of PublicationsRetriever by running an example.
Type ./runExample.sh in the terminal and hit ENTER.
Then you can see the results in the example/sample_output directory.
The above script will run the following commands:

  • mvn clean install: Does a clean install.
  • rm -rf example/sample_output/*: Removes any previous example-results.
  • cd target && java -jar publications_retriever-1.2-SNAPSHOT.jar -retrieveDataType document -downloadDocFiles -fileNameType numberName -firstFileNum 1 -docFilesStorage ../example/sample_output/DocFiles < ../example/sample_input/sample_input.json > ../example/sample_output/sample_output.json
    This command will run the program with "../example/sample_input/sample_input.json" as input and "../example/sample_output/sample_output.json" as the output.
    The arguments used are:
    • -retrieveDataType and document will tell the program to retrieve the urls of type "document".
    • -downloadDocFiles which will tell the program to download the DocFiles.
    • -fileNameType numberName which will tell the program to use numbers as the docFileNames.
    • -firstFileNum 1 which will tell the program to use numbers as DocFileNames and the first DocFile will have the number <1>.
    • -docFilesStorage ../example/sample_output/DocFiles which will tell the program to use the custom DocFilesStorageDir: "../example/sample_output/DocFiles".

Customizations