-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The Harvester is an extensible Python module that enables harvesting capabilities within the ANDS Registry.
The plugin-architecture enables the development of further modules that support additional harvest methods and metadata schemas and/or profiles. Further modules can be ‘plugged-in’ to this architecture, enabling the creation of ANDS Registry records from any web resource.
The following harvest methods are currently supported:
- HTTP-FETCH - allows the harvest of individual files from any web resource, in any format (e.g. json or xml)
- CKAN - json metadata over HTTP
- OAI-PMH - xml
- CSW (Catalogue Services for the Web) - xml
Whilst the Harvester can retrieve metadata in any format, it must be transformed into RIF-CS XML to be compatible with the ANDS Registry ingest process; where a transform is required, an XSL transformation can be incorporated within the ‘plug-in’ module.
The Harvester can perform simultaneous harvests; the maximum number of concurrent harvests can be set within its configuration.
The following are requirements to run the Harvester:
- Python 3.4
- Java Runtime Environment (either a JRE or a JDK)
- Saxon 8
Installation of Java is covered in the Registry installation instructions. Installation of Python and Saxon is covered below.
The ANDS Harvester requires Python 3.4 to be installed on the server. The instructions on how to install Python 3.4 on a centOS machine is available for viewing.
The instruction will install the harvester in /usr/local/harvester
with the harvested contents being in /var/harvested_contents
cd /usr/local/
sudo git clone https://github.com/au-research/ANDS-Harvester.git harvester
cd harvester
sudo cp myconfig.sample myconfig.py
sudo vi myconfig.py
See below for sample/recommended settings for myconfig.py
.
Use a search engine to find saxon8.jar. Install it as /usr/share/java/saxon8.jar
test_limit = 99999 #provide an upper limit for all the harvested records, will go to this limit and then complete
polling_frequency = 30 #frequency for the harvester to write to the database and check for jobs, in seconds
max_up_seconds_per_harvest = 7200 #sets the upper limit time for a single harvest thread, terminate after this number of seconds
run_dir = '/usr/local/harvester' #the current directory of the harvester
# context = ssl.SSLContext(ssl.PROTOCOL_SSLv3) # comment out this line
# context.load_cert_chain(certfile="<pathto:cert.pem>") # comment out this line
admin_email_addr = '' #the email address for the site admin, use for reporting
response_url = 'http://localhost/registry/import/put/' #URL to the import controller for the registry
data_store_path = '/var/harvested_contents'
log_dir= run_dir + '/log'
java_home = '/usr/bin/java' #path to the java_bin
saxon_jar = '/usr/share/java/saxon8.jar' #path to the saxon library
db_host = <database host>
db_user = <database user>
db_passwd = <database password>
db = 'dbs_registry'
harvest_table = 'harvests'
harvester_specific_datasource_attributes = "'xsl_file','title','harvest_method','uri','provider_type','advanced_harvest_mode','oai_set', 'advanced_harvest_mode'"
Make sure the log directory is created and writable
cd /usr/local/harvester
sudo mkdir log
sudo chmod 777 -R log
Make sure the harvested contents directory is created and writable
cd /var/
sudo mkdir harvested_contents
sudo chmod 777 -R harvested_contents
The file ands-harvester
is a System V init script to be copied into
/etc/init.d
. Copy it into place:
cd /usr/local/harvester
sudo cp ands-harvester /etc/init.d
sudo chmod 755 /etc/init.d/ands-harvester
sudo chkconfig --add ands-harvester
sudo chkconfig ands-harvester on
Start the harvester with
sudo service ands-harvester start