Skip to content
Richard Walker edited this page Sep 25, 2014 · 10 revisions

Welcome to the ANDS Harvester wiki!

ANDS-Harvester

The Harvester is an extensible Python module that enables harvesting capabilities within the ANDS Registry.

The plugin-architecture enables the development of further modules that support additional harvest methods and metadata schemas and/or profiles. Further modules can be ‘plugged-in’ to this architecture, enabling the creation of ANDS Registry records from any web resource.

The following harvest methods are currently supported:

  • HTTP-FETCH - allows the harvest of individual files from any web resource, in any format (e.g. json or xml)
  • CKAN - json metadata over HTTP
  • OAI-PMH - xml
  • CSW (Catalogue Services for the Web) - xml

Whilst the Harvester can retrieve metadata in any format, it must be transformed into RIF-CS XML to be compatible with the ANDS Registry ingest process; where a transform is required, an XSL transformation can be incorporated within the ‘plug-in’ module.

The Harvester can perform simultaneous harvests; the maximum number of concurrent harvests can be set within its configuration.

Requirements

The following are requirements to run the Harvester:

  • Python 3.4
  • Java Runtime Environment (either a JRE or a JDK)
  • Saxon 8

Installation of Java is covered in the Registry installation instructions. Installation of Python and Saxon is covered below.

Installation

The ANDS Harvester requires Python 3.4 to be installed on the server. The instructions on how to install Python 3.4 on a centOS machine is available for viewing.

The instruction will install the harvester in /usr/local/harvester with the harvested contents being in /var/harvested_contents

Checkout the repository

cd /usr/local/
sudo git clone https://github.com/au-research/ANDS-Harvester.git harvester

Configure the ANDS Harvester

cd harvester
sudo cp myconfig.sample myconfig.py
sudo vi myconfig.py

See below for sample/recommended settings for myconfig.py.

Download Saxon 8

Use a search engine to find saxon8.jar. Install it as /usr/share/java/saxon8.jar

Sample Configurations

test_limit = 99999 #provide an upper limit for all the harvested records, will go to this limit and then complete
polling_frequency = 30 #frequency for the harvester to write to the database and check for jobs, in seconds
max_up_seconds_per_harvest = 7200 #sets the upper limit time for a single harvest thread, terminate after this number of seconds
run_dir = '/usr/local/harvester' #the current directory of the harvester
# context = ssl.SSLContext(ssl.PROTOCOL_SSLv3) # comment out this line
# context.load_cert_chain(certfile="<pathto:cert.pem>") # comment out this line
admin_email_addr = '' #the email address for the site admin, use for reporting
response_url = 'http://localhost/registry/import/put/' #URL to the import controller for the registry
data_store_path = '/var/harvested_contents'
log_dir= run_dir + '/log'
java_home = '/usr/bin/java' #path to the java_bin
saxon_jar = '/usr/share/java/saxon8.jar' #path to the saxon library
db_host = <database host>
db_user = <database user>
db_passwd = <database password>
db = 'dbs_registry'
harvest_table = 'harvests'
harvester_specific_datasource_attributes = "'xsl_file','title','harvest_method','uri','provider_type','advanced_harvest_mode','oai_set', 'advanced_harvest_mode'"

Permissions

Make sure the log directory is created and writable

cd /usr/local/harvester
sudo mkdir log
sudo chmod 777 -R log

Make sure the harvested contents directory is created and writable

cd /var/
sudo mkdir harvested_contents
sudo chmod 777 -R harvested_contents

Install Linux service

The file ands-harvester is a System V init script to be copied into /etc/init.d. Copy it into place:

cd /usr/local/harvester
sudo cp ands-harvester /etc/init.d
sudo chmod 755 /etc/init.d/ands-harvester
sudo chkconfig --add ands-harvester
sudo chkconfig ands-harvester on

Start the harvester

Start the harvester with

sudo service ands-harvester start