"Seegrid will be due for a migration to confluence on the 1st of August. Any update on or after the 1st of August will NOT be migrated"

CKAN Harvesting User Guide

Contents

Related pages


Overview

In this harvesting user guide, we'll walk you through two metadata harvesting tasks. Firstly, we'll look at the process of harvesting metadata from a GeoNetwork instance into your CKAN instance. Secondly, we'll do the reverse i.e. to harvest metadata in your CKAN instance into a GeoNetwork instance.

Harvest remote metadata into CKAN's catalogue

The remote metadata harvesting process in CKAN comprises of two parts. The first part is done on CKAN's harvest web interface and the second part which can be automated (see this link for further details) is done via command line interface (CLI).

1. Add harvest source in CKAN's harvest page (replace localhost with your server address):

Goto http://localhost/harvest

Click "Add Harvest Source" button.

2. You'll be redirected to CKAN's login page. Log in with the sysadmin "admin" user we have created in CKANSetupGuide.

ckan-login-page.JPG

3. On the Create Harvest Source page, enter the details of your harvest source. For example, we've chosen to harvest metadata from CSIRO's Mineral Down Under GeoNetwork instance via its OGC CSW service. We change the Update frequency field to Always for now. It is always possible to come back and modify its value later on.

Note: CKAN harvester supports a number of configuration options to adjust its default harvesting behaviour (see the official CKAN harvester website for further details).

ckan-harvest-page.JPG

Manually run the harvest jobs via CLI

Important: As in release 2.0 of ckanext-harvest extension, there are a number of issues in the extension which prevented it from working properly. To continue on with the following section, those issues need to be fixed. It is out of scope for us to properly address those issues, the workaround we've come out with here shouldn't be treated as a fix to the issue. To use ckanext-harvest extension in production, you need to check with its user community and check to see if those issues have been addressed in the latest release. Now, proceed to the Troubleshooting section of this user guide before moving to the rest of this section.

4. Before running the harvest job we previously created, ensure that your Python Virtual Environment is activated:
. /usr/lib/ckan/default/bin/activate

5. Check that the harvest job is in CKAN's database:

(default) $ paster --plugin=ckanext-harvest harvester sources --config=/etc/ckan/default/production.ini

Note: If your harvest job or source is successfully created, you should be getting the following result (your Source id will be different):

Source id: 503ef896-c82f-4eea-85ae-023d8c08aa65
url: http://mdu-data.arrc.csiro.au/geonetwork/srv/en/csw
type: None
active: True
owner org: None
frequency: ALWAYS
jobs: 0

There is 1 active harvest source

6. The harvesting extension uses two different queues to complete its job. The first one handles the gathering (e.g. in a CSW server, it will perform a GetRecords operation), and the second one handles the fetching (e.g. in a CSW server, it will perform a GetRecordById operations) and importing. First, let's start the consumer for gathering queue.

(default) $ paster --plugin=ckanext-harvest harvester gather_consumer --config=/etc/ckan/default/production.ini

You should see the following output on the terminal where you run the above command:

2013-09-02 00:49:20,527 DEBUG [ckanext.harvest.queue] pika connection using {'retry_delay': 2.0, 'frame_max': 10000, 'channel_max': 0, ' locale': 'en_US', 'socket_timeout': 0.25, 'ssl': False, 'host': 'localhost', 'ssl_options': {}, 'virtual_host': '/', 'heartbeat': 0, 'cr edentials': <pika.credentials.PlainCredentials object at 0x2f72f50>, 'backpressure_detection': False, 'port': 5672, 'connection_attempts ': 1}
2013-09-02 00:49:21,580 DEBUG [ckanext.harvest.queue] Gather queue consumer registered

7. On another terminal, run the following command to start the consumer for the fetching queue:

(default) $ paster --plugin=ckanext-harvest harvester fetch_consumer --config=/etc/ckan/default/production.ini

You should see the following output on the terminal where you run the above command:

2013-09-02 00:59:09,960 DEBUG [ckanext.harvest.queue] pika connection using {'retry_delay': 2.0, 'frame_max': 10000, 'channel_max': 0, 'locale': 'en_US', 'socket_timeout': 0.25, 'ssl': False, 'host': 'localhost', 'ssl_options': {}, 'virtual_host': '/', 'heartbeat': 0, 'credentials': <pika.credentials.PlainCredentials object at 0x447b050>, 'backpressure_detection': False, 'port': 5672, 'connection_attempts': 1}
2013-09-02 00:59:11,018 DEBUG [ckanext.harvest.queue] Fetch queue consumer registered

8. On a third terminal, run the following command to start any pending harvest job(s):

(default) $ paster --plugin=ckanext-harvest harvester run --config=/etc/ckan/default/production.ini

On the terminal where you start the pending harvest job, you should see the following output (your job id will be different):

2013-09-02 05:45:39,426 INFO [ckanext.harvest.logic.action.update] Harvest job run: {}
2013-09-02 05:45:39,434 INFO [ckanext.harvest.logic.action.create] Harvest job create: {'source_id': u'503ef896-c82f-4eea-85ae-023d8c08aa65'}
2013-09-02 05:45:39,450 INFO [ckanext.harvest.logic.action.create] Harvest job saved 061bc014-76ea-44ec-a091-8c363922bf10
2013-09-02 05:45:39,475 DEBUG [ckanext.harvest.queue] pika connection using {'retry_delay': 2.0, 'frame_max': 10000, 'channel_max': 0, 'locale': 'en_US', 'socket_timeout': 0.25, 'ssl': False, 'host': 'localhost', 'ssl_options': {}, 'virtual_host': '/', 'heartbeat': 0, 'credentials': <pika.credentials.PlainCredentials object at 0x33fe6d0>, 'backpressure_detection': False, 'port': 5672, 'connection_attempts': 1}
2013-09-02 05:45:40,090 INFO [ckanext.harvest.logic.action.update] Sent job 061bc014-76ea-44ec-a091-8c363922bf10 to the gather queue

9. Inspect the log on the first terminal where you started the gathering queue consumer. If the gathering process is working (no error or exception being thrown), you should be seeing (a) the harvest queue received a job after starting the harvest job at the beginning of the log and (b) the number of objects sent to fetching queue at the end of the log. In our case, there are 53 metadata found in our harvest source i.e. the CSIRO Mineral Down Under's GeoNetwork instance.

2013-09-02 05:45:40,131 DEBUG [ckanext.harvest.queue] Received harvest job id: 061bc014-76ea-44ec-a091-8c363922bf10
2013-09-02 05:45:40,132 DEBUG [ckanext.harvest.queue] pika connection using {'retry_delay': 2.0, 'frame_max': 10000, 'channel_max': 0, 'locale': 'en_US', 'socket_timeout': 0.25, 'ssl': False, 'host': 'localhost', 'ssl_options': {}, 'virtual_host': '/', 'heartbeat': 0, 'credentials': <pika.credentials.PlainCredentials object at 0x3c4fc90>, 'backpressure_detection': False, 'port': 5672, 'connection_attempts': 1}
2013-09-02 05:45:40,684 DEBUG [ckanext.spatial.harvesters.csw.CSW.gather] CswHarvester gather_stage for job: <HarvestJob id=061bc014-76ea-44ec-a091-8c363922bf10 created=2013-09-02 05:45:39.447744 gather_started=2013-09-02 05:45:40.684076 gather_finished=None finished=None source_id=503ef896-c82f-4eea-85ae-023d8c08aa65 status=Running>
2013-09-02 05:45:41,097 DEBUG [ckanext.spatial.harvesters.csw.CSW.gather] Starting gathering for http://mdu-data.arrc.csiro.au/geonetwork/srv/en/csw
2013-09-02 05:45:41,098 INFO [ckanext.spatial.lib.csw_client] Making CSW request: getrecords {'outputschema': 'http://www.isotc211.org/2005/gmd', 'startposition': 0, 'typenames': 'csw:Record', 'maxrecords': 10, 'keywords': [], 'esn': 'brief', 'qtype': None}
2013-09-02 05:45:41,378 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 0d5c2a35-de5c-47ab-a845-71b5e395fc41 from the CSW
2013-09-02 05:45:41,378 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier e2c97096-67d0-4734-aa62-c5410a992dec from the CSW
2013-09-02 05:45:41,378 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 78a7f613-9b70-404e-a6db-0b363b461e65 from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 296ac3df-b36a-483b-a300-4bfad18ec7be from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 1e290633-526d-45ca-b7ed-58525e4ca526 from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 2ac5f52e-76b5-429e-b6d8-933edcc3cbe3 from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 63d90615-4a9b-442a-9357-54a76092dd27 from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier fb00f4ec-b1d2-453c-a3c0-7c60200548c3 from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier f050094d-1a20-4b53-9059-42abca6dd09f from the CSW
2013-09-02 05:45:41,379 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 0b18dd5f-61a8-43e5-bd20-4600255dd63a from the CSW
...

2013-09-02 05:45:42,500 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 43fc7d6b-3e4b-4f80-b779-a81d40bfc2db from the CSW
2013-09-02 05:45:42,500 INFO [ckanext.spatial.lib.csw_client] Making CSW request: getrecords {'outputschema': 'http://www.isotc211.org/2005/gmd', 'startposition': 50, 'typenames': 'csw:Record', 'maxrecords': 10, 'keywords': [], 'esn': 'brief', 'qtype': None}
2013-09-02 05:45:42,675 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier 5b81df53-d1cb-4eea-a199-2a6d07d1e43b from the CSW
2013-09-02 05:45:42,676 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier c7e665be-dc43-40cf-af44-5af613cb31b1 from the CSW
2013-09-02 05:45:42,676 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier b66d436f-3619-412a-b1e3-7192240baa84 from the CSW
2013-09-02 05:45:42,676 INFO [ckanext.spatial.harvesters.csw.CSW.gather] Got identifier b4776ba2-7e65-49f7-8d02-9b6332509130 from the CSW
2013-09-02 05:45:43,256 DEBUG [ckanext.harvest.queue] Received from plugin gather_stage: 53 objects (first: [u'01158b51-0c45-487d-a6af-74b43ed1b189'] last: [u'0fad11e6-2834-4a5c-8117-89d95cece629'])
2013-09-02 05:45:43,272 DEBUG [ckanext.harvest.queue] Sent 53 objects to the fetch queue

10. Inspect the last log on the second terminal where you started the fetching queue consumer. This log is much more verbose than the previous two as it contains fetched metadata, validation and metadata importing messages. If the fetching and importing processes are working, you should be seeing those messages for each imported metadata record. Remember that in CKANSpatialExtensionsSetupGuide, we've set ckanext.spatial.harvest.continue_on_validation_errors to True. This setting will also import metadata record that failed the harvester's validation process into CKAN database.

2013-09-02 05:46:57,729 INFO [ckanext.spatial.lib.csw_client] Making CSW request: getrecordbyid [u'0b18dd5f-61a8-43e5-bd20-4600255dd63a'] {'esn': 'full', 'outputschema': 'http://www.isotc211.org/2005/gmd'}
2013-09-02 05:46:57,928 DEBUG [ckanext.spatial.harvesters.csw.CSW.fetch] XML content saved (len 21724)
2013-09-02 05:46:57,942 DEBUG [ckanext.spatial.harvesters.base.import] Import stage for harvest object: 0fad11e6-2834-4a5c-8117-89d95cece629
2013-09-02 05:46:57,951 DEBUG [ckanext.spatial.validation.validation] Starting validation against profile(s) iso19139
2013-09-02 05:46:57,996 INFO [ckanext.spatial.validation.validation] Validation errors found using schema Dataset schema (gmx.xsd)
2013-09-02 05:46:58,001 INFO [ckanext.spatial.validation.validation] Validating against "ISO19139 XSD Schema" profile failed
2013-09-02 05:46:58,002 DEBUG [ckanext.spatial.validation.validation] [('Dataset schema (gmx.xsd) Validation Error', None), (u"Element '{http://www.isotc211.org/2005/gco}DateTime': '' is not a valid value of the atomic type 'xs:dateTime'.", 93), (u"Element '{http://www.isotc211.org/2005/gco}Integer': '' is not a valid value of the atomic type 'xs:integer'.", 297), (u"Element '{http://www.isotc211.org/2005/gmd}EX_GeographicBoundingBox': This element is not expected.", 329)]
2013-09-02 05:46:58,002 ERROR [ckanext.spatial.harvesters.base] Validation errors found using profile iso19139 for object with GUID 0b18dd5f-61a8-43e5-bd20-4600255dd63a
2013-09-02 05:46:58,007 DEBUG [ckanext.harvest.harvesters.base] Dataset schema (gmx.xsd) Validation Error
2013-09-02 05:46:58,014 DEBUG [ckanext.harvest.harvesters.base] Element '{http://www.isotc211.org/2005/gco}DateTime': '' is not a valid value of the atomic type 'xs:dateTime'., line 93
2013-09-02 05:46:58,021 DEBUG [ckanext.harvest.harvesters.base] Element '{http://www.isotc211.org/2005/gco}Integer': '' is not a valid value of the atomic type 'xs:integer'., line 297
2013-09-02 05:46:58,029 DEBUG [ckanext.harvest.harvesters.base] Element '{http://www.isotc211.org/2005/gmd}EX_GeographicBoundingBox': This element is not expected., line 329
2013-09-02 05:46:58,032 WARNI [ckanext.spatial.model.harvested_metadata] Value not found for element 'value'
2013-09-02 05:46:58,039 WARNI [ckanext.spatial.model.harvested_metadata] Value not found for element 'url'
2013-09-02 05:46:58,039 WARNI [ckanext.spatial.model.harvested_metadata] Value not found for element 'file'
2013-09-02 05:46:58,040 WARNI [ckanext.spatial.model.harvested_metadata] Value not found for element 'file'
2013-09-02 05:46:58,224 DEBUG [ckanext.spatial.plugin] Received: '{"type": "Polygon", "coordinates": [[[133.5833333, -33.58333333], [137.8333333, -33.58333333], [137.8333333, -31.0], [133.5833333, -31.0], [133.5833333, -33.58333333]]]}'
2013-09-02 05:46:58,226 DEBUG [ckanext.spatial.lib] Created new extent for package 95aaa7a6-3155-4065-9a9e-fbb8614a9035
2013-09-02 05:46:58,897 INFO [ckanext.spatial.harvesters.base.import] Created new package 95aaa7a6-3155-4065-9a9e-fbb8614a9035 with guid 0b18dd5f-61a8-43e5-bd20-4600255dd63a

11. You can see all successfully harvested datasets on CKAN's datasets page (replace localhost with your server address):

http://localhost/dataset

You should see 53 datasets found on CKAN Datasets page if your csw_harvester plugin successfully harvested metadata from our harvest source:

ckan-datasets-page.png

Harvest metadata from CKAN into GeoNetwork 's catalogue

In this section, we will walk you through how to harvest metadata from your CKAN instance into a GeoNetwork instance via CKAN Geospatial Extension cswserver plugin. We'll assume that you're familiar with GeoNetwork, and already have an instance of GeoNetwork installed and also have admin access to perform admin related task such as setting up a harvest job, etc.

Note: We'd like to thank and acknowledge Jose Garcia from GetNetwork User Community for his help in the issue we encountered while harvesting metadata from CKAN CSW service. See this link http://osgeo-org.1560.x6.nabble.com/CKAN-csw-Metadata-Harvesting-td5073100.html for further details.

1. Log in as admin to your GeoNetwork instance and go to the Harvesting Management page as shown below:

geonetwork-harvest-page-1.PNG

2. Add a new harvest job of type "Catalogue Services for the Web ISO Profile 2.0":

geonetwork-harvest-page-2.PNG

3. Configure the new job on the harvesting management page. Enter the following Service URL (replace localhost with your server address) and save the job:

http://localhost/csw?request=GetCapabilities&service=CSW&version=2.0.2

geonetwork-harvest-page-3.PNG

4. Activate and run the newly created harvest job:

geonetwork-harvest-page-4.PNG

5. To find out the progress or status of the harvesting process. Press the Refresh button and hover your mouse over to the tick icon below Errors column heading for a few seconds. You should see a small popup window showing that 53 metadata have been added to your GeoNetwork instance as screenshot below:

geonetwork-harvest-page-5.PNG

Troubleshooting

1. Gather stage failed when running a harvest job.

You will get the following error messages if you run the harvest jobs command in release 2.0 of ckanext-harvester extension:

2013-09-02 01:12:21,759 DEBUG [ckanext.spatial.harvesters.csw.CSW.gather] Starting gathering for http://mdu-data.arrc.csiro.au/geonetwork/srv/en/csw
2013-09-02 01:12:21,759 INFO [ckanext.spatial.lib.csw_client] Making CSW request: getrecords {'outputschema': 'http://www.isotc211.org/2005/gmd', 'startposition': 0, 'typenames': 'csw:Record', 'maxrecords': 10, 'keywords': [], 'esn': 'brief', 'qtype': None}
2013-09-02 01:12:21,764 ERROR [ckanext.spatial.harvesters.csw.CSW.gather] Exception: Traceback (most recent call last):
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/csw.py", line 90, in gather_stage
for identifier in self.csw.getidentifiers(page=10):
File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/lib/csw_client.py", line 110, in getidentifiers
csw.getrecords(**kwa)
File "/usr/lib/ckan/default/src/owslib/owslib/csw.py", line 188, in getrecords
in a future version of OWSLib.""")
DeprecationWarning: Please use the updated 'getrecords2' method instead of 'getrecords'.
The 'getrecords' method will be upgraded to use the 'getrecords2' parameters
in a future version of OWSLib.
2013-09-02 01:12:21,771 ERROR [ckanext.harvest.harvesters.base] Error gathering the identifiers from the CSW server
[Please use the updated 'getrecords2' method instead of 'getrecords'.
The 'getrecords' method will be upgraded to use the 'getrecords2' parameters
in a future version of OWSLib.]
2013-09-02 01:12:21,774 ERROR [ckanext.harvest.queue] Gather stage failed

The workaround (not a fix) for the above error is to comment out the lines (186-188) that raised the DeprecationWarning exception in the following Python source file (/usr/lib/ckan/default/src/owslib/owslib/csw.py).

#raise DeprecationWarning("""Please use the updated 'getrecords2' method instead of 'getrecords'.
#The 'getrecords' method will be upgraded to use the 'getrecords2' parameters
#in a future version of OWSLib.""")

 
Topic attachments
I Attachment Action Size Date Who Comment
ckan-datasets-page.pngpng ckan-datasets-page.png manage 166.5 K 02 Sep 2013 - 17:16 RichardGoh CKAN datasets page image
ckan-harvest-page.JPGJPG ckan-harvest-page.JPG manage 161.8 K 02 Sep 2013 - 11:51 RichardGoh CKAN harvest source page image
ckan-login-page.JPGJPG ckan-login-page.JPG manage 73.0 K 31 Aug 2013 - 21:42 RichardGoh CKAN login page image
geonetwork-harvest-page-1.PNGPNG geonetwork-harvest-page-1.PNG manage 64.5 K 02 Sep 2013 - 18:03 RichardGoh GeoNetwork harvesting page image 1
geonetwork-harvest-page-2.PNGPNG geonetwork-harvest-page-2.PNG manage 58.2 K 02 Sep 2013 - 18:03 RichardGoh GeoNetwork harvesting page image 2
geonetwork-harvest-page-3.PNGPNG geonetwork-harvest-page-3.PNG manage 38.7 K 02 Sep 2013 - 18:04 RichardGoh GeoNetwork harvesting page image 3
geonetwork-harvest-page-4.PNGPNG geonetwork-harvest-page-4.PNG manage 70.9 K 02 Sep 2013 - 18:12 RichardGoh GeoNetwork harvesting page image 4
geonetwork-harvest-page-5.PNGPNG geonetwork-harvest-page-5.PNG manage 72.0 K 02 Sep 2013 - 18:12 RichardGoh GeoNetwork harvesting page image 5
Topic revision: r11 - 10 Feb 2014, RichardGoh
 

Current license: All material on this collaboration platform is licensed under a Creative Commons Attribution 3.0 Australia Licence (CC BY 3.0).