Harvesting resources from remote services
Harvesting is the process by which a metadata catalogue, i.e. GeoNode, can connect to other remote catalogues and retrieve information about their resources. This process is usually performed periodically to keep the local catalogue in sync with the remote one.
When it is appropriately configured, GeoNode will contact the remote service, extract a list of relevant resources that can be harvested, and then create local resources for each remote resource. It will also keep the resources synchronized with the remote service by updating them periodically.
To explore more advanced features on harvesters, we will mainly use the Django Administration Dashboard
to create harvesters for remote services and then import their relevant resources to our local GeoNode.
Harvesting workflows
There are two main possible harvesting workflows:
Continuous harvesting
This workflow relies on the harvesting scheduler to ensure that harvested resources are continuously kept up to date with their remote counterparts.
When the time comes, the harvesting scheduler calls the update list of harvestable resources operation. Alternatively, the user may call this operation manually the first time.
When the previous operation is done, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its
should_be_harvested
attribute toTrue
. Alternatively, if the harvester has itsharvest_new_resources_automatically
attribute set toTrue
, the harvestable resources will already be marked asto be harvested
, without requiring manual user intervention.When the time comes, the harvesting scheduler calls the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.
One-time harvesting
This workflow is mostly manually executed by the user.
The user creates a harvester and sets its
scheduling_enabled
attribute toFalse
;The user calls the update list of harvestable resources operation;
When the previous operation is complete, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its
should_be_harvested
attribute toTrue
;The user then proceeds to call the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.
Standard harvester workers
Note
Remember that, as stated above, a harvester worker is configured by means of setting the harvester_type
and
harvester_type_specific_configuration
attributes on the harvester.
Moreover, the format of the harvester_type_specific_configuration
attribute must be a JSON object.
GeoNode harvester worker
This worker can harvest remote GeoNode deployments. In addition to creating local resources by retrieving the remote metadata, this harvester can also copy remote datasets over to the local GeoNode. This means that this harvester can even be used to generate replicated GeoNode instances.
This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker
in the harvester configuration.
It recognizes the following harvester_type_specific_configuration
parameters:
harvest_datasets : Whether to harvest remote resources of the
dataset
type or not. Acceptable values:true
(the default) orfalse
.copy_datasets: Whether to copy remote resources of the
dataset
type over to the local GeoNode. Acceptable values:true
orfalse
(the default).harvest_documents: Whether to harvest remote resources of the
document
type or not. Acceptable values:true
(the default) orfalse
.copy_documents: Whether to copy remote resources of the
document
type over to the local GeoNode. Acceptable values:true
orfalse
(the default).resource_title_filter: A string that must be present in the remote resources’
title
in order for them to be acknowledged as harvestable resources. This allows for the filtering out of resources that are not relevant. Acceptable values: any alphanumeric value.start_date_filter: A string specifying a datetime that is used to filter out resources by their start_date. This is parsed with dateutil.parser.parse(), which means that it accepts multiple different formats (e.g. 2021-06-31T13:04:05Z)
end_date_filter: Similar to
start_date_filter
but uses the resources’ end_date as a filter parameter.keywords_filter: A list of keywords that are used to filter remote resources.
categories_filter: A list of categories that are used to filter remote resources.
Creating scheduled harvesters
Using the Continuous harvesting workflow example, sign in as admin and click on Admin
to be redirected to the Django Administration Dashboard
Search for the Harvesters tab
, click on Harvesters
Click on Add harvester
to create a new harvester
.
Create the harvester with the following attributes and then click Save
.
Name:
GeoNode harvester test
Remote url:
https://summit2020.cartoview.net/
Harvester type:
geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker
After successful creation, you will be redirected to the list of available harvesters, which includes your newly created harvester.
GeoNode will create an Asynchronous harvesting session
for the harvester. To view it, browse through the administration using the path Home
> Harvesting
> Asynchronous harvesting session
Once the session of the discover-harvestable-resources
type is complete, you can view the available harvestable resources via the path Home
> Harvesting
> Harvestable resources
As previously explained, since we did not set harvest_new_resources_automatically
to True
on our harvester, the attribute should_be_harvested
of the listed resources is set to False
.
We can manually set this attribute to True
for the resources that we want so, that when the time comes, the harvesting scheduler calls the perform harvesting operation
and they can be harvested.
The harvester will now show that the selected resources are scheduled for harvesting.
Click on the Go
icon to view the list.
You can manually harvest resources through an action. E.g. Select a resource from the list, set the action as Harvest selected resources
, then click Go
- Note that an
asynchronous harvesting session
for the selected resource is created with: Session type:
harvesting
Status:
on-going
(which will be updated tofinished-all-ok
after it’s successfully harvested)Total records to process:
1
Search for the resource in GeoNode and verify that it’s been created.
WMS harvester worker
This worker can harvest resources from remote OGC WMS servers.
This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.wms.OgcWmsHarvester
in the harvester configuration.
It recognizes the following harvester_type_specific_configuration
parameters:
dataset_title_filter: A string that is used to filter remote WMS layers by their
title
property. If a remote layer’s title contains the string defined by this parameter, then the layer is recognized by the harvester worker.
Creating unscheduled harvesters using Remote Services
Through Remote services
, GeoNode provides a simpler way of creating unscheduled harvesters and a user is able to perform the actions explained in One-time harvesting workflow using GeoNode web pages.
Let’s create a remote service through GeoNode by clicking on Data
> Remote Services
> Add Remote Service
.
- Fill the form with the following attributes and click
Create
: Service URL:
https://carto.nationalmap.gov/arcgis/services/transportation/MapServer/WMSServer?request=GetCapabilities&service=WMS
Service Type:
Web Map Service
The service will be created and you will be redirected to the page where you can select resources to harvest into GeoNode.
By default, GeoNode will create a harvester with the values set as shown below.
After the harvester has loaded the available resources that can be imported, select the resources you want to import and click on import Resources
When you check the GeoNode resources page, the imported resources will be available.