Harvesting resources from remote services
Harvesting is the process by which a metadata catalogue, i.e. GeoNode, can connect to other remote catalogues and retrieve information about their resources. This process is usually performed periodically to keep the local catalogue in sync with the remote one.
When it is appropriately configured, GeoNode will contact the remote service, extract a list of relevant resources that can be harvested, and then create local resources for each remote resource. It will also keep the resources synchronized with the remote service by updating them periodically.
To explore more advanced features on harvesters, we will mainly use the Django Administration Dashboard to create harvesters for remote services and then import their relevant resources to our local GeoNode.
Harvesting workflows
There are two main possible harvesting workflows:
Continuous harvesting
This workflow relies on the harvesting scheduler to ensure that harvested resources are continuously kept up to date with their remote counterparts.
When the time comes, the harvesting scheduler calls the update list of harvestable resources operation. Alternatively, the user may call this operation manually the first time.
When the previous operation is done, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its
should_be_harvestedattribute toTrue. Alternatively, if the harvester has itsharvest_new_resources_automaticallyattribute set toTrue, the harvestable resources will already be marked asto be harvested, without requiring manual user intervention.When the time comes, the harvesting scheduler calls the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.
One-time harvesting
This workflow is mostly manually executed by the user.
The user creates a harvester and sets its
scheduling_enabledattribute toFalse;The user calls the update list of harvestable resources operation;
When the previous operation is complete, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its
should_be_harvestedattribute toTrue;The user then proceeds to call the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.
Standard harvester workers
Note
Remember that, as stated above, a harvester worker is configured by means of setting the harvester_type and
harvester_type_specific_configuration attributes on the harvester.
Moreover, the format of the harvester_type_specific_configuration attribute must be a JSON object.
GeoNode harvester worker
This worker can harvest remote GeoNode deployments. In addition to creating local resources by retrieving the remote metadata, this harvester can also copy remote datasets over to the local GeoNode. This means that this harvester can even be used to generate replicated GeoNode instances.
This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker
in the harvester configuration.
It recognizes the following harvester_type_specific_configuration parameters:
harvest_datasets : Whether to harvest remote resources of the
datasettype or not. Acceptable values:true(the default) orfalse.copy_datasets: Whether to copy remote resources of the
datasettype over to the local GeoNode. Acceptable values:trueorfalse(the default).harvest_documents: Whether to harvest remote resources of the
documenttype or not. Acceptable values:true(the default) orfalse.copy_documents: Whether to copy remote resources of the
documenttype over to the local GeoNode. Acceptable values:trueorfalse(the default).resource_title_filter: A string that must be present in the remote resources’
titlein order for them to be acknowledged as harvestable resources. This allows for the filtering out of resources that are not relevant. Acceptable values: any alphanumeric value.start_date_filter: A string specifying a datetime that is used to filter out resources by their start_date. This is parsed with dateutil.parser.parse(), which means that it accepts multiple different formats (e.g. 2021-06-31T13:04:05Z)
end_date_filter: Similar to
start_date_filterbut uses the resources’ end_date as a filter parameter.keywords_filter: A list of keywords that are used to filter remote resources.
categories_filter: A list of categories that are used to filter remote resources.
Creating scheduled harvesters
Using the Continuous harvesting workflow example, sign in as admin and click on Admin to be redirected to the Django Administration Dashboard
Search for the Harvesters tab, click on Harvesters
Click on Add harvester to create a new harvester.
Create the harvester with the following attributes and then click Save.
Name:
GeoNode harvester testRemote url:
https://summit2020.cartoview.net/Harvester type:
geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker
After successful creation, you will be redirected to the list of available harvesters, which includes your newly created harvester.
GeoNode will create an Asynchronous harvesting session for the harvester. To view it, browse through the administration using the path Home > Harvesting > Asynchronous harvesting session
Once the session of the discover-harvestable-resources type is complete, you can view the available harvestable resources via the path Home > Harvesting > Harvestable resources
As previously explained, since we did not set harvest_new_resources_automatically to True on our harvester, the attribute should_be_harvested of the listed resources is set to False.
We can manually set this attribute to True for the resources that we want so, that when the time comes, the harvesting scheduler calls the perform harvesting operation and they can be harvested.
The harvester will now show that the selected resources are scheduled for harvesting.
Click on the Go icon to view the list.
You can manually harvest resources through an action. E.g. Select a resource from the list, set the action as Harvest selected resources, then click Go
- Note that an 
asynchronous harvesting sessionfor the selected resource is created with: Session type:
harvestingStatus:
on-going(which will be updated tofinished-all-okafter it’s successfully harvested)Total records to process:
1
Search for the resource in GeoNode and verify that it’s been created.
WMS harvester worker
This worker can harvest resources from remote OGC WMS servers.
This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.wms.OgcWmsHarvester in the harvester configuration.
It recognizes the following harvester_type_specific_configuration parameters:
dataset_title_filter: A string that is used to filter remote WMS layers by their
titleproperty. If a remote layer’s title contains the string defined by this parameter, then the layer is recognized by the harvester worker.
Creating unscheduled harvesters using Remote Services
Through Remote services, GeoNode provides a simpler way of creating unscheduled harvesters and a user is able to perform the actions explained in One-time harvesting workflow using GeoNode web pages.
Let’s create a remote service through GeoNode by clicking on Data > Remote Services > Add Remote Service.
- Fill the form with the following attributes and click 
Create: Service URL:
https://carto.nationalmap.gov/arcgis/services/transportation/MapServer/WMSServer?request=GetCapabilities&service=WMSService Type:
Web Map Service
The service will be created and you will be redirected to the page where you can select resources to harvest into GeoNode.
By default, GeoNode will create a harvester with the values set as shown below.
After the harvester has loaded the available resources that can be imported, select the resources you want to import and click on import Resources
When you check the GeoNode resources page, the imported resources will be available.