Harvesting resources from remote services

Harvesting is the process by which a metadata catalogue, i.e. GeoNode, can connect to other remote catalogues and retrieve information about their resources. This process is usually performed periodically to keep the local catalogue in sync with the remote one.

When it is appropriately configured, GeoNode will contact the remote service, extract a list of relevant resources that can be harvested, and then create local resources for each remote resource. It will also keep the resources synchronized with the remote service by updating them periodically.

To explore more advanced features on harvesters, we will mainly use the Django Administration Dashboard to create harvesters for remote services and then import their relevant resources to our local GeoNode.

Harvesting workflows

There are two main possible harvesting workflows:

Continuous harvesting
One-time harvesting

Continuous harvesting

This workflow relies on the harvesting scheduler to ensure that harvested resources are continuously kept up to date with their remote counterparts.

When the time comes, the harvesting scheduler calls the update list of harvestable resources operation. Alternatively, the user may call this operation manually the first time.

When the previous operation is done, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its should_be_harvested attribute to True. Alternatively, if the harvester has its harvest_new_resources_automatically attribute set to True, the harvestable resources will already be marked as to be harvested, without requiring manual user intervention.

When the time comes, the harvesting scheduler calls the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.

One-time harvesting

This workflow is mostly manually executed by the user.

The user creates a harvester and sets its scheduling_enabled attribute to False;

The user calls the update list of harvestable resources operation;

When the previous operation is complete, the user goes through the list of generated harvestable resources and, for each relevant harvestable resource, sets its should_be_harvested attribute to True;

The user then proceeds to call the perform harvesting operation. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode.

Standard harvester workers

Note

Remember that, as stated above, a harvester worker is configured by means of setting the harvester_type and harvester_type_specific_configuration attributes on the harvester.

Moreover, the format of the harvester_type_specific_configuration attribute must be a JSON object.

GeoNode harvester worker

This worker can harvest remote GeoNode deployments. In addition to creating local resources by retrieving the remote metadata, this harvester can also copy remote datasets over to the local GeoNode. This means that this harvester can even be used to generate replicated GeoNode instances.

This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker in the harvester configuration.

It recognizes the following harvester_type_specific_configuration parameters:

harvest_datasets : Whether to harvest remote resources of the dataset type or not. Acceptable values: true (the default) or false.

copy_datasets: Whether to copy remote resources of the dataset type over to the local GeoNode. Acceptable values: true or false (the default).

harvest_documents: Whether to harvest remote resources of the document type or not. Acceptable values: true (the default) or false.

copy_documents: Whether to copy remote resources of the document type over to the local GeoNode. Acceptable values: true or false (the default).

resource_title_filter: A string that must be present in the remote resources’ title in order for them to be acknowledged as harvestable resources. This allows for the filtering out of resources that are not relevant. Acceptable values: any alphanumeric value.

start_date_filter: A string specifying a datetime that is used to filter out resources by their start_date. This is parsed with dateutil.parser.parse(), which means that it accepts multiple different formats (e.g. 2021-06-31T13:04:05Z)

end_date_filter: Similar to start_date_filter but uses the resources’ end_date as a filter parameter.

keywords_filter: A list of keywords that are used to filter remote resources.

categories_filter: A list of categories that are used to filter remote resources.

Creating scheduled harvesters

Using the Continuous harvesting workflow example, sign in as admin and click on Admin to be redirected to the Django Administration Dashboard

Search for the Harvesters tab, click on Harvesters

Click on Add harvester to create a new harvester.

Create the harvester with the following attributes and then click Save.

Name: GeoNode harvester test
Remote url: https://summit2020.cartoview.net/
Harvester type: geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker

After successful creation, you will be redirected to the list of available harvesters, which includes your newly created harvester.

GeoNode will create an Asynchronous harvesting session for the harvester. To view it, browse through the administration using the path Home > Harvesting > Asynchronous harvesting session

Once the session of the discover-harvestable-resources type is complete, you can view the available harvestable resources via the path Home > Harvesting > Harvestable resources

As previously explained, since we did not set harvest_new_resources_automatically to True on our harvester, the attribute should_be_harvested of the listed resources is set to False.

We can manually set this attribute to True for the resources that we want so, that when the time comes, the harvesting scheduler calls the perform harvesting operation and they can be harvested.

The harvester will now show that the selected resources are scheduled for harvesting.

Click on the Go icon to view the list.

You can manually harvest resources through an action. E.g. Select a resource from the list, set the action as Harvest selected resources, then click Go

Note that an asynchronous harvesting session for the selected resource is created with:

Session type: harvesting
Status: on-going (which will be updated to finished-all-ok after it’s successfully harvested)
Total records to process: 1

Search for the resource in GeoNode and verify that it’s been created.

WMS harvester worker

This worker can harvest resources from remote OGC WMS servers.

This harvester can be used by setting harvester_type=geonode.harvesting.harvesters.wms.OgcWmsHarvester in the harvester configuration.

It recognizes the following harvester_type_specific_configuration parameters:

dataset_title_filter: A string that is used to filter remote WMS layers by their title property. If a remote layer’s title contains the string defined by this parameter, then the layer is recognized by the harvester worker.

Creating unscheduled harvesters using Remote Services

Through Remote services, GeoNode provides a simpler way of creating unscheduled harvesters and a user is able to perform the actions explained in One-time harvesting workflow using GeoNode web pages.

Let’s create a remote service through GeoNode by clicking on Data > Remote Services > Add Remote Service.

Fill the form with the following attributes and click Create:

Service URL: https://carto.nationalmap.gov/arcgis/services/transportation/MapServer/WMSServer?request=GetCapabilities&service=WMS
Service Type: Web Map Service

The service will be created and you will be redirected to the page where you can select resources to harvest into GeoNode.

By default, GeoNode will create a harvester with the values set as shown below.

After the harvester has loaded the available resources that can be imported, select the resources you want to import and click on import Resources

When you check the GeoNode resources page, the imported resources will be available.