========================================= Harvesting resources from remote services ========================================= Harvesting is the process by which a metadata catalogue, *i.e.* GeoNode, can connect to other remote catalogues and retrieve information about their resources. This process is usually performed periodically to keep the local catalogue in sync with the remote one. When it is appropriately configured, GeoNode will contact the remote service, extract a list of relevant resources that can be harvested, and then create local resources for each remote resource. It will also keep the resources synchronized with the remote service by updating them periodically. To explore more advanced features on harvesters, we will mainly use the ``Django Administration Dashboard`` to create harvesters for remote services and then import their relevant resources to our local GeoNode. Harvesting workflows ==================== There are two main possible harvesting workflows: #. :ref:`Continuous harvesting ` #. :ref:`One-time harvesting ` .. _continuous-harvesting-label: Continuous harvesting --------------------- This workflow relies on the harvesting scheduler to ensure that harvested resources are continuously kept up to date with their remote counterparts. #. When the time comes, the harvesting scheduler calls the :ref:`update list of harvestable resources operation `. Alternatively, the user may call this operation manually the first time. #. When the previous operation is done, the user goes through the list of generated :ref:`harvestable resources ` and, for each relevant harvestable resource, sets its ``should_be_harvested`` attribute to ``True``. Alternatively, if the harvester has its ``harvest_new_resources_automatically`` attribute set to ``True``, the harvestable resources will already be marked as ``to be harvested``, without requiring manual user intervention. #. When the time comes, the harvesting scheduler calls the :ref:`perform harvesting operation `. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode. .. _one-time-harvesting-label: One-time harvesting ------------------- This workflow is mostly manually executed by the user. #. The user creates a harvester and sets its ``scheduling_enabled`` attribute to ``False``; #. The user calls the :ref:`update list of harvestable resources operation `; #. When the previous operation is complete, the user goes through the list of generated :ref:`harvestable resources ` and, for each relevant harvestable resource, sets its ``should_be_harvested`` attribute to ``True``; #. The user then proceeds to call the :ref:`perform harvesting operation `. This causes the remote resources to be harvested. These now show up as resources on the local GeoNode. Standard harvester workers ========================== .. note:: Remember that, as stated above, a harvester worker is configured by means of setting the ``harvester_type`` and ``harvester_type_specific_configuration`` attributes on the :ref:`harvester `. Moreover, the format of the ``harvester_type_specific_configuration`` attribute must be a JSON object. .. _geonode-harvester-worker-label: GeoNode harvester worker ------------------------ This worker can harvest remote GeoNode deployments. In addition to creating local resources by retrieving the remote metadata, this harvester can also copy remote datasets over to the local GeoNode. This means that this harvester can even be used to generate replicated GeoNode instances. This harvester can be used by setting ``harvester_type=geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker`` in the harvester configuration. It recognizes the following ``harvester_type_specific_configuration`` parameters: - **harvest_datasets** : Whether to harvest remote resources of the ``dataset`` type or not. Acceptable values: ``true`` (the default) or ``false``. - **copy_datasets**: Whether to copy remote resources of the ``dataset`` type over to the local GeoNode. Acceptable values: ``true`` or ``false`` (the default). - **harvest_documents**: Whether to harvest remote resources of the ``document`` type or not. Acceptable values: ``true`` (the default) or ``false``. - **copy_documents**: Whether to copy remote resources of the ``document`` type over to the local GeoNode. Acceptable values: ``true`` or ``false`` (the default). - **resource_title_filter**: A string that must be present in the remote resources' ``title`` in order for them to be acknowledged as harvestable resources. This allows for the filtering out of resources that are not relevant. Acceptable values: any alphanumeric value. - **start_date_filter**: A string specifying a datetime that is used to filter out resources by their start_date. This is parsed with :ref:`dateutil.parser.parse() `, which means that it accepts multiple different formats (e.g. `2021-06-31T13:04:05Z`) - **end_date_filter**: Similar to ``start_date_filter`` but uses the resources' `end_date` as a filter parameter. - **keywords_filter**: A list of keywords that are used to filter remote resources. - **categories_filter**: A list of categories that are used to filter remote resources. Creating scheduled harvesters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using the :ref:`Continuous harvesting ` workflow example, sign in as admin and click on ``Admin`` to be redirected to the ``Django Administration Dashboard`` .. figure:: img/admin_option.png Search for the ``Harvesters tab``, click on ``Harvesters`` .. figure:: img/harvesters_1.png Click on ``Add harvester`` to create a new ``harvester``. .. figure:: img/harvesters_2.png Create the harvester with the following attributes and then click ``Save``. - Name: ``GeoNode harvester test`` - Remote url: ``https://summit2020.cartoview.net/`` - Harvester type: ``geonode.harvesting.harvesters.geonodeharvester.GeonodeUnifiedHarvesterWorker`` .. figure:: img/harvesters_5.png After successful creation, you will be redirected to the list of available harvesters, which includes your newly created harvester. .. figure:: img/harvesters_50.png GeoNode will create an ``Asynchronous harvesting session`` for the harvester. To view it, browse through the administration using the path ``Home`` > ``Harvesting`` > ``Asynchronous harvesting session`` .. figure:: img/harvesters_51.png Once the session of the ``discover-harvestable-resources`` type is complete, you can view the available harvestable resources via the path ``Home`` > ``Harvesting`` > ``Harvestable resources`` .. figure:: img/harvesters_52.png As previously explained, since we did not set ``harvest_new_resources_automatically`` to ``True`` on our harvester, the attribute ``should_be_harvested`` of the listed resources is set to ``False``. We can manually set this attribute to ``True`` for the resources that we want so, that when the time comes, the harvesting scheduler calls the ``perform harvesting operation`` and they can be harvested. .. figure:: img/harvesters_53.png The harvester will now show that the selected resources are scheduled for harvesting. .. figure:: img/harvesters_54.png Click on the ``Go`` icon to view the list. .. figure:: img/harvesters_55.png You can manually harvest resources through an action. E.g. Select a resource from the list, set the action as ``Harvest selected resources``, then click ``Go`` .. figure:: img/harvesters_56.png Note that an ``asynchronous harvesting session`` for the selected resource is created with: - Session type: ``harvesting`` - Status: ``on-going`` (which will be updated to ``finished-all-ok`` after it's successfully harvested) - Total records to process: ``1`` .. figure:: img/harvesters_57.png Search for the resource in GeoNode and verify that it's been created. .. figure:: img/harvesters_58.png .. _wms-harvester-worker-label: WMS harvester worker -------------------- This worker can harvest resources from remote OGC WMS servers. This harvester can be used by setting ``harvester_type=geonode.harvesting.harvesters.wms.OgcWmsHarvester`` in the harvester configuration. It recognizes the following ``harvester_type_specific_configuration`` parameters: - **dataset_title_filter**: A string that is used to filter remote WMS layers by their ``title`` property. If a remote layer's title contains the string defined by this parameter, then the layer is recognized by the harvester worker. Creating unscheduled harvesters using Remote Services ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Through ``Remote services``, GeoNode provides a simpler way of creating unscheduled harvesters and a user is able to perform the actions explained in :ref:`One-time harvesting ` workflow using GeoNode web pages. Let's create a remote service through GeoNode by clicking on ``Data`` > ``Remote Services`` > ``Add Remote Service``. .. figure:: img/harvesters_6.png Fill the form with the following attributes and click ``Create``: - Service URL: ``https://carto.nationalmap.gov/arcgis/services/transportation/MapServer/WMSServer?request=GetCapabilities&service=WMS`` - Service Type: ``Web Map Service`` .. figure:: img/harvesters_7.png The service will be created and you will be redirected to the page where you can select resources to harvest into GeoNode. .. figure:: img/harvesters_8.png By default, GeoNode will create a harvester with the values set as shown below. .. figure:: img/harvesters_9.png After the harvester has loaded the available resources that can be imported, select the resources you want to import and click on ``import Resources`` .. figure:: img/harvesters_10.png .. figure:: img/harvesters_10_b.png When you check the GeoNode resources page, the imported resources will be available. .. figure:: img/harvesters_11.png .. figure:: img/harvesters_11_b.png .. figure:: img/harvesters_11_c.png