Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a DCAT compliant endpoint for harvesting by Worldbank Data Hub #14

Open
2 of 4 tasks
ldodds opened this issue Apr 14, 2021 · 7 comments
Open
2 of 4 tasks
Assignees

Comments

@ldodds
Copy link

ldodds commented Apr 14, 2021

The metadata in the catalogue will be harvested by the Worldbank Data Hub. While they can write a custom importer, the preference is to have a DCAT compliant endpoint that provides the necessary metadata along with stable identifiers for individual datasets.

At the moment there are several JSON endpoints.

We need to agree:

  • which one will be recommended for harvesting by WB DHH
  • identify any changes/mapping to align with DCAT (plus any extensions)
  • do necessary code changes to support
  • decide whether to remove any redundant endpoints?
@ldodds
Copy link
Author

ldodds commented Apr 14, 2021

Here's a valid JSON-LD document that conforms to the DCAT and GEODCAT profile:

{
 "@context": {
   "dcat": "http://www.w3.org/ns/dcat#",
   "dct": "http://purl.org/dc/terms/",
   "foaf": "http://xmlns.com/foaf/0.1/",
   "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
   "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
   "vcard": "http://www.w3.org/2006/vcard/ns#",
   "xsd": "http://www.w3.org/2001/XMLSchema#"   
 },
 "@type": "dcat:Catalog",
 "dct:title": "Risk Data Library",
 "dct:description": "A simple catalog to find datasets compliant with the Risk Data Library",
 "foaf:homepage": "http://jkan.riskdatalibrary.org/",
 "dct:publisher": {
     "@type": "foaf:Organization",
     "foaf:name": "GFDRR",
     "foaf:homepage": "https://www.gfdrr.org/"
 },
 "dcat:dataset": [
   {
     "@id": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
     "@type": "dcat:Dataset",
     "dct:identifier": "...",
     "dct:title": "Afghanistan agriculture",
     "dct:description": "Location, area and USD value of rainfed and irrigated agricultural crops in Afghanistan.",
     "dcat:landingPage": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
     "dct:license": "https://creativecommons.org/licenses/by-sa/4.0/",
     "dct:publisher": {
        "@type": "foaf:Organization",
        "foaf:name": "GFDRR",
        "foaf:homepage": "https://www.gfdrr.org/"
     },
     "dcat:contactPoint": {
	"vcard:fn": "GFDRR",
        "vcard:hasEmail": "mailto:[email protected]"
     },
     "dct:spatial": [{
       "rdfs:label": "Afghanistan"
     }],
     "dct:keyword": [ "Exposure" ],     
     "dcat:distribution": [{
	"@type": "dcat:Distribution",
        "dct:title": "Afghanistan agriculture",
	"dcat:accessURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/exposure/exp-afg-infrastructures.zip",
        "dct:format": "geotiff"
     }]
   }
 ]
}

Additional datasets would be added to the array value of dcat:dataset property.

I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.

@pzwsk
Copy link

pzwsk commented Apr 15, 2021

which one will be recommended for harvesting by WB DHH

I guess that we have several endpoints because those are provided by JKAN per default right?

If yes, I would simply keep the one that is the closest to the profile above.

@pzwsk
Copy link

pzwsk commented Apr 15, 2021

I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.

Agree that it is not the priority.

@ConnectedSystems
Copy link
Collaborator

ConnectedSystems commented Apr 19, 2021

Thanks @ldodds

FYI - rdl-datasets.json is the canonical endpoint for RDL-JKAN.

Took me a while to remember where the datasets.json file came from. It is part of the base JKAN implementation and I intended to remove it at earliest opportunity - I simply did not want to break the JKAN install while setting the RDL instance up.

I will mock up a GEODCAT endpoint shortly.

@ConnectedSystems
Copy link
Collaborator

ConnectedSystems commented Apr 19, 2021

Hi @ldodds @matamadio

Attached is an example rdl-geodcat.json file that is auto-generated based on the included/available datasets.

The only attribute I've left alone is dct:identifier - I guess this can be the URL as well but is not a stable solution for the reasons outlined earlier (change of platform/service provider, etc.)
[EDIT: here I am referring to our conversation with @jeanpommier via email on the possible datapackage spec, circa 8 April 2021]

I can deploy this implementation if this example is adequate, and a solution to the above is decided on.

rdl-geodcat.txt
[EDIT: File is provided as a .txt file as GitHub won't let me upload a JSON file]

@matamadio
Copy link

Review alignment of metadata with HDX schema (also using DCAT)

@ConnectedSystems
Copy link
Collaborator

ConnectedSystems commented Apr 28, 2021

Turns out it was faster for me to hack together a liquid template to generate entries.

The approach is to extract the metadata from the JKAN datasets, which are structured according to the schema specification, and entries were manually mapped to DCAT fields where not already done so by JKAN. Note that this doesn't pull the data out from the pages, so in the long term inadvertent inconsistencies between the page-embedded metadata and endpoint may occur.

@ldodds note that downloadURL is used in combination with dct:format and dcat:mediaType.

format is extracted from the dataset entry, whereas mediaType is inferred from downloadURL.
Really, all it does currently is prepend "application/" to the entry, unless the URL points to a bare CSV file.

Example for a multi-entry dataset page below.
Latest full example attached as txt file.

{
      "@id": "/datasets/lss-mdg-mh/",
      "@type": "dcat:Dataset",
      "dct:identifier": "/datasets/lss-mdg-mh/",
      "dct:title": "Madagascar Multi-Hazard loss scenarios",
      "dct:description": "Direct loss simulated on exposed building asset measured as Average Annual Losses (AAL) and six Return Period scenarios for multiple hazards (earthquake, pluvial flood, storm surge and strong wind).",
      "dcat:landingPage": "/datasets/lss-mdg-mh/",
      "dct:license": "https://creativecommons.org/licenses/by/4.0/",
      "dct:publisher": {
        "@type": "foaf:Organization",
        "foaf:name": "GFDRR",
        "foaf:homepage": "https://www.gfdrr.org/"
      },
      "dcat:contactPoint": {
        "vcard:fn": "GFDRR",
        "vcard:hasEmail": "[email protected]"
      },
      "dct:spatial": [
          {
          "rdfs:label": "Madagascar"
          }
        ],
      "dct:keyword": ["Loss"],
      "dcat:distribution": [{
          "@type": "dcat:Distribution",
          "dct:title": "Madagascar multi-hazard loss scenarios",
          "dct:format": "gpkg",
          "dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh.gpkg",
          "dcat:mediaType": "application/gpkg"
        }, 
      {
          "@type": "dcat:Distribution",
          "dct:title": "Madagascar multi-hazard loss exceedence-probability curves",
          "dct:format": "csv",
          "dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh-epc.zip",
          "dcat:mediaType": "application/zip"
        }
      ]
    }

rdl-geodcat.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants