Use Case: GBIF Biodiversity Data - Building Virtual Knowledge Graphs for Life Sciences with Ontop-VKG

The Global Biodiversity Information Facility (GBIF) publishes monthly snapshots of all occurrence records as Parquet files on Amazon S3. Using the same Ontop + DuckDB approach from the core tutorial, you can expose these biodiversity records as a SPARQL endpoint — another example of the VKG pattern applied to a different domain.

The full GBIF dataset contains billions of records spread across ~2800 Parquet shards. Querying it live on AWS is not practical with Ontop because every SPARQL query would trigger a full scan of all shards. Instead, we extract a country-level subset first, then map it with Ontop.

This chapter uses Montserrat (country code MS, ~50k records, ~5 MB) as a quick example, and Suriname (SR, ~500k records, ~65 MB) as a more realistic case.

Prerequisites¶

DuckDB installed (see Installation)
Ontop CLI installed and on your PATH
Internet access (to download from AWS S3)

Step 1 — Extract a country subset from GBIF¶

GBIF publishes monthly snapshots at:

s3://gbif-open-data-us-east-1/occurrence/YYYY-MM-01/occurrence.parquet/

The extract_country.sh script automates the extraction. It:

Lists all Parquet shards in the S3 bucket for a given month
Generates a DuckDB SQL query that scans all shards and filters by country code
Runs the query to produce a single local Parquet file

The extraction script¶

Save this as extract_country.sh:

#!/bin/bash

# === Input Parameters ===
YEAR=$1
MONTH=$2
COUNTRYCODE=$3

if [ -z "$YEAR" ] || [ -z "$MONTH" ] || [ -z "$COUNTRYCODE" ]; then
  echo "Usage: ./extract_country.sh <year> <month> <country_code>"
  echo "Example: ./extract_country.sh 2025 02 SR"
  exit 1
fi

# === Zero-pad month ===
MONTH=$(printf "%02d" $MONTH)
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Step 1: Listing files in S3 bucket..."

# === Step 1: List files in S3 bucket ===
S3_PATH="s3://gbif-open-data-us-east-1/occurrence/${YEAR}-${MONTH}-01/occurrence.parquet/"
FILES=$(aws s3 ls "$S3_PATH" --no-sign-request | awk '{print $4}')

if [ -z "$FILES" ]; then
  echo "!! No files found for $YEAR-$MONTH!"
  exit 1
fi
echo "[$(date "+%Y-%m-%d %H:%M:%S")]  Found files in S3."

# === Step 2: Build DuckDB SQL ===
echo "[$(date "+%Y-%m-%d %H:%M:%S")]  Step 2: Building DuckDB SQL query..."
SAFE_COUNTRYCODE=$(echo "$COUNTRYCODE" | tr ',' '_')
SQL_FILE="query_${SAFE_COUNTRYCODE}_${YEAR}_${MONTH}.sql"
PARQUET_FILE="occurrence_${SAFE_COUNTRYCODE}_${YEAR}_${MONTH}.parquet"

echo "INSTALL httpfs;" > $SQL_FILE
echo "LOAD httpfs;" >> $SQL_FILE
echo "" >> $SQL_FILE
echo "COPY (" >> $SQL_FILE
echo "  SELECT *" >> $SQL_FILE
echo "  FROM parquet_scan([" >> $SQL_FILE

for FILE in $FILES; do
  echo "    'https://gbif-open-data-us-east-1.s3.amazonaws.com/occurrence/${YEAR}-${MONTH}-01/occurrence.parquet/$FILE'," >> $SQL_FILE
done

# Remove trailing comma
sed -i '' '$ s/,$//' $SQL_FILE

echo "  ])" >> $SQL_FILE

# === Build WHERE clause ===
if [[ "$COUNTRYCODE" == *","* ]]; then
  COUNTRY_LIST=$(echo "$COUNTRYCODE" | awk -v q="'" -F',' \
    '{for (i=1; i<=NF; i++) printf q tolower($i) q (i<NF?",":"")}')
  echo "  WHERE lower(countrycode) IN (${COUNTRY_LIST})" >> $SQL_FILE
else
  echo "  WHERE lower(countrycode) = lower('${COUNTRYCODE}')" >> $SQL_FILE
fi
echo ") TO '${PARQUET_FILE}' (FORMAT PARQUET);" >> $SQL_FILE

echo "[$(date "+%Y-%m-%d %H:%M:%S")]  SQL saved to: $SQL_FILE"

# === Step 3: Run the query ===
echo "[$(date "+%Y-%m-%d %H:%M:%S")]  Step 3: Running DuckDB query..."
duckdb < $SQL_FILE
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Done! File saved: $PARQUET_FILE"

Run the extraction¶

chmod +x extract_country.sh

# Montserrat (small, fast — good for testing)
./extract_country.sh 2025 02 MS

# Suriname (larger, more realistic)
./extract_country.sh 2025 02 SR

You can also extract multiple countries at once:

./extract_country.sh 2025 02 SR,GY,GF

Step 2 — Create the OBDA mapping¶

GBIF occurrence records use Darwin Core terminology. The mapping file translates DuckDB columns into Darwin Core RDF properties.

Save this as gbif-mapping.obda:

[PrefixDeclaration]
ex:  http://example.org/occurrence/
dwc: http://rs.tdwg.org/dwc/terms/
geo: http://www.opengis.net/ont/geosparql#
wdt: http://www.wikidata.org/prop/direct/

[MappingDeclaration] @collection [[
mappingId   occurrence_all_fields
target      <http://example.org/occurrence/{gbifid}> a dwc:Occurrence ;
              ex:gbifID "{gbifid}" ;
              dwc:datasetKey "{datasetkey}" ;
              dwc:occurrenceID "{occurrenceid}" ;
              dwc:kingdom "{kingdom}" ;
              dwc:phylum "{phylum}" ;
              dwc:class "{class}" ;
              dwc:order "{order}" ;
              dwc:family "{family}" ;
              dwc:genus "{genus}" ;
              dwc:species "{species}" ;
              dwc:infraspecificEpithet "{infraspecificepithet}" ;
              dwc:taxonRank "{taxonrank}" ;
              dwc:scientificName "{scientificname}" ;
              dwc:verbatimScientificName "{verbatimscientificname}" ;
              dwc:verbatimScientificNameAuthorship "{verbatimscientificnameauthorship}" ;
              dwc:countryCode "{countrycode}" ;
              dwc:locality "{locality}" ;
              dwc:stateProvince "{stateprovince}" ;
              dwc:occurrenceStatus "{occurrencestatus}" ;
              dwc:individualCount "{individualcount}" ;
              dwc:publishingOrgKey "{publishingorgkey}" ;
              dwc:decimalLatitude "{decimallatitude}" ;
              dwc:decimalLongitude "{decimallongitude}" ;
              dwc:coordinateUncertaintyInMeters "{coordinateuncertaintyinmeters}" ;
              dwc:coordinatePrecision "{coordinateprecision}" ;
              dwc:elevation "{elevation}" ;
              dwc:elevationAccuracy "{elevationaccuracy}" ;
              dwc:depth "{depth}" ;
              dwc:depthAccuracy "{depthaccuracy}" ;
              dwc:eventDate "{eventdate}" ;
              dwc:day "{day}" ;
              dwc:month "{month}" ;
              dwc:year "{year}" ;
              dwc:taxonKey "{taxonkey}" ;
              dwc:speciesKey "{specieskey}" ;
              dwc:basisOfRecord "{basisofrecord}" ;
              dwc:institutionCode "{institutioncode}" ;
              dwc:collectionCode "{collectioncode}" ;
              dwc:catalogNumber "{catalognumber}" ;
              dwc:recordNumber "{recordnumber}" ;
              dwc:identifiedBy "{identifiedby}" ;
              dwc:dateIdentified "{dateidentified}" ;
              dwc:license "{license}" ;
              dwc:rightsHolder "{rightsholder}" ;
              dwc:recordedBy "{recordedby}" ;
              dwc:typeStatus "{typestatus}" ;
              dwc:establishmentMeans "{establishmentmeans}" ;
              dwc:lastInterpreted "{lastinterpreted}" ;
              dwc:mediaType "{mediatype}" ;
              dwc:issue "{issue}" ;
              wdt:P625 "Point({decimallongitude} {decimallatitude})"^^geo:wktLiteral .
source      SELECT gbifid, datasetkey, occurrenceid, kingdom, phylum, class,
              "order", family, genus, species, infraspecificepithet, taxonrank,
              scientificname, verbatimscientificname,
              verbatimscientificnameauthorship, countrycode, locality,
              stateprovince, occurrencestatus, individualcount, publishingorgkey,
              decimallatitude, decimallongitude, coordinateuncertaintyinmeters,
              coordinateprecision, elevation, elevationaccuracy, depth,
              depthaccuracy, eventdate, day, month, year, taxonkey, specieskey,
              basisofrecord, institutioncode, collectioncode, catalognumber,
              recordnumber, identifiedby, dateidentified, license, rightsholder,
              recordedby, typestatus, establishmentmeans, lastinterpreted,
              mediatype, issue
            FROM 'occurrence_MS_2025_02.parquet'
]]

Step 3 — Create a minimal ontology¶

Save this as gbif.ttl:

@prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
@prefix ex:  <http://example.org/occurrence/> .

Step 4 — Create the properties file¶

Save this as gbif.properties:

jdbc.driver=org.duckdb.DuckDBDriver
jdbc.url=jdbc:duckdb:

Step 5 — Start the SPARQL endpoint¶

ontop endpoint \
    --ontology gbif.ttl \
    --mapping gbif-mapping.obda \
    --properties gbif.properties \
    --port=8082

Open: http://localhost:8082

Example SPARQL queries¶

Count all occurrences¶

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT (COUNT(?occ) AS ?count)
WHERE {
  ?occ a dwc:Occurrence .
}

List species with their taxonomic hierarchy¶

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT ?species ?genus ?family ?order
WHERE {
  ?occ a dwc:Occurrence ;
       dwc:species ?species ;
       dwc:genus ?genus ;
       dwc:family ?family ;
       dwc:order ?order .
}
LIMIT 20

Count occurrences per institution¶

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT ?institution (COUNT(?occ) AS ?count)
WHERE {
  ?occ a dwc:Occurrence ;
       dwc:institutionCode ?institution .
}
GROUP BY ?institution
ORDER BY DESC(?count)

Occurrences with coordinates (GeoSPARQL)¶

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?species ?coord
WHERE {
  ?occ a dwc:Occurrence ;
       dwc:species ?species ;
       wdt:P625 ?coord .
}
LIMIT 20

Count species per family¶

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT ?family (COUNT(DISTINCT ?species) AS ?speciesCount)
WHERE {
  ?occ a dwc:Occurrence ;
       dwc:family ?family ;
       dwc:species ?species .
}
GROUP BY ?family
ORDER BY DESC(?speciesCount)
LIMIT 20

Can I query GBIF on AWS directly (without downloading)?¶

Not practically with Ontop. The full GBIF dataset consists of ~2800 Parquet shards totalling hundreds of gigabytes. Ontop translates each SPARQL query into SQL, and DuckDB would need to scan all shards over the network for every query. This makes direct S3 querying impractical for interactive SPARQL use.

The recommended workflow is:

GBIF Parquet on S3 → DuckDB extract by country → local Parquet → Ontop → SPARQL

You can query GBIF’s S3 Parquet files directly with DuckDB for one-off analytical queries (using httpfs), but Ontop’s virtual mapping approach works best with local data where repeated SQL access is fast.

Connection to the core tutorial¶

Like the OMOP use case, GBIF follows the same three-artifact VKG pattern:

Artifact	Solar system	GBIF
Database	PostgreSQL	DuckDB (Parquet)
`.properties`	`solar_system.properties`	`gbif.properties`
`.obda`	`solar_system.obda`	`gbif-mapping.obda`
`.ttl`	`solar_system.ttl`	`gbif.ttl`

The GBIF mapping uses Darwin Core vocabulary instead of a custom ontology, demonstrating how Ontop can map to established community standards.