Datasets Generation¶
This document explains how MerMEId MeLODy generates and uses datasets for search indices and dropdown menus. Understanding this process is essential for setting up a data repository, customizing search behavior, and extending the system with new entity types.
Overview¶
Datasets are aggregated RDF/Turtle files generated from your repository's data. They serve two critical functions in the editor:
- Search indices — powers the entity search functionality that loads in the editor
- Cross-reference dropdowns — supplies the linked entity fields (e.g., "Birth Place", "Contributor") in the editor forms
Without published datasets, the editor can read and edit entities locally, but the search and cross-reference features are not working.
How Datasets Are Generated¶
The Pipeline¶
The template repository includes a GitHub Actions workflow that automatically generates datasets after every push:
Push
↓
GitHub Actions: generate_datasets job
↓
Run datasets-generator.sh script
↓
For each dataset:
- Collect all RDF files from their source directory
- Execute a SPARQL CONSTRUCT query
- Output aggregated .ttl file to public/datasets/
↓
Upload public/ folder to GitHub Pages
↓
Datasets are published at your Pages URL
Tools Used¶
The generation process requires:
static-publishing-backendDocker container — provides the RDF aggregator and SPARQL engine- SPARQL CONSTRUCT queries — transform entity data into searchable datasets
- GitHub Actions — orchestrates the workflow on every push
All of these are configured in the template; you do not need to set anything up manually.
Dataset Files¶
When the pipeline runs, it generates Turtle (.ttl) files in the public/datasets/ directory. Each dataset corresponds to an entity type and is published to your GitHub Pages URL.
| Dataset File | Source Directory | Entity Type | Used For |
|---|---|---|---|
persons.ttl |
persons/ |
Person | Search, cross-references (e.g., contributors, publishers) |
works.ttl |
works/ |
Work | Search, cross-references (e.g., related works) |
expressions.ttl |
expressions/ |
Expression | Search, cross-references (related expressions, works and manifestations) |
manifestations.ttl |
manifestations/ |
Manifestation | Search, cross-references (related manifestations) |
items.ttl |
items/ |
Item | Search, cross-references (related manifestations and items) |
institutions.ttl |
institutions/ |
Institution | Search, cross-references (e.g., holding institutions/repositories) |
places.ttl |
places/ |
Place | Search, cross-references (e.g., birth place) |
venues.ttl |
venues/, performanceEvents/ |
Venue | Search, cross-references (e.g. performance venue) |
events.ttl |
events/ |
Event | Search, cross-references (e.g. historic events of a items creation) |
performanceEvents.ttl |
performanceEvents/ |
Performance Event | Search, cross-references (e.g. first performance of an expression) |
instrumentations.ttl |
instrumentations/ |
Instrumentation | Search, cross-references (e.g. instrumentation of an expression) |
bibliography.ttl |
bibliography/ |
Bibliography | Search, cross-references (e.g. evidences of a performance or work) |
SPARQL Query Reference¶
Each dataset is built using a SPARQL CONSTRUCT query that transforms entity data into a simplified searchable form. These queries are located in modules/datasets-generator/.
Persons Dataset¶
File: persons.sparql
Purpose: Index all persons with their full names.
Output properties:
skos:prefLabel— formatted name (e.g., "Bach, Johann Sebastian")
Query logic:
- Combines
schema:familyNameandschema:givenNameinto a displayable label - Falls back to family name only if given name is missing
- Sorted alphabetically by full name
Works Dataset¶
File: works.sparql
Purpose: Index all works with titles, alternative titles, and composer information.
Output properties:
skos:prefLabel— main titleskos:altLabel— uniform title and alternative title (if present)skos:broader— work classification (if present)schema:composer— composer name (if linked)
Query logic:
- Extracts the main title (type:
MainTitle) - Optionally includes uniform title and alternative titles as searchable variants
- Optionally extracts composer information via a contribution relationship
- Sorted by main title
Expressions Dataset¶
File: expressions.sparql
Purpose: Index all expressions with labels including incipit text for identification.
Output properties:
skos:prefLabel— expression label with optional incipit text
Query logic:
- Uses either
skos:prefLabelorrdfs:labelas base label - Optionally appends incipit text in parentheses for disambiguation
- Sorted by label
Manifestations Dataset¶
File: manifestations.sparql
Purpose: Index all manifestations with their titles and classifications.
Output properties:
skos:prefLabel— manifestation titleskos:broader— classification (if present)
Query logic:
- Extracts title from title subject
- Optionally includes classification for broader search
- Sorted by title
Items Dataset¶
File: items.sparql
Purpose: Index all items with titles and classifications.
Output properties:
skos:prefLabel— item labelskos:broader— classification
Query logic:
- Uses item label from
rdfs:label - Includes classification
- Sorted by label
Institutions Dataset¶
File: institutions.sparql
Purpose: Index all institutions with abbreviations or RISM sigla.
Output properties:
skos:prefLabel— label with optional abbreviation or RISM siglumskos:broader— RISM identifier if present
Query logic:
- Uses institution name as base
- Prepends abbreviation or RISM siglum if available:
"DB (Name)" - Falls back to name only if no abbreviation or rism siglum
- Sorted by label
Places Dataset¶
File: places.sparql
Purpose: Index all places with their names.
Output properties:
skos:prefLabel— place name
Query logic:
- Extracts place name from
schema:name - Sorted alphabetically
Venues Dataset¶
File: venues.sparql
Purpose: Index all venues with their names and locations.
Output properties:
skos:prefLabel— venue nameskos:broader— place name (if venue is located in a place)
Query logic:
- Extracts venue name from
schema:name - Optionally includes containing place via
schema:containedInPlace - Sorted by venue name
Events Dataset¶
File: events.sparql
Purpose: Index all events with their names.
Output properties:
skos:prefLabel— event name
Query logic:
- Extracts event name from
rdfs:label - Sorted by name
Performance Events Dataset¶
File: performanceEvents.sparql
Purpose: Index all performance events with dates and venues.
Output properties:
skos:prefLabel— performance label with optional date and venue
Query logic:
- Uses performance event name as base
- Optionally appends date and place information for context
- Combined label format:
"Name: Date Place" - Sorted by label
Instrumentations Dataset¶
File: instrumentations.sparql
Purpose: Index all instrumentations.
Output properties:
skos:prefLabel— instrumentation name
Query logic:
- Extracts name from
rdfs:label - Sorted alphabetically
Bibliography Dataset¶
File: bibliography.sparql
Purpose: Index all bibliography entries with abbreviations or titles.
Output properties:
skos:prefLabel— bibliography label with abbreviation or title
Query logic:
- Prefers
melod:hasAbbreviationif present - Falls back to title if no abbreviation exists
- Sorted by label
The Generation Script¶
The file modules/datasets-generator/datasets-generator.sh orchestrates the entire process. It:
- Sets up directory paths for all entity types
-
For each dataset, calls the
rdf-data-aggregatortool with:- Source directory: where entity files are located (e.g.,
persons/) - File pattern: which files to include (e.g.,
*.ttl) - SPARQL query: the transformation logic (e.g.,
persons.sparql) - Output path: where to write the result (e.g.,
public/datasets/persons.ttl)
- Source directory: where entity files are located (e.g.,
-
Times each operation for performance monitoring
Example invocation:
/static-publishing-backend rdf-data-aggregator \
$persons_dir_path/ \
"*.ttl" \
$datasets_generator_dir_path/persons.sparql \
$datasets_dir_path/persons.ttl
GitHub Pages Publishing¶
After datasets are generated, the GitHub Actions workflow publishes them to GitHub Pages:
- Datasets are created in
public/datasets/directory - The artifact is uploaded to GitHub Pages
- They become available at your configured
datasetBaseUrlURL
Required configuration:
- GitHub Pages must be enabled for your repository (Settings → Pages → Source: GitHub Actions)
- The repository must be public (for free GitHub Pages hosting)
For GitLab repositories, configure Pages in the CI/CD pipeline settings.
Troubleshooting¶
Datasets Are Not Published¶
Problem: Dropdowns are empty or search doesn't work.
Solutions:
- Check that GitHub Pages is enabled (Settings → Pages → Source: GitHub Actions)
- Verify the
datasetBaseUrlinconfig.jsonmatches your actual Pages URL - Check that the latest pipeline run succeeded (Actions tab in GitHub)
- Clear the editor's browser cache (Ctrl+Shift+Delete)
SPARQL Query Errors¶
Problem: Pipeline fails with a SPARQL error.
Solutions:
- Check the Actions logs for the specific error
- Verify namespace prefixes are defined (
prefix melod:,prefix skos:, etc.) - Ensure the entity type in the query matches your actual entity classes
- Test the query locally in an RDF editor or SPARQL IDE
Dataset File Not Generated¶
Problem: A specific dataset file is missing.
Solutions:
- Check the source directory exists and contains
.ttlfiles - Verify the SPARQL query file exists in
modules/datasets-generator/ - Ensure the entry is added to
datasets-generator.sh - Check that no syntax errors exists in the SPARQL query
Related Topics¶
- Repository Configuration —
datasetBaseUrlandprojectDomainsettings - SHACL Shapes — how form fields reference datasets
- Entity Types — data model and entity classes
- Data Model Ontologies — vocabularies used in datasets