Theory chapters

Theory chapters#

Why we are doing this? What is the motivation of these training materials?

At the IOER we are committed to the principles of open, reproducible, accessible and transparent science. To this end, documenting and sharing the code we use to process data and generate figures and results is critical. By using tools such as JupyterLab, we enable others to reproduce our results and workflows. We also use this framework (Jupyter Book) to document our own application programming interfaces (APIs), so that others (you!) can copy code snippets to access our data.

A Core Resource: Much of the foundational knowledge for research data management (RDM) in biodiversity is comprehensively covered in the NFDI4Biodiversity Self-Study Unit: Research Data Management for Biodiversity Data by Fischer et al. (2023). We highly recommend consulting this unit for in-depth explanations, particularly on topics like Data Management Plans (DMPs), the data life cycle, and detailed considerations for data collection, preservation, and sharing. The following sections provide a summary of key concepts, often drawing upon or aligning with this excellent resource.

Summary

The FAIR principles provide guidelines to make data Findable, Accessible, Interoperable, and Reusable, ensuring data is well-organized, machine-readable, and optimized for reuse across disciplines.
Data provenance refers to the documentation of the origin, history, and data processing.
Metadata is information that describes and organizes data, enabling easier discovery and use.
A license defines the permissions, restrictions, and terms under which data or software can be used, shared, and modified.

FAIR principles#

The Findable Acessible Interoperable Reusable (FAIR) principles (Wilkinson et al. 2016) are the culmination of more than 20 years of agreements and discussions within industry and academia to address the critical issue of managing the most crucial asset of any research activities, namely the DATA.

The FAIR principles listed below follow the Go FAIR initiative

Findable The first step in (re)using data is to find them. Metadata and data should be easy for both humans and computers to find. Machine-readable metadata plays a crucial role in enabling the automatic discovery of datasets and services.

F1 (Meta)data are assigned a globally unique and persistent identifier
F2 Data are described with rich metadata (defined by R1 below)
F3 (Meta)data clearly and explicitly include the identifier of the data they describe
F4 (Meta)data are registered or indexed in a searchable resource** F1: (Meta) data are assigned globally unique and persistent identifiers

Accessible Once users have found the required data, they need to understand how to access it. This involves determining whether the data is openly available or requires authentication and authorization, such as login credentials. Users must know the methods for retrieving the data, whether through direct downloads, APIs, or repositories. Finally, it is essential to consider any restrictions or conditions on access.

A1 (Meta)data are retrievable by their identifier using a standardised communications protocol
- A1.1. The protocol is open, free and universally implementable
- A1.2. The protocol allows for an authentication and authorisation procedure where necessary
A2 (Meta)data is accessible, even when the data are no longer available

Interoperable The data usually needs to be integrated with other data. In addition, the data needs to interoperate with applications or workflows for analysis, storage, and processing.

I1 (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2 (Meta)data use vocabularies that follow FAIR principles
I3 (Meta)data include qualified references to other (meta)data

Reusable The ultimate goal of FAIR is to optimise data reuse. To achieve this, metadata and data should be well-described to be replicated and/or combined in different settings.

R1 (Meta)data are richly described with accurate and relevant attributes.
- R1.1 (Meta)data are released with a clear and accessible data usage license.
- R1.2 (Meta)data are associated with detailed provenance.
- R1.3 (Meta)data meet domain-relevant community standards.

CARE Principles#

Complementing FAIR, the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, and Ethics) were introduced by the Global Indigenous Data Alliance (GIDA) to address ethical concerns related to indigenous data (Rizzolli and Imeri 2022). They emphasize the rights of Indigenous Peoples concerning their data, including information about their language, customs, and territories, and are increasingly relevant in biodiversity research involving traditional knowledge or resources from indigenous lands.

Data Provenance#

In scientific research, ensuring reproducibility remains a cornerstone of the scientific method. Reproducibility allows other researchers to verify findings by following the same methodology, reanalyzing data, and obtaining consistent results. In Data Science, it is fundamental to provide transparent documentation, well-structured metadata, standardized workflows, and detailed provenance tracking to capture every step of data processing and analysis. Unlike workflows, which serve as structured guidelines, provenance functions more like a detailed logbook by systematically recording every step to generate a specific result. This allows researchers to trace, review, and even replicate the exact process that led to a particular outcome, ensuring its validity (Henzen et al. 2013).

For example, in typical geoscience research, provenance can include the:

Data source: raw measurements, original vector and raster data, ground control data
Pre-processing Methods: Reprojection of the geodata; Clipping the dataset to the bounds of a specific study area; Data cleaning (e.g., removing clouds or irrelevant features)
Data processing and analysis Transformations applied include filtering, aggregation, resampling, joining, and/or model design with relative statistical analysis.
Model or statical parameters with relative functions and code used in computations
The final output and how it was generated

Tip

A Jupyter Notebook is an excellent tool for maintaining provenance in computational research. It records the entire workflow and provides a detailed logbook of all the data processing and analysis steps. Moreover, a Juper Notebook allows easy annotations to describe each step, improving clarity and documentation.

Metadata (MD): “Data About Data”#

Metadata (MD) is often described as “data about data.” It provides structured information about research data, enabling better organization, discovery, and context of datasets.

Why Is Metadata Important?#

Metadata plays a crucial role for:

Enhancing discoverability: Well-documented metadata allows researchers to find relevant datasets quickly (e.g. by using keywords in their search).
Ensuring Data Interoperability: Standardized metadata enhances searchability and data integration by providing consistent descriptors (e.g., using controlled vocabularies and standardized keywords for geospatial data). It also facilitates the collection and processing of datasets across different platforms (e.g. in the case of geospatial data, by adopting Open Geospatial Consortium (OGC) standard services (such as WMS, WFS, or WCS) allows seamless data retrieval and processing across various software and systems, including R, Python, QGIS, ArcGIS, and web-based GIS applications). By ensuring metadata consistency (e.g., uniformly defining coordinate reference systems, spatial extent, and thematic attributes), interoperability is significantly improved, enabling researchers and analysts to integrate datasets from diverse sources efficiently.
Improving data reproducibility: By providing details about how data was collected and processed (e.g., by adding related links to the original data source, pre-processing algorithms, analysis-ready data, post-processing algorithms, replication packages and any related documentation such as data or software description article).
Facilitating long-term data usability and fit for purpose: Metadata includes essential details such as data format, provenance/lineage, licensing, and links to other resources supporting research data’s long-term sustainability and usability.
Promoting proper Attribution, Credits, and Citations: Metadata elements like Creator and License ensure creators hold copyright and can, therefore, be appropriately credited for their work while defining the usage condition for sharing and reusability.

What Does “Structured Metadata” Mean?#

Metadata follows a defined format. Standards are classified into:

General-Purpose Metadata Standards: Broadly applicable (e.g., DataCite Schema, Dublin Core).
Domain-Specific Metadata Standards: Tailored to fields (e.g., ISO 19115 for geodata; ABCD, Darwin Core, EML for biodiversity data; see Fischer et al. (2023), Sec. 4.3.2.4 for more examples and resources like FAIRsharing.org to find standards).

Common metadata elements (largely based on DataCite) include:

Title: The name of the dataset or research work.
Creator: The individual(s) or organization(s) responsible for generating the data.
Abstract: A summary of the dataset’s content and purpose.
Keywords: Terms that help categorize and index the data for easier retrieval.
Format: The file type or structure of the dataset (e.g., CSV, PDF, XML).
Subject: The broader topic or discipline related to the data.
Persistent Identifier (PID): A unique identifier (such as a DOI) ensures the dataset remains accessible over time.
License: The terms of use specifying how the data can be shared and reused.
Provenance/Lineage: Information on the origin and history of the dataset, including how it was created and modified.

The ISO 19115: Is a domain-specific metadata standard tailored specifically for geodata, providing further and extensive details on spatial, temporal, and thematic aspects of datasets such as:

Spatial Reference Information: Coordinate Reference System (CRS), Projection details, Spatial resolution (scale, ground sampling distance)
Temporal Extent: Period covered by the data, Frequency of updates (e.g., daily, annually)
Detailed Lineage and Data Provenance: Source data origin (e.g., satellite imagery, field surveys), Data processing history (e.g., transformations, filtering, aggregation), Quality control procedures applied
Data Quality: Positional accuracy (spatial precision), Logical consistency (topological and attribute correctness), Completeness (missing data, coverage gaps)
Geospatial Feature and Attribute Information: Vector feature types (e.g., points, lines, polygons), Raster properties (resolution, pixel size, band information), Thematic classification (e.g., land cover categories)
Geospatial Services: Web services (e.g., WMS, WFS, WCS from OGC)

Please refer to the user guide of the National Agricultural Library (NAL) of the United States for a better explanation of the ISO 19115 metadata elements.

Controlled Vocabularies and Authority Files#

To ensure consistency and machine-interpretability, metadata should use controlled vocabularies (predefined terms, thesauri like AGROVOC, or ontologies like ENVO) and authority files (standardized names/identifiers for entities like people via ORCID, organizations via ROR, or places via GeoNames). See Fischer et al. (2023) Sec. 4.3.2.5 for details and resources like BARTOC or the GFBio Terminology Service).

Intellectual Property and Licence#

Intellectual Property (IP) refers to creations of the mind. Intellectual Property Rights (IPR), like copyright, protect these creations. A license specifies conditions for access, modification, or sharing. (See Fischer et al. 2023 Sec. 7.2.5 for an in-depth guide).

Prefer CC0 or CC BY

IOER FDZ recommends open and specific licenses for spatial data. Internationally, Creative Commons (CC) licenses are common. For Germany, GOVDATA licenses are recommended. Fischer et al. (2023) (Sec. 7.2.5) similarly advise CC0 or CC BY for creative content, warning against ND (NoDerivatives) and NC (NonCommercial) due to ambiguities that hinder reuse. For databases, Open Data Commons (ODC-PDDL, ODC-BY) or CC BY 4.0 are suitable. Raw data and metadata, often not copyrightable, can be marked with CC0 or a Public Domain Mark.

For software, use specific open-source licenses like GPL, MIT, or MPL, not CC licenses.

Tip

Not sure, which license you should choose? This license chooser helps you to find the appropriate license.

Selected International Creative Commons Licenses#

License	License URL	SPDX link	Comment
CC0	CC0	SPDX
CC-BY-4.0	CC-BY-4.0	SPDX	IOER-FDZ default
CC-BY-SA-4.0	CC-BY-SA-4.0	SPDX
CC-BY-NC-4.0	CC-BY-NC-4.0	SPDX	Use with caution
CC-BY-ND-4.0	CC-BY-ND-4.0	SPDX	Use with caution
CC-BY-NC-SA-4.0	CC-BY-NC-SA-4.0	SPDX	Use with caution
CC-BY-NC-ND-4.0	CC-BY-NC-ND-4.0	SPDX	Use with caution

CC: Creative Commons licence

BY: credit must be given to the creator.

SA: Adaptations must be shared under the same terms.

NC: Only noncommercial uses of the work are permitted.

ND: No derivatives or adaptations of the work are permitted.

Selected German Licenses#

License	License URL	SPDX link
dl-de/by-2-0	GovData	SPDX
dl-de/zero-2-0	GovData	SPDX

Note

dl-zero-de/2.0 is comparable to CC0

dl-by-de/2.0 is comparable to CC-BY-4.0

Open Source Software Licenses#

License	License URL	SPDX link	Comment
GPL 3	GPL-3.0	SPDX	Copyleft License*
MIT	MIT License	SPDX	Permissive**

* Copyleft: Requires developers to license any modified versions under the same terms as the original version. It is a form of “share-alike” license.
** Permissive: Allows developers to use the original software in any project without licensing any changed versions under the original terms.

References#

[1] (1,2,3,4,5)

Marlen Fischer, Juliane Röder, Johannes Signer, Daniel Tschink, Tanja Weibulat, and Ortrun Brand. Nfdi4biodiversity self-study unit - research data management for biodiversity data. December 2023. URL: https://doi.org/10.5281/zenodo.10377868, doi:10.5281/zenodo.10377868.

[2]

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1):160018, March 2016. URL: https://doi.org/10.1038/sdata.2016.18, doi:10.1038/sdata.2016.18.

[3]

Michaela Rizzolli and Sabine Imeri. CARE Principles for Indigenous Data Governance. o-bib. Das offene Bibliotheksjournal / Herausgeber VDB, pages 1–14 Seiten, June 2022. Artwork Size: 1-14 Seiten Publisher: o-bib. Das offene Bibliotheksjournal / Herausgeber VDB. URL: https://www.o-bib.de/bib/article/view/5815 (visited on 2025-06-02), doi:10.5282/O-BIB/5815.

[4]

Christin Henzen, Stephan Mäs, and Lars Bernard. Provenance Information in Geodata Infrastructures. In Danny Vandenbroucke, Bénédicte Bucher, and Joep Crompvoets, editors, Geographic Information Science at the Heart of Europe, pages 133–151. Springer International Publishing, Cham, 2013. URL: https://doi.org/10.1007/978-3-319-00615-4_8, doi:10.1007/978-3-319-00615-4_8.