Here, we will break down the details of the FAIR data principles in a four-part blog series: Part 1: Findable, Part 2: Accessible, Part 3: Interoperable, Part 4: Reusable.
So what are the FAIR data principles, and why are they important? The FAIR data principles emphasize machine-actionability to ensure that computational systems can find, access, interoperate, and reuse data with minimal human user intervention. This is necessary because as data increases in volume, complexity and creation speed, it outpaces individual researchers’ ability to work with these large data sets effectively.
The NIH and many funding organizations worldwide require that data generated from their funded research be managed and shared using the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. These principles were first introduced in 2016 with the publication of the FAIR Guiding Principles for Scientific Data Management and Stewardship in Scientific Data and further expanded through the GO FAIR Initiative.
Adopting FAIR data principles will make it easier for researchers to use computational tools to search, process and analyze large datasets. Standardization with FAIR principles is also crucial for repurposing datasets for secondary research purposes. Below is a quick overview of the FAIR data principles described in the NIH Strategic Plan for Data Science.
- To be Findable, data must have unique persistent identifiers (e.g., DOI (Digital Object Identifier)) to label it and make it searchable within a larger data structure
- To be Accessible, data must be easily retrievable via open systems and include effective and secure authentication and authorization procedures
- To be Interoperable, data should “use and speak the same language” using standardized vocabulary and data format
- To be Reusable, data must be adequately described to a new user, have clear information about data-usage licenses, and have a traceable “owner’s manual” or provenance
Part 1: Findable
The key component that makes data findable online is metadata. Metadata is essentially “data about data” or data documentation. Clear and detailed documentation is essential. Metadata includes the data’s content, format, and internal organization which allows researchers to find, use, and cite your data set.
Metadata is required when data is deposited into a repository to be shared. However, it is best to document data from the beginning of the project through its completion to prevent important details from being lost or forgotten. Once published data is placed in a repository, it will often be assigned a DOI, a form of a persistent unique identifier (PID) for research products, which is crucial for “findability”.
Globally unique and persistent identifiers consist of an internet link (e.g., a URL to a web page that defines the concept, such as a particular human protein). Identifiers are essential to Open Science and will help others properly cite your work.
Below are the key elements that should be included to create rich metadata. An easy way to create metadata is to use an Excel spreadsheet as a collection template. Fill in the available information at the start of a project and update it as the project progresses. Keep this metadata file, along with the data itself, in the same folder.
- Title: Name of the dataset or research project that generated it. If the dataset is part of a manuscript, it is recommended (and often required) to use the manuscript title to link the data with the publication
- Creator/Author: Names and addresses of the organizations or people who created the data; the preferred format for personal names is surname first (e.g., Smith, Jane). Using ORCID iDs for authors is highly recommended
- Funder: Funding agencies/organizations, including the Crossref Funder ID
- Date: Key dates, including project start and end dates, release date, time period covered by the data and other dates related to the data lifespan, such as maintenance cycle and update schedule. The preferred format is the ISO 8601 standard (e.g., yyyy-mm-dd or yyyy.mm.dd-yyyy.mm.dd for a range)
- Description: Keywords or phrases describing the subject or content of the data
- Place: Note the physical locations where data are collected (e.g., Washington University School of Medicine in St. Louis, MPRB Building, Room 10302, Molecular Microbiology Imaging Facility)
- Method: Describe how the data were generated, listing equipment and software used (including model and/or version numbers), formulae, algorithms, experimental protocols, reagents, and other details that one might include in a lab notebook. RRIDs (Research Resource Identifiers) are recommended for citing key resources such as antibodies, model organisms, cell lines, plasmids, and other tools (e.g., software, databases, services). Protocol DOIs can be generated using protocols.io. In addition, LabArchives, the Electronic Lab Notebook (ELN) WashU offers to researchers at no charge, recently integrated protocols.io, making it easy to incorporate it into your LabArchives notebook
- Processing: A description of how the data have been filtered and processed prior to analysis
- Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed
- File inventory: A list of all files associated with the project, including extensions (e.g., baseline_CDI.csv, readme.txt)
- File structure: Directory URL where your datasets are located, along with a description of how data files are organized
- Necessary software: List special-purpose software required to create, view, analyze, or otherwise use the data
When you submit your datasets to a repository, you might be required to convert this metadata collection template to an open file format (e.g., .pdf or .txt) or transfer it into the repository’s metadata collection form. Once your dataset is deposited into a repository, your data will be assigned a DOI with rich metadata, allowing it to be findable online.
For more details about the FAIRification process for Findability (F1-F4), please visit the GO FAIR Website. In my next blog, I will review the Accessibility of the FAIR Data Principles, so please stay tuned!
Resources
Readings
- Wilkinson, M., Dumontier, M., Aalbersberg, I. et al.The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). DOI: https://doi.org/10.1038/sdata.2016.18
- Gould, Maria. “People, places, and things: Persistent identifiers in the scholarly communication landscape.” College & Research Libraries News[Online], 83.9 (2022): 398. Web. 17 Oct. 2022. DOI: https://doi.org/10.5860/crln.83.9.398