Practicing Data Stewardship During Research

Data Science Dispatch |

Data stewardship refers to the responsible management and oversight of scientific data throughout its lifecycle, from creation and collection to storage, sharing, and preservation. It involves ensuring that data is accurate, accessible, and reusable, while also protecting its integrity and confidentiality. 

Good data stewardship is a collaborative effort throughout a NIH-funded project — including researchers, data managers, IT professionals, and Program Officers (POs). Practice good data stewardship throughout the project life cycle by following these key principles and practices.

Planning for data management

A comprehensive Data Management and Sharing (DMS) Plan is part of the foundation of data stewardship. According to the NIH DMS Policy, a DMS Plan should outline the types of data to be collected; tools, software, and code to be used during data analysis; the data standards to be used; and the plans for preserving and sharing data, including repositories where the data will be stored. 

POs at NIAID play a crucial role in good data stewardship, in part by reviewing and approving DMS Plans submitted with research grant applications. They ensure that DMS Plans are logical, feasible, and align with the goals of the funding agency. Learn more about NIH data sharing policies on the Data Policy and Guidance page

Organizing and documenting during research

Once the research begins, scientists are responsible for organizing data consistently. Staying organized at the time of data collection reduces the burden of cleaning and preparing data at the end of a project. Sometimes a data manager is responsible for coordinating these tasks, but the entire research team has a hand in maintaining good stewardship during data capture. 

Controlled vocabularies and ontologies help ensure that data is labeled accurately and consistently. This practice reduces the burden of data cleaning and makes it easier for others to understand and use the data. For example:

  • For drug names, researchers can use a controlled vocabulary like RxNorm to select from a list of verified values instead of manually typing drug names.
  • For species names, leveraging a NCBI Taxon ID provides a unique and standardized identifier for each species, facilitating accurate and efficient retrieval, comparison, and analysis of biological data across various databases and research studies.
  • For other biomedical ontologies, the Ontobee data server provides downloadable ontologies for various diseases, populations, and biomedical concepts.

When collecting data during a study, researchers can apply controlled vocabularies by using data validation tools or dropdown menus in data collection software. 

Common data elements (CDEs) can also help standardize data collection to align with established research questions. CDEs are data elements or variables that are defined and used the same way across multiple studies. Use of CDEs supports data interoperability, helps researchers meet funding requirements, and can save time designing data gathering protocols. The NIH CDE Repository provides searchable lists of CDEs and information about using them in research.  

Perform regular quality checks to identify and correct errors, inconsistencies, and missing values. This includes using tools and techniques to clean and validate data, ensuring that it meets the required standards of accuracy and completeness. High-quality data is essential for reliable research outcomes.  While these activities can be time consuming, they are critical to ensure data can be interpreted by outside scientists after publication or the completion of a project. 

Additionally, using consistent naming conventions and version control practices for data files is also an example of good data stewardship. Maintaining thorough documentation during the data capture process ensures that data can be understood and reused in the future.

Protecting data privacy and security

It is required that sensitive data, such as personal health information, is protected throughout the research process. This involves implementing appropriate security measures, such as password protection and encryption, and complying with relevant regulations like HIPAA. 

Ensuring that only authorized personnel have access to sensitive data is a key aspect of data stewardship. This is true throughout the research process. During data collection, use encrypted or password-protected data storage methods that comply with relevant rules on sensitive data. When sharing data, protect study participants’ data by selecting a controlled access database. An example of this is dbGaP (the database of Genotypes and Phenotypes), which stores information on genetic variations and their associated physical and clinical traits, including detailed genetic data, health-related information, and study results exploring the genetic factors influencing various conditions and characteristics.

IT professionals may play a role by providing the technical infrastructure and implementing security measures, such as encryption and access controls, to protect sensitive data.

Sharing and preserving data

At the end of a study or research project, data should be shared in compliance with the DMS Plan and associated NIH data sharing policies. 

Researchers should take care to select an appropriate repository to share data to enable other researchers to discover the data. Prior to publication, researchers should also consider journal requirements for data sharing so they can conform with the standards and expectations of the scientific community.

All data should be shared with detailed “metadata” that describes the data's content and context. This practice ensures that data is findable, accessible, interoperable, and reusable (FAIR); in other words, it helps other researchers find and use data, therefore maximizing its value and impact. Supplementary materials like study protocols can also be shared to make the research easier to reproduce. 

Finally, researchers should consider the long-term preservation of their data, ensuring that it remains accessible and usable over time. The easiest way to ensure long-term preservation of data is by following NIH guidelines for selecting a repository. NIAID researchers can also consult with their PO for recommendations on the appropriate repository.

By practicing good data stewardship, researchers can enhance the value and impact of their work. Effective data planning, organization, quality assurance, security, and sharing are all essential components of data stewardship that contribute to the advancement of science and the broader research community.

Learn more about managing data and complying with NIH policy by visiting the Data Policy and Guidance page

Content last reviewed on