Data Management Skillbuilding Hub

Best Practice: Provide a citation and document provenance for your dataset


Best Practices by Data Life Cycle

Provide a citation and document provenance for your dataset

Data Life Cycle stage(s): Describe   Preserve

For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:

  • Name the people responsible for the dataset throughout the lifetime of the dataset, including for each person:
    • Name
    • Contact information
    • Role (e.g., principal investigator, technician, data manager)

According to the International Polar Year Data and Information Service, an author is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the dataset. People responsible for the data can include: individuals, groups, compilers or editors.

  • Description of the context of the dataset with respect to a larger project or study (include links and related documentation), if applicable.
  • Revision history, including additions of new data and error corrections. Links to source data, if the data in one dataset were derived from data in another dataset.
  • List of project support (e.g., funding agencies, collaborators, material support).
  • Describe how to properly cite the dataset. The data citation should include:
    • All contributors
    • date of dataset publication
    • Title of dataset
    • media or URL
    • Data publisher
    • Identifier (Digital Object Identifier)

Description Rationale

Documenting the dataset origin, history, and contact information allows for proper citation of datasets. By encouraging the proper citation of datasets, data providers and publishers receive appropriate credit for their efforts.

Additional Information

The Oak Ridge National Laboratory Distributed Active Archive Center has guidance and rational for citing data sets: Editorial: Citations to Published Data Sets

Buneman P, Khanna S, Tan W. 2001. Why and Where: A Characterization of Data Provenance. Pp. 316-330 in Lecture Notes in Computer Science. Springer Berlin/Heidelberg.
Osterweil LJ, Clarke LA, Ellison AM, Boose E, Podorozhny R, Wise A. 2010. Clear and precise specification of ecological data management processes and dataset provenance. IEEE Transations on Automation Science and Engineering 7(1):189-195.
Simmhan YL, Plale B, Gannon D. 2005. A survey of data provenance in e-science. ACM SIGMOD 34(3):31-36.


Turner, D.P., W.D.Ritts, and M. Gregory. 2006. BigFoot NPP Surfaces for North and South American Sites, 2002-2004. Data set. Available on-line from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.



Cite this best practice:

Sherry Lake, DataONE  (September 01, 2011) "Best Practice: Provide a citation and document provenance for your dataset". Accessed through the Data Management Skillbuilding Hub at on Aug 22, 2019


Hosted by DataONE

In collaboration with the community, DataONE has developed high quality resources for helping educators and librarians with training in data management, including teaching materials, webinars and a database of best-practices to improve methods for data sharing and management.

Question If you have a question or concern, please open an Issue in this repository on GitHub.