Data Management Skillbuilding Hub

Best Practice: Document the integration of multiple datasets


Best Practices by Data Life Cycle

Document the integration of multiple datasets

Data Life Cycle stage(s): Integrate

Document the steps used to integrate disparate datasets.

  • Ideally, one would adopt mechanisms to systematically capture the integration process, e.g. in an executable form such as a script or workflow, so that it can be reproduced
  • In lieu of a scientific workflow system, document the process, scripts, or queries used to perform the integration of data in documentation that will accompany the data (metadata)
  • Provide a conceptual model that describes the relationships among datasets from different sources
  • Use unique identifiers in the data records to maintain data integrity by reducing duplication
  • Identify foreign key fields in the data records which support the relationship between the data sources
  • When you use datasets and data elements from within those datasets as a source for new datasets, it is important to identify and document those data within the documentation of the new/derived dataset. This is known as dataset provenance; provenance describes the origin or source of something. Just as you would cite papers that are sources for your research paper, it is critical to identify the sources of the data used within your own datasets. This will allow for:
    • tracing the chain of use of datasets and data elements
    • credit and attribution to accrue to the creators of the original datasets
    • the possibility that if errors or new information about the original datasets or data elements comes to light, that any impact on your new datasets and interpretation of such could be traced

Description Rationale

Provide enough information about the process used to integrate disparate datasets so that others can properly use your data and/or your process to integrate similar data sources.


Some proposed guidelines and methods of citing datasets and data elements can be found at:

Additional Information (Biblio)

A survey of approaches to automatic schema matching.



Cite this best practice:

John Porter, Juliana Freire, Marissa Strong, Michael Grady, DataONE  (August 30, 2011) "Best Practice: Document the integration of multiple datasets". Accessed through the Data Management Skillbuilding Hub at on Aug 31, 2020


Hosted by DataONE

In collaboration with the community, DataONE has developed high quality resources for helping educators and librarians with training in data management, including teaching materials, webinars and a database of best-practices to improve methods for data sharing and management.

Question If you have a question or concern, please open an Issue in this repository on GitHub.