Document the integration of multiple datasets
Data Life Cycle stage(s): Integrate
Document the steps used to integrate disparate datasets.
- Ideally, one would adopt mechanisms to systematically capture the integration process, e.g. in an executable form such as a script or workflow, so that it can be reproduced
- In lieu of a scientific workflow system, document the process, scripts, or queries used to perform the integration of data in documentation that will accompany the data (metadata)
- Provide a conceptual model that describes the relationships among datasets from different sources
- Use unique identifiers in the data records to maintain data integrity by reducing duplication
- Identify foreign key fields in the data records which support the relationship between the data sources
- When you use datasets and data elements from within those datasets as a source for new datasets, it is important to identify and document those data within the documentation of the new/derived dataset. This is known as dataset provenance; provenance describes the origin or source of something. Just as you would cite papers that are sources for your research paper, it is critical to identify the sources of the data used within your own datasets. This will allow for:
- tracing the chain of use of datasets and data elements
- credit and attribution to accrue to the creators of the original datasets
- the possibility that if errors or new information about the original datasets or data elements comes to light, that any impact on your new datasets and interpretation of such could be traced
Description Rationale
Provide enough information about the process used to integrate disparate datasets so that others can properly use your data and/or your process to integrate similar data sources.
Examples
Some proposed guidelines and methods of citing datasets and data elements can be found at:
- DataCite: Cite your data
- Dryad: Citing Data
- A Proposed Standard for the Scholarly Citation of Quantitative Data
- Dataverse Network Data Citation Standard
Additional Information (Biblio)
A survey of approaches to automatic schema matching.
Cite this best practice:
John Porter, Juliana Freire, Marissa Strong, Michael Grady, DataONE (August 30, 2011) "Best Practice: Document the integration of multiple datasets". Accessed through the Data Management Skillbuilding Hub at https://dataoneorg.github.io/Education/bestpractices/document-the-integration on Mar 01, 2024Home