Data Management Skillbuilding Hub

Best Practice: Ensure datasets used are reproducible


Best Practices by Data Life Cycle

Ensure datasets used are reproducible

Data Life Cycle stage(s): Analyze   Assure   Discover

When searching for data, whether locally on one’s machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.

  • Note the location of the originating data set
  • Document which search terms were used
  • Document any additional parameters that were used, such as any controls that were used (pull-down boxes, radio buttons, text entry forms)
  • Document the query term that was used, where possible
  • Note the database version and/or date, so you can any limit newly-added data sets since the query was last performed
  • Note the name of the website and URL, if applicable

Description Rationale

In order to reproduce a data set or result set, it is necessary to document which terms were originally used to capture that data. By documenting this information while the search is being conducted, one greatly enhances the chance of being able to reproduce the results at a later date.

Additional Information:

A query example can either be formatted as a URL string,or.r_gc.r_pw.&fp=d6a6fda25e240d6 or a database string (select * from database_name where collection_data IS NOT NULL).



Cite this best practice:

Cindy Parr, Heather Henkel, DataONE  (May 11, 2011) "Best Practice: Ensure datasets used are reproducible". Accessed through the Data Management Skillbuilding Hub at on Aug 31, 2020


Hosted by DataONE

In collaboration with the community, DataONE has developed high quality resources for helping educators and librarians with training in data management, including teaching materials, webinars and a database of best-practices to improve methods for data sharing and management.

Question If you have a question or concern, please open an Issue in this repository on GitHub.