University of Maryland Press Releases on Flickr
Data are collected from sensors, sensor networks, remote sensing, observations, and more. This calls for increased attention to data management and stewardship.
The amount of available storage is not keeping up with the amount of data flooding in daily. How do we decide what data we keep?
John Gantz, IDC Corporation: The Expanding Digital Universe
CC Image by Shawyn Morrow on Flickr
CC Image by momboleum on Flickr
CC Image by Shawyn Morrow on Flickr
CC Image by momboleum on Flickr
MEDICARE PAYMENT ERRORS NEAR $20B (CNN) December 2004
Miscoding and Billing Errors from Doctors and Hospitals totaled $20,000,000,000 in FY 2003 (9.3% error rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional."
OOPS! TECH ERROR WIPES OUT Alaska Info” (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.)
Note: Data Costs
Consider some of the data management issues that made headlines, affecting agencies and organizations. Data quality is not limited to any one organization. These examples show costs (in terms of money lost) due to a lack of data quality control.
A wildlife biologist for a small field office provided support for staff GIS needs. The data were stored on her workstation. When the biologist relocated to another office, no one understood how the data was stored or managed.
Solution: A state office GIS specialist retrieved the workstation and sifted through files trying to salvage relevant data.
Cost: 1 work month ($4,000) plus the value of data that was not recovered
The situation could have been worse, because the data were not being backed up as it would have been if stored on a server.
CC Image by ruffin_Ready on Flickr
In preparation for a Resource Management Plan, an office discovered 14 duplicate GPS inventories of roads. However, because none of the inventories had enough metadata, it was impossible to know which inventory was best or if any of the inventories actually met their requirements.
Solution: Re-Inventory roads
Cost: Estimated 9 work months/inventory @$4,000/wm (14 inventories = $504,000)
"Please forgive my paranoia about protocols, standards, and data review. I'm in the latter stages of a long career with USGS (30 years, and counting), and have experienced much. Experience is the knowledge you get just after you needed it. Several times, I've seen colleagues called to court in order to testify about conditions they have observed. Without a strong tradition of constant review and approval of basic data, they would've been in deep trouble under cross-examination. Instead, they were able to produce field notes, data approval records, and the like, to back up their testimony. It's one thing to be questioned by a college student who is working on a project for school. It's another entirely to be grilled by an attorney under oath with the media present.”
Source: guardian.uk
"The climate scientists at the centre of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work."
Manage your data for yourself:
CC Image by UWW ResNet on Flickr
Data should be managed to:
Data management and organization facilitate archiving, sharing and publishing data. These activities feed data re-use and reproducibility in science.
Image from DataONE
By re-using data collected from a variety of sources – eBird database, land cover data, meteorology, and remotely sensed -- this project was able to compile and process the data using supercomputering to determine bird migration routes for particular species.
Images courtesy of Cornell Ornithology Lab
Recreated from Klump et al 2006
There is an abundance of data and metadata (if it is done) end up in filing cabinets, on discarded hard drives, in hard-copy journals on the library shelves -- or on the web, but many are subscription only journals.
Recreated from Klump et al 2006
Data should be properly managed and eventually be placed where they are accessible, understandable, and re-usable.
Here are a few reasons (from the UK Data Archive):
A new image processing technique reveals something not before seen in this Hubble Space Telescope image taken 11 years ago: A faint planet (arrows), the outermost of three discovered with ground-based telescopes last year around the young star HR 8799.
D. Lafrenière et al., ApJ Letters
“The first thing it tells you is how valuable maintaining long-term archives can be. Here is a major discovery that’s been lurking in the data for about 10 years!” comments Matt Mountain, director of the Space Telescope Science Institute in Baltimore, which operates Hubble.
D. Lafrenière et al., ApJ Letters
“The second thing its tells you is having a well calibrated archive is necessary but not sufficient to make breakthroughs — it also takes a very innovative group of people to develop very smart extraction routines that can get rid of all the artifacts to reveal the planet hidden under all that telescope and detector structure.”
A research data lifecycle illustrates stages through which well-managed data passes from the inception of a research project to its conclusion.
A data lifecycle illustrates stages thru which well-managed data passes from the inception of a research project to its conclusion. In the reality of science research, the stages do not always follow a continuous circle.
The rest of the DataONE Education Modules, handouts, and hands-on exercises are available at: https://www.dataone.org/education-modules
If data are:
Result is:
Participate in our GitHub repo: https://github.com/DataONEorg/Education
Suggested citation: DataONE Education Module: Why Data Management. DataONE. Retrieved November 12, 2016. From https://dataoneorg.github.io/Education/
Copyright license information: No rights reserved; you may enhance and reuse for your own purposes. We do ask that you provide appropriate citation and attribution to DataONE.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |