Data Management: Processing
Data Processing covers any set of structured activities resulting in the alteration or integration of data. Process components support validation, transformation, subsetting, summarizing, integration, and derivation, among others. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.
"Processing" Can Exist Anywhere Within the Science Data Lifecycle
- Data validation includes documenting how source data were checked and made suitable for use.
- Data transformation changes the format of data or applies conversions without changing the meaning.
- Data derivation is the process of creating new value types that didn't exist in the source data.
- Summarizing data changes the scale of the data, while subsetting data extracts select parts of a larger dataset.
- Data integration builds a new data structure or combines datasets.
- Processes can be chained together to perform complex tasks.
- Document your processing steps in the metadata record to encourage understanding and reuse.
The Process stage of the data lifecycle does not represent a set of activities that must occur after Acquisition and before Analysis; rather, a 'process' can be implemented to support any data handling activity in the Lifecycle. Examples include basic data screening and preparation, iterating with data changes prompted during analysis, or preparing data for long-term preservation and sharing.
Process and Analyze are closely related activities when performing scientific research
It can sometimes be difficult to determine where processing ends and analysis begins. In part this is because the two concepts are often intermingled to ensure that both data and research products meet a common set of goals. Learn more about the considerations for both process and analyze activities.
Data may need to be compared to natural limits, adjacent measurements, or historical data to verify that they are suitable for use. Although this activity falls under the umbrella of Data Quality Management, it is quite common to develop a process to handle those validation steps, one that can be codified and does not require manual intervention. See the section on Managing Quality for more information about data quality in the Science Data Lifecycle.
Examples of Data Validation
Sometimes it is necessary to summarize data through grouping, aggregating, and producing statistics about the data. This can be considered a 'data reduction' step in a process, where fine-grained data are recast at a scale more amenable to integration, analysis or display.
Examples of Summarization
Transforming data includes converting, reorganizing, or reformatting data. This action does not change the meaning of the data but can enable use in a context different than the original intent, or facilitate display and analysis. It is not unusual for a single dataset to be reformatted for use in different software environments (Fig 1).
Examples of Data Transformation
Data integration builds a new data structure or combines datasets. Activities could involve merging, stacking, or concatenating data, and may use web services that allow access to authoritative data sources based on user-defined criteria.
Examples of Data Integration
Subsetting data includes extracting not only select parts of a larger dataset (retrieval filter), but also filtering columns or rows and excluding values from working datasets based on user-defined criteria. The result of subsetting is a more compact and well-defined set of data that meet a particular set of use requirements.
Examples of Subsetting
Data derivation is a processing component that creates new value types that were not present in the source data. Typically an algorithm is applied to derive a new value.
Examples of Data Derivation
Capturing and communicating information about how data were processed is critical for reproducible science. Documenting changes made to data from acquisition through use and sharing in a project (what you 'did' to the data, not how you 'used' it) should be part of the formal metadata record accompanying a data release, or as part of a published methodology being shared more widely. Also see the next section on diagramming and workflow management.
Process Examples in Metadata
Process Diagrams, Workflow Tools and Automation
The use of flowcharts, data flow diagrams, and workflow tools can be very helpful for communication and capturing the history of a work activity. Workflow Capture is described elsewhere on this website. Good descriptions of a variety of diagramming tools can be found at this Minnesota Department of Health Web site.
Data processing at USGS usually involves the use of software and programming languages, and processes to handle routine or repeated interactions with data are often automated. Modular approaches to process development provide the most utility, as well-designed components can be reused in a variety of contexts or combined with other, compatible modules to accomplish very complex tasks.
- Use existing metadata conventions appropriate to the data (FGDC-CSDGM, ISO19115, ISO 19157, CF-1.6, ACDD). Note that USGS must follow the FGDC-CSDGM or ISO standard.
- Use known and open formats (txt, geoTIFF, netCDF).
- Use checklists to manage workflow.
- Use scripts to automate, and enhance reproducibility.
- Use open-source solutions when possible.
- Keep code releases in public repositories such as GitHub. Note that USGS is developing internal repository capability for version control services, staging pre-release development, and to foster code reviews.
- Save your input data (will be published at "Publish/Share" stage).
- Conduct a peer review on the processing software. To validate data produced by a 'software process,' that process should ideally be vetted.
- Produce data using standards that are appropriate to your discipline.
What the U.S. Geological Survey Manual Requires:
Policies that apply to Data Processing address appropriate documentation of the methods and actions used to modify data from an acquired state to the form used for research or produced for sharing. Metadata standards (FGDC, ISO) include sections for describing the "provenance" of data, meaning that enough "process" information is provided for the user to determine where data originated and what changes were made to get to the described form.
The USGS Manual Chapter 500.25 - USGS Scientific Integrity discusses the USGS's dedication to "preserving the integrity of the scientific activities it conducts and that are conducted on its behalf" by abiding to the Department of Interior 305 DM 3 - Integrity of Scientific and Scholarly Activities.
The USGS Manual Chapter 502.2 - Fundamental Science Practices: Planning and Conducting Data Collection and Research includes requirements for process documentation.
"Documentation: Data collected for publication in databases or information products, regardless of the manner in which they are published (such as USGS reports, journal articles, and Web pages), must be documented to describe the methods or techniques used to collect, process, and analyze data (including computer modeling software and tools produced by USGS); the structure of the output; description of accuracy and precision; standards for metadata; and methods of quality assurance."
"Standard USGS methods are employed for distinct research activities that are conducted on a frequent or ongoing basis and for types of data that are produced in large quantities. Methods must be documented to describe the processes used and the quality-assurance procedures applied."
The USGS Manual Chapter 502.4 - Fundamental Science Practices: Review, Approval, and Release of Information Products addresses documentation of the methodology used to create data and generate research results.:
"Methods used to collect data and produce results must be defensible and adequately documented."
- Gawande, A. 2010. The Checklist Manifesto: How to Get Things Right. New York: Metropolitan.
- Read, J.S., Walker, J.I, Appling, A.P., Blodgett, D.L., Read, E.K., Winslow, L.A., 2016. geoknife: reproducible web-processing of large gridded datasets. Ecography. 39(4):354-360. DOI: 10.1111/ecog.01880.
- Wilson, G, et al, Good Enough Practices in Scientific Computing: https://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
- USGS Fundamental Science Practices - USGS Code of Scientific Conduct. Accessed May 20, 2016.
- Hook, L.A., Santhana Vannan, S.K., Beaty, T.W., Cook, R.B., and Wilson, B.E. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Oak Ridge National Laboratory. https://daac.ornl.gov/PI/BestPractices-2010.pdf. Accessed June 21, 2016.
- The Quartz Guide to Bad Data. Accessed June 21, 2016.
- Singh, M.P. & Vouk, M.A. Scientific Workflows: Scientific Computing Meets Transactional Workflows. Accessed June 21, 2016.
- IOOS Quality Assurance of Real Time Oceanographic Data. Accessed June 21, 2016.
- Open Geospatial Consortium (OGC). Accessed June 21, 2016.
- Williams, S.J., Arsenault, M.A., Buczkowski, B.J., Reid, J.A., Flocks, James, Kulp, M.A., Penland, Shea, Jenkins, C.J., 2007, Surficial sediment character of the Louisiana offshore continental shelf region: A GIS Compilation: U.S. Geological Survey Open File Report 2006-1195, nomenclature.