Data citation refers to the process of citing a dataset in the same way that books or journal articles are referenced in research publications. Historically, the practice of citing a dataset as a source reference in itself has been generally inconsistent or not practiced at all. However, researchers and institutions are beginning to realize the importance of data citation as more data producers are beginning to cite other people's datasets as well as their own datasets. In general data citation is a good practice that benefits the researcher, data repositories and stewards, the scientific community, and the general public.
Why Cite Your Data?
Data citation is important for a number of reasons. First, citing datasets gives the researcher proper credit and serves as recognition of scholarly effort. It also gives credit to data stewards and repositories who manage the data presumably for the long term. Data citation also creates accountability for creators and stewards of the dataset and reduces the danger of plagiarism once the dataset itself has been properly cited.
Second, data citation allows others to more easily locate and access a researcher's dataset for the purposes of replicating or verifying their results, which is good scientific practice. Additionally, easy location and access can facilitate discovery and encourage possible reuse of the dataset.
Lastly, the practice of data citation creates a formalized system of recognition and reward to data producers as a citable contribution to the scientific community. Data citation allows the impact of the dataset to be easily tracked through publications that cite the dataset. This system of citing data formally in publications can increase the transparency of data production as well as encourage the production of more high quality datasets.
Data Citation Standards
In order to cite data properly, several institutions and organizations have created standards for citing datasets. The mechanics of citing datasets are generally similar to the citation of journal articles and other publications. The author(s), year, title, archive/distributer, and access date are the most obvious components of data citation.
However, datasets can be more difficult to cite because they can be more dynamic in terms of content and version. For example, a dataset can consist of multiple versions of the raw data, or it can be part of a larger dataset. The dataset itself can change over time as researchers modify or add more data. Therefore, a dataset needs a persistent identifier or locator that can be added to the citation in order to better track the dataset.
Persistent Identifiers and Locators
Datasets should have an identifier and a locator. A persistent identifier is a unique Web-compatible, alphanumeric code that points to a specific dataset that will be preserved for the long term. The dataset identifier is an identifier of the dataset such as its title, file name, or even an object ID code. Examples of identifiers are UUID (Universally Unique Identifier), OID (Object Identifier), LSID (Life Sciences Identifier).
A dataset locator helps find the location of the dataset. Examples of locators are URL addresses, directories, or registered locators. A registered locator is a unique code that points to the specific dataset that is usually separate from the metadata. Examples of data locators are DOI (Digital Object Identifier), ARK (Archival Resource Key), Handles, URLs, PURL, XRI. See Preserve > Persistent Identifiers for more information.
Example Data Citations for USGS Released Data
Moschetti, M.P., 2017, Database of earthquake ground motions from 3-D simulations on the Salt Lake City of the Wasatch fault zone, Utah: U.S. Geological Survey data release, https://doi.org/10.5066/F7V98691.
McLeod, J.M., Jelks, Howard, Pursifull, Sandra, and Johnson, N.A., 2016, Characterizing the early life history of an imperiled freshwater mussel (Ptychobranchus jonesi): U.S. Geological Survey data release, https://doi.org/10.5066/F7FT8J5T.
Barber, L.B., Weber, A.K., LeBlanc, D.R., Hull, R.B., Sunderland, E.M., and Vecitis, C.D., 2017, Poly- and perfluoroalkyl substances in contaminated groundwater, Cape Cod, Massachusetts, 2014-2015 (ver. 1.1, March 24, 2017): U.S. Geological Survey data release, https://doi.org/10.5066/F7Z899KT.
Example Data Citation for Non-USGS Data
The following example of a dataset citation is from the Earth Science and Information Partners (ESIP).
Zwally, H.J., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2003. GLAS/ICESat L1A Global Altimetry Data V018, 15 October to 18 November 2003. National Snow and Ice Data Center. dataset accessed 2011-07-21 at doi:10.3334/NSIDC/gla01.