Data & File Formats
Standardization of Data & File Formats makes the data easier to work with both for the original researcher and for others who will use the data.
- Digital data is in danger of being lost due to old or outdated data hardware and software formats.
- The safest way to reduce the risk of data loss is to convert data formats into open formats that are non-proprietary.
- Examples of good standard formats: OpenDocument Format (ODF), ASCII, tab-delimited format, comma-separated values (csv), XML, TIFF, MPEG-4.
- Data formats should be unencrypted and uncompressed.
- When working with a set of files within a project, use the same formats and procedures for saving files.
The format of the data refers to the organization of digital information that is read and processed through computer software. The format and software of the research data are typically determined by how researchers choose to collect and analyze their data or by standard norms practiced within a scientific field. File formats are usually determined by existing standards or whichever file format is commonly used within projects. However, this method (using whichever file format is commonly used) may not be a best practice as the format used may not be one that can be converted into a newer or more stable format.
- A good example of this is the program WordPerfect, which was in common use about a decade ago. There has not been a new version of WordPerfect in several years, and often files created and saved in this format can no longer be accessed.
Choosing a file format for your data can be difficult and there are many things to consider when making that decision:
- There are often many file formats for a given data type.
- Saving files in a format suitable for long-term storage may mean losing formatting information (in the case of .txt files).
- Currently-used formats may become obsolete.
- Certain software programs may save data in proprietary formats which require special software or hardware.
- As software versions are upgraded, previous versions of the file format may no longer be supported.
- Saving a file in a compressed format may mean losing some of the original data.
Even if you save files in a commonly-used format, all digital data are in danger due to the loss of access through obsolescence of hardware and software over time. Although backward compatibility and previous software versions can often overcome outdated formatting, these solutions are not completely infallible. Therefore, converting data into standard formats can more safely guarantee long-term data access, sharing, and future transformation. Be aware that media can degrade quickly, unexpectedly, and inconsistently. Even if you can open a file today, that doesn't mean you can tomorrow!
Open or standard formats are types of formats that are not proprietary (e.g. MS Excel, SPSS, MS Rich Text) and can be interpreted by most software. Examples of standard formats are OpenDocument Format (ODF), ASCII, tab-delimited format, comma-separated values (csv), XML, TIFF, and MPEG-4.
Be aware that data should also be in a format that is unencrypted and uncompressed. These formats assure that the data will be readable in its original format and easily accessible in the long term.
Below is a list of common data types and things to consider when choosing a format for the files.
Text, ASCII, tab-delimited, and comma-separated values are best formats for long-term storage as it can be easily transferred between systems and software. These formats only store the data but not the formatting.
PDF and Microsoft Word (.docx) formats are commonly used, and will preserve formatting. However, there is a good chance that the file format may not be readable after several years as some software programs are only backwards-compatible for a handful of years. If you are storing the file for long-term, save the file as both a (.pdf) or (.docx) and also as a text file. That way, if the (.pdf) or (.docx) file is unreadable, you can still obtain the information from the text file.
Microsoft Excel (.xlsx) files are commonly used to store data. However, older versions of (.xlsx) files may not be able to be read by newer versions of Excel. For long-term access, it is good practice to store the (.xlsx) file and a tab-delimited version (.txt) as well.
TIFF and JPEG are common file formats for images. TIFF format is preferable for storing image files, as you have the option of saving the file without any lossless or lossy compression. Saving a file using "lossless" or "lossy" compression will yield a smaller file, but with differing results.
- Lossless file formats will compress the data without any loss in quality and are a better choice than lossy file formats.
- Lossy compression creates a smaller file, by removing bits of data, but there is some data loss which may be a problem when working with the compressed version of a file. Each time a file is saved using a lossy compression, more information is lost.
While it may not be immediately discernable to the naked eye, and can vary from file to file, there is some data loss that will be unrecoverable depending upon which compression format you use. However, files saved without any compression are larger in size, so you may find that for long-term storage, an uncompressed version is best. Saving a duplicate version as a compressed TIFF or JPEG may be better for viewing and sharing.
When saving files as JPEGs, be sure to choose the highest resolution possible. Lower resolution copies will be easier to share due to their smaller file size, but will also be missing some of the original bits of the image due to compression. The image quality will also degrade as you zoom into the lower-resolution version when compared to the higher-resolution one.
The quality of an audio file will often vary from file to file, due to factors such as clarity of the recording. NARA recommends the following file formats for storing audio:
- Audio Interchange File Format (AIFF)
- Uncompressed Waveform audio format (WAV)
- Audio format (AU)
- Uncompressed Broadcast Wave Format (BWF)
- Free format Lossless Audio Codec (FLAC)
- Motion Pictures Expert Group (MPEG) 4 Audio Lossless Coding format (ALS)
Like audio recordings, video recordings will also be of varying quality. Factors such as lighting, sound, and standard versus high-definition recording will all affect the quality of the recording. Using a file format to compress the video will make the file size smaller, but it will be at the cost of loss of quality. NARA recommends the following formats for storing video:
- Audio-Video Interleave format (AVI)
- Material Exchange Format (MXF)
- Quicktime format (MOV)
When working with a set of files within a project, consistently use the same formats and procedures for saving files. If you are planning on archving a file, or not using it in the near-term, you may want to consider having a header at the top of files that describes basic information about the file (brief description of the file, author, date, other associated files, etc.). Better yet, create a metadata record!
- Organizing Data [see Plan > Organize Files & Data for more information]
- File Naming
- Create concise but meaningful names for each file/folder.
- Classify broad types of files with file names.
- Avoid using spaces or special characters which can be read incorrectly by some software.
- Where Data Belong
- GIS data should be placed in a geodatabase.
- Data with attributes found in several separate tables should be placed in a relational database for greater flexibility.
- Some data belong in a designated repository (e.g., NWIS) which require the procedures specific to the data repository for submitting data. [See Preserve > Repositories for more information]
- Include a number behind the file name to indicate the version, e.g.:
- Bisondata_1.0 = original document
- Bisondata_1.1 = original document with minor revisions
- Bisondata_2.0 = document with substantial revisions
- Be consistent with naming each version of the dataset.
- File Structure
- Organize the data by data type and then by research activity.
- Keep folder levels to only three or four deep.
- Do not place more than 10 items within each subfolder.
Data and File formatting
- Choose a file format that is open, non-proprietary, used widely, and can be opened and shared without special software or hardware.
- Use the same file formats within a project so that file conversions are not necessary.
- When saving files in commonly-used formats such as Microsoft Word, it will preserve the format of the original document but you may find that you are unable to open the file in later versions of the software. It's best to save two versions of the file: one in MS Word (to preserve formatting) and one as a text file (to ensure readability and ease of use at a later date).
- When converting a file, be sure to check that the new file does not contain any errors or omissions.
- Check the actual data itself: column headings, rows, etc
- Check the metadata; make sure it is present and accurate.
- Check to see if markup such as highlights or bolded text are lost.
- If saving an image file using compression, lossless is better than lossy.
- If saving different formats of the same file, but sure to name each file with the same name (ex: bison_data_v1.xlsx and bison_data_v1.txt).
- Once the data analysis is completed, prepare the data for long-term storage by converting the data format into standard and long lasting formats.
- Convert data formats to standard formats for backup data as well.