Data types and compression methods

Contents

Data types and compression methods#

Introduction#

We deal with large amounts of data in our analysis. There are different ways to store the data and archive. We use long term storage to archive the data using Globus. Details on how to do this can be found in this link. Firstly, we think about data compression and alternative storage option when it is necessary in the following situations

Size: Working on a project that involves reading or writing data is intractable in current form
Readability: Is the file type best read by humans/machines or both?
Efficiency: input/output bandwidth as how fast files can be read or written

Learning Objectives#

This training aims to help you undertsand:

Different methods of file compression
Testing and using the optimal data storage type

Training activities#

There are different compression utilities as tar, gzip, lzma, xz, bzip2, pax, peazip, 7zip, shar, cpio, ar, iso, kgb, zpaq (some of which you may familiar with and others you have no idea about). But, there is more!!! These methods are different based on efficiency, speed, compression ratio and use case.

Table 11 Data compression techniques#
Topic	Commitment	Tasks	Readings	Outcomes
Parallel File Compressing	S	None	Read about Parallel File Compressing in Linux.	You will learn to compress large amounts of data using multiple processes at the same time using tar or gzip
Hierarchical Data Format (HDF)	M	Complete the steps outlined in this blog post	None	You will learn about HDF and compare write, compression and read speed using CSV and HDF5.
Utilizing Parquet File Format	M	Work with this notebook on reading and working with parquet files	Read Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB in this blog post	You will learn how to read and process parquet file types in Python
NetCDF (Network Common Data Form)	S	Work with Converting ASCII data to NetCDF in Python	Read on reducing size of NetCDF spatial data with list representations in this blog post	You will learn how to read and process NetCDF data

Commitment: S = Short ( < 1 day), M = Medium (1-5 days), L = Long (>5 days)