Data types and compression methods

Data types and compression methods#

Introduction#

We deal with large amounts of data in our analysis. There are different ways to store the data and archive. We use long term storage to archive the data using Globus. Details on how to do this can be found in this link. Firstly, we think about data compression and alternative storage option when it is necessary in the following situations

  • Size: Working on a project that involves reading or writing data is intractable in current form

  • Readability: Is the file type best read by humans/machines or both?

  • Efficiency: input/output bandwidth as how fast files can be read or written

Learning Objectives#

This training aims to help you undertsand:

  1. Different methods of file compression

  2. Testing and using the optimal data storage type

Training activities#

There are different compression utilities as tar, gzip, lzma, xz, bzip2, pax, peazip, 7zip, shar, cpio, ar, iso, kgb, zpaq (some of which you may familiar with and others you have no idea about). But, there is more!!! These methods are different based on efficiency, speed, compression ratio and use case.

Table 11 Data compression techniques#

Topic

Commitment

Tasks

Readings

Outcomes

Parallel File Compressing

S

None

Read about Parallel File Compressing in Linux.

You will learn to compress large amounts of data using multiple processes at the same time using tar or gzip

Hierarchical Data Format (HDF)

M

Complete the steps outlined in this blog post

None

You will learn about HDF and compare write, compression and read speed using CSV and HDF5.

Utilizing Parquet File Format

M

Work with this notebook on reading and working with parquet files

Read Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB in this blog post

You will learn how to read and process parquet file types in Python

NetCDF (Network Common Data Form)

S

Work with Converting ASCII data to NetCDF in Python

Read on reducing size of NetCDF spatial data with list representations in this blog post

You will learn how to read and process NetCDF data

  • Commitment: S = Short ( < 1 day), M = Medium (1-5 days), L = Long (>5 days)