Data types and compression methods#
Introduction#
We deal with large amounts of data in our analysis. There are different ways to store the data and archive. We use long term storage to archive the data using Globus. Details on how to do this can be found in this link. Firstly, we think about data compression and alternative storage option when it is necessary in the following situations
Size: Working on a project that involves reading or writing data is intractable in current form
Readability: Is the file type best read by humans/machines or both?
Efficiency: input/output bandwidth as how fast files can be read or written
Learning Objectives#
This training aims to help you undertsand:
Different methods of file compression
Testing and using the optimal data storage type
Training activities#
There are different compression utilities as tar, gzip, lzma, xz, bzip2, pax, peazip, 7zip, shar, cpio, ar, iso, kgb, zpaq (some of which you may familiar with and others you have no idea about). But, there is more!!! These methods are different based on efficiency, speed, compression ratio and use case.
Topic |
Commitment |
Tasks |
Readings |
Outcomes |
---|---|---|---|---|
Parallel File Compressing |
S |
None |
Read about Parallel File Compressing in Linux. |
You will learn to compress large amounts of data using multiple processes at the same time using tar or gzip |
Hierarchical Data Format (HDF) |
M |
Complete the steps outlined in this blog post |
None |
You will learn about HDF and compare write, compression and read speed using CSV and HDF5. |
Utilizing Parquet File Format |
M |
Work with this notebook on reading and working with parquet files |
Read Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB in this blog post |
You will learn how to read and process parquet file types in Python |
NetCDF (Network Common Data Form) |
S |
Read on reducing size of NetCDF spatial data with list representations in this blog post |
You will learn how to read and process NetCDF data |
Commitment: S = Short ( < 1 day), M = Medium (1-5 days), L = Long (>5 days)