IT / 气象/Meteorology · 2022年3月13日 0

A file size comparison test report between GRIB and NetCDF

Summary

This paper uses tools to convert an original ERA5 GRIB1 format file into GRIB2, NetCDF3 and NetCDF4 format files under different compression algorithms, and performs bz2 compression on them, and finally compares their file sizes. The final conclusion is that, with or without bz2 compression, the smallest file format is the GRIB2 family of file formats.

Data Introduction

The raw data used in this test is the GRIB file downloaded from the ERA5 reanalysis data. The raw data is in GRIB1 format. In order to compare the sizes in different formats, I will use some tools to convert the data to the format.

Original data download address: https://doi.org/10.5281/zenodo.6348679

Tools and Methods

The command line tools used in this article include:

  • ecCodes
  • wgrib2
  • netCDF4
  • bzip2
  • Among them, ecCodes and netCDF4 can be installed directly using conda, while bzip2 and wgrib2 need to use the corresponding installation method according to their own operating systems. This article uses wgrib2 by docker image: $ docker run agilesrc/wgrib2 bash

Size Comparison of GRIB1 and GRIB2

GRIB1 format does not support compression, while GRIB2 format supports compression, so we compare the size of GIRB1 and GRIB2 format files, essentially comparing the size of GRIB1 and GRIB2 files under various compression methods.

Since our original file era5-sample.grib is in GRIB1 format, we first convert it to GRIB2 and execute the command in the terminal:

grib_set -s edition=2 era5-sample.grib era5-sample.grib2

Check out the new GRIB2 file:

$ grib_ls era5-sample.grib2
era5-sample.grib2
edition      centre       date         dataType     gridType     stepRange    typeOfLevel  level        shortName    packingType
2            ecmf         20211028     fc           regular_ll   8            heightAboveGround  10           10u          grid_simple
2            ecmf         20211028     fc           regular_ll   8            heightAboveGround  10           10v          grid_simple
2            ecmf         20211028     fc           regular_ll   8            heightAboveGround  2            2d           grid_simple
2            ecmf         20211028     fc           regular_ll   8            heightAboveGround  2            2t           grid_simple
2            ecmf         20211028     fc           regular_ll   8            surface      0            fal          grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            slhf         grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            ssr          grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            str          grid_simple
2            ecmf         20211028     fc           regular_ll   8            surface      0            sp           grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            sshf         grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            ssrd         grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            strd         grid_simple
2            ecmf         20211028     fc           regular_ll   7-8          surface      0            tp           grid_simple
13 of 13 messages in era5-sample.grib2

13 of 13 total messages in 1 files

It can be seen that the edition of GRIB has become 2. Next, we need to compress the converted GRIB2 file. In the environment where wgrib2 is pre-installed and supports bash commands, run the following script in the data directory:

#!/bin/sh  

ctypes=( “ieee” “simple” “complex1” “complex2” “complex3” “jpeg” “aec” “same” )  


for ctype in “${ctypes[@]}”
do
    wgrib2 -set_grib_type $ctype era5-sample.grib2 -grib_out era5-sample-$ctype.grib2
done

This results in the following files:

$ ls -lh era5-sample-*.grib2
-rw-r—r—  1 clarmylee  staff    36M  3 12 14:02 era5-sample-aec.grib2
-rw-r—r—  1 clarmylee  staff    34M  3 12 14:01 era5-sample-complex1.grib2
-rw-r—r—  1 clarmylee  staff    27M  3 12 14:01 era5-sample-complex2.grib2
-rw-r—r—  1 clarmylee  staff    27M  3 12 14:02 era5-sample-complex3.grib2
-rw-r—r—  1 clarmylee  staff   322M  3 12 14:01 era5-sample-ieee.grib2
-rw-r—r—  1 clarmylee  staff    36M  3 12 14:02 era5-sample-jpeg.grib2
-rw-r—r—  1 clarmylee  staff    65M  3 12 14:02 era5-sample-same.grib2
-rw-r—r—  1 clarmylee  staff    65M  3 12 14:01 era5-sample-simple.grib2

对比1

It can be seen that the largest compression format is ieee, the smallest is complex3, and the original GRIB1 format file is the largest except for ieee. The GRIB2 file directly converted using grib_set is slightly smaller than the original file, while the complex3 compression format file is about 1/3 of the original file.

The above GRIB files are storage formats that do not lose the ability to read directly. Let's test them again, compress them into .bz2 format, and execute $ bzip2 -k *grib* in the terminal, you can get the following files:

$ ls -lh *.bz2
-rw-r—r—  1 clarmylee  staff    25M  3 12 14:02 era5-sample-aec.grib2.bz2
-rw-r—r—  1 clarmylee  staff    33M  3 12 14:01 era5-sample-complex1.grib2.bz2
-rw-r—r—  1 clarmylee  staff    26M  3 12 14:01 era5-sample-complex2.grib2.bz2
-rw-r—r—  1 clarmylee  staff    26M  3 12 14:02 era5-sample-complex3.grib2.bz2
-rw-r—r—  1 clarmylee  staff    55M  3 12 14:01 era5-sample-ieee.grib2.bz2
-rw-r—r—  1 clarmylee  staff    26M  3 12 14:02 era5-sample-jpeg.grib2.bz2
-rw-r—r—  1 clarmylee  staff    31M  3 12 14:02 era5-sample-same.grib2.bz2
-rw-r—r—  1 clarmylee  staff    31M  3 12 14:01 era5-sample-simple.grib2.bz2
-rw-r—r—  1 clarmylee  staff    52M  3 12 13:58 era5-sample.grib.bz2
-rw-r—r—  1 clarmylee  staff    52M  3 12 13:58 era5-sample.grib2.bz2

对比2

It can be seen that after bz2 compression, the file with the smallest file size is the file compressed by the aec method, and the bz2 compression effect is the most obvious in ieee, while the original smaller file has little effect after being compressed by the bz2 algorithm.

From the above, it can be seen that under the GRIB ecology, simply from the perspective of reducing the file size, without losing the reading ability, using the GRIB2 format of the complex3 compression algorithm for storage is the best solution. In the case of loss of access capability, the aec compression algorithm can also be considered.

Size Comparison Between NetCDF3 and NetCDF4

Let's discuss the file size comparison between NetCDF3 and NetCDF4. Similar to GRIB, the old version of NetCDF3 does not support native compression. If you want to compress, you need to use a tool similar to bz2, while the new version of NetCDF4 supports native compression. , so the comparison of the two formats is actually a comparison between NetCDF3 and different compression levels of NetCDF4.

The grib_to_netcdf command supports converting GRIB to the following four NetCDF storage formats:

  • netCDF classic file format
  • netCDF 64 bit classic file format (Default)
  • netCDF-4 file format
  • netCDF-4 classic model file format

We will not repeat the underlying differences of the above four data formats here, but only compare their volumes. We first convert the GRIB files to these four NetCDF formats.

$ grib_to_netcdf -k 1 -o era5-sample-class.nc3 era5-sample.grib
$ grib_to_netcdf -k 2 -o era5-sample-64class.nc3 era5-sample.grib
$ grib_to_netcdf -k 3 -o era5-sample.nc4 era5-sample.grib
$ grib_to_netcdf -k 4 -o era5-sample-class.nc4 era5-sample.grib
$ ls -lh *nc*
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:26 era5-sample-64class.nc3
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:26 era5-sample-class.nc3
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:27 era5-sample-class.nc4
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:26 era5-sample.nc4

It can be seen that there is no significant difference in the size of various NetCDF formats converted directly by the grib_to_netcdf command. Let’s first use nccopy to natively compress nc4, and execute the following script:

#!/bin/sh  

clevels=( 0 1 2 3 4 5 6 7 8 9 )  


for level in “${clevels[@]}”
do
    nccopy -k ‘netCDF-4’ -d $level era5-sample.nc4 era5-sample-c$level.nc4
done

Check result:

$ ls -lh era5-sample-*nc*
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:26 era5-sample-64class.nc3
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:44 era5-sample-c0.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c1.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c2.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c3.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c4.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c5.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c6.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c7.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c8.nc4
-rw-r—r—  1 clarmylee  staff    44M  3 12 15:44 era5-sample-c9.nc4
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:26 era5-sample-class.nc3
-rw-r—r—  1 clarmylee  staff   161M  3 12 15:27 era5-sample-class.nc4

It can be seen that in the uncompressed state, the size of nc4 is 161M, and the results after using 1-9 levels of compression are all 44M, that is to say, in the NetCDF4 format, the results of compression and non-compression are very different. The compression difference between levels is small, and uncompressed NetCDF4 is as bulky as NetCDF3.

Let’s bz2 it again, execute $ bzip2 -k era5-sample-*nc*
Then draw a picture to see:

对比3

As can be seen from the above figure, according to the non-bz2 algorithm compression form, the largest is the uncompressed NetCDF4 format, and the smallest is the 9-level compressed NetCDF4 format.

对比4

If you look at the file size after bz2 compression, NetCDF3 will be smaller than NetCDF4 after bz2 compression.

Cross-comparison of 4 Format Sizes

The following is a combination of all the above compressed or uncompressed formats and different compression levels, and let's take a look at their size comparisons.

对比5

The above picture is sorted according to the non-bz2 compression size from large to small. It can be seen that according to the original file sorting without loss of readability, the file size of the GRIB2 format under the complex3 compression algorithm is still the smallest.

对比6

And the format sorting after using bz2 compression, the smallest size is still the GRIB2 format under the aec compression algorithm.

Conclusion

From the above comparison, we can conclude that whether or not bz2 compression is used, the file format with the smallest size is the GRIB2 family file format, and in the non-bz2 mode, the smallest size is the GRIB2 file based on the complex3 compression algorithm. The smallest volume is the GRIB2 file based on the aec compression algorithm, of course, this is only considered from the perspective of volume. To examine read speed, additional experimentation and testing is required.

To cite this article, please use the following citation format:
Wentao Li. (2022). 一份GRIB与NetCDF的体积对比报告 (Version v1). Zenodo. https://doi.org/10.5281/zenodo.6348695