{ "cells": [ { "cell_type": "markdown", "id": "6e0b3c7d-b3ac-4013-979a-f4b2012b48f7", "metadata": {}, "source": [ "# Compression methods for NetCDF files \n", "\n", "with Xarray\n", "\n", "This Notebook gives a guideline how to use recent basic lossy and lossless compression for netdf output via `xarray`. It is an implementation of [these slides](https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements) using example data of Earth System Models AWI-CM. We will compare the writing speed and the compression ratio of different methods.\n", "\n", "For using lossless compression methods on netcdf files, we use the [hdf5plugin](https://github.com/silx-kit/hdf5plugin/blob/b53d70b702154a149a60f653c49b20045228aa16/doc/index.rst) tool that enables to use additional hdf5filters within python.\n", "\n", "For *lossy* compression, we use the [numcodecs](https://github.com/zarr-developers/numcodecs) lib to calculate the bitrounding.\n", "\n", "We use the [BitShuffle](https://github.com/kiyo-masui/bitshuffle) filter prior to compression whenever possible. This will rearrange the binary data in order to improve compression.\n", "\n", "## Requirements\n", "\n", "- Lossy compression requires a lot of memory. \n", "- Reading lossless compressions other than deflated requires netcdf version 4.9 or newer with access to the HDF5 filters" ] }, { "cell_type": "markdown", "id": "0f4d7d8d-e120-4d56-9b45-1bfd7eb4ba4b", "metadata": {}, "source": [ "## Lossless compression methods\n", "\n", "### [zlib]()\n", "\n", "zlib has been the standard compression method since it was introduced in netcdf. It is based on the deflate algorithm, other compression packages use dictionaries.\n", "\n", "### [Zstandard](https://github.com/facebook/zstd)\n", "\n", "Zstandard allows multithreading. It is used by package registries and is supported by linux systemss. Zstandard offers features beyond zlib/zlib-ng, including better performance and compression.\n", "\n", "### [LZ4](https://github.com/lz4/lz4)\n", "\n", "LZ4 has a focus on very fast compression. It scales with CPUs.\n", "\n", "### [Blosc](https://github.com/Blosc/c-blosc)\n", "\n", "Blosc uses the blocking technique to not only reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations. Its default compressor is based on LZ4 and Zstandard." ] }, { "cell_type": "code", "execution_count": null, "id": "911b4efb-47e2-422b-acf2-a3d8d37b1baf", "metadata": {}, "outputs": [], "source": [ "import hdf5plugin\n", "import time\n", "import fsspec as fs\n", "import glob\n", "import xarray as xr\n", "import tqdm" ] }, { "cell_type": "code", "execution_count": null, "id": "888676a5-4c4c-4c49-8f97-7144e2bee12d", "metadata": { "tags": [] }, "outputs": [], "source": [ "help(hdf5plugin)" ] }, { "cell_type": "code", "execution_count": null, "id": "a598a965-73c3-4700-8de2-4afa26ec3702", "metadata": {}, "outputs": [], "source": [ "%store -r times\n", "#print(times) \n", "times=0" ] }, { "cell_type": "markdown", "id": "48311eee-fdf4-4e75-9ae6-ace523381037", "metadata": {}, "source": [ "On Levante, you can use the plugins from the CMIP pool directory `/work/ik1017/hdf5plugin/plugins/`:" ] }, { "cell_type": "code", "execution_count": null, "id": "cad827c9-1445-481e-b24b-0017bf78e19b", "metadata": {}, "outputs": [], "source": [ "hdf5plugin.PLUGIN_PATH=\"/work/ik1017/hdf5plugin/plugins/\"\n", "%set_env HDF5_PLUGIN_PATH={hdf5plugin.PLUGIN_PATH}" ] }, { "cell_type": "markdown", "id": "c7a91342-9e70-40ba-9389-54fca72bb949", "metadata": {}, "source": [ "We use the ocean surface temperature `tos` in this example:" ] }, { "cell_type": "code", "execution_count": null, "id": "c346fc8b-f437-41e3-9c83-6f2ef49fe749", "metadata": { "tags": [] }, "outputs": [], "source": [ "source=\"/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR/ssp370/r1i1p1f1/Omon/tos/gn/v20181218/tos_Omon_AWI-CM-1-1-MR_ssp370_r1i1p1f1_gn_201501-202012.nc\"\n", "pwd=!pwd\n", "pwd=pwd[0]\n", "source_uncompressed=f\"{pwd}/temp.nc\"\n", "sds=xr.open_mfdataset(source).isel(time=slice(1,13))\n", "for var in sds.variables:\n", " sds[var].encoding[\"zlib\"]=False\n", "sds.to_netcdf(source_uncompressed)" ] }, { "cell_type": "code", "execution_count": null, "id": "6c6bf599-ba32-416f-ad51-1ddb058cbd83", "metadata": {}, "outputs": [], "source": [ "sds" ] }, { "cell_type": "code", "execution_count": null, "id": "a2b12787-39e0-45cb-8ebf-b02b95e50b3d", "metadata": {}, "outputs": [], "source": [ "omon2d=xr.open_mfdataset(\n", " source_uncompressed,\n", " engine=\"h5netcdf\",\n", " parallel=False\n", ") " ] }, { "cell_type": "markdown", "id": "727f4776-5667-4e74-8c08-cea0855d217c", "metadata": {}, "source": [ "The following \"*compression* : *configuration*\" dictionary is used to configure the `encoding` keyword argument in xarray's *to_netcdf*:" ] }, { "cell_type": "code", "execution_count": null, "id": "8a419d87-1dd1-4b3c-9d48-68c5ced53549", "metadata": { "tags": [] }, "outputs": [], "source": [ "comprdict=dict(\n", " zlib=dict(\n", " engine=\"h5netcdf\",\n", " compr=dict(\n", " zlib=True,\n", " complevel=5\n", " ) \n", " ),\n", " zstd=dict(\n", " engine=\"h5netcdf\",\n", " #from python 3.11:\n", " compr=dict(**hdf5plugin.Bitshuffle(cname=\"zstd\"))\n", " #compr=dict(**hdf5plugin.Zstd())\n", " ),\n", " lz4=dict(\n", " engine=\"h5netcdf\",\n", " #from python 3.11:\n", " compr=dict(**hdf5plugin.Bitshuffle(cname=\"lz4\"))\n", " #compr=dict(**hdf5plugin.Bitshuffle(lz4=True))\n", " ),\n", " blosc=dict(\n", " engine=\"h5netcdf\",\n", " compr=dict(**hdf5plugin.Blosc(cname='blosclz', shuffle=1))\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "34e5bc3f-164e-4c40-b89f-a4e8d5866276", "metadata": {}, "outputs": [], "source": [ "comprdict[\"lz4\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "1c356ce8-7d6f-4d47-a673-e8f1f698799f", "metadata": {}, "outputs": [], "source": [ "sourcesize=fs.filesystem(\"file\").du(source_uncompressed)\n", "print(f\"The size of the uncompressed source file is {sourcesize/1024/1024} MB\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e6202eb6-0897-4d18-b15e-69dd30b53c23", "metadata": { "tags": [] }, "outputs": [], "source": [ "resultdir={}\n", "for compr,config in tqdm.tqdm(comprdict.items()):\n", " enc=dict()\n", " for var in omon2d.data_vars:\n", " enc[var]=config[\"compr\"]\n", " start=time.time()\n", " omon2d.to_netcdf(f\"{pwd}/test_{compr}_compression.nc\",\n", " mode=\"w\",\n", " engine=config[\"engine\"],\n", " unlimited_dims=\"time\",\n", " encoding=enc,\n", " )\n", " end=time.time()\n", " resultdir[compr]=dict(\n", " speed=sourcesize/(end-start)/1024/1024,\n", " ratio=fs.filesystem(\"file\").du(f\"{pwd}/test_{compr}_compression.nc\")/sourcesize\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "a054a804-0cd6-4f89-9d95-8805d79bac77", "metadata": {}, "outputs": [], "source": [ "with open(f\"results_{str(times)}.csv\",\"w\") as f:\n", " for k,v in resultdir.items():\n", " f.write(f\"{k},{sourcesize},{v['speed']},{v['ratio']}\\n\")" ] }, { "cell_type": "markdown", "id": "c1befda5-4bd5-4d1e-b586-8794b0046d85", "metadata": {}, "source": [ "### Reading not-deflated data\n", "\n", "Before open a file compressed with something else than zlib, you have to import hdf5plugin first:" ] }, { "cell_type": "code", "execution_count": null, "id": "d2f290c4-31a3-48ab-b0a6-c360a8bf171e", "metadata": { "tags": [] }, "outputs": [], "source": [ "import hdf5plugin\n", "import xarray as xr\n", "outf=xr.open_dataset(f\"{pwd}/test_zstd_compression.nc\",engine=\"h5netcdf\")" ] }, { "cell_type": "code", "execution_count": null, "id": "6848c002-42d2-47e3-93a5-988995f40208", "metadata": {}, "outputs": [], "source": [ "outf" ] }, { "cell_type": "markdown", "id": "d6b06f6c-4e6c-41ef-84c7-f3ef0c011c8f", "metadata": {}, "source": [ "## Lossy\n", "\n", "1. Direct `BitRound`ing with 16 bits to be kept. This precision can be considered as similar to e.g. ERA5 data (24 bit Integer space).\n", "1. Calculate number of bits with information level 0.99 via *xbitinfo*." ] }, { "cell_type": "code", "execution_count": null, "id": "bf7139de-3dda-4b9f-a04b-b70d3eb018f8", "metadata": {}, "outputs": [], "source": [ "losstimestart=time.time()\n", "import numcodecs\n", "rounding = numcodecs.BitRound(keepbits=16)\n", "for var in omon2d.data_vars:\n", " if \"bnds\" in var :\n", " continue\n", " omon2d[var].data=rounding.decode(\n", " rounding.encode(\n", " omon2d[var].load().data\n", " )\n", " )\n", "losstimeend=time.time()" ] }, { "cell_type": "code", "execution_count": null, "id": "358dc7c4-5186-4265-a5f5-8d7ee7e73c18", "metadata": {}, "outputs": [], "source": [ "resultdir={}\n", "for compr,config in tqdm.tqdm(comprdict.items()):\n", " enc=dict()\n", " for var in omon2d.data_vars:\n", " enc[var]=config[\"compr\"]\n", " start=time.time()\n", " \n", " omon2d.to_netcdf(f\"{pwd}/test_{compr}_compression_lossy.nc\",\n", " mode=\"w\",\n", " engine=config[\"engine\"],\n", " unlimited_dims=\"time\",\n", " encoding=enc,\n", " )\n", " end=time.time()\n", " resultdir[compr]=dict(\n", " speed=sourcesize/(end-start+losstimeend-losstimestart)/1024/1024,\n", " ratio=fs.filesystem(\"file\").du(f\"{pwd}/test_{compr}_compression_lossy.nc\")/sourcesize\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "bbc7a341-cf54-4073-bf26-c1549ebe44e7", "metadata": {}, "outputs": [], "source": [ "with open(f\"results_{str(times)}.csv\",\"a\") as f:\n", " for k,v in resultdir.items():\n", " f.write(f\"{k}_lossy,{sourcesize},{v['speed']},{v['ratio']}\\n\")" ] }, { "cell_type": "markdown", "id": "6cece0ed-0ab0-4fd1-a515-93f36b6753a8", "metadata": {}, "source": [ "### Xbitinfo" ] }, { "cell_type": "code", "execution_count": null, "id": "8ff2e4f1-10ae-4b0b-9d4f-5ffbf871dbdb", "metadata": {}, "outputs": [], "source": [ "omon2d=xr.open_mfdataset(\n", " source_uncompressed,\n", " engine=\"h5netcdf\",\n", " parallel=False\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "09e45692-8bdd-4a78-82eb-9bead71875bb", "metadata": {}, "outputs": [], "source": [ "import xbitinfo as xb" ] }, { "cell_type": "code", "execution_count": null, "id": "3c021ba4-7172-4035-9305-ae61e92bb684", "metadata": {}, "outputs": [], "source": [ "import time\n", "bitinfostart=time.time()\n", "for var in omon2d.data_vars:\n", " if \"bnds\" in var:\n", " continue\n", " dims=[dim for dim in omon2d[[var]].dims.keys() if \"ncell\" in dim]\n", " print(dims)\n", " if dims:\n", " bitinfo = xb.get_bitinformation(omon2d[[var]], dim=dims, implementation=\"python\")\n", " keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)\n", " print(keepbits)\n", " if keepbits[var][0] > 0 :\n", " print(keepbits[var][0])\n", " omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits)[var] # this one wraps around numcodecs.bitround\n", "bitinfoend=time.time()" ] }, { "cell_type": "code", "execution_count": null, "id": "0d2def50-c209-41ee-8604-5fdb8554f640", "metadata": {}, "outputs": [], "source": [ "resultdir={}\n", "for compr,config in tqdm.tqdm(comprdict.items()):\n", " enc=dict()\n", " for var in omon2d.data_vars:\n", " enc[var]=config[\"compr\"]\n", " start=time.time()\n", " \n", " omon2d.to_netcdf(f\"{pwd}/test_{compr}_compression_lossy_xbit.nc\",\n", " mode=\"w\",\n", " engine=config[\"engine\"],\n", " unlimited_dims=\"time\",\n", " encoding=enc,\n", " )\n", " end=time.time()\n", " resultdir[compr]=dict(\n", " speed=sourcesize/(end-start+bitinfoend-bitinfostart)/1024/1024,\n", " ratio=fs.filesystem(\"file\").du(f\"{pwd}/test_{compr}_compression_lossy_xbit.nc\")/sourcesize\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "4f95632f-ac07-4192-8a96-b3bb98ebfde2", "metadata": {}, "outputs": [], "source": [ "with open(f\"results_{str(times)}.csv\",\"a\") as f:\n", " for k,v in resultdir.items():\n", " f.write(f\"{k}_lossy_xbit,{sourcesize},{v['speed']},{v['ratio']}\\n\")" ] }, { "cell_type": "markdown", "id": "3d000b88-f995-4369-b87d-98e910c57e9a", "metadata": {}, "source": [ "### Write the results" ] }, { "cell_type": "code", "execution_count": null, "id": "1b054323-937d-4bd8-a785-d3f709a4ee26", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import glob\n", "df = pd.concat((pd.read_csv(f,names=[\"type\",\"insize\",\"write_speed [Mb/s]\",\"ratio\"]) for f in glob.glob(\"results*.csv\")), ignore_index=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "9c381d31-1156-4e95-ae07-b6cc942a9288", "metadata": {}, "outputs": [], "source": [ "df.groupby(\"type\").mean()[[\"write_speed [Mb/s]\",\"ratio\"]].sort_values(by=\"write_speed [Mb/s]\",ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "802c5d41-2791-4e0d-bf1f-35a1e748c9ed", "metadata": {}, "outputs": [], "source": [ "!rm test_*compression*.nc" ] }, { "cell_type": "code", "execution_count": null, "id": "0fd679fe-140a-4ab3-9a09-0349ad6ab821", "metadata": {}, "outputs": [], "source": [ "!rm temp.nc" ] }, { "cell_type": "code", "execution_count": null, "id": "31141e83-3c2c-4b05-a5ee-d070a9f7d004", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }