Tzis - To Zarr in Swift

Tzis - To Zarr in Swift#

tzis is a small python package which

converts data into the zarr format and
writes it to the DKRZ’s cloud storage space swift

in one step. It is based on a script which uses xarray and the fsspec implementation for swift from Tobias Kölling. tzis is optimized for DKRZ’s High Performance Computer but can also be used from local computers.

tzis features

writing of different input file formats. All files that can be passed to xarray.open_mfdataset() can be used.
writing an atomic dataset i.e. one variable covering many files into the cloud per write_to_swift call.
consolidated stores. Metadata of many files are saved into one. Conflicting metadata with varying values are combined into a list, e.g. tracking_ids.
chunking along the time dimension. Datasets without time will be written directly (“unmodified”) to storage.
swift-store implementation for using basic filesystem-like operations on the object store (like listdir)

In this notebook, you will learn

the meaning of zarr and the swift object storage
why you can benefit from zarr in cloud storage
when it is a good idea to write into cloud
how to initializie the swift store for tzis including creating a token
how to open and configure the source dataset
how to write data to swift
how to set options for the zarr output
how to access and use data from swift
how to work with the SwiftStore similar to file systems

Definition#

Zarr is a cloud-optimised format for climate data. By using chunk-based data access, zarr enables arrays the can be larger than memory. Both input and output operations can be parallelised. It features customization of compression methods and stores.

The Swift cloud object storage is a 🔑 Keyvalue store where the key is a global unique identifier and the value a representation of binary data. In contrast to a file system 📁 , there are no files or directories but objects and containers/buckets. Data access is possible via internet i.e. http.

Motivation#

In recent years, object storage systems became an alternative to traditional file systems because of

Independency from computational ressources. Users can access and download data from anywhere without the need of HPC access or resources
Scalability because no filesystem or system manager has to care about the connected disks.
A lack of storage space in general because of increasing model output volume.
No namespace conflicts because data is accessed via global unique identifier

Large Earth System Science data bases like the CMIP Data Pool at DKRZ contain netCDF formatted data. Access and transfers of such data from an object storage can only be conducted on file level which results in heavy download volumes and less reproducible workflows.

The cloud-optimised climate data format Zarr solves these problems by

allowing programs to identify chunks corresponding to the desired subset of the data before the download so that the volume of data transfer is reduced.
allowing users to access the data via http so that both no authentication or software on the cloud repository site is required
saving meta data next to the binary data. That allows programs to quickly create a virtual representation of large and complex datasets.

Zarr formatted data in the cloud makes the data as analysis ready as possible.

With tzis, we developed a package that enables to use DKRZ’s insitutional cloud storage as a back end storage for Earth System Science data. It combines swiftclient based scripts, a Zarr storage implementation and a high-level xarray application including rechunking. Download velocity can be up to 400 MB/s. Additional validation of the data transfer ensures its completeness.

Which type of data is suitable?#

Datasets in the cloud are useful if

they are fixed. Moving data in the cloud is very inefficient.
they will not be prepended. Data in the cloud can be easily appended but prepending most likely requires moving which is not efficient.
they are open. One advantage comes from the easy access via http. This is even easier when useres do not have to log in.

Swift authentication and initialization#

Central tzis functions require that you specify an OS_AUTH_TOKEN which allows the program to connect to the swift storage with your credentials. This token is valid for a month per default. Otherwise, you would have to login for each new session. When you work with swift, this token is saved in the hidden file ~/.swiftenv which contains the following paramter

OS_STORAGE_URL which is the URL associated with the storage space of the project or the user. Note that this URL cannot be opened like a swiftbrowser link but instead it can be used within programs like tzis.
OS_AUTH_TOKEN.

Be careful with the token. It should stay only readable for you. Especially, do not push it into git repos.

Get token and url#

Tzis includes a function to get the token or, if not available, create the token:

from tzis import swifthandling
token=swifthandling.get_token(
    "dkrz",
    project,
    user
)

When calling get_token,

it tries to read in the configuration file ~/.swiftenv
if there is a file, it checks, if the found configuration matches the specified account
if no file was found or the configuration is invalid, it will create a token
1. it asks you for a password
2. it writes two files: the ~/.swiftenv with the configuration and ~/.swiftenv_useracc which contains the account and user specification for that token.
it returns a dictionary with all configuration variables

Initializing an output container#

After having the authentication for swift, we initialize a swift container in which we will save the data. We do that with

target_fsmap=swifthandling.get_swift_mapper(
    token["OS_STORAGE_URL"],
    token["OS_AUTH_TOKEN"],
    container_name,
    os_name=prefix_for_object_storage
)

The mandatory arguments are:

os_url is the OS_STORAGE_URL
os_token is the OS_AUTH_TOKEN
os_container is the container name / the bucket. A container is the highest of two store levels in the swift object store.

these will connect you to the swift store and initialize/create a container.

Open and configure the source dataset#

Tzis offers a convenient function to directly open a dataset such that it has the chunks fitting to target chunk size. See the Writing to Swift-chapter for notes related to the chunking.

from tzis import openmf
omo = openmf.open_mfdataset_optimize(
    glob_path_var,
    varname,
    target_fsmap,
    chunkdim=chunkdim,
    target_mb=target_mb
)

The mandatory arguments are

glob_path_var: The dataset file(s). A str or a list of source files which can be opened with

mf_dset = xarray.open_mfdataset(glob_path_var,
                                decode_cf=True,
                                use_cftime=True,
                                data_vars='minimal', 
                                coords='minimal', 
                                compat='override',
                                combine_attrs="drop_conflicts")

varname: The variable from the dataset which will be selected and then written into the object store
target_fsmap

E.g.:

path_to_dataset = "/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
mfs_towrite=[path_var +filename for filename in os.listdir(path_to_dataset)]
container.mf_dataset=container.open_mf_dataset(openmf, "pr", target_fsmap)
container.mf_dataset

Grib input#

If you want to use grb input files, you can specify cfgrib as an engine for xarray.

container.open_mf_dataset(list_of_grib_files, "pr", xarray_kwargs=**dict(engine="cfgrib"))

Writing to swift#

After we have initialized the container and opened the dataset, we can write it into cloud. The conversion to zarr is made on the way. We can specify all necessary configuration options within the write function:

def write_zarr(
    self,
    fsmap,    
    mf_dset,
    varname,
    chunkdim="time",
    target_mb=0,
    startchunk=0,
    validity_check=False,
    maxretries=3,
    trusted=True,
)

The function needs

a target store fsmap as a fsspec mapping
the input xarray dataset mf_dset
the variable name varname which should be used to rechunk

The function allows you

to set chunkdim which is the dimension used for chunking. There is yet no other dimension than “time” possible.
to set the target size of a data chunk. A chunk corresponds to an object in the swift object storage. It has limitations on both sides: Chunks smaller than 10 MB are not efficient while sizes larger than 2GB are not supported.
to set the startchunk. If the write process was interrupted - e.g. because your dataset is very large, you can specify at which chunk the write process should restart.
to set the number of retries if the transfer is interrupted.
to set validity_check=True which will validate the transfer after having the data completly transferred. This checks if the data in the chunks are equal to the input data.

E.g.

from tzis import tzis
outstore=tzis.write_zarr(
    omo.target_fsmap,
    omo.mf_dset,
    omo.varname,
    verbose=True,
    target_mb=0
)

The output outstore of write_zarr is a new variable which packages like xarray can use and open as a consolidated dataset. The os_name of container can now be changed while the outstore still points to the written os_name.

Overwriting or appending?#

write_zarr() per default appends data if possible. It calls xarray’s to_zarr() function for each chunk. Before a chunk is written, it is checked if there is already a chunk for exactly the slice of the dataset that should be written. If so, the chunk is skipped. Therefore, recalling write_zarr only overwrites chunks if they cover a different slice of the source dataset.

In order to skip chunks, you can set startchunk. Then, the function will jump to startchunk and start writing this.

Writing another variable from the same dataset#

Define another store by using a different os_name:

omo.target_fsmap= swifthandling.get_swift_mapper(
    token["OS_STORAGE_URL"],
    token["OS_AUTH_TOKEN"],
    container_name,
    os_name=new_prefix_for_new_variable
)

Set another variable name varname:

omo.varname=varname

Write to swift:

tzis.write_zarr()

Writing another dataset into the same container#

You do not have to login to the same store and the same container a second time. You can still use the container variable. Just restart at upload.

Options and configuration for the zarr output#

Memory and chunk size#

Compression#

From Zarr docs:

If you don’t specify a compressor, by default Zarr uses the Blosc compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can use a number of different compression algorithms internally to compress the data. A list of the internal compression libraries available within Blosc can be obtained via:

from numcodecs import blosc
blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']

The default compressor can be changed by setting the value of the zarr.storage.default_compressor variable, e.g.:

import zarr.storage
from numcodecs import Zstd, Blosc
# switch to using Zstandard
zarr.storage.default_compressor = Zstd(level=1)

A number of different compressors can be used with Zarr. A separate package called NumCodecs is available which provides a common interface to various compressor libraries including Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions.

Attributes#

Attributes of the dataset are handled in a dictionary in the container.mf_dset variable via xarray. You can add or delete attributes just like items from a dictionary:

#add an attribute
omo.attrs["new_attribute"]="New value of attribute"
print(omo.attrs["new_attribute"])

#delete the attribute
del omo.attrs["new_attribute"]

Access and use your Zarr dataset#

You can open the consolidated zarr datasets with xarray using an URL-prefix-like string constructed as

zarrinput=OS_STORAGE_URL+"/"+os_container+"/"+os_name
xarry.open_zarr(zarrinput, consolidated=True, decode_times=True)

This is possible if the container is public.

If your container is private, you have to use a zarr storage where you have to login with credentials to the store first. I.e., you can also do

zarr_dset = xarray.open_zarr(container.store, consolidated=True, decode_times=True)
zarr_dset

You can download data from the swiftbrowser manually

Coordinates#

Sometimes, you have to reset the coordinates because it gets lost on the transfer to zarr:

precoords = set(
    ["lat_bnds", "lev_bnds", "ap", "b", "ap_bnds", "b_bnds", "lon_bnds"]
)
coords = [x for x in zarr_dset.data_vars.variables if x in precoords]
zarr_dset = zarr_dset.set_coords(coords)

Reconvert to NetCDF#

The basic reconversion to netCDF can be done with xarray:

written.to_netcdf(outputfilename)

Compression and encoding:#

Often, the original netCDF was compressed. You can set different compressions in an encoding dictionary. For using zlib and its compression level 1, you can set:

var_dict = dict(zlib=True, complevel=1)
encoding = {var: var_dict for var in written.data_vars}

FillValue#

to_netcdf might write out FillValues for coordinates which is not compliant to CF. In order to prevent that, set an encoding as follows:

coord_dict = dict(_FillValue=False)
encoding.update({var: coord_dict for var in written.coords})

Unlimited dimensions#

Last but not least, one key element of netCDF is the unlimited dimension. You can set a keyword argument in the to_netcdf command. E.g., for rewriting a zarr-CMIP6 dataset into netCDF, consider compression and fillValue in the encoding and run

written.to_netcdf("testcase.nc",
                  format="NETCDF4_CLASSIC",
                  unlimited_dims="time",
                 encoding=encoding)

Swift storage handling with fsspec - `chmod`, `ls`, `rm`, `mv`#

The mapper from fsspec comes with a filesystem object named fs which maps the api calls to the linux commands so that they become applicable, e.g.:

outstore.fs.ls(outstore.root)

The Index#

write_zarr automatically appends to an index INDEX.csv in the parent directory. You should find it via

import os
outstore.fs.ls(os.path.dirname(outstore.root))

You can directly read that with

import pandas as pd
index_df=pd.read_csv(os.path.dirname(outstore.root)+"/INDEX.csv")

All the urls in the column url should be openable with xarray, e.g.:

import xarray as xr
xr.open_zarr(index_df["url"][0], consolidated=True)

How to make a container public#

use the store:

#with a container and a prefix, you can get the container_name via os
#import os
#container_name=os.path.basename(os.path.dirname(outstore.root))

swifthandling.toggle_public(container_name)

This will either make the container of the outstore public if it was not or it will make it private by removing all access control lists if it was public. Note that only container as a whole can be made public or private.

With hand:

Log in at https://swiftbrowser.dkrz.de/login/ .
In the line of the target container, click on the arrow on the right side with the red background and click on share.
Again, click on the arrow on the right side and click on make public.

Remove a zarr-`store` i.e. all objects with `os_name` prefix#

use fsspec:

target_fsmap.fs.rmdir(os_name)                 

With hand:

Log in at https://swiftbrowser.dkrz.de/login/ .
- On the line of the target container, click on the arrow on the right side and click on Delete container.
- Click on the target container and select the store to be deleted. Click on the arrow on the right side and click on Delete.