Tzis - To Zarr in Swift#
tzis
is a small python package which
in one step. It is based on a script which uses xarray and the fsspec
implementation for swift from Tobias Kölling. tzis
is optimized for DKRZ’s High Performance Computer but can also be used from local computers.
tzis
features
writing of different input file formats. All files that can be passed to
xarray.open_mfdataset()
can be used.writing an atomic dataset i.e. one variable covering many files into the cloud per
write_to_swift
call.consolidated stores. Metadata of many files are saved into one. Conflicting metadata with varying values are combined into a list, e.g.
tracking_id
s.chunking along the
time
dimension. Datasets withouttime
will be written directly (“unmodified”) to storage.swift-store implementation for using basic filesystem-like operations on the object store (like
listdir
)
In this notebook, you will learn
the meaning of
zarr
and theswift object storage
why you can benefit from
zarr
in cloud storagewhen it is a good idea to write into cloud
how to initializie the swift store for
tzis
including creating a tokenhow to open and configure the source dataset
how to write data to swift
how to set options for the zarr output
how to access and use data from swift
how to work with the SwiftStore similar to file systems
Definition#
Zarr is a cloud-optimised format for climate data. By using chunk-based data access, zarr
enables arrays the can be larger than memory. Both input and output operations can be parallelised. It features customization of compression methods and stores.
The Swift cloud object storage is a 🔑 Keyvalue store where the key is a global unique identifier and the value a representation of binary data. In contrast to a file system 📁 , there are no files or directories but objects and containers/buckets. Data access is possible via internet i.e. http
.
Motivation#
In recent years, object storage systems became an alternative to traditional file systems because of
Independency from computational ressources. Users can access and download data from anywhere without the need of HPC access or resources
Scalability because no filesystem or system manager has to care about the connected disks.
A lack of storage space in general because of increasing model output volume.
No namespace conflicts because data is accessed via global unique identifier
Large Earth System Science data bases like the CMIP Data Pool at DKRZ contain netCDF formatted data. Access and transfers of such data from an object storage can only be conducted on file level which results in heavy download volumes and less reproducible workflows.
The cloud-optimised climate data format Zarr solves these problems by
allowing programs to identify chunks corresponding to the desired subset of the data before the download so that the volume of data transfer is reduced.
allowing users to access the data via
http
so that both no authentication or software on the cloud repository site is requiredsaving meta data next to the binary data. That allows programs to quickly create a virtual representation of large and complex datasets.
Zarr formatted data in the cloud makes the data as analysis ready as possible.
With tzis
, we developed a package that enables to use DKRZ’s insitutional cloud storage as a back end storage for Earth System Science data. It combines swiftclient
based scripts, a Zarr storage implementation and a high-level xarray
application including rechunking
. Download velocity can be up to 400 MB/s. Additional validation of the data transfer ensures its completeness.
Which type of data is suitable?#
Datasets in the cloud are useful if
they are fixed. Moving data in the cloud is very inefficient.
they will not be prepended. Data in the cloud can be easily appended but prepending most likely requires moving which is not efficient.
they are open. One advantage comes from the easy access via http. This is even easier when useres do not have to log in.
Swift authentication and initialization#
Central tzis
functions require that you specify an OS_AUTH_TOKEN
which allows the program to connect to the swift storage with your credentials. This token is valid for a month per default. Otherwise, you would have to login for each new session. When you work with swift
, this token is saved in the hidden file ~/.swiftenv
which contains the following paramter
OS_STORAGE_URL
which is the URL associated with the storage space of the project or the user. Note that this URL cannot be opened like a swiftbrowser link but instead it can be used within programs liketzis
.OS_AUTH_TOKEN
.
Be careful with the token. It should stay only readable for you. Especially, do not push it into git repos.
Get token and url#
Tzis
includes a function to get the token or, if not available, create the token:
from tzis import swifthandling
token=swifthandling.get_token(
"dkrz",
project,
user
)
When calling get_token
,
it tries to read in the configuration file
~/.swiftenv
if there is a file, it checks, if the found configuration matches the specified account
if no file was found or the configuration is invalid, it will create a token
it asks you for a password
it writes two files: the
~/.swiftenv
with the configuration and~/.swiftenv_useracc
which contains the account and user specification for that token.
it returns a dictionary with all configuration variables
Initializing an output container#
After having the authentication for swift, we initialize a swift container in which we will save the data. We do that with
target_fsmap=swifthandling.get_swift_mapper(
token["OS_STORAGE_URL"],
token["OS_AUTH_TOKEN"],
container_name,
os_name=prefix_for_object_storage
)
The mandatory arguments are:
os_url
is theOS_STORAGE_URL
os_token
is theOS_AUTH_TOKEN
os_container
is the container name / the bucket. A container is the highest of two store levels in the swift object store.
these will connect you to the swift store and initialize/create a container.
Open and configure the source dataset#
Tzis offers a convenient function to directly open a dataset such that it has the chunks fitting to target chunk size. See the Writing to Swift-chapter for notes related to the chunking.
from tzis import openmf
omo = openmf.open_mfdataset_optimize(
glob_path_var,
varname,
target_fsmap,
chunkdim=chunkdim,
target_mb=target_mb
)
The mandatory arguments are
glob_path_var
: The dataset file(s). Astr
or alist
of source files which can be opened with
mf_dset = xarray.open_mfdataset(glob_path_var,
decode_cf=True,
use_cftime=True,
data_vars='minimal',
coords='minimal',
compat='override',
combine_attrs="drop_conflicts")
varname
: The variable from the dataset which will be selected and then written into the object storetarget_fsmap
E.g.:
path_to_dataset = "/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
mfs_towrite=[path_var +filename for filename in os.listdir(path_to_dataset)]
container.mf_dataset=container.open_mf_dataset(openmf, "pr", target_fsmap)
container.mf_dataset
Grib input#
If you want to use grb
input files, you can specify cfgrib
as an engine for xarray
.
container.open_mf_dataset(list_of_grib_files, "pr", xarray_kwargs=**dict(engine="cfgrib"))
Writing to swift#
After we have initialized the container and opened the dataset, we can write it into cloud. The conversion to zarr
is made on the way. We can specify all necessary configuration options within the write
function:
def write_zarr(
self,
fsmap,
mf_dset,
varname,
chunkdim="time",
target_mb=0,
startchunk=0,
validity_check=False,
maxretries=3,
trusted=True,
)
The function needs
a target store
fsmap
as a fsspec mappingthe input xarray dataset
mf_dset
the variable name
varname
which should be used to rechunk
The function allows you
to set
chunkdim
which is the dimension used for chunking. There is yet no other dimension than “time” possible.to set the target size of a data chunk. A chunk corresponds to an object in the swift object storage. It has limitations on both sides: Chunks smaller than 10 MB are not efficient while sizes larger than 2GB are not supported.
to set the
startchunk
. If the write process was interrupted - e.g. because your dataset is very large, you can specify at which chunk the write process should restart.to set the number of retries if the transfer is interrupted.
to set
validity_check=True
which will validate the transfer after having the data completly transferred. This checks if the data in the chunks are equal to the input data.
E.g.
from tzis import tzis
outstore=tzis.write_zarr(
omo.target_fsmap,
omo.mf_dset,
omo.varname,
verbose=True,
target_mb=0
)
The output outstore
of write_zarr
is a new variable which packages like xarray
can use and open as a consolidated dataset. The os_name
of container
can now be changed while the outstore
still points to the written os_name
.
Overwriting or appending?#
write_zarr()
per default appends data if possible. It calls xarray
’s to_zarr()
function for each chunk. Before a chunk is written, it is checked if there is already a chunk for exactly the slice of the dataset that should be written. If so, the chunk is skipped. Therefore, recalling write_zarr
only overwrites chunks if they cover a different slice of the source dataset.
In order to skip chunks, you can set startchunk
. Then, the function will jump to startchunk
and start writing this.
Writing another variable from the same dataset#
Define another store by using a different
os_name
:
omo.target_fsmap= swifthandling.get_swift_mapper(
token["OS_STORAGE_URL"],
token["OS_AUTH_TOKEN"],
container_name,
os_name=new_prefix_for_new_variable
)
Set another variable name
varname
:
omo.varname=varname
Write to swift:
tzis.write_zarr()
Writing another dataset into the same container#
You do not have to login to the same store and the same container a second time. You can still use the container
variable. Just restart at upload.
Options and configuration for the zarr output#
Memory and chunk size#
Compression#
If you don’t specify a compressor, by default Zarr uses the Blosc compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can use a number of different compression algorithms internally to compress the data. A list of the internal compression libraries available within Blosc can be obtained via:
from numcodecs import blosc
blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
The default compressor can be changed by setting the value of the zarr.storage.default_compressor variable, e.g.:
import zarr.storage
from numcodecs import Zstd, Blosc
# switch to using Zstandard
zarr.storage.default_compressor = Zstd(level=1)
A number of different compressors can be used with Zarr. A separate package called NumCodecs is available which provides a common interface to various compressor libraries including Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions.
Attributes#
Attributes of the dataset are handled in a dict
ionary in the container.mf_dset
variable via xarray
. You can add or delete attributes just like items from a dictionary:
#add an attribute
omo.attrs["new_attribute"]="New value of attribute"
print(omo.attrs["new_attribute"])
#delete the attribute
del omo.attrs["new_attribute"]
Access and use your Zarr dataset#
You can open the consolidated zarr datasets with
xarray
using an URL-prefix-like string constructed as
zarrinput=OS_STORAGE_URL+"/"+os_container+"/"+os_name
xarry.open_zarr(zarrinput, consolidated=True, decode_times=True)
This is possible if the container is public.
If your container is private, you have to use a
zarr storage
where you have to login with credentials to the store first. I.e., you can also do
zarr_dset = xarray.open_zarr(container.store, consolidated=True, decode_times=True)
zarr_dset
You can download data from the swiftbrowser manually
Coordinates#
Sometimes, you have to reset the coordinates because it gets lost on the transfer to zarr:
precoords = set(
["lat_bnds", "lev_bnds", "ap", "b", "ap_bnds", "b_bnds", "lon_bnds"]
)
coords = [x for x in zarr_dset.data_vars.variables if x in precoords]
zarr_dset = zarr_dset.set_coords(coords)
Reconvert to NetCDF#
The basic reconversion to netCDF can be done with xarray
:
written.to_netcdf(outputfilename)
Compression and encoding:#
Often, the original netCDF was compressed. You can set different compressions in an encoding dictionary. For using zlib
and its compression level 1, you can set:
var_dict = dict(zlib=True, complevel=1)
encoding = {var: var_dict for var in written.data_vars}
FillValue#
to_netcdf
might write out FillValues for coordinates which is not compliant to CF. In order to prevent that, set an encoding as follows:
coord_dict = dict(_FillValue=False)
encoding.update({var: coord_dict for var in written.coords})
Unlimited dimensions#
Last but not least, one key element of netCDF is the unlimited dimension. You can set a keyword argument in the to_netcdf
command. E.g., for rewriting a zarr-CMIP6 dataset into netCDF, consider compression and fillValue in the encoding and run
written.to_netcdf("testcase.nc",
format="NETCDF4_CLASSIC",
unlimited_dims="time",
encoding=encoding)
Swift storage handling with fsspec - chmod
, ls
, rm
, mv
#
The mapper from fsspec comes with a filesystem object named fs
which maps the api calls to the linux commands so that they become applicable, e.g.:
outstore.fs.ls(outstore.root)
The Index#
write_zarr
automatically appends to an index INDEX.csv in the parent directory. You should find it via
import os
outstore.fs.ls(os.path.dirname(outstore.root))
You can directly read that with
import pandas as pd
index_df=pd.read_csv(os.path.dirname(outstore.root)+"/INDEX.csv")
All the urls in the column url should be openable with xarray, e.g.:
import xarray as xr
xr.open_zarr(index_df["url"][0], consolidated=True)
How to make a container public#
use the
store
:
#with a container and a prefix, you can get the container_name via os
#import os
#container_name=os.path.basename(os.path.dirname(outstore.root))
swifthandling.toggle_public(container_name)
This will either make the container of the outstore public if it was not or it will make it private by removing all access control lists if it was public. Note that only container as a whole can be made public or private.
With hand:
Log in at https://swiftbrowser.dkrz.de/login/ .
In the line of the target container, click on the arrow on the right side with the red background and click on
share
.Again, click on the arrow on the right side and click on
make public
.
Remove a zarr-store
i.e. all objects with os_name
prefix#
use
fsspec
:
target_fsmap.fs.rmdir(os_name)
With hand:
Log in at https://swiftbrowser.dkrz.de/login/ .
On the line of the target container, click on the arrow on the right side and click on
Delete container
.Click on the target container and select the store to be deleted. Click on the arrow on the right side and click on
Delete
.