{
"cells": [
{
"cell_type": "markdown",
"id": "d8957c6c-7375-4bff-8985-475be95a245f",
"metadata": {},
"source": [
"# Tzis - To Zarr in Swift\n",
"\n",
"`tzis` is a small python package which\n",
"1. converts data into the [zarr](https://zarr.readthedocs.io/en/stable/) format and\n",
"1. writes it to the DKRZ's cloud storage space [swift](https://swiftbrowser.dkrz.de/)\n",
"\n",
"in one step. It is based on a script which uses [xarray](http://xarray.pydata.org/en/stable/index.html) and the `fsspec` [implementation for swift](https://github.com/d70-t/swiftspec) from Tobias Kölling. `tzis` is optimized for DKRZ's High Performance Computer but can also be used from local computers.\n",
"\n",
"`tzis` features\n",
"\n",
"- writing of **different input file formats**. All files that can be passed to `xarray.open_mfdataset()` can be used.\n",
"- **writing** an atomic dataset i.e. one variable covering many files into the cloud per `write_to_swift` call.\n",
"- **consolidated stores**. Metadata of many files are saved into one. Conflicting metadata with varying values are combined into a list, e.g. `tracking_id`s.\n",
"- **chunking** along the `time` dimension. Datasets without `time` will be written directly (\"unmodified\") to storage.\n",
"- **swift-store** implementation for using basic filesystem-like operations on the object store (like `listdir`)\n"
]
},
{
"cell_type": "markdown",
"id": "6f216a5e-dd27-42ad-ad9b-baa4b0434f86",
"metadata": {},
"source": [
"In this notebook, you will learn\n",
"\n",
"- the [meaning](#define) of `zarr` and the `swift object storage`\n",
"- why you can [benefit](#moti) from `zarr` in cloud storage\n",
"- [when](#when) it is a good idea to write into cloud\n",
"- how to [initializie the swift store](#token) for `tzis` including creating a token\n",
"- how to [open and configure](#source) the source dataset\n",
"- how to [write](#write) data to swift\n",
"- how to [set options](#output) for the zarr output\n",
"- how to [access](#access) and use data from swift\n",
"- how to work with the [SwiftStore](#swiftstore) similar to file systems"
]
},
{
"cell_type": "markdown",
"id": "8d6a1c4f-66b5-466a-84b6-8a011ac5bd82",
"metadata": {},
"source": [
"\n",
"\n",
"## Definition\n",
"\n",
"**Zarr** is a *cloud-optimised* format for climate data. By using *chunk*-based data access, `zarr` enables arrays the can be larger than memory. Both input and output operations can be parallelised. It features *customization* of compression methods and stores.\n",
"\n",
"The **Swift** cloud object storage is a 🔑 *Keyvalue* store where the key is a global unique identifier and the value a representation of binary data. In contrast to a file system 📁 , there are no files or directories but *objects and containers/buckets*. Data access is possible via internet i.e. `http`."
]
},
{
"cell_type": "markdown",
"id": "82d19de7-889b-423a-beb2-0ed10972c90e",
"metadata": {},
"source": [
"\n",
"\n",
"## Motivation\n",
"\n",
"In recent years, object storage systems became an alternative to traditional file systems because of\n",
"\n",
"- **Independency** from computational ressources. Users can access and download data from anywhere without the need of HPC access or resources\n",
"- **Scalability** because no filesystem or system manager has to care about the connected disks.\n",
"- **A lack of storage** space in general because of increasing model output volume.\n",
"- **No namespace conflicts** because data is accessed via global unique identifier\n",
"\n",
"Large Earth System Science data bases like the CMIP Data Pool at DKRZ contain [netCDF](https://github.com/Unidata/netcdf-c) formatted data. Access and transfers of such data from an object storage can only be conducted on file level which results in heavy download volumes and less reproducible workflows. \n",
"\n",
"The cloud-optimised climate data format [Zarr](https://zarr.readthedocs.io/en/stable/) solves these problems by\n",
"\n",
"- allowing programs to identify _chunks_ corresponding to the desired subset of the data before the download so that the **volume of data transfer is reduced**.\n",
"- allowing users to access the data via `http` so that both **no authentication** or software on the cloud repository site is required \n",
"- saving **meta data** next to the binary data. That allows programs to quickly create a virtual representation of large and complex datasets.\n",
"\n",
"Zarr formatted data in the cloud makes the data as *analysis ready* as possible.\n",
"\n",
"With `tzis`, we developed a package that enables to use DKRZ's insitutional cloud storage as a back end storage for Earth System Science data. It combines `swiftclient` based scripts, a *Zarr storage* implementation and a high-level `xarray` application including `rechunking`. Download velocity can be up to **400 MB/s**. Additional validation of the data transfer ensures its completeness."
]
},
{
"cell_type": "markdown",
"id": "87d9ffaf-55c4-4013-b961-3bfb94d9c99a",
"metadata": {},
"source": [
"\n",
"\n",
"## Which type of data is suitable?\n",
"\n",
"Datasets in the cloud are useful if\n",
"- they are *fixed*. Moving data in the cloud is very inefficient.\n",
"- they will not be *prepended*. Data in the cloud can be easily *appended* but *prepending* most likely requires moving which is not efficient.\n",
"- they are *open*. One advantage comes from the easy access via http. This is even easier when useres do not have to log in."
]
},
{
"cell_type": "markdown",
"id": "77feab5f-a512-4449-95c7-4daa0762f25f",
"metadata": {},
"source": [
"\n",
"\n",
"## Swift authentication and initialization\n",
"\n",
"Central `tzis` functions require that you specify an `OS_AUTH_TOKEN` which allows the program to connect to the swift storage with your credentials. This token is valid for a month per default. Otherwise, you would have to login for each new session. When you work with `swift`, this token is saved in the hidden file `~/.swiftenv` which contains the following paramter\n",
"- `OS_STORAGE_URL` which is the URL associated with the storage space of the project or the user. Note that this URL cannot be opened like a *swiftbrowser* link but instead it can be used within programs like `tzis`.\n",
"- `OS_AUTH_TOKEN`. \n",
"\n",
"**Be careful** with the token. It should stay only readable for you. Especially, do not push it into git repos."
]
},
{
"cell_type": "markdown",
"id": "aa1c7b7f-6c8e-44db-9c29-7bb9eee88a84",
"metadata": {},
"source": [
"\n",
"\n",
"### Get token and url\n",
"\n",
"`Tzis` includes a function to get the token or, if not available, create the token:\n",
"\n",
"```python\n",
"from tzis import swifthandling\n",
"token=swifthandling.get_token(\n",
" \"dkrz\",\n",
" project,\n",
" user\n",
")\n",
"```\n",
"\n",
"When calling `get_token`,\n",
"1. it tries to read in the configuration file `~/.swiftenv`\n",
"1. if there is a file, it checks, if the found configuration matches the specified *account*\n",
"1. if no file was found or the configuration is invalid, it will create a token\n",
" 1. it asks you for a password\n",
" 1. it writes two files: the `~/.swiftenv` with the configuration and `~/.swiftenv_useracc` which contains the account and user specification for that token.\n",
"1. it returns a dictionary with all configuration variables"
]
},
{
"cell_type": "markdown",
"id": "13396f94-6da3-4eb9-9602-1045fc9540c5",
"metadata": {},
"source": [
"### Initializing an output container\n",
"\n",
"After having the authentication for swift, we *initialize* a swift container in which we will save the data. We do that with\n",
"\n",
"```python\n",
"target_fsmap=swifthandling.get_swift_mapper(\n",
" token[\"OS_STORAGE_URL\"],\n",
" token[\"OS_AUTH_TOKEN\"],\n",
" container_name,\n",
" os_name=prefix_for_object_storage\n",
")\n",
"```\n",
"\n",
"The mandatory arguments are:\n",
"- `os_url` is the `OS_STORAGE_URL`\n",
"- `os_token` is the `OS_AUTH_TOKEN`\n",
"- `os_container` is the *container name* / the *bucket*. A container is the highest of two store levels in the swift object store.\n",
"\n",
"these will connect you to the swift store and initialize/create a container."
]
},
{
"cell_type": "markdown",
"id": "9ec4e347-4d16-4916-8cd0-355ddd512fe2",
"metadata": {},
"source": [
"\n",
"## Open and configure the source dataset\n",
"\n",
"Tzis offers a convenient function to directly open a dataset such that it has the chunks fitting to target chunk size. See the *Writing to Swift*-chapter for notes related to the chunking.\n",
"\n",
"```python\n",
"from tzis import openmf\n",
"omo = openmf.open_mfdataset_optimize(\n",
" glob_path_var,\n",
" varname,\n",
" target_fsmap,\n",
" chunkdim=chunkdim,\n",
" target_mb=target_mb\n",
")\n",
"\n",
"```\n",
"The mandatory arguments are\n",
"- `glob_path_var`: The dataset file(s). A `str` or a `list` of source files which can be opened with\n",
"```python\n",
"mf_dset = xarray.open_mfdataset(glob_path_var,\n",
" decode_cf=True,\n",
" use_cftime=True,\n",
" data_vars='minimal', \n",
" coords='minimal', \n",
" compat='override',\n",
" combine_attrs=\"drop_conflicts\")\n",
"```\n",
"- `varname`: The variable from the dataset which will be selected and then written into the object store\n",
"- `target_fsmap`\n",
"\n",
"E.g.:\n",
"```python\n",
"path_to_dataset = \"/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/\"\n",
"mfs_towrite=[path_var +filename for filename in os.listdir(path_to_dataset)]\n",
"container.mf_dataset=container.open_mf_dataset(openmf, \"pr\", target_fsmap)\n",
"container.mf_dataset\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "6736e17c-c4e5-4393-8ef9-13590c2397fe",
"metadata": {},
"source": [
"### Grib input\n",
"\n",
"If you want to use `grb` input files, you can specify `cfgrib` as an **engine** for `xarray`.\n",
"```python\n",
"container.open_mf_dataset(list_of_grib_files, \"pr\", xarray_kwargs=**dict(engine=\"cfgrib\"))\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "76171636-1f6c-453b-942e-c62d2b49467d",
"metadata": {
"tags": []
},
"source": [
"\n",
"\n",
"## Writing to swift\n",
"\n",
"After we have initialized the container and opened the dataset, we can **write** it into cloud. The conversion to `zarr` is made on the way. We can specify all necessary configuration options within the `write` function:\n",
"\n",
"```python\n",
"\n",
"def write_zarr(\n",
" self,\n",
" fsmap, \n",
" mf_dset,\n",
" varname,\n",
" chunkdim=\"time\",\n",
" target_mb=0,\n",
" startchunk=0,\n",
" validity_check=False,\n",
" maxretries=3,\n",
" trusted=True,\n",
")\n",
"```\n",
"The function needs\n",
"\n",
"- a target store `fsmap` as a *fsspec mapping*\n",
"- the input xarray dataset `mf_dset`\n",
"- the variable name `varname` which should be used to rechunk\n",
"\n",
"The function allows you\n",
"\n",
"- to set `chunkdim` which is the *dimension* used for chunking. There is yet no other dimension than \"time\" possible.\n",
"- to set the target size of a data chunk. A *chunk* corresponds to an object in the swift object storage. It has limitations on both sides: Chunks smaller than 10 MB are not efficient while sizes larger than 2GB are not supported.\n",
"- to set the `startchunk`. If the write process was interrupted - e.g. because your dataset is very large, you can specify at which chunk the write process should restart.\n",
"- to set the number of *retries* if the transfer is interrupted.\n",
"- to set `validity_check=True` which will validate the transfer after having the data completly transferred. This checks if the data in the chunks are equal to the input data.\n",
"\n",
"E.g.\n",
"```python\n",
"from tzis import tzis\n",
"outstore=tzis.write_zarr(\n",
" omo.target_fsmap,\n",
" omo.mf_dset,\n",
" omo.varname,\n",
" verbose=True,\n",
" target_mb=0\n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "9f045270-f61d-450d-8bc5-dd9a725c7dfb",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"The output `outstore` of `write_zarr` is a new variable which packages like `xarray` can use and open as a *consolidated* dataset. The `os_name` of `container` can now be changed while the `outstore` still points to the written `os_name`."
]
},
{
"cell_type": "markdown",
"id": "30fda29b-036a-4a8c-bf35-0421b1cad34e",
"metadata": {
"tags": []
},
"source": [
"### Overwriting or appending?\n",
"\n",
"`write_zarr()` per default **appends** data if possible. It calls `xarray`'s `to_zarr()` function *for each chunk*. Before a chunk is written, it is checked if there is already a chunk for exactly the **slice** of the dataset that should be written. If so, the chunk is skipped. Therefore, recalling `write_zarr` only overwrites chunks if they cover a different slice of the source dataset.\n",
"\n",
"In order to skip chunks, you can set `startchunk`. Then, the function will jump to `startchunk` and start writing this."
]
},
{
"cell_type": "markdown",
"id": "e33d8816-18bc-4cff-86b9-5cfac67de7de",
"metadata": {
"tags": []
},
"source": [
"### Writing another variable from the same dataset\n",
"\n",
"1. Define another store by using a different `os_name`:\n",
"```python\n",
"omo.target_fsmap= swifthandling.get_swift_mapper(\n",
" token[\"OS_STORAGE_URL\"],\n",
" token[\"OS_AUTH_TOKEN\"],\n",
" container_name,\n",
" os_name=new_prefix_for_new_variable\n",
")\n",
"```\n",
"2. Set another variable name `varname`:\n",
"```python\n",
"omo.varname=varname\n",
"```\n",
"3. Write to swift:\n",
"```python\n",
"tzis.write_zarr()\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "5aed5c16-bfee-49d9-8693-5d3a38893bee",
"metadata": {
"tags": []
},
"source": [
"### Writing another dataset into the same container\n",
"\n",
"You do not have to login to the same store and the same container a second time. You can still use the `container` variable. Just restart at [upload](#upload)."
]
},
{
"cell_type": "markdown",
"id": "cb4d8781-5314-4a55-a301-1300b4a94667",
"metadata": {
"tags": []
},
"source": [
"## Options and configuration for the zarr output"
]
},
{
"cell_type": "markdown",
"id": "5230a651-4f6d-4c12-a0d1-bb9bb790877d",
"metadata": {
"tags": []
},
"source": [
"### Memory and chunk size"
]
},
{
"cell_type": "markdown",
"id": "e494d109-82fa-448b-ac0d-ce4f77565949",
"metadata": {
"tags": []
},
"source": [
"### Compression\n",
"\n",
"[From Zarr docs:](https://zarr.readthedocs.io/en/v2.10.2/tutorial.html#compressors)\n",
"\n",
"> If you don’t specify a compressor, by default Zarr uses the [Blosc](https://github.com/Blosc) compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a *“meta-compressor”*, which means that it can use a number of different compression algorithms internally to compress the data. A list of the internal compression libraries available within Blosc can be obtained via:\n",
"\n",
"```python\n",
"from numcodecs import blosc\n",
"blosc.list_compressors()\n",
"['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']\n",
"```\n",
"\n",
"> The default compressor can be changed by setting the value of the zarr.storage.default_compressor variable, e.g.:\n",
"\n",
"```python\n",
"import zarr.storage\n",
"from numcodecs import Zstd, Blosc\n",
"# switch to using Zstandard\n",
"zarr.storage.default_compressor = Zstd(level=1)\n",
"```\n",
"\n",
"> A number of different compressors can be used with Zarr. A separate package called [NumCodecs](http://numcodecs.readthedocs.io/) is available which provides a common interface to various compressor libraries including Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions. "
]
},
{
"cell_type": "markdown",
"id": "afa8df5d-0101-4ed9-9063-f7ce1ba404c9",
"metadata": {
"tags": []
},
"source": [
"### Attributes\n",
"\n",
"*Attributes* of the dataset are handled in a `dict`ionary in the `container.mf_dset` variable via `xarray`. You can **add** or **delete** attributes just like items from a dictionary:\n",
"```python\n",
"#add an attribute\n",
"omo.attrs[\"new_attribute\"]=\"New value of attribute\"\n",
"print(omo.attrs[\"new_attribute\"])\n",
"\n",
"#delete the attribute\n",
"del omo.attrs[\"new_attribute\"]\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b4e19440-3fd9-406f-80e2-752070e2e060",
"metadata": {},
"source": [
"\n",
"## Access and use your Zarr dataset\n",
"\n",
"1. You can open the *consolidated zarr datasets* with `xarray` using an URL-prefix-like string constructed as \n",
"```python\n",
"zarrinput=OS_STORAGE_URL+\"/\"+os_container+\"/\"+os_name\n",
"xarry.open_zarr(zarrinput, consolidated=True, decode_times=True)\n",
"```\n",
"This is possible if the container is *public*.\n",
"\n",
"1. If your container is *private*, you have to use a `zarr storage` where you have to login with credentials to the store first. I.e., you can also do\n",
"```python\n",
"zarr_dset = xarray.open_zarr(container.store, consolidated=True, decode_times=True)\n",
"zarr_dset\n",
"```\n",
"\n",
"1. You can download data from the [swiftbrowser](https://swiftbrowser.dkrz.de) manually"
]
},
{
"cell_type": "markdown",
"id": "c976d55c-502d-47ab-b3ef-67842f6aea11",
"metadata": {},
"source": [
"### Coordinates\n",
"\n",
"Sometimes, you have to *reset* the coordinates because it gets lost on the transfer to zarr:\n",
"```python\n",
"precoords = set(\n",
" [\"lat_bnds\", \"lev_bnds\", \"ap\", \"b\", \"ap_bnds\", \"b_bnds\", \"lon_bnds\"]\n",
")\n",
"coords = [x for x in zarr_dset.data_vars.variables if x in precoords]\n",
"zarr_dset = zarr_dset.set_coords(coords)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "91a744a5-eb11-4d21-a2be-9f7ac7284c21",
"metadata": {},
"source": [
"### Reconvert to NetCDF\n",
"\n",
"The basic reconversion to netCDF can be done with `xarray`:\n",
"```python\n",
"written.to_netcdf(outputfilename)\n",
"```\n",
"\n",
"#### Compression and encoding:\n",
"\n",
"Often, the original netCDF was compressed. You can set different compressions in an **encoding** dictionary. For using `zlib` and its compression level 1, you can set:\n",
"\n",
"```python\n",
"var_dict = dict(zlib=True, complevel=1)\n",
"encoding = {var: var_dict for var in written.data_vars}\n",
"```\n",
"\n",
"#### FillValue\n",
"\n",
"`to_netcdf` might write out *FillValue*s for coordinates which is not compliant to CF. In order to prevent that, set an encoding as follows:\n",
"\n",
"```python\n",
"coord_dict = dict(_FillValue=False)\n",
"encoding.update({var: coord_dict for var in written.coords})\n",
"```\n",
"\n",
"#### Unlimited dimensions\n",
"\n",
"Last but not least, one key element of netCDF is the **unlimited dimension**. You can set a *keyword argument* in the `to_netcdf` command. E.g., for rewriting a zarr-CMIP6 dataset into netCDF, consider compression and fillValue in the encoding and run\n",
"\n",
"```python\n",
"written.to_netcdf(\"testcase.nc\",\n",
" format=\"NETCDF4_CLASSIC\",\n",
" unlimited_dims=\"time\",\n",
" encoding=encoding)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "1eff5a21-b2dd-43b6-9e00-88779638b6aa",
"metadata": {},
"source": [
"\n",
"## Swift storage handling with fsspec - `chmod`, `ls`, `rm`, `mv`\n",
"\n",
"The mapper from fsspec comes with a *filesystem* object named `fs` which maps the api calls to the linux commands so that they become applicable, e.g.:\n",
"\n",
"```python\n",
"outstore.fs.ls(outstore.root)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "39f029d4-9efd-442d-8262-9ded55cf0a3d",
"metadata": {
"tags": []
},
"source": [
"### The Index\n",
"\n",
"`write_zarr` automatically appends to an index *INDEX.csv* in the parent directory. You should find it via\n",
"\n",
"```python\n",
"import os\n",
"outstore.fs.ls(os.path.dirname(outstore.root))\n",
"```\n",
"\n",
"You can directly read that with\n",
"\n",
"```python\n",
"import pandas as pd\n",
"index_df=pd.read_csv(os.path.dirname(outstore.root)+\"/INDEX.csv\")\n",
"```\n",
"\n",
"All the *url*s in the column *url* should be openable with xarray, e.g.:\n",
"\n",
"```python\n",
"import xarray as xr\n",
"xr.open_zarr(index_df[\"url\"][0], consolidated=True)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "03a2f2e1-a4d7-4016-abe9-224663271a40",
"metadata": {},
"source": [
"### How to make a container public\n",
"\n",
"- use the `store`:\n",
"\n",
"```python\n",
"#with a container and a prefix, you can get the container_name via os\n",
"#import os\n",
"#container_name=os.path.basename(os.path.dirname(outstore.root))\n",
"\n",
"swifthandling.toggle_public(container_name)\n",
"```\n",
"\n",
"This will either make the container of the outstore *public* if it was not or it will make it *private* by removing all access control lists if it was public. Note that only container as a whole can be made public or private.\n",
"\n",
"- With hand:\n",
"\n",
"1. Log in at https://swiftbrowser.dkrz.de/login/ . \n",
"2. In the line of the target container, click on the arrow on the right side with the red background and click on `share`.\n",
"3. Again, click on the arrow on the right side and click on `make public`."
]
},
{
"cell_type": "markdown",
"id": "966d03c4-74a0-4f63-87ed-49ba6f4b29ae",
"metadata": {},
"source": [
"### Remove a zarr-`store` i.e. all objects with `os_name` prefix\n",
"\n",
"- use `fsspec`:\n",
"\n",
"```python\n",
"target_fsmap.fs.rmdir(os_name) \n",
"```\n",
"\n",
"- With hand:\n",
"\n",
"1. Log in at https://swiftbrowser.dkrz.de/login/ . \n",
"2.\n",
" - On the line of the target container, click on the arrow on the right side and click on `Delete container`.\n",
" - Click on the target container and select the store to be deleted. Click on the arrow on the right side and click on `Delete`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56bf4fef-58eb-413f-833a-20594b515fb4",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c7aa866-70f2-4fe0-ae02-742671357701",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "python3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}