Intake I part 2 - DKRZ catalog scheme, strategy and services#
Overview
🎯 objectives: Learn what intake-esm
ESM-collections DKRZ offer
⌛ time_estimation: “15min”
☑️ requirements: None
© contributors: k204210
⚖ license:
Agenda
In this part, you learn
DKRZ intake-esm catalog schema
DKRZ intake-esm catalogs for project data
Catalog dependencies on different stores
Workflow at Levante for collecting and merging catalogs into main catalog
DKRZ intake-esm catalog strategy and schema#
DKRZ catalogs aim at using one common scheme for its attributes so that combining catalogs and working with multiple catalogs on the same time will be easy. In collaboration with NextGEMS scientists, we agreed on some attribute names that DKRZ intake-esm catalogs should be equipped with. The resulting scheme is named cataloonies scheme.
Note
The cataloonies scheme is not a standard for anything but it is evolving and will be adapted to use cases. It is mainly influenced by ICON output and the CMIP standard. If you have suggestions, please contact us.
As a result, you will find redundant attributes in project catalogs which have the same meaning, e.g:
source_id, model_id, model
member_id, ensemble_member, simulation_id
Which of these attributes are loaded into the python workflow can be set (see intake-1).
You will find only one version for each atomic dataset in each catalog. This is the most recent one available in the store. An atomic dataset is found if unique values are set for all catalog attributes with one exception: it covers the entire time span of the simulation.
The cataloonies scheme#
ESM script developers, project scientists and data managers together defined some attribute names that DKRZ intake-esm catalogs should be equipped with. One benefit, originated by the composition of this working group, is that these attribute names can be used throughout the research data life cycle: At the earliest point, for model raw output, but also at the latest point, for data that is standardized and published.
The integration of intake-esm catalog generation into ESM run scripts is planned. That will enable the usage of intake and with it the easy usage and configuration of the python software stack for data processing from the beginning of data life.
Already existing catalogs will be provided with the newly defined attributes. For some, the values will fall back to None as there is no easy way to retrieve the values without looking into the asset which is technically not implementable when taking into account the amount of published project data. For CMIP6-like projects, we can take missing information from the cmor-mip-tables which represent the data standard of the project.
Show code cell source
import pandas as pd
cataloonies_raw=[["#","Attribute name and column name","Examples","Description","Comments"],
[1,"project","DYAMON-WINTER","The project in which the data was produced",],
[2,"institution_id","MPIM-DWD-DKRZ","The institution that runs the simulations",],
[3,"source_id","ICON-SAP-5km","The Earth System Model which produced the simulations",],
[4,"experiment_id","DW-CPL / DW-ATM","The short term for the experiment that was run",],
[5,"simulation_id","dpp1234","The simulation/member/realization of the ensemble.",],
[6,"realm","atm / oce","The submodel of the ESM which produces the output.",],
[7,"frequency","PT1h or 1hr – Style","The frequency of the output","ICON uses ISO format"],
[8,"time_reduction","mean / inst / timmax /…","The method used for sampling and averaging along time. The same as the time part of cell_methods.",],
[9,"grid_label","gn","A clear description for the grid for distingusihing between native and regridded grids.",],
[10,"grid_id","","A specific identifier of the grid.","we might need more than one (e.g. horizontal + vertical)"],
[11,"variable_id","tas","The CMIP short term of the variable.",],
[12,"level_type","pressure_level, atmosphere_level","The vertical axis type used for the variable.",],
[13,"time_min",1800,"The minimal time value covered by the asset.",],
[14,"time_max",1900,"The maximal time value covered by the asset.",],
[15,"format","netcdf/zarr/…","The format of the asset.",],
[16,"uri","url,path-to-file","The uri used to open and load the asset.",],
[17,"(time_range)","start-end","Combination of time_min and time_max.",]]
pd.DataFrame(cataloonies_raw[1:],columns=cataloonies_raw[0])[cataloonies_raw[0][1:-1]]
Attribute name and column name | Examples | Description | |
---|---|---|---|
0 | project | DYAMON-WINTER | The project in which the data was produced |
1 | institution_id | MPIM-DWD-DKRZ | The institution that runs the simulations |
2 | source_id | ICON-SAP-5km | The Earth System Model which produced the simu... |
3 | experiment_id | DW-CPL / DW-ATM | The short term for the experiment that was run |
4 | simulation_id | dpp1234 | The simulation/member/realization of the ensem... |
5 | realm | atm / oce | The submodel of the ESM which produces the out... |
6 | frequency | PT1h or 1hr – Style | The frequency of the output |
7 | time_reduction | mean / inst / timmax /… | The method used for sampling and averaging alo... |
8 | grid_label | gn | A clear description for the grid for distingus... |
9 | grid_id | A specific identifier of the grid. | |
10 | variable_id | tas | The CMIP short term of the variable. |
11 | level_type | pressure_level, atmosphere_level | The vertical axis type used for the variable. |
12 | time_min | 1800 | The minimal time value covered by the asset. |
13 | time_max | 1900 | The maximal time value covered by the asset. |
14 | format | netcdf/zarr/… | The format of the asset. |
15 | uri | url,path-to-file | The uri used to open and load the asset. |
16 | (time_range) | start-end | Combination of time_min and time_max. |
DKRZ intake-esm catalogs for community project data#
Jobs we do for you#
We make all catalogs available
under
/pool/data/Catalogs/
for logged-in HPC usersin the cloud
We create and update the content of project’s catalogs regularly by running scripts which are automatically executed and called cronjobs. We set the creation frequency so that the data of the project is updated sufficently quickly.
The updated catalog replaces the outdated one.
The updated catalog is uploaded to the DKRZ swift cloud
We plan to provide a catalog that tracks data which is removed by the update.
The data bases of project catalogs#
Creation of the .csv.gz
table :
A file list is created based on a
find
shell command on the project directory in the data pool.For the column values, filenames and pathes are parsed according to the project’s
path_template
andfilename_template
. These templates need to be constructed with attribute values requested and required by the project.Filenames that cannot be parsed are sorted out
If more than one version is found for a dataset, only the most recent one is kept.
Depending on the project, additional columns can be created by adding project’s specifications.
E.g., for CMIP6, we added a
OpenDAP
column which allows users to access data from everywhere viahttp
By 2022, we offer you project data for
Catalog dependencies on different stores#
DKRZ’s catalog naming convention distinguishes between the different storage formats for as long as data access to stores like archive is either not possible or very different to disk access. The specialties of the storages are explained in the following.
Preparing project catalogs for DKRZ’s main catalog#
Use attributes of existing catalogs and/or templates in
/pool/data/Catalogs/Templates
but at leasturi
,format
andproject
.Set permissions to readable for everyone for
the data referenced in the catalog under
uri
the catalog itself
Use the naming convention for dkrz catalogs (
dkrz_PROJECT_STORE
) for your catalogLink the catalog via
ln -s PATH-TO-YOUR-CATALOG /pool/data/Catalogs/Candidates/YOUR-CATALOG
Your catalog then will be catched by a cronjob which
tests your catalog
against the catalog naming convention
open, search and load
if for disk, are all
uri
values readable?
merges or creates your catalog
if a catalog for the specified project exists in
/pool/data/Catalogs/
, they will be merged if possible. Entries of your catalog will be merged if they are no duplicates.else, your catalog will be written to
/work/ik1017/Catalogs
and a link will be set in/pool/data/Catalogs/
See also
This tutorial is part of a series on intake
:
You can also do another CMIP6 tutorial from the official intake page.