Intake II - modifying intake-esm data bases and save new catalogs

Intake II - modifying intake-esm data bases and save new catalogs#

Overview

🎯 objectives: Learn how to integrate intake-esm in your workflow

⌛ time_estimation: “40min”

☑️ requirements: intake_esm.__version__ == 2023.4.*, at least 10GB memory.

intake I
pandas

⚖ license:

Agenda

Based on DKRZ’s CMIP6 catalog, you learn in this part, how to

Modify the data base of the catalog
- How to rename values in a column
Make complex searches
- Which member was produced the most?
Save subset catalogs

The data base of intake-esm catalogs is processed as a pandas DataFrame. In section 1 and 2, you will learn to modify this data base and make complex searches with pandas commands. The tutorial covers some examples - if you aim at a deeper understanding of pandas we recommend you to do the extensive pandas tutorials from its own documentation.

Note

Pandas is a powerful data analysis tool and with its DataFrame class, users are enabled to process table-like data fast and intuitively.

import intake
#dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk

Modify the data base#

Assume you want to rename a short name like tas into a long name like temperature. We define a dictionary for renaming:

rename_dict={"tas":"Temperature",
             "pr":"Precipitation"}

For all items in the rename dictionary, we will reset the value in the underlying DataFrame inplace. We iterate over the dictionary with .items which returns key and value separately for all entries in the dictionary.

With the .loc attribute of the catalog, we can access a slice inside the DataFrame. The first argument of loc is the row indexer, the second is the column indexer.

The row index condition is: all rows where the variable is as the key of our dictionary, e.g. variable_id == tas. In general terms, that is esm_dkrz.df["variable_id"]==short_name
The column index is: easy, just variable_id

Therefore, our code looks like:

for short_name,long_name in rename_dict.items():
    esm_dkrz.df.loc[esm_dkrz.df["variable_id"]==short_name, "variable_id"]=long_name

Now, you can search “Temperature”:

esm_dkrz.search(variable_id="Temperature")

/work/ik1017/Catalogs/dkrz_cmip6_disk catalog with 2687 dataset(s) from 140620 asset(s):

	unique
activity_id	19
institution_id	36
source_id	91
experiment_id	208
member_id	425
table_id	9
variable_id	1
grid_label	7
dcpp_init_year	63
version	544
time_range	15866
project	1
format	1
uri	140620
derived_variable_id	0

Warning

In this example, we changed variable_ids which are predefined by the CMIP6 data standard for good reasons. Do NOT consider renaming as good practice. Do NOT share a catalog with renamed variables.

Complex searches, e.g. for combinations of attributes:#

If you want to know what the “preferred” ensemble member assignment of one specific or all ESMs is, you can do that by grouping the underlying data base with groupby. It takes columns as arguments which are used for grouping. I.e., groupby creates a new combined index with all unique combination of the specified columns. We have to save the returned object as we continue to work with the grouped dataframe:

grouped_df=esm_dkrz.df.groupby(["source_id","member_id"])

Note

if you return the underlying df variable, it will not be in the context of the catalog any longer and instead, it will be only the DataFrame which corresponds to the data base of the catalog

The returned grouped_df is only the start of an operation. Pandas has named the workflow split-apply-combine. We started to split the dataframe into groups.

apply: We will calculate now the number of entries for all groups, which can be easily done with size(). size() creates a new column for all groups named “counts” which contains the results.
combine: Afterwards, we will reset the index. That means, that the columns which were used for the groups and which were indexes in the grouped DataFrame return to be columns. The return of reset_index is a regular DataFrame.

grouped_df_size=grouped_df.size().reset_index(name='counts')
grouped_df_size

	source_id	member_id	counts
0	ACCESS-CM2	r1i1p1f1	9742
1	ACCESS-CM2	r2i1p1f1	1339
2	ACCESS-CM2	r3i1p1f1	966
3	ACCESS-CM2	r4i1p1f1	21
4	ACCESS-ESM1-5	r10i1p1f1	670
...	...	...	...
1637	UKESM1-0-LL	r6i1p1f3	603
1638	UKESM1-0-LL	r7i1p1f2	148
1639	UKESM1-0-LL	r7i1p1f3	1305
1640	UKESM1-0-LL	r8i1p1f2	8789
1641	UKESM1-0-LL	r9i1p1f2	1587

1642 rows × 3 columns

With that DataFrame, we can already display statistics of ensemble member for specific sources. If we would like to know the occurences of specific ensemble member of the source MPI-ESM1-2-HR only, we can subselect this source from the data very easily. Our condition grouped_df_size["source_id"]=="MPI-ESM1-2-HR" is just put in brackets of the Dataframe:

grouped_df_size_mpi=grouped_df_size[grouped_df_size["source_id"]=="MPI-ESM1-2-HR"]

The DataFrame has Plot options which allows us to directly create a figure of our data. One nice way to plot the counts of ensemble member for MPI-ESM1-2-HR is to use the bar() plot:

Note

We will use hvplot because it can create interactive figures. The same plot can be created wihtout the .hvplot part of the command. You can also check for other plot plugins of Pandas.

import hvplot.pandas
grouped_df_size_mpi.hvplot.bar(x="member_id")

But how is it in general for all sources? For that, we will do another split-apply-combine to get our target measure:

split: for all unique member (groupby("member_id")),…
apply: … we calculate the mean counts (mean("counts"))
combine: and sort the values with sort_values("counts", ascending=False)

We can do that in one line:

grouped_df_member=grouped_df_size.groupby("member_id").mean("counts").sort_values("counts", ascending=False)

As there are thousands of unique ensemble member values, we should only plot some, e.g. the top 10:

import hvplot.pandas
grouped_df_member.iloc[0:10,:].hvplot.bar()

Save a catalog subset as a new catalog#

Tip

We highly recommend that you save the subset of the catalog which you use in your analysis. Catalogs are often not as stable as they should be. With a local copy, you can ensure whether the original source has changed.

If we want to save our subset catalog which only contains metadata, we save the search result:

esm_subset=esm_dkrz.search(variable_id="Temperature")

Secondly, intake esm comes up with the serialize() function. The only argument is the name of the catalog which will be used as filename. It writes the two parts of the catalog either together in a .json file:

esm_subset.serialize("esm_subset")

Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json

Or in two seperated files if we provide catalog_type=file as a second argument. The test.json has over 100 MB. If we save the data base in a separate .csv.gz file, we reduce that to 2MB:

esm_subset.serialize("esm_subset", catalog_type="file")

Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json

Now, we can open the catalog from disk:

intake.open_esm_datastore("esm_subset.json")

esm_subset catalog with 2687 dataset(s) from 140620 asset(s):

	unique
activity_id	19
institution_id	36
source_id	91
experiment_id	208
member_id	425
table_id	9
variable_id	1
grid_label	7
dcpp_init_year	63
version	544
time_range	15866
project	1
format	1
uri	140620
derived_variable_id	0

Intake II - modifying intake-esm data bases and save new catalogs

Contents

Intake II - modifying intake-esm data bases and save new catalogs#

Modify the data base#

Complex searches, e.g. for combinations of attributes:#

Save a catalog subset as a new catalog#