Intake II - modifying intake-esm data bases and save new catalogs#
Overview
🎯 objectives: Learn how to integrate intake-esm
in your workflow
⌛ time_estimation: “40min”
☑️ requirements: intake_esm.__version__ == 2023.4.*
, at least 10GB memory.
intake I
© contributors: k204210
⚖ license:
Agenda
Based on DKRZ’s CMIP6 catalog, you learn in this part, how to
Modify the data base of the catalog
How to rename values in a column
Make complex searches
Which member was produced the most?
Save subset catalogs
The data base of intake-esm catalogs is processed as a pandas DataFrame. In section 1 and 2, you will learn to modify this data base and make complex searches with pandas commands. The tutorial covers some examples - if you aim at a deeper understanding of pandas we recommend you to do the extensive pandas tutorials from its own documentation.
Note
Pandas is a powerful data analysis tool and with its DataFrame class, users are enabled to process table-like data fast and intuitively.
import intake
#dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk
Modify the data base#
Assume you want to rename a short name like tas
into a long name like temperature
.
We define a dictionary for renaming:
rename_dict={"tas":"Temperature",
"pr":"Precipitation"}
For all items in the rename dictionary, we will reset the value in the underlying DataFrame inplace. We iterate over the dictionary with .items
which returns key and value separately for all entries in the dictionary.
With the .loc
attribute of the catalog, we can access a slice inside the DataFrame. The first argument of loc
is the row indexer, the second is the column indexer.
The row index condition is: all rows where the variable is as the key of our dictionary, e.g. variable_id == tas. In general terms, that is
esm_dkrz.df["variable_id"]==short_name
The column index is: easy, just
variable_id
Therefore, our code looks like:
for short_name,long_name in rename_dict.items():
esm_dkrz.df.loc[esm_dkrz.df["variable_id"]==short_name, "variable_id"]=long_name
Now, you can search “Temperature”:
esm_dkrz.search(variable_id="Temperature")
/work/ik1017/Catalogs/dkrz_cmip6_disk catalog with 2687 dataset(s) from 140620 asset(s):
unique | |
---|---|
activity_id | 19 |
institution_id | 36 |
source_id | 91 |
experiment_id | 208 |
member_id | 425 |
table_id | 9 |
variable_id | 1 |
grid_label | 7 |
dcpp_init_year | 63 |
version | 544 |
time_range | 15866 |
project | 1 |
format | 1 |
uri | 140620 |
derived_variable_id | 0 |
Warning
In this example, we changed variable_id
s which are predefined by the CMIP6 data standard for good reasons. Do NOT consider renaming as good practice. Do NOT share a catalog with renamed variables.
Complex searches, e.g. for combinations of attributes:#
If you want to know what the “preferred” ensemble member assignment of one specific or all ESMs is, you can do that by grouping the underlying data base with groupby
. It takes columns as arguments which are used for grouping. I.e., groupby
creates a new combined index with all unique combination of the specified columns. We have to save the returned object as we continue to work with the grouped dataframe:
grouped_df=esm_dkrz.df.groupby(["source_id","member_id"])
Note
if you return the underlying df
variable, it will not be in the context of the catalog any longer and instead, it will be only the DataFrame which corresponds to the data base of the catalog
The returned grouped_df
is only the start of an operation. Pandas has named the workflow split-apply-combine. We started to split the dataframe into groups.
apply: We will calculate now the number of entries for all groups, which can be easily done with
size()
.size()
creates a new column for all groups named “counts” which contains the results.combine: Afterwards, we will reset the index. That means, that the columns which were used for the groups and which were indexes in the grouped DataFrame return to be columns. The return of
reset_index
is a regular DataFrame.
grouped_df_size=grouped_df.size().reset_index(name='counts')
grouped_df_size
source_id | member_id | counts | |
---|---|---|---|
0 | ACCESS-CM2 | r1i1p1f1 | 9742 |
1 | ACCESS-CM2 | r2i1p1f1 | 1339 |
2 | ACCESS-CM2 | r3i1p1f1 | 966 |
3 | ACCESS-CM2 | r4i1p1f1 | 21 |
4 | ACCESS-ESM1-5 | r10i1p1f1 | 670 |
... | ... | ... | ... |
1637 | UKESM1-0-LL | r6i1p1f3 | 603 |
1638 | UKESM1-0-LL | r7i1p1f2 | 148 |
1639 | UKESM1-0-LL | r7i1p1f3 | 1305 |
1640 | UKESM1-0-LL | r8i1p1f2 | 8789 |
1641 | UKESM1-0-LL | r9i1p1f2 | 1587 |
1642 rows × 3 columns
With that DataFrame, we can already display statistics of ensemble member for specific sources. If we would like to know the occurences of specific ensemble member of the source MPI-ESM1-2-HR only, we can subselect this source from the data very easily. Our condition grouped_df_size["source_id"]=="MPI-ESM1-2-HR"
is just put in brackets of the Dataframe:
grouped_df_size_mpi=grouped_df_size[grouped_df_size["source_id"]=="MPI-ESM1-2-HR"]
The DataFrame has Plot options which allows us to directly create a figure of our data. One nice way to plot the counts of ensemble member for MPI-ESM1-2-HR is to use the bar()
plot:
Note
We will use hvplot because it can create interactive figures. The same plot can be created wihtout the .hvplot
part of the command. You can also check for other plot plugins of Pandas.
import hvplot.pandas
grouped_df_size_mpi.hvplot.bar(x="member_id")
But how is it in general for all sources? For that, we will do another split-apply-combine to get our target measure:
split: for all unique member (
groupby("member_id")
),…apply: … we calculate the mean
counts
(mean("counts")
)combine: and sort the values with
sort_values("counts", ascending=False)
We can do that in one line:
grouped_df_member=grouped_df_size.groupby("member_id").mean("counts").sort_values("counts", ascending=False)
As there are thousands of unique ensemble member values, we should only plot some, e.g. the top 10:
import hvplot.pandas
grouped_df_member.iloc[0:10,:].hvplot.bar()
Save a catalog subset as a new catalog#
Tip
We highly recommend that you save the subset of the catalog which you use in your analysis. Catalogs are often not as stable as they should be. With a local copy, you can ensure whether the original source has changed.
If we want to save our subset catalog which only contains metadata, we save the search result:
esm_subset=esm_dkrz.search(variable_id="Temperature")
Secondly, intake esm comes up with the serialize()
function. The only argument is the name of the catalog which will be used as filename. It writes the two parts of the catalog either together in a .json
file:
esm_subset.serialize("esm_subset")
Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json
Or in two seperated files if we provide catalog_type=file
as a second argument. The test.json
has over 100 MB. If we save the data base in a separate .csv.gz
file, we reduce that to 2MB:
esm_subset.serialize("esm_subset", catalog_type="file")
Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json
Now, we can open the catalog from disk:
intake.open_esm_datastore("esm_subset.json")
esm_subset catalog with 2687 dataset(s) from 140620 asset(s):
unique | |
---|---|
activity_id | 19 |
institution_id | 36 |
source_id | 91 |
experiment_id | 208 |
member_id | 425 |
table_id | 9 |
variable_id | 1 |
grid_label | 7 |
dcpp_init_year | 63 |
version | 544 |
time_range | 15866 |
project | 1 |
format | 1 |
uri | 140620 |
derived_variable_id | 0 |
See also
This tutorial is part of a series on intake
:
You can also do another CMIP6 tutorial from the official intake page.