Advanced quantile calculation for the MPI-ESM1-2-LR CMIP6 ensemble with xarray, pandas and hvplot

Advanced quantile calculation for the MPI-ESM1-2-LR CMIP6 ensemble with xarray, pandas and hvplot#

We will show how to calculate statistics of the MPI-ESM1-2-LR ensemble published within the Coupled Model Intercomparison Project CMIP6. Quantiles of a single-model ensemble allow to estimate the simulated climate variability. This notebook will calculate 5 quantiles for a specified region on the globe and compares them with the global quantiles. Interactive plots will be used for data visualization for easy interpretation.

We will choose one variable of multiple experiments and compare the results of different models. In particular, we analyse the historical experiment in combination with one of the shared socioeconomic pathway (ssp) experiments.

This Jupyter notebook is meant to run in the Jupyterhub server of the German Climate Computing Center DKRZ. The DKRZ hosts the CMIP data pool including 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the common geoscience packages. See more information on how to run Jupyter notebooks at DKRZ here.

Running this Jupyter notebook in your premise, which is also known as client-side computing, will require that you install the necessary packages and download data.

Learning Objectives#

How to access a dataset from the DKRZ CMIP data pool with intake-esm
How to use xarray and pandas for data analysis
How to visualize the results with hvplot

Requirements#

About 10GB memory

import intake
import pandas as pd
import hvplot.pandas
import hvplot.xarray
import numpy as np
import cartopy.crs as ccrs

ERROR 1: PROJ: proj_create_from_database: Open of /envs/share/proj failed

Parameters#

Along with the variable_id, we can specify

the number of ensemble members ens_size that we want to take into account of our analysis
a latitude and longitude combination for our region for comparing with global climate variability

# Choose one of
# pr, psl, tas, tasmax, tasmin, clt
variable_id = "tas"
# 1-30
ens_size=29
# lat and lon in deg
regions=dict(
    user_region=dict(
        latitudes=[45,55],
        longitudes=[0,10]
    )
)
%store -r

regions["global"]=dict(
    latitudes=[-90,90],
    longitudes=[0,360]
)

# get formating done automatically according to style `black`
#%load_ext lab_black

Initialization#

The intake-esm software reads Catalogs which we use to find, access and load the data we are interested in. Daily updated CMIP6 catalogs are provided in DKRZ’s cloud swift.

# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
parent_col=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
list(parent_col)

['dkrz_cmip5_archive',
 'dkrz_cmip5_disk',
 'dkrz_cmip6_cloud',
 'dkrz_cmip6_disk',
 'dkrz_cordex_disk',
 'dkrz_dyamond-winter_disk',
 'dkrz_era5_disk',
 'dkrz_monsoon_disk',
 'dkrz_mpige_disk',
 'dkrz_nextgems_disk',
 'dkrz_palmod2_disk']

col=parent_col["dkrz_cmip6_disk"]

We are searching for the ensemble of the ESM MPI-ESM1-2-LR for the time span 1850 to 2100. We use the historical and ssp585 experiments for the analysis. For now, we are averaging variables from the table Amon which contains atmospheric (Amon) submodel output in monthly (Amon) frequency.

query = dict(
    variable_id=variable_id,
    table_id="Amon",
    experiment_id=["historical", "ssp585"], # we have excluded "ssp245" from the list because it would take 15min to finish the nb
    source_id=["MPI-ESM1-2-LR"]#, "AWI-CM-1-1-MR"],
)
col.df["uri"]=col.df["uri"].str.replace("lustre/","lustre02/") 
cat = col.search(**query)

We open our selection and hold them as xarray datasets:

print("Opening the results as xarray dsets...")
xr_dsets=cat.to_dataset_dict(progressbar=True)

Opening the results as xarray dsets...

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.source_id.experiment_id.table_id.grid_label'

100.00% [2/2 02:49<00:00]

Post-processing#

The main post-processing step will be done in the spatial_weighted_yearly_mean function. It contains:

Select a region from the globally model data
Calculate a weighted spatial mean. The weights are only valid for a regular shaped lon-lat grid.
Average over years

This will be done for the provided data set, variable_id, lats and lons

def spatial_weighted_yearly_mean(dset, variable_id, lats, lons):
    
    var = dset.get(variable_id)
    # Var zoomed:
    var = var.sel(lat=slice(lats[0],lats[1]),
                  lon=slice(lons[0],lons[1]))
    
    # Get weights
    weights = np.cos(np.deg2rad(dset.lat))
    # Var weighted
    varwg   = var.weighted(weights)
    # Var global mean:
    vargm   = varwg.mean(("lon", "lat"))
    # Var yearly mean:
    vargmym = vargm.groupby("time.year").mean("time")
    return vargmym.values

Calculate reference climate for anomalies#

As a refrence period, we define the 1971 to 2000 time interval. This is only available in the historical experiment. We calculate a 30-yr mean of spatial weighted yearly mean arrays of the target variable.

def calculate_historical_gmym(xr_dsets, variable_id, lats, lons):
    historical = [key for key in xr_dsets.keys() if "historical" in key][0]
    dset_hist = xr_dsets[historical]
    dset_hist_thirty = dset_hist.sel(time=dset_hist.time.dt.year.isin(range(1971, 2000)))
    # 10member
    hist_gmym = spatial_weighted_yearly_mean(dset_hist_thirty, variable_id, lats, lons)
    return hist_gmym.mean()

Arrange anomalies results in a DataFrame#

For an area plot, hvplot works best on DataFrames - the xarray-extension of hvplot converts the data array into a data frame for an area plot anyway but takes way longer. So it is decided to create a table with

years from 1850 to 2100 as index
ensembles as columns

This table is filled with the anomaly of the spatial weighted yearly means of our target variable in comparison with the 30yr historical reference period 1971-2000.

def calculate_vargmym(xr_dsets, hist_gmymref, variable_id, lats, lons):
    years = [i for i in range(1850,2101)]
    vargmympd = pd.DataFrame(index=years, columns=["r"+str(i) for i in range(ens_size)], dtype=float)
    vargmympd.index.name = "Year"
    for key in xr_dsets:
        datatoappend = spatial_weighted_yearly_mean(xr_dsets[key], variable_id, lats,lons) - hist_gmymref
        insert_years = list(xr_dsets[key].get(variable_id).groupby("time.year").groups.keys())
        for i in range(ens_size):
            vargmympd.loc[insert_years, "r"+str(i)] = datatoappend[i]
    return vargmympd

Plotting#

Areas between quantiles and line for median#

The final plot should contain two areas and one line.

One area spans the interval between the first and 99th quantile i.e. it contains 98% of all values of the ensemble
One area spans the interval between the 25th and 75th quantile i.e. it contains 50% of all values of the ensemble
The line corresponds to the median i.e. it shows the 50th quantile

With hvplot, we can generate three individual plots and combine them with *:

def plot_quantiles_area(quantiles,ylabel):
    ninetyeight=quantiles.hvplot.area(grid=True,
                                      y="0.01",
                                      y2="0.99",
                                      width=820,
                                      color="aliceblue",
                                      label="98% of all values")
    fifty=quantiles.hvplot.area(grid=True,
                                y="0.25",
                                y2="0.75",
                                width=820,
                                color="aqua",
                                label="50% of all values")
    median=quantiles.hvplot.line(grid=True,
                                 y="0.5",
                                 color="cornflowerblue",
                                 label="median")
    comb=(ninetyeight*fifty*median).opts(ylabel=ylabel, legend_position='bottom_right')
    return comb

Axis label for plotting#

We can get metadata for the chosen variable from the files so that the script can generate an axis label automatically. The metadata is stored in a dict-like .attrs variable of the dataset. The function get_label returns the fitting label:

def get_label(xr_dset, variable_id):
    lname = xr_dset.get(variable_id).attrs["long_name"]
    units = xr_dset.get(variable_id).attrs["units"]
    return "Delta " + lname + "[" + units + "]"

Running the workflow#

For both regions, the user specified one and the global comparison, we run

Calculation of a 30-yr mean historical reference for anomaly calculation
Calculation of the anomalies for each year and realization spatially weighted and averaged over the region
Calculation of the quantiles
Creation of the plots

plots=[]
quantiles=[]
for k,region in regions.items():
    lats=region["latitudes"]
    lons=region["longitudes"]
    ylabel=get_label(xr_dsets[list(xr_dsets.keys())[0]], variable_id)
    label="lats: "+str(lats[0])+"-"+str(lats[1])
    print("1. Calculate historical reference...")
    hist_gmymref = calculate_historical_gmym(xr_dsets, variable_id, lats, lons)
    print("2. Calculate spatial and yearly mean of variable "+variable_id+" ...")
    var_gmym= calculate_vargmym(xr_dsets, hist_gmymref, variable_id, lats, lons)
    print("3. Calculate quantiles ...")
    quantile=var_gmym.transpose().quantile([0.01, 0.25, 0.5, 0.75, 0.99]).transpose()
    quantiles.append(quantile)
    print("4. Creating plots and diagnostic ...")
    plots.append(plot_quantiles_area(quantile, ylabel))
    plots.append((quantile[0.99]-quantile[0.01]).hvplot.line(
        color="red",
        grid=True,
        ylabel=ylabel,
        label="0.99-0.01 Quantile, "+label
    ))

1. Calculate historical reference...

2. Calculate spatial and yearly mean of variable tas ...

Calculate quantiles ...
Creating plots and diagnostic ...
Calculate historical reference...

2. Calculate spatial and yearly mean of variable tas ...

3. Calculate quantiles ...
4. Creating plots and diagnostic ...

Displaying the result#

We combine all plots into one with one column. The first two plots show the results for the region specifed by the user, the second two for the global mean.

(plots[0]+plots[1]+plots[2]+plots[3]).cols(1)

The variability of the regional average of the yearly mean temperature is about 2-3 °C for mid-latitudes close to Europe while there is less variability in the global mean. 28 out of 30 members of the MPI-ESM1-2-LR indicate that, in a ssp585 world,

Europeans will not experience another colder-than-usual year with yearly mean temperature smaller than the historical mean from about 2030 onwards.
The date when we shifted european climate to a new state which does not produce yearly temperatures which we are used to experience is from about 2068
Europeans will not experience another year with yearly mean temperature smaller than 2°C above the historical mean from 2075 onwards

The variability of the global average of the yearly mean temperature is in a narrow band with 0.5 deg C width.

no clear change of variability in the climate state transition in the ssp585 scenario
a 2°C warmer world is reached in year 2066

Add-on#

Calculating quantiles with xarray is also easy. The results can be visualized easily with hvplot.

In the following, we show you how to do that with the historical dataset. We calculate the 98% interval between the 99th and 1st quantile and plot a quadmesh plot.

historical_ds=xr_dsets[list(xr_dsets.keys())[0]]
historical_ds=historical_ds.chunk(dict(member_id=-1)).groupby("time.year").mean("time") #.isel(time=slice(0,120)).mean(dim="time")
xr_quantiles=historical_ds.quantile(q=[0.01,0.99],dim="member_id")
xr_quantiles

<xarray.Dataset>
Dimensions:         (lat: 96, bnds: 2, lon: 192, quantile: 2, year: 86,
                     dcpp_init_year: 1)
Coordinates:
  * lat             (lat) float64 -88.57 -86.72 -84.86 ... 84.86 86.72 88.57
    lat_bnds        (lat, bnds) float64 dask.array<chunksize=(96, 2), meta=np.ndarray>
  * lon             (lon) float64 0.0 1.875 3.75 5.625 ... 354.4 356.2 358.1
    lon_bnds        (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
  * dcpp_init_year  (dcpp_init_year) float64 nan
  * year            (year) int64 2015 2016 2017 2018 ... 2097 2098 2099 2100
  * quantile        (quantile) float64 0.01 0.99
Dimensions without coordinates: bnds
Data variables:
    tas             (quantile, year, dcpp_init_year, lat, lon) float64 dask.array<chunksize=(2, 1, 1, 96, 192), meta=np.ndarray>

qs=xr_quantiles.sel(quantile=0.99)-xr_quantiles.sel(quantile=0.01).squeeze()
qs

<xarray.Dataset>
Dimensions:         (lat: 96, bnds: 2, lon: 192, dcpp_init_year: 1, year: 86)
Coordinates:
  * lat             (lat) float64 -88.57 -86.72 -84.86 ... 84.86 86.72 88.57
    lat_bnds        (lat, bnds) float64 dask.array<chunksize=(96, 2), meta=np.ndarray>
  * lon             (lon) float64 0.0 1.875 3.75 5.625 ... 354.4 356.2 358.1
    lon_bnds        (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
  * dcpp_init_year  (dcpp_init_year) float64 nan
  * year            (year) int64 2015 2016 2017 2018 ... 2097 2098 2099 2100
Dimensions without coordinates: bnds
Data variables:
    tas             (year, dcpp_init_year, lat, lon) float64 dask.array<chunksize=(1, 1, 96, 192), meta=np.ndarray>

#qs.tas.hvplot.contourf(levels=20, geo=True, coastline=True, width=600)
qs.tas.hvplot.quadmesh(width=600)
#qs.hvplot.quadmesh(
#    'lon', 'lat', 'tas', projection=ccrs.Orthographic(-90, 30),
#    global_extent=True, frame_height=540, cmap='viridis'#,    coastline=True
#)
#hvplot.save(testplot, "testplot.html")

Variability of the yearly mean temperature varies with location. The yearly mean temperature can be simulated more precisely i.e. varies less

above oceans than above land
in tropical than in polar regions

In a next step, these results should be validated. How well simulated is the climate variability?

compare it with reanalysis statistics

Used data#

We acknowledge the CMIP community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP data of this study were replicated and made available for this study by the DKRZ.”