# Data Standardization with CMOR via Python CDO

This series introduces you to the functions of **cdo cmor**.

**cdo cmor** calls the CMOR library to standardize climate model output for different projects.

## Why refining data?

Part of a sustainable scientifc workflow is the refinement of the produced data to make it **FAIR**. **FAIR** research data is **I**nteroperable and **R**eusable. Adopting a data standard helps to make data *interoperable* and *reusable* in many ways.

**Interoperability**:

- Use of a widely accepted common **data format** across all generated data such as [NetCDF]()
- The inherited **compliance** to domain specific **conventions** like [CF]() simplifies processing.

**Reusability**

- Data that is stored with sufficient meta data according to the data standard is analyzable without external information and therefore **self-descriptive**. 
    - E.g. statistical operations and interpolations over space and time are only processable if all temporal and spatial information including cells and interval bounds and a full description of the vertical axis is available in the input file.
- Continoulsy developped applications aim at being **compatible** to accepted data standards ensuring a long-term usability
- Definition of a **Data Reference Syntax** including templates for storage pathes and file names allows to identify a single file by the set of specified project attributes 

### The scope

For (very) large climate community projects like the Coupled Model Intercomparison Project (CMIP), *systematic* analysis across models only easy to do if model output is provided as **FAIR** data. In CMIP6, 2000 unique variables are defined and which can be submitted for 100 different experiments.

## Approaches

Two different approaches are possible:

- Model output adaptation
- Post-processing with specialized tools

Reasons **against** model output adaptation

- Data standards are *evolving* which require continous updates on output writing
- Once adapted to a specific standard, the output is *inflexible* for new and other standards
- Conservative scientists using stone-age but proofed software will be hard to convince to switch to a new output format
- Since rarely data standard experts work on the adaptation, the task is time consuming and error-prone

Reasons **for** Post-processing with specialized tools

+ **Developer specialization**: experts on model development (post-processing) can *focus* on model development (post-processing) goals
+ **Guarantee**: standardizing software *ensure*s data standard generation i.e. no flaws.
+ **Compatibility**: Other older tools remain *compatible* with the original model raw output
+ **Flexibility**: Enabling of quick adaptation of other data standards

## Definitions

- [CMIP6](https://www.wcrp-climate.org/wgcm-cmip)
    - The recent phase 6 of the Coupled Model Intercomparison Project
- [CMIP Data Standard](https://goo.gl/neswPr)
    - Convetion on climate data accepted in CMIP
- [CMOR]()
    - the Climate Model Output Rewriter] can generate data compliant to the CMIP Data Standard.
- [CDO](https://code.mpimet.mpg.de/)
    - Collection of operators to process climate data.
    - The python binding is a wrapper to call a specific binary correctly

## CMOR

The [Climate Model Output Rewriter]() tool can generate data compliant to the CMIP Data Standard.

### Features

- **Different** (CMIP-like) data standards can be produced
    - No user side preparation of data standard description
- CMOR **ensures** that output is conform to the data standard. Building upon CMOR means using synergies which
    - avoids repeating work 
    - helps to concentrated on the actual goal instead of debugging own cmor-lite developments

```{note}
**CMIP-like data standard** means:

Each file must  
- contain only a single output data variable 
- cover only a single simulation 
- include coordinates and additional meta data
```

## Why integrating CMOR into CDOs?

CDO

- is widely used and accepted
- has an active support by both users and developers
- has an interface that allows
    - different infile formats
    - to access to all infile information no matter how structured
- is fast because it is written in C++

## Installation of CDO with CMOR

If you work on DKRZ HPC system, we recommend to work versions installed here:

```bash
ls -1 /work/bm0021/cdo_incl_cmor/
```

The older CMOR Version 2 is used to generate CMIP5 and CORDEX-CMIP5 data standard. Due to a design change in CMOR functions, there is **up- and downward incompatibility** of CMOR input data.

The interface of cdo cmor and the format of user input does not change with the installed CMOR versions. Scripts and files used for one project can be the starting point for the next project.

### Installation with conda

Only the recent CMOR3 version can be installed and linked to CDO via conda:

```bash
conda update conda
conda create --name cdocmorenv conda-forge/label/dev::cdo -c conda-forge
source activate ${cdoenv}
```

The environment for CDO with cmor shall be **cdocmorenv** set by `--name cdocmorenv`. CDO will be installed from the develop-channel which contains CMOR by specifying `conda-forge/label/dev::cdo`. All other packages come from the **conda-forge channel** with `-c conda-forge`.

#### Updating a conda installation of cdo with cmor:

Depending on which additional packages you have installed, you may have to lower the *channel_priority* first.

```bash
conda config --set channel_priority flexible
conda install --name ${cdoenv} conda-forge/label/dev::cdo -c conda-forge
```

```{note}
Debian CDO (`sudo apt-get install cdo`) is installed **without CMOR**
```

### 1. Preparation. 

#### Option A: On Levante

Define vars for CDO and working directories

In [None]:
#- Recent path:
import os
pwd=os.getcwd()
#
workdir="/work/bm0021/cdo_incl_cmor/examples/"
cdodir="/work/bm0021/cdo_incl_cmor/"
#
cdocmorinfo=workdir+".cdocmorinfo"

#### Option B: Local PC

Clone repo for material:

In [None]:
!git clone https://gitlab.dkrz.de/dicad-pp/cdo-incl-cmor.git

In [None]:
#- Recent path:
import os
pwd=os.getcwd()
#
basedir=pwd+"/cdo-incl-cmor"
workdir=basedir+"/application/handson/"
cdocmorinfo=pwd+"/.cdocmorinfo"

### 2. Set-up cdo in python

In [None]:
#set cdo binary to the one installed in the environment of the kernel
import sys
import os
cdobin="/".join(sys.executable.split(os.path.sep)[:-1])+"/cdo"
#

In [None]:
#import python cdo 
from cdo import *
cdo = Cdo(cdobin)
cdo.debug=True
#This prohibits that existing files are created a second time
cdo.forceOutput = False

In [None]:
help(cdo)

### 3. Interface

In [None]:
%%capture --no-stdout
cdo.cmor(options="-h")

The operator requires one parameter and one argument. The first parameter is always the MIP-table. The argument is the input file.

## Project data standard

The project data standard is build up by 4 different type of documents:

- **The Data Request (Dreq)**: A data standard will only be defined for variables that are *requested* for and by the project, e.g. [CMIP6](https://cmip6dr.github.io/Data_Request_Home/)
- **Output requirements (OR)**: Technical specifications for the structure, content and format of files, e.g. [CMIP6](https://goo.gl/neswPr)
- **Global attributes (GA)**: Specifications for required and optional global attributes, e.g. [CMIP6](https://goo.gl/v1drZl )
- **A registry**: Only names of institutions and ESMs that are registred are valid values of global attributes like *institution* or *source*. E.g. [CMIP6](https://github.com/WCRP-CMIP/CMIP6_CVs)

### Controlled Vocabularies (CVs) and MIP-Tables

**DReq, GAs** and the **registry** are translated into *controlled vocabularies*, CVs. This set is also called **MIP-Tables**, e.g. [CIMP6](https://github.com/PCMDI/cmip6-cmor-tables).

- One *CV*-MIP-Table contains a condensed form of all CVs which are version controlled in the registry. It contains
    - required and optional CMIP attributes
    - allowed values for attributes
    - restrictions resulting from a setting of attributes (e.g. min. simulation years of an experiment)
    - whether additional attributes must be specified (e.g. parent attributes)
- All other MIP-Tables contain variable information

```{tip}
The MIP-Tables are *input* for CMOR. Therefore, it is guaranted that **CMOR output is CMIP compliant** as it also implements the **OR** specifications.
```

A variable can be requested for different *frequencies, dimensions or cell_methods*. E.g., it can be reasonable to provide data on model level for reuse in ESMs while having another version of the data on pressure levels for easy analysis.

MIP-tables are divided by their variables'

- realm
- frequencies
- grid and vertical axis types
- time cell method.

so that a variable only occur **once** in the MIP-table. Also, this division is made to keep them short.

In CMIP6, the MIP-table name is constructed by a *Prefix, Frequency, Suffix and a Qualifier*. However, neither all of these four parts need to be included in MIP-table name nor all of the possible combinations exists as a MIP-Table.

For this notebook, we working with the example on CMIP6. You can clone the MIP-Tables repository yourself or use the submodule inside the workshop material.

In [None]:
%%bash
#!git clone https://github.com/PCMDI/cmip6-cmor-tables.git {mip_tables_dir}
rel_mip_tables_dir=configuration/cmip6/cmip6-cmor-tables/
cd $(pwd)/cdo-incl-cmor 
git submodule init ${rel_mip_tables_dir} 
git submodule update ${rel_mip_tables_dir} 
cd ${rel_mip_tables_dir} 
git checkout --track origin/01.00.31

We can **parse** the tables with the `json` package. The *Amon* MIP-Table contains a **Header**, and **variable_entries**.

In [None]:
mip_tables_dir=basedir+"/configuration/cmip6/cmip6-cmor-tables/"
import json
with open(mip_tables_dir+"/Tables/CMIP6_Amon.json") as f:
    amon=json.load(f)
print(amon.keys())
print(amon["Header"].keys())
print(amon["variable_entry"].keys())

In [None]:
#%%capture --no-stdout
#Standardize all variables of example_interface.nc for CMIP6_Amon.json:
infotabledir="cdo-incl-cmor/configuration/cmip6/cmip6-cdocmorinfo/"
cdocmorinfos=[infotabledir+k 
              for k in ["dkrz_atts",
                        "historical_atts",
                        "mpi-esm1-2-lr_atts",
                        "cdocmorcontrol_atts",
                        "member_atts",
                        "nominalresolution_atts"]
             ]
cdocmorinfostring=','.join(cdocmorinfos)
cdo.cmor(mip_tables_dir+'/Tables/CMIP6_Amon.json,'
         'i='+cdocmorinfostring,
         input=workdir+'/example_interface.nc',
         options="-v")

### CMOR variable

The entry of a variable inside a MIP-Table is called **cmor name** of the variable.

A **CMOR-variable** is the unique combination of the **cmor name** and the corresponding **MIP-table** which includes the **cmor name**. 

The data standard of the same variable is different from one MIP-Table to another. That can include all variable information from cell methods to its grid.

- In CMIP6, the request for monthly air temperature (CMIP6_Amon.json) is different compared to daily air temperature (CMIP6_day.json)

### 4. cdocmorinfo 

Global attributes and operator control keywords are specified in a cdocmorinfo file.

### Global attributes and the CV:

Project dependence of attribute nomenclature:

<table><tbody>
<tr>
<th>Attribute\project</th>
<th>CMIP6</th>
<th>CMIP5</th>
</tr>
<tr>
<td><em>MIP:</em></td>
<td>activity_id</td>
<td>project_id</td>
</tr>
<tr>
<td><em>Model:</em>
</td>
<td>source_id</td>
<td>model_id</td>
</tr>
<tr>
<td><em>Institute:</em></td>
<td>institution_id</td>
<td>institute_id</td>
</tr>
<tr>
<td><em>Ensemble member:</em></td>
<td>variant_label</td>
<td>member</td>
</tr>
<tr>
<td><em>Grid resolution:</em></td>
<td>nominal_resolution</td>
<td></td>
</tr>
</tbody></table>

Experiments are registered in the CV with attached predefined attributes:


<table><tbody>
<tr>
<th >Attributes\experiment_id</th>
<th>1pctCO2</th>
<th>amip</th>
<th>ssp585</th>
</tr>
<tr>
<td >activity_id</td>
<td>CMIP</td>
<td>CMIP</td>
<td>ScenarioMIP</td>
</tr>
<tr>
<td >experiment</td>
<td>1 percent per year increase in CO2</td>
<td>AMIP</td>
<td>update of RCP8.5 based on SSP5</td>
</tr>
<tr>
<td >sub_experiment_id</td>
<td>none</td>
<td>none</td>
<td>none</td>
</tr>
<tr>
<td >parent_activity_id</td>
<td>CMIP</td>
<td>no parent</td>
<td>CMIP</td>
</tr>
<tr>
<td >parent_experiment_id</td>
<td>piControl</td>
<td>no parent</td>
<td>historical</td>
</tr>
</tbody></table>

In [None]:
%%capture --no-stdout
#Since there is 
#1. a default for i which is '.cdocmorinfo'
#2. the attribute MIP_table_dir specified in cdocmorinfo,
#We only need to copy cdocmorinfo to our pwd:
#!rm {pwd}/cdocmorinfo
for c in cdocmorinfos:
    !cat {c} >>{pwd}/.cdocmorinfo
#so that it is sufficient to call:
cdo.cmor(mip_tables_dir+'/Tables/CMIP6_Amon.json',input=workdir+'example_interface.nc')

### 5. Select subset of variables

In [None]:
%%capture --no-stdout
#Only process variable with cmor_name=tas
cdo.cmor('Amon,cmor_name=tas',input=workdir+'example_interface.nc')
#Same process, but with short keyword cn:
#cdo.cmor('Amon,cn=tas',      input=workdir+'examples/example_interface.nc')

### Variable mapping

How to map variables?

1. Know the CMOR variable you aim to produce
1. Link to the matching infile variable(s)
    - specify a recipe
1. Provide attributes

<table><tbody>
<tr>
<th >Keyword</th>
<th >Short name</th>
<th >Value format</th>
<th>Default</th>
</tr>
<tr>
<td >cmor_name</td>
<td >cn</td>
<td >Variable name included in MIP-table</td>
<td></td>
</tr>
<tr>
<td >name</td>
<td >n</td>
<td >Input variable name</td>
<td></td>
</tr>
<tr>
<td >code</td>
<td >c</td>
<td >Three digits integer. GRIB code.</td>
<td></td>
</tr>
<tr>
<td >units</td>
<td >u</td>
<td >String. Must be readable by udunits.</td>
<td></td>
</tr>
<tr>
<td >cell_methods</td>
<td >cm</td>
<td >Character (see below)</td>
<td>m</td>
</tr>
<tr>
<td >positive</td>
<td >p</td>
<td >u=upward, d=downward</td>
<td></td>
</tr>
<tr>
<td >variable_comment</td>
<td >vc</td>
<td >String</td>
<td></td>
</tr>
</tbody></table>

In [None]:
%%capture --no-stdout
#Map Variable witch code=167 to CMOR Variable tas.
# All Mapping information are infile variable descriptions.
cdo.cmor('Amon,cn=tas,'
         'code=167,units=K,cell_methods=m',
         input=workdir+'example_mapping.grb')

In [None]:
%%capture --no-stdout
#Write mapping information to mapping table:
with open(workdir+'mapping_table.txt', 'a') as mapping_table:
    mapping_table.write('&parameter cmor_name=tas code=167 units=K cell_methods=m /\n')
mapping_table.close()
#Select a specific variable in the command line to be mapped with mapping_table.txt:
cdo.cmor('Amon,cn=tas',
         'mapping_table='+workdir+'mapping_table.txt',
          input=workdir+'example_mapping.grb')

In [None]:
%%capture --no-stdout
#Process and map all variables which are in example_collect.grb with mtPERFECT.txt:
cdo.cmor('Amon',
         'mt='+workdir+'mtPERFECT.txt',
          input=workdir+'example_collect.grb')

### 7. Coordinates

In [None]:
%%capture --no-stdout
#Define value for z_axis height2m as 1.5m:
with open(workdir+'.cdocmorinfo.txt', 'a') as info:
    info.write('height2m=1.5\n')
#
cdo.cmor('Amon,cn=tas,
         'z_axis=height2m',
         'mapping_table=mapping_table.txt',
          input=workdir+'example_T_3M.nc')