## ATMODAT Standard Compliance Checker

This notebook introduces you to the [atmodat checker](https://github.com/AtMoDat/atmodat_data_checker) which contains checks to ensure compliance with the ATMODAT Standard.

> Its core functionality is based on the [IOOS compliance checker](https://github.com/ioos/compliance-checker). The ATMODAT Standard Compliance Checker library makes use of [cc-yaml](https://github.com/cedadev/cc-yaml), which provides a plugin for the IOOS compliance checker that generates check suites from YAML descriptions. Furthermore, the Compliance Check Library is used as the basis to define generic, reusable compliance checks.

In addition, the compliance to the **CF Conventions 1.4 or higher** is verified with the [CF checker](https://github.com/cedadev/cf-checker).

In this notebook, you will learn

- [how to use an environment on DKRZ HPC mistral or levante](#Preparation)
- [how to run checks with the atmodat data checker](#Application)
- [to understand the results of the checker and further analyse it with pandas](#Results)
- [how you could proceed to cure the data with xarray if it does not pass the QC](#Curation)

### Preparation

On DKRZ's High-performance computer PC, we provide a `conda` environment which are useful for working with data in DKRZ’s CMIP Data Pool.

**Option 1: Activate checker libraries for working with a comand-line shell**

If you like to work with shell commands, you can simply activate the environment. Prior to this, you may have
to load a module with a recent python interpreter

```bash
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
``` 

**Option 2: Create a kernel with checker libraries to work with jupyter notebooks**

With `ipykernel` you can install a *kernel* which can be used within a jupyter server like [jupyterhub](https://jupyterhub.dkrz.de). `ipykernel` creates the kernel based on the activated environment.

```bash
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
python -m ipykernel install --user --name qualitychecker --display-name="qualitychecker"
```

If you run this command from within a jupyter server, you have to restart the jupyterserver afterwards to be able to select the new *quality checker* kernel.

**Expert mode**: Running the jupyter server from a different environment than the environment in which atmodat is installed

Make sure that you:

1. Install the `cfunits` package to the jupyter environment via `conda install cfunits -c conda-forge -p $jupyterenv` and restart the kernel.
1. Add the atmodat environment to the `PATH` environment variable inside the notebook. Otherwise, the notebook's shell does not find the application `run_checks`. You can modify environment variables with the `os` package and its command `os.envrion`. The environment of the kernel can be found with `sys` and `sys.executable`. The following block sets the environment variable `PATH` correctly:

In [None]:
import sys
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.path.sep.join(sys.executable.split('/')[:-1])

In [None]:
#As long as there is the installation bug, we have to manually get the Atmodat CVs:
if not "AtMoDat_CVs" in [dirpath.split(os.path.sep)[-1]
 for (dirpath, dirs, files) in os.walk(os.path.sep.join(sys.executable.split('/')[:-2]))] :
 !git clone https://github.com/AtMoDat/AtMoDat_CVs.git {os.path.sep.join(sys.executable.split('/')[:-2])}/lib/python3.9/site-packages/atmodat_checklib/AtMoDat_CVs

### Data to be checked

In this tutorial, we will check a small subset of CMIP6 data which we gain via `intake`:

In [None]:
import intake
# Path to master catalog on the DKRZ server
col_url = "https://dkrz.de/s/intake"
col_url = "https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"
parent_col=intake.open_catalog([col_url])
list(parent_col)

# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk"]

In [None]:
# We just use the first file from the CMIP6 catalog and copy it to the local disk because we make some experiments from it
exp_file=col.df["uri"].values[0]
exp_file

### Application

The command `run_checks` can be executed from any directory from within the atmodat conda environment. 

The atmodat checker contains two modules:
- one that checks the global attributes for compliance with the ATMODAT standard
- another that performs a standard CF check (building upon the cfchecks library).

Show usage instructions of `run_checks`

In [None]:
!run_checks -h

The results of the performed checks are provided in the checker_output directory. By default, `run_checks` assumes writing permissions in the path where the atmodat checker is installed. If this is not the case, you must specify an output directory where you possess writing permissions with the `-op output_path`.

In the following block, we set the *output path* to the current working directory which we get via the bash command `pwd`. We apply `run_checks` for the `exp_file` which we downloaded in the chapter before.

In [None]:
cwd=!pwd
cwd=cwd[0]
!run_checks -f {exp_file} -op {cwd} -s

Now, we have a directory `atmodat_checker_output` in the `op`. For each run of `run_checks`, a new directory is created inside of `op` named by the timestamp. Additionally, a directory *latest* always shows the output of the most recent run.

In [None]:
!ls {os.path.sep.join([cwd, "atmodat_checker_output"])}

As we ran `run_checks` with the option `-s`, one output is the *short_summary.txt* file which we `cat` in the following:

In [None]:
output_dir_string=os.path.sep.join(["atmodat_checker_output","latest"])
output_path=os.path.sep.join([cwd, output_dir_string])
!cat {os.path.sep.join([output_path, "short_summary.txt"])}

### Results

The short summary contains information about versions, the timestamp of execution, the ratio of passed checks on attributes and errors written by the CF checker.

- cfchecks routine only issues a warning/information message if variable metadata are completely missing.
- Zero errors in the cfchecks routine does not necessarily mean that a data file is CF compliant!

We can also have a look into the detailled output including the exact error message in the *long_summary_* files which are subdivided into severe levels.

In [None]:
!cat {os.path.sep.join([output_path,"long_summary_recommended.csv"])}

In [None]:
!cat {os.path.sep.join([output_path,"long_summary_mandatory.csv"])}

We can open the *.csv* files with `pandas` to further analyse the output.

In [None]:
import pandas as pd
recommend_df=pd.read_csv(os.path.sep.join([output_path,"long_summary_recommended.csv"]))
recommend_df

There may be **missing** global attributes wich are recommended by the *atmodat standard*. We can find them with pandas:

In [None]:
missing_recommend_atts=list(
 recommend_df.loc[recommend_df["Error Message"]=="global attribute is not present"]["Global Attribute"]
)
missing_recommend_atts

### Curation

Let's try first steps to *cure* the file by adding a missing attribute with `xarray`. We can open the file into an *xarray dataset* with:

In [None]:
import xarray as xr
exp_file_ds=xr.open_dataset(exp_file)
exp_file_ds

We can **handle and add attributes** via the `dict`-type attribute `.attrs`. Applied on the dataset, it shows all *global attributes* of the file:

In [None]:
exp_file_ds.attrs

We add all missing attributes and set a dummy value for them:

In [None]:
for att in missing_recommend_atts:
 exp_file_ds.attrs[att]="Dummy"

We save the modified dataset with the `to_netcdf` function:

In [None]:
exp_file_ds.to_netcdf("testfile-modified.nc")

Now, lets run `run_checks` again.

We can also only provide a directory instead of a file as an argument with the option `-p`. The checker will find all `.nc` files inside that directory.

In [None]:
!run_checks -p {cwd} -op {cwd} -s

Using the *latest* directory, here is the new summary:

In [None]:
!cat {os.path.sep.join([output_path,"short_summary.txt"])}

You can see that the checks do not fail for the modified file when subtracting the earlier failes from the sum of new passed checks.