ATMODAT Standard Compliance Checker

ATMODAT Standard Compliance Checker#

This notebook introduces you to the atmodat checker which contains checks to ensure compliance with the ATMODAT Standard.

Its core functionality is based on the IOOS compliance checker. The ATMODAT Standard Compliance Checker library makes use of cc-yaml, which provides a plugin for the IOOS compliance checker that generates check suites from YAML descriptions. Furthermore, the Compliance Check Library is used as the basis to define generic, reusable compliance checks.

In addition, the compliance to the CF Conventions 1.4 or higher is verified with the CF checker.

In this notebook, you will learn

how to use an environment on DKRZ HPC mistral or levante
how to run checks with the atmodat data checker
to understand the results of the checker and further analyse it with pandas
how you could proceed to cure the data with xarray if it does not pass the QC

Preparation#

On DKRZ’s High-performance computer PC, we provide a conda environment which are useful for working with data in DKRZ’s CMIP Data Pool.

Option 1: Activate checker libraries for working with a comand-line shell

If you like to work with shell commands, you can simply activate the environment. Prior to this, you may have to load a module with a recent python interpreter

module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance

Option 2: Create a kernel with checker libraries to work with jupyter notebooks

With ipykernel you can install a kernel which can be used within a jupyter server like jupyterhub. ipykernel creates the kernel based on the activated environment.

module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
python -m ipykernel install --user --name qualitychecker --display-name="qualitychecker"

If you run this command from within a jupyter server, you have to restart the jupyterserver afterwards to be able to select the new quality checker kernel.

Expert mode: Running the jupyter server from a different environment than the environment in which atmodat is installed

Make sure that you:

Install the cfunits package to the jupyter environment via conda install cfunits -c conda-forge -p $jupyterenv and restart the kernel.
Add the atmodat environment to the PATH environment variable inside the notebook. Otherwise, the notebook’s shell does not find the application run_checks. You can modify environment variables with the os package and its command os.envrion. The environment of the kernel can be found with sys and sys.executable. The following block sets the environment variable PATH correctly:

import sys
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.path.sep.join(sys.executable.split('/')[:-1])

#As long as there is the installation bug, we have to manually get the Atmodat CVs:
if not "AtMoDat_CVs" in [dirpath.split(os.path.sep)[-1]
                         for (dirpath, dirs, files) in os.walk(os.path.sep.join(sys.executable.split('/')[:-2]))] :
    !git clone https://github.com/AtMoDat/AtMoDat_CVs.git {os.path.sep.join(sys.executable.split('/')[:-2])}/lib/python3.9/site-packages/atmodat_checklib/AtMoDat_CVs

Data to be checked#

In this tutorial, we will check a small subset of CMIP6 data which we gain via intake:

import intake
# Path to master catalog on the DKRZ server
col_url = "https://dkrz.de/s/intake"
col_url = "https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"
parent_col=intake.open_catalog([col_url])
list(parent_col)

# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk"]

# We just use the first file from the CMIP6 catalog and copy it to the local disk because we make some experiments from it
exp_file=col.df["uri"].values[0]
exp_file

'/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc'

Application#

The command run_checks can be executed from any directory from within the atmodat conda environment.

The atmodat checker contains two modules:

one that checks the global attributes for compliance with the ATMODAT standard
another that performs a standard CF check (building upon the cfchecks library).

Show usage instructions of run_checks

!run_checks -h

usage: run_checks [-h] [-v] [-op OPATH] [-cfv CFVERSION] [-check WHATCHECKS]
                  [-s] [-V] [-f FILE | -p PATH | -pnr PATH_NO_RECURSIVE]

Run the AtMoDat checks suits.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Print output of checkers (longer runtime due to double
                        call of checkers)
  -op OPATH, --opath OPATH
                        Define custom path where checker output shall be
                        written
  -cfv CFVERSION, --cfversion CFVERSION
                        Define custom CF table version against which the file
                        shall be checked. Valid are versions from 1.3 to 1.8.
                        Example: "-cfv 1.6". Default is 'auto'
  -check WHATCHECKS, --whatchecks WHATCHECKS
                        Define if AtMoDat or CF check or both shall be
                        executed. Valid options: AT, CF, both. Example:
                        "-check CF". Default is 'both'
  -s, --summary         Create summary of checker output
  -V, --version         show program's version number and exit
  -f FILE, --file FILE  Processes the given file
  -p PATH, --path PATH  Processes all files in a given path and subdirectories
                        (recursive file search)
  -pnr PATH_NO_RECURSIVE, --path_no_recursive PATH_NO_RECURSIVE
                        Processes all files in a given directory

The results of the performed checks are provided in the checker_output directory. By default, run_checks assumes writing permissions in the path where the atmodat checker is installed. If this is not the case, you must specify an output directory where you possess writing permissions with the -op output_path.

In the following block, we set the output path to the current working directory which we get via the bash command pwd. We apply run_checks for the exp_file which we downloaded in the chapter before.

cwd=!pwd
cwd=cwd[0]
!run_checks -f {exp_file} -op {cwd} -s

Running Compliance Checker on the datasets from: ['/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc']

2023-07-05 14:33:44.523616 [INFO] :: PYESSV :: Loading vocabularies from /envs/lib/python3.11/site-packages/atmodat_checklib/AtMoDat_CVs/pyessv-archive:

2023-07-05 14:33:44.702117 [INFO] :: PYESSV :: ... loaded: atmodat

--- 13.0180 seconds for checking 1 files---

Now, we have a directory atmodat_checker_output in the op. For each run of run_checks, a new directory is created inside of op named by the timestamp. Additionally, a directory latest always shows the output of the most recent run.

!ls {os.path.sep.join([cwd, "atmodat_checker_output"])}

20230705_1433  latest

As we ran run_checks with the option -s, one output is the short_summary.txt file which we cat in the following:

output_dir_string=os.path.sep.join(["atmodat_checker_output","latest"])
output_path=os.path.sep.join([cwd, output_dir_string])
!cat {os.path.sep.join([output_path, "short_summary.txt"])}

=== Short summary === 
 
ATMODAT Standard Compliance Checker Version: 1.3.2
Checking against: ATMODAT Standard 3.0, CF Version 1.7
Checked at: 2023-07-05T14:33:55
 
Number of checked netCDF files: 1

Mandatory ATMODAT Standard checks passed: 4/4 (0 missing, 0 error(s))
Recommended ATMODAT Standard checks passed: 9/20 (11 missing, 0 error(s))
Optional ATMODAT Standard checks passed: 3/9 (6 missing, 0 error(s))

CF checker errors: 0 (Ignoring errors related to formula_terms in boundary variables. See Known Issues section https://github.com/AtMoDat/atmodat_data_checker#known-issues )
CF checker warnings: 2

Results#

The short summary contains information about versions, the timestamp of execution, the ratio of passed checks on attributes and errors written by the CF checker.

cfchecks routine only issues a warning/information message if variable metadata are completely missing.
Zero errors in the cfchecks routine does not necessarily mean that a data file is CF compliant!

We can also have a look into the detailled output including the exact error message in the long_summary_ files which are subdivided into severe levels.

!cat {os.path.sep.join([output_path,"long_summary_recommended.csv"])}

File,Check level,Global Attribute,Error Message

,,,

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,Conventions,ATMODAT Standard information not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,creator,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,crs,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lat_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lon_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_vertical_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,keywords,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,product_version,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,source_type,global attribute value is invalid

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,standard_name_vocabulary,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,summary,global attribute is not present

!cat {os.path.sep.join([output_path,"long_summary_mandatory.csv"])}

File,Check level,Global Attribute,Error Message

,,,

We can open the .csv files with pandas to further analyse the output.

import pandas as pd
recommend_df=pd.read_csv(os.path.sep.join([output_path,"long_summary_recommended.csv"]))
recommend_df

	File	Check level	Global Attribute	Error Message
0	NaN	NaN	NaN	NaN
1	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	Conventions	ATMODAT Standard information not present
2	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	creator	global attribute is not present
3	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	crs	global attribute is not present
4	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	geospatial_lat_resolution	global attribute is not present
5	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	geospatial_lon_resolution	global attribute is not present
6	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	geospatial_vertical_resolution	global attribute is not present
7	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	keywords	global attribute is not present
8	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	product_version	global attribute is not present
9	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	source_type	global attribute value is invalid
10	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	standard_name_vocabulary	global attribute is not present
11	c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18...	recommended	summary	global attribute is not present

There may be missing global attributes wich are recommended by the atmodat standard. We can find them with pandas:

missing_recommend_atts=list(
    recommend_df.loc[recommend_df["Error Message"]=="global attribute is not present"]["Global Attribute"]
)
missing_recommend_atts

['creator',
 'crs',
 'geospatial_lat_resolution',
 'geospatial_lon_resolution',
 'geospatial_vertical_resolution',
 'keywords',
 'product_version',
 'standard_name_vocabulary',
 'summary']