ATMODAT Standard Compliance Checker#
This notebook introduces you to the atmodat checker which contains checks to ensure compliance with the ATMODAT Standard.
Its core functionality is based on the IOOS compliance checker. The ATMODAT Standard Compliance Checker library makes use of cc-yaml, which provides a plugin for the IOOS compliance checker that generates check suites from YAML descriptions. Furthermore, the Compliance Check Library is used as the basis to define generic, reusable compliance checks.
In addition, the compliance to the CF Conventions 1.4 or higher is verified with the CF checker.
In this notebook, you will learn
how to use an environment on DKRZ HPC mistral or levante
how to run checks with the atmodat data checker
to understand the results of the checker and further analyse it with pandas
how you could proceed to cure the data with xarray if it does not pass the QC
Preparation#
On DKRZ’s High-performance computer PC, we provide a conda
environment which are useful for working with data in DKRZ’s CMIP Data Pool.
Option 1: Activate checker libraries for working with a comand-line shell
If you like to work with shell commands, you can simply activate the environment. Prior to this, you may have to load a module with a recent python interpreter
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
Option 2: Create a kernel with checker libraries to work with jupyter notebooks
With ipykernel
you can install a kernel which can be used within a jupyter server like jupyterhub. ipykernel
creates the kernel based on the activated environment.
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
python -m ipykernel install --user --name qualitychecker --display-name="qualitychecker"
If you run this command from within a jupyter server, you have to restart the jupyterserver afterwards to be able to select the new quality checker kernel.
Expert mode: Running the jupyter server from a different environment than the environment in which atmodat is installed
Make sure that you:
Install the
cfunits
package to the jupyter environment viaconda install cfunits -c conda-forge -p $jupyterenv
and restart the kernel.Add the atmodat environment to the
PATH
environment variable inside the notebook. Otherwise, the notebook’s shell does not find the applicationrun_checks
. You can modify environment variables with theos
package and its commandos.envrion
. The environment of the kernel can be found withsys
andsys.executable
. The following block sets the environment variablePATH
correctly:
import sys
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.path.sep.join(sys.executable.split('/')[:-1])
#As long as there is the installation bug, we have to manually get the Atmodat CVs:
if not "AtMoDat_CVs" in [dirpath.split(os.path.sep)[-1]
for (dirpath, dirs, files) in os.walk(os.path.sep.join(sys.executable.split('/')[:-2]))] :
!git clone https://github.com/AtMoDat/AtMoDat_CVs.git {os.path.sep.join(sys.executable.split('/')[:-2])}/lib/python3.9/site-packages/atmodat_checklib/AtMoDat_CVs
Data to be checked#
In this tutorial, we will check a small subset of CMIP6 data which we gain via intake
:
import intake
# Path to master catalog on the DKRZ server
col_url = "https://dkrz.de/s/intake"
col_url = "https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"
parent_col=intake.open_catalog([col_url])
list(parent_col)
# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk"]
# We just use the first file from the CMIP6 catalog and copy it to the local disk because we make some experiments from it
exp_file=col.df["uri"].values[0]
exp_file
'/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc'
Application#
The command run_checks
can be executed from any directory from within the atmodat conda environment.
The atmodat checker contains two modules:
one that checks the global attributes for compliance with the ATMODAT standard
another that performs a standard CF check (building upon the cfchecks library).
Show usage instructions of run_checks
!run_checks -h
usage: run_checks [-h] [-v] [-op OPATH] [-cfv CFVERSION] [-check WHATCHECKS]
[-s] [-V] [-f FILE | -p PATH | -pnr PATH_NO_RECURSIVE]
Run the AtMoDat checks suits.
options:
-h, --help show this help message and exit
-v, --verbose Print output of checkers (longer runtime due to double
call of checkers)
-op OPATH, --opath OPATH
Define custom path where checker output shall be
written
-cfv CFVERSION, --cfversion CFVERSION
Define custom CF table version against which the file
shall be checked. Valid are versions from 1.3 to 1.8.
Example: "-cfv 1.6". Default is 'auto'
-check WHATCHECKS, --whatchecks WHATCHECKS
Define if AtMoDat or CF check or both shall be
executed. Valid options: AT, CF, both. Example:
"-check CF". Default is 'both'
-s, --summary Create summary of checker output
-V, --version show program's version number and exit
-f FILE, --file FILE Processes the given file
-p PATH, --path PATH Processes all files in a given path and subdirectories
(recursive file search)
-pnr PATH_NO_RECURSIVE, --path_no_recursive PATH_NO_RECURSIVE
Processes all files in a given directory
The results of the performed checks are provided in the checker_output directory. By default, run_checks
assumes writing permissions in the path where the atmodat checker is installed. If this is not the case, you must specify an output directory where you possess writing permissions with the -op output_path
.
In the following block, we set the output path to the current working directory which we get via the bash command pwd
. We apply run_checks
for the exp_file
which we downloaded in the chapter before.
cwd=!pwd
cwd=cwd[0]
!run_checks -f {exp_file} -op {cwd} -s
Running Compliance Checker on the datasets from: ['/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc']
2023-07-05 14:33:44.523616 [INFO] :: PYESSV :: Loading vocabularies from /envs/lib/python3.11/site-packages/atmodat_checklib/AtMoDat_CVs/pyessv-archive:
2023-07-05 14:33:44.702117 [INFO] :: PYESSV :: ... loaded: atmodat
--- 13.0180 seconds for checking 1 files---
Now, we have a directory atmodat_checker_output
in the op
. For each run of run_checks
, a new directory is created inside of op
named by the timestamp. Additionally, a directory latest always shows the output of the most recent run.
!ls {os.path.sep.join([cwd, "atmodat_checker_output"])}
20230705_1433 latest
As we ran run_checks
with the option -s
, one output is the short_summary.txt file which we cat
in the following:
output_dir_string=os.path.sep.join(["atmodat_checker_output","latest"])
output_path=os.path.sep.join([cwd, output_dir_string])
!cat {os.path.sep.join([output_path, "short_summary.txt"])}
=== Short summary ===
ATMODAT Standard Compliance Checker Version: 1.3.2
Checking against: ATMODAT Standard 3.0, CF Version 1.7
Checked at: 2023-07-05T14:33:55
Number of checked netCDF files: 1
Mandatory ATMODAT Standard checks passed: 4/4 (0 missing, 0 error(s))
Recommended ATMODAT Standard checks passed: 9/20 (11 missing, 0 error(s))
Optional ATMODAT Standard checks passed: 3/9 (6 missing, 0 error(s))
CF checker errors: 0 (Ignoring errors related to formula_terms in boundary variables. See Known Issues section https://github.com/AtMoDat/atmodat_data_checker#known-issues )
CF checker warnings: 2
Results#
The short summary contains information about versions, the timestamp of execution, the ratio of passed checks on attributes and errors written by the CF checker.
cfchecks routine only issues a warning/information message if variable metadata are completely missing.
Zero errors in the cfchecks routine does not necessarily mean that a data file is CF compliant!
We can also have a look into the detailled output including the exact error message in the long_summary_ files which are subdivided into severe levels.
!cat {os.path.sep.join([output_path,"long_summary_recommended.csv"])}
File,Check level,Global Attribute,Error Message
,,,
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,Conventions,ATMODAT Standard information not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,creator,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,crs,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lat_resolution,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lon_resolution,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_vertical_resolution,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,keywords,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,product_version,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,source_type,global attribute value is invalid
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,standard_name_vocabulary,global attribute is not present
c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,summary,global attribute is not present
!cat {os.path.sep.join([output_path,"long_summary_mandatory.csv"])}
File,Check level,Global Attribute,Error Message
,,,
We can open the .csv files with pandas
to further analyse the output.
import pandas as pd
recommend_df=pd.read_csv(os.path.sep.join([output_path,"long_summary_recommended.csv"]))
recommend_df
File | Check level | Global Attribute | Error Message | |
---|---|---|---|---|
0 | NaN | NaN | NaN | NaN |
1 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | Conventions | ATMODAT Standard information not present |
2 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | creator | global attribute is not present |
3 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | crs | global attribute is not present |
4 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | geospatial_lat_resolution | global attribute is not present |
5 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | geospatial_lon_resolution | global attribute is not present |
6 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | geospatial_vertical_resolution | global attribute is not present |
7 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | keywords | global attribute is not present |
8 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | product_version | global attribute is not present |
9 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | source_type | global attribute value is invalid |
10 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | standard_name_vocabulary | global attribute is not present |
11 | c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... | recommended | summary | global attribute is not present |
There may be missing global attributes wich are recommended by the atmodat standard. We can find them with pandas:
missing_recommend_atts=list(
recommend_df.loc[recommend_df["Error Message"]=="global attribute is not present"]["Global Attribute"]
)
missing_recommend_atts
['creator',
'crs',
'geospatial_lat_resolution',
'geospatial_lon_resolution',
'geospatial_vertical_resolution',
'keywords',
'product_version',
'standard_name_vocabulary',
'summary']
Curation#
Let’s try first steps to cure the file by adding a missing attribute with xarray
. We can open the file into an xarray dataset with:
import xarray as xr
exp_file_ds=xr.open_dataset(exp_file)
exp_file_ds
<xarray.Dataset> Dimensions: (time: 1980, bnds: 2, lev: 26, lat: 64, lon: 128) Coordinates: * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * lev (lev) float64 0.9926 0.9706 0.9296 ... 0.01397 0.007389 0.003545 * lat (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86 * lon (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2 Dimensions without coordinates: bnds Data variables: time_bnds (time, bnds) object ... lev_bnds (lev, bnds) float64 ... p0 float64 ... a (lev) float64 ... b (lev) float64 ... ps (time, lat, lon) float32 ... a_bnds (lev, bnds) float64 ... b_bnds (lev, bnds) float64 ... lat_bnds (lat, bnds) float64 ... lon_bnds (lon, bnds) float64 ... c2h6 (time, lev, lat, lon) float32 ... Attributes: (12/49) Conventions: CF-1.7 CMIP-6.2 activity_id: AerChemMIP branch_method: Standard branch_time_in_child: 0.0 branch_time_in_parent: 2110.0 comment: The experiments parallel historical from 1850 to ... ... ... title: BCC-ESM1 output prepared for CMIP6 tracking_id: hdl:21.14100/7be29ebc-8b8a-4fda-95e9-ac1dc8b3da8c variable_id: c2h6 variant_label: r1i1p1f1 license: CMIP6 model data produced by BCC is licensed unde... cmor_version: 3.3.2
We can handle and add attributes via the dict
-type attribute .attrs
. Applied on the dataset, it shows all global attributes of the file:
exp_file_ds.attrs
{'Conventions': 'CF-1.7 CMIP-6.2',
'activity_id': 'AerChemMIP',
'branch_method': 'Standard',
'branch_time_in_child': 0.0,
'branch_time_in_parent': 2110.0,
'comment': 'The experiments parallel historical from 1850 to 2014 with all forcing applied, but fix the anthropogenic emissions of Aerosol precursors to the 1850 value that is used in piControl. The same initial conditions as r1i1p1f1 of historical, branched from year 2110 in piControl.',
'contact': 'Dr. Tongwen Wu(twwu@cma.gov.cn)',
'creation_date': '2020-05-11T06:54:48Z',
'data_specs_version': '01.00.27',
'description': 'AerChemMIP:hist-piAer',
'experiment': 'historical forcing, but with pre-industrial aerosol emissions',
'experiment_id': 'hist-piAer',
'external_variables': 'areacella',
'forcing_index': 1,
'frequency': 'mon',
'further_info_url': 'https://furtherinfo.es-doc.org/CMIP6.BCC.BCC-ESM1.hist-piAer.none.r1i1p1f1',
'grid': 'T42',
'grid_label': 'gn',
'history': '2020-05-11T06:54:48Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.',
'initialization_index': 1,
'institution': 'Beijing Climate Center, Beijing 100081, China',
'institution_id': 'BCC',
'mip_era': 'CMIP6',
'nominal_resolution': '250 km',
'parent_activity_id': 'CMIP',
'parent_experiment_id': 'piControl',
'parent_mip_era': 'CMIP6',
'parent_source_id': 'BCC-ESM1',
'parent_time_units': 'days since 1850-01-01',
'parent_variant_label': 'r1i1p1f1',
'physics_index': 1,
'product': 'model-output',
'realization_index': 1,
'realm': 'aerosol',
'references': 'Model described by Tongwen Wu et al. (JGR 2013; JMR 2014; GMD,2019). Also see http://forecast.bcccsm.ncc-cma.net/htm',
'run_variant': 'forcing: greenhouse gases,aerosol emission,solar constant,volcano mass,land use',
'source': 'BCC-ESM 1 (2017): aerosol: none atmos: BCC_AGCM3_LR (T42; 128 x 64 longitude/latitude; 26 levels; top level 2.19 hPa) atmosChem: BCC-AGCM3-Chem land: BCC_AVIM2 landIce: none ocean: MOM4 (1/3 deg 10S-10N, 1/3-1 deg 10-30 N/S, and 1 deg in high latitudes; 360 x 232 longitude/latitude; 40 levels; top grid cell 0-10 m) ocnBgchem: none seaIce: SIS2',
'source_id': 'BCC-ESM1',
'source_type': 'AER AOGCM CHEM',
'sub_experiment': 'none',
'sub_experiment_id': 'none',
'table_id': 'AERmon',
'table_info': 'Creation Date:(30 July 2018) MD5:e53ff52009d0b97d9d867dc12b6096c7',
'title': 'BCC-ESM1 output prepared for CMIP6',
'tracking_id': 'hdl:21.14100/7be29ebc-8b8a-4fda-95e9-ac1dc8b3da8c',
'variable_id': 'c2h6',
'variant_label': 'r1i1p1f1',
'license': 'CMIP6 model data produced by BCC is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.',
'cmor_version': '3.3.2'}
We add all missing attributes and set a dummy value for them:
for att in missing_recommend_atts:
exp_file_ds.attrs[att]="Dummy"
We save the modified dataset with the to_netcdf
function:
exp_file_ds.to_netcdf("testfile-modified.nc")
Now, lets run run_checks
again.
We can also only provide a directory instead of a file as an argument with the option -p
. The checker will find all .nc
files inside that directory.
!run_checks -p {cwd} -op {cwd} -s
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/testfile-modified.nc
2023-07-05 14:34:52.550854 [INFO] :: PYESSV :: Loading vocabularies from /envs/lib/python3.11/site-packages/atmodat_checklib/AtMoDat_CVs/pyessv-archive:
2023-07-05 14:34:52.557105 [INFO] :: PYESSV :: ... loaded: atmodat
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_CCLM.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_2M.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_2M_celsius.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_3M.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_celsius.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_gridinfo_CCLM4-8-17.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_interface.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_temp_day.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_temp_mon.nc
--- 7.3080 seconds for checking 10 files---
Using the latest directory, here is the new summary:
!cat {os.path.sep.join([output_path,"short_summary.txt"])}
=== Short summary ===
ATMODAT Standard Compliance Checker Version: 1.3.2
Checking against: ATMODAT Standard 3.0, multiple CF versions (CF-1.7, CF-1.6, CF-1.0, CF-1.4)
Checked at: 2023-07-05T14:34:59
Number of checked netCDF files: 10
Mandatory ATMODAT Standard checks passed: 26/40 (13 missing, 1 error(s))
Recommended ATMODAT Standard checks passed: 33/200 (167 missing, 0 error(s))
Optional ATMODAT Standard checks passed: 6/90 (84 missing, 0 error(s))
CF checker errors: 23 (Ignoring errors related to formula_terms in boundary variables. See Known Issues section https://github.com/AtMoDat/atmodat_data_checker#known-issues )
CF checker warnings: 14
You can see that the checks do not fail for the modified file when subtracting the earlier failes from the sum of new passed checks.