# Intake V - create intake-esm catalog from scratch

```{admonition} Overview
:class: dropdown

![Level](https://img.shields.io/badge/Level-Intermediate-orange.svg)


üéØ **objectives**: Learn how to create `intake-esm` ESM-collections

‚åõ **time_estimation**: "60min"

‚òëÔ∏è **requirements**: `intake_esm.__version__ == 2023.4.*`, at least 10GB memory.
- intake I, intake II

- [pandas](https://pandas.pydata.org/)
- [json](https://de.wikipedia.org/wiki/JavaScript_Object_Notation)
- [xarray](http://xarray.pydata.org/en/stable/)

¬© **contributors**: k204210

‚öñ **license**:

```

```{admonition} Agenda
:class: tip

In this part, you learn

1. [When to build `intake-esm` collections](#motivation)
1. [How to create a standardized intake-esm catalog from scratch](#create) 
    1. [How to equip the catalog with attributes and configurations for assets and aggregation](#description)
    2. [How to add the collection of assets to the catalog](#database)
1. [How to validate and save the newly created catalog](#validate)
1. [How to configure the catalog to process multivariable assets](#multi)
```

**Intake** is a cataloging tool for data repositories. It opens catalogs with *driver*s. Drivers can be plug-ins like `intake-esm`. 
 
This tutorial gives insight into the creation of a **intake-esm catalogs**. We recommend this specific driver for intake when working with ESM-data as the plugin allows to load the data with the widely used and accepted tool `xarray`. 

```{note}
This tutorial creates a catalog from scratch. If you work based on another catalog, it might be sufficient for you to look into [intake II - save subset]() 
``` 

<a class="anchor" id="motivation"></a>

## 1. When should I create an `intake-esm` catalog?

Cataloging your data set with a *static* catalog for *easy access* is beneficial if 
- the data set is *stable* üèî such that you do not have to update the content of the catalog to make it usable at all
- the data set is *very large* üóÉ such that browsing and accessing data via file system is less performant
- the data set should be *shared* üîÄ with many people such that you cannot use a data base format

<a class="anchor" id="create"></a>

## 2. Create an intake-esm catalog which complies to [esmcat-specs]((https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md))

In order to create a well-defined, helpful catalog, you have to answer the following questions:

- What should be *search facetts* of the catalog?
- How are [assets](#asset) of the catalog combined to a dataset?
- How should `xarray` open the data set?

For `intake-esm` catalogs, an early [standard](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) has been developped to ensure compatibility across different `intake-esm` catalogs. We will follow those specs in this tutorial.

In the code example, we will use a python dictionary in this example but you could also write directly into a file with your favorite editor. We start with a catalog dictionary `intake_esm_catalog` and add the required basic meta data:

In [None]:
intake_esm_catalog={
    # we follow the esmcat specs version 0.1.0: 
    'esmcat_version': '0.1.0',
    'id': 'Intake-esmI',
    'description': "This is an intake catalog created for the intake tutorial"
}

<a class="anchor" id="description"></a>

### 2.1. Create the description

The description contains all the meta data which is necessary to understand the catalog. That makes the catalog *self-descriptive*. It also includes configuration for intake how to load assets of the data set(s) with the specified driver.

<a class="anchor" id="defineatts"></a>

#### Define **attributes** of your catalog

The catalog's [collection](#collection) uses attributes to describe the assets. These attributes are defined in the description via python `dict`ionaries and given as a list in the `intake-esm` catalog `.json` file, e.g.:

```json
"attributes": [
    {
      "column_name": "activity_id",
      "vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
    },
]
```

and will be accessed by users from the loaded catalog variable `catalog` via:

```python
catalog.esmcat.attributes
```

Catalog's attributes should allow users to 

- effectively **browse**: 
    - The in-memory representation and the visulation tool for the catalog is a `Pandas` *DataFrame*. By specifying a `column_name`, the columns of the *DataFrame* are generated by the attributes of the catalog.
    - Additionally, the `column_name` of the catalog's attributes can be used as *search facetts* - they will be keyword arguments of the `catalog.search()` function
- **understand** the content: You can provide information to the attributes, e.g. by specifying a `vocabulary` for all available (or allowed) values of the attribute.

‚û° The [collection](#collection) must have values for all defined attributes (see 3.2.)

‚û° In other terms: If [assets](#assets) should be integrated into the catalog, they have to be described with these attributes.

```{admonition} Best Practise
:class: tip

- The best configuration is reached if all datasets can be *uniquely identified*. I.e., if the users fill out all search facets, they will end up with **only one** dataset.
- Do not exaggerate with supply of additional columns. Users may be confused when many search fields have similar meanings. Also, the display of the DataFrame should fit into the window width.
```

**Use case: Catalog for project data on a file system**

Given a more than one level directory tree, ensure that:

- All files are on the same and deepest directory level.
- Each directory level has the same meaning across the project data. E.g. the deepest directory can have the meaning **version**.

This can easily be done by creating a **directory structure template** and check against their definitions.

If that is approved, each directory level can be used as an catalog's attribute.

In [None]:
attributes=[]
directory_structure_template="mip_era/activity_id/institution_id/source_id/experiment_id/member_id/table_id/variable_id/grid_label/version"


In [None]:
for att in directory_structure_template.split('/'):
    attributes.append(
        dict(column_name=att,
             vocabulary=f"https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_{att}.json"
            )
    )
intake_esm_catalog["attributes"]=attributes    
print(intake_esm_catalog)

```{note}
For data managemant purposes in general, we highly recoomend to define a <i>path_template</i> and a <i>filename_template</i> for a clear directory structure **before storing any data**.
```

* You can add more attributes from files or by parsing filenames

#### Define the **assets** column of your catalog

The `assets` entry is a python `dict`ionary in the catalog similiar to an attribute, e.g.:

```json
  "assets": {
    "column_name": "path",
    "format": "netcdf"
  },
```

The assets of a catalog refer to the data source that can be loaded by `intake`. Assets are essential for connecting `intake`'s function of **browsing** with the function of **accessing** the data. It contains 

- a `column_name` which is associated with the keyword in the [collection](#collection). The value of `column_name` in the collection points at the [asset](#asset) which can be loaded by `xarray`.
- the entry `format` specifies the dataformat of the asset.

```{note}

If you have [assets](#asset) of mixed types, you can substitute `format` by `format_column_name` so that both information for the asset is taken from the [collection](#collection)
```

In [None]:
assets={
    "column_name": "path",
    "format": "netcdf"
  }
intake_esm_catalog["assets"]=assets
print(intake_esm_catalog)

#### Optional: Define **aggregation control** for your data sets

```{note}

If **aggregation_control** is not defined, intake opens one xarray dataset per asset

```

One goal of a catalog is to the make access of the data as **analysis ready** as possible. Therefore, `intake-esm` features aggregating multiple [assets](#asset) to a larger single **data set**. If **aggregation_control** is defined in the [catalog](#catalog) and users run the catalog's `to_dataset_dict()` function, a Python dictionary of aggregated xarray datasets is created. The logic for merging and/or concatenating the catalog into datasets has to be configured under aggregation_control.

The implementation works such that the variable's dimensions are either enhanced by a new dimension or an existing dimension is extended with new data included in the addtional [assets](#asset).

- `aggregation_control` is a `dict`ionary in the [catalog](#catalog). If it is set, three keywords have to be configured:
    - `variable_column_name`: In the [collection](#collection), the **variable name** is specified under that column. Intake-esm will aggregate [assets](#asset) with the same name only.  Thus, all [assets](#asset) to be combined to a dataset have to include at least one unique variable. If your [assets](#asset) contain more than one data variable and users should be able to subset with intake, check [multi variable assets](#multivar).
    - `groupby_attrs`: [assets](#asset) attributed with different values of the `groupby_attrs` **should not be aggregated** to one xarray dataset. E.g., if you have data for different ESMs in one catalog you do not want users to merge them into one dataset. The `groupby_attrs` will be combined to the key of the aggregated dataset in the returned dictionary of `to_dataset_dict()`.
    - `aggregations`: Specification of **how** xarray should combine [assets](#asset) with same values of these `groupby_attrs`. <a class="anchor" id="aggregations"></a>

. E.g.:
```json
  "aggregation_control": {
    "variable_column_name": "variable_id",
    "groupby_attrs": [
      "activity_id",
      "institution_id"
    ],
    "aggregations": [
      {
        "type": "union",
        "attribute_name": "variable_id"
      }
    ]
  }
```

Let's start with defining `variable_column_name` and `groupby_attrs`:

In [None]:
aggregation_control=dict(
    variable_column_name="variable_id",
    groupby_attrs=[
        "activity_id",
        "institution_id"
    ]
)

```{admonition} Best Practise
:class: tip

- A well-defined aggregation control contains **all** [defined attributes](#defineatts)
```

[**Aggregations**](#aggregations):

*aggregations* is an optional list of dictionaries each of which configures

- on which dimension of the variable the [assets](#assets) should be aggregated
- optionally: what keyword arguments should be passed to xarray's `concat()` and `merge()` functions

for one attribute/column of the catalog given as `attribute_name`.

A dictionary of the aggregations list is named *aggregation object* and has to include three specifications:
- `attribute_name`: the column name which is not a *groupby_attr* and should be used for aggregating a single variable over a dimension
- `type`: Can either be
    - `join_new`: 
    - join_existing
    - union
- **optional**: `options`: Keyword arguments for xarray

The following defines that `variable_id` will be taken for a unique dataset:

In [None]:
aggregation_control["aggregations"]=[dict(
    attribute_name="variable_id",
    type="union"
)]

Now, we configure intake to use `time` for extending the existing dimension `time`. Therefore, we have to add `options` with "dim":"time" as keyword argument for xarray:

In [None]:
aggregation_control["aggregations"].append(
    dict(
        attribute_name="time_range",
        type="join_existing",
        options={ "dim": "time", "coords": "minimal", "compat": "override" }
    )
)

We can also, kind of retrospectively, combine all *member* of an ensemble on a new dimension of a variable:

In [None]:
aggregation_control["aggregations"].append(
    dict(
        attribute_name= "member_id",
        type= "join_new",
        options={ "coords": "minimal", "compat": "override" }
    )
)

```{note}

It is not possible to pre-configure `dask` options for `xarray`. Be sure that users of your catalog know if and how to set <b>chunks</b>.

```

In [None]:
intake_esm_catalog["aggregation_control"]=aggregation_control
print(intake_esm_catalog)

<a class="anchor" id="database"></a>

### 2.2. Create the data base for the catalog

The [collection](#collection) of [assets](#asset) can be specified either
- under `catalog_dict` as a list of dictionaries inside the [catalog](#catalog). One asset including all attribute specifications is saved as an individual dictionary, e.g.:
```json    
    "catalog_dict": [
        {
            "filename": "/work/mh0287/m221078/prj/switch/icon-oes/experiments/khwX155/outdata/khwX155_atm_mon_18500101.nc",
            "variable": "tas_gmean"
        }
    ]
```
- or under `catalog_file` which refers to a separate `.csv` file, e.g.
```json    
    "catalog_file": "dkrz_cmip6_disk_netcdf.csv.gz"
```

### Option A: Catalog_dict implementation

Assuming, we would like to create a catalog for all files in `/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/`, we can parse the path with our `directory_structure_template`:

In [None]:
trunk="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
trunkdict={}
for i,item in enumerate(directory_structure_template.split('/')):
    trunkdict[item]=trunk.split('/')[-(len(directory_structure_template.split('/'))-i+1)]

Afterwards, we can associate all files in that directory with these attributes and the additional `time_range` and `path` using os:

In [None]:
import os
filelist=!ls {trunk}
catalog_dict=[]
for asset in filelist:
    assetdict={}
    assetdict["time_range"]=asset.split('.')[0].split('_')[-1]
    assetdict["path"]=trunk+asset
    assetdict.update(trunkdict)
    catalog_dict.append(assetdict)

Then, we put that dict into the catalog:

In [None]:
intake_esm_catalog["catalog_dict"]=catalog_dict

### Option B: Catalog_file implementation

The `catalog_file` format needs to comply with the following rules:

- all file types that can be opened by pandas are allowed to be set as `catalog_file`
- the `.csv` file needs a header which includes all catalog attributes

An example would be:
```csv
filename,variable
/work/mh0287/m221078/prj/switch/icon-oes/experiments/khwX155/outdata/khwX155_atm_mon_18500101.nc,tas_gmean
``` 

```{note}

- Note that the *catalog_file* can also live in the cloud i.e. be an URL. You can host both the collection and catalog in the cloud as [DKRZ](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/ ) does.
```

```{admonition} Best practice
:class: tip

For keeping clear overview, you better use the same prefix name for both `catalog` and `catalog_file`.
```

In [None]:
import pandas as pd

catalog_dict_df=pd.DataFrame(catalog_dict)

### Saving a separate data base for [assets](#asset) or use a dictionary in the catalog?

```{tabbed} Advantages catalog_dict

- Only maintain one file which contains both catalog and collection

Suitable for smaller catalogs

```

```{tabbed} Disadvantages catalog_dict

- you cannot easily compress the catalog
- you can only use **one type of data access** for the catalog content. For CMIP6, we can provide access via `netcdf` or via `opendap`. We can create two collections for the same catalog file for covering both use cases.

Suitable for larger catalogs

```

### Use case: Updating the collection for a living project on file system

Solution: Write a **builder** script and run it as a cronjob (automatically and regularly):

A typical builder for a community project contains the following sequence:

1. Create one or more **lists of files** based on a `find` shell command on the data base directory. This type of job is also named *crawler* as it *crawls* through the file system.
1. Read the lists of files and create a `panda`s DataFrame for these files.
1. Parse the file names and file paths and fill column values. That can be easily done by deconstructing filepaths and filenames into their parts assuming you defined a mandatory
    - Filenames that cannot be parsed should be sorted out
1. The data frame is saved as the final **catalog** as a `.csv` file. You can also compress it to `.csv.gz`.
    
At DKRZ, we run scripts for project data on disk repeatedly in cronjobs to keep the catalog updated.

#### Builder tool examples

- The NCAR [builder tool](https://github.com/NCAR/intake-esm-datastore/tree/e253f184ccc78906a08f1580282da070b898957a/builders) for community projects like CMIP6 and CMIP5.
- DKRZ builder notebooks (based on NCAR tools) like this [Era5 notebook](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/blob/master/builder/notebooks/dkrz_era5_disk_catalog.ipynb)

<a class="anchor" id="validate"></a>

## 3. Validate and save the catalog:

If we open the defined catalog with `open_esm_datastore()` and try `to_dataset_dict()`, we can check if our creation is successful. The resulting catalog should give us exactly 1 dataset from 18 assets as we aggregate over time.

In [None]:
import intake
validated_cat=intake.open_esm_datastore(
    obj=dict(
        df=catalog_dict_df,
        esmcat=intake_esm_catalog
    )
)
validated_cat

In [None]:
validated_cat.to_dataset_dict()

Intake esm allows to write catalog file(s) with the `serialize()` function. The only argument is the **name** of the catalog which will be used as filename. It writes the two parts of the catalog either together in a `.json` file:

In [None]:
validated_cat.serialize("validated_cat")

Or in two seperated files if we provide `catalog_type=file` as a second argument. The `test.json` may be very large while we can save disk space if we svae the data base in a separate `.csv.gz` file:

In [None]:
validated_cat.serialize("validated_cat", catalog_type="file")

<a class="anchor" id="multi"></a>

## 4. Multivariable assets

If an [asset](#asset) contains more than one variable, `intake-esm` also features pre-selection of a variable before loading the data. [Here](https://intake-esm.readthedocs.io/en/latest/user-guide/multi-variable-assets.html) is a user guide on how to configure the collection for that.

1. the *variable_column* of the catalog must contain iterables (`list`, `tuple`, `set`) of values.
2. the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the `csv_kwargs` argument such that the collection needs to be opened as follows:

```python
import ast
import intake

col = intake.open_esm_datastore(
    "multi-variable-collection.json",
    csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
col
```


```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)

- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.

```