{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare modeling data for GIS analyses\n",
    "The most common way to work with climate modeling data sets is storing them in NetCDF or GRIB format and processing them using Xarray. Raster data sets used in GIS applications are usually stored in raster formats such as GeoTIFF, BIL, or IMG and being processed mainly with the help of GDAL-based python libraries. While modeling data sets are normally georeferenced using grids explicitly described by coordinates and conforming to e.g., CF sandard, geographic reference of GIS data sets is described by a coordinate reference system (CRS). This diferences imply some steps that need to be undertaken in order to convert data sets from one format to another.\n",
    "\n",
    "In this notebook we create an aggregated map of pre-industrial Vertically Averaged Sea Water Potential Temperature (thetaot) from CMIP6 and save it as GeoTIFF.\n",
    "\n",
    "First, we import all necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import intake, requests, aiohttp\n",
    "import matplotlib.pyplot as plt\n",
    "import cartopy.crs as ccrs\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "import cf_xarray as cfxr\n",
    "import xesmf as xe\n",
    "import scipy.sparse as sps\n",
    "\n",
    "# Rioxarray is an experimental interface between Xarray and Rasterio libraries\n",
    "# made for loading GDAL-readable rasters as xarray datasets and saving xarrays as raster files.\n",
    "import rioxarray\n",
    "#print(\"Using roocs/clisops in version %s\" % cl.__version__)\n",
    "print(\"Using xESMF in version %s\" % xe.__version__)\n",
    "\n",
    "xr.set_options(display_style = \"html\");\n",
    "\n",
    "import warnings\n",
    "warnings.simplefilter(\"ignore\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Find and load the data set\n",
    "Then we find the data set we need in the DKRZ open catalogue and print its metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to master catalog on the DKRZ server\n",
    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
    "#\n",
    "#only for the web page we need to take the original link:\n",
    "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
    "# Print DKRZ open catalogues\n",
    "list(dkrz_catalog)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n",
    "cols=dkrz_catalog._entries[\"dkrz_cmip6_disk\"]._open_args[\"read_csv_kwargs\"][\"usecols\"]+[\"opendap_url\"]\n",
    "col=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cols))\n",
    "cat = col.search(variable_id = \"thetaot\",\n",
    "                 experiment_id = \"historical\",\n",
    "                time_range = \"185001-194912\",\n",
    "                member_id = \"r1i1p1f3\"\n",
    "                )\n",
    "cat.df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dset = xr.open_dataset(cat.df.uri.values[0], chunks = {\"time\": 6})\n",
    "dset[\"thetaot\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, coordinates are stored as arrays of grid cell indices, which is not a common format in the GIS community.\n",
    "\n",
    "\n",
    "## 2. Calculate pre-industrail mean\n",
    "We make a subset limiting the data set to pre-industrial era, calculate the mean values per grid cell, and plot the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dset[\"time\"] = dset[\"time\"].dt.strftime(\"%Y%m%d\")\n",
    "pre_industrial = dset[\"thetaot\"].sel(time = slice(\"1850\", \"1880\"))\n",
    "pre_industrial_mean = pre_industrial.mean(dim = \"time\", keep_attrs = True)\n",
    "pre_industrial_mean.plot() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we plot the coastlines from cartopy over the dataset, we will see that the coordinate systems of these datasets are... a bit different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(1, figsize = [30, 13])\n",
    "\n",
    "# Set the projection to use for plotting\n",
    "ax = plt.subplot(1, 1, 1, projection=ccrs.PlateCarree())\n",
    "ax.coastlines()\n",
    "\n",
    "# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting\n",
    "# isel allows to select by indices instead of the time values\n",
    "pre_industrial_mean.plot.pcolormesh(ax = ax, cmap = \"coolwarm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we visualize our data set's grid, it becomes obvious that it is far from a standard grid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import textwrap\n",
    "def plot_curv_grid(ds, var = \"tos\"):    \n",
    "        lat = cfxr.accessor._get_with_standard_name(ds, \"latitude\")[0]\n",
    "        lon = cfxr.accessor._get_with_standard_name(ds, \"longitude\")[0]        \n",
    "        if any([i == None for i in [lat, lon]]): \n",
    "            print(ds.attrs[\"source_id\"], \": Cannot identify latitude/longitude.\")\n",
    "            return\n",
    "        plt.figure(figsize = (16, 9), dpi=120)\n",
    "        plt.scatter(x = ds[lon], y = ds[lat], s = 0.01)\n",
    "        # x, y = np.meshgrid(ds[lon], ds[lat])\n",
    "        # plt.scatter(x = x, y = y, s = 0.01)\n",
    "        try:\n",
    "            plt.title(\"\\n\".join(textwrap.wrap(ds.attrs[\"source_id\"] + \"(\" + ds.attrs[\"source\"].split(\"ocean:\")[-1].split(\"\\n\")[0] + \")\", 120)))\n",
    "        except (KeyError, IndexError):\n",
    "            plt.title(ds.attrs[\"source_id\"])  \n",
    "plot_curv_grid(dset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Reformat the grid and regridd the data set.\n",
    "As we mentioned before, the coordinates need to be re-defined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Later one can just install and use clisops, once functions like this are merged into the master branch\n",
    "def grid_reformat(ds):\n",
    "\n",
    "    # Define lat, lon, lat_bnds, lon_bnds\n",
    "    lat = ds.lat[:, 0]\n",
    "    lon = ds.lon[0, :]\n",
    "    lat_bnds = np.zeros((lat.shape[0], 2), dtype = \"double\")\n",
    "    lon_bnds = np.zeros((lon.shape[0], 2), dtype = \"double\")\n",
    "\n",
    "    # From (N + 1, M + 1) shaped bounds to (N, M, 4) shaped vertices\n",
    "    lat_vertices = cfxr.vertices_to_bounds(\n",
    "        ds.lat_b, (\"bnds\", \"lat\", \"lon\")\n",
    "    ).values\n",
    "    lon_vertices = cfxr.vertices_to_bounds(\n",
    "        ds.lon_b, (\"bnds\", \"lat\", \"lon\")\n",
    "    ).values\n",
    "\n",
    "    lat_vertices = np.moveaxis(lat_vertices, 0, -1)\n",
    "    lon_vertices = np.moveaxis(lon_vertices, 0, -1)\n",
    "\n",
    "    # From (N, M, 4) shaped vertices to (N, 2)  and (M, 2) shaped bounds\n",
    "    lat_bnds[:, 0] = np.min(lat_vertices[:, 0, :], axis = 1)\n",
    "    lat_bnds[:, 1] = np.max(lat_vertices[:, 0, :], axis = 1)\n",
    "    lon_bnds[:, 0] = np.min(lon_vertices[0, :, :], axis = 1)\n",
    "    lon_bnds[:, 1] = np.max(lon_vertices[0, :, :], axis = 1)\n",
    "\n",
    "    # Create dataset\n",
    "    ds_ref = xr.Dataset(\n",
    "        coords = {\n",
    "            \"lat\": ([\"lat\"], lat.data),\n",
    "            \"lon\": ([\"lon\"], lon.data),\n",
    "            \"lat_bnds\": ([\"lat\", \"bnds\"], lat_bnds.data),\n",
    "            \"lon_bnds\": ([\"lon\", \"bnds\"], lon_bnds.data),\n",
    "        },\n",
    "    )\n",
    "\n",
    "\n",
    "    # Add variable attributes to the coordinate variables\n",
    "    ds_ref[\"lat\"].attrs = {\n",
    "        \"bounds\": \"lat_bnds\",\n",
    "        \"units\": \"degrees_north\",\n",
    "        \"long_name\": \"latitude\",\n",
    "        \"standard_name\": \"latitude\",\n",
    "        \"axis\": \"Y\",\n",
    "    }\n",
    "    ds_ref[\"lon\"].attrs = {\n",
    "        \"bounds\": \"lon_bnds\",\n",
    "        \"units\": \"degrees_east\",\n",
    "        \"long_name\": \"longitude\",\n",
    "        \"standard_name\": \"longitude\",\n",
    "        \"axis\": \"X\",\n",
    "    }\n",
    "    ds_ref[\"lat_bnds\"].attrs = {\n",
    "        \"long_name\": \"latitude_bounds\",\n",
    "        \"units\": \"degrees_north\",\n",
    "    }\n",
    "    ds_ref[\"lon_bnds\"].attrs = {\n",
    "        \"long_name\": \"longitude_bounds\",\n",
    "        \"units\": \"degrees_east\",\n",
    "    }\n",
    "    \n",
    "    return ds_ref"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify a global 1deg target grid\n",
    "\n",
    "# With clisops\n",
    "#ds_out = clore.Grid(grid_instructor=(1., )).ds\n",
    "\n",
    "# Without clisops\n",
    "ds_out = grid_reformat(xe.util.grid_global(1., 1.))\n",
    "\n",
    "ds_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we regrid the data set to a normal 1-degree global grid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In case of problems, activate ESMF verbose mode\n",
    "import ESMF\n",
    "ESMF.Manager(debug = True)\n",
    "\n",
    "# Regridding methods\n",
    "#method_list = [\"bilinear\", \"nearest_s2d\", \"patch\", \"conservative\", \"conservative_normed\"]\n",
    "method_list = [\"bilinear\", \"nearest_s2d\", \"patch\"]\n",
    "\n",
    "# Function to generate the weights\n",
    "#   If grids have problems of degenerated cells near the poles there is the ignore_degenerate option\n",
    "def regrid(ds_in, ds_out, method, periodic, unmapped_to_nan = True, ignore_degenerate = None):\n",
    "    \"\"\"Convenience function for calculating regridding weights\"\"\"\n",
    "    ds_in[\"lat\"] = ds_in[\"latitude\"]\n",
    "    ds_in[\"lon\"] = ds_in[\"longitude\"]\n",
    "    return xe.Regridder(ds_in, ds_out, method, periodic = periodic, \n",
    "                        ignore_degenerate = ignore_degenerate) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time regridder = regrid(pre_industrial_mean, ds_out, \"bilinear\", periodic = True, unmapped_to_nan = True, ignore_degenerate = None) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regridder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regridded_ds = regridder(pre_industrial_mean, keep_attrs = True)\n",
    "regridded_ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(1, figsize = [30, 13])\n",
    "\n",
    "# Set the projection to use for plotting\n",
    "ax = plt.subplot(1, 1, 1, projection = ccrs.PlateCarree())\n",
    "ax.coastlines()\n",
    "\n",
    "# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting\n",
    "# isel allows to select by indices instead of the time values\n",
    "regridded_ds.plot.pcolormesh(ax = ax, cmap = \"coolwarm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the data has a correct CRS now.\n",
    "\n",
    "## 4. Write a CRS and save the data set as a GeoTIFF\n",
    "It is pretty straitforward to write a CRS to an xarray with rioxarray. Here we used the World Geodetic System 1984 CRS which is quite common and used by default, for example in QGIS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#You might need to rename the axes before\n",
    "transformed = regridded_ds.rename({\"lon\":\"x\", \"lat\":\"y\"})\n",
    "\n",
    "transformed = transformed.rio.write_crs(\"EPSG:4326\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to re-project the data set to e.g., Web Mercator, wich is used by default in Google Maps and ArcGIS, we do it as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformed = transformed.rio.reproject(\"EPSG:3857\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally, saving our dataset as a GeoTIFF is as easy as this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformed.rio.to_raster(\"regridded_3857.tif\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}