[#7]: split fundamentals

hytest-org · Apr 7, 2023 · d572ee4 · d572ee4
1 parent c2c5ec4
commit d572ee4
Show file tree

Hide file tree

Showing 30 changed files with 3,282 additions and 21,833 deletions.
diff --git a/101/02a_kerchunk_single.ipynb b/101/02a_kerchunk_single.ipynb
diff --git a/101/02b_kerchunk_multi.ipynb b/101/02b_kerchunk_multi.ipynb
diff --git a/101/02b_kerchunk_multi_to_years.ipynb b/101/02b_kerchunk_multi_to_years.ipynb
diff --git a/101/02c_kerchunk_multi_from_years.ipynb b/101/02c_kerchunk_multi_from_years.ipynb
diff --git a/101/02d_kerchunk.explore.ipynb b/101/02d_kerchunk.explore.ipynb
diff --git a/101/Compression.ipynb b/101/Compression.ipynb
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Assessing Compression\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "**OBJECTIVE**:  \n",
+    "The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. \n",
+    "\n",
+    "This notebook will evaluate how well the data is compressed when it is \n",
+    "written do disk. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preamble\n",
+    "This is all stuff we are going to need: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import logging\n",
+    "\n",
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "import hvplot.xarray\n",
+    "import fsspec\n",
+    "import zarr\n",
+    "\n",
+    "logging.basicConfig(level=logging.INFO, force=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": [
+     "remove-input"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "%run ../utils.ipynb\n",
+    "_versions(['xarray', 'dask', 'zarr', 'fsspec'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## AWS Credentials\n",
+    "\n",
+    "We will use the same credentials scheme we used to write the data in \n",
+    "{doc}`OpenDAP_to_S3`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.environ['AWS_PROFILE'] = 'osn-rsignellbucket2'\n",
+    "os.environ['AWS_ENDPOINT'] = 'https://renc.osn.xsede.org'\n",
+    "\n",
+    "%run ../AWS.ipynb  # handles credentials for us. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The zarr store\n",
+    "\n",
+    "Let's look at the zarr data store we wrote to object storage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# OUTPUT Location: \n",
+    "outdir = f's3://rsignellbucket2/testing/prism/PRISM2.zarr'\n",
+    "# established in earlier notebooks.\n",
+    "\n",
+    "fsw = fsspec.filesystem('s3', \n",
+    "    anon=False, \n",
+    "    default_fill_cache=False, \n",
+    "    skip_instance_cache=True, \n",
+    "    client_kwargs={ 'endpoint_url': os.environ['AWS_S3_ENDPOINT'] },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The zarr store is actually a folder/directory, with subfolders for variables, groups, etc. \n",
+    "We can get a quick peek at that with a couple of zarr functions:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "g = zarr.convenience.open_consolidated(fsw.get_mapper(outdir)) # read zarr metadata for named file.\n",
+    "print(g.tree())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Sizing the chunks\n",
+    "\n",
+    "Using the filesystem utilities, build a datasets of file sizes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "flist = fsw.glob(f'{outdir}/tmx/*')\n",
+    "fsize = [fsw.size(f) for f in flist]\n",
+    "da = xr.DataArray(data=np.array(fsize)/1e6, name='size')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plot a histogram of sizes for data files in this zarr store:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "da.hvplot.hist(title='Compressed object sizes (MB) for \"tmx\" variable', grid=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that most of the individual chunks are just over 4MB in size. Compare this with \n",
+    "the in-memory size for chunks, according to `xarray` -- 34MB per chunk. \n",
+    "\n",
+    "This tells us that we get an astonishing 9:1 compression ratio on this particular data. \n",
+    "\n",
+    "Let's look at another variable in this dataset:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "flist = fsw.glob(f'{outdir}/ppt/*')\n",
+    "fsize = [fsw.size(f) for f in flist]\n",
+    "da = xr.DataArray(data=np.array(fsize)/1e6, name='size')\n",
+    "da.hvplot.hist(title='Compressed object sizes (MB) for \"ppt\" variable', grid=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This variable did not compress quite as well.  In the worse case, severl chunks are about\n",
+    "14MB each.  This is still a respectable 2.5:1 compression ratio for this variable. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Total Size\n",
+    "The total size in GB of this dataset as stored on disk (including all metadata) is:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fsw.du(outdir)/1e9"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If we count up the in-memory sizes reported by `xarray`: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "new_ds = xr.open_dataset(fsw.get_mapper(outdir), engine='zarr', chunks={})\n",
+    "total=0\n",
+    "for i in new_ds.variables:\n",
+    "    n = new_ds[i].size\n",
+    "    bytes = n * 4\n",
+    "    total += bytes\n",
+    "    print(f\"{i:10s}: {bytes: 12d}\")\n",
+    "print(\"=\" * 24, f\"\\nTOTAL     : {total: 12d}\")\n",
+    "\n",
+    "print(f\"In GB: {total/1e9}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So... our total compression ratio for the entire dataset, including file system overhead and metadata is: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total / fsw.du(outdir) "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}