Merge pull request #9 from hytest-org/gt-00-gridmet

Expands 101, includes gridmet example
hytest-org · Apr 14, 2023 · aa65148 · aa65148
2 parents 455891a + 65c5866
commit aa65148
Show file tree

Hide file tree

Showing 6 changed files with 743 additions and 272 deletions.
diff --git a/101/EffectSizeShape.ipynb b/101/EffectSizeShape.ipynb
@@ -52,7 +52,7 @@
     "## Why Chunk\n",
     "The simple reason is that the full dataset won't fit in memory. It has to be\n",
     "divided in some way so that only those parts of the data being actively worked\n",
-    "are loaded. \n",
+    "are loaded. (See more in {doc}`WhyChunk`).\n",
     "\n",
     "This has other benefits when it comes to parallel algorithms.  If work can be\n",
     "performed in parallel, it is easy to set it up such that a separate worker is\n",

diff --git a/101/ExamineSourceData.ipynb b/101/ExamineSourceData.ipynb
@@ -85,7 +85,7 @@
    "outputs": [],
    "source": [
     "# lazy-load\n",
-    "ds_in = xr.open_dataset(OPENDAP_url, decode_times=False)\n",
+    "ds_in = xr.open_dataset(OPENDAP_url)#, decode_times=False)\n",
     "# and show it:\n",
     "ds_in"
    ]

diff --git a/101/SecondaryExample.ipynb b/101/SecondaryExample.ipynb
@@ -0,0 +1,270 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2979f859-1640-4633-b52a-b10cb3482fd4",
+   "metadata": {},
+   "source": [
+    "# Another Example\n",
+    "\n",
+    "Taking what we learned reading OpenDAP data, let's do another example, end-to-end, of\n",
+    "reading a dataset from a tredds server.  The `gridmet` data for precipitation will be\n",
+    "our new example dataset: \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "00d2ca69-26f6-4b2a-bc5c-b11e0bb7f23f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Example Dataset:  gridmet\n",
+    "DATA_url = r\"http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_met_pr_1979_CurrentYear_CONUS.nc\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12c9573c-f965-458c-88a8-5376ded065da",
+   "metadata": {},
+   "source": [
+    "## Preamble\n",
+    "Stuff we need..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "19421b21-eb10-4d4d-b8a6-8a28de456a6a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import logging\n",
+    "import xarray as xr\n",
+    "\n",
+    "logging.basicConfig(level=logging.INFO, force=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ddb8b2fa-d386-4cae-b8bb-e820bed27d65",
+   "metadata": {
+    "tags": [
+     "hide-input"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "%run ../utils.ipynb\n",
+    "_versions(['xarray', 'dask'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1e5f8ac-6f5b-43a6-a7a7-4006ceda54cb",
+   "metadata": {},
+   "source": [
+    "## Examine Data\n",
+    "Lazy-load the data using defaults to see how the overall structure looks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9ebc6a80-a610-47b9-89d0-a26e8c469f44",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# lazy-load\n",
+    "ds_in = xr.open_dataset(DATA_url + r\"#fillmismatch\", decode_coords=True, chunks={})\n",
+    "# and show it:\n",
+    "ds_in"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "abead8bb-a65e-4af6-90a0-165996f12ce1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds_in.precipitation_amount"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "065a21ed-4230-4338-b64c-d48438c3c0ed",
+   "metadata": {},
+   "source": [
+    "The data is being presented to us from the server as if it is one big chunk.  This is almost certainly not how it is stored on the server end. And more importantly, that 48GB chunk is too big for the server to provide all at once. Typically, data requests are capped at 500MB. \n",
+    "\n",
+    "But because we did not specify a chunk pattern, we get the illusion that it is one big chunk, and it is up to the server and the client (inside the `open_dataset()` method) to negotiate the transfer. \n",
+    "\n",
+    "A hint as to the way the server thinks of this data (absent chunking directives) is the `_ChunkSizes` attribute: `[61 98 231]` for `(day, lat, lon)`.  Using that chunking pattern, the data is sized like so: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "760ea923-404b-4ed1-9a49-be8e4549eabe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#     day  #lat  #lon  #float32\n",
+    "bytes = 61 * 98 * 231 * 4\n",
+    "kbytes = bytes / (2**10)\n",
+    "mbytes = kbytes / (2**10)\n",
+    "print(f\"TMN chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1469889-2267-4301-b561-4fdbae4dba8a",
+   "metadata": {},
+   "source": [
+    "## Establish Chunking Preference\n",
+    "Will proceed with the assumption that this data will most likely be taken at full extent, for short time intervals. \n",
+    "\n",
+    "Examining each of the dimensions of this dataset: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f23f2317-e504-4882-b936-da482f2864ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "day = 16169/365 #how many chunks for a year-at-a-time\n",
+    "day"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a60542a-e5e8-448b-b25c-db04947f42d7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lat = 585 / 3 # split into 3 chunks\n",
+    "lat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "868e11d0-a3a5-4ac8-9b7f-043e7e29be3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lon = 1386 /3 # split into 3 chunks\n",
+    "lon"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da090e6f-4c11-439c-86f7-3246431b3a7b",
+   "metadata": {},
+   "source": [
+    "If we chunk with this pattern, how big will each chunk be? "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e1be74e-2bcc-48cb-becc-ea10e40cf3ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#     day  #lat  #lon  #float32\n",
+    "bytes = 365 * 195 * 462 * 4\n",
+    "kbytes = bytes / (2**10)\n",
+    "mbytes = kbytes / (2**10)\n",
+    "print(f\"Chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99737679-82da-4ca4-b4f7-f2fe97411553",
+   "metadata": {},
+   "source": [
+    " 125MB chunk seems reasonable, but it does mean that a time-series read pattern will have to take in 45 chunks (assuming the spatial extent of the analysis is within one lat/lon chunk).  To bring that down, let's take the time dimension as 2 years at a time.  Just to make the numbers more round, we'll express the time as 24 30-day months: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0df8131-d350-48bb-b6c0-932486d2a13a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds_in = xr.open_dataset(\n",
+    "    DATA_url + r\"#fillmismatch\", \n",
+    "    decode_coords=True, \n",
+    "    chunks={'day': 24*30, 'lon': 462, 'lat': 195}\n",
+    ")\n",
+    "ds_in"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e49d852b-5f98-408f-b24b-708c6b1227aa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds_in.precipitation_amount"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94c55815-6cd2-4ff3-a48c-97d61739c4cc",
+   "metadata": {},
+   "source": [
+    "This chunk pattern favors the spatial extent.  Six chunks are needed to read the entire spatial extent for one time step.  \n",
+    "\n",
+    "The time data is chunked by two year blocks (assuming alignment with year boundaries, which is almost certainly not true).  It would be more accurate to say that the time is in two-year-sized chunks, but may not align with the calendar year.   An entire time-series for a small spatial extent will require 23 chunks to be read."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3873a668-7e42-4c35-84a5-9cc4a91d9bd7",
+   "metadata": {},
+   "source": [
+    "## Important\n",
+    "The chunk specification in the `open_dataset()` call does not reconfigure the data itself.  It governs how the data driver formulates its requests to the server.  The chunking information specified in the open dataset call helps the driver establish the boundaries for its queries.  It will then request the data, a chunk at a time, from the server.  How the server holds that data is hidden from us, and we really don't need to care. The data driver on our end (from within `xarray.open_dataset()`) does the necessary work to ensure data alignment and that the block/chunk sizes will align to what we want. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cc397e76-c4fb-48db-80c7-3064429ad0ca",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}