-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #9 from hytest-org/gt-00-gridmet
Expands 101, includes gridmet example
- Loading branch information
Showing
6 changed files
with
743 additions
and
272 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,270 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2979f859-1640-4633-b52a-b10cb3482fd4", | ||
"metadata": {}, | ||
"source": [ | ||
"# Another Example\n", | ||
"\n", | ||
"Taking what we learned reading OpenDAP data, let's do another example, end-to-end, of\n", | ||
"reading a dataset from a tredds server. The `gridmet` data for precipitation will be\n", | ||
"our new example dataset: \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "00d2ca69-26f6-4b2a-bc5c-b11e0bb7f23f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Example Dataset: gridmet\n", | ||
"DATA_url = r\"http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_met_pr_1979_CurrentYear_CONUS.nc\"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "12c9573c-f965-458c-88a8-5376ded065da", | ||
"metadata": {}, | ||
"source": [ | ||
"## Preamble\n", | ||
"Stuff we need..." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "19421b21-eb10-4d4d-b8a6-8a28de456a6a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import logging\n", | ||
"import xarray as xr\n", | ||
"\n", | ||
"logging.basicConfig(level=logging.INFO, force=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ddb8b2fa-d386-4cae-b8bb-e820bed27d65", | ||
"metadata": { | ||
"tags": [ | ||
"hide-input" | ||
] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%run ../utils.ipynb\n", | ||
"_versions(['xarray', 'dask'])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c1e5f8ac-6f5b-43a6-a7a7-4006ceda54cb", | ||
"metadata": {}, | ||
"source": [ | ||
"## Examine Data\n", | ||
"Lazy-load the data using defaults to see how the overall structure looks:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9ebc6a80-a610-47b9-89d0-a26e8c469f44", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# lazy-load\n", | ||
"ds_in = xr.open_dataset(DATA_url + r\"#fillmismatch\", decode_coords=True, chunks={})\n", | ||
"# and show it:\n", | ||
"ds_in" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "abead8bb-a65e-4af6-90a0-165996f12ce1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ds_in.precipitation_amount" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "065a21ed-4230-4338-b64c-d48438c3c0ed", | ||
"metadata": {}, | ||
"source": [ | ||
"The data is being presented to us from the server as if it is one big chunk. This is almost certainly not how it is stored on the server end. And more importantly, that 48GB chunk is too big for the server to provide all at once. Typically, data requests are capped at 500MB. \n", | ||
"\n", | ||
"But because we did not specify a chunk pattern, we get the illusion that it is one big chunk, and it is up to the server and the client (inside the `open_dataset()` method) to negotiate the transfer. \n", | ||
"\n", | ||
"A hint as to the way the server thinks of this data (absent chunking directives) is the `_ChunkSizes` attribute: `[61 98 231]` for `(day, lat, lon)`. Using that chunking pattern, the data is sized like so: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "760ea923-404b-4ed1-9a49-be8e4549eabe", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# day #lat #lon #float32\n", | ||
"bytes = 61 * 98 * 231 * 4\n", | ||
"kbytes = bytes / (2**10)\n", | ||
"mbytes = kbytes / (2**10)\n", | ||
"print(f\"TMN chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a1469889-2267-4301-b561-4fdbae4dba8a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Establish Chunking Preference\n", | ||
"Will proceed with the assumption that this data will most likely be taken at full extent, for short time intervals. \n", | ||
"\n", | ||
"Examining each of the dimensions of this dataset: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f23f2317-e504-4882-b936-da482f2864ba", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"day = 16169/365 #how many chunks for a year-at-a-time\n", | ||
"day" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "7a60542a-e5e8-448b-b25c-db04947f42d7", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"lat = 585 / 3 # split into 3 chunks\n", | ||
"lat" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "868e11d0-a3a5-4ac8-9b7f-043e7e29be3f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"lon = 1386 /3 # split into 3 chunks\n", | ||
"lon" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "da090e6f-4c11-439c-86f7-3246431b3a7b", | ||
"metadata": {}, | ||
"source": [ | ||
"If we chunk with this pattern, how big will each chunk be? " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2e1be74e-2bcc-48cb-becc-ea10e40cf3ff", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# day #lat #lon #float32\n", | ||
"bytes = 365 * 195 * 462 * 4\n", | ||
"kbytes = bytes / (2**10)\n", | ||
"mbytes = kbytes / (2**10)\n", | ||
"print(f\"Chunk size: {bytes=} ({kbytes=:.2f})({mbytes=:.4f})\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "99737679-82da-4ca4-b4f7-f2fe97411553", | ||
"metadata": {}, | ||
"source": [ | ||
" 125MB chunk seems reasonable, but it does mean that a time-series read pattern will have to take in 45 chunks (assuming the spatial extent of the analysis is within one lat/lon chunk). To bring that down, let's take the time dimension as 2 years at a time. Just to make the numbers more round, we'll express the time as 24 30-day months: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e0df8131-d350-48bb-b6c0-932486d2a13a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ds_in = xr.open_dataset(\n", | ||
" DATA_url + r\"#fillmismatch\", \n", | ||
" decode_coords=True, \n", | ||
" chunks={'day': 24*30, 'lon': 462, 'lat': 195}\n", | ||
")\n", | ||
"ds_in" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e49d852b-5f98-408f-b24b-708c6b1227aa", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ds_in.precipitation_amount" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "94c55815-6cd2-4ff3-a48c-97d61739c4cc", | ||
"metadata": {}, | ||
"source": [ | ||
"This chunk pattern favors the spatial extent. Six chunks are needed to read the entire spatial extent for one time step. \n", | ||
"\n", | ||
"The time data is chunked by two year blocks (assuming alignment with year boundaries, which is almost certainly not true). It would be more accurate to say that the time is in two-year-sized chunks, but may not align with the calendar year. An entire time-series for a small spatial extent will require 23 chunks to be read." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3873a668-7e42-4c35-84a5-9cc4a91d9bd7", | ||
"metadata": {}, | ||
"source": [ | ||
"## Important\n", | ||
"The chunk specification in the `open_dataset()` call does not reconfigure the data itself. It governs how the data driver formulates its requests to the server. The chunking information specified in the open dataset call helps the driver establish the boundaries for its queries. It will then request the data, a chunk at a time, from the server. How the server holds that data is hidden from us, and we really don't need to care. The data driver on our end (from within `xarray.open_dataset()`) does the necessary work to ensure data alignment and that the block/chunk sizes will align to what we want. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "cc397e76-c4fb-48db-80c7-3064429ad0ca", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.