Skip to content

Commit

Permalink
Typo fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
kjdoore committed Jan 7, 2025
1 parent c1c6af0 commit 8d2c896
Show file tree
Hide file tree
Showing 9 changed files with 73 additions and 74 deletions.
14 changes: 7 additions & 7 deletions 101/BasicsShapeSize.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@
"## Accessing the Example Dataset\n",
"\n",
"In this notebook, we will use the monthly PRISM v2 dataset as an example for understanding the effects of chunk shape and size.\n",
"Let's go ahead and read in the file using `xarray`.\n",
"To do this, we will use [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to get a mapper to the `zarr` file the HyTEST OSN.\n",
"Let's go ahead and read in the file using xarray.\n",
"To do this, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to get a mapper to the Zarr file the HyTEST OSN.\n",
"\n",
"```{note}\n",
"The `xarray` loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n",
"The xarray loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n",
"It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.\n",
"```"
]
Expand Down Expand Up @@ -177,7 +177,7 @@
"Having `9.06` longitude chunks means we will have 10 chunks in practice, but that last one is not full-sized.\n",
"In this case, this means that the last chunk in the given dimension will be extremely thin. \n",
"\n",
"In the case of the latitude chunks, the extra `0.006` of a chunk means that the last, fractional chunk is only one `lat` observation.\n",
"In the case of the latitude chunks, the extra `0.006` of a chunk means that the last, fractional chunk (or [\"**partial chunk**\"](../back/Glossary.md#term-Partial-Chunk)) is only one `lat` observation.\n",
"(This occurred as `//` is floor division and `lat` does not have a number of elements divisible by 4.)\n",
"This all but guarantees that two chunks are needed for a small spatial extent near the \"end\" of the `lat` dimension.\n",
"\n",
Expand Down Expand Up @@ -302,10 +302,10 @@
"Here are some constraints: \n",
"\n",
"* Files Too Big:\n",
" In a `zarr` dataset, each chunk is stored as a separate binary file.\n",
" In a Zarr dataset, each chunk is stored as a separate binary file.\n",
" If we need data from a particular chunk, no matter how little or how much, that file gets opened, decompressed, and the whole thing read into memory.\n",
" A large chunk size means that there may be a lot of data transferred in situations when only a small subset of that chunk's data is actually needed.\n",
" It also means there might not be enough chunks to allow the `dask` workers to stay busy loading data in parallel.\n",
" It also means there might not be enough chunks to allow the dask workers to stay busy loading data in parallel.\n",
"\n",
"* Files Too Small:\n",
" If the chunk size is too small, the time it takes to read and decompress the data for each chunk can become comparable to the latency of S3 (typically 10-100ms).\n",
Expand Down Expand Up @@ -399,7 +399,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
"version": "3.12.0"
}
},
"nbformat": 4,
Expand Down
26 changes: 13 additions & 13 deletions 101/ExamineDataChunking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@
"source": [
"## Accessing the Dataset\n",
"\n",
"Before we can open at the dataset, we must first get a mapper that will easily allow for `xarray` to open the dataset.\n",
"To do this, we will use [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to perform an anonymous read from an endpoints outside of S3, using the S3 API (i.e., the HyTEST OSN).\n",
"This requires us to set up an S3 file system and feed it the endpoint url.\n",
"Before we can open at the dataset, we must first get a mapper that will easily allow for [xarray](https://docs.xarray.dev/en/stable/index.html) to open the dataset.\n",
"To do this, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to perform an anonymous read from an endpoints outside of S3, using the S3 API (i.e., the HyTEST OSN).\n",
"This requires us to set up an S3 file system and feed it the endpoint URL.\n",
"We can then point the the file system to our dataset (in this case the PRISM V2 Zarr store) and get a mapper to the file."
]
},
Expand All @@ -59,7 +59,7 @@
"Now that we have our file mapper, we can open the dataset using [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) with `zarr` specified as our engine.\n",
"\n",
"```{note}\n",
"The `xarray` loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n",
"The xarray loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n",
"It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.\n",
"```"
]
Expand All @@ -79,7 +79,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"rich\" HTML output to show the `xarray.Dataset` includes a lot of information, some of which is hidden behind toggles.\n",
"The \"rich\" HTML output to show the [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) includes a lot of information, some of which is hidden behind toggles.\n",
"Click on the icons to the right to expand and see all the metadata available for the dataset.\n",
"The page icon will display attributes attached to the data, while the database icon will display information about the dataset.\n",
"\n",
Expand All @@ -93,7 +93,7 @@
" In this dataset, a coordinate can be used to pick out a particular cell of the array.\n",
" Asking for cells where say `lat=49.9` is possible because these coordinates map the meaningful values of latitude to the behind-the-scenes cell index needed to fetch the value. \n",
"- **Data Variables**: The variables are `tmx`, `ppt`, and `tmn`, which are associated with three indices by which data values are located in space and time (the \"Dimensions\"). \n",
"- **Indexes**: This is an internal data structure to help `xarray` quickly find items in the array.\n",
"- **Indexes**: This is an internal data structure to help xarray quickly find items in the array.\n",
"- **Attributes**: Arbitrary metadata that has been given to the dataset. \n",
"\n",
"Let's look at one of the data variables to learn more about it. "
Expand All @@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variable = `xarray.DataArray`\n",
"### Variable = [`xarray.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray)\n",
"\n",
"Each data variable is its own N-dimensional array (in this case, 3-dimensional, indexed by `lat`, `lon`, and `time`).\n",
"We can look at the individual variables by examining its array separately from the dataset: "
Expand Down Expand Up @@ -154,7 +154,7 @@
"Therefore, we need to directly access the data in a way that returns the true chunk shape of the stored dataset.\n",
"\n",
"To do this, we can simply check the variables \"encoding\".\n",
"This returns metadata that was used by `xarray` when reading the data."
"This returns metadata that was used by xarray when reading the data."
]
},
{
Expand Down Expand Up @@ -184,10 +184,10 @@
"## Getting the Chunking When Reading Data\n",
"\n",
"While checking the \"encoding\" of the variable can tell you what the dataset's stored chunk shape is, it is typically easier to do this in one step when you open the dataset.\n",
"To do this, all we need is to add a another keyword when we open the dataset with `xarray`: `chunks={}`.\n",
"To do this, all we need is to add a another keyword when we open the dataset with xarray: `chunks={}`.\n",
"As per the [`xarray.open_dataset` documentation](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html): \n",
"\n",
"> `chunks={}` loads the data with `dask` using the engine’s preferred chunk size, generally identical to the format’s chunk size.\n",
"> `chunks={}` loads the data with dask using the engine’s preferred chunk size, generally identical to the format’s chunk size.\n",
"\n",
"In other words, using `chunks={}` will load the data with chunk shape equal to `'preferred_chunks'`.\n",
"Let's check this out and see how our data looks when we include this keyword when opening."
Expand Down Expand Up @@ -237,12 +237,12 @@
"## Changing the Chunk Shape and Size\n",
"\n",
"Now that we know our stored chunk shape and size or how to find them, they may not always be the optimal choice for performing analysis.\n",
"For example, [`zarr` recommends a stored chunk size of at least 1 MB uncompressed](https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape) as they give better performance.\n",
"However, [`dask` recommends chunk sizes between 10 MB and 1 GB for computations](https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes), depending on the availability of RAM and the duration of computations.\n",
"For example, [Zarr recommends a stored chunk size of at least 1 MB uncompressed](https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape) as they give better performance.\n",
"However, [dask recommends chunk sizes between 10 MB and 1 GB for computations](https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes), depending on the availability of RAM and the duration of computations.\n",
"Therefore, our stored chunk size may not be large enough for optimal computations.\n",
"Thankfully, stored chunks do not need to be the same size as those we use for our computations.\n",
"In other words, we can group multiple smaller stored chunks together when performing our computations.\n",
"`xarray` makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.\n",
"Xarray makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.\n",
"\n",
"Let's show how this works by increasing our chunks of the minimum monthly temperature to a size of ~500 MiB.\n",
"To do so when reading in the data, all we need to do is actually specify the chunk shape to `chunks`.\n",
Expand Down
38 changes: 19 additions & 19 deletions 101/Rechunking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@
"The goal of this notebook is to learn how to \"[**rechunk**](../back/Glossary.md#term-Rechunking)\" data.\n",
"This will be a culmination of all the [previous introductory material](index.md) where we will:\n",
"\n",
"1. [Read in a `zarr` store](ExamineDataChunking.ipynb)\n",
"1. [Read in a Zarr store](ExamineDataChunking.ipynb)\n",
"2. [Check the current chunking](ExamineDataChunking.ipynb)\n",
"3. [Choose a new chunk shape](BasicsShapeSize.ipynb)\n",
"4. Rechunk using [`Rechunker`](https://rechunker.readthedocs.io/en/latest/index.html)\n",
"5. [Confirm the proper creation of the `zarr` store by `Rechunker`](WriteChunkedFiles.ipynb)"
"4. Rechunk using [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html)\n",
"5. [Confirm the proper creation of the Zarr store by Rechunker](WriteChunkedFiles.ipynb)"
]
},
{
Expand Down Expand Up @@ -61,7 +61,7 @@
"For the dataset in this tutorial, we will use the data from the National Water Model Reanalysis Version 2.1.\n",
"The full dataset is part of the [AWS Open Data Program](https://aws.amazon.com/opendata/), available via the S3 bucket at: `s3://noaa-nwm-retro-v2-zarr-pds/`.\n",
"\n",
"As this is a `zarr` store, we can easily read it in directly with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html), including the keyword `chunks={}` to make sure it loads the data with `dask` using the stored chunks' shape and size."
"As this is a Zarr store, we can easily read it in directly with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html), including the keyword `chunks={}` to make sure it loads the data with dask using the stored chunks' shape and size."
]
},
{
Expand Down Expand Up @@ -211,7 +211,7 @@
"id": "3234b824-aa75-402f-9976-0b9d9f90e821",
"metadata": {},
"source": [
"## Rechunk with `Rechunker`\n",
"## Rechunk with [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html)\n",
"\n",
"This is a relatively trivial example, due to the smaller size of the subset of the dataset.\n",
"As the whole subset can fit into memory easily, chunking in general is largely unnecesary in terms of optimizing I/O (however, parallelism is still a consideration). \n",
Expand Down Expand Up @@ -246,9 +246,9 @@
"id": "bfbdfce8-992d-4412-ad19-cdad10dc379c",
"metadata": {},
"source": [
"With this plan, we can ask [`rechunker`](https://rechunker.readthedocs.io/en/latest/index.html) to re-write the data using the prescribed chunking pattern.\n",
"`Rechunker` will take the currently read in data and rechunk it using an intermediate `zarr` store for efficiency.\n",
"The result will be our rechunked data saved to a new `zarr` store."
"With this plan, we can ask [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html) to re-write the data using the prescribed chunking pattern.\n",
"Rechunker will take the currently read in data and rechunk it using an intermediate Zarr store for efficiency.\n",
"The result will be our rechunked data saved to a new Zarr store."
]
},
{
Expand Down Expand Up @@ -278,18 +278,18 @@
"Oh, that is not what we wanted!\n",
"We seem to have gotten an error indicating overlap in chunks between the read and write.\n",
"Looking at the error, it is saying that the first `time` chunk we are reading is a partial chunk and not a full chunk.\n",
"So, when `Rechunker` tries to read the data and then write the first rechunk, it is having to read two chunks to write to the one chunk.\n",
"This is a one-to-many write, which can corrupt our file when done in parallel with `dask`.\n",
"Thank goodness `Rechunker` caught this for us!\n",
"So, when Rechunker tries to read the data and then write the first rechunk, it is having to read two chunks to write to the one chunk.\n",
"This is a one-to-many write, which can corrupt our file when done in parallel with dask.\n",
"Thank goodness Rechunker caught this for us!\n",
"Reading the recommended fix, it seems the only way to go about this is to call `chunk()` and reset the chunking on the original data.\n",
"In other words, after we select the subset from the dataset, we need to realign the chunks such that the first chunk is not a partial chunk.\n",
"This is simple enough to do.\n",
"So much so, we can just do it when passing the dataset subset to `Rechunker`.\n",
"So much so, we can just do it when passing the dataset subset to Rechunker.\n",
"\n",
"```{note}\n",
"`rechunker.rechunk` does not overwrite any data.\n",
"If it sees that `rechunked_nwm.zarr` or `/tmp/scratch.zarr` already exist, it will raise an exception.\n",
"Be sure that these locations do not exist before calling `Rechunker`. \n",
"Be sure that these locations do not exist before calling Rechunker. \n",
"```"
]
},
Expand Down Expand Up @@ -322,11 +322,11 @@
"Alright, that worked with no problems!\n",
"Now, we must specifically direct rechunk to calculate.\n",
"To do this, we can call `execute()` on our `result` `Rechunked` object.\n",
"Without the call to `execute()`, the `zarr` dataset will be empty, and `result` will only hold a 'task graph' outlining the calculation steps.\n",
"Without the call to `execute()`, the Zarr dataset will be empty, and `result` will only hold a 'task graph' outlining the calculation steps.\n",
"\n",
"```{tip}\n",
"The `rechunker` also writes a minimalist data group, meaning that variable metadata is not consolidated.\n",
"This is not a required step, but it will really spead up future workflows when the data is read back in using `xarray`.\n",
"Rechunker also writes a minimalist data group, meaning that variable metadata is not consolidated.\n",
"This is not a required step, but it will really spead up future workflows when the data is read back in using xarray.\n",
"```"
]
},
Expand Down Expand Up @@ -393,14 +393,14 @@
"source": [
"Perfect!\n",
"The maximum absolute difference between each both the `streamflow` and `velocity` variables is 0.\n",
"In other words, they are exactly the same, and `Rechunker` worked as expect.\n",
"In other words, they are exactly the same, and Rechunker worked as expect.\n",
"\n",
"Now that you know how to rechunk a `zarr` store using `Rechunker`, you should know all of the basics there are in terms of chunking.\n",
"Now that you know how to rechunk a Zarr store using Rechunker, you should know all of the basics there are in terms of chunking.\n",
"You are now ready to explore more [advanced chunking topics in chunking](../201/index.md) if you are interested!\n",
"\n",
"## Clean Up\n",
"\n",
"As we don't want to keep this rechunked `zarr` on our local machine, let's go ahead and delete it."
"As we don't want to keep this rechunked Zarr on our local machine, let's go ahead and delete it."
]
},
{
Expand Down
Loading

0 comments on commit 8d2c896

Please sign in to comment.