diff --git a/101/BasicsShapeSize.ipynb b/101/BasicsShapeSize.ipynb index eb9cc44..a4a6269 100644 --- a/101/BasicsShapeSize.ipynb +++ b/101/BasicsShapeSize.ipynb @@ -32,11 +32,11 @@ "## Accessing the Example Dataset\n", "\n", "In this notebook, we will use the monthly PRISM v2 dataset as an example for understanding the effects of chunk shape and size.\n", - "Let's go ahead and read in the file using `xarray`.\n", - "To do this, we will use [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to get a mapper to the `zarr` file the HyTEST OSN.\n", + "Let's go ahead and read in the file using xarray.\n", + "To do this, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to get a mapper to the Zarr file the HyTEST OSN.\n", "\n", "```{note}\n", - "The `xarray` loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n", + "The xarray loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n", "It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.\n", "```" ] @@ -177,7 +177,7 @@ "Having `9.06` longitude chunks means we will have 10 chunks in practice, but that last one is not full-sized.\n", "In this case, this means that the last chunk in the given dimension will be extremely thin. \n", "\n", - "In the case of the latitude chunks, the extra `0.006` of a chunk means that the last, fractional chunk is only one `lat` observation.\n", + "In the case of the latitude chunks, the extra `0.006` of a chunk means that the last, fractional chunk (or [\"**partial chunk**\"](../back/Glossary.md#term-Partial-Chunk)) is only one `lat` observation.\n", "(This occurred as `//` is floor division and `lat` does not have a number of elements divisible by 4.)\n", "This all but guarantees that two chunks are needed for a small spatial extent near the \"end\" of the `lat` dimension.\n", "\n", @@ -302,10 +302,10 @@ "Here are some constraints: \n", "\n", "* Files Too Big:\n", - " In a `zarr` dataset, each chunk is stored as a separate binary file.\n", + " In a Zarr dataset, each chunk is stored as a separate binary file.\n", " If we need data from a particular chunk, no matter how little or how much, that file gets opened, decompressed, and the whole thing read into memory.\n", " A large chunk size means that there may be a lot of data transferred in situations when only a small subset of that chunk's data is actually needed.\n", - " It also means there might not be enough chunks to allow the `dask` workers to stay busy loading data in parallel.\n", + " It also means there might not be enough chunks to allow the dask workers to stay busy loading data in parallel.\n", "\n", "* Files Too Small:\n", " If the chunk size is too small, the time it takes to read and decompress the data for each chunk can become comparable to the latency of S3 (typically 10-100ms).\n", @@ -399,7 +399,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.0" + "version": "3.12.0" } }, "nbformat": 4, diff --git a/101/ExamineDataChunking.ipynb b/101/ExamineDataChunking.ipynb index 19ef43d..a71b497 100644 --- a/101/ExamineDataChunking.ipynb +++ b/101/ExamineDataChunking.ipynb @@ -32,9 +32,9 @@ "source": [ "## Accessing the Dataset\n", "\n", - "Before we can open at the dataset, we must first get a mapper that will easily allow for `xarray` to open the dataset.\n", - "To do this, we will use [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to perform an anonymous read from an endpoints outside of S3, using the S3 API (i.e., the HyTEST OSN).\n", - "This requires us to set up an S3 file system and feed it the endpoint url.\n", + "Before we can open at the dataset, we must first get a mapper that will easily allow for [xarray](https://docs.xarray.dev/en/stable/index.html) to open the dataset.\n", + "To do this, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to perform an anonymous read from an endpoints outside of S3, using the S3 API (i.e., the HyTEST OSN).\n", + "This requires us to set up an S3 file system and feed it the endpoint URL.\n", "We can then point the the file system to our dataset (in this case the PRISM V2 Zarr store) and get a mapper to the file." ] }, @@ -59,7 +59,7 @@ "Now that we have our file mapper, we can open the dataset using [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) with `zarr` specified as our engine.\n", "\n", "```{note}\n", - "The `xarray` loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n", + "The xarray loader is \"lazy\", meaning it will read just enough of the data to make decisions about its shape, structure, etc.\n", "It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.\n", "```" ] @@ -79,7 +79,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The \"rich\" HTML output to show the `xarray.Dataset` includes a lot of information, some of which is hidden behind toggles.\n", + "The \"rich\" HTML output to show the [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) includes a lot of information, some of which is hidden behind toggles.\n", "Click on the icons to the right to expand and see all the metadata available for the dataset.\n", "The page icon will display attributes attached to the data, while the database icon will display information about the dataset.\n", "\n", @@ -93,7 +93,7 @@ " In this dataset, a coordinate can be used to pick out a particular cell of the array.\n", " Asking for cells where say `lat=49.9` is possible because these coordinates map the meaningful values of latitude to the behind-the-scenes cell index needed to fetch the value. \n", "- **Data Variables**: The variables are `tmx`, `ppt`, and `tmn`, which are associated with three indices by which data values are located in space and time (the \"Dimensions\"). \n", - "- **Indexes**: This is an internal data structure to help `xarray` quickly find items in the array.\n", + "- **Indexes**: This is an internal data structure to help xarray quickly find items in the array.\n", "- **Attributes**: Arbitrary metadata that has been given to the dataset. \n", "\n", "Let's look at one of the data variables to learn more about it. " @@ -103,7 +103,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Variable = `xarray.DataArray`\n", + "### Variable = [`xarray.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray)\n", "\n", "Each data variable is its own N-dimensional array (in this case, 3-dimensional, indexed by `lat`, `lon`, and `time`).\n", "We can look at the individual variables by examining its array separately from the dataset: " @@ -154,7 +154,7 @@ "Therefore, we need to directly access the data in a way that returns the true chunk shape of the stored dataset.\n", "\n", "To do this, we can simply check the variables \"encoding\".\n", - "This returns metadata that was used by `xarray` when reading the data." + "This returns metadata that was used by xarray when reading the data." ] }, { @@ -184,10 +184,10 @@ "## Getting the Chunking When Reading Data\n", "\n", "While checking the \"encoding\" of the variable can tell you what the dataset's stored chunk shape is, it is typically easier to do this in one step when you open the dataset.\n", - "To do this, all we need is to add a another keyword when we open the dataset with `xarray`: `chunks={}`.\n", + "To do this, all we need is to add a another keyword when we open the dataset with xarray: `chunks={}`.\n", "As per the [`xarray.open_dataset` documentation](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html): \n", "\n", - "> `chunks={}` loads the data with `dask` using the engine’s preferred chunk size, generally identical to the format’s chunk size.\n", + "> `chunks={}` loads the data with dask using the engine’s preferred chunk size, generally identical to the format’s chunk size.\n", "\n", "In other words, using `chunks={}` will load the data with chunk shape equal to `'preferred_chunks'`.\n", "Let's check this out and see how our data looks when we include this keyword when opening." @@ -237,12 +237,12 @@ "## Changing the Chunk Shape and Size\n", "\n", "Now that we know our stored chunk shape and size or how to find them, they may not always be the optimal choice for performing analysis.\n", - "For example, [`zarr` recommends a stored chunk size of at least 1 MB uncompressed](https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape) as they give better performance.\n", - "However, [`dask` recommends chunk sizes between 10 MB and 1 GB for computations](https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes), depending on the availability of RAM and the duration of computations.\n", + "For example, [Zarr recommends a stored chunk size of at least 1 MB uncompressed](https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-size-and-shape) as they give better performance.\n", + "However, [dask recommends chunk sizes between 10 MB and 1 GB for computations](https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes), depending on the availability of RAM and the duration of computations.\n", "Therefore, our stored chunk size may not be large enough for optimal computations.\n", "Thankfully, stored chunks do not need to be the same size as those we use for our computations.\n", "In other words, we can group multiple smaller stored chunks together when performing our computations.\n", - "`xarray` makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.\n", + "Xarray makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.\n", "\n", "Let's show how this works by increasing our chunks of the minimum monthly temperature to a size of ~500 MiB.\n", "To do so when reading in the data, all we need to do is actually specify the chunk shape to `chunks`.\n", diff --git a/101/Rechunking.ipynb b/101/Rechunking.ipynb index e2ca0a4..eb49048 100644 --- a/101/Rechunking.ipynb +++ b/101/Rechunking.ipynb @@ -28,11 +28,11 @@ "The goal of this notebook is to learn how to \"[**rechunk**](../back/Glossary.md#term-Rechunking)\" data.\n", "This will be a culmination of all the [previous introductory material](index.md) where we will:\n", "\n", - "1. [Read in a `zarr` store](ExamineDataChunking.ipynb)\n", + "1. [Read in a Zarr store](ExamineDataChunking.ipynb)\n", "2. [Check the current chunking](ExamineDataChunking.ipynb)\n", "3. [Choose a new chunk shape](BasicsShapeSize.ipynb)\n", - "4. Rechunk using [`Rechunker`](https://rechunker.readthedocs.io/en/latest/index.html)\n", - "5. [Confirm the proper creation of the `zarr` store by `Rechunker`](WriteChunkedFiles.ipynb)" + "4. Rechunk using [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html)\n", + "5. [Confirm the proper creation of the Zarr store by Rechunker](WriteChunkedFiles.ipynb)" ] }, { @@ -61,7 +61,7 @@ "For the dataset in this tutorial, we will use the data from the National Water Model Reanalysis Version 2.1.\n", "The full dataset is part of the [AWS Open Data Program](https://aws.amazon.com/opendata/), available via the S3 bucket at: `s3://noaa-nwm-retro-v2-zarr-pds/`.\n", "\n", - "As this is a `zarr` store, we can easily read it in directly with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html), including the keyword `chunks={}` to make sure it loads the data with `dask` using the stored chunks' shape and size." + "As this is a Zarr store, we can easily read it in directly with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html), including the keyword `chunks={}` to make sure it loads the data with dask using the stored chunks' shape and size." ] }, { @@ -211,7 +211,7 @@ "id": "3234b824-aa75-402f-9976-0b9d9f90e821", "metadata": {}, "source": [ - "## Rechunk with `Rechunker`\n", + "## Rechunk with [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html)\n", "\n", "This is a relatively trivial example, due to the smaller size of the subset of the dataset.\n", "As the whole subset can fit into memory easily, chunking in general is largely unnecesary in terms of optimizing I/O (however, parallelism is still a consideration). \n", @@ -246,9 +246,9 @@ "id": "bfbdfce8-992d-4412-ad19-cdad10dc379c", "metadata": {}, "source": [ - "With this plan, we can ask [`rechunker`](https://rechunker.readthedocs.io/en/latest/index.html) to re-write the data using the prescribed chunking pattern.\n", - "`Rechunker` will take the currently read in data and rechunk it using an intermediate `zarr` store for efficiency.\n", - "The result will be our rechunked data saved to a new `zarr` store." + "With this plan, we can ask [Rechunker](https://rechunker.readthedocs.io/en/latest/index.html) to re-write the data using the prescribed chunking pattern.\n", + "Rechunker will take the currently read in data and rechunk it using an intermediate Zarr store for efficiency.\n", + "The result will be our rechunked data saved to a new Zarr store." ] }, { @@ -278,18 +278,18 @@ "Oh, that is not what we wanted!\n", "We seem to have gotten an error indicating overlap in chunks between the read and write.\n", "Looking at the error, it is saying that the first `time` chunk we are reading is a partial chunk and not a full chunk.\n", - "So, when `Rechunker` tries to read the data and then write the first rechunk, it is having to read two chunks to write to the one chunk.\n", - "This is a one-to-many write, which can corrupt our file when done in parallel with `dask`.\n", - "Thank goodness `Rechunker` caught this for us!\n", + "So, when Rechunker tries to read the data and then write the first rechunk, it is having to read two chunks to write to the one chunk.\n", + "This is a one-to-many write, which can corrupt our file when done in parallel with dask.\n", + "Thank goodness Rechunker caught this for us!\n", "Reading the recommended fix, it seems the only way to go about this is to call `chunk()` and reset the chunking on the original data.\n", "In other words, after we select the subset from the dataset, we need to realign the chunks such that the first chunk is not a partial chunk.\n", "This is simple enough to do.\n", - "So much so, we can just do it when passing the dataset subset to `Rechunker`.\n", + "So much so, we can just do it when passing the dataset subset to Rechunker.\n", "\n", "```{note}\n", "`rechunker.rechunk` does not overwrite any data.\n", "If it sees that `rechunked_nwm.zarr` or `/tmp/scratch.zarr` already exist, it will raise an exception.\n", - "Be sure that these locations do not exist before calling `Rechunker`. \n", + "Be sure that these locations do not exist before calling Rechunker. \n", "```" ] }, @@ -322,11 +322,11 @@ "Alright, that worked with no problems!\n", "Now, we must specifically direct rechunk to calculate.\n", "To do this, we can call `execute()` on our `result` `Rechunked` object.\n", - "Without the call to `execute()`, the `zarr` dataset will be empty, and `result` will only hold a 'task graph' outlining the calculation steps.\n", + "Without the call to `execute()`, the Zarr dataset will be empty, and `result` will only hold a 'task graph' outlining the calculation steps.\n", "\n", "```{tip}\n", - "The `rechunker` also writes a minimalist data group, meaning that variable metadata is not consolidated.\n", - "This is not a required step, but it will really spead up future workflows when the data is read back in using `xarray`.\n", + "Rechunker also writes a minimalist data group, meaning that variable metadata is not consolidated.\n", + "This is not a required step, but it will really spead up future workflows when the data is read back in using xarray.\n", "```" ] }, @@ -393,14 +393,14 @@ "source": [ "Perfect!\n", "The maximum absolute difference between each both the `streamflow` and `velocity` variables is 0.\n", - "In other words, they are exactly the same, and `Rechunker` worked as expect.\n", + "In other words, they are exactly the same, and Rechunker worked as expect.\n", "\n", - "Now that you know how to rechunk a `zarr` store using `Rechunker`, you should know all of the basics there are in terms of chunking.\n", + "Now that you know how to rechunk a Zarr store using Rechunker, you should know all of the basics there are in terms of chunking.\n", "You are now ready to explore more [advanced chunking topics in chunking](../201/index.md) if you are interested!\n", "\n", "## Clean Up\n", "\n", - "As we don't want to keep this rechunked `zarr` on our local machine, let's go ahead and delete it." + "As we don't want to keep this rechunked Zarr on our local machine, let's go ahead and delete it." ] }, { diff --git a/101/WhyChunk.ipynb b/101/WhyChunk.ipynb index 2b86130..60a5f67 100644 --- a/101/WhyChunk.ipynb +++ b/101/WhyChunk.ipynb @@ -127,7 +127,7 @@ "For this analysis, you can easily see why this would be slow due to the massive amount of I/O and be better if the array could instead be chunked in column-major order.\n", "Just to make this clear, if your data was now chunked by columns, all you would have to do is read the $i^{th}$ column into memory, and you would be good to go.\n", "Meaning you would just need a single read from disk versus reading however many rows your data has.\n", - "While handling chunks may seem like it would become complicated, array-handling libraries (`numpy`, `xarray`, `pandas`, `dask`, and others) will handle all of the record-keeping to know which chunk holds what data within the dataset. " + "While handling chunks may seem like it would become complicated, array-handling libraries ([numpy](https://numpy.org/), [xarray](https://xarray.dev/), [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), and others) will handle all of the record-keeping to know which chunk holds what data within the dataset. " ] }, { @@ -137,7 +137,7 @@ "## Toy Example\n", "\n", "By now, we have hopefully answered both of the question about \"what is data chunking?\" and \"why should I care?\".\n", - "To really drive home the idea, let's apply the above theoretical example using [`dask`](https://docs.dask.org/en/stable/).\n", + "To really drive home the idea, let's apply the above theoretical example using [dask](https://docs.dask.org/en/stable/).\n", "In this case, we will generate a square array of ones to test how different \"[**chunk shapes**](../back/Glossary.md#term-Chunk-shape)\" compare." ] }, @@ -157,8 +157,8 @@ "### Chunk by Rows\n", "\n", "First, let's start with the square array chunked by rows.\n", - "We'll do a 50,625x50,625 array as this is about 19 GiB, which is larger than the typical memory availablity of a laptop.\n", - "The nice thing about `dask` is that we can see how big our array and chunks are in the output. " + "We'll do a 50625x50625 array as this is about 19 GiB, which is larger than the typical memory availablity of a laptop.\n", + "The nice thing about dask is that we can see how big our array and chunks are in the output. " ] }, { @@ -232,8 +232,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As expected, the time difference is massive.\n", - "In this case, it is about a factor of 200x faster when properly chunked (at least on my laptop)." + "As expected, the time difference is massive when properly chunked." ] }, { @@ -269,7 +268,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we can see, this is only 1.5x slower when accessing the first column.\n", + "As we can see, this is only slightly slower when accessing the first column compared to the column chunking.\n", "However, let's time how long it takes to access a single row." ] }, @@ -305,7 +304,7 @@ "The primary pro, as we hopefully conveyed with our previous example, is that well chunked data substantially speeds up any analysis that favors that chunk shape.\n", "However, this becomes a con when you change your analysis to one that favors a new chunk shape.\n", "In other words, data that is well-organized to optimize one kind of analysis may not suit another kind of analysis on the same data.\n", - "While not a problem for our example here, changing the chunk shape (known as \"[**rechunking**](../back/Glossary.md#term-Rechunking)\" on an established dataset is time-consuming, and it produces a separate copy of the dataset, increasing storage requirements.\n", + "While not a problem for our example here, changing the chunk shape (known as \"[**rechunking**](../back/Glossary.md#term-Rechunking)\") on an established dataset is time-consuming, and it produces a separate copy of the dataset, increasing storage requirements.\n", "The space commitment can be substantial if a complex dataset needs to be organized for many different analyses.\n", "If our example above used unique values that we wanted to keep as we changed chunking, this would have meant that rather than having a single ~19 GiB dataset, we would have needed to keep all three, tripling our storage to almost 60 GiB.\n", "Therefore, selecting an appropriate chunk shape is critical when generating widely used datasets." diff --git a/101/WriteChunkedFiles.ipynb b/101/WriteChunkedFiles.ipynb index e7f3c7e..e0219f8 100644 --- a/101/WriteChunkedFiles.ipynb +++ b/101/WriteChunkedFiles.ipynb @@ -25,7 +25,7 @@ "id": "9262bf36-4d8b-4b1d-9d7c-6730462cf7a6", "metadata": {}, "source": [ - "The goal of this notebook is to learn how to load a collection of `netcdf` files, chunk the data, write the data in `zarr` format, and confirm the proper creation of the `zarr` store.\n", + "The goal of this notebook is to learn how to load a collection of NetCDF files, chunk the data, write the data in Zarr format, and confirm the proper creation of the Zarr store.\n", "We will be writing to our local storage for simplicity (as this is just a tutorial notebook), but you can easily change the output path to be anywhere including cloud storage." ] }, @@ -49,9 +49,9 @@ "source": [ "## Example Dataset\n", "\n", - "In this notebook, we will use the daily gridMET precipitation dataset as an example for reading data and writing to `zarr`.\n", - "The data is currently hosted on the HyTEST OSN as a collection `netcdf` files.\n", - "To get the files, we will use [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to open each year of precipitation data to a list.\n", + "In this notebook, we will use the daily gridMET precipitation dataset as an example for reading data and writing to Zarr.\n", + "The data is currently hosted on the HyTEST OSN as a collection NetCDF files.\n", + "To get the files, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to open each year of precipitation data to a list.\n", "Then, we can read in the all the files at once using [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html)." ] }, @@ -83,7 +83,7 @@ "source": [ "## Selecting Chunk Shape and Size\n", "\n", - "As we can see in the rich HTML output of our dataset, the netCDF files were already chunked with pattern of `{'day': 61, 'lat': 98, 'lon': 231}`.\n", + "As we can see in the rich HTML output of our dataset, the NetCDF files were already chunked with pattern of `{'day': 61, 'lat': 98, 'lon': 231}`.\n", "However, the size of these chunks are relatively small and near the lower limit of an acceptable chunk size of 10 MiB.\n", "So, it would be better if we could increase our chunk sizes to say 70-110 MiB.\n", "To do this, we will simply use multiples of the current chunks as to not completely rechunk the dataset (grouping chunks is way faster than completely changing chunk shape).\n", @@ -191,7 +191,7 @@ "id": "1d206001-3e5e-4805-b87a-de0c22bcd7d6", "metadata": {}, "source": [ - "Now, let's save this data to a `zarr`!" + "Now, let's save this data to a Zarr!" ] }, { @@ -207,18 +207,18 @@ "id": "72ad607c-8491-45ea-bb90-ec7394419dd6", "metadata": {}, "source": [ - "As discussed in the [Xarray User-guide for reading and writing Zarr](https://docs.xarray.dev/en/stable/user-guide/io.html#specifying-chunks-in-a-zarr-store), chunks are specified to our `zarr` store in one of three ways in the preferential order of:\n", + "As discussed in the [Xarray User-guide for reading and writing Zarr](https://docs.xarray.dev/en/stable/user-guide/io.html#specifying-chunks-in-a-zarr-store), chunks are specified to our Zarr store in one of three ways in the preferential order of:\n", "\n", " 1. Manual chunk sizing through the use of the `encoding` argument\n", - " 2. Automatic chunking based on chunks of the `dask` arrays\n", - " 3. Default chunk behavior determined by the `zarr` library\n", + " 2. Automatic chunking based on chunks of the dask arrays\n", + " 3. Default chunk behavior determined by the Zarr library\n", "\n", - "In our case, we updated the `dask` array chunks by calling `ds.chunk()`.\n", + "In our case, we updated the dask array chunks by calling `ds.chunk()`.\n", "Therefore, we have the correct chunks and should be good to go.\n", "\n", "```{tip}\n", "This is our preferred method over using the `encoding` argument, as the positional ordering of the chunk shape in the `encoding` argument must match the positional ordering of the dimensions in each array.\n", - "If they do not match you can get incorrect chunk shapes in the `zarr` store.\n", + "If they do not match you can get incorrect chunk shapes in the Zarr store.\n", "\n", "If you have multiple variables, using `encoding` could allow for specifying individual chunking shapes for each variable.\n", "However, if this is the case, we recommend updating each variable individually using, for example, `ds.precipitation_amount.chunk()` to change the individual variable chunk shape.\n", @@ -276,7 +276,7 @@ "source": [ "## Assessing Compression\n", "\n", - "Now that our `zarr` store is made, let's check how much the data was compressed.\n", + "Now that our Zarr store is made, let's check how much the data was compressed.\n", "By default, [Zarr uses the Blosc compressor](https://docs.xarray.dev/en/stable/user-guide/io.html#zarr-compressors-and-filters) when calling [`xarray.Dataset.to_zarr()`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html) if we don't specify a compressor in the `encoding`.\n", "So, our data should be compressed by default, and we can examine each chunk on disk to confirm their compression factor.\n", "\n", @@ -327,7 +327,7 @@ "metadata": {}, "source": [ "As we can see, the total dataset (excluding coordinates) is only 85 MiB on disk, with chunk sizes varying from 76 KiB to 4.8 MiB.\n", - "This size is drastically smaller than the quoted total size for the `xarray` output, which said 2.58 GiB.\n", + "This size is drastically smaller than the quoted total size for the xarray output, which said 2.58 GiB.\n", "Same for the individual chunks, which were quoted at 73.75 MiB.\n", "Let's get an exact comparison and compression ratio for the data we read in." ] @@ -364,8 +364,8 @@ "\n", "## Appending New Chunk\n", "\n", - "Since this compression is so good, let's go ahead and add another time chunk onto our existing `zarr` store.\n", - "This is simple in `xarray`, especially since we are just appending another time chunk.\n", + "Since this compression is so good, let's go ahead and add another time chunk onto our existing Zarr store.\n", + "This is simple in xarray, especially since we are just appending another time chunk.\n", "All we have to do is [add `append_dim` to our `.to_zarr()` call to append to the time dimension](https://docs.xarray.dev/en/stable/user-guide/io.html#modifying-existing-zarr-stores)." ] }, @@ -438,10 +438,10 @@ "\n", "## Clean Up\n", "\n", - "So, hopefully now you know the basics of how to create a `zarr` store from some netCDF files and set its chunks' shape.\n", + "So, hopefully now you know the basics of how to create a Zarr store from some NetCDF files and set its chunks' shape.\n", "The same methods would apply when rechunking a dataset, which we will get into next.\n", "\n", - "As we don't want to keep this `zarr` on our local machine, let's go ahead and delete it." + "As we don't want to keep this Zarr on our local machine, let's go ahead and delete it." ] }, { diff --git a/201/VirtualZarr.ipynb b/201/CreateVirtualZarr.ipynb similarity index 99% rename from 201/VirtualZarr.ipynb rename to 201/CreateVirtualZarr.ipynb index de04f19..08c6664 100644 --- a/201/VirtualZarr.ipynb +++ b/201/CreateVirtualZarr.ipynb @@ -27,7 +27,7 @@ "source": [ "The objective of this notebook is to learn how to create a virtual Zarr store for a collection of NetCDF files that together make up a complete data set.\n", "To do this, we will use [Kerchunk](https://fsspec.github.io/kerchunk/index.html) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/stable/index.html).\n", - "As these two packages can both create virtual Zarr stores but do it in different ways, we will utilize them both to show how they compare in combination with [Dask](https://www.dask.org/) for parallel execution." + "As these two packages can both create virtual Zarr stores but do it in different ways, we will utilize them both to show how they compare in combination with [dask](https://www.dask.org/) for parallel execution." ] }, { diff --git a/201/AddingCRStoZarr.ipynb b/201/IncludeCRSinZarr.ipynb similarity index 100% rename from 201/AddingCRStoZarr.ipynb rename to 201/IncludeCRSinZarr.ipynb diff --git a/201/RechunkingwithDask.ipynb b/201/RechunkingwithDask.ipynb index b20770c..a566a1f 100755 --- a/201/RechunkingwithDask.ipynb +++ b/201/RechunkingwithDask.ipynb @@ -7,7 +7,7 @@ "# Rechunking Larger Datasets with Dask\n", "\n", "The goal of this notebook is to expand on the rechunking performed in the [Introductory Rechunking tutorial](../101/Rechunking.ipynb).\n", - "This notebook will perfrom the same operations, but will work on the **much** larger dataset and involve some parallelization using [Dask](https://www.dask.org/). \n", + "This notebook will perfrom the same operations, but will work on the **much** larger dataset and involve some parallelization using [dask](https://www.dask.org/). \n", "\n", ":::{Warning}\n", "You should only run workflows like this tutorial on a cloud or HPC compute node.\n", @@ -41,7 +41,7 @@ "Like the [Introductory Rechunking tutorial](../101/Rechunking.ipynb), we will use the data from the National Water Model Retrospective Version 2.1.\n", "The full dataset is part of the [AWS Open Data Program](https://aws.amazon.com/opendata/), available via the S3 bucket at: `s3://noaa-nwm-retro-v2-zarr-pds/`.\n", "\n", - "As this is a `zarr` store, let's read it in with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) and `engine='zarr'`." + "As this is a Zarr store, let's read it in with [`xarray.open_dataset()`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) and `engine='zarr'`." ] }, { @@ -215,14 +215,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Rechunk with `Rechunker`\n", + "## Rechunk with Rechunker\n", "\n", - "With this plan, we can now ask `rechunker` to re-write the data using the prescribed chunking pattern.\n", + "With this plan, we can now ask Rechunker to re-write the data using the prescribed chunking pattern.\n", "\n", "### Set up output location\n", "\n", "Unlike with the smaller dataset in our previous rechunking tutorial, we will write this larger dataset to an object store (an S3 'bucket') on the USGS OSN.\n", - "So, we need to set that up so that `rechunker` will have a suitable place to write data.\n", + "So, we need to set that up so that Rechunker will have a suitable place to write data.\n", "\n", "First, we need to set up the AWS profile and S3 endpoit." ] @@ -281,7 +281,7 @@ "### Spin up Dask Cluster\n", "\n", "Our rechunking operation will be able to work in parallel.\n", - "To do that, we will spin up a `dask` cluster to schedule the various workers.\n", + "To do that, we will spin up a dask cluster to schedule the various workers.\n", "\n", "```{note}\n", "This cluster will be configured differently depending on where you compute is performed.\n", @@ -390,7 +390,7 @@ "source": [ "## Clean Up\n", "\n", - "As we don't want to keep this rechunked `zarr`, let's go ahead and delete it.\n", + "As we don't want to keep this rechunked Zarr, let's go ahead and delete it.\n", "We will also conform with best practices and close our Dask client and cluster." ] }, diff --git a/_toc.yml b/_toc.yml index 0b173bc..b831787 100755 --- a/_toc.yml +++ b/_toc.yml @@ -12,8 +12,8 @@ chapters: - file: 201/index sections: - file: 201/RechunkingwithDask - - file: 201/VirtualZarr - - file: 201/AddingCRStoZarr + - file: 201/CreateVirtualZarr + - file: 201/IncludeCRSinZarr - file: 201/OptimalChunkSelection # - file: 201/IcechunkTutorial - file: back/index