diff --git a/101/WhyChunk.ipynb b/101/WhyChunk.ipynb index ddf1969..de212c5 100644 --- a/101/WhyChunk.ipynb +++ b/101/WhyChunk.ipynb @@ -6,19 +6,8 @@ "source": [ "# Why (re)Chunk Data?\n", "\n", - "Re-organizing stored data such that it matches the analysis use-case.\n", - "\n", - "Inspiration from:\n", - "\n", - "\n", - ":::{note}\n", - "* The [`rechunker` documentation](https://rechunker.readthedocs.io/en/latest/index.html) contains several \n", - "examples and a tutorial covering how to re-chunk data. Much of what is here replicates concepts covered\n", - "in that material. This document uses data that _looks_ like data you might encounter in a `HyTest` workflow.\n", - "\n", - "* The `zarr` data standard has a nice tutorial also which covers details of \n", - " [optimizing chunking strategies](https://zarr.readthedocs.io/en/stable/tutorial.html#changing-chunk-shapes-rechunking).\n", - ":::" + "If you are completely new to chunking, then you are probably interested in learning \"what is data chunking?\" and \"why should I care?\".\n", + "The goal of this notebooks is to answer these two basic questions and give you the understanding of what it means for data to be chunked and why you would want to do it." ] }, { @@ -28,23 +17,38 @@ "tags": [] }, "source": [ - "## What is chunking and why should you care?\n", + "## What is chunking?\n", "\n", - "The idea of data '_chunks_' is closely aligned with the NetCDF and [zarr](https://zarr.dev/) \n", - "standards for storing N-dimensional arrays of typed data. \n", + "Since modern computers were invented, there have existed datasets that were too large to fully read into computer memory.\n", + "These datasets have come to be known as \"**larger-than-memory**\" datasets.\n", + "While these datasets may be larger than memory, we will still want to access them and perform analysis on the data.\n", + "This where chunking comes in.\n", + "\"**Chunking**\" is the process of breaking down large amounts of data into smaller, more manageable pieces.\n", + "By breaking the data down into \"**chunks**\", it allows for us to work with the chunks of the larger overall dataset using a structured approach without exceeding our machines available memory.\n", + "Additionally, proper chunking can allow for faster retrieval and analysis when we only need to work with part of the dataset.\n", + "\n", + "```{note}\n", + "Chunks are not another dimension to your data, but merely a map to how the dataset is partitioned into more palatable sized units for manipulation in memory.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Why should I care?\n", "\n", - "Chunks become more important as the size of the array increases. For very large arrays, it \n", - "is helpful to organize the memory it occupies into sub-units. These sub-units are the \n", - "chunks. Note that this is not another dimension to the array, but merely a map to how the \n", - "large array is partitioned into more palatable sized units for manipulation in memory. \n", - "Array-handling libraries (`numpy`, `xarray`, `pandas`, and others) will handle all of the \n", - "record-keeping to know which chunk holds a given unit of the array. \n", + "The main reason you should care about chunking is that proper chunking can allow for faster retrieval and analysis of the dataset.\n", + "Even datasets that are small enough to fit into memory can still technically be chunked.\n", + "So even proper chunking of them can potentially speed up retrieval and analysis.\n", + "To help you understand this, let's begin with a simple example.\n", "\n", - "### Example - first principles\n", - "A quick side-bar to illustrate two chunking patterns for a simple 2D array. This is a \n", - "simplified use-case. Consider a square array of integer values. Just for exposition, \n", - "let's use a small array 10x10. \n", + "### Example - First Principles\n", "\n", + "In this example, we will illustrate two common memory organization (analagous to chunking) patterns that computers use when handling basic multidimensional data.\n", + "To simplify this, let's consider a small 10x10 array of integer values.\n", "\n", "$$\n", "\\def\\arraystretch{2.0}\n", @@ -73,16 +77,13 @@ "\\end{array}\n", "$$\n", "\n", - "\n", - "Computer memory is not addressed in grids -- it is a linear address space, so the\n", - "2D matrix has to be organized in memory such that it presents as 2D, while being\n", - "stored as 1D. Two common options are **row-major** \n", - "order, and **column-major** order:\n", - "* Row-Major -- A row of data occupies a contiguous block of memory. This implies that \n", - " cells which are logicall adjacent vertically are not physicall near one another in \n", - " memory. The 'distance' from `r0c0` to `r0c1` (a one-cell logical move within the row) \n", - " is short, while the 'distance' to `r1c0` (a one-cell logical move within the column) \n", - " is long.\n", + "While this is easy for us humans to visualize, computer memory is not addressed in grids.\n", + "Instead, it is organized as a linear address space.\n", + "So, the 2D matrix has to be organized in memory such that it presents as 2D, while being stored as 1D.\n", + "Two common options are **row-major** order, and **column-major** order:\n", + "- **Row-Major**: A row of data occupies a contiguous block of memory.\n", + " This implies that cells which are logically adjacent vertically are not physically near one another in memory.\n", + " The \"distance\" from `r0c0` to `r0c1` (a one-cell logical move within the row) is short, while the \"distance\" to `r1c0` (a one-cell logical move within the column) is long.\n", "\n", "$$\n", "\\def\\arraystretch{2.0}\n", @@ -93,9 +94,8 @@ "\\end{array}\n", "$$\n", "\n", - "* Column-Major -- A column of the array occupies a contiguious block of memory. This \n", - " implies that cells which are adjacent horizontally are not near one another physically \n", - " in memory. \n", + "- **Column-Major**: A column of the array occupies a contiguious block of memory.\n", + " This implies that cells which are adjacent horizontally are not near one another physically in memory. \n", "\n", "$$\n", "\\def\\arraystretch{2.0}\n", @@ -106,59 +106,34 @@ "\\end{array}\n", "$$\n", "\n", + "In either mapping, `r3c5` (for example) still fetches the same value.\n", + "For a single value, this is not a problem.\n", + "The array is still indexed/addressed in the same way as far as the user is concerned, but the memory organization pattern determines how nearby an 'adjacent' index is.\n", + "This becomes important when trying to get a subsection of the data.\n", + "For example, if the array is in row-major order and we select say `r0`, this is fast for the computer as all the data is adjacent.\n", + "However, if we wanted `c0`, then the computer has to access every 10th value in memory, which as you can imagine is not as efficient.\n", "\n", - "In either mapping, `r3c5` (for example) still fetches the same value. The array \n", - "still indexes/addresses in the same way as far as the user is concerned, but the \n", - "chunking plan determines how nearby an 'adjacent' index is. \n", - "\n", - "### Example - extend to chunking\n", - "The basic idea behind chunking is an extension of this memory organization principle. \n", - "As the size of the array increases, the chunk pattern becomes more relevant. Suppose \n", - "the data is big enough that only a row or column at a time can fit into memory. \n", - "If your data is chunked by **rows**, and you need to process a **column** of data -- your \n", - "process will need to read a lot of data, skipping most of it, to get the $i^{th}$ \n", - "column value for each row. For this analysis, it would be better if the array could \n", - "be '_re-chunked_' from row-major order to column-major order. This would favor \n", - "column operations.\n", - "\n", + "### Extend to Chunking\n", "\n", - "## Pros & Cons\n", - "Data that is well-organized to optimize one kind of analysis may not suit another \n", - "kind of analysis on the same data. Re-chunking is time-consuming, and it produces \n", - "a separate copy of the dataset, increasing storage requirements. The initial time \n", - "commitment is a one-time operation so that future analyses can run quickly. The \n", - "space commitment can be substantial if a complex dataset needs to be organized for \n", - "many different analyses.\n" + "The basic idea behind chunking is an extension of this memory organization principle.\n", + "As the size of the array increases, the chunk pattern becomes more relevant.\n", + "Now suppose the square array is now larger-than-memory and stored on disk such that only a single row or column can fit into memory at a time.\n", + "If your data is chunked by **rows**, and you need to process the $i^{th}$ **column**, you will have to read one row at a time into memory, skip to the $i^{th}$ column value in each row, and extract that value.\n", + "For this analysis, you can easily see why this would be slow due to the massive amount of I/O and be better if the array could instead be chunked in column-major order.\n", + "Just to make this clear, if your data was now chunked by **columns**, all you would have to do is read the $i^{th}$ column into memory, and you would be good to go.\n", + "Meaning you would just need a single read from disk versus reading however many rows your data has.\n", + "While handling chunks may seem like it would become complicated, array-handling libraries (`numpy`, `xarray`, `pandas`, `dask`, and others) will handle all of the record-keeping to know which chunk holds what data within the dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Examining a Small Dataset\n", - "Let's read a sample dataset and examine how it is chunked. \n", + "## Toy Example\n", "\n", - "As a test datasaet, we've taken a random sampling of 400 stream gages for \n", - "the month of July, 2000 from the National Water Model Reanalysis Version 2.1.\n", - "The full dataset is part of the \n", - "[AWS Open Data Program](https://aws.amazon.com/opendata/), \n", - "available via the S3 bucket at \n", - "```\n", - "s3://noaa-nwm-retrospective-2-1-zarr-pds/noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr\n", - "``` \n", - " \n", - " Our subset of that data for use in this tutorial is included in the HyTEST catalog:\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%run ../AWS.ipynb\n", - "## Establish AWS credentials" + "By now, we have hopefully answered both of the question about \"what is data chunking?\" and \"why should I care?\".\n", + "To really drive home the idea, let's apply the above theoretical example using [`dask`](https://docs.dask.org/en/stable/).\n", + "In this case, we will generate a square array of ones to test how different chunking patterns compare." ] }, { @@ -167,61 +142,18 @@ "metadata": {}, "outputs": [], "source": [ - "import xarray as xr \n", - "import intake\n", - "url = 'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml'\n", - "cat = intake.open_catalog(url)\n", - "sampleData = cat['rechunking-tutorial-cloud'].to_dask()\n", - "sampleData" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The critical items to notice in this output are highlighted here: \n", - "
\n",
-    "<xarray.Dataset>\n",
-    "\n",
-    "Dimensions:     (feature_id: 400, time: 744)  <-- NOTE: Two dimensions\n",
-    "\n",
-    "                   +--- most coordinates are tied to feature_id dimension \n",
-    "                   | \n",
-    "Coordinates:       V\n",
-    "    elevation   (feature_id) float32 dask.array<chunksize=(400,), meta=np.ndarray>\n",
-    "  * feature_id  (feature_id) int32 3109 189899 239166 ... 947070134 1010003783\n",
-    "    gage_id     (feature_id) |S15 dask.array<chunksize=(400,), meta=np.ndarray>\n",
-    "    latitude    (feature_id) float32 dask.array<chunksize=(400,), meta=np.ndarray>\n",
-    "    longitude   (feature_id) float32 dask.array<chunksize=(400,), meta=np.ndarray>\n",
-    "    order       (feature_id) int32 dask.array<chunksize=(400,), meta=np.ndarray>\n",
-    "  * time        (time) datetime64[ns] 2000-07-01 ... 2000-07-31T23:00:00\n",
-    "\n",
-    "Data variables:\n",
-    "    streamflow  (time, feature_id) float64 dask.array<chunksize=(256, 16), meta=np.ndarray>\n",
-    "    velocity    (time, feature_id) float64 dask.array<chunksize=(256, 16), meta=np.ndarray>\n",
-    "                 ^^^^  ^^^^^^^^^^\n",
-    "                 the data variables are addressed by both dimensions; this is 2D data.\n",
-    "
" + "import dask.array as da" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Data Structure\n", - "\n", - "This dataset is a 'stack' of two 2D arrays. They are named 'streamflow' and 'velocity'. The indices \n", - "into each of those 2D arrays are `time` on one axis, and `feature_id` on the other. The feature id \n", - "is bound to a number of other coordinates, so you can relate/refer to a given feature by its elevation, \n", - "gage_id, latitude, longitude, or stream order. \n", + "### Chunk by Rows\n", "\n", - "Note the `chunksize` highlighted in green. This says that the data is stored in blocks mapping to 256 \n", - "adjacent time-steps for 16 adjacent features. (**NOTE**: _The original data is not chunked this way; we've \n", - "deliberately fiddled with the chunk configuration for this tutorial_)\n", - "\n", - "### Two Example Read Patterns\n", - "A time-series analysis (i.e. sampling all time-step values for a single `feature_id`) would require \n", - "multiple chunks to be fetched. " + "First, let's start with the square array chunked by rows.\n", + "We'll do a 50,000x50,000 array as this is about 19 GiB, which is larger than the typical memory availablity of a laptop.\n", + "The nice thing about `dask` is that we can see how big our array and chunks are in the output. " ] }, { @@ -230,18 +162,19 @@ "metadata": {}, "outputs": [], "source": [ - "# Fetch all the time values for a specific feature_id\n", - "sampleData['streamflow'].sel(feature_id=1343034)" + "vals = da.ones(shape=(5e4, 5e4), chunks=(1, 5e4))\n", + "vals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This data has 744 time-steps available, chunked into chunks of 256 values each. Three chunks are needed to hold this time-series for one feature. Not too bad, but not good either. \n", + "Now, let's see how long on average it takes to get the first column.\n", "\n", - "On the other hand, an analysis which samples all locations for a single point in time would need \n", - "to fetch multiple chunks also. \n" + "```{note}\n", + "We use the `.compute()` method on our slice to ensure its extraction is not lazily performed.\n", + "```" ] }, { @@ -250,35 +183,17 @@ "metadata": {}, "outputs": [], "source": [ - "# Fetch all the gage values for a single day\n", - "sampleData['streamflow'].sel(time='07-01-2000')" + "%%timeit\n", + "vals[:, 0].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "### Chunk by Columns\n", "\n", - "This dataset has 400 features, broken into chunks of 16 data values in each chunk. Many \n", - "more chunks must be fetched for this read pattern. This is much worse: the I/O engine \n", - "needs to find and retrieve 25 chunks vs 3 in the previous example. Each separate chunk/file\n", - "is a full trip through the I/O stack. \n", - "\n", - "If we were going to do either of those analyses on a very large dataset with this pattern,\n", - "we'd want to re-chunk the data to optimize for our read pattern. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Re-Chunking the Sample Data\n", - "This is a trivial example, due to the small size of the dataset -- It all fits in memory easily,\n", - "so chunking is largely unnecesary in terms of optimizing I/O (parallelism is still a consideration). \n", - "But it is worth doing, as concepts will apply when we take this to the full-sized data.\n", - "\n", - "First thing we need is a chunk plan to describe the chunk layout we want. This can be generated \n", - "using various methods. For this dataset, it's easy enough to write it manually:" + "Switching the array to be chunked by columns." ] }, { @@ -287,25 +202,15 @@ "metadata": {}, "outputs": [], "source": [ - "# Numbers are *size* of the chunk. \n", - "chunk_plan = {\n", - " 'streamflow': {'time': 744, 'feature_id': 1}, # all time records in one chunk for each feature_id\n", - " 'velocity': {'time': 744, 'feature_id': 1},\n", - " 'elevation': (400,),\n", - " 'gage_id': (400,),\n", - " 'latitude': (400,),\n", - " 'longitude': (400,), \n", - " 'order': (400,), \n", - " 'time': (744,),\n", - " 'feature_id': (400,)\n", - "}" + "vals = da.ones(shape=(5e4, 5e4), chunks=(5e4, 1))\n", + "vals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "With this plan, we can ask `rechunker` to re-write the data using the prescribed chunking pattern.\n" + "Time to see how much faster this is." ] }, { @@ -314,46 +219,25 @@ "metadata": {}, "outputs": [], "source": [ - "import rechunker\n", - "outfile = r\"/tmp/outfile.zarr\"\n", - "result = rechunker.rechunk(\n", - " sampleData,\n", - " chunk_plan,\n", - " \"2GB\", #<--- Max Memory\n", - " outfile ,\n", - " temp_store=\"/tmp/scratch.zarr\" \n", - ")\n", - "_ = result.execute() # Note that we must specifically direct rechunk to calculate.\n", - "# without the call to execute(), the zarr dataset will be empty, and result will hold only\n", - "# a 'task graph' outlining the calculation steps." + "%%timeit\n", + "vals[:, 0].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Note that `rechunker.rechunk` does not overwrite any data. If it sees that `/tmp/outfile.zarr` or `/tmp/scratch.zarr` already exist, it will balk and likely raise an exception. Be sure that these locations do not exist. \n", - "\n", - "The `rechunker` also writes a minimalist data group. Meaning that variable metadata is not consolidated. This is not a required step, but it will really spead up future workflows when the\n", - "data is read back in. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import zarr\n", - "_ = zarr.consolidate_metadata(outfile)" + "As expected, the time difference is massive.\n", + "In this case, it is about a factor of 200x faster when properly chunked (at least on my laptop)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Results\n", - "Let's read in the resulting re-chunked dataset to see how it looks:" + "### Balanced Chunks\n", + "\n", + "As a final example, let's check a chunking pattern that is evenly split between columns and rows." ] }, { @@ -362,29 +246,8 @@ "metadata": {}, "outputs": [], "source": [ - "reChunkedData = xr.open_zarr(outfile)\n", - "reChunkedData" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note here that for both `streamflow` and `velocity`, the chunksize in the `time` dimension is 744 (the total number of time steps). Analyses which favor fetching all time-step values for a given `facility_id` will prefer this chunking strategy." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Comparison\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Before Re-Chunking:" + "vals = da.ones(shape=(5e4, 5e4), chunks=(225, 225))\n", + "vals" ] }, { @@ -393,15 +256,16 @@ "metadata": {}, "outputs": [], "source": [ - "sampleData['streamflow'].sel(feature_id=1343034)\n", - "# Note: three chunks needed to service a single feature_id" + "%%timeit\n", + "vals[:, 0].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### After re-chunking:" + "As we can see, this is only 1.5x slower when accessing the first column.\n", + "However, let's time how long it takes to access a single row." ] }, { @@ -410,30 +274,36 @@ "metadata": {}, "outputs": [], "source": [ - "reChunkedData['streamflow'].sel(feature_id=1343034) \n", - "# All data for the specified feature_id is in a single chunk\n" + "%%timeit\n", + "vals[0, :].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Cleaning Up" + "As expected, it is about the same as accessing a single column.\n", + "However, that means it is drastically faster than the column chunking when accessing rows.\n", + "Therefore, a chunking pattern that balances the dimensions is more generally applicable when both dimensions are needed for analysis." ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "cell_type": "markdown", + "metadata": { + "tags": [] + }, "source": [ - "import shutil\n", - "if os.path.exists(outfile):\n", - " print(f\"removing {outfile}\")\n", - " shutil.rmtree(outfile)\n", - "if os.path.exists(r\"/tmp/scratch.zarr\"):\n", - " print(\"removing scratch space\")\n", - " shutil.rmtree(r\"/tmp/scratch.zarr\")" + "## Pros & Cons to Chunking\n", + "\n", + "As a wrap up, let's review some of the pros and cons to chunking.\n", + "Some we have clearly discussed while others may be more subtle.\n", + "The primary pro, as we hopefully conveyed with our previous example, is that well chunked data substantially speeds up any analysis that favors that chunking pattern.\n", + "However, this becomes a con when you change your analysis to one that favors a new chunking pattern.\n", + "In other words, data that is well-organized to optimize one kind of analysis may not suit another kind of analysis on the same data.\n", + "While not a problem for our example here, changing the chunking pattern (known as \"**rechunking**\") on an established dataset is time-consuming, and it produces a separate copy of the dataset, increasing storage requirements.\n", + "The space commitment can be substantial if a complex dataset needs to be organized for many different analyses.\n", + "If our example above used unique values that we wanted to keep as we changed chunking, this would have meant that rather than having a single ~19 GiB dataset, we would have needed to keep all three, tripling our storage to almost 60 GiB.\n", + "Therefore, selecting an appropriate chunking pattern is critical when generating widely used datasets." ] } ], @@ -453,7 +323,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.8" + "version": "3.12.0" }, "vscode": { "interpreter": { diff --git a/101/index.md b/101/index.md index 56bba4d..4b289cf 100644 --- a/101/index.md +++ b/101/index.md @@ -1,21 +1,10 @@ -# Chunking 101 +# Introduction to Chunking -A gentle introduction to concepts and workflows. - -This introductory chapter will illustrate some key concepts for writing -chunked data (in zarr format) to object storage in 'the cloud'. We'll be -eventually be writing to an OSN storage device using the S3 API, although -you could, in theory, write anywhere (including a local file system). - -The illustration dataset will be PRISM(v2), accessed via its OpenDAP -endpoint at - - -Buckle up... we will get up to speed fast. +In this first series of notebooks, we will go over basic introductory topics associated with chunking. +As you will soon learn, "chunking" is an an essential part of the data preparation workflow, particularly for large datasets. +The key concepts you should understand have after this series include: ```{tableofcontents} ``` - -The dask performance report for the total conversion workflow is [here](../performance_reports/OpenDAP_to_S3-perfreport.html) - +Buckle up... we will get up to speed fast. diff --git a/_toc.yml b/_toc.yml index c1619d5..736ea49 100755 --- a/_toc.yml +++ b/_toc.yml @@ -2,20 +2,24 @@ format: jb-book root: index chapters: -- file: about/index - file: 101/index sections: - file: 101/WhyChunk - - file: 101/ExamineSourceData - - file: 101/EffectSizeShape - - file: 101/OpenDAP_to_S3 - - file: 101/Compression - - file: 101/SecondaryExample + # - file: 101/ExamineSourceData + # - file: 101/EffectSizeShape + # - file: 101/ReadWriteChunkedFiles + # - file: 101/Compression + # - file: 101/Rechunking + # - file: 101/OpenDAP_to_S3 + # - file: 101/SecondaryExample +- file: 201/index + # sections: + # - file: 201/TBD - file: back/index sections: - - file: helpers.md - sections: - - file: utils - - file: AWS - - file: StartNebariCluster - - file: back/Appendix_A + # - file: helpers.md + # sections: + # - file: utils + # - file: AWS + # - file: StartNebariCluster + - file: back/Glossary diff --git a/about/index.md b/about/index.md deleted file mode 100644 index 4a52b3a..0000000 --- a/about/index.md +++ /dev/null @@ -1,3 +0,0 @@ -# About This Project - -who, what, objectives, etc. \ No newline at end of file diff --git a/back/Glossary.md b/back/Glossary.md index ca3c827..5759c38 100644 --- a/back/Glossary.md +++ b/back/Glossary.md @@ -1 +1,6 @@ # Glossary + +- **Chunking**: The process of breaking down large amounts of data into smaller, more manageable pieces. +- **Chunk**: Smaller, more manageable pieces of a larger dataset. +- **Larger-than-memory**: A dataset whose memory footprint is too large to fit into memory all at once. +- **Rechunking**: The process of changing the current chunking pattern of a dataset to another chunking pattern. \ No newline at end of file diff --git a/index.md b/index.md index dc7909a..4adf76c 100755 --- a/index.md +++ b/index.md @@ -1,17 +1,24 @@ -# Data Chunking +# A Data Chunking Tutorial -"Chunking" large datasets is an essential workflow in the data peparation stage of -analysis. Some of the large datasets are written with a chunking pattern which -is optimized for writing (i.e. how they are created -- model outputs, etc), and -performs poorly for reading. This depends on the analysis. +If you have found your way here, then you are probably interested in learning more about data chunking. +In this tutorial, we will go over all levels of information on data chunking, +from the basic introductions on the topic to complex methods of selecting optimal chunk sizes and rechunking on the cloud. +Much of what is covered in this tutorial replicates concepts covered in a variety of materials that we cite as we go. +However, that material has been adapted to use data that looks like data you might encounter in a HyTEST workflow. -Re-chunking is a useful strategy to re-write the dataset in such a way to optimize -a particular kind of analysis (i.e. time-series vs spatial). +The content is split into two primary section: + - [Introduction to Chunking](101/index.md) + - [Advanced Topics in Chunking](201/index.md) +In [Introduction to Chunking](101/index.md), we discuss all of the basic introductory topics associated with chunking. +As for [Advanced Topics in Chunking](201/index.md), we dive into some more advanced topics related to chunking, +which require a firm understanding of introductory topics. + +Feel free to read this tutorial in order (which has been set up for those new to chunking) or jump directly to the topic that interests you: ```{tableofcontents} ``` ------ -Download the environment YAML file [here](env.yml) +If you find any issues or errors in this tutorial or have any ideas for material that should be included, +please open an issue using GitHub icon in the upper right.