-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
152 additions
and
280 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,10 @@ | ||
# Chunking 101 | ||
# Introduction to Chunking | ||
|
||
A gentle introduction to concepts and workflows. | ||
|
||
This introductory chapter will illustrate some key concepts for writing | ||
chunked data (in zarr format) to object storage in 'the cloud'. We'll be | ||
eventually be writing to an OSN storage device using the S3 API, although | ||
you could, in theory, write anywhere (including a local file system). | ||
|
||
The illustration dataset will be PRISM(v2), accessed via its OpenDAP | ||
endpoint at <https://cida.usgs.gov/thredds/dodsC/prism_v2.html> | ||
|
||
|
||
Buckle up... we will get up to speed fast. | ||
In this first series of notebooks, we will go over basic introductory topics associated with chunking. | ||
As you will soon learn, "chunking" is an an essential part of the data preparation workflow, particularly for large datasets. | ||
The key concepts you should understand have after this series include: | ||
|
||
```{tableofcontents} | ||
``` | ||
|
||
|
||
The dask performance report for the total conversion workflow is [here](../performance_reports/OpenDAP_to_S3-perfreport.html) | ||
|
||
Buckle up... we will get up to speed fast. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,6 @@ | ||
# Glossary | ||
|
||
- **Chunking**: The process of breaking down large amounts of data into smaller, more manageable pieces. | ||
- **Chunk**: Smaller, more manageable pieces of a larger dataset. | ||
- **Larger-than-memory**: A dataset whose memory footprint is too large to fit into memory all at once. | ||
- **Rechunking**: The process of changing the current chunking pattern of a dataset to another chunking pattern. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,24 @@ | ||
# Data Chunking | ||
# A Data Chunking Tutorial | ||
|
||
"Chunking" large datasets is an essential workflow in the data peparation stage of | ||
analysis. Some of the large datasets are written with a chunking pattern which | ||
is optimized for writing (i.e. how they are created -- model outputs, etc), and | ||
performs poorly for reading. This depends on the analysis. | ||
If you have found your way here, then you are probably interested in learning more about data chunking. | ||
In this tutorial, we will go over all levels of information on data chunking, | ||
from the basic introductions on the topic to complex methods of selecting optimal chunk sizes and rechunking on the cloud. | ||
Much of what is covered in this tutorial replicates concepts covered in a variety of materials that we cite as we go. | ||
However, that material has been adapted to use data that looks like data you might encounter in a HyTEST workflow. | ||
|
||
Re-chunking is a useful strategy to re-write the dataset in such a way to optimize | ||
a particular kind of analysis (i.e. time-series vs spatial). | ||
The content is split into two primary section: | ||
|
||
- [Introduction to Chunking](101/index.md) | ||
- [Advanced Topics in Chunking](201/index.md) | ||
|
||
In [Introduction to Chunking](101/index.md), we discuss all of the basic introductory topics associated with chunking. | ||
As for [Advanced Topics in Chunking](201/index.md), we dive into some more advanced topics related to chunking, | ||
which require a firm understanding of introductory topics. | ||
|
||
Feel free to read this tutorial in order (which has been set up for those new to chunking) or jump directly to the topic that interests you: | ||
|
||
```{tableofcontents} | ||
``` | ||
|
||
----- | ||
Download the environment YAML file [here](env.yml) | ||
If you find any issues or errors in this tutorial or have any ideas for material that should be included, | ||
please open an issue using GitHub icon in the upper right. |