Skip to content

Commit

Permalink
Update with New intro notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
kjdoore committed Oct 23, 2024
1 parent aa65148 commit 48aa24c
Show file tree
Hide file tree
Showing 6 changed files with 152 additions and 280 deletions.
350 changes: 110 additions & 240 deletions 101/WhyChunk.ipynb

Large diffs are not rendered by default.

21 changes: 5 additions & 16 deletions 101/index.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,10 @@
# Chunking 101
# Introduction to Chunking

A gentle introduction to concepts and workflows.

This introductory chapter will illustrate some key concepts for writing
chunked data (in zarr format) to object storage in 'the cloud'. We'll be
eventually be writing to an OSN storage device using the S3 API, although
you could, in theory, write anywhere (including a local file system).

The illustration dataset will be PRISM(v2), accessed via its OpenDAP
endpoint at <https://cida.usgs.gov/thredds/dodsC/prism_v2.html>


Buckle up... we will get up to speed fast.
In this first series of notebooks, we will go over basic introductory topics associated with chunking.
As you will soon learn, "chunking" is an an essential part of the data preparation workflow, particularly for large datasets.
The key concepts you should understand have after this series include:

```{tableofcontents}
```


The dask performance report for the total conversion workflow is [here](../performance_reports/OpenDAP_to_S3-perfreport.html)

Buckle up... we will get up to speed fast.
28 changes: 16 additions & 12 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,24 @@ format: jb-book
root: index

chapters:
- file: about/index
- file: 101/index
sections:
- file: 101/WhyChunk
- file: 101/ExamineSourceData
- file: 101/EffectSizeShape
- file: 101/OpenDAP_to_S3
- file: 101/Compression
- file: 101/SecondaryExample
# - file: 101/ExamineSourceData
# - file: 101/EffectSizeShape
# - file: 101/ReadWriteChunkedFiles
# - file: 101/Compression
# - file: 101/Rechunking
# - file: 101/OpenDAP_to_S3
# - file: 101/SecondaryExample
- file: 201/index
# sections:
# - file: 201/TBD
- file: back/index
sections:
- file: helpers.md
sections:
- file: utils
- file: AWS
- file: StartNebariCluster
- file: back/Appendix_A
# - file: helpers.md
# sections:
# - file: utils
# - file: AWS
# - file: StartNebariCluster
- file: back/Glossary
3 changes: 0 additions & 3 deletions about/index.md

This file was deleted.

5 changes: 5 additions & 0 deletions back/Glossary.md
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
# Glossary

- **Chunking**: The process of breaking down large amounts of data into smaller, more manageable pieces.
- **Chunk**: Smaller, more manageable pieces of a larger dataset.
- **Larger-than-memory**: A dataset whose memory footprint is too large to fit into memory all at once.
- **Rechunking**: The process of changing the current chunking pattern of a dataset to another chunking pattern.
25 changes: 16 additions & 9 deletions index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,24 @@
# Data Chunking
# A Data Chunking Tutorial

"Chunking" large datasets is an essential workflow in the data peparation stage of
analysis. Some of the large datasets are written with a chunking pattern which
is optimized for writing (i.e. how they are created -- model outputs, etc), and
performs poorly for reading. This depends on the analysis.
If you have found your way here, then you are probably interested in learning more about data chunking.
In this tutorial, we will go over all levels of information on data chunking,
from the basic introductions on the topic to complex methods of selecting optimal chunk sizes and rechunking on the cloud.
Much of what is covered in this tutorial replicates concepts covered in a variety of materials that we cite as we go.
However, that material has been adapted to use data that looks like data you might encounter in a HyTEST workflow.

Re-chunking is a useful strategy to re-write the dataset in such a way to optimize
a particular kind of analysis (i.e. time-series vs spatial).
The content is split into two primary section:

- [Introduction to Chunking](101/index.md)
- [Advanced Topics in Chunking](201/index.md)

In [Introduction to Chunking](101/index.md), we discuss all of the basic introductory topics associated with chunking.
As for [Advanced Topics in Chunking](201/index.md), we dive into some more advanced topics related to chunking,
which require a firm understanding of introductory topics.

Feel free to read this tutorial in order (which has been set up for those new to chunking) or jump directly to the topic that interests you:

```{tableofcontents}
```

-----
Download the environment YAML file [here](env.yml)
If you find any issues or errors in this tutorial or have any ideas for material that should be included,
please open an issue using GitHub icon in the upper right.

0 comments on commit 48aa24c

Please sign in to comment.