Skip to content

Commit

Permalink
[#7]: split fundamentals
Browse files Browse the repository at this point in the history
  • Loading branch information
g.trantham committed Apr 7, 2023
1 parent c2c5ec4 commit d572ee4
Show file tree
Hide file tree
Showing 30 changed files with 3,282 additions and 21,833 deletions.
2,475 changes: 0 additions & 2,475 deletions 101/02a_kerchunk_single.ipynb

This file was deleted.

5,467 changes: 0 additions & 5,467 deletions 101/02b_kerchunk_multi.ipynb

This file was deleted.

5,452 changes: 0 additions & 5,452 deletions 101/02b_kerchunk_multi_to_years.ipynb

This file was deleted.

2,342 changes: 0 additions & 2,342 deletions 101/02c_kerchunk_multi_from_years.ipynb

This file was deleted.

5,274 changes: 0 additions & 5,274 deletions 101/02d_kerchunk.explore.ipynb

This file was deleted.

280 changes: 280 additions & 0 deletions 101/Compression.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assessing Compression\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"**OBJECTIVE**: \n",
"The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. \n",
"\n",
"This notebook will evaluate how well the data is compressed when it is \n",
"written do disk. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preamble\n",
"This is all stuff we are going to need: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import logging\n",
"\n",
"import numpy as np\n",
"import xarray as xr\n",
"import hvplot.xarray\n",
"import fsspec\n",
"import zarr\n",
"\n",
"logging.basicConfig(level=logging.INFO, force=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"%run ../utils.ipynb\n",
"_versions(['xarray', 'dask', 'zarr', 'fsspec'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AWS Credentials\n",
"\n",
"We will use the same credentials scheme we used to write the data in \n",
"{doc}`OpenDAP_to_S3`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"os.environ['AWS_PROFILE'] = 'osn-rsignellbucket2'\n",
"os.environ['AWS_ENDPOINT'] = 'https://renc.osn.xsede.org'\n",
"\n",
"%run ../AWS.ipynb # handles credentials for us. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The zarr store\n",
"\n",
"Let's look at the zarr data store we wrote to object storage"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# OUTPUT Location: \n",
"outdir = f's3://rsignellbucket2/testing/prism/PRISM2.zarr'\n",
"# established in earlier notebooks.\n",
"\n",
"fsw = fsspec.filesystem('s3', \n",
" anon=False, \n",
" default_fill_cache=False, \n",
" skip_instance_cache=True, \n",
" client_kwargs={ 'endpoint_url': os.environ['AWS_S3_ENDPOINT'] },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The zarr store is actually a folder/directory, with subfolders for variables, groups, etc. \n",
"We can get a quick peek at that with a couple of zarr functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"g = zarr.convenience.open_consolidated(fsw.get_mapper(outdir)) # read zarr metadata for named file.\n",
"print(g.tree())"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Sizing the chunks\n",
"\n",
"Using the filesystem utilities, build a datasets of file sizes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flist = fsw.glob(f'{outdir}/tmx/*')\n",
"fsize = [fsw.size(f) for f in flist]\n",
"da = xr.DataArray(data=np.array(fsize)/1e6, name='size')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a histogram of sizes for data files in this zarr store:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"da.hvplot.hist(title='Compressed object sizes (MB) for \"tmx\" variable', grid=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that most of the individual chunks are just over 4MB in size. Compare this with \n",
"the in-memory size for chunks, according to `xarray` -- 34MB per chunk. \n",
"\n",
"This tells us that we get an astonishing 9:1 compression ratio on this particular data. \n",
"\n",
"Let's look at another variable in this dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flist = fsw.glob(f'{outdir}/ppt/*')\n",
"fsize = [fsw.size(f) for f in flist]\n",
"da = xr.DataArray(data=np.array(fsize)/1e6, name='size')\n",
"da.hvplot.hist(title='Compressed object sizes (MB) for \"ppt\" variable', grid=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This variable did not compress quite as well. In the worse case, severl chunks are about\n",
"14MB each. This is still a respectable 2.5:1 compression ratio for this variable. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Total Size\n",
"The total size in GB of this dataset as stored on disk (including all metadata) is:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fsw.du(outdir)/1e9"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we count up the in-memory sizes reported by `xarray`: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_ds = xr.open_dataset(fsw.get_mapper(outdir), engine='zarr', chunks={})\n",
"total=0\n",
"for i in new_ds.variables:\n",
" n = new_ds[i].size\n",
" bytes = n * 4\n",
" total += bytes\n",
" print(f\"{i:10s}: {bytes: 12d}\")\n",
"print(\"=\" * 24, f\"\\nTOTAL : {total: 12d}\")\n",
"\n",
"print(f\"In GB: {total/1e9}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So... our total compression ratio for the entire dataset, including file system overhead and metadata is: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"total / fsw.du(outdir) "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit d572ee4

Please sign in to comment.