-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
g.trantham
committed
Apr 7, 2023
1 parent
c2c5ec4
commit d572ee4
Showing
30 changed files
with
3,282 additions
and
21,833 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,280 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Assessing Compression\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"**OBJECTIVE**: \n", | ||
"The objective of this chapter is to demonstrate how to read an existing dataset available as an OpenDAP endpoint, and translate it into a cloud-optimized zarr on S3. \n", | ||
"\n", | ||
"This notebook will evaluate how well the data is compressed when it is \n", | ||
"written do disk. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Preamble\n", | ||
"This is all stuff we are going to need: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import logging\n", | ||
"\n", | ||
"import numpy as np\n", | ||
"import xarray as xr\n", | ||
"import hvplot.xarray\n", | ||
"import fsspec\n", | ||
"import zarr\n", | ||
"\n", | ||
"logging.basicConfig(level=logging.INFO, force=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"tags": [ | ||
"remove-input" | ||
] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%run ../utils.ipynb\n", | ||
"_versions(['xarray', 'dask', 'zarr', 'fsspec'])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## AWS Credentials\n", | ||
"\n", | ||
"We will use the same credentials scheme we used to write the data in \n", | ||
"{doc}`OpenDAP_to_S3`. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"os.environ['AWS_PROFILE'] = 'osn-rsignellbucket2'\n", | ||
"os.environ['AWS_ENDPOINT'] = 'https://renc.osn.xsede.org'\n", | ||
"\n", | ||
"%run ../AWS.ipynb # handles credentials for us. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## The zarr store\n", | ||
"\n", | ||
"Let's look at the zarr data store we wrote to object storage" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# OUTPUT Location: \n", | ||
"outdir = f's3://rsignellbucket2/testing/prism/PRISM2.zarr'\n", | ||
"# established in earlier notebooks.\n", | ||
"\n", | ||
"fsw = fsspec.filesystem('s3', \n", | ||
" anon=False, \n", | ||
" default_fill_cache=False, \n", | ||
" skip_instance_cache=True, \n", | ||
" client_kwargs={ 'endpoint_url': os.environ['AWS_S3_ENDPOINT'] },\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The zarr store is actually a folder/directory, with subfolders for variables, groups, etc. \n", | ||
"We can get a quick peek at that with a couple of zarr functions:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"g = zarr.convenience.open_consolidated(fsw.get_mapper(outdir)) # read zarr metadata for named file.\n", | ||
"print(g.tree())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"## Sizing the chunks\n", | ||
"\n", | ||
"Using the filesystem utilities, build a datasets of file sizes:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"flist = fsw.glob(f'{outdir}/tmx/*')\n", | ||
"fsize = [fsw.size(f) for f in flist]\n", | ||
"da = xr.DataArray(data=np.array(fsize)/1e6, name='size')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Plot a histogram of sizes for data files in this zarr store:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"da.hvplot.hist(title='Compressed object sizes (MB) for \"tmx\" variable', grid=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can see that most of the individual chunks are just over 4MB in size. Compare this with \n", | ||
"the in-memory size for chunks, according to `xarray` -- 34MB per chunk. \n", | ||
"\n", | ||
"This tells us that we get an astonishing 9:1 compression ratio on this particular data. \n", | ||
"\n", | ||
"Let's look at another variable in this dataset:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"flist = fsw.glob(f'{outdir}/ppt/*')\n", | ||
"fsize = [fsw.size(f) for f in flist]\n", | ||
"da = xr.DataArray(data=np.array(fsize)/1e6, name='size')\n", | ||
"da.hvplot.hist(title='Compressed object sizes (MB) for \"ppt\" variable', grid=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This variable did not compress quite as well. In the worse case, severl chunks are about\n", | ||
"14MB each. This is still a respectable 2.5:1 compression ratio for this variable. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Total Size\n", | ||
"The total size in GB of this dataset as stored on disk (including all metadata) is:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"fsw.du(outdir)/1e9" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"If we count up the in-memory sizes reported by `xarray`: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"new_ds = xr.open_dataset(fsw.get_mapper(outdir), engine='zarr', chunks={})\n", | ||
"total=0\n", | ||
"for i in new_ds.variables:\n", | ||
" n = new_ds[i].size\n", | ||
" bytes = n * 4\n", | ||
" total += bytes\n", | ||
" print(f\"{i:10s}: {bytes: 12d}\")\n", | ||
"print(\"=\" * 24, f\"\\nTOTAL : {total: 12d}\")\n", | ||
"\n", | ||
"print(f\"In GB: {total/1e9}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So... our total compression ratio for the entire dataset, including file system overhead and metadata is: " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"total / fsw.du(outdir) " | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.