From 0fa6be5ee3dab33d25d5c946ea83254db0033d78 Mon Sep 17 00:00:00 2001
From: Keith Doore <kjdoore@gmail.com>
Date: Fri, 27 Dec 2024 15:01:53 -0600
Subject: [PATCH] Virtual Zarr notebook

---
 .gitignore            |   3 +-
 201/VirtualZarr.ipynb | 811 ++++++++++++++++++++++++++++++++++++++++++
 _toc.yml              |   2 +-
 back/Glossary.md      |   3 +
 env.yml               |   1 +
 5 files changed, 818 insertions(+), 2 deletions(-)
 create mode 100644 201/VirtualZarr.ipynb

diff --git a/.gitignore b/.gitignore
index 9f74371..f138f50 100644
--- a/.gitignore
+++ b/.gitignore
@@ -8,7 +8,8 @@ scratch.ipynb
 # The built book -- never check into the repo.
 _build
 
-
+# Temporary files generated from notebooks
+201/virtual_zarr/
 
 # IPython & Jupyter
 profile_default/
diff --git a/201/VirtualZarr.ipynb b/201/VirtualZarr.ipynb
new file mode 100644
index 0000000..de04f19
--- /dev/null
+++ b/201/VirtualZarr.ipynb
@@ -0,0 +1,811 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f83aa4d9-364e-42b2-8de6-f31bf9034f1c",
+   "metadata": {},
+   "source": [
+    "# Generating a Virtual Zarr Store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a670e170-eaf4-4158-bf02-a96b13f3f935",
+   "metadata": {},
+   "source": [
+    "::::{margin}\n",
+    ":::{note}\n",
+    "This notebook builds off the [Kerchunk](https://fsspec.github.io/kerchunk/index.html) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/stable/index.html) docs.\n",
+    ":::\n",
+    "::::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "afb02d98-a534-4cb4-9c37-0250aa2f78d9",
+   "metadata": {},
+   "source": [
+    "The objective of this notebook is to learn how to create a virtual Zarr store for a collection of NetCDF files that together make up a complete data set.\n",
+    "To do this, we will use [Kerchunk](https://fsspec.github.io/kerchunk/index.html) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/stable/index.html).\n",
+    "As these two packages can both create virtual Zarr stores but do it in different ways, we will utilize them both to show how they compare in combination with [Dask](https://www.dask.org/) for parallel execution."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7cf6a00-e79f-400e-b9a6-e60858d99a3c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fsspec\n",
+    "import xarray as xr\n",
+    "import ujson\n",
+    "import time\n",
+    "import kerchunk.hdf\n",
+    "import kerchunk.combine\n",
+    "from virtualizarr import open_virtual_dataset\n",
+    "import dask.distributed\n",
+    "import logging"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db16a5a6-e56e-4907-baf4-da9c91b2aba0",
+   "metadata": {},
+   "source": [
+    "## Kerchunk vs VirtualiZarr\n",
+    "\n",
+    "To begin, let's explain what a virtual Zarr store even is.\n",
+    "A \"[**virtual Zarr store**](../back/Glossary.md#term-Virtual-Zarr-Store)\" is a virtual representation of a Zarr store generated by mapping any number of real datasets in individual files (e.g., NetCDF/HDF5, GRIB2, TIFF) together into a single, sliceable dataset via an interface layer.\n",
+    "This interface layer, which Kerchunk and VirtualiZarr generate, contains information about the original files (e.g., chunking, compression, data byte location, etc.) needed to efficiently access the data.\n",
+    "While this could be done with [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html), we don't want to run this command every time we open the dataset as it can be a slow and expensive process.\n",
+    "The reason for this is that `xarray.open_mfdataset` performs many consistency checks as it runs, and it requires partially opening all of the datasets to get general matadata information on each of the individual files.\n",
+    "Therefore, for numerous files, this can have significant overhead, and it would be preferable to just cache these checks and metadata for more performant future reads.\n",
+    "This cache (specifically in Zarr format) is what a virtual Zarr store is. \n",
+    "Once we have the virtual Zarr store, we can open the combined xarray dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) for an almost instantaneous read.\n",
+    "\n",
+    "Now that we know what a virtual Zarr store is, let's discuss the differences between Kerchunk and VirtualiZarr and their virtual Zarr stores.\n",
+    "At a top level, VirtualiZarr provides almost all of the same features as Kerchunk.\n",
+    "The primary difference is that Kerchunk supports non-Zarr-like virtual format, while VirtualiZarr is specifically focused on the Zarr format.\n",
+    "Additionally, Kerchunk creates the virtual Zarr store and represents it in memory using json formatting (the format used for Zarr metadata).\n",
+    "Alternatively, VirtualiZarr represents the store as array-level abstractions (which can be converted to json format).\n",
+    "These abstractions can be cleanly wrapped by xarray for easy use of `xarray.concat` and `xarray.merge` commands to combine virtual Zarr stores.\n",
+    "A nice table comparing the two packages can be found in the [VirtualiZarr FAQs](https://virtualizarr.readthedocs.io/en/stable/faq.html#how-do-virtualizarr-and-kerchunk-compare), which shows how the two packages represent virtual Zarr stores and their comparative syntax."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6eb26405-82de-41b3-b552-55a34fd2b1be",
+   "metadata": {},
+   "source": [
+    "## Spin up Dask Cluster\n",
+    "\n",
+    "To run the virtual Zarr creation in parallel, we need to spin up a Dask cluster to schedule the various workers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88de9acf-7cb6-4c0d-a918-003eaca707e1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster = dask.distributed.LocalCluster(\n",
+    "    n_workers=16,\n",
+    "    threads_per_worker=1, \n",
+    "    silence_logs=logging.ERROR\n",
+    ")\n",
+    "client = dask.distributed.Client(cluster)\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23951132-87c8-4f67-9cb0-2ab36bea7b7c",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "## Example Comparison\n",
+    "\n",
+    "With our Dask cluster ready, let's see how Kerchunk and VirtualiZarr can be utilized to generate a vitrual Zarr store.\n",
+    "For this example, we will use the same daily gridMET NetCDF data as used in the [Writing Chunked File tutorial](../101/WriteChunkedFiles.ipynb).\n",
+    "Only this time we will use all of the variables not just precipitation.\n",
+    "These include:\n",
+    " - precipitation,\n",
+    " - maximum relative humidity,\n",
+    " - minimum relative humidity,\n",
+    " - specific humidity,\n",
+    " - downward shortwave radiation,\n",
+    " - minimum air temperature,\n",
+    " - maximum air temperature,\n",
+    " - wind direction, and\n",
+    " - wind speed.\n",
+    "   \n",
+    "The data is currently hosted on the HyTEST OSN as a collection NetCDF files.\n",
+    "To access the data with both Kerchunk and VirtualiZarr, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to get the list of files that we are wanting to combine into a virtual Zarr store.\n",
+    "\n",
+    "First we need to create the file system for accessing the files, and a second one for outputting the virtual Zarr store.\n",
+    "\n",
+    "```{note}\n",
+    "We will exclude the year 2019 for now and use it later to show how to append virtual Zarr stores.\n",
+    "Also, we will not use 2020 as it is a partial year with different chunking than the other 40 years, which is currently incompatible with Kerchunk and Virtualizarr.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fcd0b92c-4943-419a-a85d-907a84764893",
+   "metadata": {
+    "editable": true,
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# These reader options will be needed for VirtualiZarr\n",
+    "# We created them here to show how they fold into fsspec\n",
+    "reader_options = {\n",
+    "    'storage_options': {\n",
+    "        'anon': True, \n",
+    "        'client_kwargs': {\n",
+    "            'endpoint_url': 'https://usgs.osn.mghpcc.org/'\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "fs = fsspec.filesystem(\n",
+    "    protocol='s3',\n",
+    "    **reader_options['storage_options']\n",
+    ")\n",
+    "\n",
+    "fs_local = fsspec.filesystem('')\n",
+    "# Make directories to save the virtual zarr stores\n",
+    "fs_local.mkdir('virtual_zarr/kerchunk')\n",
+    "fs_local.mkdir('virtual_zarr/virtualizarr')\n",
+    "\n",
+    "file_glob = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*198*.nc')\n",
+    "file_glob = [file for file in file_glob if (('2020' not in file) and ('2019' not in file))]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c816bf0-6268-41e7-b159-e0b1e91c3ffa",
+   "metadata": {},
+   "source": [
+    "Now, we are ready to generate the virtual Zarr stores.\n",
+    "For both Kerchunk and VirtualiZarr ([for now](https://virtualizarr.readthedocs.io/en/stable/usage.html#opening-files-as-virtual-datasets)), this consists of two steps:\n",
+    "\n",
+    "1) Convert a single original data file into individual virtual Zarr stores,\n",
+    "2) Combine the individual virtual Zarr stores into a single combined virtual Zarr store.\n",
+    "\n",
+    "We will show these two steps seperately and how they are done for each package."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41f65cc2-6a43-455d-88fd-69e96d5ab020",
+   "metadata": {},
+   "source": [
+    "### Generate Individual Virtual Zarr Stores\n",
+    "\n",
+    "#### Kerchunk\n",
+    "\n",
+    "To generate the individual virtual Zarr stores with Kerchunk, we will use [`kerchunk.hdf.SingleHdf5ToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.hdf.SingleHdf5ToZarr), which translates the content of one HDF5 file into Zarr metadata.\n",
+    "Other translators exist in Kerchunk that can convert GeoTiffs and NetCDF3 files.\n",
+    "However, as we are looking at NetCDF4 files (a specific version of a HDF5 file), we will use the HDF5 translator.\n",
+    "As this only translates one file, we can make a collection of [`dask.delayed`](https://docs.dask.org/en/stable/delayed.html) objects that wrap the `SingleHdf5ToZarr` call to run it for all files in parallel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2289fac1-96a3-424a-89f5-e7d52c5c2006",
+   "metadata": {
+    "editable": true,
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "scroll-output"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "# Make a function to run in parallel with dask\n",
+    "@dask.delayed\n",
+    "def generate_single_virtual_zarr(file):\n",
+    "    with fs.open(file) as hdf:\n",
+    "        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(hdf, file, inline_threshold=0)\n",
+    "        return h5chunks.translate()\n",
+    "\n",
+    "# Time the duration for later comparison\n",
+    "t0 = time.time()\n",
+    "\n",
+    "# Generate Dask Delayed objects\n",
+    "tasks = [generate_single_virtual_zarr(file) for file in file_glob]\n",
+    "# Compute the delayed object\n",
+    "single_virtual_zarrs = dask.compute(*tasks)\n",
+    "\n",
+    "kerchunk_time = time.time() - t0\n",
+    "\n",
+    "single_virtual_zarrs[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e288b09f-ae7d-48d0-afff-e50ba278f086",
+   "metadata": {},
+   "source": [
+    "Notice that the output for a virtualization of a single NetCDF is a json style dictionary, where the coordinate data is actually kept in the dictionary, while the data is a file pointer and the byte range for each chunk."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8bfcf7b-77ae-4fa9-8c8c-e6e6a4a4002c",
+   "metadata": {},
+   "source": [
+    "#### VirtualiZarr\n",
+    "\n",
+    "To generate the individual virtual Zarr stores with VirtualiZarr, we will use [`virtualizarr.open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.backend.open_virtual_dataset.html#virtualizarr-backend-open-virtual-dataset), which can infer what type of file we are reading instead of us having to specify.\n",
+    "Like Kerchunk, this only translates one file at a time.\n",
+    "So, we can make a collection of [`dask.delayed`](https://docs.dask.org/en/stable/delayed.html) objects that wraps `open_virtual_dataset` to run it for all files in parallel.\n",
+    "\n",
+    "```{important}\n",
+    "When reading in the individual files as virtual datasets, it is critical to include the `loadable_variables` keyword.\n",
+    "The keyword should be set to a list of the coordinate names.\n",
+    "By adding this keyword, the coordinates are read into memory rather than being loaded as virtual data.\n",
+    "This can make a massive difference in the next steps of (1) concatenation as it gives the coordinates indexes and (2) the serialization of the virtual Zarr store as it saves the in-memory coordinates directly to the store rather than a pointer.\n",
+    "Also, if this is not included, coordinates of different sizes will not be able to be concatenated due to potential chunking differences.\n",
+    "The only downside is that it can slightly increase the time it takes to initially read the virtual datasets.\n",
+    "However, this slowdown is more than worth the future convenience of having the coords in-memory when reading in the virtual Zarr store.\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d998158-c484-41db-a476-2bd8ffd2c501",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "tasks = [\n",
+    "    dask.delayed(open_virtual_dataset)(\n",
+    "        f's3://{file}',\n",
+    "        indexes={},\n",
+    "        loadable_variables=['day', 'lat', 'lon', 'crs'],\n",
+    "        decode_times=True,\n",
+    "        reader_options=reader_options\n",
+    "    )\n",
+    "    for file in file_glob\n",
+    "]\n",
+    "\n",
+    "virtual_datasets = dask.compute(*tasks)\n",
+    "\n",
+    "virtualizarr_time = time.time() - t0\n",
+    "\n",
+    "virtual_datasets[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "79eb52f0-aec7-4d65-bb02-918e22797483",
+   "metadata": {},
+   "source": [
+    "Notice that the output for a virtualization of a single NetCDF is now an `xarray.Dataset`, where the data is a [`ManifestArray`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.manifests.ManifestArray.html) object.\n",
+    "This `ManifestArray` contains [`ChunkManifest`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.manifests.ChunkManifest.html#virtualizarr.manifests.ChunkManifest) objects that hold the same info as the Kerchunk json format (i.e., a file pointer and the byte range for each chunk), but allows for it to be nicely wrapped by xarray."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba9eec47-691e-40b7-a255-d7684eb31cc2",
+   "metadata": {},
+   "source": [
+    "### Combine Individual Virtual Zarr Stores\n",
+    "\n",
+    "#### Kerchunk\n",
+    "\n",
+    "To combine the individual virtual Zarr stores into one virtual Zarr store with Kerchunk, we will use [`kerchunk.combine.MultiZarrToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr), which combines the content of multiple virtual Zarr stores into a single virtual Zarr store.\n",
+    "This call requires feeding `MultiZarrToZarr` the remote access info that we needed for our file system, along with the dimension we want to combine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d4e92e9-3fe4-41c4-ab42-ea2133396f82",
+   "metadata": {
+    "editable": true,
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "scroll-output"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "mzz = kerchunk.combine.MultiZarrToZarr(\n",
+    "    single_virtual_zarrs,\n",
+    "    remote_protocol='s3',\n",
+    "    remote_options=reader_options['storage_options'],\n",
+    "    concat_dims=[\"day\"]\n",
+    ")\n",
+    "\n",
+    "out = mzz.translate()\n",
+    "\n",
+    "# Save the virtual Zarr store, serialized as json\n",
+    "with fs_local.open('virtual_zarr/kerchunk/gridmet.json', 'wb') as f:\n",
+    "    f.write(ujson.dumps(out).encode())\n",
+    "\n",
+    "kerchunk_time += time.time() - t0\n",
+    "\n",
+    "out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf64912b-be1f-42ae-bc77-a1c07b04ead2",
+   "metadata": {},
+   "source": [
+    "Again, notice the output type is in a json format with the coords in the dictionary and data chunks having pointers, but this time all chunks are in the one dictionary."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a08f6432-3073-484a-9bc9-86d46d5b0324",
+   "metadata": {},
+   "source": [
+    "#### VirtualiZarr\n",
+    "\n",
+    "To combine the virtual datasets from VirtualiZarr, we can just use [`xarray.combine_by_coords`](https://docs.xarray.dev/en/stable/generated/xarray.combine_by_coords.html) which will auto-magically combine the virtual datasets together."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1331f08-5029-4ae5-aed8-4a6ccebefe4a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "virtual_ds = xr.combine_by_coords(virtual_datasets, coords='minimal', compat='override', combine_attrs='override')\n",
+    "\n",
+    "# Save the virtual Zarr store, serialized as json\n",
+    "virtual_ds.virtualize.to_kerchunk('virtual_zarr/virtualizarr/gridmet.json', format='json')\n",
+    "\n",
+    "virtualizarr_time += time.time() - t0\n",
+    "\n",
+    "virtual_ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9b603bd-f7ed-40a9-8f4d-97a8f917e0aa",
+   "metadata": {},
+   "source": [
+    "Notice that when we saved the virtual dataset that we converted it to a Kerchunk format for saving."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf70fafa-c21e-446a-b856-8d747f4f9da4",
+   "metadata": {},
+   "source": [
+    "### Opening the Virtual Zarr Stores\n",
+    "\n",
+    "To open the virtual Zarr stores, we can use the same method for both stores as we converted to Kerchunk format when saving from VirtualiZarr.\n",
+    "\n",
+    "#### Kerchunk"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "368b69da-fe1f-43a1-a46d-ea89ceadbe70",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "ds = xr.open_dataset(\n",
+    "    'virtual_zarr/kerchunk/gridmet.json',\n",
+    "    chunks={},\n",
+    "    engine=\"kerchunk\",\n",
+    "    backend_kwargs={\n",
+    "        \"storage_options\": {\n",
+    "            \"remote_protocol\": \"s3\",\n",
+    "            \"remote_options\": reader_options['storage_options']\n",
+    "        },\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "kerchunk_read_time = time.time() - t0\n",
+    "\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25d4365f-7b7d-4be4-a97b-3db4b6a17a38",
+   "metadata": {},
+   "source": [
+    "#### VirtualiZarr"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "26a9e7c9-6e58-4fd2-896a-d4b0dbbe2af1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "ds = xr.open_dataset(\n",
+    "    'virtual_zarr/virtualizarr/gridmet.json',\n",
+    "    chunks={},\n",
+    "    engine=\"kerchunk\",\n",
+    "    backend_kwargs={\n",
+    "        \"storage_options\": {\n",
+    "            \"remote_protocol\": \"s3\",\n",
+    "            \"remote_options\": reader_options['storage_options']\n",
+    "        },\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "virtualizarr_read_time = time.time() - t0\n",
+    "\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cdd0e99a-f589-46be-ab84-9cf346aa81a1",
+   "metadata": {},
+   "source": [
+    "### Reading with `xarray.open_mfdataset`\n",
+    "\n",
+    "As a comparison of read times, let's also compile the dataset using [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html) in parallel with Dask.\n",
+    "This way we can see if we will be saving time in the future by having the compiled virtual Zarr for faster reads."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fd10f6f4-f380-4f01-be6f-2b28bc1c9d4c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "t0 = time.time()\n",
+    "\n",
+    "ds = xr.open_mfdataset(\n",
+    "    [fs.open(file) for file in file_glob],\n",
+    "    chunks={},\n",
+    "    parallel=True,\n",
+    "    engine='h5netcdf'\n",
+    ")\n",
+    "\n",
+    "open_mfdataset_time = time.time() - t0\n",
+    "\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0210ea1d-b59b-4466-b65e-1f7da4b698b2",
+   "metadata": {},
+   "source": [
+    "Now, let's compare the computational times!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b414bd2-7ed9-4b2b-bcdb-403e1971a438",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Kerchunk virtual Zarr creation time: \"\n",
+    "      f\"{kerchunk_time:.0f}s ({kerchunk_time/60:.1f} min)\")\n",
+    "print(\"VirtualiZarr virtual Zarr creation time: \"\n",
+    "      f\"{virtualizarr_time:.0f}s ({virtualizarr_time/60:.1f} min)\")\n",
+    "print(\"open_mfdataset dataset creation time: \"\n",
+    "      f\"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)\")\n",
+    "print(f\"Time ratio: Kerchunk to open_mfdataset = {kerchunk_time/open_mfdataset_time}\\n\"\n",
+    "      f\"            VirtualiZarr to open_mfdataset = {virtualizarr_time/open_mfdataset_time}\\n\"\n",
+    "      f\"            Kerchunk to VirtualiZarr = {kerchunk_time/virtualizarr_time}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "948bea17-7fd2-44bb-aed8-2f0d40056603",
+   "metadata": {},
+   "source": [
+    "As we can see the Kerchunk is about 1.6x faster than VirtualiZarr and about the same speed as `open_mfdataset` for creating the `Dataset`.\n",
+    "Therefore, it is definitely worth creating a virtual Zarr store in this case.\n",
+    "Looking at read speed after the virtual Zarr store creation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "082d983f-7528-4dfb-8a48-a8eeaffbd92c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Kerchunk virtual Zarr read time: \"\n",
+    "      f\"{kerchunk_read_time:.2f}s\")\n",
+    "print(\"VirtualiZarr virtual Zarr read time: \"\n",
+    "      f\"{virtualizarr_read_time:.2f}s\")\n",
+    "print(\"open_mfdataset dataset read/creation time: \"\n",
+    "      f\"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)\")\n",
+    "print(f\"Time ratio: Kerchunk to open_mfdataset = {kerchunk_read_time/open_mfdataset_time}\\n\"\n",
+    "      f\"            VirtualiZarr to open_mfdataset = {virtualizarr_read_time/open_mfdataset_time}\\n\"\n",
+    "      f\"            Kerchunk to VirtualiZarr = {kerchunk_read_time/virtualizarr_read_time}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1071966-3d4b-43fa-9274-ce2be1933c63",
+   "metadata": {},
+   "source": [
+    "From this, it is very clear that performing more than one read using either the Kerchunk or VirtualiZarr virtual Zarr store is more efficient that reading with `open_mfdataset`.\n",
+    "Additionally, the differences in read times between Kerchunk and Virtualizarr, while appearing drastic, is likely not going to be significant in any workflow."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6c44f7d-9bca-4856-a0ea-7212a48a52a2",
+   "metadata": {},
+   "source": [
+    "## Appending to Existing Virtual Zarr Store\n",
+    "\n",
+    "As noted when [introducing the gridMET data](#Example-Comparison), we did not utilize the 2019 data in order to show how to append it to a virtual Zarr store.\n",
+    "The ability to append more data to the virtual Zarr store is highly convienient, as plenty of datasets are continuously updated as new data becomes available.\n",
+    "So, let's appends some data to our virtual Zarr stores we just made.\n",
+    "\n",
+    "First, we create the 2019 file glob."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b657a5f-752c-4236-a62d-809e79c7d4c2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_glob_2019 = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*_2019.nc')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17b1fbd6-5038-4a54-8ec3-cd78a57b03de",
+   "metadata": {},
+   "source": [
+    "### Create New Virtual Zarr for New File\n",
+    "\n",
+    "Next, we need to get our 2019 NetCDFs into a virtual Zarr store.\n",
+    "\n",
+    "#### Kerchunk\n",
+    "\n",
+    "We will do this for Kerchunk the same way we did before, by using [`kerchunk.hdf.SingleHdf5ToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.hdf.SingleHdf5ToZarr), which translates the content of one HDF5 (NetCDF4) file into Zarr metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "04faebef-edba-43ff-9db4-dccbd3bc5a3e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tasks = [generate_single_virtual_zarr(file) for file in file_glob_2019]\n",
+    "single_virtual_zarrs_2019 = dask.compute(*tasks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3e76b73-4e61-4a67-9253-6a84e479d260",
+   "metadata": {},
+   "source": [
+    "#### VirtualiZarr\n",
+    "\n",
+    "And for VirtualiZarr, we will use [`virtualizarr.open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.backend.open_virtual_dataset.html#virtualizarr-backend-open-virtual-dataset)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "22940c8d-10ff-467c-922d-9d1777cfdda9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tasks = [\n",
+    "    dask.delayed(open_virtual_dataset)(\n",
+    "        f's3://{file}',\n",
+    "        indexes={},\n",
+    "        loadable_variables=['day', 'lat', 'lon', 'crs'],\n",
+    "        decode_times=True,\n",
+    "        reader_options=reader_options\n",
+    "    )\n",
+    "    for file in file_glob_2019\n",
+    "]\n",
+    "\n",
+    "virtual_datasets_2019 = dask.compute(*tasks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9b22cd8-f39f-4f52-9bbb-3ecbaa53682a",
+   "metadata": {},
+   "source": [
+    "### Append to Existing Store\n",
+    "\n",
+    "Now, we can append the virtualized NetCDFs to our existing stores.\n",
+    "\n",
+    "#### Kerchunk\n",
+    "\n",
+    "For Kerchunk, we will use still [`kerchunk.combine.MultiZarrToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr).\n",
+    "However, this time we will need to use the `append` method to append our new data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "349f7e19-0635-402c-98ca-8c85f5579b9a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Append to the existing reference file\n",
+    "mzz = kerchunk.combine.MultiZarrToZarr.append(\n",
+    "    single_virtual_zarrs_2019,\n",
+    "    original_refs=out,\n",
+    "    concat_dims=[\"day\"],\n",
+    "    remote_protocol='s3',\n",
+    "    remote_options=reader_options['storage_options'],\n",
+    ")\n",
+    "\n",
+    "out_2019 = mzz.translate()\n",
+    "\n",
+    "# Save the virtual Zarr store, serialized as json\n",
+    "with fs_local.open('virtual_zarr/kerchunk/gridmet_appended.json', 'wb') as f:\n",
+    "    f.write(ujson.dumps(out_2019).encode())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e9fba22-70f3-4192-a067-d431c96237a1",
+   "metadata": {},
+   "source": [
+    "#### VirtualiZarr\n",
+    "\n",
+    "For VirtualiZarr, we can just use `xarray.concat` and `xarray.merge` like would to combine any `xarray.Dataset`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "030add3c-eee2-42b9-9039-3282c9b49711",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "virtual_ds_2019 = xr.merge(virtual_datasets_2019, compat='override', combine_attrs='override')\n",
+    "virtual_ds = xr.concat([virtual_ds, virtual_ds_2019], dim='day', coords='minimal', compat='override', combine_attrs='override')\n",
+    "virtual_ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8259334-e55d-418b-b260-893a255bbec8",
+   "metadata": {},
+   "source": [
+    "This simple `xarray.merge` and `concat` is the major advantage of VirtualiZarr.\n",
+    "Rather than having to figure out Kerchunk's syntax and commands, we can keep using xarray as we already do.\n",
+    "Therefore, the increase in time to create the virtual Zarr store compared to Kerchunk is likely worth it due to its native compatibility with xarray."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aff1cd3e-2f55-41ab-9dae-64f1a67f51d9",
+   "metadata": {},
+   "source": [
+    "### Double Check New Stores\n",
+    "\n",
+    "Finally, let's read in the appended stores to make sure that we correctly appended the 2019 data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30423f47-7a4d-4ff3-9e94-8cc1268cabd7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = xr.open_dataset(\n",
+    "    'virtual_zarr/kerchunk/gridmet_appended.json',\n",
+    "    engine=\"kerchunk\",\n",
+    "    chunks={},\n",
+    "    backend_kwargs={\n",
+    "        \"storage_options\": {\n",
+    "            \"remote_protocol\": \"s3\",\n",
+    "            \"remote_options\": reader_options['storage_options']\n",
+    "        },\n",
+    "    }\n",
+    ")\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2f33304-1bc2-4c28-8a2a-37c217c6b44f",
+   "metadata": {},
+   "source": [
+    "Nice!\n",
+    "The 2019 data is now appended and showing on the day coordinate."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "446b9e9e-cc6d-432c-b822-e3dac040c66c",
+   "metadata": {},
+   "source": [
+    "## Clean Up\n",
+    "\n",
+    "Rather than deleting the virtual Zarr stores that we created, we will actually keep them for use in future tutorials.\n",
+    "However, we will do want to conform with best practices and close our Dask client and cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "433b1321-8bc7-422a-a81c-d0bf2b87fef4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.close()\n",
+    "cluster.close()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/_toc.yml b/_toc.yml
index db00d40..406c8a9 100755
--- a/_toc.yml
+++ b/_toc.yml
@@ -12,7 +12,7 @@ chapters:
 - file: 201/index
   sections:
   - file: 201/RechunkingwithDask
-  # - file: 201/VirtualZarr
+  - file: 201/VirtualZarr
   - file: 201/AddingCRStoZarr
   # - file: 201/ChunkingAuxilliaryCoords
   # - file: 201/OptimalChunkSelection
diff --git a/back/Glossary.md b/back/Glossary.md
index b305d9e..34a9a70 100644
--- a/back/Glossary.md
+++ b/back/Glossary.md
@@ -27,4 +27,7 @@ A glossary of common terms used throughout Jupyter Book.
 **Stored chunks**
     The chunks that are physically stored on disk.
 
+**Virtual Zarr Store**
+    A virtual representation of a Zarr store generated by mapping any number of real datasets in individual files (e.g., NetCDF/HDF5, GRIB2, TIFF) together into a single, sliceable dataset via an interface layer, which contains information about the original files (e.g., chunking, compression, etc.).
+
 ```
\ No newline at end of file
diff --git a/env.yml b/env.yml
index 79d5211..2f7c369 100644
--- a/env.yml
+++ b/env.yml
@@ -136,6 +136,7 @@ dependencies:
 - ghp-import
 - jsonschema-with-format-nongpl
 - webcolors
+- virtualizarr
 - pip:
   - kerchunk
   - rechunker