From 0fa6be5ee3dab33d25d5c946ea83254db0033d78 Mon Sep 17 00:00:00 2001 From: Keith Doore Date: Fri, 27 Dec 2024 15:01:53 -0600 Subject: [PATCH] Virtual Zarr notebook --- .gitignore | 3 +- 201/VirtualZarr.ipynb | 811 ++++++++++++++++++++++++++++++++++++++++++ _toc.yml | 2 +- back/Glossary.md | 3 + env.yml | 1 + 5 files changed, 818 insertions(+), 2 deletions(-) create mode 100644 201/VirtualZarr.ipynb diff --git a/.gitignore b/.gitignore index 9f74371..f138f50 100644 --- a/.gitignore +++ b/.gitignore @@ -8,7 +8,8 @@ scratch.ipynb # The built book -- never check into the repo. _build - +# Temporary files generated from notebooks +201/virtual_zarr/ # IPython & Jupyter profile_default/ diff --git a/201/VirtualZarr.ipynb b/201/VirtualZarr.ipynb new file mode 100644 index 0000000..de04f19 --- /dev/null +++ b/201/VirtualZarr.ipynb @@ -0,0 +1,811 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f83aa4d9-364e-42b2-8de6-f31bf9034f1c", + "metadata": {}, + "source": [ + "# Generating a Virtual Zarr Store" + ] + }, + { + "cell_type": "markdown", + "id": "a670e170-eaf4-4158-bf02-a96b13f3f935", + "metadata": {}, + "source": [ + "::::{margin}\n", + ":::{note}\n", + "This notebook builds off the [Kerchunk](https://fsspec.github.io/kerchunk/index.html) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/stable/index.html) docs.\n", + ":::\n", + "::::" + ] + }, + { + "cell_type": "markdown", + "id": "afb02d98-a534-4cb4-9c37-0250aa2f78d9", + "metadata": {}, + "source": [ + "The objective of this notebook is to learn how to create a virtual Zarr store for a collection of NetCDF files that together make up a complete data set.\n", + "To do this, we will use [Kerchunk](https://fsspec.github.io/kerchunk/index.html) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/stable/index.html).\n", + "As these two packages can both create virtual Zarr stores but do it in different ways, we will utilize them both to show how they compare in combination with [Dask](https://www.dask.org/) for parallel execution." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7cf6a00-e79f-400e-b9a6-e60858d99a3c", + "metadata": {}, + "outputs": [], + "source": [ + "import fsspec\n", + "import xarray as xr\n", + "import ujson\n", + "import time\n", + "import kerchunk.hdf\n", + "import kerchunk.combine\n", + "from virtualizarr import open_virtual_dataset\n", + "import dask.distributed\n", + "import logging" + ] + }, + { + "cell_type": "markdown", + "id": "db16a5a6-e56e-4907-baf4-da9c91b2aba0", + "metadata": {}, + "source": [ + "## Kerchunk vs VirtualiZarr\n", + "\n", + "To begin, let's explain what a virtual Zarr store even is.\n", + "A \"[**virtual Zarr store**](../back/Glossary.md#term-Virtual-Zarr-Store)\" is a virtual representation of a Zarr store generated by mapping any number of real datasets in individual files (e.g., NetCDF/HDF5, GRIB2, TIFF) together into a single, sliceable dataset via an interface layer.\n", + "This interface layer, which Kerchunk and VirtualiZarr generate, contains information about the original files (e.g., chunking, compression, data byte location, etc.) needed to efficiently access the data.\n", + "While this could be done with [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html), we don't want to run this command every time we open the dataset as it can be a slow and expensive process.\n", + "The reason for this is that `xarray.open_mfdataset` performs many consistency checks as it runs, and it requires partially opening all of the datasets to get general matadata information on each of the individual files.\n", + "Therefore, for numerous files, this can have significant overhead, and it would be preferable to just cache these checks and metadata for more performant future reads.\n", + "This cache (specifically in Zarr format) is what a virtual Zarr store is. \n", + "Once we have the virtual Zarr store, we can open the combined xarray dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) for an almost instantaneous read.\n", + "\n", + "Now that we know what a virtual Zarr store is, let's discuss the differences between Kerchunk and VirtualiZarr and their virtual Zarr stores.\n", + "At a top level, VirtualiZarr provides almost all of the same features as Kerchunk.\n", + "The primary difference is that Kerchunk supports non-Zarr-like virtual format, while VirtualiZarr is specifically focused on the Zarr format.\n", + "Additionally, Kerchunk creates the virtual Zarr store and represents it in memory using json formatting (the format used for Zarr metadata).\n", + "Alternatively, VirtualiZarr represents the store as array-level abstractions (which can be converted to json format).\n", + "These abstractions can be cleanly wrapped by xarray for easy use of `xarray.concat` and `xarray.merge` commands to combine virtual Zarr stores.\n", + "A nice table comparing the two packages can be found in the [VirtualiZarr FAQs](https://virtualizarr.readthedocs.io/en/stable/faq.html#how-do-virtualizarr-and-kerchunk-compare), which shows how the two packages represent virtual Zarr stores and their comparative syntax." + ] + }, + { + "cell_type": "markdown", + "id": "6eb26405-82de-41b3-b552-55a34fd2b1be", + "metadata": {}, + "source": [ + "## Spin up Dask Cluster\n", + "\n", + "To run the virtual Zarr creation in parallel, we need to spin up a Dask cluster to schedule the various workers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88de9acf-7cb6-4c0d-a918-003eaca707e1", + "metadata": {}, + "outputs": [], + "source": [ + "cluster = dask.distributed.LocalCluster(\n", + " n_workers=16,\n", + " threads_per_worker=1, \n", + " silence_logs=logging.ERROR\n", + ")\n", + "client = dask.distributed.Client(cluster)\n", + "client" + ] + }, + { + "cell_type": "markdown", + "id": "23951132-87c8-4f67-9cb0-2ab36bea7b7c", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "## Example Comparison\n", + "\n", + "With our Dask cluster ready, let's see how Kerchunk and VirtualiZarr can be utilized to generate a vitrual Zarr store.\n", + "For this example, we will use the same daily gridMET NetCDF data as used in the [Writing Chunked File tutorial](../101/WriteChunkedFiles.ipynb).\n", + "Only this time we will use all of the variables not just precipitation.\n", + "These include:\n", + " - precipitation,\n", + " - maximum relative humidity,\n", + " - minimum relative humidity,\n", + " - specific humidity,\n", + " - downward shortwave radiation,\n", + " - minimum air temperature,\n", + " - maximum air temperature,\n", + " - wind direction, and\n", + " - wind speed.\n", + " \n", + "The data is currently hosted on the HyTEST OSN as a collection NetCDF files.\n", + "To access the data with both Kerchunk and VirtualiZarr, we will use [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) to get the list of files that we are wanting to combine into a virtual Zarr store.\n", + "\n", + "First we need to create the file system for accessing the files, and a second one for outputting the virtual Zarr store.\n", + "\n", + "```{note}\n", + "We will exclude the year 2019 for now and use it later to show how to append virtual Zarr stores.\n", + "Also, we will not use 2020 as it is a partial year with different chunking than the other 40 years, which is currently incompatible with Kerchunk and Virtualizarr.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcd0b92c-4943-419a-a85d-907a84764893", + "metadata": { + "editable": true, + "scrolled": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# These reader options will be needed for VirtualiZarr\n", + "# We created them here to show how they fold into fsspec\n", + "reader_options = {\n", + " 'storage_options': {\n", + " 'anon': True, \n", + " 'client_kwargs': {\n", + " 'endpoint_url': 'https://usgs.osn.mghpcc.org/'\n", + " }\n", + " }\n", + "}\n", + "\n", + "fs = fsspec.filesystem(\n", + " protocol='s3',\n", + " **reader_options['storage_options']\n", + ")\n", + "\n", + "fs_local = fsspec.filesystem('')\n", + "# Make directories to save the virtual zarr stores\n", + "fs_local.mkdir('virtual_zarr/kerchunk')\n", + "fs_local.mkdir('virtual_zarr/virtualizarr')\n", + "\n", + "file_glob = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*198*.nc')\n", + "file_glob = [file for file in file_glob if (('2020' not in file) and ('2019' not in file))]" + ] + }, + { + "cell_type": "markdown", + "id": "5c816bf0-6268-41e7-b159-e0b1e91c3ffa", + "metadata": {}, + "source": [ + "Now, we are ready to generate the virtual Zarr stores.\n", + "For both Kerchunk and VirtualiZarr ([for now](https://virtualizarr.readthedocs.io/en/stable/usage.html#opening-files-as-virtual-datasets)), this consists of two steps:\n", + "\n", + "1) Convert a single original data file into individual virtual Zarr stores,\n", + "2) Combine the individual virtual Zarr stores into a single combined virtual Zarr store.\n", + "\n", + "We will show these two steps seperately and how they are done for each package." + ] + }, + { + "cell_type": "markdown", + "id": "41f65cc2-6a43-455d-88fd-69e96d5ab020", + "metadata": {}, + "source": [ + "### Generate Individual Virtual Zarr Stores\n", + "\n", + "#### Kerchunk\n", + "\n", + "To generate the individual virtual Zarr stores with Kerchunk, we will use [`kerchunk.hdf.SingleHdf5ToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.hdf.SingleHdf5ToZarr), which translates the content of one HDF5 file into Zarr metadata.\n", + "Other translators exist in Kerchunk that can convert GeoTiffs and NetCDF3 files.\n", + "However, as we are looking at NetCDF4 files (a specific version of a HDF5 file), we will use the HDF5 translator.\n", + "As this only translates one file, we can make a collection of [`dask.delayed`](https://docs.dask.org/en/stable/delayed.html) objects that wrap the `SingleHdf5ToZarr` call to run it for all files in parallel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2289fac1-96a3-424a-89f5-e7d52c5c2006", + "metadata": { + "editable": true, + "scrolled": true, + "slideshow": { + "slide_type": "" + }, + "tags": [ + "scroll-output" + ] + }, + "outputs": [], + "source": [ + "# Make a function to run in parallel with dask\n", + "@dask.delayed\n", + "def generate_single_virtual_zarr(file):\n", + " with fs.open(file) as hdf:\n", + " h5chunks = kerchunk.hdf.SingleHdf5ToZarr(hdf, file, inline_threshold=0)\n", + " return h5chunks.translate()\n", + "\n", + "# Time the duration for later comparison\n", + "t0 = time.time()\n", + "\n", + "# Generate Dask Delayed objects\n", + "tasks = [generate_single_virtual_zarr(file) for file in file_glob]\n", + "# Compute the delayed object\n", + "single_virtual_zarrs = dask.compute(*tasks)\n", + "\n", + "kerchunk_time = time.time() - t0\n", + "\n", + "single_virtual_zarrs[0]" + ] + }, + { + "cell_type": "markdown", + "id": "e288b09f-ae7d-48d0-afff-e50ba278f086", + "metadata": {}, + "source": [ + "Notice that the output for a virtualization of a single NetCDF is a json style dictionary, where the coordinate data is actually kept in the dictionary, while the data is a file pointer and the byte range for each chunk." + ] + }, + { + "cell_type": "markdown", + "id": "b8bfcf7b-77ae-4fa9-8c8c-e6e6a4a4002c", + "metadata": {}, + "source": [ + "#### VirtualiZarr\n", + "\n", + "To generate the individual virtual Zarr stores with VirtualiZarr, we will use [`virtualizarr.open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.backend.open_virtual_dataset.html#virtualizarr-backend-open-virtual-dataset), which can infer what type of file we are reading instead of us having to specify.\n", + "Like Kerchunk, this only translates one file at a time.\n", + "So, we can make a collection of [`dask.delayed`](https://docs.dask.org/en/stable/delayed.html) objects that wraps `open_virtual_dataset` to run it for all files in parallel.\n", + "\n", + "```{important}\n", + "When reading in the individual files as virtual datasets, it is critical to include the `loadable_variables` keyword.\n", + "The keyword should be set to a list of the coordinate names.\n", + "By adding this keyword, the coordinates are read into memory rather than being loaded as virtual data.\n", + "This can make a massive difference in the next steps of (1) concatenation as it gives the coordinates indexes and (2) the serialization of the virtual Zarr store as it saves the in-memory coordinates directly to the store rather than a pointer.\n", + "Also, if this is not included, coordinates of different sizes will not be able to be concatenated due to potential chunking differences.\n", + "The only downside is that it can slightly increase the time it takes to initially read the virtual datasets.\n", + "However, this slowdown is more than worth the future convenience of having the coords in-memory when reading in the virtual Zarr store.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d998158-c484-41db-a476-2bd8ffd2c501", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "tasks = [\n", + " dask.delayed(open_virtual_dataset)(\n", + " f's3://{file}',\n", + " indexes={},\n", + " loadable_variables=['day', 'lat', 'lon', 'crs'],\n", + " decode_times=True,\n", + " reader_options=reader_options\n", + " )\n", + " for file in file_glob\n", + "]\n", + "\n", + "virtual_datasets = dask.compute(*tasks)\n", + "\n", + "virtualizarr_time = time.time() - t0\n", + "\n", + "virtual_datasets[0]" + ] + }, + { + "cell_type": "markdown", + "id": "79eb52f0-aec7-4d65-bb02-918e22797483", + "metadata": {}, + "source": [ + "Notice that the output for a virtualization of a single NetCDF is now an `xarray.Dataset`, where the data is a [`ManifestArray`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.manifests.ManifestArray.html) object.\n", + "This `ManifestArray` contains [`ChunkManifest`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.manifests.ChunkManifest.html#virtualizarr.manifests.ChunkManifest) objects that hold the same info as the Kerchunk json format (i.e., a file pointer and the byte range for each chunk), but allows for it to be nicely wrapped by xarray." + ] + }, + { + "cell_type": "markdown", + "id": "ba9eec47-691e-40b7-a255-d7684eb31cc2", + "metadata": {}, + "source": [ + "### Combine Individual Virtual Zarr Stores\n", + "\n", + "#### Kerchunk\n", + "\n", + "To combine the individual virtual Zarr stores into one virtual Zarr store with Kerchunk, we will use [`kerchunk.combine.MultiZarrToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr), which combines the content of multiple virtual Zarr stores into a single virtual Zarr store.\n", + "This call requires feeding `MultiZarrToZarr` the remote access info that we needed for our file system, along with the dimension we want to combine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d4e92e9-3fe4-41c4-ab42-ea2133396f82", + "metadata": { + "editable": true, + "scrolled": true, + "slideshow": { + "slide_type": "" + }, + "tags": [ + "scroll-output" + ] + }, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "mzz = kerchunk.combine.MultiZarrToZarr(\n", + " single_virtual_zarrs,\n", + " remote_protocol='s3',\n", + " remote_options=reader_options['storage_options'],\n", + " concat_dims=[\"day\"]\n", + ")\n", + "\n", + "out = mzz.translate()\n", + "\n", + "# Save the virtual Zarr store, serialized as json\n", + "with fs_local.open('virtual_zarr/kerchunk/gridmet.json', 'wb') as f:\n", + " f.write(ujson.dumps(out).encode())\n", + "\n", + "kerchunk_time += time.time() - t0\n", + "\n", + "out" + ] + }, + { + "cell_type": "markdown", + "id": "cf64912b-be1f-42ae-bc77-a1c07b04ead2", + "metadata": {}, + "source": [ + "Again, notice the output type is in a json format with the coords in the dictionary and data chunks having pointers, but this time all chunks are in the one dictionary." + ] + }, + { + "cell_type": "markdown", + "id": "a08f6432-3073-484a-9bc9-86d46d5b0324", + "metadata": {}, + "source": [ + "#### VirtualiZarr\n", + "\n", + "To combine the virtual datasets from VirtualiZarr, we can just use [`xarray.combine_by_coords`](https://docs.xarray.dev/en/stable/generated/xarray.combine_by_coords.html) which will auto-magically combine the virtual datasets together." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a1331f08-5029-4ae5-aed8-4a6ccebefe4a", + "metadata": {}, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "virtual_ds = xr.combine_by_coords(virtual_datasets, coords='minimal', compat='override', combine_attrs='override')\n", + "\n", + "# Save the virtual Zarr store, serialized as json\n", + "virtual_ds.virtualize.to_kerchunk('virtual_zarr/virtualizarr/gridmet.json', format='json')\n", + "\n", + "virtualizarr_time += time.time() - t0\n", + "\n", + "virtual_ds" + ] + }, + { + "cell_type": "markdown", + "id": "a9b603bd-f7ed-40a9-8f4d-97a8f917e0aa", + "metadata": {}, + "source": [ + "Notice that when we saved the virtual dataset that we converted it to a Kerchunk format for saving." + ] + }, + { + "cell_type": "markdown", + "id": "bf70fafa-c21e-446a-b856-8d747f4f9da4", + "metadata": {}, + "source": [ + "### Opening the Virtual Zarr Stores\n", + "\n", + "To open the virtual Zarr stores, we can use the same method for both stores as we converted to Kerchunk format when saving from VirtualiZarr.\n", + "\n", + "#### Kerchunk" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "368b69da-fe1f-43a1-a46d-ea89ceadbe70", + "metadata": {}, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "ds = xr.open_dataset(\n", + " 'virtual_zarr/kerchunk/gridmet.json',\n", + " chunks={},\n", + " engine=\"kerchunk\",\n", + " backend_kwargs={\n", + " \"storage_options\": {\n", + " \"remote_protocol\": \"s3\",\n", + " \"remote_options\": reader_options['storage_options']\n", + " },\n", + " }\n", + ")\n", + "\n", + "kerchunk_read_time = time.time() - t0\n", + "\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "25d4365f-7b7d-4be4-a97b-3db4b6a17a38", + "metadata": {}, + "source": [ + "#### VirtualiZarr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "26a9e7c9-6e58-4fd2-896a-d4b0dbbe2af1", + "metadata": {}, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "ds = xr.open_dataset(\n", + " 'virtual_zarr/virtualizarr/gridmet.json',\n", + " chunks={},\n", + " engine=\"kerchunk\",\n", + " backend_kwargs={\n", + " \"storage_options\": {\n", + " \"remote_protocol\": \"s3\",\n", + " \"remote_options\": reader_options['storage_options']\n", + " },\n", + " }\n", + ")\n", + "\n", + "virtualizarr_read_time = time.time() - t0\n", + "\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "cdd0e99a-f589-46be-ab84-9cf346aa81a1", + "metadata": {}, + "source": [ + "### Reading with `xarray.open_mfdataset`\n", + "\n", + "As a comparison of read times, let's also compile the dataset using [`xarray.open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html) in parallel with Dask.\n", + "This way we can see if we will be saving time in the future by having the compiled virtual Zarr for faster reads." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd10f6f4-f380-4f01-be6f-2b28bc1c9d4c", + "metadata": {}, + "outputs": [], + "source": [ + "t0 = time.time()\n", + "\n", + "ds = xr.open_mfdataset(\n", + " [fs.open(file) for file in file_glob],\n", + " chunks={},\n", + " parallel=True,\n", + " engine='h5netcdf'\n", + ")\n", + "\n", + "open_mfdataset_time = time.time() - t0\n", + "\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "0210ea1d-b59b-4466-b65e-1f7da4b698b2", + "metadata": {}, + "source": [ + "Now, let's compare the computational times!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b414bd2-7ed9-4b2b-bcdb-403e1971a438", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Kerchunk virtual Zarr creation time: \"\n", + " f\"{kerchunk_time:.0f}s ({kerchunk_time/60:.1f} min)\")\n", + "print(\"VirtualiZarr virtual Zarr creation time: \"\n", + " f\"{virtualizarr_time:.0f}s ({virtualizarr_time/60:.1f} min)\")\n", + "print(\"open_mfdataset dataset creation time: \"\n", + " f\"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)\")\n", + "print(f\"Time ratio: Kerchunk to open_mfdataset = {kerchunk_time/open_mfdataset_time}\\n\"\n", + " f\" VirtualiZarr to open_mfdataset = {virtualizarr_time/open_mfdataset_time}\\n\"\n", + " f\" Kerchunk to VirtualiZarr = {kerchunk_time/virtualizarr_time}\")" + ] + }, + { + "cell_type": "markdown", + "id": "948bea17-7fd2-44bb-aed8-2f0d40056603", + "metadata": {}, + "source": [ + "As we can see the Kerchunk is about 1.6x faster than VirtualiZarr and about the same speed as `open_mfdataset` for creating the `Dataset`.\n", + "Therefore, it is definitely worth creating a virtual Zarr store in this case.\n", + "Looking at read speed after the virtual Zarr store creation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "082d983f-7528-4dfb-8a48-a8eeaffbd92c", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Kerchunk virtual Zarr read time: \"\n", + " f\"{kerchunk_read_time:.2f}s\")\n", + "print(\"VirtualiZarr virtual Zarr read time: \"\n", + " f\"{virtualizarr_read_time:.2f}s\")\n", + "print(\"open_mfdataset dataset read/creation time: \"\n", + " f\"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)\")\n", + "print(f\"Time ratio: Kerchunk to open_mfdataset = {kerchunk_read_time/open_mfdataset_time}\\n\"\n", + " f\" VirtualiZarr to open_mfdataset = {virtualizarr_read_time/open_mfdataset_time}\\n\"\n", + " f\" Kerchunk to VirtualiZarr = {kerchunk_read_time/virtualizarr_read_time}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e1071966-3d4b-43fa-9274-ce2be1933c63", + "metadata": {}, + "source": [ + "From this, it is very clear that performing more than one read using either the Kerchunk or VirtualiZarr virtual Zarr store is more efficient that reading with `open_mfdataset`.\n", + "Additionally, the differences in read times between Kerchunk and Virtualizarr, while appearing drastic, is likely not going to be significant in any workflow." + ] + }, + { + "cell_type": "markdown", + "id": "b6c44f7d-9bca-4856-a0ea-7212a48a52a2", + "metadata": {}, + "source": [ + "## Appending to Existing Virtual Zarr Store\n", + "\n", + "As noted when [introducing the gridMET data](#Example-Comparison), we did not utilize the 2019 data in order to show how to append it to a virtual Zarr store.\n", + "The ability to append more data to the virtual Zarr store is highly convienient, as plenty of datasets are continuously updated as new data becomes available.\n", + "So, let's appends some data to our virtual Zarr stores we just made.\n", + "\n", + "First, we create the 2019 file glob." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b657a5f-752c-4236-a62d-809e79c7d4c2", + "metadata": {}, + "outputs": [], + "source": [ + "file_glob_2019 = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*_2019.nc')" + ] + }, + { + "cell_type": "markdown", + "id": "17b1fbd6-5038-4a54-8ec3-cd78a57b03de", + "metadata": {}, + "source": [ + "### Create New Virtual Zarr for New File\n", + "\n", + "Next, we need to get our 2019 NetCDFs into a virtual Zarr store.\n", + "\n", + "#### Kerchunk\n", + "\n", + "We will do this for Kerchunk the same way we did before, by using [`kerchunk.hdf.SingleHdf5ToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.hdf.SingleHdf5ToZarr), which translates the content of one HDF5 (NetCDF4) file into Zarr metadata." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04faebef-edba-43ff-9db4-dccbd3bc5a3e", + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [generate_single_virtual_zarr(file) for file in file_glob_2019]\n", + "single_virtual_zarrs_2019 = dask.compute(*tasks)" + ] + }, + { + "cell_type": "markdown", + "id": "f3e76b73-4e61-4a67-9253-6a84e479d260", + "metadata": {}, + "source": [ + "#### VirtualiZarr\n", + "\n", + "And for VirtualiZarr, we will use [`virtualizarr.open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/stable/generated/virtualizarr.backend.open_virtual_dataset.html#virtualizarr-backend-open-virtual-dataset)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22940c8d-10ff-467c-922d-9d1777cfdda9", + "metadata": {}, + "outputs": [], + "source": [ + "tasks = [\n", + " dask.delayed(open_virtual_dataset)(\n", + " f's3://{file}',\n", + " indexes={},\n", + " loadable_variables=['day', 'lat', 'lon', 'crs'],\n", + " decode_times=True,\n", + " reader_options=reader_options\n", + " )\n", + " for file in file_glob_2019\n", + "]\n", + "\n", + "virtual_datasets_2019 = dask.compute(*tasks)" + ] + }, + { + "cell_type": "markdown", + "id": "d9b22cd8-f39f-4f52-9bbb-3ecbaa53682a", + "metadata": {}, + "source": [ + "### Append to Existing Store\n", + "\n", + "Now, we can append the virtualized NetCDFs to our existing stores.\n", + "\n", + "#### Kerchunk\n", + "\n", + "For Kerchunk, we will use still [`kerchunk.combine.MultiZarrToZarr`](https://fsspec.github.io/kerchunk/reference.html#kerchunk.combine.MultiZarrToZarr).\n", + "However, this time we will need to use the `append` method to append our new data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "349f7e19-0635-402c-98ca-8c85f5579b9a", + "metadata": {}, + "outputs": [], + "source": [ + "# Append to the existing reference file\n", + "mzz = kerchunk.combine.MultiZarrToZarr.append(\n", + " single_virtual_zarrs_2019,\n", + " original_refs=out,\n", + " concat_dims=[\"day\"],\n", + " remote_protocol='s3',\n", + " remote_options=reader_options['storage_options'],\n", + ")\n", + "\n", + "out_2019 = mzz.translate()\n", + "\n", + "# Save the virtual Zarr store, serialized as json\n", + "with fs_local.open('virtual_zarr/kerchunk/gridmet_appended.json', 'wb') as f:\n", + " f.write(ujson.dumps(out_2019).encode())" + ] + }, + { + "cell_type": "markdown", + "id": "0e9fba22-70f3-4192-a067-d431c96237a1", + "metadata": {}, + "source": [ + "#### VirtualiZarr\n", + "\n", + "For VirtualiZarr, we can just use `xarray.concat` and `xarray.merge` like would to combine any `xarray.Dataset`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "030add3c-eee2-42b9-9039-3282c9b49711", + "metadata": {}, + "outputs": [], + "source": [ + "virtual_ds_2019 = xr.merge(virtual_datasets_2019, compat='override', combine_attrs='override')\n", + "virtual_ds = xr.concat([virtual_ds, virtual_ds_2019], dim='day', coords='minimal', compat='override', combine_attrs='override')\n", + "virtual_ds" + ] + }, + { + "cell_type": "markdown", + "id": "a8259334-e55d-418b-b260-893a255bbec8", + "metadata": {}, + "source": [ + "This simple `xarray.merge` and `concat` is the major advantage of VirtualiZarr.\n", + "Rather than having to figure out Kerchunk's syntax and commands, we can keep using xarray as we already do.\n", + "Therefore, the increase in time to create the virtual Zarr store compared to Kerchunk is likely worth it due to its native compatibility with xarray." + ] + }, + { + "cell_type": "markdown", + "id": "aff1cd3e-2f55-41ab-9dae-64f1a67f51d9", + "metadata": {}, + "source": [ + "### Double Check New Stores\n", + "\n", + "Finally, let's read in the appended stores to make sure that we correctly appended the 2019 data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30423f47-7a4d-4ff3-9e94-8cc1268cabd7", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.open_dataset(\n", + " 'virtual_zarr/kerchunk/gridmet_appended.json',\n", + " engine=\"kerchunk\",\n", + " chunks={},\n", + " backend_kwargs={\n", + " \"storage_options\": {\n", + " \"remote_protocol\": \"s3\",\n", + " \"remote_options\": reader_options['storage_options']\n", + " },\n", + " }\n", + ")\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "e2f33304-1bc2-4c28-8a2a-37c217c6b44f", + "metadata": {}, + "source": [ + "Nice!\n", + "The 2019 data is now appended and showing on the day coordinate." + ] + }, + { + "cell_type": "markdown", + "id": "446b9e9e-cc6d-432c-b822-e3dac040c66c", + "metadata": {}, + "source": [ + "## Clean Up\n", + "\n", + "Rather than deleting the virtual Zarr stores that we created, we will actually keep them for use in future tutorials.\n", + "However, we will do want to conform with best practices and close our Dask client and cluster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "433b1321-8bc7-422a-a81c-d0bf2b87fef4", + "metadata": {}, + "outputs": [], + "source": [ + "client.close()\n", + "cluster.close()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/_toc.yml b/_toc.yml index db00d40..406c8a9 100755 --- a/_toc.yml +++ b/_toc.yml @@ -12,7 +12,7 @@ chapters: - file: 201/index sections: - file: 201/RechunkingwithDask - # - file: 201/VirtualZarr + - file: 201/VirtualZarr - file: 201/AddingCRStoZarr # - file: 201/ChunkingAuxilliaryCoords # - file: 201/OptimalChunkSelection diff --git a/back/Glossary.md b/back/Glossary.md index b305d9e..34a9a70 100644 --- a/back/Glossary.md +++ b/back/Glossary.md @@ -27,4 +27,7 @@ A glossary of common terms used throughout Jupyter Book. **Stored chunks** The chunks that are physically stored on disk. +**Virtual Zarr Store** + A virtual representation of a Zarr store generated by mapping any number of real datasets in individual files (e.g., NetCDF/HDF5, GRIB2, TIFF) together into a single, sliceable dataset via an interface layer, which contains information about the original files (e.g., chunking, compression, etc.). + ``` \ No newline at end of file diff --git a/env.yml b/env.yml index 79d5211..2f7c369 100644 --- a/env.yml +++ b/env.yml @@ -136,6 +136,7 @@ dependencies: - ghp-import - jsonschema-with-format-nongpl - webcolors +- virtualizarr - pip: - kerchunk - rechunker