[#7]: describes helpers a little better

hytest-org · Apr 7, 2023 · 769c51b · 769c51b
1 parent 1051c6e
commit 769c51b
Show file tree

Hide file tree

Showing 4 changed files with 137 additions and 17 deletions.
diff --git a/AWS.ipynb b/AWS.ipynb
@@ -8,7 +8,14 @@
     "# AWS Credentials Helper\n",
     "\n",
     "This notebook helps set AWS credentials based on already-specified \n",
-    "environment variables for profile and S3 endpoint. "
+    "environment variables for profile and S3 endpoint. \n",
+    "\n",
+    "Before this notebook is called,  you can specify a particular profile and \n",
+    "endpoint you'd like to use.  Do this by setting the appropriate environment\n",
+    "variables: `AWS_PROFILE` and `AWS_S3_ENDPOINT`. \n",
+    "\n",
+    "If these environment variables are not set, defaults will be used (as specified\n",
+    "in the code block below). "
    ]
   },
   {
@@ -39,10 +46,18 @@
     "    logging.error(\"Problem parsing the AWS credentials file. \")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4f7f0646-2a0a-4a1d-aae2-9045dd50622e",
+   "metadata": {},
+   "source": [
+    "It is extremely important that you **never** set any of the access keys or secrets directly -- we never want to include any of those values as string literals in any code.  This code is committed to a public repository, so doing this would essentially publish those secrets.  **ALWAYS** parse the config file as demonstrated above in order to obtain the access key and the secret access key. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e446e487-dd85-4d7d-a8b5-10d63a23dde2",
+   "id": "f473e813-dfff-4e91-a87c-bf7f74a414e8",
    "metadata": {},
    "outputs": [],
    "source": []

diff --git a/StartNebariCluster.ipynb b/StartNebariCluster.ipynb
@@ -14,23 +14,72 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "71c2ef84-bed9-4dd6-9919-8cab74075035",
+   "id": "2d30b864-d9d5-4bbc-a679-18796516712d",
    "metadata": {},
    "outputs": [],
    "source": [
     "import os\n",
-    "\n",
+    "import logging \n",
     "try:\n",
     "    from dask_gateway import Gateway\n",
     "except ImportError:\n",
     "    logging.error(\"Unable to import Dask Gateway.  Are you running in a cloud compute environment?\\n\")\n",
-    "    raise\n",
-    "os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION'] = \"1.0\"\n",
+    "    raise\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c7b4e8d1-6e81-4592-8331-b61d4ef12cfd",
+   "metadata": {},
+   "source": [
+    "## Dask Gateway Options\n",
     "\n",
+    "The cluster scheduler on nebari makes use of a `Gateway`. This handles the \n",
+    "instantiation of clusters of workers, and gives us a way to monitor their\n",
+    "progress. Gateways are not used on all clustered systems (KubeCluster is\n",
+    "one alternative you might find on other cloud platforms -- such as `pangeo.chs.usgs.gov`). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b6c6f4e-a945-43f6-a56c-15d6d1b02a7a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "gateway = Gateway()\n",
+    "os.environ['DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION'] = \"1.0\"\n",
     "_options = gateway.cluster_options()\n",
     "_options.conda_environment='users/users-pangeo'  ##<< this is the conda environment we use on nebari.\n",
-    "_options.profile = 'Medium Worker'\n",
+    "_options.profile = 'Medium Worker'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a895e027-cf9a-4d52-907b-6891339f48c4",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## AWS Environment Variables\n",
+    "By default, the cluster does not hand the entire set of environment variables to\n",
+    "each of the workers. This is an important default to override in the case of the\n",
+    "AWS configuration parameters. \n",
+    "\n",
+    "Because individual workers in the cluster do not have access to the standard file\n",
+    "system (where `~/.aws/credentials` is), the workers do not have a way to obtain\n",
+    "their AWS credentials unless we hand them over as environment variables. So... we\n",
+    "have to establish key variables in the environment, and explicity pass those to\n",
+    "the cluster workers at the time the cluster is started: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "47b9fc40-c749-4d12-b9fe-a7eccd6b11fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "_env_to_add={}\n",
     "aws_env_vars=['AWS_ACCESS_KEY_ID',\n",
     "              'AWS_SECRET_ACCESS_KEY',\n",
@@ -40,11 +89,45 @@
     "for _e in aws_env_vars:\n",
     "    if _e in os.environ:\n",
     "        _env_to_add[_e] = os.environ[_e]\n",
-    "_options.environment_vars = _env_to_add    \n",
+    "_options.environment_vars = _env_to_add    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea7a9269-501b-40a7-9478-e42792600549",
+   "metadata": {},
+   "source": [
+    "## Cluster Start"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0a59614f-774e-4943-886b-251f628fb042",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "cluster = gateway.new_cluster(_options)          ##<< create cluster via the dask gateway\n",
     "cluster.adapt(minimum=10, maximum=30)             ##<< Sets scaling parameters. \n",
-    "client = cluster.get_client()\n",
-    "\n",
+    "client = cluster.get_client()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "80bf775c-c8b5-49d8-b31f-bdf734297426",
+   "metadata": {},
+   "source": [
+    "## Notify\n",
+    "Give the user the link by which they can monitor the cluster workers' progress and status. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "394a8ff6-a154-4679-b217-22366c24ea51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "print(\"The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'\")\n",
     "print(\"The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' \")\n",
     "print(f\"The link to view the client dashboard is:\\n>  {client.dashboard_link}\")"

diff --git a/helpers.md b/helpers.md
@@ -2,7 +2,7 @@
 
 This is a collection of helpers/demo scripts to show how we like to 
 perform common tasks that might be replicated within any notebook 
-in this repository:
+in this repository.
 
 ```{tableofcontents}
 ```
@@ -11,7 +11,7 @@ in this repository:
 
 Supposing that you want to start the dask cluster on the `nebari` cloud
 hosting environment.  You could do that by hand using a dask `Gateway()`, 
-or you could just:
+or you could use our boilerplate:
 
 ```
 %run StartNebariCluster.ipynb
@@ -20,3 +20,16 @@ or you could just:
 That cell "magic" will run and load the named notebook within the context
 of the notebook calling it.  It is similar to module loading in python, but
 allows us to document helper operations in their own notebooks. 
+
+This mechanism allows us to start the cluter in exactly the same way every
+time, and with only one line of code in each notebook that needs a cluster. 
+See the {doc}`StartNebariCluster` itself to see how you might modify its
+behavior without having to rewrite or replicate it. 
+
+## Why? 
+
+This mechanism is similar to a more typical python module import. We like
+the running of external notebooks because it allows us to use jupyter-book
+friendly explanations of each helper and its use.  And it also prevents 
+notebooks from having to modify `sys.path` in order to import modules using
+the standard mechanism. 
diff --git a/utils.ipynb b/utils.ipynb
@@ -5,7 +5,16 @@
    "id": "5d838fc2-b153-4511-b7a5-b7ce360f72ba",
    "metadata": {},
    "source": [
-    "# Utility Functions\n"
+    "# Utility Functions\n",
+    "\n",
+    "This notebook contains a collection of minor utility functions that can help\n",
+    "with common tasks in other notebook. To use these helper functions: \n",
+    "\n",
+    "```python\n",
+    "%run utils.ipynb\n",
+    "```\n",
+    "Then you will have access to the functions defined below.  You may call them \n",
+    "as if they were defined in the notebook which \"ran\" this notebook. "
    ]
   },
   {
@@ -26,16 +35,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import sys\n",
-    "from importlib.metadata import version as _v\n",
+    "from sys import version as _sysver\n",
+    "from importlib.metadata import version as _ver\n",
     "\n",
     "def _versions(mlist=None):\n",
-    "    print(\"Python     :\", sys.version.replace('\\n', '')) \n",
+    "    print(\"Python     :\", _sysver.replace('\\n', '')) \n",
     "    if not mlist:\n",
     "        mlist = ['dask', 'xarray', 'fsspec', 'zarr', 's3fs' ]\n",
     "    for m in sorted(mlist):\n",
     "        try:\n",
-    "            print(f\"{m:10s} : {_v(m)}\")\n",
+    "            print(f\"{m:10s} : {_ver(m)}\")\n",
     "        except ModuleNotFoundError:\n",
     "            print(f\"{m:10s} : --\")"
    ]