Local JupyterLab connecting to Databricks via SSH

This package allows to connect to a remote Databricks cluster from a locally running JupyterLab.

                                  ______________________________________
                            _____|                                      |_____
                            \    |    NEW MAJOR RELEASE V2 (May 2020)   |    /
                             )   |______________________________________|   (
                            /______)                                  (______\

1 New features

Input of Personal Access Token (PAT) in Jupyter is not necessary any more (demo, how it works)
Native Windows 10 support (demo)
Docker support on macOS, Linux and Windows (demo).
Browsers for Databricks entities
- DBFS browser with file preview (demo)
- Database browser with schema and data preview (demo)
- MLflow experiments as Pandas dataframes linked with Managed Tracking Server (demo: Intro, Keras, MLlib)
dbutils support
- Support for dbutils.secrets (demo)
- Support for dbutils.notebook (demo)
Support for kernels without Spark, e.g. for Deep Learning (demo)
Support of Databricks Runtimes 6.4 and higher (incl 7.0)
JupyterLab 2.1 is now default
Experimental features
- Scala magic (%%scala) support (experimental) (demo)
- DBFS file system (%fs) support (experimental) (demo)

2 Overview

3 Prerequisites

Operating System

Jupyterlab Integration will run on the following operation systems:
- macOS
- Linux
- Windows 10 (with OpenSSH)
Anaconda

JupyterLab Integration is based on Anaconda and supports:
- A recent version of Anaconda with Python >= 3.6
- The tool conda must be newer than 4.7.5, test were executed with 4.8.x.
Since Jupyterlab Integration will create a separate conda environment, Miniconda is sufficient to start
Python

JupyterLab Integration only works with Python 3 and supports Python 3.6 and Python 3.7 both on the remote cluster and locally.
Databricks CLI

For JupyterLab Integration a recent version of Databricks CLI is needed. To install Databricks CLI and to configure profiles for your clusters, please refer to AWS / Azure.

Note:
- JupyterLab Integration does not support Databricks CLI profiles with username password. Only Personal Access Tokens are supported.
- Whenever $PROFILE is used in this documentation, it refers to a valid Databricks CLI profile name, stored in a shell environment variable.
SSH access to Databricks clusters

Configure your Databricks clusters to allow ssh access, see Configure SSH access

Note:
- Only clusters with valid ssh configuration are visible to databrickslabs_jupyterlab.
Databricks Runtime

JupyterLab Integration works with the following Databricks runtimes on AWS and Azure:
- '5.5 ML LTS'
- '6.3' and '6.3 ML'
- '6.4' and '6.4 ML'
- '6.5' and '6.5 ML'
- '7.0 BETA' and '7.0 ML BETA'
Supported only on AWS:
- '5.5 LTS'

4 Running with docker

A docker image ready for working with Jupyterlab Integration is available from Dockerhub. It is recommended to prepare your environment by pulling the repository: docker pull bwalter42/databrickslabs_jupyterlab:2.0.0

There are two scripts in the folder docker:

for Windows: dk.dj.bat and dk-jupyter.bat
for macOS/Linux: dk-dj and dk-jupyter

Alternatively, under macOS and Linux one can use the following bash functions:

databrickslabs-jupyterlab for docker:

This is the Jupyterlab Integration configuration utility using the docker image:

function dk-dj {
    docker run -it --rm -p 8888:8888 \
        -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
        -v $HOME/.ssh/:/home/dbuser/.ssh  \
        -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
        -v $(pwd):/home/dbuser/notebooks \
        bwalter42/databrickslabs_jupyterlab:2.0.0 /opt/conda/bin/databrickslabs-jupyterlab $@
}

jupyter for docker:

Allows to run jupyter commands using the docker image:

function dk-jupyter {
    docker run -it --rm -p 8888:8888 \
        -v $(pwd)/kernels:/home/dbuser/.local/share/jupyter/kernels/ \
        -v $HOME/.ssh/:/home/dbuser/.ssh  \
        -v $HOME/.databrickscfg:/home/dbuser/.databrickscfg \
        -v $(pwd):/home/dbuser/notebooks \
        bwalter42/databrickslabs_jupyterlab:2.0.0 /opt/conda/bin/jupyter $@
}

The two scripts assume that notebooks will be in the current folder and kernels will be in the kernels subfolder of the current folder:

$PWD  <= Start jupyterLab from here
 |_ kernels
 |  |_ <Jupyterlab Integration kernel spec>
 |  |_ ... 
 |_ project
 |  |_ notebook.ipynb
 |_ notebook.ipynb
 |_ ...

Note, the scripts dk-dj / dk-dj.bat will modify your ~/.ssh/config and ~/.ssh/know_hosts! If you you do not want this to happen, you can for example extend the folder structure to

$PWD  <= Start jupyterLab from here
|_ .ssh                      <= new
|  |_ config                 <= new
|  |_ id_$PROFILE            <= new
|  |_ id_$PROFILE.pub        <= new
|_ kernels
|  |_ <Jupyterlab Integration kernel spec>
|  |_ ... 
|_ project
|  |_ notebook.ipynb
|_ notebook.ipynb
|_ ...

and create the necessary public/private key pair in $(pwd)/.ssh and change the parameter -v $HOME/.ssh/:/home/dbuser/.ssh to -v $(pwd)/.ssh/:/home/dbuser/.ssh in both commands.

5 Local installation

Install Jupyterlab Integration

Create a new conda environment and install databrickslabs_jupyterlab with the following commands:
```
(base)$ conda create -n db-jlab python=3.7
(base)$ conda activate db-jlab
(db-jlab)$ pip install --upgrade databrickslabs-jupyterlab==2.0.0
```
The prefix (db-jlab)$ for all command examples in this document assumes that the conda enviromnent db-jlab is activated.
The tool databrickslabs-jupyterlab / dj

It comes with a batch file dj.bat for Windows. On MacOS or Linux both dj and databrickslabs-jupyterlab exist
Bootstrap Jupyterlab Integration

Bootstrap the environment for Jupyterlab Integration with the following command (which will show the usage after successfully configuring Juypterlab Integration):
```
(db-jlab)$ dj -b
```

6 Getting started with local installation or docker

Ensure, ssh access is correctly configured, see Configure SSH access

6.1 Starting JupyterLab

Create a kernel specification

In the terminal, create a jupyter kernel specification for a Databricks CLI profile $PROFILE with the following command:
- Local installation
```
(db-jlab)$ dj $PROFILE -k
```
- With docker
```
(db-jlab)$ dk-dj $PROFILE -k
```
A new kernel is available in the kernel change menu (see here for an explanation of the kernel name structure)
Start JupyterLab
- Local installation
```
(db-jlab)$ dj $PROFILE -l      # or 'jupyter lab'
```
- With docker
```
(db-jlab)$ dk-dj $PROFILE -l   # or 'dk-jupyter lab'
```
The command with -l is a safe version for the standard command to start JupyterLab (jupyter lab) that ensures that the kernel specificiation is updated.

6.2 Using Spark in the Notebook

Check whether the notebook is properly connected

When the notebook connected successfully to the cluster, the status bar at the bottom of JupyterLab should show

if you use a kernel with Spark, else just

If this is not the case, see Troubleshooting
Test the Spark access

To check the remote Spark connection, enter the following lines into a notebook cell:
```
import socket

from databrickslabs_jupyterlab import is_remote

result = sc.range(10000).repartition(100).map(lambda x: x).sum()
print(socket.gethostname(), is_remote())
print(result)
```
It will show that the kernel is actually running remotely and the hostname of the driver. The second part quickly smoke tests a Spark job.

Success: Your local JupyterLab is successfully contected to the remote Databricks cluster

7 Advanced topics

7.1 Switching kernels and restart after cluster auto-termination

7.2 Creating a mirror of a remote Databricks cluster

7.3 Detailed databrickslabs_jupyterlab command overview

7.4 How it works

7.5 Troubleshooting

8 Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

9 Test notebooks

To work with the test notebooks in ./examples the remote cluster needs to have the following libraries installed:

mlflow==1.x
spark-sklearn

Name		Name	Last commit message	Last commit date
Latest commit History 504 Commits
databrickslabs_jupyterlab		databrickslabs_jupyterlab
dev_tools		dev_tools
docker		docker
docs		docs
examples		examples
extensions/databrickslabs_jupyterlab_statusbar		extensions/databrickslabs_jupyterlab_statusbar
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.pylintrc		.pylintrc
CONTRIBUTING.md		CONTRIBUTING.md
Changes.md		Changes.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
NOTICE		NOTICE
PYPI.md		PYPI.md
README.md		README.md
README_v1 (old).md		README_v1 (old).md
Release.md		Release.md
databrickslabs-jupyterlab		databrickslabs-jupyterlab
dj.bat		dj.bat
env.yml		env.yml
labextensions.txt		labextensions.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local JupyterLab connecting to Databricks via SSH

1 New features

2 Overview

3 Prerequisites

4 Running with docker

5 Local installation

6 Getting started with local installation or docker

6.1 Starting JupyterLab

6.2 Using Spark in the Notebook

7 Advanced topics

8 Project Support

9 Test notebooks

About

Releases

Packages

Languages

License

baljit92/jupyterlab-integration

Folders and files

Latest commit

History

Repository files navigation

Local JupyterLab connecting to Databricks via SSH

1 New features

2 Overview

3 Prerequisites

4 Running with docker

5 Local installation

6 Getting started with local installation or docker

6.1 Starting JupyterLab

6.2 Using Spark in the Notebook

7 Advanced topics

8 Project Support

9 Test notebooks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages