GitHub - baljit92/spark-k8s: Repo for running Spark + Jupyter notebook on a K8 cluster

We will bring up a Jupyter notebook in Kubernetes and run a Spark application in client mode. We will also use sparkmonitor widget for visualization.

Jupyter installation:

Our setup contains two images:

Spark image — used for spinning up Spark executors.
Jupyter notebook image — used for Jupyter notebook and Spark driver.

Setup instructions:

Create an cluster (AKS, in our case). The command below is just for an example cluster. It can be customized based on the user needs. The important option to include is the --enable-cluster-autoscaler to make sure the scaling is Automatic and not Manual.

# Now create the AKS cluster and enable the cluster autoscaler
az aks create \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --node-count 1 \
  --vm-set-type VirtualMachineScaleSets \
  --load-balancer-sku standard \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 3

Next, create a new dedicated namespace for spark, install the relevant Kubernetes resources and expose Jupyter’s port:

kubectl create ns spark
kubectl apply -n spark -f jupyter.yaml
kubectl port-forward -n spark service/jupyter 8888:8888

Note that a dedicated namespace has several benefits:

Security — as Spark requires permissions to create/delete pods etc. it’s better to limit those permissions to a specific namespace.
Observability — Spark might spawn a lot of executor pods so it might be easier to track those if they are isolated in a separate namespace. On the other hand, you don’t want to miss any other application pods between all of those executor pods.

That’s it! Now open your browser and go to http://127.0.0.1:8888 and run our first Spark application. You can use the notebooks included as an example.

spark_application.ipynb has the Spark configuration that will be used to launch the Spark session
demo.ipynb just reads a csv file from s3 storage

Note: In order to change the storageClassName; execute kubectl get storageclass --all-namespaces and choose the type of disk and enter the Name value in the jupyter.yaml file.

Interesting repo for optimizing Spark on EKS: https://github.com/aws-samples/eks-spark-benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jupyter installation:

Setup instructions:

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
demo.ipynb		demo.ipynb
jupyter.yaml		jupyter.yaml
spark_application.ipynb		spark_application.ipynb

baljit92/spark-k8s

Folders and files

Latest commit

History

Repository files navigation

Jupyter installation:

Setup instructions:

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages