From 0bf2c3661ade5e5b6d2d144ea57f7f6b61284c5b Mon Sep 17 00:00:00 2001 From: Jacob Celestine Date: Wed, 19 Feb 2020 16:38:43 +0530 Subject: [PATCH] Added new sections --- pyspark/Colab and PySpark.ipynb | 60 ++++++++++++++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/pyspark/Colab and PySpark.ipynb b/pyspark/Colab and PySpark.ipynb index a52cfa0..7fc0558 100644 --- a/pyspark/Colab and PySpark.ipynb +++ b/pyspark/Colab and PySpark.ipynb @@ -51,6 +51,8 @@ "
  • Working with joins
  • \n", "
  • Hands-on again!
  • \n", "
  • RDD
  • \n", + "
  • User-Defined Functions (UDF)
  • \n", + "
  • Common Questions
  • \n", "" ] }, @@ -2939,7 +2941,8 @@ }, "colab_type": "code", "id": "6ZcRIX3mMquF", - "outputId": "319981cb-4fae-4945-dc43-100b248144a5" + "outputId": "319981cb-4fae-4945-dc43-100b248144a5", + "scrolled": true }, "outputs": [ { @@ -2964,6 +2967,61 @@ " line.split(\",\")[2],\n", " line.split(\",\")[5])).collect())" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# User-Defined Functions (UDF)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "PySpark User-Defined Functions (UDFs) help you convert your python code into a scalable version of itself. It comes in handy more than you can imagine, but beware, as the performance is less when you compare it with pyspark functions. You can view examples of how UDF works [here](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html). What I will give in this section is some theory on how it works, and why it is slower.\n", + "\n", + "When you try to run a UDF in PySpark, each executor creates a python process. Data will be serialised and deserialised between each executor and python. This leads to lots of performance impact and overhead on spark jobs, making it less efficent than using spark dataframes. Apart from this, sometimes you might have memory issues while using UDFs. The Python worker consumes huge off-heap memory and so it often leads to memoryOverhead, thereby failing your job. Keeping these in mind, I wouldn't recommend using them, but at the end of the day, your choice." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# Common Questions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Recommeded IDE\n", + "I personally prefer [PyCharm](https://www.jetbrains.com/pycharm/) while coding in Python/PySpark. It's based on IntelliJ IDEA so it has a lot of features! And the main advantage I have felt is the ease of installing pyspark and other packages. You can customize it with themes and plugins, and it lets you enhance productivity while coding by providing some features like suggestions, Local VCS etc." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Submitting a spark job\n", + "The python syntax for running jobs is: `python .py ...`
    \n", + "But when you submit a spark job you have to use `spark-submit` to run the application. \n", + "\n", + "Here is a simple example of a spark-submit command:
    \n", + "`spark-submit filename.py --named_argument 'arguemnt value'`
    \n", + "Here, named_argument is an arguemnt that you are reading from inside your script. \n", + "\n", + "There are other options you can pass in the command, like:
    \n", + "`--py-files` which helps you pass a python file to read in your file,
    \n", + "`--files` which helps pass other files like txt or config,
    \n", + "`--deploy-mode` which tells wether to deploy your worker node on cluster or locally\n", + "`--conf` which helps pass different configurations, like memoryOverhead, dynamicAllocation etc.\n", + "\n", + "\n", + "There is an [entire page](https://spark.apache.org/docs/latest/submitting-applications.html) in spark documentation dedicated to this. I highly recommend you go through it once." + ] } ], "metadata": {