Skip to content

Commit

Permalink
Added new sections
Browse files Browse the repository at this point in the history
  • Loading branch information
jacobceles authored Feb 19, 2020
1 parent 50bb1ba commit 0bf2c36
Showing 1 changed file with 59 additions and 1 deletion.
60 changes: 59 additions & 1 deletion pyspark/Colab and PySpark.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@
"<li><a href=\"#working-with-joins\">Working with joins</a></li>\n",
"<li><a href=\"#hands-on-again\">Hands-on again!</a></li>\n",
"<li><a href=\"#rdd\">RDD</a></li>\n",
"<li><a href=\"#user-defined-functions-udf\">User-Defined Functions (UDF)</a></li>\n",
"<li><a href=\"#common-questions\">Common Questions</a></li>\n",
"</ol>"
]
},
Expand Down Expand Up @@ -2939,7 +2941,8 @@
},
"colab_type": "code",
"id": "6ZcRIX3mMquF",
"outputId": "319981cb-4fae-4945-dc43-100b248144a5"
"outputId": "319981cb-4fae-4945-dc43-100b248144a5",
"scrolled": true
},
"outputs": [
{
Expand All @@ -2964,6 +2967,61 @@
" line.split(\",\")[2],\n",
" line.split(\",\")[5])).collect())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='user-defined-functions-udf'></a>\n",
"# User-Defined Functions (UDF)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PySpark User-Defined Functions (UDFs) help you convert your python code into a scalable version of itself. It comes in handy more than you can imagine, but beware, as the performance is less when you compare it with pyspark functions. You can view examples of how UDF works [here](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html). What I will give in this section is some theory on how it works, and why it is slower.\n",
"\n",
"When you try to run a UDF in PySpark, each executor creates a python process. Data will be serialised and deserialised between each executor and python. This leads to lots of performance impact and overhead on spark jobs, making it less efficent than using spark dataframes. Apart from this, sometimes you might have memory issues while using UDFs. The Python worker consumes huge off-heap memory and so it often leads to memoryOverhead, thereby failing your job. Keeping these in mind, I wouldn't recommend using them, but at the end of the day, your choice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='common-questions'></a>\n",
"# Common Questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recommeded IDE\n",
"I personally prefer [PyCharm](https://www.jetbrains.com/pycharm/) while coding in Python/PySpark. It's based on IntelliJ IDEA so it has a lot of features! And the main advantage I have felt is the ease of installing pyspark and other packages. You can customize it with themes and plugins, and it lets you enhance productivity while coding by providing some features like suggestions, Local VCS etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submitting a spark job\n",
"The python syntax for running jobs is: `python <file_name>.py <arg1> <arg2> ...`<br>\n",
"But when you submit a spark job you have to use `spark-submit` to run the application. \n",
"\n",
"Here is a simple example of a spark-submit command:<br>\n",
"`spark-submit filename.py --named_argument 'arguemnt value'`<br>\n",
"Here, named_argument is an arguemnt that you are reading from inside your script. \n",
"\n",
"There are other options you can pass in the command, like:<br>\n",
"`--py-files` which helps you pass a python file to read in your file,<br>\n",
"`--files` which helps pass other files like txt or config,<br>\n",
"`--deploy-mode` which tells wether to deploy your worker node on cluster or locally\n",
"`--conf` which helps pass different configurations, like memoryOverhead, dynamicAllocation etc.\n",
"\n",
"\n",
"There is an [entire page](https://spark.apache.org/docs/latest/submitting-applications.html) in spark documentation dedicated to this. I highly recommend you go through it once."
]
}
],
"metadata": {
Expand Down

0 comments on commit 0bf2c36

Please sign in to comment.