diff --git a/pyspark/Colab and PySpark.ipynb b/pyspark/Colab and PySpark.ipynb
index 82e3cb8..a52cfa0 100644
--- a/pyspark/Colab and PySpark.ipynb
+++ b/pyspark/Colab and PySpark.ipynb
@@ -1,28 +1,196 @@
{
"cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "qJoeN3e8_Gzk"
+ },
+ "source": [
+ "
Introduction to Google Colab and PySpark
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Table Of Contents:\n",
+ "\n",
+ "- Objective
\n",
+ "- Prerequisite
\n",
+ "- Notes from the author
\n",
+ "- Big data, PySpark and Colaboratory
\n",
+ " \n",
+ " - Big data
\n",
+ " - PySpark
\n",
+ " - Colaboratory
\n",
+ "
\n",
+ "- Jupyter notebook basics
\n",
+ " \n",
+ " - Code cells
\n",
+ " - Text cells
\n",
+ " - Access to the shell
\n",
+ " - Install Spark
\n",
+ "
\n",
+ "- Loading Dataset
\n",
+ "- Working with the DataFrame API
\n",
+ " \n",
+ " - Viewing Dataframe
\n",
+ " - Schema of a DataFrame
\n",
+ " - Working with columns
\n",
+ " - Working with Rows
\n",
+ "
\n",
+ "- Hands-on Questions
\n",
+ "- Functions
\n",
+ " \n",
+ " - String Functions
\n",
+ " - Numeric functions
\n",
+ " - Date
\n",
+ "
\n",
+ "- Working with Dates
\n",
+ "- Working with joins
\n",
+ "- Hands-on again!
\n",
+ "- RDD
\n",
+ "
"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
+ "\n",
+ "## Objective\n",
+ "The objective of this notebook is to:\n",
+ ">Give a proper understanding about the different PySpark functions available. \n",
+ ">A short introduction to Google Colab, as that is the platform on which this notebook is written on. \n",
+ "\n",
+ "Once you complete this notebook, you should be able to write pyspark programs in an efficent way. The ideal way to use this is by going through the examples given and then trying them on Colab. At the end there are a few hands on questions which you can use to evaluate yourself."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisite\n",
+ ">Although some theory about pyspark and big data will be given in this notebook, I recommend everyone to read more about it and have a deeper understanding on how the functions get executed and the relevance of big data in the current scenario.\n",
+ ">A good understanding on python will be an added bonus."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Notes from the author\n",
+ "\n",
"This tutorial was made using Google Colab so the code you see here is meant to run on a colab notebook.
\n",
"It goes through basic [PySpark Functions](https://spark.apache.org/docs/latest/api/python/index.html) and a short introduction on how to use [Colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb).
\n",
- "The reason why I used colab is because of its shareability and free GPU. Yeah you read that right. A FREE GPU! In the words of Google:
\n",
- "`Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs.`
\n",
- "If you have more questions about colab, [REFER THIS LINK](https://research.google.com/colaboratory/faq.html)
\n",
+ "If you want to view my colab notebook for this particular tutorial, you can view it [here](https://colab.research.google.com/drive/1G894WS7ltIUTusWWmsCnF_zQhQqZCDOc). The viewing experience and readability is much better there.
\n",
+ "If you want to try out things with this notebook as a base, feel free to download it from my repo [here](https://github.com/jacobceles/knowledge-repo/blob/master/pyspark/Colab%20and%20PySpark.ipynb) and then use it with jupyter notebook."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Big data, PySpark and Colaboratory"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Big data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Big data usually means data of such huge volume that normal data storage solutions cannot efficently store and process it. In this era, data is being generated at an absurd rate. Data is collected for each movement a person makes. The bulk of big data comes from three primary sources: \n",
+ "\n",
+ " - Social data
\n",
+ " - Machine data
\n",
+ " - Transactional data
\n",
+ "
\n",
+ "\n",
+ "Some common examples for the sources of such data include internet searches, facebook posts, doorbell cams, smartwatches, online shopping history etc. Every action creates data, it is just a matter of of there is a way to collect them or not. But what's interesting is that out of all this data collected, not even 5% of it is being used fully. There is a huge demand for big data professionals in the industry. Even though the number of graduates with a specialization in big data are rising, the problem is that they don't have the practical knowledge about big data scenarios, which leads to bad architecutres and inefficent methods of processing data.\n",
"\n",
- "All you need is an internet connection to keep a session alive. If you lose the connection you will have to download the datasets again.
\n",
- "If you want to view my colab notebook you can do it [HERE](https://colab.research.google.com/drive/1G894WS7ltIUTusWWmsCnF_zQhQqZCDOc). The viewing experience and readability is much better there.
\n",
- "If you want to try out things with this notebook as a base, feel free to download it from my repo [HERE](https://github.com/jacobceles/knowledge-repo/blob/master/pyspark/Colab%20and%20PySpark.ipynb) and then use it with jupyter notebook.\n"
+ ">If you are interested to know more about the landscape and technologies involved, here is [an article](https://hostingtribunal.com/blog/big-data-stats/) which I found really interesting!"
]
},
{
"cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "qJoeN3e8_Gzk"
- },
+ "metadata": {},
"source": [
- "# Introduction to Google Colab and PySpark"
+ "\n",
+ "### PySpark"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If you are working in the field of big data, you must have definelty heard of spark. If you look at the [Apache Spark](https://spark.apache.org/) website, you will see that it is said to be a `Lightning-fast unified analytics engine`. PySPark is a flavour of Spark used for processing and analysing massive volumes of data. If you are familiar with python and have tried it for huge datasets, you should know that the execution time can get ridiculous. Enter PySpark!\n",
+ "\n",
+ "Imagine your data resides in a distributed manner at different places. If you try brining your data to one point and executing your code there, not only would that be inefficent, but also cause memory issues. Now let's say your code goes to the data rather than the data coming to where your code. This will help avoid unneccesary data movement which will thereby decrease the running time. \n",
+ "\n",
+ "PySpark is the Python API of Spark; which means it can do almost all the things python can. Machine learning(ML) pipelines, exploratory data analysis (at scale), ETLs for data platform, and much more! And all of them in a distributed manner. One of the best parts of pyspark is that if you are already familiar with python, it's really easy to learn.\n",
+ "\n",
+ "Apart from PySpark, there is another language called Scala used for big data processing. Scala is frequently over 10 times faster than Python is native for Hadoop as its based on JVM. But PySpark is getting adopted at a fast rate because of the ease of use, easier learning curve and ML capabilities.\n",
+ "\n",
+ "I will briefly explain how a PySpark job works, but I strongly recommend you read more about the [architecture](https://data-flair.training/blogs/how-apache-spark-works/) and how everything works. Now, before I get into it, let me talk about some basic jargons first:\n",
+ "\n",
+ "Cluster is a set of loosely or tightly connected computers that work together so that they can be viewed as a single system.\n",
+ "\n",
+ "Hadoop is an open source, scalable, and fault tolerant framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing.\n",
+ "\n",
+ "HDFS (Hadoop distributed file system). It is one of the world's most reliable storage system. HDFS is a Filesystem of Hadoop designed for storing very large files running on a cluster of commodity hardware.\n",
+ "\n",
+ "MapReduce is a data Processing framework, which has 2 phases - Mapper and Reducer. The map procedure performs filtering and sorting, and the reduce method performs a summary operation. It usually runs on a hadoop cluster.\n",
+ "\n",
+ "Transformation refers to the operations applied on a dataset to create a new dataset. Filter, groupBy and map are the examples of transformations.\n",
+ "\n",
+ "Actions Actions refer to an operation which instructs Spark to perform computation and send the result back to driver. This is an example of action.\n",
+ "\n",
+ "Alright! Now that that's out of the way, let me explain how a spark job runs. In simple terma, each time you submit a pyspark job, the code gets internally converted into a MapReduce program and gets executed in the Java virtual machine. Now one of the thoughts that might be popping in your mind will probably be:
`So the code gets converted into a MapReduce program. Wouldn't that mean MapReduce is faster than pySpark?`
Well, the answer is a big NO. This is what makes spark jobs special. Spark is capable of handling a massive amount of data at a time, in it's distributed environment. It does this through in-memory processing, which is what makes it almost 100 times faster than Hadoop. Another factor which amkes it fast is Lazy Evaluation. Spark delays its evaluation as much as it can. Each time you submit a job, spark creates an action plan for how to execute the code, and then does nothing. Finally, when you ask for the result(i.e, calls an action), it executes the plan, which is basically all the transofrmations you have mentioned in your code. That's basically the gist of it. \n",
+ "\n",
+ "Now lastly, I want to talk about on more thing. Spark mainly consists of 4 modules:\n",
+ "\n",
+ "\n",
+ " - Spark SQL - helps to write spark programs using SQL like queries.
\n",
+ " - Spark Streaming - is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. used heavily in processing of social media data.
\n",
+ " - Spark MLLib - is the machine learning component of SPark. It helps train ML models on massive datasets with very high efficeny.
\n",
+ " - Spark GraphX - is the visualization component of Spark. It enables users to view data both as graphs and as collections without data movement or duplication.
\n",
+ "
\n",
+ "\n",
+ "Hopefully this image gives a better idea of what I am talking about:\n",
+ "\n",
+ "Source: Datanami\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Colaboratory"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the words of Google:
\n",
+ "`Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs.`\n",
+ "\n",
+ "The reason why I used colab is because of its shareability and free GPU. Yeah you read that right. A FREE GPU! Additionally, it helps use different Google services conveniently. It saves to Google Drive and all the services are very closely related. I recommend you go through the offical [overview documentation](https://colab.research.google.com/notebooks/basic_features_overview.ipynb) if you want to know more about it.\n",
+ "If you have more questions about colab, please [refer this link](https://research.google.com/colaboratory/faq.html)\n",
+ "\n",
+ ">While using a colab notebook, you will need an active internet connection to keep a session alive. If you lose the connection you will have to download the datasets again."
]
},
{
@@ -32,6 +200,7 @@
"id": "_N5-lspH_N8B"
},
"source": [
+ "\n",
"## Jupyter notebook basics"
]
},
@@ -42,6 +211,7 @@
"id": "6Ul54hAYyHyd"
},
"source": [
+ "\n",
"### Code cells"
]
},
@@ -118,9 +288,17 @@
"id": "VOqLNkRKyUIS"
},
"source": [
+ "\n",
"### Text cells"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Hello world!"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -128,6 +306,7 @@
"id": "X6zdrH15_CCW"
},
"source": [
+ "\n",
"## Access to the shell"
]
},
@@ -194,6 +373,7 @@
"id": "Dd6t0uFzuR4X"
},
"source": [
+ "\n",
"## Install Spark"
]
},
@@ -375,6 +555,7 @@
"id": "hmIqq6xPK7m7"
},
"source": [
+ "\n",
"# Loading Dataset"
]
},
@@ -497,7 +678,8 @@
"id": "HgwoX-pfNqQI"
},
"source": [
- "# Working with the DataFrame API"
+ "\n",
+ "# Working with the Dataframe API"
]
},
{
@@ -507,6 +689,7 @@
"id": "_QwZtWxZRCBn"
},
"source": [
+ "\n",
"## Viewing Dataframe"
]
},
@@ -534,6 +717,7 @@
"id": "eFoagdqARKb8"
},
"source": [
+ "\n",
"## Schema of a DataFrame"
]
},
@@ -918,7 +1102,8 @@
"id": "rsD48rckdHPe"
},
"source": [
- "# Wokring with columns"
+ "\n",
+ "# Working with columns"
]
},
{
@@ -1310,7 +1495,8 @@
"id": "WbKK5iHwmIoV"
},
"source": [
- "## Wokring with Rows"
+ "\n",
+ "## Working with Rows"
]
},
{
@@ -1569,7 +1755,8 @@
"id": "xOQPOt19q_he"
},
"source": [
- "# Hands-on Question 🤚 !"
+ "\n",
+ "# Hands-on Questions 🤚 !"
]
},
{
@@ -1625,6 +1812,7 @@
"id": "aHjILb1DriuX"
},
"source": [
+ "\n",
"# Functions"
]
},
@@ -1662,6 +1850,7 @@
"id": "PIKigra7A34e"
},
"source": [
+ "\n",
"## String Functions"
]
},
@@ -1749,6 +1938,7 @@
"id": "ldtA0wk9BMkT"
},
"source": [
+ "\n",
"## Numeric functions"
]
},
@@ -1800,6 +1990,7 @@
"id": "KQ6Ul9HGCwC3"
},
"source": [
+ "\n",
"## Date"
]
},
@@ -1851,6 +2042,7 @@
"id": "sY6PstyLDp6P"
},
"source": [
+ "\n",
"# Working with Dates"
]
},
@@ -2009,6 +2201,7 @@
"id": "7OZElEvcGOD1"
},
"source": [
+ "\n",
"# Working with joins"
]
},
@@ -2359,6 +2552,7 @@
"id": "EEEB2TVqL4Ie"
},
"source": [
+ "\n",
"# Hands-on again!"
]
},
@@ -2588,6 +2782,7 @@
"id": "x62BiCgBMOtq"
},
"source": [
+ "\n",
"# RDD"
]
},
@@ -2600,7 +2795,13 @@
"source": [
"> With map, you define a function and then apply it record by record. Flatmap returns a new RDD by first applying a function to all of the elements in RDDs and then flattening the result. Filter, returns a new RDD. Meaning only the elements that satisfy a condition. With reduce, we are taking neighboring elements and producing a single combined result.\n",
"For example, let's say you have a set of numbers. You can reduce this to its sum by providing a function that takes as input two values and reduces them to one. \n",
- "\n"
+ "\n",
+ "Some of the reasons you would use a dataframe over RDD are:\n",
+ "\n",
+ " - It's ability to represnt data as rows and columns. But this also means it can only hold structred and semi-structured data.
\n",
+ " - It allows processing data in different formats (AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL).
\n",
+ " - It's superior job Optimization capability.
\n",
+ " - DataFrame API is very easy to use.
"
]
},
{
@@ -2763,14 +2964,6 @@
" line.split(\",\")[2],\n",
" line.split(\",\")[5])).collect())"
]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3wn2zXe7TbI3"
- },
- "source": []
}
],
"metadata": {