In this project, we have analyzed the infamous Chicago City Taxi Trips dataset using Spark Structured Streaming, and also built a dashboard for reporting the metrics using Flask and HTML. The dataset itself is open for public use and holds information since 2013, amounting to about 200 Million rows with each record being around 1024 bytes in size.
Here are some features of this implementation:
- Retrieves data from a folder and process them in a streaming fashion.
- A full-fledged dashboard built using Flask and Chart JS to visualize the metrics.
- Uses Spark to support big data analysis, tested on more than 100 million records.
- Uses Structured Streaming and therefore can easily be ported for any similar use case.
- Rate of tipping over years?
- Popular taxi trips days?
- Mode of payment over years?
- Total miles travelled across years?
- Total time travelled across years?
- Which company makes the most trips per year?
├── Documents # Holds info about the project
├── dashboard # Code for the Flask application
│ ├── static # Holds static files associated with the dashboard
│ │ ├── css # Holds styling files for the dashboard
│ │ └── js # Holds javascript files used for the dashboard
│ ├── templates # Holds the HTML template used for the dashboard UI
│ └── app.py # Flask application which defines and triggers the endpoints; startpoint for the dashboard
├── source # Source folder for the streaming application
│ └── 1.csv # A sample source file
├── README.md # Read this first
└── streaming.py # Streaming application which reads the files and executes operations in PySpark using Structured Streaming
A full demo including code walkthrough and live demo can be found here (redirects to YouTube).