Skip to content

Analysis and Visualisation of Yelp Dataset using Apache Spark | Elastic Search | Kibana

Notifications You must be signed in to change notification settings

RepakaRamateja/YelpDataSet-Analysis-Visualisation

Repository files navigation

YelpDataSet-Analysis-Visualisation

Analysis and Visualisation of Yelp Dataset using Apache Spark | Elastic Search | Kibana


Introduction

Most businesses seek to get reviews on their goods and services one way or another. It is a most basic way for the business to improve their efficiency and subsequently their bottom-line. Get the review is not only the issue, ability to extract and visualize analytics from review data is critical to business success.

In Apache Spark Project, we will use the yelp review dataset to analyze businesses and reviews over a period of time. Perhaps we will spot potential gaps in service delivery or see how business thrive in different scenarios.

Beyond processing this data, we will ingest the final output of our data processing in Elasticsearch and use the visualization tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.


Goal

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.


Architecture

alt text

alt text

• Import data from yelp dataset into relational database(MySQL)

please go through file(get_yelp_in_mysql_databaset.txt) for more information 

• Ingesting data from relational database (MySQL) using Sqoop into Hadoop HDFS

please go through file(yelp_mysql_sqoop_commands.txt) for more information 

• Ingesting data from relational database directly into Spark

• Processing relational data in Spark

• Ingesting processed data into Elasticsearch

• Visualizing review analytics using Kibana

Technology stack

alt text


Area Technology
DataSet Yelp
Relational Database MySQL
Big Data Ingestion Tool Hadoop (Sqoop)
Distributed File System Hadoop (HDFS)
Cluster Computing Framework Apache Spark (Scala)
Search and Analytics Engine Elasticsearch

Yelp schema

alt text

Out of all attributes we will focus on some shown below

Business

category

hours

Review


Use Cases considered for Visualisation


Top 10 Business Categories

Yelp Business Map

Business distribution by state

Average rating of business over time

Top rated businesses

User sign up trend


Configuring Environment


Installation of Cloudera quickstart VM

Installation of Elk stack

Later Configuring Scala Runtime to Cloudera QuickStart VM

Watch the below video for more information

https://www.youtube.com/watch?v=SFJsuo2XISs


Execution Instructions

Launch the Spark Shell

spark-shell --packages org.elasticsearch:elasticsearch-spark-13_2.10:6.1.1 --conf spark.es.index.auto.create=true --conf spark.es.nodes= Ipaddress:port(Elastic search)


Visualisation Screen shots

After Ingesting processed data into Elasticsearch

alt text

Yelp User sign up trend

alt text

alt text

Business distribution by state

alt text

Yelp review

alt text

Top 10 Business Categories

alt text

alt text

Dashboard

alt text

About

Analysis and Visualisation of Yelp Dataset using Apache Spark | Elastic Search | Kibana

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages