Skip to content

Log file processing data pipeline built using Lambda architecture | Flume | Apache Spark | Spark Streaming | Apache Kafka | HDFS | Hbase | Hive | Impala | Oozie

Notifications You must be signed in to change notification settings

RepakaRamateja/Log-File-Processing-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Log-File-Processing-Data-Pipeline

Log file processing data pipeline built using Lambda architecture | Flume | Apache Spark | Spark Streaming | Apache Kafka | HDFS | Hbase | Hive | Impala | Oozie


Introduction

• Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.

• Lambda architecture is a data- processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.


Goal

• In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.

• However that processing was batch processing which operates in the batch and serving layer and in the lambda architecture(Batch, Speed, Serving),we also have speed layer. Now also going one step further by bringing processing to the speed layer of the lambda architecture which opens up more capabilities. One of such capability will be ability monitor application real time perform or measure real time comfort with applications or real time alert in case of security breach.The abilities and functionalities will be explored using Spark Streaming in a streaming architecture.


Architecture

High Level Architecture

alt text


Batch layer

alt text

• Using Flume to ingest log data

• Using Spark to process data

• Integrating Kafka to complex event alert

• Using Impala for the low-latency query of processed log data.

• Coordinating the data processing pipeline with Oozie.


Speed layer

alt text

• Getting logs at real time using Flume Log4J appenders

• Custom Flume configuration for Spark Streaming

• Storing log event as a time series datasets in HBase

• Integrating Hive and HBase for data retrieval using query.

Data Ingestion

alt text

Oozie Work flow

alt text


Technology stack

alt text


Area Technology
Data Ingestion tool Apache Flume
Cluster Computing Framework Apache Spark and Spark Streaming (Scala)
Message Broker Apache Kafka
Non-Relational Distributed Database Hbase
Query Engine Hive and Impala
Orchestration System for Batch Layer Oozie
Distributed File System Hdfs

Use Cases

Web Server Log Processing Use Case

• Application Health Monitoring

• Fraud - Security

• User Pattern (sessionizing a click stream)

• User Experience

• Support Triage

• Metric Data Collection


Configuring Environment


Download web server log dataset available from http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html, we must make analysis, event processing, and data retrieval of log data for varying kind possible.

Data from the log must be available either using low-latency querying tool or real-time event reporting and analysis.

Installation of Cloudera quickstart VM 5.7 or 5.8

Later Configuring Scala Runtime to Cloudera QuickStart VM

Watch the below video for more information

https://www.youtube.com/watch?v=SFJsuo2XISs



Execution Instructions

Batch layer

please go through the files in batch/commands

flume_commands.txt

hdfs_command.txt

kafka_setup_run_command.txt

Speed layer

please go through the file commands.txt in realtime

Also execute start-logging-app.sh to start real time Flume Log4J appenders

source code will be in spark streaming folder

Screen shots

Flume config files

alt text

Flume agent 1

alt text

Flume agent 2

alt text

Flume server

alt text

batch spark job submit

alt text

after processing

alt text

Launching Impala and executing queries

alt text

alt text

Hive queries

alt text

After initialization of batch oozie work flow

alt text

Launching Real time spark submit job

alt text

Launching hbase shell to see the new tables

alt text

Launching hive to see the logs stored in tables

alt text

About

Log file processing data pipeline built using Lambda architecture | Flume | Apache Spark | Spark Streaming | Apache Kafka | HDFS | Hbase | Hive | Impala | Oozie

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published