Log-File-Processing-Data-Pipeline

Introduction

• Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.

• Lambda architecture is a data- processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

Goal

• In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.

• However that processing was batch processing which operates in the batch and serving layer and in the lambda architecture(Batch, Speed, Serving),we also have speed layer. Now also going one step further by bringing processing to the speed layer of the lambda architecture which opens up more capabilities. One of such capability will be ability monitor application real time perform or measure real time comfort with applications or real time alert in case of security breach.The abilities and functionalities will be explored using Spark Streaming in a streaming architecture.

Architecture

High Level Architecture

Batch layer

• Using Flume to ingest log data

• Using Spark to process data

• Integrating Kafka to complex event alert

• Using Impala for the low-latency query of processed log data.

• Coordinating the data processing pipeline with Oozie.

Speed layer

• Getting logs at real time using Flume Log4J appenders

• Custom Flume configuration for Spark Streaming

• Storing log event as a time series datasets in HBase

• Integrating Hive and HBase for data retrieval using query.

Data Ingestion

Oozie Work flow

Technology stack

Area	Technology
Data Ingestion tool	Apache Flume
Cluster Computing Framework	Apache Spark and Spark Streaming (Scala)
Message Broker	Apache Kafka
Non-Relational Distributed Database	Hbase
Query Engine	Hive and Impala
Orchestration System for Batch Layer	Oozie
Distributed File System	Hdfs

Use Cases

Web Server Log Processing Use Case

• Application Health Monitoring

• Fraud - Security

• User Pattern (sessionizing a click stream)

• User Experience

• Support Triage

• Metric Data Collection

Configuring Environment

Download web server log dataset available from http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html, we must make analysis, event processing, and data retrieval of log data for varying kind possible.

Data from the log must be available either using low-latency querying tool or real-time event reporting and analysis.

Installation of Cloudera quickstart VM 5.7 or 5.8

Later Configuring Scala Runtime to Cloudera QuickStart VM

Watch the below video for more information

https://www.youtube.com/watch?v=SFJsuo2XISs

Execution Instructions

Batch layer

please go through the files in batch/commands

flume_commands.txt

hdfs_command.txt

kafka_setup_run_command.txt

Speed layer

please go through the file commands.txt in realtime

Also execute start-logging-app.sh to start real time Flume Log4J appenders

source code will be in spark streaming folder

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
batch		batch
images		images
realtime		realtime
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log-File-Processing-Data-Pipeline

Introduction

Goal

Architecture

High Level Architecture

Batch layer

Speed layer

Data Ingestion

Oozie Work flow

Technology stack

Use Cases

Configuring Environment

Execution Instructions

Batch layer

Speed layer

Screen shots

Flume config files

Flume agent 1

Flume agent 2

Flume server

batch spark job submit

after processing

Launching Impala and executing queries

Hive queries

After initialization of batch oozie work flow

Launching Real time spark submit job

Launching hbase shell to see the new tables

Launching hive to see the logs stored in tables

About

Releases

Packages

Languages

RepakaRamateja/Log-File-Processing-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Log-File-Processing-Data-Pipeline

Introduction

Goal

Architecture

High Level Architecture

Batch layer

Speed layer

Data Ingestion

Oozie Work flow

Technology stack

Use Cases

Configuring Environment

Execution Instructions

Batch layer

Speed layer

Screen shots

Flume config files

Flume agent 1

Flume agent 2

Flume server

batch spark job submit

after processing

Launching Impala and executing queries

Hive queries

After initialization of batch oozie work flow

Launching Real time spark submit job

Launching hbase shell to see the new tables

Launching hive to see the logs stored in tables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages