Anomaly Detections using Apache Spark

Assumptions

GIT is installed
JDK (1.8+) is installed
Scala (2.1+) is installed
Maven is installed
SBT (Scala build tool) is installed
Spark cluster (with 1 master and at least 1 worker) is up and running and accessible
Following environment variables should be set:
```
   SPARK_HOME is set in ~/.bashrc
```

Steps to setup the project in your local system

GIT CLONE avenir , beymani , chombo , hoidla.
Navigate to the folder named hoilda and execute the below commands:
```
  mvn clean install
  
  sbt publishLocal 
```
Navigate to the folder named chombo and follow the below sequence:

Build chombo first in master branch with

      mvn clean install 
      
      sbt publishLocal

Build chombo-spark in chombo/spark directory

      sbt clean package 
      
      sbt publishLocal

Navigate to the folder named avenir and execute the below command
```
 mvn clean install
```
Navigate to the folder named beymani and execute the below command
```
  mvn clean install
  
  sbt publishLocal
```

Build beymani-spark in beymani/spark directory

   sbt clean package 
   
   sbt publishLocal

Navigate to the folder named beymani /resource and execute
```
  ant -f beymani_spark.xml
```
Navigate to the folder named chombo /resource and execute the below command
```
  ant -f  chombo_spark.xml
```

Navigate to the folder named beymani /resource and edit the and_spark.sh file to reflect the path in your local system:

  Set the project home path       (  PROJECT_HOME )
  set the spark home path	    ( SPARK_HOME )
  set the master as spark master       ( MASTER )

Now you are ready to run the file and_spark.sh and below are the various parameters you should use to run the file:

Step 1 : Create base normal data

./and_spark.sh crInput <num_of_days> <reading_intervaL> <num_servers> <output_file>

where num_of_days = number of days e.g 10

reading_intervaL = reading interval in sec e.g. 300

num_servers = number of servers e.g. 4

output_file = output file, we will use c.txt from now on

        ./and_spark.sh crInput 10 300 40 c.txt

Step 2 : Copy modeling data

insert outliers

./and_spark.sh insOutliers <normal_data_file> <with_outlier_data_file>

where

normal_data_file = normal data file (c.txt)

with_outlier_data_file = data file with outliers (cusage.txt)

        ./and_spark.sh insOutliers c.txt cusage.txt

-copy

./and_spark.sh cpModData <with_outlier_data_file>

where

with_outlier_data_file = data file with outliers (cusage.txt)

        ./and_spark.sh cpModData cusage.csv

Step 3 : Run Spark job for stats

        ./and_spark.sh numStat

Step 4 : Copy and consolidate stats file

        ./and_spark.sh crStatsFile

Step 5 : Run Spark job to detect outliers

Set output.outliers = true and rem.outliers = true

        ./and_spark.sh olPred

Step 6 : Copy and consolidate clean file

        ./and_spark.sh crCleanFile

Step 7 : Copy test data

insert outliers

./and_spark.sh insOutliers <normal_data_file> <with_outlier_data_file>

where

normal_data_file = normal data file (c.txt)

with_outlier_data_file = data file with outliers (cusage.txt)

        ./and_spark.sh insOutliers c.txt cusage.txt

-copy

./and_spark.sh cpTestData <with_outlier_data_file>

where

with_outlier_data_file = data file with outliers (cusage.txt)

        ./and_spark.sh cpTestData cusage.txt

Step 8 : Run Spark job for stats again with clean data

        ./and_spark.sh numStat

Step 9 : Copy and consolidate stats file

        ./and_spark.sh crStatsFile

Step 10 : Run Spark job to detect outliers

Set output.outliers = true and rem.outliers = true

        ./and_spark.sh olPred

Configuration

Configuration is in and.conf & in and1.conf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Anomaly Detections using Apache Spark

Files

README.md

Latest commit

History

README.md

File metadata and controls

Anomaly Detections using Apache Spark