Skip to content

Developing With EclairJS Client

Doron Rosenberg edited this page Nov 7, 2016 · 5 revisions

EclairJS Client provides the Apache Spark API with some minor differences. Its a work in progress so please check our API docs to see which APIs are currently implemented.

Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers, double the values and then collect them.

var eclairjs = require('eclairjs');

var spark = new eclairjs();

var sc = new spark.SparkContext("local[*]", "Simple Spark Program");

// load in some data
var data = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

// double the values using the map operator
var doubleddata = data.map(function(num) {
  return num * 2;
});

doubleddata.collect().then(function(results) {
  console.log("results: ", results);
  sc.stop();
}).catch(function(err) {
  console.error(err);
  sc.stop();
});

Using the EclairJS module

The EclairJS class is returned when you require the eclairjs module. Creating a new instance of that class will create a standalone Apache Spark sessions which exposes the Apache Spark API.

var eclairjs = require('eclairjs');

var spark = new eclairjs();

var sc = new spark.SparkContext("local[*]", "Simple Spark Program");

EclairJS allows multiple Spark jobs to be run concurrently in one Node.js process - simply create multiple instances of the eclairjs class.

Method Calls

Methods that return single Spark objects will immediately return their result. For parallelize the DataSet object is returned directly.

var data = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

If a method is an action that returns an array of Spark objects (such as RDD.randomSplit) or if the method returns a native JavaScript object (like collect), it will return a Promise.

doubleddata.collect().then(function(results) {
   ...
}).catch(function(err) {
   ...
});

This is done because these calls can take a while to execute and Node.js does not like having its execution blocked.

Anonymous Functions

Many Spark operators take in an anonymous (also often called Lambda or UDFs) function as an argument, such as the map operator:

// double the values using the map operator
var doubleddata = data.map(function(num) {
  return num * 2;
})

The anonymous function gets applied to the data stored in Spark, which means it is evaluated in Spark in the worker processes and not in Node.js. Because of this, anonymous functions cannot reference variables or classes defined outside of it, and currently they must be pure Javascript - no Node.js features are supported (like requires). EclairJS allows binding of classes and values to anonymous functions and is described in a separate page about anonymous functions.

Stopping the SparkContext

It is important to stop the SparkContext when your application is done. The same applies to StreamingContext.

rdd7.take(10).then(function(val) {
  ...
  sc.stop();
}).catch(function(err) {
  ...
  sc.stop();
});

This is because the actual Apache Spark program is running remotely and you always want to clean up after yourself. Also, EclairJS will not stop the SparkContext if your Node application exits, so consider handling that, especially if you have a long running Apache Spark program using Streaming.

function exit() {
  process.exit(0);
}

function stop(e) {
  if (e) {
    console.log('Error:', e);
  }

  if (sc) {
    sc.stop().then(exit).catch(exit);
  }
}

process.on('SIGTERM', stop);
process.on('SIGINT', stop);

Error Handling and Debugging

We have a separate page devoted to debugging