-
Notifications
You must be signed in to change notification settings - Fork 26
Developing With EclairJS Client
EclairJS Client provides the Apache Spark API with some minor differences. Its a work in progress so please check our API docs to see which APIs are currently implemented.
Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers, double the values and then collect them.
var eclairjs = require('eclairjs');
var spark = new eclairjs();
var sc = new spark.SparkContext("local[*]", "Simple Spark Program");
// load in some data
var data = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
// double the values using the map operator
var doubleddata = data.map(function(num) {
return num * 2;
});
doubleddata.collect().then(function(results) {
console.log("results: ", results);
sc.stop();
}).catch(function(err) {
console.error(err);
sc.stop();
});
The EclairJS class is returned when you require the eclairjs
module. Creating a new instance of that class will create a standalone Apache Spark sessions which exposes the Apache Spark API.
var eclairjs = require('eclairjs');
var spark = new eclairjs();
var sc = new spark.SparkContext("local[*]", "Simple Spark Program");
EclairJS allows multiple Spark jobs to be run concurrently in one Node.js process - simply create multiple instances of the eclairjs
class.
Methods that return single Spark objects will immediately return their result. For parallelize
the DataSet object is returned directly.
var data = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
If a method is an action that returns an array of Spark objects (such as RDD.randomSplit
) or if the method returns a native JavaScript object (like collect
), it will return a Promise
.
doubleddata.collect().then(function(results) {
...
}).catch(function(err) {
...
});
This is done because these calls can take a while to execute and Node.js does not like having its execution blocked.
Many Spark operators take in an anonymous (also often called Lambda or UDFs) function as an argument, such as the map
operator:
// double the values using the map operator
var doubleddata = data.map(function(num) {
return num * 2;
})
The anonymous function gets applied to the data stored in Spark, which means it is evaluated in Spark in the worker processes and not in Node.js. Because of this, anonymous functions cannot reference variables or classes defined outside of it, and currently they must be pure Javascript - no Node.js features are supported (like requires). EclairJS allows binding of classes and values to anonymous functions and is described in a separate page about anonymous functions.
It is important to stop the SparkContext
when your application is done. The same applies to StreamingContext
.
rdd7.take(10).then(function(val) {
...
sc.stop();
}).catch(function(err) {
...
sc.stop();
});
This is because the actual Apache Spark program is running remotely and you always want to clean up after yourself. Also, EclairJS will not stop the SparkContext
if your Node application exits, so consider handling that, especially if you have a long running Apache Spark program using Streaming.
function exit() {
process.exit(0);
}
function stop(e) {
if (e) {
console.log('Error:', e);
}
if (sc) {
sc.stop().then(exit).catch(exit);
}
}
process.on('SIGTERM', stop);
process.on('SIGINT', stop);
We have a separate page devoted to debugging