spark-github-pr

Spark SQL datasource for GitHub PR API.

Overview

Package allows to query GitHub API v3 to fetch pull request information. Launches first requests on driver to list available pull requests, and creates tasks with pull requests details to execute. PRs are cached in cacheDir value to save rate limit. It is recommended to use token to remove 60 requests/hour constraint. Package also supports loading pull requests using structured streaming (experimental, for Spark 2.x only), see usage example below.

Most of JSON keys are supported, see schema here, here is an example output for subset of columns you might see:

scala> df.select("number", "title", "state", "base.repo.full_name", "user.login",
  "commits", "additions", "deletions")

+------+--------------------+-----+------------+------------+-------+---------+---------+
|number|               title|state|   full_name|       login|commits|additions|deletions|
+------+--------------------+-----+------------+------------+-------+---------+---------+
| 15599|[SPARK-18022][SQL...| open|apache/spark|      srowen|      1|        1|        1|
| 15598|[SPARK-18027][YAR...| open|apache/spark|      srowen|      1|        2|        0|
| 15597|[SPARK-18063][SQL...| open|apache/spark| jiangxb1987|      2|       16|        6|
| 15596|[SQL] Remove shuf...| open|apache/spark|      viirya|      1|       13|       12|
+------+--------------------+-----+------------+------------+-------+---------+---------+

Requirements

Spark version	spark-github-pr latest version
1.6.x	1.2.0
2.x.x	1.3.0

Linking

The spark-github-pr package can be added to Spark by using the --packages command line option. For example, run this to include it when starting the spark shell:

 $SPARK_HOME/bin/spark-shell --packages lightcopy:spark-github-pr:1.3.0-s_2.10

Change to lightcopy:spark-github-pr:1.3.0-s_2.11 for Scala 2.11.x

Options

Currently supported options:

Name	Since	Example	Description
`user`	`1.0.0`	apache	GitHub username or organization, default is `apache`
`repo`	`1.0.0`	spark	GitHub repository name for provided user, default is `spark`
`batch`	`1.0.0`	100	number of pull requests to fetch, default is 25, must be >= 1 and <= 1000
`token`	`1.0.0`	auth_token	authentication token to increase rate limit from 60 to 5000, see GitHub Auth for more info
`cacheDir`	`1.0.0`	file:/tmp/.spark-github-pr	directory to store cached pull requests information, currently required to be shared folder on local file system or directory on HDFS.

Example

Scala API

// Load default number of pull requests from apache/spark
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").load().
  select("number", "title", "user.login")

val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark").load().
  select("number", "title", "state", "base.repo.full_name", "user.login", "commits")

// You can also specify batch size for number of pull requests to fetch
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark").option("batch", "52").load()

Python API

df = sqlContext.read.format("com.github.lightcopy.spark.pr").
  option("user", "apache").option("repo", "spark")load()

res = df.where("commits > 10")

SQL API

CREATE TEMPORARY TABLE prs
USING com.github.lightcopy.spark.pr
OPTIONS (user "apache", repo "spark");

SELECT number, title FROM prs LIMIT 10;

Structured Streaming API

val df = spark.readStream.format("com.github.lightcopy.spark.pr").load()
val query = df.select("number", "title", "user.login").
  writeStream.format("console").option("checkpointLocation", "./checkpoint").start()

Building From Source

This library is built using sbt, to build a JAR file simply run sbt package from project root.

Testing

Run sbt test from project root. CI runs for Spark 2.0 only.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
scalastyle-config.xml		scalastyle-config.xml
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-github-pr

Overview

Requirements

Linking

Options

Example

Scala API

Python API

SQL API

Structured Streaming API

Building From Source

Testing

About

Releases

Packages

Languages

License

lightcopy/spark-github-pr

Folders and files

Latest commit

History

Repository files navigation

spark-github-pr

Overview

Requirements

Linking

Options

Example

Scala API

Python API

SQL API

Structured Streaming API

Building From Source

Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages