Spark SQL datasource for GitHub PR API.
Package allows to query GitHub API v3 to fetch pull request information. Launches first requests on
driver to list available pull requests, and creates tasks with pull requests details to execute. PRs
are cached in cacheDir
value to save rate limit. It is recommended to use token to remove 60
requests/hour constraint. Package also supports loading pull requests using structured streaming
(experimental, for Spark 2.x only), see usage example below.
Most of JSON keys are supported, see schema here, here is an example output for subset of columns you might see:
scala> df.select("number", "title", "state", "base.repo.full_name", "user.login",
"commits", "additions", "deletions")
+------+--------------------+-----+------------+------------+-------+---------+---------+
|number| title|state| full_name| login|commits|additions|deletions|
+------+--------------------+-----+------------+------------+-------+---------+---------+
| 15599|[SPARK-18022][SQL...| open|apache/spark| srowen| 1| 1| 1|
| 15598|[SPARK-18027][YAR...| open|apache/spark| srowen| 1| 2| 0|
| 15597|[SPARK-18063][SQL...| open|apache/spark| jiangxb1987| 2| 16| 6|
| 15596|[SQL] Remove shuf...| open|apache/spark| viirya| 1| 13| 12|
+------+--------------------+-----+------------+------------+-------+---------+---------+
Spark version | spark-github-pr latest version |
---|---|
1.6.x | 1.2.0 |
2.x.x | 1.3.0 |
The spark-github-pr package can be added to Spark by using the --packages
command line option.
For example, run this to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --packages lightcopy:spark-github-pr:1.3.0-s_2.10
Change to lightcopy:spark-github-pr:1.3.0-s_2.11
for Scala 2.11.x
Currently supported options:
Name | Since | Example | Description |
---|---|---|---|
user |
1.0.0 |
apache | GitHub username or organization, default is apache |
repo |
1.0.0 |
spark | GitHub repository name for provided user, default is spark |
batch |
1.0.0 |
100 | number of pull requests to fetch, default is 25, must be >= 1 and <= 1000 |
token |
1.0.0 |
auth_token | authentication token to increase rate limit from 60 to 5000, see GitHub Auth for more info |
cacheDir |
1.0.0 |
file:/tmp/.spark-github-pr | directory to store cached pull requests information, currently required to be shared folder on local file system or directory on HDFS. |
// Load default number of pull requests from apache/spark
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").load().
select("number", "title", "user.login")
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
option("user", "apache").option("repo", "spark").load().
select("number", "title", "state", "base.repo.full_name", "user.login", "commits")
// You can also specify batch size for number of pull requests to fetch
val df = sqlContext.read.format("com.github.lightcopy.spark.pr").
option("user", "apache").option("repo", "spark").option("batch", "52").load()
df = sqlContext.read.format("com.github.lightcopy.spark.pr").
option("user", "apache").option("repo", "spark")load()
res = df.where("commits > 10")
CREATE TEMPORARY TABLE prs
USING com.github.lightcopy.spark.pr
OPTIONS (user "apache", repo "spark");
SELECT number, title FROM prs LIMIT 10;
val df = spark.readStream.format("com.github.lightcopy.spark.pr").load()
val query = df.select("number", "title", "user.login").
writeStream.format("console").option("checkpointLocation", "./checkpoint").start()
This library is built using sbt
, to build a JAR file simply run sbt package
from project root.
Run sbt test
from project root. CI runs for Spark 2.0 only.