Skip to content

Commit

Permalink
Experiments for SaGe backends
Browse files Browse the repository at this point in the history
  • Loading branch information
JulienDavat committed Jun 6, 2021
0 parents commit fb439c5
Show file tree
Hide file tree
Showing 21 changed files with 550 additions and 0 deletions.
105 changes: 105 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# SaGe: A Preemptive SPARQL Server for Online Knowledge Graphs

**Authors:** Julien Aimonier-Davat (LS2N), Hala Skaf-Molli (LS2N), Pascal Molli (LS2N) and Thomas Minier

**Abstract**
In order to provide stable and responsive SPARQL endpoints to the community, public SPARQL endpoints enforce fair use policies. Unfortunately, long-running SPARQL queries cannot be executed under the fair use policy restrictions, providing only partial results. In this paper, we present SaGe, a SPARQL server based on the web preemption principle. Instead of stopping queries after a quota of time, SaGe suspends the current query and returns it to the user. The user is then free to continue executing the query from the point where it was stopped by simply returning the suspended query to the server. In this paper, we describe the current state of the SaGe server, including the latest advances on the expressiveness of the server and its ability to support updates.

# Experimental results

## Dataset and Queries

In our experiments, we re-use the RDF dataset and the
SPARQL queries from the [BrTPF](https://doi.org/10.1007/978-3-319-48472-3_48)
experimental study. The dataset contains 10M triples and we randomly picked 60
queries such that all queries complete at least in 30 minutes.

## Machine configuration

We run all our experiments on a `MacBook Pro` with a `2,3 GHz Intel Core i7`
processor and a `1TB SSD disk`.

## Plots

**Plot 1**: Execution time of the query `?s ?p ?o` using the different backends.

![](figures/spo_execution_times.png?raw=true)

**Plot 2**: Supend/Resume time of the different backends and triple pattern shapes.

![](figures/suspend_resume_times.png?raw=true)

**Plot 3**: The execution time of the different backends on the `WatDiv` queries.

![](figures/execution_times.png?raw=true)

# Experimental study

## Dependencies

To run our experiments, the following softwares and packages have to be installed on your system.
* [Python3.7](https://www.python.org) with developpement headers
* [Virtualenv](https://pypi.org/project/virtualenv)
* [sage-engine](https://github.com/sage-org/sage-engine)
* [PostgreSQL](https://www.postgresql.org)
* [HBase](https://hbase.apache.org)

## Installation

Once all dependencies have been installed, clone this repository and install the project.

```bash
# clone the project repository
git clone https://github.com/JulienDavat/sage-backends-experiments.git
cd sage-backends-experiments
# create a virtual environement to isolate project dependencies
virtualenv sage-env
# activate the virtual environement
source sage-env/bin/activate
# install the main dependencies
pip install -r requirements.txt
```

## Preparation

```bash
# download datasets into the graphs directory
mkdir graphs && cd graphs
wget nas.jadserver.fr/thesis/projects/sage/datasets/watdiv10M.hdt
wget nas.jadserver.fr/thesis/projects/sage/datasets/watdiv10M.nt
cd ..
# download queries into the workloads directory
cd workloads
wget nas.jadserver.fr/thesis/projects/sage/queries/watdiv_workloads.gz
cd ..
# insert data into PostgreSQL
sage-postgres-init --no-index configs/sage/backends.yaml sage_psql
sage-postgres-put graphs/watdiv10M.nt configs/sage/backends.yaml sage_psql
sage-postgres-index configs/sage/backends.yaml sage_psql
# insert data into SQLite
sage-sqlite-init --no-index configs/sage/backends.yaml sage_sqlite
sage-sqlite-put graphs/watdiv10M.nt configs/sage/backends.yaml sage_sqlite
sage-sqlite-index configs/sage/backends.yaml sage_sqlite
# insert data into HBase
sage-hbase-init --no-index configs/sage/backends.yaml sage_hbase
sage-hbase-put graphs/watdiv10M.nt configs/sage/backends.yaml sage_hbase
sage-hbase-index configs/sage/backends.yaml sage_hbase
# run the SaGe server
sage configs/sage/backends.yaml -w 1 -p 8080
```

## Running the experiments

Our experimental study is powered by **Snakemake**. The main commands used in our
experimental study are given below:

```bash
# Plot backends execution times for the ?s ?p ?o query
snakemake --cores 1 figures/spo_execution_times.png

# Plot backends suspend/resume times
snakemake --cores 1 figures/suspend_resume_times.png

# Plot backends execution times for a given WatDiv workload
snakemake --cores 1 figures/[workload directory]/execution_times.png
```
2 changes: 2 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include: "rules/exec.smk"
include: "rules/plot.smk"
37 changes: 37 additions & 0 deletions configs/sage/backends.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: SaGe experimental server
quota: 60000
max_results: 10000
graphs:
- name: sage_psql
uri: http://localhost:8080/sparql/sage_psql
backend: postgres
dbname: sage
user: sage
password: 'sage'
-
name: sage_psql_catalog
uri: http://localhost:8080/sparql/sage_psql_catalog
backend: postgres-catalog
dbname: sage
user: sage
password: 'sage'
-
name: sage_sqlite
uri: http://localhost:8080/sparql/sage_sqlite
backend: sqlite
database: graphs/sage-sqlite.db
-
name: sage_sqlite_catalog
uri: http://localhost:8080/sparql/sage_sqlite_catalog
backend: sqlite-catalog
database: graphs/sage-sqlite-catalog.db
-
name: sage_hdt
uri: http://localhost:8080/sparql/sage_hdt
backend: hdt-file
file: graphs/watdiv.10M.hdt
-
name: sage_hbase
uri: http://localhost:8080/sparql/sage_hbase
backend: hbase
thrift_host: localhost
9 changes: 9 additions & 0 deletions configs/sage/sage-hdt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: SaGe experimental server
quota: 60000
max_results: 10000
graphs:
-
name: sage_hdt
uri: http://localhost:8080/sparql/sage_hdt
backend: hdt-file
file: graphs/watdiv.100M.hdt
17 changes: 17 additions & 0 deletions configs/sage/sage-psql.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: SaGe experimental server
quota: 60000
max_results: 10000
graphs:
- name: sage_psql
uri: http://localhost:8080/sparql/sage_psql
backend: postgres
dbname: sage
user: sage
password: 'sage'
-
name: sage_psql_catalog
uri: http://localhost:8080/sparql/sage_psql_catalog
backend: postgres-catalog
dbname: sage
user: sage
password: 'sage'
9 changes: 9 additions & 0 deletions configs/sage/sage-sqlite.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: SaGe experimental server
quota: 60000
max_results: 10000
graphs:
-
name: sage_sqlite_100M
uri: http://localhost:8080/sparql/sage_sqlite
backend: sqlite
database: graphs/sage-sqlite-100M.db
Binary file added figures/execution_times.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/spo_execution_times.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/suspend_resume_times.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
sparqlwrapper
snakemake
seaborn
matplotlib
coloredlogs
click
26 changes: 26 additions & 0 deletions rules/exec.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
rule run_sage:
input:
ancient("workloads/{workload}/{query}.rq")
output:
result="output/{workload,[^/]+}/{backend,sage_[^/]+}/{query,[^/]+}.json",
stats="output/{workload,[^/]+}/{backend,sage_[^/]+}/{query,[^/]+}.csv",
params:
endpoint="http://localhost:8080/sparql",
shell:
"python scripts/query_sage.py {input} \
http://localhost:8080/sparql http://localhost:8080/sparql/{wildcards.backend} \
--output {output.result} --measures {output.stats}"


rule run_virtuoso:
input:
ancient("workloads/{workload}/{query}.rq")
output:
result="output/{workload,[^/]+}/virtuoso/{query,[^/]+}.json",
stats="output/{workload,[^/]+}/virtuoso/{query,[^/]+}.csv",
params:
endpoint="http://localhost:8890/sparql",
shell:
"python scripts/query_virtuoso.py {input} \
http://localhost:8890/sparql http://example.org/datasets/watdiv10M \
--output {output.result} --measures {output.stats}"
70 changes: 70 additions & 0 deletions rules/plot.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
from scripts.utils import list_files, query_name

def list_workload_queries(wildcards):
return [ query_name(q) for q in list_files(f"workloads/{wildcards.workload}", "rq") ]

def list_hbase_queries(wildcards):
return [ query_name(q) for q in list_files(f"output/{wildcards.workload}/sage_hbase", "csv") ]

rule prepare_backend_data:
input:
"output/{workload}/{backend}/{query}.csv"
output:
"output/{workload,[^/]+}/{backend,[^/]+}/{query,[^/]+}-prepared.csv"
shell:
"touch {output}; "
"echo 'backend,query,execution_time,nb_calls,nb_results,loading_time,resume_time' > {output}; "
"echo -n '{wildcards.backend},{wildcards.query},' >> {output}; "
"cat {input} >> {output}"


rule merge_backend_data:
input:
lambda wildcards: expand("output/{{workload}}/{{backend}}/{query}-prepared.csv", query=list_workload_queries(wildcards))
output:
"output/{workload,[^/]+}/{backend,[^/]+}/execution_times.csv"
shell:
"bash scripts/merge_csv.sh {input} > {output}"


rule merge_backends_data:
input:
sage_psql=ancient("output/{workload}/sage_psql/execution_times.csv"),
sage_psql_catalog=ancient("output/{workload}/sage_psql_catalog/execution_times.csv"),
sage_sqlite=ancient("output/{workload}/sage_sqlite/execution_times.csv"),
sage_sqlite_catalog=ancient("output/{workload}/sage_sqlite_catalog/execution_times.csv"),
sage_hdt=ancient("output/{workload}/sage_hdt/execution_times.csv"),
sage_hbase=ancient("output/{workload}/sage_hbase/execution_times.csv"),
output:
"output/{workload,[^/]+}/execution_times.csv"
shell:
"bash scripts/merge_csv.sh {input.sage_psql} {input.sage_psql_catalog} \
{input.sage_sqlite} {input.sage_sqlite_catalog} \
{input.sage_hdt} {input.sage_hbase} > {output}"


rule plot_execution_times:
input:
ancient("output/{workload}/execution_times.csv")
output:
"figures/{workload,[^/]+}/execution_times.png"
shell:
"python scripts/plots.py execution-times {input} {output}"


rule plot_suspend_resume_times:
input:
ancient("output/indexes/execution_times.csv")
output:
"figures/suspend_resume_times.png"
shell:
"python scripts/plots.py suspend-resume-times {input} {output}"


rule spo_execution_times:
input:
ancient("output/spo/execution_times.csv")
output:
"figures/spo_execution_times.png"
shell:
"python scripts/plots.py spo-execution-times {input} {output}"
3 changes: 3 additions & 0 deletions scripts/merge_csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

awk 'FNR==1 && NR!=1{next;}{print}' $@ | sed '/^\s*$/d'
Loading

0 comments on commit fb439c5

Please sign in to comment.