Skip to content
Bruno Vieira edited this page Sep 18, 2015 · 71 revisions

Software

Development of tools and applications for Linked Open Data

  • Quality assessment of SPARQL endpoints, RDF data and triple stores (cf. yummydata https://github.com/dbcls/bh14/wiki/Yummydata ) [Atsuko(interested)]
    • Automation of utilization, documentation and visualization of RDF data [Atsuko(interested), Kouji(Interested), Yas]
  • SPARQL Builder (Kouji, Yasunori, Atsuko) (see also YASGUI, yasgui.org)
  • Schema Salad (Peter)
  • Federated query via SPARQL endpoints(Hongyan, Atsuko, Jin-Dong, Kouji) (Kieron for Ensembl + PubMed)
  • Text search with triple store
    • Embedding ElasticSearch functions in RDF store to enhance query function and performance, such as autocompletion and so on (G. Fu)
  • the BioVirtuoso Docker data containers (HiroMishima, https://github.com/misshie/bio-virtuoso )[T Nakazato(interested)]
    • Bio2RDF - MichelD
    • OrphaData, better RDFized HPO Annotation
  • Common Workflow Language, portable workflows, container standards (Peter, Colin, Tazro)
    • CWL tutorial
    • wrapping tools and writing workflows using CWL
    • containers, runtime configuration
    • writing CWL implementations (JS, Ruby, Java?)
    • tool & workflow registries, workflow metadata ontologies (Dublin core, EDAM, DOAP)
    • visualisation
    • software discovery
    • RDF/SPARQL explorer and visualizer (Naoki)
  • SMART API - semantic annotation of web APIs http://smartapi.info - (MichelD) [Nick interested]
    • develop documentation
    • annotate biohack15 APIs
  • Semantic Wetlab (Erick Antezana, Alexander Garcia, Tazro Ohta, Jean-Luc Perret)
    • Ontologies for representing investigations [OBI?][SIO?][EDAM][ISA]
      • experimental design
      • rdf specification for workflows
      • tools for designing, planning and running experiments (how good should look like, use case)
    • Report BH 2015
  • Reproducible Software and data deployment (Pjotr Prins)
    • GUIX for databases (Jerven Bolleman, Raoul)
    • Software discovery (Pjotr, Raoul)
    • Ruby biogem support (Pjotr, Raoul, Naohisa)
    • GUIX for UniProt RDF releases (Jerven)
    • GUIX UniProt virtuoso local builds
      • With minimal auto tuning for memory
    • Reproducible software and data with Dat, hyperos.io (Bruno, Tazro)
      • Bionode pipeline inside, visualization with BioJS, nyaplot, D3
  • Visualisation
    • D3 visualisation work group (Toshiaki, Naoki, Pjotr, Peter, Bruno, MichelD)
  • Open-Bio
  • Crick-chan - a question answering system (Kazuharu Arakawa, Kotone Itaya)

Day 1

Bio-virtuoso

Participants: Hiro Mishima, Jeremy Nguyen Xuan, Tudor Grosa

See BioVirtuoso

Day 2

project

Participants: ...

  • CWL
    • Tutorial given by Peter Amstutz. Attended by Benedict Paten, Bruno Viera, Alex Garcia, Tazro Ohta, others++
    • Discussion of how to combine CWL, Docker Hub, Elixir Tool Registry to provide a central repository for bioinformatics tools that can be directly downloaded and executed in workflows (no installation needed)
    • Annotating CWL files with metadata
    • Ways of running Docker when the IT staff doesn't want to run Docker (solution: run Docker inside VM)
    • Rebasing CWL draft 3 on Salad schema to support linked data annotations

SPARQL Builder

Participants: Atsuko, Kouji, Yasunori ...

  • Re-design for SPARQL Builder Matadata(SBM)
  • Setup SPARQL endpoint for SBM, etc.

Day 3

SPARQL Builder

Participants: Atsuko, Kouji, Yasunori ...

  • new version of specification of SPARQL Builder Matadata(SBM) was released http://www.sparqlbuilder.org/doc/sbm_2015sep/
  • Thank to Arto, Dydra can automatically generate the metadata.
  • crawling for LSDB archive rdf
  • Since problems are found, we asked the author of the crawler to fix them.

CWL

RDF::VCF

Day 4

SPARQL Builder

Participants: Atsuko, Kouji, Yasunori ...

  • Because anyone who are users of Dydra can generate SB metadata for their DB, we started to develop an interface for uploading SB metadata.
  • alpha version of SPARQL Builder for LSDB archive

Day 5

BioRuby

Day 6

Embedding Elasticsearch into SPARQL and Cypher

Participants: Fu Gang, Jeremy

Inspired by Aber-owl: embedded DL query into SPARQL (http://aber-owl.net/aber-owl/sparql/) Elasticsearch allows synonym search (doc: https://www.elastic.co/guide/en/elasticsearch/guide/current/using-synonyms.html), spell correction (doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-term.html), and phrase match (doc: https://www.elastic.co/guide/en/elasticsearch/guide/current/phrase-matching.html).

Experiment setup:

  1. Elasticsearch index of PubMed title (data file 2.4G, index file 1.7G)

  2. search phrase: "drug treat disease" returned 415 230 records; "gene mutation cause disease" returned 598 057 records.

  3. embed the results into sparql query using 'values' keyword: challenges cannot allow too many records.

Improve performance of a query

Typical query performed in our group:

START disease = node:node_auto_index(iri={disease_id}) MATCH path = (disease)<-[:subClassOf*0..]-(diseaseSubclass)-[:hasPhenotype]->(phenotype) RETURN distinct phenotype

Expensive to run in big/deep ontologies.

Idea: Index in ElasticSearch the subclasses and superclasses of all the nodes, and delegate the expensive part of the query to ElasticSearch instead of performing it all in Cypher.

Experiment setup:

  • Subset of ontologies used in the Monarch Initiative (2.1G, 74k nodes)
  • Query runtime was divided by half.

Reproducible and distributable software and data

  • Use Bionode for streamable workflow (Bruno)
    • Reason: Node.js events and Streams are very flexible for scalable pipelines and workflows
  • Use Guix for package management
    • Reason: Reproducible software installation, dependency management, toolchain independent from OS.
    • Docker container with Guix (Bruno, Pjotr, Raoul)
  • Use CWL for tools integration
    • Reason: Integration between several bioinformatics tools, JSON stdin and stdout (easy to integrate with Bionode).
    • Streamable CWL (Peter, Bruno)
    • CWL Guix package (Pjotr, Bruno)
  • Run Docker tarball with Dat/Hyperos (Bruno)
    • Reason: Run tarball in non-Docker environments (e.g., HPC)
  • Discussion around standards and distribution of containers for bioinformatics (Benedict, Peter, Bruno)
  • Implemented CWL support within Toil (pip install toil), a scaleable workflow execution engine (Peter, Benedict)