Skip to content
gaou edited this page Mar 29, 2016 · 18 revisions

Crick-chan - a question answering system

  • Participants: Kazuharu Arakawa and Kotone Itaya (Institute for Advanced Biosciences, Keio University)

Introduction

Success of the IBM Watson in the quiz show Jeopardy highlighted the potentials of state-of-the-art cognitive computing in answering natural language questions. IBM Watson, however, does not rely so much on semantics nor machine learning, but rather it is primarily based on queries on unstructured data, with statistical identification of answer domains (Lexical Answer Type).

IBM Watson software (DeepQA) is a system to answer a "word" matching the natural language quiz, searching through millions of pages of documents, including the entire text of Wikipedia. A scientific fact or knowledge is almost always written in natural language text in the form of manuscript, use of which is relatively less explored in the semantic web context. Therefore, here we aimed to develop a software system mimicking DeepQA that finds a most relevant "sentence" (as opposed to a "word" in Watson) from millions of scientific documents. Since the software deals with the biological knowledge as a counterpart of "Watson", we named the software "Crick-chan". "-chan" is a postfix added when calling kids' names in Japanese, since our software is still quite immature compared to Watson.

Architecture

Below is the architecture of DeepQA. DeepQA Architecture

Crick-chan mimics the architecture, and works as follows:

  1. Question text first undergoes morphological analysis using Enju to extract objective nouns, key verbs. Using dictionary search, proper nouns are identified.

  2. Queries are extended using Bing search. At the same time, the question is checked if it belongs to biology domain.

  3. Full text searches are performed for the entire OMIM, PubMed, PubMedCentral, NCBI Bookshelf, Wikipedia, and the entire WWW, via queries to NCBI EUtils and Bing searches.

  4. Relevant sentences are extracted to most relevant articles.

  5. Extracted sentences (these are the answer hypothesis) are check for grammatical completeness and are scored according to the key words.

  6. Answer confidence is scored according to the data sources and the completeness of the possession of key terms.

  7. Resulting "answer" is presented in a UI suited for such an intelligible process.

UPDATED (2015.09.18 21:35JST) v.0.3

  • Crick-chan now embeds ALICE-bot v.1 (ProgramV 0.09) for general chatting, when question has (keyterm < 2 && isBio != 1)

UPDATED (2015.09.20 01:34JST) v.0.4

UPDATED (2016.03.29 18:54JST) v.0.5

  • Crick-chan now first categorize the query as 1. Personal question, 2. Keyword question, 3. natural language question. Personal questions are handled by AIML, keyword questions are redirected to Wikipedia, and other questions are handled by the existing Crick-chan engine.

UPDATED (2016.03.29 19:51JST) v.0.6

Crick-chan software

Crick-chan is accessible at: http://link.g-language.org/crick-chan/

Source code for Crick-chan core API is here: https://github.com/gaou/crick-chan requires G-language Genome Analysis Environment.

Examples

Crick-chan interface

  1. What genes are associated with Alzheimer disease?
  2. What is G-language Genome Analysis Environment?
  3. Who is Luke Skywalker married to?
  4. How does semantic web technologies facilitate life science?

Presentation

https://www.dropbox.com/s/152blkr7yhwythl/biohackathon2015.pdf?dl=0

Acknowledgements

This software uses or is derived from Enju 2.4.2 for CentOS 5.5 for x86_64 software, Enju 2.4.2 for CentOS 5.5 for x86_64 modules, and/or Enju 2.4.2 for CentOS 5.5 for x86_64 itineraries, developed at the Tsujii Laboratory, the University of Tokyo. (c) Copyright 2011 the University of Tokyo

The character "Crick-chan" is created and owned by artist Paperu (ぱぺる).

Image courtesy to: freepictureweb (server room image) and Gfycat (background animation).

Music used in background is "Poppin' Shower" by P*Light.