-
Notifications
You must be signed in to change notification settings - Fork 3
Crick chan
- Participants: Kazuharu Arakawa and Kotone Itaya (Institute for Advanced Biosciences, Keio University)
Success of the IBM Watson in the quiz show Jeopardy highlighted the potentials of state-of-the-art cognitive computing in answering natural language questions. IBM Watson, however, does not rely so much on semantics nor machine learning, but rather it is primarily based on queries on unstructured data, with statistical identification of answer domains (Lexical Answer Type).
IBM Watson software (DeepQA) is a system to answer a "word" matching the natural language quiz, searching through millions of pages of documents, including the entire text of Wikipedia. A scientific fact or knowledge is almost always written in natural language text in the form of manuscript, use of which is relatively less explored in the semantic web context. Therefore, here we aimed to develop a software system mimicking DeepQA that finds a most relevant "sentence" (as opposed to a "word" in Watson) from millions of scientific documents. Since the software deals with the biological knowledge as a counterpart of "Watson", we named the software "Crick-chan". "-chan" is a postfix added when calling kids' names in Japanese, since our software is still quite immature compared to Watson.
Below is the architecture of DeepQA.
Crick-chan mimics the architecture, and works as follows:
-
Question text first undergoes morphological analysis using Enju to extract objective nouns, key verbs. Using dictionary search, proper nouns are identified.
-
Queries are extended using Bing search. At the same time, the question is checked if it belongs to biology domain.
-
Full text searches are performed for the entire OMIM, PubMed, PubMedCentral, NCBI Bookshelf, Wikipedia, and the entire WWW, via queries to NCBI EUtils and Bing searches.
-
Relevant sentences are extracted to most relevant articles.
-
Extracted sentences (these are the answer hypothesis) are check for grammatical completeness and are scored according to the key words.
-
Answer confidence is scored according to the data sources and the completeness of the possession of key terms.
-
Resulting "answer" is presented in a UI suited for such an intelligible process.
UPDATED (2015.09.18 21:35JST) v.0.3
- Crick-chan now embeds ALICE-bot v.1 (ProgramV 0.09) for general chatting, when question has (keyterm < 2 && isBio != 1)
UPDATED (2015.09.20 01:34JST) v.0.4
- Crick-chan now embeds G-language Genie (https://github.com/dbcls/bh12/wiki/G-language-project) for gene-centric information retrieval and genome analysis methods.
UPDATED (2016.03.29 18:54JST) v.0.5
- Crick-chan now first categorize the query as 1. Personal question, 2. Keyword question, 3. natural language question. Personal questions are handled by AIML, keyword questions are redirected to Wikipedia, and other questions are handled by the existing Crick-chan engine.
UPDATED (2016.03.29 19:51JST) v.0.6
- Crick-chan now handles generic question first using Evi (http://www.evi.com). When the question is considered a math question, it is redirected to WolframAlpha API. (https://www.wolframalpha.com)
Crick-chan is accessible at: http://link.g-language.org/crick-chan/
Source code for Crick-chan core API is here: https://github.com/gaou/crick-chan requires G-language Genome Analysis Environment.
- What genes are associated with Alzheimer disease?
- What is G-language Genome Analysis Environment?
- Who is Luke Skywalker married to?
- How does semantic web technologies facilitate life science?
https://www.dropbox.com/s/152blkr7yhwythl/biohackathon2015.pdf?dl=0
This software uses or is derived from Enju 2.4.2 for CentOS 5.5 for x86_64 software, Enju 2.4.2 for CentOS 5.5 for x86_64 modules, and/or Enju 2.4.2 for CentOS 5.5 for x86_64 itineraries, developed at the Tsujii Laboratory, the University of Tokyo. (c) Copyright 2011 the University of Tokyo
The character "Crick-chan" is created and owned by artist Paperu (ぱぺる).
Image courtesy to: freepictureweb (server room image) and Gfycat (background animation).
Music used in background is "Poppin' Shower" by P*Light.