Skip to content

Commit

Permalink
Update project with pipelines and sample data.
Browse files Browse the repository at this point in the history
Our code for the submission of AMIA 2020.
  • Loading branch information
chlor committed Mar 27, 2020
1 parent 9e034ad commit 49ec439
Show file tree
Hide file tree
Showing 10 changed files with 82 additions and 1 deletion.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.project
jcore-pipelines/detectStopWords/*
jcore-pipelines/detectUMLSentries/*
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,28 @@

# JuFiT: filtered dictionaries from UMLS

* Download JuFit from https://github.com/JULIELab/jufit and create the jar file by maven
* Download JuFit from https://github.com/JULIELab/jufit and create the jar file by Maven
* run `java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded > UMLS_dict.txt`
* run the script request-jufit.sh for dictionaries of the different semantic groups
* run the script createDics.py to create on large dictionary (before run: adapt paths)


# Dictionary Format

* We use following format in our dictionaries:
* one line per entry
* seperated by tabulators

# JCoRe Pipeline
* unpack the *.zip files in jcore-pipelines, there are 2 pipelines: dectectUMLSentries and detectStopwords
* put the UMLS dictionary file into jcore-pipelines/detectUMLSentries/resources
* put your analysis text data into data/files (subdirectories are not read, be carefuly with *.tar files)
* adapt filename of the dictionary and the stopword dictionary in the following files:
Einstellung des zu filternden Wörterbuches und des Stopwörterbuches in folgenden Dateien anpassen:
* desc/GazetteerAnnotator Template Descriptor with Configurable External Resource.xml
* descAll/GazetteerAnnotator, Template Descriptor with Configurable External Resource.xml
* open a terminal and root into one of the pipeline directories
* start the pipeline with 'java -jar ../jcore-pipeline-runner-base-0.4.1-SNAPSHOT-cli-assembly.jar run.xml '
* and have a look into
* offsets.tsv
* data/outData/output-xmi
30 changes: 30 additions & 0 deletions extended_script_dictionaries/createDics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import glob
import csv

print('merge different UMLS dics')

def create_big_dic(dic_path, delim):
dics = glob.glob(dic_path + '/*')
big_dic = ''
for dic in dics:
name = dic.replace(dic_path + '/', '').replace('.txt', '').replace('.dict', '').replace('2019AB-', '').replace('-GER', '')
with open(dic) as tsvfile:
reader = csv.reader(tsvfile, delimiter=delim)
for row in reader:
#print(row[0])
big_dic += row[0] + '\t' + name + '\n'
return big_dic

path = '/the/name/of/the/path/with/dictionary/files'

dic_path_umls = path + '/UMLS-semantic-group'
big_dic_umls = create_big_dic(dic_path_umls, '|')

dic_path_gene = path + '/gene'
big_dic_gene = create_big_dic(dic_path_gene, '\t')

big_dic_file = open('bic_dic.txt', 'w')
big_dic_file.write(big_dic_umls)
big_dic_file.write(big_dic_redlist)
big_dic_file.write(big_dic_gene)
big_dic_file.close()
24 changes: 24 additions & 0 deletions extended_script_dictionaries/request-jufit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=ACTI > dic/UMLS-2019AB-ACTI-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=ANAT > dic/UMLS-2019AB-ANAT-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=CHEM > dic/UMLS-2019AB-CHEM-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=CONC > dic/UMLS-2019AB-CONC-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=DEVI > dic/UMLS-2019AB-DEVI-GER.txt

java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=DISO > dic/UMLS-2019AB-DISO-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=GENE > dic/UMLS-2019AB-GENE-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=GEOG > dic/UMLS-2019AB-GEOG-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=LIVB > dic/UMLS-2019AB-LIVB-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=OBJC > dic/UMLS-2019AB-OBJC-GER.txt

java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=OCCU > dic/UMLS-2019AB-OCCU-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=ORGA > dic/UMLS-2019AB-ORGA-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=PHEN > dic/UMLS-2019AB-PHEN-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=PHYS > dic/UMLS-2019AB-PHYS-GER.txt
java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded --semanticGroup=PROC > dic/UMLS-2019AB-PROC-GER.txt

java -jar JenaUmlsFilter-1.1-jar-with-dependencies.jar MRCONSO.RRF MRSTY.RRF GER --grounded > dic/UMLS-2019AB-GER.txt

#Only the following semantic group names are supported:
#ACTI, ANAT, CHEM, CONC, DEVI,
#DISO, GENE, GEOG, LIVB, OBJC,
#OCCU, ORGA, PHEN, PHYS, PROC
2 changes: 2 additions & 0 deletions jcore-pipelines/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/detectStopWords/
/detectUMLSentries/
Binary file added jcore-pipelines/detectStopWords.zip
Binary file not shown.
Binary file added jcore-pipelines/detectUMLSentries.zip
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 comments on commit 49ec439

Please sign in to comment.