This is a EHR mapping design template containing the folders and scripts needed to configure and run the EHR DataFactory pipeline on a hospital node. Please refer to DataFactory's User Guide for further details about DataFactory and its mapping-task designing principles.
- docker (v18)
- docker-compose (v1.22 - v1.8)
- python3 (v3.5 or greater), python3-pip, python3-tk
- MIPMAP
Datafactory User must be in the user group “docker”, so the scripts will run without the “sudo” command. To do that, follow the below instructions: Add the docker group if it doesn't already exist:
sudo groupadd docker
Add the connected user "$USER" to the docker group. Change the user name to match your preferred user if you do not want to use your current user:
sudo gpasswd -a $USER docker
Either do a newgrp docker or log out/in to activate the changes to groups.
You can use $ docker run hello-world
to check if you can run docker without sudo.
- 45432: for
demo_postgres
postgreSQL container
This port may change in the configuration. Please edit config.sys
file if you want to change the default port.
-
Clone this repo
-
Edit the
config.sys
file, if you want to change the default settings. -
Execute the
build_postgres.sh
script to create a docker container with the namedemo_postgres
with the 3 databases which are needed by EHR DataFactory pipeline. These databases are:
- mipmap
- i2b2_capture
- i2b2_harmonized
sh build_postgres.sh
Caution! If demo_postgres
container is already exist the script will drop the 3 databases and will create new ones. Also make sure that demo_postgres
container is running, otherwise the script will fail. In case the container exist and is stopped, you must restart the container.
- Place the hospital csv files into the
source
folder. It is a good practice, for privacy reasons, not using the original csv hospital files containing all the data, but work with copy of those files having the same filenames and containing a limited number of rows (i.e. 15-40).
Create preprocessing configuration files in the preprocess_step
folder by using the preprosess_tool GUI located in the preprocess_tool
folder.
Launch the tool by giving in terminal:
cd preprocess_tool
python3 prepro_gui.py
In the Patient Mapping Configuration section we declare the csv file (located in the source
folder, see step 3 of the sample data preparation section) that contains all the patients ID unique codes of the incomming EHR batch and we select the column that contains this code.
In the Encounter Mapping Configuration section we declare the csv file (located in the source
folder, see step 3 of the sample data preparation section) that contains all the patients' visits ID unique codes of the incoming EHR batch and we select the column that contains this code and also the column that holds the unique patient's ID code.
In the Unpivoting Configuration Section we first select which csv files (located in the source
folder, see step 3 of the sample data preparation section) we want to be unpivoted. Unpivoting a csv file is the action where the values of some columns in each row, are placed in multiple rows which share the same values of the "selected" columns. For example, let's say we have:
Column1 | Column2 | Column3 | Column4 |
---|---|---|---|
Value1 | Value2 | Value3 | Value4 |
The unpivoted version with the columns 1 and 2 as "selected" columns will be:
Column1 | Column2 | unPivoted | Value |
---|---|---|---|
Value1 | Value2 | Column3 | Value3 |
Value1 | Value2 | Column4 | Value4 |
Moving on, we select for each file which columns will be selected.
And finaly we select the output folder where we want to save the configurations files (for example preprocess_step
)
The final configurations files which are located in the preprocess_step
folder, will be the following:
- encountermapping.properties
- patientmapping.properties
- run.sh
- selected<input_csv 1>.txt (columns that will not be unpivoted in the 1st input csv)
- unpivoted<input_csv 1>.txt (columns that will be unpivoted in the 1st input csv)
- ...
- selected<input_csv N>.txt
- upivoted<input_csv N>.txt
The total number of the configuration file will be 3 + 2N, where N is the number of input csv files that needed to be unpivoted.
Before continuing to the next step (Capture step configuration files), which is the designing of the "capture" mapping task, we must create the auxilary csv files (Data Factory STEP_2B):
- EncounterMapping.csv
- PatientMapping.csv
- and the unpivoted csv files
In order to do that we give in the main mapping design folder:
sh ingestdata.sh preprocess
Check folder “source” if the auxiliary files have been created. Open each file and check if it is not empty.
If everything is ok, we are ready to go to the next step.
- Update(or replace) the following csv files in
source
folder with metadata:hospital_metadata.csv
(metadata about hospital's raw input csv files)cde_metadata.csv
(metadata about the pathology data model)
- Ensure that the auxiliary files created from the previous step are located in the
source
folder. - Design the mapping task by using MIPMAP and save it in the main folder with the name
map.xml
- Then run:
sh templator.sh capture
The last script creates a template xml map.xml.tmpl
file in the folder capture_step
.
The final configurations files are located in the capture_step
folder and are the following:
- map.xml.tmpl
- run.sh
- Design the mapping task by using MIPMAP and save it in the main folder with the name
mapHarmonize.xml
- Then run:
sh templator.sh harmonize
The last script creates a template xml mapHarmonize.xml.tmpl
file in the folder harmonize_step
.
The final configurations files are located in the harmonize_step
folder and are the following:
- mapHarmonize.xml.tmpl
- run.sh
The steps are numbered according to the MIP DATA FACTORY Data Processing User Guide document
In the main folder we run:
sh ingestdata.sh preprocess
Check if the auxiliary files (EncounterMapping.csv, PatientMapping,csv and the unpivoted csv's) are created in the source
folder
Caution! Auxilary files must be created first by the preprocessing step.
In the main folder we run:
sh ingestdata.sh capture
Check if the i2b2_capture
database is populated with data in the postgres container.
In the main folder we run:
sh ingestdata.sh harmonize
Check if the i2b2_harmonized
database is populated with data in the postgres container. Also, we could run the export_step
and check if the flattened csv file is correct.
If every of the above steps have run successfully in our local machine, we are ready to upload the EHR mapping configuration files into the actual DataFactory installation on Hospital node.
By inspecting the flattened csv file created in this step, we can gain more confidence on the validity of our EHR mapping configuration files (mapping tasks xml files). We sugest to run the export step multiple times using different flattening methods.
sh ingestdata.sh export <flatening method>
flattening method
we declare the csv flattening method. The choices are the following:
-
'mindate': For each patient, export one row with all information related to her first visit
-
'maxdate': For each patient, export one row with all information related to her last visit
-
'6months': For each patient, export one row with the information according to the 6-month window selection strategy defined by CHUV for clinical data related to MRI's. The criteria in detail:
- For a patient to have a row in the output CSV she has to have an MRI and a VALID Diagnosis (etiology1 !=“diagnostic en attente” and etiology1 != “NA”) in a 6-month window to the MRI. If there are more than one MRIs choose the first one. If there are more than one Diagnosis, choose the closest VALID to the MRI.
- The age and the visit date selected are the ones of the Diagnosis.
- Having information about MMSE and MoCA is optional. Has to be within a 6-month window to the Diagnosis date.
-
'longitude': For each patient, export all available information.