Skip to content

🏥 This is my attempt at the selection tests for the GSOC 2019 project, "Human history and data visualization" under the Canadian Center for Computational Genomics

Notifications You must be signed in to change notification settings

kelvindecosta/genomic-data-visualization-pipeline

Repository files navigation

Genomic Data Visualization Pipeline

This is my attempt at the selection tests for the GSoC 2019 project, "Human history and data visualization" under the Canadian Center for Computational Genomics

Usage

usage: main.py [-h] [--log_dir] [--plot_dir] [--coords_dir] [--genedata]
               [--labels] [--pops] [--seed] [--downsample_n] [--prune_size]
               [--prune_step] [--prune_thresh] [--prune_iter] [--pca_n]
               [--pca_scaler] [--plot_limit_components] [--title TITLE]

Applying Dimensionality Reduction on Genomic Data

optional arguments:
  -h, --help            show this help message and exit
  --log_dir             log directory
  --plot_dir            plot directory
  --coords_dir          coordinates directory
  --genedata            genomic data file
  --labels              sample population mappings file
  --pops                population description file
  --seed                seed for prng
  --downsample_n        number of SNPs to randomly choose
  --prune_size          size for pruning
  --prune_step          step for pruning
  --prune_thresh        threshold for pruning
  --prune_iter          iterations for pruning
  --pca_n               number of principal components
  --pca_scaler          scaler for pca
  --plot_limit_components 
                        number of components to consider plotting, [None]
                        means plot all
  --title TITLE         process title

Example

With the dataset in a folder data in the repository directory, run the following command :

python main.py --genedata=data/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf --labels=data/affy_samples.20141118.panel --pops=data/20131219.populations.tsv --title="Chromosome 1 Assembly B37" --plot_limit_components=4

After running command, the project structure changes to :

.
├── coords
│   └── [1554763502304] Chromosome 1 Assembly B37
│       ├── [1554763502304] Chromosome 1 Assembly B37 - [PCA] coords.txt
│       ├── [1554763502304] Chromosome 1 Assembly B37 - [UMAP] coords.txt
│       └── [1554763502304] Chromosome 1 Assembly B37 - [UMAP_PCA] coords.txt
├── logs
│   └── [1554763502304] Chromosome 1 Assembly B37.log
├── main.py
├── plots
│   └── [1554763502304] Chromosome 1 Assembly B37 plots
│       ├── [1554763502304] Chromosome 1 Assembly B37 [PCA]
│       │   ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][2].png
│       │   ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][3].png
│       │   ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][4].png
│       │   ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[2][3].png
│       │   ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[2][4].png
│       │   └── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[3][4].png
│       ├── [1554763502304] Chromosome 1 Assembly B37 [UMAP]
│       │   └── [1554763502304] Chromosome 1 Assembly B37 [UMAP] - PC[1][2].png
│       └── [1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA]
│           └── [1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA] - PC[1][2].png
└── README.md

The coords directory contains the lower dimensional coordinates written into text files.

The logs directory contains the log reports.

The plots directory has the plots of the components with one another.

PCA

alt text

UMAP

alt text

UMAP + PCA

alt text

About

🏥 This is my attempt at the selection tests for the GSOC 2019 project, "Human history and data visualization" under the Canadian Center for Computational Genomics

Topics

Resources

Stars

Watchers

Forks

Languages