Skip to content

Commit

Permalink
complete snakemake workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
nargesr committed Aug 14, 2024
1 parent 6a282e6 commit 4251f89
Show file tree
Hide file tree
Showing 8 changed files with 111 additions and 71 deletions.
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
setup(
name='Topyfic', # the name of your package
packages=['Topyfic'], # same as above
version='v0.4.13', # version number
version='v0.4.15', # version number
license='MIT', # license type
description='Topyfic is a Python package designed to identify reproducible latent dirichlet allocation (LDA) '
'using leiden clustering and harmony for single cell epigenomics data',
# short description
author='Narges Rezaie', # your name
author_email='[email protected]', # your email
url='https://github.com/mortazavilab/Topyfic', # url to your git repo
download_url='https://github.com/mortazavilab/Topyfic/archive/refs/tags/v0.4.13.tar.gz', # link to the tar.gz file associated with this release
download_url='https://github.com/mortazavilab/Topyfic/archive/refs/tags/v0.4.15.tar.gz', # link to the tar.gz file associated with this release
keywords=['Cellular Programs', 'Latent Dirichlet allocation', 'single-cell multiome', 'single-cell RNA-seq',
'gene regulatory network', 'Topic Modeling', 'single-nucleus RNA-seq'], #
python_requires='>=3.9',
Expand Down
44 changes: 27 additions & 17 deletions workflow/snakemake/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

This directory contains a Snakemake pipeline for running the Topyfic automatically.

The snakemake will run training and building model (topModel).
The snakemake will run training (Train) and building model (topModel, Analysis).

**Note**: Please make sure to install necessary packages and set up your Snakemake appropriately.
**Note**: pipeline is tested for >= 8.*

**Note**: pipeline is tested for Snakemake >= 8.X ([more info](https://snakemake.readthedocs.io/en/stable/index.html))

## Getting started

Expand All @@ -31,55 +32,64 @@ Modify the [config file](config/config.yaml) or create a new one with the same s

3. **n_topics**
- Contains list of number of initial topics you wish to train model base on them
- list of int: [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
- list of int: `[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]`

4. **organism**
- Indicate spices which will be used for downstream analysis
- Example: human or mouse

5. **train**
5. **workdir**
- Directory to put the outputs
- Make sure to have write access.
- It will create one folder per dataset.

6. **train**
- most of the item is an input of `train_model()`
- n_runs: number of run to define rLDA model (default: 100)
- random_states: list of random state, we used to run LDA models (default: range(n_runs))
- workdir: directory of train outputs. make sure to have write access.

6. **top_model**
- workdir: directory of topModel outputs. make sure to have write access.
7. **top_model**
- n_top_genes (int): Number of highly-variable genes to keep (default: 50)
- resolution (int): A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
- max_iter_harmony (int): Number of iteration for running harmony (default: 10)
- min_cell_participation (float): Minimum cell participation across for each topic to keep them, when is `None`, it will keep topics with cell participation more than 1% of #cells (#cells / 100)

7. **merging**
- workdir: directory of merged outputs. make sure to have write access.
- only if you have multiple adata input
8. **merge**
- Indicate if you want to also get a model for all data together.


### 3. Run snakemake

First run it with -n to make sure the steps that it plans to run are reasonable.
After it finishes, run the same command without the -n option.
First run it with `-n` to make sure the steps that it plans to run are reasonable.
After it finishes, run the same command without the `-n` option.

`snakemake -n`

for slurm:
For SLURM:

```
snakemake \
-j 200 \
--latency-wait 120 \
-j 1000 \
--latency-wait 300 \
--use-conda \
--rerun-triggers mtime \
--executor cluster-generic \
--cluster-generic-submit-cmd \
"sbatch -A model-ad_lab \
--partition=standard \
--partition=highmem \
--cpus-per-task 16 \
[email protected] \
--mail-type=START,END,FAIL \
--time=72:00:00" \
-n
-n \
-p \
--verbose
```
highmem
standard

Development hints: If you ran to any error `-p --verbose` would give you more detail about each run and will help you to debug your code.


### 4. Further downstream analysis

Expand Down
10 changes: 4 additions & 6 deletions workflow/snakemake/config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
n_jobs: 8 # Threads

names: [parse, 10x]
names: ['parse', '10x']

count_adata:
parse:
Expand All @@ -12,6 +12,8 @@ n_topics: [5, 10] #, 15, 20, 25, 30, 35, 40, 45, 50]

organism: mouse # human or mouse

workdir: /Users/nargesrezaie/Documents/MortazaviLab/Topyfic/workflow/snakemake/results/

train:
n_runs: 100
random_states: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
Expand All @@ -24,15 +26,11 @@ train:
70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
workdir: /Users/nargesrezaie/Documents/MortazaviLab/Topyfic/workflow/snakemake/results/{name}/{n_topic}/train

top_model:
workdir: /Users/nargesrezaie/Documents/MortazaviLab/Topyfic/workflow/snakemake/results/{name}/{n_topic}/topmodel
n_top_genes: 50
resolution: 1
max_iter_harmony: 10
min_cell_participation: None

# only if you have multiple adata input
merging:
workdir: /Users/nargesrezaie/Documents/MortazaviLab/Topyfic/workflow/snakemake/results
merge: True
23 changes: 19 additions & 4 deletions workflow/snakemake/workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,34 @@ n_topics = config['n_topics']
include: "rules/train.smk"
include: "rules/topModel.smk"


# ##### make neccessary directories #####
# for name in config['names']:
# for n_topic in config['n_topics']:
# # Check if the file exists
# if not os.path.exists(f"{config['workdir']}/{name}/{n_topic}/train"):
# os.makedirs(f"{config['workdir']}/{name}/{n_topic}/train", mode=0o777)
#
# if not os.path.exists(f"{config['workdir']}/{name}/{n_topic}/topmodel"):
# os.makedirs(f"{config['workdir']}/{name}/{n_topic}/topmodel", mode=0o777)

#### check if merging is necessary ###
merging = []
if 'merging' in config:
merging = f"{config['merging']['workdir']}/topModel_{'_'.join(config['names'])}.p"
if 'merge' in config:
merging = f"{config['workdir']}/topModel_{'_'.join(config['names'])}.p"


##### target rules #####
rule all:
input:
expand(f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}.p",
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}_{{random_state}}.p",
name=config["names"],
n_topic=config["n_topics"],
random_state=config["train"]["random_states"]),
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}.p",
name=config['names'],
n_topic=config["n_topics"]),
expand(f"{config['top_model']['workdir']}/topModel_{{name}}_{{n_topic}}.p",
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/topmodel/topModel_{{name}}_{{n_topic}}.p",
name=config['names'],
n_topic=config["n_topics"]),
merging,
59 changes: 34 additions & 25 deletions workflow/snakemake/workflow/rules/topModel.smk
Original file line number Diff line number Diff line change
@@ -1,54 +1,61 @@
import pandas as pd
import Topyfic
from scripts import make_topmodel

configfile: 'config/config.yaml'

# Rule to run topModel for one input adata
rule run_top_model:
input:
expand(f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}.p",
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}.p",
name=config["names"],
n_topic=config["n_topics"])
output:
f"{config['top_model']['workdir']}/topModel_{{name}}_{{n_topic}}.p",
f"{config['workdir']}/{{name}}/{{n_topic}}/topmodel/topModel_{{name}}_{{n_topic}}.p",
params:
name=lambda wildcards: config['names'],
n_topic=lambda wildcards: config['n_topics']
name=lambda wildcards: wildcards.name,
n_topic=lambda wildcards: wildcards.n_topic,
run:
# df, res = make_topmodel.find_best_n_topic(n_topics=config["n_topics"],
# names=params.name,
# topmodel_outputs=config['top_model']['workdir'])
#
# df.to_csv(f"{config['top_model']['workdir']}/k_N_{params.name}.csv",index=False)
train = Topyfic.read_train(f"{config['workdir']}/{params.name}/{params.n_topic}/train/train_{params.name}_{params.n_topic}.p")

make_topmodel.make_top_model(trains={input},
make_topmodel.make_top_model(trains=[train],
adata_paths=config['count_adata'][params.name],
n_top_genes=config['top_model']['n_top_genes'],
resolution=config['top_model']['resolution'],
max_iter_harmony=config['top_model']['max_iter_harmony'],
min_cell_participation=config['top_model']['min_cell_participation'],
topmodel_output=config['top_model']['workdir'])
n_top_genes=int(config['top_model']['n_top_genes']),
resolution=float(config['top_model']['resolution']),
max_iter_harmony=int(config['top_model']['max_iter_harmony']),
min_cell_participation=None if config['top_model']['min_cell_participation'] == "None" else float(
config['top_model']['min_cell_participation']),
topmodel_output=f"{config['workdir']}/{params.name}/{params.n_topic}/topmodel/")

# Rule to run topModel for merging multiple input adatas
if 'merging' in config:
if 'merge' in config:
rule run_multiple_top_model:
input:
expand(f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}.p",
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}.p",
name=config["names"],
n_topic=config["n_topics"])
n_topic=config["n_topics"]),
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/topmodel/topModel_{{name}}_{{n_topic}}.p",
name=config["names"],
n_topic=config["n_topics"]),
output:
f"{config['merging']['workdir']}/topModel_{'_'.join(config['names'])}.p",
f"{config['merging']['workdir']}/analysis_{'_'.join(config['names'])}.p",
f"{config['workdir']}/topModel_{'_'.join(config['names'])}.p",
f"{config['workdir']}/analysis_{'_'.join(config['names'])}.p",
params:
name=lambda wildcards: config['names'],
n_topic=lambda wildcards: config['n_topics']
name=lambda wildcards: wildcards.name,
n_topic=lambda wildcards: wildcards.n_topic,
run:
df, res = make_topmodel.find_best_n_topic(n_topics=config["n_topics"],
names=config["names"],
topmodel_outputs=config['merging']['workdir'])

df.to_csv(f"{config['merging']['workdir']}/k_N.csv",index=False)
df.to_csv(f"{config['workdir']}/k_N.csv",index=False)

pd.DataFrame.from_dict(res, orient='index').to_csv(f"{config['merging']['workdir']}/best_k.csv")
pd.DataFrame.from_dict(res, orient='index').to_csv(f"{config['workdir']}/best_k.csv")

adata_paths = []
for name in config['names']:
Expand All @@ -57,12 +64,14 @@ if 'merging' in config:
trains = []
for name in res.keys():
n_topic = res[name]
trains.append(f"{config['train']['workdir']}/train_{name}_{n_topic}.p")
train = Topyfic.read_train(f"{config['workdir']}/{name}/{n_topic}/train/train_{name}_{n_topic}.p")
trains.append(train)

make_topmodel.make_top_model(trains=trains,
adata_paths=adata_paths,
n_top_genes=config['top_model']['n_top_genes'],
resolution=config['top_model']['resolution'],
max_iter_harmony=config['top_model']['max_iter_harmony'],
min_cell_participation=config['top_model']['min_cell_participation'],
topmodel_output=None)
n_top_genes=int(config['top_model']['n_top_genes']),
resolution=float(config['top_model']['resolution']),
max_iter_harmony=int(config['top_model']['max_iter_harmony']),
min_cell_participation=None if config['top_model']['min_cell_participation'] == "None" else float(
config['top_model']['min_cell_participation']),
topmodel_output=f"{config['workdir']}")
28 changes: 15 additions & 13 deletions workflow/snakemake/workflow/rules/train.smk
Original file line number Diff line number Diff line change
@@ -1,36 +1,38 @@
from scripts import make_train

configfile: 'config/config.yaml'

# Rule to run make_single_train_model
rule run_single_train:
output:
f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}_{{random_state}}.p",
f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}_{{random_state}}.p",
params:
name=lambda wildcards: config['names'],
n_topic=lambda wildcards: config['n_topics'],
random_state=lambda wildcards: config["train"]["random_states"]
name=lambda wildcards: wildcards.name,
n_topic=lambda wildcards: wildcards.n_topic,
random_state=lambda wildcards: wildcards.random_state
run:
make_train.make_single_train_model(name=params.name,
adata_path=config['count_adata'][params.name],
k=params.n_topic,
random_state=params.random_state,
train_output=config['train']['workdir'])
k=int(params.n_topic),
random_state=int(params.random_state),
train_output=f"{config['workdir']}/{params.name}/{params.n_topic}/train/")

# Rule to run make_train_model
rule run_train_model:
input:
expand(f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}_{{random_state}}.p",
expand(f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}_{{random_state}}.p",
name=config["names"],
n_topic=config["n_topics"],
random_state=config["train"]["random_states"])
output:
f"{config['train']['workdir']}/train_{{name}}_{{n_topic}}.p",
f"{config['workdir']}/{{name}}/{{n_topic}}/train/train_{{name}}_{{n_topic}}.p",
params:
name=lambda wildcards: config['names'],
n_topic=lambda wildcards: config['n_topics'],
name=lambda wildcards: wildcards.name,
n_topic=lambda wildcards: wildcards.n_topic,
run:
make_train.make_train_model(name=params.name,
adata_path=config['count_adata'][params.name],
k=params.n_topic,
k=int(params.n_topic),
n_runs=config['train']['n_runs'],
random_state=config['train']['random_states'],
train_output=config['train']['workdir'])
train_output=f"{config['workdir']}/{params.name}/{params.n_topic}/train/")
4 changes: 3 additions & 1 deletion workflow/snakemake/workflow/scripts/make_topmodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ def make_top_model(trains, adata_paths, n_top_genes=50, resolution=1, max_iter_h

# Check if the file exists
if not os.path.exists(topmodel_output):
os.makedirs(topmodel_output)
os.makedirs(topmodel_output, mode=0o777)

adata = None
print(adata_paths)
if isinstance(adata_paths, str):
adata = sc.read_h5ad(adata_paths)
else:
Expand All @@ -53,6 +54,7 @@ def make_top_model(trains, adata_paths, n_top_genes=50, resolution=1, max_iter_h
max_iter_harmony=max_iter_harmony,
min_cell_participation=min_cell_participation)

print(top_model.name, topmodel_output)
top_model.save_topModel(save_path=topmodel_output)

adata_topmodel.write_h5ad(f"{topmodel_output}/topic_weight_umap.h5ad")
Expand Down
10 changes: 7 additions & 3 deletions workflow/snakemake/workflow/scripts/make_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def make_single_train_model(name, adata_path, k, random_state, train_output):

# Check if the file exists
if not os.path.exists(train_output):
os.makedirs(train_output)
os.makedirs(train_output, mode=0o777)

adata = sc.read_h5ad(adata_path)

Expand Down Expand Up @@ -61,7 +61,7 @@ def make_train_model(name, adata_path, k, n_runs, random_state, train_output):

# Check if the file exists
if not os.path.exists(train_output):
os.makedirs(train_output)
os.makedirs(train_output, mode=0o777)

adata = sc.read_h5ad(adata_path)

Expand All @@ -71,9 +71,13 @@ def make_train_model(name, adata_path, k, n_runs, random_state, train_output):
random_state_range=random_state)
trains = []
for i in random_state:
train = Topyfic.read_train(f"{name}_{k}_{i}.p")
print(f"{train_output}train_{name}_{k}_{i}.p")
train = Topyfic.read_train(f"{train_output}train_{name}_{k}_{i}.p")
print(train.random_state_range)
trains.append(train)

print(main_train.random_state_range)
main_train.combine_LDA_models(adata, single_trains=trains)
print(f"{train_output}{main_train.name}")
main_train.save_train(save_path=train_output)

0 comments on commit 4251f89

Please sign in to comment.