This directory contains a Snakemake pipeline for running the Topyfic automatically.
The snakemake will run training (Train) and building model (topModel, Analysis).
Note: Please make sure to install the necessary packages and set up your Snakemake appropriately.
Note: pipeline is tested for Snakemake >= 8.X (more info)
Build your environment and install the necessary packages
Modify the config file or create a new one with the same structure.
-
names
- Contains the name of the input dataset(s).
- Name will be used as the name of train and topModel models
- If there are multiple names, Topyfic will normalize the models across names using harmony.
- list of name:
[parse, 10x]
-
count_data
- Contains the path of each input data
- Name of each path should match the name in
names
- Recommended to use full path rather than relative path
-
n_topics
- Contains a list of integers of initial topics you wish to train the model based on them
- list of int:
[5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
-
organism
- Indicate spices that will be used for downstream analysis
- Example: human or mouse
-
workdir
- Directory to put the outputs
- Make sure to have write access.
- It will create one folder per dataset.
-
train
- most of the item is an input of
train_model()
- n_runs: number of runs to define the rLDA model (default: 100)
- random_states: list of random states, we used to run LDA models (default: range(n_runs))
- most of the item is an input of
-
top_model
- n_top_genes (int): Number of highly-variable genes to keep (default: 50)
- resolution (int): A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
- max_iter_harmony (int): Number of iterations for running harmony (default: 10)
- min_cell_participation (float): Minimum cell participation across for each topic to keep them, when is
None
, it will keep topics with cell participation more than 1% of #cells (#cells / 100)
-
merge
- Indicate if you want to also get a model for all data together.
- Make sure you have write access.
First, run it with -n
to make sure the steps that it plans to run are reasonable.
After it finishes, run the same command without the -n
option.
snakemake -n
For SLURM:
snakemake \
-j 1000 \
--latency-wait 300 \
--use-conda \
--rerun-triggers mtime \
--executor cluster-generic \
--cluster-generic-submit-cmd \
"sbatch -A model-ad_lab \
--partition=highmem \
--cpus-per-task 16 \
[email protected] \
--mail-type=START,END,FAIL \
--time=72:00:00" \
-n \
-p \
--verbose
Development hints: If you run into any error -p --verbose
would give you more detail about each run and help you to debug your code.
Once you get all the three main objects (Train, TopModel, Analysis), I would recommend using this notebook for depth_in downstream analysis. ** Section 4 is still under construction **