-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not require copying/linking/renaming input data #1
Comments
Thanks, Daniel! You are correct, in that the workflows currently require input files to be located in a specific directory and follow particular naming conventions. We can definitely review the example you provided and think further about how we could redesign the workflows to be more flexible. Thanks for sharing that! These particular workflows also involve the execution of singularity containers, and the singularity bind path is currently setup in such a way that all of the input files are expected to be located within a specific directory on the host file system. So to make things more flexible, we would need to make some changes along those lines too. As a potential user of these workflows, maybe you could answer a couple questions to help our developers better understand the possible use cases? Others are welcome to chime in on these questions too. Would a user always want to start by inputting raw reads in the read filtering workflow, or are there times when they would want to input their own filtered reads or assembled contigs for analysis? There may also be other entry points into the workflows that I’m not considering, so please feel free to bring up any scenario where you’d have your own input file. Currently, all of the output files are placed in the same directory as the input files. This was a straightforward way to set things up because the workflows build on each other (i.e., outputs of one singularity container become the inputs of a subsequent one), and the singularity bind path is currently setup to look in one directory to find all imaginable input files. Is it fine to continue outputting all files to that same directory? |
For what it's worth, I can imagine users wanting to start with any of the workflows, using either data previously generated by SigSci workflows or (compatible) data generated some other way. So as much flexibility as you'd like to support in terms of "entry points" into the workflows, I think will be useful. |
Expected behavior
The user should be able to invoke the metagenomics workflow from any arbitrary directory, assuming the correct absolute or relative path to relevant Snakefile(s) is indicated. The user should not be required to copy data into a specific directory within the metagenomics workflow code distribution, nor should they be required to rename input file names.
Actual behavior
Currently, some (or all) of the workflows require the user to copy data into a particular directory in the code distribution. It appears that input data files are expected to adhere to particular file naming conventions. Both requirements introduce an unnecessary logistical burden on the user.
I recently encountered a similar issue in a Snakemake workflow I was implementing. Two requirements seemed to be at odds:
My first thought was to restrict input filenames to specific patterns as well, but after a bit of work I was able to come up with a different approach that requires a lot less of the user.
config.json
configfile. There is infinite flexibility here: you can enable an arbitrary number of input samples, and an arbitrary number of input files per sample. There is no need to require that the filenames have a particular extension.--directory
flag). The symlink files are named in a standardize fashion that I as the workflow developer decide. (While I generally prefer to implement Snakemake rules asshell
commands, I implemented this rule in Python so I could more easily handle the input configuration dynamically.)The Snakefile is here in case you're interested, with the corresponding config template here. In this example config, all the input BAM files are in the same directory and have the same extension, but the way this workflow is implemented it would still work if each one was in a different directory with non-standard names.
The text was updated successfully, but these errors were encountered: