Do not require copying/linking/renaming input data #1

standage · 2019-01-14T15:26:23Z

Expected behavior

The user should be able to invoke the metagenomics workflow from any arbitrary directory, assuming the correct absolute or relative path to relevant Snakefile(s) is indicated. The user should not be required to copy data into a specific directory within the metagenomics workflow code distribution, nor should they be required to rename input file names.

Actual behavior

Currently, some (or all) of the workflows require the user to copy data into a particular directory in the code distribution. It appears that input data files are expected to adhere to particular file naming conventions. Both requirements introduce an unnecessary logistical burden on the user.

I recently encountered a similar issue in a Snakemake workflow I was implementing. Two requirements seemed to be at odds:

ease-of-use for end users (not making them jump through too many hoops to run the workflow)
setting up sufficient constraints on input data to enable processing with a Snakemake workflow

My first thought was to restrict input filenames to specific patterns as well, but after a bit of work I was able to come up with a different approach that requires a lot less of the user.

The user specifies the input files with the config.json configfile. There is infinite flexibility here: you can enable an arbitrary number of input samples, and an arbitrary number of input files per sample. There is no need to require that the filenames have a particular extension.
The first rule implemented by the Snakemake workflow is to create symlinks to the input files in the workflow working directory (configurable with snakemake's --directory flag). The symlink files are named in a standardize fashion that I as the workflow developer decide. (While I generally prefer to implement Snakemake rules as shell commands, I implemented this rule in Python so I could more easily handle the input configuration dynamically.)
All subsequent steps in the workflow point to these symlinks instead of the user-specified input files.

The Snakefile is here in case you're interested, with the corresponding config template here. In this example config, all the input BAM files are in the same directory and have the same extension, but the way this workflow is implemented it would still work if each one was in a different directory with non-standard names.

The text was updated successfully, but these errors were encountered:

kternus · 2019-01-16T23:54:47Z

Thanks, Daniel! You are correct, in that the workflows currently require input files to be located in a specific directory and follow particular naming conventions. We can definitely review the example you provided and think further about how we could redesign the workflows to be more flexible. Thanks for sharing that!

These particular workflows also involve the execution of singularity containers, and the singularity bind path is currently setup in such a way that all of the input files are expected to be located within a specific directory on the host file system. So to make things more flexible, we would need to make some changes along those lines too. As a potential user of these workflows, maybe you could answer a couple questions to help our developers better understand the possible use cases? Others are welcome to chime in on these questions too.

Would a user always want to start by inputting raw reads in the read filtering workflow, or are there times when they would want to input their own filtered reads or assembled contigs for analysis? There may also be other entry points into the workflows that I’m not considering, so please feel free to bring up any scenario where you’d have your own input file.

Currently, all of the output files are placed in the same directory as the input files. This was a straightforward way to set things up because the workflows build on each other (i.e., outputs of one singularity container become the inputs of a subsequent one), and the singularity bind path is currently setup to look in one directory to find all imaginable input files. Is it fine to continue outputting all files to that same directory?

adambazinet · 2019-01-17T19:03:59Z

For what it's worth, I can imagine users wanting to start with any of the workflows, using either data previously generated by SigSci workflows or (compatible) data generated some other way. So as much flexibility as you'd like to support in terms of "entry points" into the workflows, I think will be useful.

lovettse mentioned this issue Mar 27, 2019

Data directory is disorganized #8

Open

kternus assigned cgrahlm Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not require copying/linking/renaming input data #1

Do not require copying/linking/renaming input data #1

standage commented Jan 14, 2019

kternus commented Jan 16, 2019

adambazinet commented Jan 17, 2019

Do not require copying/linking/renaming input data #1

Do not require copying/linking/renaming input data #1

Comments

standage commented Jan 14, 2019

Expected behavior

Actual behavior

kternus commented Jan 16, 2019

adambazinet commented Jan 17, 2019