Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in passthrough argument feature: Invalid config definition: Config entry must start with a valid identifier. #109

Open
cmkobel opened this issue Jul 25, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@cmkobel
Copy link
Owner

cmkobel commented Jul 25, 2024

When using the passthrough argument feature (e.g. by setting prokka's --kingdom argument to "archaea") on a fresh installation of CompareM2, the following error message might show:

$ comparem2 --dry-run --config set_prokka--kingdom=archaea --dry-run
Using profile /glittertind/home/carl/miniforge3/envs/comparem2/share/comparem2-2.9.1-1/profile/apptainer/default for setting default command line arguments.
Invalid config definition: Config entry must start with a valid identifier.

This is a known bug which is a product of CompareM2 using Snakemake in a way it wasn't originally intended to. Because Snakemake does not allow any other characters than regex: [\w_] in the --config strings, the passthrough argument feature can not support dashes (-). Unfortunately, these dashes are necessary to be able to specify different types of command line option keys often using a combination of dashes and underlines (e.g. "--some-option" or "--other_option").

@cmkobel cmkobel added the bug Something isn't working label Jul 25, 2024
@cmkobel cmkobel self-assigned this Jul 25, 2024
@cmkobel
Copy link
Owner Author

cmkobel commented Jul 25, 2024

Temporary solution:
To enable Snakemake to use be able to parse dashes in these --config strings, you can run this command which is built into the conda environment.

conda activate comparem2
enable_passthrough_parameters_comparem2

This will modify the Snakemake code so that it accepts the dashes.
https://github.com/cmkobel/CompareM2/blob/480fb26fb7e6e96819e061b83948862911e18c88/resources/enable_passthrough_parameters.sh#L6C1-L6C168

This is explained in more detail in the documentation: https://comparem2.readthedocs.io/en/latest/20%20usage/#passthrough-arguments

This issue will be kept open until a better and more permanent solution is made: snakemake/snakemake#2986

Best, Carl

@eternal-flame-AD
Copy link

eternal-flame-AD commented Jul 26, 2024

@cmkobel Thanks for the explanation on snakemake/snakemake#2986 (comment)! Yeah I totally agree with you snakemake don't have to be that strict with config names, I was just saying that might be where that specific regex came from. It certainly is a limitation that there is no very good way to refer to the name of a rule with - in config.

Also thanks for the pointer to this repo, I read it. Great project I might actually need to use it :). I looked at the snakefile, from what I understand you did:

  • Find all config that is prefixed by the set_${rule.name}, for each of the config:
    • Strip the prefix off, split the rest into KEY=VALUE format (snakemake does this for you).
    • Concatenate into ${KEY} ${VALUE}
  • Join all, separated by space.
  • Interpolate the resulting string into tool invocation shell without escaping, as a raw string.

If this is all the features you need, why not just --config args_prokka='--kingdom archaea --verbose --whatever=value'? It does everything your current code does but has the following advantages:

  • Writes just like you are writing in the shell directly to the tool, no need to force it into the "${KEY}=${VALUE}" format where effectively every odd argument can't have "=" in it without some hack.

  • In your grammar sometimes things can get very confusing when one of these happens:

    • The same option gets repeated.

    • The program accepts "=" or whatever separator we choose as argument.

      for example: If I want to run awk -vFS='\t' -vOFS=' ' or awk -v FS='\t' -v OFS=' ' or anything equivalent, I can only think of these hacks to do it in your grammar, none of which are optimal:

    • --config set_awk-vFS='\t'' -> incorrect, '=' is replaced to space

    • --config "set_awk=-vFS='\t'" "set_awk=-vOFS=' '" -> incorrrent, duplicate config key

    • --config "set_awk-v=FS='\t'" "set_awk=-vOFS=' '" -> works barely, doesn't work if you need to define 3 variables.

    • --config "set_awk =-vFS='\t'" "set_awk =-vOFS=' '" works if snakemake allows whitespace in config key... Also not fun to write

    The same thing it is trivial to do it with the simple solution: --config "args_awk=-vFS='\t' -vOFS=' '". I also can't think of a case where your syntax may be readable or writable than just doing this.

  • One can easily implement a way to define a specific delimiter for each rule if that is necessary:

    No matter which delimiter you choose you effectively banned that sequence from appearing as a whole in one argument, you could write an escape mechanism but now there will be a need for triple escaping: from user to snakemake your unpacker within snakefile your snakefile to the final interpolated command. Even if you take consideration into every tool and make sure that doesn't happen, it is not fun to document it and ask user to remember specifics on a per-rule basis. On the other hand, if the user is advanced enough to want to manually pass extra options, they probably are already proficient enough in the tool itself that we don't need to stand in the way and change how they work.

Can you please elaborate?

If the above is all you need I take that back you don't need that solution at all. The GCC-type solution is when you need to escape the arguments for the user and/or manipulate/filter the arguments one-by-one. If you need that level of feature It's easier to use a completely unambiguous grammar. For example features like these:

  • De-duplication.
  • Smartly figure out if user wants to override a param you intend to pass automatically and replace your param with the user-passed ones.
  • Smartly determine additional file dependencies to pass to snakemake that only appear on the custom arguments.

I can elaborate more if this feature is what you need!

@cmkobel
Copy link
Owner Author

cmkobel commented Aug 4, 2024

Hi @eternal-flame-AD

Thanks for considering my challenge of passing command line arguments using the config system of the Snakemake framework. It is not an easy problem to solve and I'm happy to collaborate with you on finding the best solution.

Your interpretation of my implementation is correct.

I like your idea of using --config args_prokka='--kingdom archaea --verbose --whatever=value', but the problem I see here is that when the user goes ahead and specifies a another option for a tool on the command line, then the default arguments already set in config.yaml will be totally overwritten, and only the option that the user has set on the command line is alone given to the target program. The only workaround I can see for this, is that the argument string is added as an extension to the arguments already given in config.yaml or that there is a parser which is exactly what we have now (albeit with some limitations). So really, it is all about finding the best tradeoff.

The solution to running awk -vFS='\t' -vOFS=' ' is to set the delimiter character in the unpacking function to "=" for the rule that runs awk, and then run the pipeline with --config set_awkrule-vFS='\\t' set_awkrule-vOFS=" ". For your example with two identical option keys (awk -v FS='\t' -v OFS=' ') you are right that there is no sane way of passing this through the current system. But wouldn't it be possible to write it as awk -vFS='\t' -vOFS=' ' ?
You are right in how it is very hard to escape weird characters. But I think character escaping is a bit of a corner case anyway.

I also added a "separator" argument to the unpacking function so that it is also possible to write comma separated options like option=value,other=string etc.

I think we need to accept some limitations in this system. The point is not to make something that is extremely flexible, but that the user can specify common arguments like the ones given in the config.yaml, on the fly.

@cmkobel
Copy link
Owner Author

cmkobel commented Aug 13, 2024

Snakemake has now been modified to allow hyphens in CLI-config strings from version 8.17.0

snakemake/snakemake#2998

To do: CompareM2 will be updated to use this version of snakemake (or newer) asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants