-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipelines: clarify whether it's possible to have more than one _pipeline file_ for a DVC project in the docs #2170
Comments
We already start the dvc.yaml guide with a pretty clear indication about this:
"pipelines files define" is plural as well. It would be good to review most mentions of dvc.yaml throughout docs to double check that at least we don't speak about it in singular terms (except in specific examples where it is just one) but otherwise I'm not sure if there's need to further clarify this. At some point we may specifically explain a project structure pattern with multiple dvc.yaml files (perhaps in a future Best Practices section, see #72).
Actually commands that work with stages accept |
I think we need to change I'll be more careful about this. Thanks. |
For example in https://raw.githubusercontent.com/iterative/dvc.org/master/content/blog/2020-06-22-dvc-1-0-release.md
makes me think there is a single global |
Evolution of my understanding can be here :) a039900?branch=a039900ed35d31380e9bd6f7a5ef3738ce24d595&diff=split 💡 |
@iesahin looks good to me! let's make a PR for this? |
I'm not sure. That language is too technical and terminal-like. I think it's OK to speak about it in singular like "change field xyz in your dvc.yaml" — it doesn't imply there's one only ever. I don't remember many users having this confusion based on support cases either. Same with For now someone could definitely try to review all dvc.yaml mentions (there's a lot though) to make sure the context doesn't imply that constraint.
Good point but that's rarely mentioned and people used to Git and many other tools probably assume there's just one such hidden dir per project. And in its docs we clearly state there's just one.
I agree it's a strange term. Maybe it should be "pipeline file(s)" but each dvc.yaml can contain multiple pipelines so that plural is correct. We discussed this back in the day and couldn't come up with a better answer, open to suggestions. |
There are over 900 references to |
I would have greatly benefited from examples of project structure with multiple |
Added a checkbox in #72 🙂 |
I am assuming that it's not possible to re-use certain stages or that it's possible to use some sort of composition. Let's say the only difference in pipeline A and pipeline B is the how the dataset gets generated. Let's just say, B is more complicated and has therefore more stages than A. However, the way a model gets trained and evaluated is the same. Only the last stage where we evaluate the model is then again differently. Now.. I think it's okay to duplicate the code and have two YAML files for each pipeline. What I do not really like is the fact that I have to create separate those YAML files on a file-system level as explained here: Option 1
Option 2
I can't say that this is ideal and I wonder why we can't simply point Both options provided by the docs feel rather awkward. Is there a better way to execute different pipelines on the same code-base within the same repository or am I overlooking something? |
Just a few notes:
You can consider A and B separate pipelines, but if they share a stage definition, technically it's a single DAG. A single
That answer in not complete. You can define multiple pipelines in a single dvc.yaml file too. But they must be actually disconnected i.e. stage name and output file names must be different (even if some of them run the same commands), e.g. stages:
# pipeline 1
1-echo-data:
cmd: echo data > data1
outs:
- data1
1-print:
cmd: cat data1
deps:
- data1
# pipeline 2
2-wget-data:
cmd: wget <some-url> -o data2
outs:
- data2
2-print:
cmd: cat data2
deps:
- data2
Creating subdirectories is the easiest way rn. Is that a major complication for you?
Not a bad idea. Feel free to propose the feature at https://github.com/iterative/dvc/issues/new/choose ! (The repo we're in is for the DVC website and docs.) |
@jorgeorpinel thanks for the response! As it turns out, separating those pipelines isn't as problematic as I thought. The only thing I saw was that Putting all in one file is an option as you described it as two disconnected DAGs, but in some cases it could be beneficial to separate the code for readability. That being said, having the option to point |
@jorgeorpinel Just realized that the
and I run
the params file cannot be loaded:
am I missing something here? |
I think params.yaml is expected in the cwd. Please reach out to support at https://dvc.org/chat |
please provide proper documentation with examples about multiple pipelines and multiple dvc.yaml files, otherwise its heavily misleading. in that case stick to one pipeline. that's the perspective of me, the user, who tries to like DVC. |
We have plans to cover this in more detail (see #2883). For now you can use the more technical spec of dvc.yaml to get the details.
|
@walternat1ve what's your take, where in the docs a note / comment /quick sample would be enough for you?
could you clarify please, do you see specific parts of the docs that make you think that it's always one file / one pipeline? (asking these questions since it's good to see unbiased fresh perspective and I hope we can do some quick fixes vs waiting for another iteration on the pipelines since this issue clearly comes up again and again) |
I can empathize with the need for examples: I've probably could have saved a day's worth of work if examples were provided demonstrating multiple pipelines defined by multiple dvc.yaml files (e.g. 3 yaml files: pipeline A depends on pipelines B and C having run first). |
@shcheklein all examples point to one file, one pipeline. even though there is a note in one sentence somewhere, there are no examples for theses cases. |
@nxorable it's a tricky subject. For example, what you describe is a single DAG (one pipeline) spread among 3 dvc.yaml files. These are the things we have to figure out a way to properly explain (unsure whether it's a major topic TBH) and it hasn't been a high priority so far, but it's good to get this activity so that we get back to #2883 so thanks.
Thanks for the feedback @walternat1ve . Use Case pages will probably be extracted from the documentation (will look different) soon-ish, to prevent this confusion. |
i just add a |
There are some places like
content/docs/user-guide/project-structure/pipelines-files.md
andcontent/docs/user-guide/project-structure/index.md
that talks about pipeline files while describingdvc.yaml
. e.g.AFAICT there is a single file with multiple pipelines and commands don't receive a pipeline file argument to replace the default
dvc.yaml
. In these descriptions we may need to clarify that there is a single file where we describe all pipelines. WDYT @jorgeorpinel?The text was updated successfully, but these errors were encountered: