Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipelines: clarify whether it's possible to have more than one _pipeline file_ for a DVC project in the docs #2170

Closed
iesahin opened this issue Feb 9, 2021 · 21 comments
Assignees
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@iesahin
Copy link
Contributor

iesahin commented Feb 9, 2021

There are some places like content/docs/user-guide/project-structure/pipelines-files.md and content/docs/user-guide/project-structure/index.md that talks about pipeline files while describing dvc.yaml. e.g.

dvc.yaml pipelines files define stages that form the pipeline(s) of a
project. All stage-based features such as dvc params, dvc metrics, and
dvc plots are specified here.

AFAICT there is a single file with multiple pipelines and commands don't receive a pipeline file argument to replace the default dvc.yaml. In these descriptions we may need to clarify that there is a single file where we describe all pipelines. WDYT @jorgeorpinel?

@iesahin iesahin self-assigned this Feb 9, 2021
@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Feb 9, 2021
@jorgeorpinel
Copy link
Contributor

like content/docs/user-guide/project-structure/pipelines-files.md

We already start the dvc.yaml guide with a pretty clear indication about this:

image

and content/docs/user-guide/project-structure/index.md

"pipelines files define" is plural as well.

It would be good to review most mentions of dvc.yaml throughout docs to double check that at least we don't speak about it in singular terms (except in specific examples where it is just one) but otherwise I'm not sure if there's need to further clarify this. At some point we may specifically explain a project structure pattern with multiple dvc.yaml files (perhaps in a future Best Practices section, see #72).

commands don't receive a pipeline file argument to replace the default dvc.yaml

Actually commands that work with stages accept targets which can be a path to another dvc.yaml file. See for example https://dvc.org/doc/command-reference/repro#options but that's the only ref. where we treat the cmd argument (targets) as an "Option" and explain it in detail. Maybe we should focus on apply that to other refs, instead for this issue? (please change the title and desc here if you agree).

@shcheklein shcheklein changed the title Clarify whether it's possible to have more than one _pipeline file_ for a DVC project in the docs. pipelines: clarify whether it's possible to have more than one _pipeline file_ for a DVC project in the docs Feb 9, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Feb 10, 2021

It would be good to review most mentions of dvc.yaml throughout docs to double check that at least we don't speak about it in singular terms (except in specific examples where it is just one) but otherwise I'm not sure if there's need to further clarify this.

I think we need to change dvc.yaml mentions to */dvc.yaml to show that we are talking about possibly multiple files in different dirs where necessary. When a document mentions a particular filename I tend to think it's about a single file for the whole repository, e.g. .dvc is a single dir for the repository and even if you say pipelines files my mind sees a typo there, instead of multiple files in different dirs. 🤦🏼 😄

I'll be more careful about this. Thanks.

@iesahin
Copy link
Contributor Author

iesahin commented Feb 10, 2021

For example in https://raw.githubusercontent.com/iterative/dvc.org/master/content/blog/2020-06-22-dvc-1-0-release.md

In DVC 1.0, the DVC metafile format was changed in three big ways. First,
instead of multiple DVC "stage files" (*.dvc), each project has a single
dvc.yaml file. By default, all stages go in this single YAML file.

makes me think there is a single global dvc.yaml file for each project and since project is the directory where .dvc/ is present, I think this is a valid interpretation.

@iesahin
Copy link
Contributor Author

iesahin commented Feb 10, 2021

Evolution of my understanding can be here :) a039900?branch=a039900ed35d31380e9bd6f7a5ef3738ce24d595&diff=split

💡

@shcheklein
Copy link
Member

@iesahin looks good to me! let's make a PR for this?

@jorgeorpinel
Copy link
Contributor

change dvc.yaml mentions to */dvc.yaml

I'm not sure. That language is too technical and terminal-like. dvc.yaml is a schema, not a specific file.

I think it's OK to speak about it in singular like "change field xyz in your dvc.yaml" — it doesn't imply there's one only ever. I don't remember many users having this confusion based on support cases either. Same with *.dvc although that one I like a bit more.

For now someone could definitely try to review all dvc.yaml mentions (there's a lot though) to make sure the context doesn't imply that constraint.

.dvc is a single dir for the repository

Good point but that's rarely mentioned and people used to Git and many other tools probably assume there's just one such hidden dir per project. And in its docs we clearly state there's just one.

if you say pipelines files my mind sees a typo

I agree it's a strange term. Maybe it should be "pipeline file(s)" but each dvc.yaml can contain multiple pipelines so that plural is correct. We discussed this back in the day and couldn't come up with a better answer, open to suggestions.

@iesahin
Copy link
Contributor Author

iesahin commented Feb 20, 2021

There are over 900 references to dvc.yaml and I read maybe the first few hundred. There is specifically an answered question for this in the blog. I think I can close this as no major reference to a single dvc.yaml seems to be in the docs. Thank you @jorgeorpinel

@iesahin iesahin closed this as completed Feb 20, 2021
@nxorable
Copy link

nxorable commented Jul 8, 2021

perhaps in a future Best Practices section

I would have greatly benefited from examples of project structure with multiple dvc.yaml files in a best practices section. This thread was what I had to find to get a clue I was going in the right direction, and then I eventually figured out something that worked through trial and error.

@jorgeorpinel
Copy link
Contributor

Added a checkbox in #72 🙂

@stefan-falk
Copy link

stefan-falk commented Oct 4, 2022

I am assuming that it's not possible to re-use certain stages or that it's possible to use some sort of composition. Let's say the only difference in pipeline A and pipeline B is the how the dataset gets generated. Let's just say, B is more complicated and has therefore more stages than A. However, the way a model gets trained and evaluated is the same. Only the last stage where we evaluate the model is then again differently.

Now.. I think it's okay to duplicate the code and have two YAML files for each pipeline. What I do not really like is the fact that I have to create separate those YAML files on a file-system level as explained here:


Option 1

.
├── main_data_pipeline
│   └── dvc.yaml
└── secondary_data_pipeline
    └── dvc.yaml

Option 2

.
└── main_data_pipeline
    ├── dvc.yaml
    └── secondary_data_pipeline
        └── dvc.yaml

I can't say that this is ideal and I wonder why we can't simply point dvc via file path or maybe even pipe the config to the dvc CLI?

Both options provided by the docs feel rather awkward. Is there a better way to execute different pipelines on the same code-base within the same repository or am I overlooking something?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 4, 2022

Just a few notes:

B is more complicated and has therefore more stages than A. However, the way a model gets trained and evaluated is the same

You can consider A and B separate pipelines, but if they share a stage definition, technically it's a single DAG. A single dvc repro command would run them both, for example.

I have to create separate those YAML files on a file-system level as explained here

That answer in not complete. You can define multiple pipelines in a single dvc.yaml file too. But they must be actually disconnected i.e. stage name and output file names must be different (even if some of them run the same commands), e.g.

stages:
  # pipeline 1
  1-echo-data:
    cmd: echo data > data1
    outs:
    - data1
  1-print:
    cmd: cat data1
    deps:
    - data1

  # pipeline 2
  2-wget-data:
    cmd: wget <some-url> -o data2
    outs:
    - data2
  2-print:
    cmd: cat data2
    deps:
    - data2

Is there a better way to execute different pipelines on the same code-base?

Creating subdirectories is the easiest way rn. Is that a major complication for you?

we can't simply point dvc via file path or maybe even pipe the config

Not a bad idea. Feel free to propose the feature at https://github.com/iterative/dvc/issues/new/choose ! (The repo we're in is for the DVC website and docs.)

@stefan-falk
Copy link

@jorgeorpinel thanks for the response!

As it turns out, separating those pipelines isn't as problematic as I thought. The only thing I saw was that dvc dag e.g. is still treating this as a single dvc.yaml - not an issue, just something I observed.

Putting all in one file is an option as you described it as two disconnected DAGs, but in some cases it could be beneficial to separate the code for readability.

That being said, having the option to point dvc to a particular dvc.yaml, or even pipe the content to it, isn't probably providing a lot of additional functionality although it could be a convenience to some.

@stefan-falk
Copy link

@jorgeorpinel Just realized that the params.yaml file cannot be found if I create sub-directories. E.g. if I have

.dvc/
pipelines/  
  pipeline-1/
    dvc.yaml
    params.yaml
  pipeline-2/
    dvc.yaml
    params.yaml
src/
  main.py

and I run

(my-env) pipelines/pipeline-1 $ PYTHONPATH=../../src dvc repro

the params file cannot be loaded:

Traceback (most recent call last):                                    
  File "/Users/sfalk/workspaces/git/mnist/pipelines/mnist/../../src/ml/pipeline/train.py", line 120, in <module>
    main(None)
  File "/Users/sfalk/workspaces/git/mnist/pipelines/mnist/../../src/ml/pipeline/train.py", line 115, in main
    train_model(data_dir=args.data_dir, out_dir=args.out_dir, hparam_set=args.hparam_set)
  File "/Users/sfalk/workspaces/git/mnist/pipelines/mnist/../../src/ml/pipeline/train.py", line 52, in train_model
    params = dvc.api.params_show()
  File "/Users/sfalk/miniconda3/envs/mnist/lib/python3.9/site-packages/dvc/api/params.py", line 278, in params_show
    return _postprocess(params)
  File "/Users/sfalk/miniconda3/envs/mnist/lib/python3.9/site-packages/dvc/api/params.py", line 265, in _postprocess
    raise DvcException("No params found")
dvc.exceptions.DvcException: No params found

am I missing something here?

@jorgeorpinel
Copy link
Contributor

I think params.yaml is expected in the cwd. Please reach out to support at https://dvc.org/chat

@walternat1ve
Copy link

please provide proper documentation with examples about multiple pipelines and multiple dvc.yaml files, otherwise its heavily misleading. in that case stick to one pipeline. that's the perspective of me, the user, who tries to like DVC.

@jorgeorpinel
Copy link
Contributor

We have plans to cover this in more detail (see #2883). For now you can use the more technical spec of dvc.yaml to get the details.

Note that multiplicity is a general feature of DVC's flexibility and doesn't just apply to pipelines and dvc.yaml files (and stages), but also certain fields like cmd (stage commands), params, metrics, plots...

@shcheklein
Copy link
Member

@walternat1ve what's your take, where in the docs a note / comment /quick sample would be enough for you?

otherwise its heavily misleading

could you clarify please, do you see specific parts of the docs that make you think that it's always one file / one pipeline?

(asking these questions since it's good to see unbiased fresh perspective and I hope we can do some quick fixes vs waiting for another iteration on the pipelines since this issue clearly comes up again and again)

@nxorable
Copy link

I can empathize with the need for examples: I've probably could have saved a day's worth of work if examples were provided demonstrating multiple pipelines defined by multiple dvc.yaml files (e.g. 3 yaml files: pipeline A depends on pipelines B and C having run first).

@walternat1ve
Copy link

walternat1ve commented Dec 30, 2022

could you clarify please, do you see specific parts of the docs that make you think that it's always one file / one pipeline?

@shcheklein all examples point to one file, one pipeline. even though there is a note in one sentence somewhere, there are no examples for theses cases.
additionally a bit off topic: its a good documentation but i had often jump from "user guide" to "use cases" for various topics. maybe there is a way to unify things.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 30, 2022

e.g. 3 yaml files: pipeline A depends on pipelines B and C having run first

@nxorable it's a tricky subject. For example, what you describe is a single DAG (one pipeline) spread among 3 dvc.yaml files. These are the things we have to figure out a way to properly explain (unsure whether it's a major topic TBH) and it hasn't been a high priority so far, but it's good to get this activity so that we get back to #2883 so thanks.

i had often jump from "user guide" to "user cases" for various topics.

Thanks for the feedback @walternat1ve . Use Case pages will probably be extracted from the documentation (will look different) soon-ish, to prevent this confusion.

@majidaldo
Copy link

majidaldo commented Mar 28, 2023

i just add a dvc repro as a stage; not great but good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

7 participants