-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: external data updates #1735
Changes from all commits
9d31935
74b624f
d0f6c3c
df80623
551df38
d6c68de
290c278
8971e6c
23be152
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -99,7 +99,7 @@ Relevant notes: | |
|
||
- [external dependencies](/doc/user-guide/external-dependencies) and | ||
[external outputs](/doc/user-guide/managing-external-data) (outside of the | ||
<abbr>workspace</abbr>) are also supported. | ||
<abbr>workspace</abbr>) are also supported (except metrics and plots). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is part of an important note added to the external outputs doc. |
||
|
||
- Outputs are deleted from the workspace before executing the command (including | ||
at `dvc repro`) if their paths are found as existing files/directories. This | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,38 @@ | ||
# External Dependencies | ||
|
||
There are cases when data is so large, or its processing is organized in a way | ||
that you would like to avoid moving it out of its external/remote location. For | ||
example from a network attached storage (NAS) drive, processing data on HDFS, | ||
such that you would like to avoid moving it out of its external/remote location. | ||
For example from a network attached storage (NAS), processing data on HDFS, | ||
running [Dask](https://dask.org/) via SSH, or having a script that streams data | ||
from S3 to process it. A mechanism for external dependencies and | ||
[external outputs](/doc/user-guide/managing-external-data) provides a way for | ||
DVC to control data externally. | ||
from S3 to process it. | ||
|
||
## Description | ||
External <abbr>dependencies</abbr> and | ||
[external outputs](/doc/user-guide/managing-external-data) provide ways to track | ||
data outside of the <abbr>project</abbr>. | ||
|
||
With DVC, you can specify external files as dependencies for your pipeline | ||
## How it works | ||
|
||
You can specify external files or directories as dependencies for your pipeline | ||
stages. DVC will track changes in them and reflect this in the output of | ||
`dvc status`. | ||
|
||
Currently, the following types (protocols) of external dependencies are | ||
supported: | ||
|
||
- Local files and directories outside of your <abbr>workspace</abbr> | ||
- SSH | ||
- Amazon S3 | ||
- Microsoft Azure Blob Storage | ||
- Google Cloud Storage | ||
- SSH | ||
- HDFS | ||
- HTTP | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Local files and directories outside the <abbr>workspace</abbr> | ||
|
||
> Note that these are a subset of the remote storage types supported by | ||
> `dvc remote`. | ||
|
||
In order to specify an external dependency for your stage, use the usual `-d` | ||
option in `dvc run` with the external path or URL to your desired file or | ||
directory. | ||
In order to specify an external <abbr>dependency</abbr> for your stage, use the | ||
usual `-d` option in `dvc run` with the external path or URL to your desired | ||
file or directory. | ||
|
||
## Examples | ||
|
||
|
@@ -149,12 +151,12 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml | |
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml' | ||
``` | ||
|
||
The command above creates the <abbr>import stage</abbr> (DVC-file) | ||
`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL). | ||
The command above creates the import `.dvc` file `data.xml.dvc`, that contains | ||
an external dependency (in this case an HTTPs URL). | ||
|
||
<details> | ||
|
||
### Expand to see resulting DVC-file | ||
### Expand to see resulting `.dvc` file | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```yaml | ||
# ... | ||
|
@@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again. | |
|
||
`dvc import` can download a <abbr>data artifact</abbr> from any <abbr>DVC | ||
project</abbr> or Git repository. It also creates an external dependency in its | ||
<abbr>import stage</abbr> (DVC-file). | ||
import `.dvc` file. | ||
|
||
```dvc | ||
$ dvc import [email protected]:iterative/example-get-started model.pkl | ||
|
@@ -193,7 +195,7 @@ specified (with the `repo` field). | |
|
||
<details> | ||
|
||
### Expand to see resulting DVC-file | ||
### Expand to see resulting `.dvc` file | ||
|
||
```yaml | ||
# ... | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,83 +1,94 @@ | ||
# Managing External Data | ||
|
||
There are cases when data is so large, or its processing is organized in a way | ||
that you would like to avoid moving it out of its external/remote location. For | ||
example from a network attached storage (NAS) drive, processing data on HDFS, | ||
such that its preferable to avoid moving it from its external/remote location. | ||
For example data on a network attached storage (NAS), processing data on HDFS, | ||
running [Dask](https://dask.org/) via SSH, or having a script that streams data | ||
from S3 to process it. External outputs and | ||
[external dependencies](/doc/user-guide/external-dependencies) provide a way for | ||
DVC to control data outside of the <abbr>project</abbr> directory. | ||
from S3 to process it. | ||
|
||
## Description | ||
External <abbr>outputs</abbr> and | ||
[external dependencies](/doc/user-guide/external-dependencies) provide ways to | ||
track data outside of the <abbr>project</abbr>. | ||
|
||
DVC can track files on an external storage with `dvc add` or specify external | ||
files as <abbr>outputs</abbr> for | ||
[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` | ||
(stage files). External outputs are considered part of the DVC project. DVC will | ||
track changes in them and reflect this in the output of `dvc status`. | ||
## How external outputs work | ||
|
||
DVC can track existing files or directories on an external location with | ||
`dvc add` (`out` field). It can also create external files or directories as | ||
outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). | ||
|
||
External outputs are considered part of the (extended) DVC project: DVC will | ||
track changes in them, and reflect this in `dvc status` reports, for example. | ||
|
||
For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to | ||
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) | ||
in the same external/remote file system first. | ||
|
||
Currently, the following types (protocols) of external outputs (and | ||
<abbr>cache</abbr>) are supported: | ||
|
||
- Local files and directories outside of your <abbr>workspace</abbr> | ||
- SSH | ||
- Amazon S3 | ||
- Microsoft Azure Blob Storage | ||
- Google Cloud Storage | ||
- SSH | ||
- HDFS | ||
- Local files and directories outside the <abbr>workspace</abbr> | ||
|
||
> Note that these are a subset of the remote storage types supported by | ||
> `dvc remote`. | ||
|
||
In order to specify an external output for a stage file, use the usual `-o` or | ||
`-O` options of `dvc run`, but with the external path or URL to the file in | ||
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to | ||
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) | ||
in the same external/remote file system first. | ||
|
||
> Avoid using the same location of the | ||
> [remote storage](/doc/command-reference/remote) that you have for `dvc push` | ||
> and `dvc pull` for external outputs or as external cache, because it may cause | ||
> file hash overlaps: The hash value of a data file in external storage could | ||
> collide with the one generated locally for another file. | ||
> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for | ||
> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file | ||
> hash overlaps: the hash of an external output could collide with a hash | ||
> generated locally for another file with different content. | ||
|
||
## Examples | ||
|
||
For the examples, let's take a look at a [stage](/doc/command-reference/run) | ||
that simply moves local file to an external location, producing a `data.txt.dvc` | ||
DVC-file. | ||
For the examples, let's take a look at | ||
|
||
1. Adding a `dvc remote` to use as cache for data in the external location, and | ||
configure it as external <abbr>cache</abbr> with `dvc config`. | ||
2. Tracking existing data on an external location with `dvc add` (this doesn't | ||
download it). This produces a `.dvc` file with an external output. | ||
3. Creating a simple [stage](/doc/command-reference/run) that moves a local file | ||
to the external location. This produces a stage with another external output | ||
in `dvc.yaml`. | ||
|
||
### Amazon S3 | ||
|
||
```dvc | ||
# Add S3 remote to be used as cache location for S3 files | ||
$ dvc remote add s3cache s3://mybucket/cache | ||
|
||
# Tell DVC to use the 's3cache' remote as S3 cache location | ||
$ dvc config cache.s3 s3cache | ||
|
||
# Add data on S3 directly | ||
$ dvc add --external s3://mybucket/mydata | ||
$ dvc add --external s3://mybucket/existing-data | ||
|
||
# Create the stage with an external S3 output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o s3://mybucket/data.txt \ | ||
aws s3 cp data.txt s3://mybucket/data.txt | ||
``` | ||
|
||
### Microsoft Azure Blob Storage | ||
|
||
```dvc | ||
$ dvc remote add azurecache azure://mycontainer/cache | ||
$ dvc config cache.azure azurecache | ||
|
||
$ dvc add --external azure://mycontainer/existing-data | ||
|
||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o azure://mycontainer/data.txt \ | ||
az storage blob upload -f data.txt -c mycontainer -n data.txt | ||
``` | ||
|
||
### Google Cloud Storage | ||
|
||
```dvc | ||
# Add GS remote to be used as cache location for GS files | ||
$ dvc remote add gscache gs://mybucket/cache | ||
|
||
# Tell DVC to use the 'gscache' remote as GS cache location | ||
$ dvc config cache.gs gscache | ||
|
||
# Add data on GS directly | ||
$ dvc add --external gs://mybucket/mydata | ||
$ dvc add --external gs://mybucket/existing-data | ||
|
||
# Create the stage with an external GS output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o gs://mybucket/data.txt \ | ||
|
@@ -87,22 +98,22 @@ $ dvc run -d data.txt \ | |
### SSH | ||
|
||
```dvc | ||
# Add SSH remote to be used as cache location for SSH files | ||
$ dvc remote add sshcache ssh://[email protected]/cache | ||
|
||
# Tell DVC to use the 'sshcache' remote as SSH cache location | ||
$ dvc config cache.ssh sshcache | ||
|
||
# Add data on SSH directly | ||
$ dvc add --external ssh://[email protected]/mydata | ||
$ dvc add --external ssh://[email protected]/existing-data | ||
|
||
# Create the stage with an external SSH output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o ssh://[email protected]/data.txt \ | ||
scp data.txt [email protected]:/data.txt | ||
``` | ||
|
||
> Please note that to use password authentication, it's necessary to set the | ||
> `password` or `ask_password` SSH remote options first (see | ||
> `dvc remote modify`), and use a special `remote://` URL in step 2: | ||
> `dvc add --external remote://sshcache/existing-data`. | ||
|
||
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations. | ||
Please check that you are able to connect both ways with tools like `ssh` and | ||
`sftp` (GNU/Linux). | ||
|
@@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and | |
### HDFS | ||
|
||
```dvc | ||
# Add HDFS remote to be used as cache location for HDFS files | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we put some summary of these comments above of after the code blocks? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's all under the Maybe we should make all these H3s into expandable details sections so that you don't have to scroll that much between the numbered list and the actual example? |
||
$ dvc remote add hdfscache hdfs://[email protected]/cache | ||
|
||
# Tell DVC to use the 'hdfscache' remote as HDFS cache location | ||
$ dvc config cache.hdfs hdfscache | ||
|
||
# Add data on HDFS directly | ||
$ dvc add --external hdfs://[email protected]/mydata | ||
$ dvc add --external hdfs://[email protected]/existing-data | ||
|
||
# Create the stage with an external HDFS output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o hdfs://[email protected]/data.txt \ | ||
|
@@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported! | |
|
||
### Local file system path | ||
|
||
The default cache location is `.dvc/cache`, so there is no need to move it for | ||
local paths outside of your project. | ||
The default <abbr>cache</abbr> is in `.dvc/cache`, so there is no need to set a | ||
custom cache location for local paths outside of your project. | ||
|
||
> Except for external data on different storage devices or partitions mounted on | ||
> the same file system (e.g. `/mnt/raid/data`). In that case please setup an | ||
> external cache in that same drive to enable | ||
> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) | ||
> and avoid copying data. | ||
|
||
```dvc | ||
# Add data on an external location directly | ||
$ dvc add --external /home/shared/mydata | ||
$ dvc add --external /home/shared/existing-data | ||
|
||
# Create the stage with an external location output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o /home/shared/data.txt \ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was just reordered to match the standard sorting of remote types (implemented in recent PRs) — but these cache.{type} options are directly related to external cache setup, and the docs link to each other.