-
Notifications
You must be signed in to change notification settings - Fork 394
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1735 from iterative/jorge
guide: external data updates
- Loading branch information
Showing
4 changed files
with
95 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,38 @@ | ||
# External Dependencies | ||
|
||
There are cases when data is so large, or its processing is organized in a way | ||
that you would like to avoid moving it out of its external/remote location. For | ||
example from a network attached storage (NAS) drive, processing data on HDFS, | ||
such that you would like to avoid moving it out of its external/remote location. | ||
For example from a network attached storage (NAS), processing data on HDFS, | ||
running [Dask](https://dask.org/) via SSH, or having a script that streams data | ||
from S3 to process it. A mechanism for external dependencies and | ||
[external outputs](/doc/user-guide/managing-external-data) provides a way for | ||
DVC to control data externally. | ||
from S3 to process it. | ||
|
||
## Description | ||
External <abbr>dependencies</abbr> and | ||
[external outputs](/doc/user-guide/managing-external-data) provide ways to track | ||
data outside of the <abbr>project</abbr>. | ||
|
||
With DVC, you can specify external files as dependencies for your pipeline | ||
## How it works | ||
|
||
You can specify external files or directories as dependencies for your pipeline | ||
stages. DVC will track changes in them and reflect this in the output of | ||
`dvc status`. | ||
|
||
Currently, the following types (protocols) of external dependencies are | ||
supported: | ||
|
||
- Local files and directories outside of your <abbr>workspace</abbr> | ||
- SSH | ||
- Amazon S3 | ||
- Microsoft Azure Blob Storage | ||
- Google Cloud Storage | ||
- SSH | ||
- HDFS | ||
- HTTP | ||
- Local files and directories outside the <abbr>workspace</abbr> | ||
|
||
> Note that these are a subset of the remote storage types supported by | ||
> `dvc remote`. | ||
In order to specify an external dependency for your stage, use the usual `-d` | ||
option in `dvc run` with the external path or URL to your desired file or | ||
directory. | ||
In order to specify an external <abbr>dependency</abbr> for your stage, use the | ||
usual `-d` option in `dvc run` with the external path or URL to your desired | ||
file or directory. | ||
|
||
## Examples | ||
|
||
|
@@ -149,8 +151,8 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml | |
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml' | ||
``` | ||
|
||
The command above creates the <abbr>import stage</abbr> (DVC-file) | ||
`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL). | ||
The command above creates the import `.dvc` file `data.xml.dvc`, that contains | ||
an external dependency (in this case an HTTPs URL). | ||
|
||
<details> | ||
|
||
|
@@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again. | |
`dvc import` can download a <abbr>data artifact</abbr> from any <abbr>DVC | ||
project</abbr> or Git repository. It also creates an external dependency in its | ||
<abbr>import stage</abbr> (DVC-file). | ||
import `.dvc` file. | ||
|
||
```dvc | ||
$ dvc import [email protected]:iterative/example-get-started model.pkl | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,83 +1,94 @@ | ||
# Managing External Data | ||
|
||
There are cases when data is so large, or its processing is organized in a way | ||
that you would like to avoid moving it out of its external/remote location. For | ||
example from a network attached storage (NAS) drive, processing data on HDFS, | ||
such that its preferable to avoid moving it from its external/remote location. | ||
For example data on a network attached storage (NAS), processing data on HDFS, | ||
running [Dask](https://dask.org/) via SSH, or having a script that streams data | ||
from S3 to process it. External outputs and | ||
[external dependencies](/doc/user-guide/external-dependencies) provide a way for | ||
DVC to control data outside of the <abbr>project</abbr> directory. | ||
from S3 to process it. | ||
|
||
## Description | ||
External <abbr>outputs</abbr> and | ||
[external dependencies](/doc/user-guide/external-dependencies) provide ways to | ||
track data outside of the <abbr>project</abbr>. | ||
|
||
DVC can track files on an external storage with `dvc add` or specify external | ||
files as <abbr>outputs</abbr> for | ||
[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` | ||
(stage files). External outputs are considered part of the DVC project. DVC will | ||
track changes in them and reflect this in the output of `dvc status`. | ||
## How external outputs work | ||
|
||
DVC can track existing files or directories on an external location with | ||
`dvc add` (`out` field). It can also create external files or directories as | ||
outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). | ||
|
||
External outputs are considered part of the (extended) DVC project: DVC will | ||
track changes in them, and reflect this in `dvc status` reports, for example. | ||
|
||
For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to | ||
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) | ||
in the same external/remote file system first. | ||
|
||
Currently, the following types (protocols) of external outputs (and | ||
<abbr>cache</abbr>) are supported: | ||
|
||
- Local files and directories outside of your <abbr>workspace</abbr> | ||
- SSH | ||
- Amazon S3 | ||
- Microsoft Azure Blob Storage | ||
- Google Cloud Storage | ||
- SSH | ||
- HDFS | ||
- Local files and directories outside the <abbr>workspace</abbr> | ||
|
||
> Note that these are a subset of the remote storage types supported by | ||
> `dvc remote`. | ||
In order to specify an external output for a stage file, use the usual `-o` or | ||
`-O` options of `dvc run`, but with the external path or URL to the file in | ||
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to | ||
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) | ||
in the same external/remote file system first. | ||
|
||
> Avoid using the same location of the | ||
> [remote storage](/doc/command-reference/remote) that you have for `dvc push` | ||
> and `dvc pull` for external outputs or as external cache, because it may cause | ||
> file hash overlaps: The hash value of a data file in external storage could | ||
> collide with the one generated locally for another file. | ||
> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for | ||
> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file | ||
> hash overlaps: the hash of an external output could collide with a hash | ||
> generated locally for another file with different content. | ||
## Examples | ||
|
||
For the examples, let's take a look at a [stage](/doc/command-reference/run) | ||
that simply moves local file to an external location, producing a `data.txt.dvc` | ||
DVC-file. | ||
For the examples, let's take a look at | ||
|
||
1. Adding a `dvc remote` to use as cache for data in the external location, and | ||
configure it as external <abbr>cache</abbr> with `dvc config`. | ||
2. Tracking existing data on an external location with `dvc add` (this doesn't | ||
download it). This produces a `.dvc` file with an external output. | ||
3. Creating a simple [stage](/doc/command-reference/run) that moves a local file | ||
to the external location. This produces a stage with another external output | ||
in `dvc.yaml`. | ||
|
||
### Amazon S3 | ||
|
||
```dvc | ||
# Add S3 remote to be used as cache location for S3 files | ||
$ dvc remote add s3cache s3://mybucket/cache | ||
# Tell DVC to use the 's3cache' remote as S3 cache location | ||
$ dvc config cache.s3 s3cache | ||
# Add data on S3 directly | ||
$ dvc add --external s3://mybucket/mydata | ||
$ dvc add --external s3://mybucket/existing-data | ||
# Create the stage with an external S3 output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o s3://mybucket/data.txt \ | ||
aws s3 cp data.txt s3://mybucket/data.txt | ||
``` | ||
|
||
### Microsoft Azure Blob Storage | ||
|
||
```dvc | ||
$ dvc remote add azurecache azure://mycontainer/cache | ||
$ dvc config cache.azure azurecache | ||
$ dvc add --external azure://mycontainer/existing-data | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o azure://mycontainer/data.txt \ | ||
az storage blob upload -f data.txt -c mycontainer -n data.txt | ||
``` | ||
|
||
### Google Cloud Storage | ||
|
||
```dvc | ||
# Add GS remote to be used as cache location for GS files | ||
$ dvc remote add gscache gs://mybucket/cache | ||
# Tell DVC to use the 'gscache' remote as GS cache location | ||
$ dvc config cache.gs gscache | ||
# Add data on GS directly | ||
$ dvc add --external gs://mybucket/mydata | ||
$ dvc add --external gs://mybucket/existing-data | ||
# Create the stage with an external GS output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o gs://mybucket/data.txt \ | ||
|
@@ -87,22 +98,22 @@ $ dvc run -d data.txt \ | |
### SSH | ||
|
||
```dvc | ||
# Add SSH remote to be used as cache location for SSH files | ||
$ dvc remote add sshcache ssh://[email protected]/cache | ||
# Tell DVC to use the 'sshcache' remote as SSH cache location | ||
$ dvc config cache.ssh sshcache | ||
# Add data on SSH directly | ||
$ dvc add --external ssh://[email protected]/mydata | ||
$ dvc add --external ssh://[email protected]/existing-data | ||
# Create the stage with an external SSH output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o ssh://[email protected]/data.txt \ | ||
scp data.txt [email protected]:/data.txt | ||
``` | ||
|
||
> Please note that to use password authentication, it's necessary to set the | ||
> `password` or `ask_password` SSH remote options first (see | ||
> `dvc remote modify`), and use a special `remote://` URL in step 2: | ||
> `dvc add --external remote://sshcache/existing-data`. | ||
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations. | ||
Please check that you are able to connect both ways with tools like `ssh` and | ||
`sftp` (GNU/Linux). | ||
|
@@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and | |
### HDFS | ||
|
||
```dvc | ||
# Add HDFS remote to be used as cache location for HDFS files | ||
$ dvc remote add hdfscache hdfs://[email protected]/cache | ||
# Tell DVC to use the 'hdfscache' remote as HDFS cache location | ||
$ dvc config cache.hdfs hdfscache | ||
# Add data on HDFS directly | ||
$ dvc add --external hdfs://[email protected]/mydata | ||
$ dvc add --external hdfs://[email protected]/existing-data | ||
# Create the stage with an external HDFS output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o hdfs://[email protected]/data.txt \ | ||
|
@@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported! | |
|
||
### Local file system path | ||
|
||
The default cache location is `.dvc/cache`, so there is no need to move it for | ||
local paths outside of your project. | ||
The default <abbr>cache</abbr> is in `.dvc/cache`, so there is no need to set a | ||
custom cache location for local paths outside of your project. | ||
|
||
> Except for external data on different storage devices or partitions mounted on | ||
> the same file system (e.g. `/mnt/raid/data`). In that case please setup an | ||
> external cache in that same drive to enable | ||
> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) | ||
> and avoid copying data. | ||
```dvc | ||
# Add data on an external location directly | ||
$ dvc add --external /home/shared/mydata | ||
$ dvc add --external /home/shared/existing-data | ||
# Create the stage with an external location output | ||
$ dvc run -d data.txt \ | ||
--external \ | ||
-o /home/shared/data.txt \ | ||
|