Skip to content

Commit

Permalink
Merge pull request #1735 from iterative/jorge
Browse files Browse the repository at this point in the history
guide: external data updates
  • Loading branch information
jorgeorpinel authored Sep 1, 2020
2 parents c6406e0 + 23be152 commit 527ff87
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 84 deletions.
23 changes: 11 additions & 12 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,27 +156,26 @@ for more details.) This section contains the following options:
`dvc remote` for more information on "local remotes".) This will overwrite the
value provided to `dvc config cache.dir` or `dvc cache dir`.

- `cache.ssh` - name of an
[SSH remote to use as external cache](/doc/user-guide/managing-external-data#ssh).

> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible file hash overlaps: the hash of a data file in
> external storage could collide with a hash generated locally for another
> file with a different content.
- `cache.s3` - name of an
[Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3).

- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
[external cache](/doc/user-guide/managing-external-data).

- `cache.gs` - name of a
[Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage).

- `cache.ssh` - name of an SSH remote to use
[as external cache](/doc/user-guide/managing-external-data#ssh).

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
> hash overlaps: the hash of an external <abbr>output</abbr> could collide
> with a hash generated locally for another file with different content.
- `cache.hdfs` - name of an
[HDFS remote to use as external cache](/doc/user-guide/managing-external-data#hdfs).

- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
[external cache](/doc/user-guide/managing-external-data).

### state

See
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Relevant notes:

- [external dependencies](/doc/user-guide/external-dependencies) and
[external outputs](/doc/user-guide/managing-external-data) (outside of the
<abbr>workspace</abbr>) are also supported.
<abbr>workspace</abbr>) are also supported (except metrics and plots).

- Outputs are deleted from the workspace before executing the command (including
at `dvc repro`) if their paths are found as existing files/directories. This
Expand Down
32 changes: 17 additions & 15 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,38 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
such that you would like to avoid moving it out of its external/remote location.
For example from a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. A mechanism for external dependencies and
[external outputs](/doc/user-guide/managing-external-data) provides a way for
DVC to control data externally.
from S3 to process it.

## Description
External <abbr>dependencies</abbr> and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
data outside of the <abbr>project</abbr>.

With DVC, you can specify external files as dependencies for your pipeline
## How it works

You can specify external files or directories as dependencies for your pipeline
stages. DVC will track changes in them and reflect this in the output of
`dvc status`.

Currently, the following types (protocols) of external dependencies are
supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- SSH
- HDFS
- HTTP
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.
In order to specify an external dependency for your stage, use the usual `-d`
option in `dvc run` with the external path or URL to your desired file or
directory.
In order to specify an external <abbr>dependency</abbr> for your stage, use the
usual `-d` option in `dvc run` with the external path or URL to your desired
file or directory.

## Examples

Expand Down Expand Up @@ -149,8 +151,8 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
```

The command above creates the <abbr>import stage</abbr> (DVC-file)
`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL).
The command above creates the import `.dvc` file `data.xml.dvc`, that contains
an external dependency (in this case an HTTPs URL).

<details>

Expand Down Expand Up @@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again.
`dvc import` can download a <abbr>data artifact</abbr> from any <abbr>DVC
project</abbr> or Git repository. It also creates an external dependency in its
<abbr>import stage</abbr> (DVC-file).
import `.dvc` file.

```dvc
$ dvc import [email protected]:iterative/example-get-started model.pkl
Expand Down
122 changes: 66 additions & 56 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,94 @@
# Managing External Data

There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
such that its preferable to avoid moving it from its external/remote location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide a way for
DVC to control data outside of the <abbr>project</abbr> directory.
from S3 to process it.

## Description
External <abbr>outputs</abbr> and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
track data outside of the <abbr>project</abbr>.

DVC can track files on an external storage with `dvc add` or specify external
files as <abbr>outputs</abbr> for
[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run`
(stage files). External outputs are considered part of the DVC project. DVC will
track changes in them and reflect this in the output of `dvc status`.
## How external outputs work

DVC can track existing files or directories on an external location with
`dvc add` (`out` field). It can also create external files or directories as
outputs for `dvc.yaml` files (only `outs` field, not metrics or plots).

External outputs are considered part of the (extended) DVC project: DVC will
track changes in them, and reflect this in `dvc status` reports, for example.

For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

Currently, the following types (protocols) of external outputs (and
<abbr>cache</abbr>) are supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.
In order to specify an external output for a stage file, use the usual `-o` or
`-O` options of `dvc run`, but with the external path or URL to the file in
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

> Avoid using the same location of the
> [remote storage](/doc/command-reference/remote) that you have for `dvc push`
> and `dvc pull` for external outputs or as external cache, because it may cause
> file hash overlaps: The hash value of a data file in external storage could
> collide with the one generated locally for another file.
> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file
> hash overlaps: the hash of an external output could collide with a hash
> generated locally for another file with different content.
## Examples

For the examples, let's take a look at a [stage](/doc/command-reference/run)
that simply moves local file to an external location, producing a `data.txt.dvc`
DVC-file.
For the examples, let's take a look at

1. Adding a `dvc remote` to use as cache for data in the external location, and
configure it as external <abbr>cache</abbr> with `dvc config`.
2. Tracking existing data on an external location with `dvc add` (this doesn't
download it). This produces a `.dvc` file with an external output.
3. Creating a simple [stage](/doc/command-reference/run) that moves a local file
to the external location. This produces a stage with another external output
in `dvc.yaml`.

### Amazon S3

```dvc
# Add S3 remote to be used as cache location for S3 files
$ dvc remote add s3cache s3://mybucket/cache
# Tell DVC to use the 's3cache' remote as S3 cache location
$ dvc config cache.s3 s3cache
# Add data on S3 directly
$ dvc add --external s3://mybucket/mydata
$ dvc add --external s3://mybucket/existing-data
# Create the stage with an external S3 output
$ dvc run -d data.txt \
--external \
-o s3://mybucket/data.txt \
aws s3 cp data.txt s3://mybucket/data.txt
```

### Microsoft Azure Blob Storage

```dvc
$ dvc remote add azurecache azure://mycontainer/cache
$ dvc config cache.azure azurecache
$ dvc add --external azure://mycontainer/existing-data
$ dvc run -d data.txt \
--external \
-o azure://mycontainer/data.txt \
az storage blob upload -f data.txt -c mycontainer -n data.txt
```

### Google Cloud Storage

```dvc
# Add GS remote to be used as cache location for GS files
$ dvc remote add gscache gs://mybucket/cache
# Tell DVC to use the 'gscache' remote as GS cache location
$ dvc config cache.gs gscache
# Add data on GS directly
$ dvc add --external gs://mybucket/mydata
$ dvc add --external gs://mybucket/existing-data
# Create the stage with an external GS output
$ dvc run -d data.txt \
--external \
-o gs://mybucket/data.txt \
Expand All @@ -87,22 +98,22 @@ $ dvc run -d data.txt \
### SSH

```dvc
# Add SSH remote to be used as cache location for SSH files
$ dvc remote add sshcache ssh://[email protected]/cache
# Tell DVC to use the 'sshcache' remote as SSH cache location
$ dvc config cache.ssh sshcache
# Add data on SSH directly
$ dvc add --external ssh://[email protected]/mydata
$ dvc add --external ssh://[email protected]/existing-data
# Create the stage with an external SSH output
$ dvc run -d data.txt \
--external \
-o ssh://[email protected]/data.txt \
scp data.txt [email protected]:/data.txt
```

> Please note that to use password authentication, it's necessary to set the
> `password` or `ask_password` SSH remote options first (see
> `dvc remote modify`), and use a special `remote://` URL in step 2:
> `dvc add --external remote://sshcache/existing-data`.
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations.
Please check that you are able to connect both ways with tools like `ssh` and
`sftp` (GNU/Linux).
Expand All @@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and
### HDFS

```dvc
# Add HDFS remote to be used as cache location for HDFS files
$ dvc remote add hdfscache hdfs://[email protected]/cache
# Tell DVC to use the 'hdfscache' remote as HDFS cache location
$ dvc config cache.hdfs hdfscache
# Add data on HDFS directly
$ dvc add --external hdfs://[email protected]/mydata
$ dvc add --external hdfs://[email protected]/existing-data
# Create the stage with an external HDFS output
$ dvc run -d data.txt \
--external \
-o hdfs://[email protected]/data.txt \
Expand All @@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported!

### Local file system path

The default cache location is `.dvc/cache`, so there is no need to move it for
local paths outside of your project.
The default <abbr>cache</abbr> is in `.dvc/cache`, so there is no need to set a
custom cache location for local paths outside of your project.

> Except for external data on different storage devices or partitions mounted on
> the same file system (e.g. `/mnt/raid/data`). In that case please setup an
> external cache in that same drive to enable
> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> and avoid copying data.
```dvc
# Add data on an external location directly
$ dvc add --external /home/shared/mydata
$ dvc add --external /home/shared/existing-data
# Create the stage with an external location output
$ dvc run -d data.txt \
--external \
-o /home/shared/data.txt \
Expand Down

0 comments on commit 527ff87

Please sign in to comment.