diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md
index 411be369fb..43388f70f8 100644
--- a/content/docs/command-reference/config.md
+++ b/content/docs/command-reference/config.md
@@ -156,27 +156,26 @@ for more details.) This section contains the following options:
`dvc remote` for more information on "local remotes".) This will overwrite the
value provided to `dvc config cache.dir` or `dvc cache dir`.
-- `cache.ssh` - name of an
- [SSH remote to use as external cache](/doc/user-guide/managing-external-data#ssh).
-
- > Avoid using the same remote location that you are using for `dvc push`,
- > `dvc pull`, `dvc fetch` as external cache for your external outputs, because
- > it may cause possible file hash overlaps: the hash of a data file in
- > external storage could collide with a hash generated locally for another
- > file with a different content.
-
- `cache.s3` - name of an
[Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3).
+- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
+ [external cache](/doc/user-guide/managing-external-data).
+
- `cache.gs` - name of a
[Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage).
+- `cache.ssh` - name of an SSH remote to use
+ [as external cache](/doc/user-guide/managing-external-data#ssh).
+
+ > Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
+ > `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
+ > hash overlaps: the hash of an external output could collide
+ > with a hash generated locally for another file with different content.
+
- `cache.hdfs` - name of an
[HDFS remote to use as external cache](/doc/user-guide/managing-external-data#hdfs).
-- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
- [external cache](/doc/user-guide/managing-external-data).
-
### state
See
diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md
index 7746a9cfdf..1c75275c68 100644
--- a/content/docs/command-reference/run.md
+++ b/content/docs/command-reference/run.md
@@ -99,7 +99,7 @@ Relevant notes:
- [external dependencies](/doc/user-guide/external-dependencies) and
[external outputs](/doc/user-guide/managing-external-data) (outside of the
- workspace) are also supported.
+ workspace) are also supported (except metrics and plots).
- Outputs are deleted from the workspace before executing the command (including
at `dvc repro`) if their paths are found as existing files/directories. This
diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md
index 0e46718520..e3c25c3b06 100644
--- a/content/docs/user-guide/external-dependencies.md
+++ b/content/docs/user-guide/external-dependencies.md
@@ -1,36 +1,38 @@
# External Dependencies
There are cases when data is so large, or its processing is organized in a way
-that you would like to avoid moving it out of its external/remote location. For
-example from a network attached storage (NAS) drive, processing data on HDFS,
+such that you would like to avoid moving it out of its external/remote location.
+For example from a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
-from S3 to process it. A mechanism for external dependencies and
-[external outputs](/doc/user-guide/managing-external-data) provides a way for
-DVC to control data externally.
+from S3 to process it.
-## Description
+External dependencies and
+[external outputs](/doc/user-guide/managing-external-data) provide ways to track
+data outside of the project.
-With DVC, you can specify external files as dependencies for your pipeline
+## How it works
+
+You can specify external files or directories as dependencies for your pipeline
stages. DVC will track changes in them and reflect this in the output of
`dvc status`.
Currently, the following types (protocols) of external dependencies are
supported:
-- Local files and directories outside of your workspace
-- SSH
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
+- SSH
- HDFS
- HTTP
+- Local files and directories outside the workspace
> Note that these are a subset of the remote storage types supported by
> `dvc remote`.
-In order to specify an external dependency for your stage, use the usual `-d`
-option in `dvc run` with the external path or URL to your desired file or
-directory.
+In order to specify an external dependency for your stage, use the
+usual `-d` option in `dvc run` with the external path or URL to your desired
+file or directory.
## Examples
@@ -149,8 +151,8 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
```
-The command above creates the import stage (DVC-file)
-`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL).
+The command above creates the import `.dvc` file `data.xml.dvc`, that contains
+an external dependency (in this case an HTTPs URL).
@@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again.
`dvc import` can download a data artifact from any DVC
project or Git repository. It also creates an external dependency in its
-import stage (DVC-file).
+import `.dvc` file.
```dvc
$ dvc import git@github.com:iterative/example-get-started model.pkl
diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md
index 2348ee977a..649d73b12a 100644
--- a/content/docs/user-guide/managing-external-data.md
+++ b/content/docs/user-guide/managing-external-data.md
@@ -1,83 +1,94 @@
# Managing External Data
There are cases when data is so large, or its processing is organized in a way
-that you would like to avoid moving it out of its external/remote location. For
-example from a network attached storage (NAS) drive, processing data on HDFS,
+such that its preferable to avoid moving it from its external/remote location.
+For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
-from S3 to process it. External outputs and
-[external dependencies](/doc/user-guide/external-dependencies) provide a way for
-DVC to control data outside of the project directory.
+from S3 to process it.
-## Description
+External outputs and
+[external dependencies](/doc/user-guide/external-dependencies) provide ways to
+track data outside of the project.
-DVC can track files on an external storage with `dvc add` or specify external
-files as outputs for
-[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run`
-(stage files). External outputs are considered part of the DVC project. DVC will
-track changes in them and reflect this in the output of `dvc status`.
+## How external outputs work
+
+DVC can track existing files or directories on an external location with
+`dvc add` (`out` field). It can also create external files or directories as
+outputs for `dvc.yaml` files (only `outs` field, not metrics or plots).
+
+External outputs are considered part of the (extended) DVC project: DVC will
+track changes in them, and reflect this in `dvc status` reports, for example.
+
+For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to
+[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
+in the same external/remote file system first.
Currently, the following types (protocols) of external outputs (and
cache) are supported:
-- Local files and directories outside of your workspace
-- SSH
- Amazon S3
+- Microsoft Azure Blob Storage
- Google Cloud Storage
+- SSH
- HDFS
+- Local files and directories outside the workspace
> Note that these are a subset of the remote storage types supported by
> `dvc remote`.
-In order to specify an external output for a stage file, use the usual `-o` or
-`-O` options of `dvc run`, but with the external path or URL to the file in
-question. For cached external outputs (`-o`) you will need to
-[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
-in the same external/remote file system first.
-
-> Avoid using the same location of the
-> [remote storage](/doc/command-reference/remote) that you have for `dvc push`
-> and `dvc pull` for external outputs or as external cache, because it may cause
-> file hash overlaps: The hash value of a data file in external storage could
-> collide with the one generated locally for another file.
+> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
+> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file
+> hash overlaps: the hash of an external output could collide with a hash
+> generated locally for another file with different content.
## Examples
-For the examples, let's take a look at a [stage](/doc/command-reference/run)
-that simply moves local file to an external location, producing a `data.txt.dvc`
-DVC-file.
+For the examples, let's take a look at
+
+1. Adding a `dvc remote` to use as cache for data in the external location, and
+ configure it as external cache with `dvc config`.
+2. Tracking existing data on an external location with `dvc add` (this doesn't
+ download it). This produces a `.dvc` file with an external output.
+3. Creating a simple [stage](/doc/command-reference/run) that moves a local file
+ to the external location. This produces a stage with another external output
+ in `dvc.yaml`.
### Amazon S3
```dvc
-# Add S3 remote to be used as cache location for S3 files
$ dvc remote add s3cache s3://mybucket/cache
-
-# Tell DVC to use the 's3cache' remote as S3 cache location
$ dvc config cache.s3 s3cache
-# Add data on S3 directly
-$ dvc add --external s3://mybucket/mydata
+$ dvc add --external s3://mybucket/existing-data
-# Create the stage with an external S3 output
$ dvc run -d data.txt \
--external \
-o s3://mybucket/data.txt \
aws s3 cp data.txt s3://mybucket/data.txt
```
+### Microsoft Azure Blob Storage
+
+```dvc
+$ dvc remote add azurecache azure://mycontainer/cache
+$ dvc config cache.azure azurecache
+
+$ dvc add --external azure://mycontainer/existing-data
+
+$ dvc run -d data.txt \
+ --external \
+ -o azure://mycontainer/data.txt \
+ az storage blob upload -f data.txt -c mycontainer -n data.txt
+```
+
### Google Cloud Storage
```dvc
-# Add GS remote to be used as cache location for GS files
$ dvc remote add gscache gs://mybucket/cache
-
-# Tell DVC to use the 'gscache' remote as GS cache location
$ dvc config cache.gs gscache
-# Add data on GS directly
-$ dvc add --external gs://mybucket/mydata
+$ dvc add --external gs://mybucket/existing-data
-# Create the stage with an external GS output
$ dvc run -d data.txt \
--external \
-o gs://mybucket/data.txt \
@@ -87,22 +98,22 @@ $ dvc run -d data.txt \
### SSH
```dvc
-# Add SSH remote to be used as cache location for SSH files
$ dvc remote add sshcache ssh://user@example.com/cache
-
-# Tell DVC to use the 'sshcache' remote as SSH cache location
$ dvc config cache.ssh sshcache
-# Add data on SSH directly
-$ dvc add --external ssh://user@example.com/mydata
+$ dvc add --external ssh://user@example.com/existing-data
-# Create the stage with an external SSH output
$ dvc run -d data.txt \
--external \
-o ssh://user@example.com/data.txt \
scp data.txt user@example.com:/data.txt
```
+> Please note that to use password authentication, it's necessary to set the
+> `password` or `ask_password` SSH remote options first (see
+> `dvc remote modify`), and use a special `remote://` URL in step 2:
+> `dvc add --external remote://sshcache/existing-data`.
+
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations.
Please check that you are able to connect both ways with tools like `ssh` and
`sftp` (GNU/Linux).
@@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and
### HDFS
```dvc
-# Add HDFS remote to be used as cache location for HDFS files
$ dvc remote add hdfscache hdfs://user@example.com/cache
-
-# Tell DVC to use the 'hdfscache' remote as HDFS cache location
$ dvc config cache.hdfs hdfscache
-# Add data on HDFS directly
-$ dvc add --external hdfs://user@example.com/mydata
+$ dvc add --external hdfs://user@example.com/existing-data
-# Create the stage with an external HDFS output
$ dvc run -d data.txt \
--external \
-o hdfs://user@example.com/data.txt \
@@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported!
### Local file system path
-The default cache location is `.dvc/cache`, so there is no need to move it for
-local paths outside of your project.
+The default cache is in `.dvc/cache`, so there is no need to set a
+custom cache location for local paths outside of your project.
+
+> Except for external data on different storage devices or partitions mounted on
+> the same file system (e.g. `/mnt/raid/data`). In that case please setup an
+> external cache in that same drive to enable
+> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
+> and avoid copying data.
```dvc
-# Add data on an external location directly
-$ dvc add --external /home/shared/mydata
+$ dvc add --external /home/shared/existing-data
-# Create the stage with an external location output
$ dvc run -d data.txt \
--external \
-o /home/shared/data.txt \