docs: various updates (#4025)

* buffer explanation (with diagrams) * clone.md refactor, move architectural intros into clone.md * refactor related content into "consistency exceptions"
juicedata · Aug 30, 2023 · 94e7a1b · 94e7a1b
1 parent 234dd95
commit 94e7a1b
Show file tree

Hide file tree

Showing 14 changed files with 181 additions and 91 deletions.
diff --git a/docs/en/guide/cache_management.md b/docs/en/guide/cache_management.md
@@ -18,7 +18,7 @@ Meanwhile, cache improve performance only when application needs to repeatedly r
 
 Distributed systems often need to make tradeoffs between cache and consistency. Due to the decoupled architecture of JuiceFS, think consistency problems in terms of metadata, file data (in object storage), and file data local cache.
 
-For [metadata](#metadata-cache), the default configuration offers a "close-to-open" consistency guarantee, i.e. after a client modified and closed a file, other clients will see the latest state when they open this file again. Also, default mount option uses 1s of kernel metadata cache which provides decent performance for the usual cases. If your scenario demands higher level of cache performance, learn how to tune cache settings in below sections.
+For [metadata](#metadata-cache), the default configuration offers a "close-to-open" consistency guarantee, i.e. after a client modified and closed a file, other clients will see the latest state when they open this file again. Also, default mount option uses 1s of kernel metadata cache which provides decent performance for the usual cases. If your scenario demands higher level of cache performance, learn how to tune cache settings in below sections. In particular, the client (mount point) initiating file modifications will enjoy a stronger consistency, read [consistency exceptions](#consistency-exceptions) for more.
 
 As for object storage, JuiceFS clients split files into data blocks (default 4MiB), each is assigned an unique ID and uploaded to object storage. Subsequent modifications on the file are carried out on new data blocks, and the original blocks remain unchanged. This guarantees consistency of the object storage data, because once the file is modified, clients will then read from the new data blocks, while the stale ones which will be deleted through [Trash](../security/trash.md) or compaction.
 
@@ -45,16 +45,6 @@ JuiceFS Client controls these kinds of metadata as kernel cache: attribute (file
 
 Caching these metadata in kernel for 1 second really speeds up `lookup` and `getattr` calls.
 
-:::tip Consistency caveats for the mount point initiating changes
-When files are modified, the mount point initiating these changes will enjoy a higher level of consistency:
-
-* For the mount point initiating changes, kernel cache is automatically invalidated upon modification. But when different mount points access and modify the same file, active kernel cache invalidation is only effective on the client initiating the modifications, other clients can only wait for expiration.
-* When a `write` call completes, the mount point itself will immediately see the resulting file length change (e.g. use `ls -al` to verify that file size is growing)——but this doesn't mean the changes have been committed, before `flush` finishes, these modifications will not be reflected onto the object storage, and other mount points cannot see the latest writes. Using methods like `fsync, fdatasync, close` will all result in `flush` calls that will persist the file changes and make them visible to other clients.
-* As an extreme case of the previous situation, if you `write` successfully and have observed file size change in the current mount point, but the eventual `flush` fails for some reason, for example file system usage exceeds global quota, then the previously growing file size will suddenly be reduced, for example, dropped from 10M to 0, this often leads to misunderstanding that JuiceFS just emptied your files, while the reality is that the files haven't been successfully written from the beginning, the file size change that's only available to the current mount point is simply a preview of things, not an actual committed state.
-* The mount point initiating changes have access to file change events, and can use tools like [`fswatch`](https://emcrisostomo.github.io/fswatch/) or [`Watchdog`](https://python-watchdog.readthedocs.io/en/stable). But the scope is obviously limited to the files changed within the mount point itself, i.e. files modified by A cannot be monitored by mount point B.
-* Due to the fact that FUSE doesn't yet support inotify API, if you'd like to monitor file change events using libraries like [Watchdog](https://python-watchdog.readthedocs.io/en/stable), you can only achieve this via polling (e.g. [`PollingObserver`](https://python-watchdog.readthedocs.io/en/stable/_modules/watchdog/observers/polling.html#PollingObserver)).
-:::
-
 Do note that `entry` cache is gradually built upon file access and may not contain a complete file list under directory, so `readdir` calls or `ls` command cannot utilize this cache, that's why `entry` cache only improves `lookup` performance. The meaning of `direntry` here is different from [kernel directory entry](https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html), `direntry` does not tell you the files under a directory, but rather, it's the same concept as `entry`, just distinguished based on whether it's a directory.
 
 Real world scenarios scarcely require setting different values for `--entry-cache` and `--dir-entry-cache`, these options exist for theoretical possibilities like when directories seldomly change while files change a lot, in that situation, you can use a higher `--dir-entry-cache` than `--entry-cache`.
@@ -71,6 +61,64 @@ With `--open-cache` enabled, JuiceFS no longer operates under close-to-open cons
 
 In comparison, JuiceFS Enterprise Edition provides richer functionalities around memory metadata cache (supports active invalidation). Read [Enterprise Edition documentation](https://juicefs.com/docs/cloud/guide/cache/#client-memory-metadata-cache) for more.
 
+### Consistency exceptions {#consistency-exceptions}
+
+The metadata cache in discussed above really only pertain to multi-client situations, which can be deemed as a "minimum consistency guarantee". But for the client initiating file changes, it's not hard to imagine that due to changes happening "locally", the client initiating changes will enjoy a higher level of consistency:
+
+* For the mount point initiating changes, kernel cache is automatically invalidated upon modification. But when different mount points access and modify the same file, active kernel cache invalidation is only effective on the client initiating the modifications, other clients can only wait for expiration.
+* When a `write` call completes, the mount point itself will immediately see the resulting file length change (e.g. use `ls -al` to verify that file size is growing)——but this doesn't mean the changes have been committed, before `flush` finishes, these modifications will not be reflected onto the object storage, and other mount points cannot see the latest writes. Using methods like `fsync, fdatasync, close` will all result in `flush` calls that will persist the file changes and make them visible to other clients.
+* As an extreme case of the previous situation, if you `write` successfully and have observed file size change in the current mount point, but the eventual `flush` fails for some reason, for example file system usage exceeds global quota, then the previously growing file size will suddenly be reduced, for example, dropped from 10M to 0, this often leads to misunderstanding that JuiceFS just emptied your files, while the reality is that the files haven't been successfully written from the beginning, the file size change that's only available to the current mount point is simply a preview of things, not an actual committed state.
+* The mount point initiating changes have access to file change events, and can use tools like [`fswatch`](https://emcrisostomo.github.io/fswatch/) or [`Watchdog`](https://python-watchdog.readthedocs.io/en/stable). But the scope is obviously limited to the files changed within the mount point itself, i.e. files modified by A cannot be monitored by mount point B.
+* Due to the fact that FUSE doesn't yet support inotify API, if you'd like to monitor file change events using libraries like [Watchdog](https://python-watchdog.readthedocs.io/en/stable), you can only achieve this via polling (e.g. [`PollingObserver`](https://python-watchdog.readthedocs.io/en/stable/_modules/watchdog/observers/polling.html#PollingObserver)).
+
+## Read/Write Buffer {#buffer-size}
+
+The Read/Write buffer is a memory space allocated to the JuiceFS Client, size controlled by [`--buffer-size`](../reference/command_reference.md#mount) which defaults to 300 (in MiB). Read/Write data all pass through this buffer, making it crucial for all I/O operations, that's why under large scale scenarios, increase buffer size is often used as a first step of optimization.
+
+### Readahead and Prefetch {#readahead-prefetch}
+
+:::tip
+To accurately describe the internal mechanism of JuiceFS Client, we use the term "readahead" and "prefetch" to refer to the two different behaviors that both download data ahead of time to increase read performance.
+:::
+
+When a file is sequentially read, JuiceFS Client performs what's called "readahead", which involves downloading data ahead of the current read offset. In fact, the similar behavior is already being built into the [Linux Kernel](https://www.halolinux.us/kernel-architecture/page-cache-readahead.html): when reading files, kernel dynamically settles on a readahead window based on the actual read behavior, and load file into the page cache. With JuiceFS being a network file system, the classic kernel readahead mechanism is simply not enough to bring the desired performance increase, that's why JuiceFS performs its own readahead on top of kernel readahead, using algorithm to "guess" the size of the readahead window (more aggressive than kernel's), and then download the object storage data in advance.
+
+![readahead](../images/buffer-readahead.svg)
+
+Apparently readahead is only good for sequential reads, that's why there's another similar mechanism called "prefetch": when a block is randomly read by a small offset range, the whole block is scheduled for download asynchronously.
+
+![prefetch](../images/buffer-prefetch.svg)
+
+This mechanism assumes that if a file is randomly read at a given range, then its adjacent content is also more likely to get read momentarily. This isn't necessarily true for various different types of applications, for example, if an application decides to read read a huge file in a very sparse fashion, i.e. read offsets are far from each other. In such case, prefetch isn't really useful and can cause serious read amplification, so if you are already familiar with the file system access pattern of your application, and concluded that prefetch isn't really needed, you can disable by using [`--prefetch=0`](../reference/command_reference.md#mount-data-cache-options).
+
+Readahead and prefetch effectively increase sequential read and random read performance, but it also comes with read amplification, read ["Read amplification"](../administration/troubleshooting.md#read-amplification) for more information.
+
+### Write {#buffer-write}
+
+A successful `write` does not mean data is persisted: that's actually `flush`'s job. This is true for both local file systems, and JuiceFS file systems. In JuiceFS, `write` only commits changes to the buffer, from the writing mount point's POV, you may notice that file size is changing, but do not mistake this for persistence (this behavior is also covered in detail in [consistency exceptions](#consistency-exceptions)). To sum up, before `flush` actually finishes, changes are only kept inside the client buffer. Applications may explicitly invoke `flush`, but even without this, `flush` is automatically triggered when a pending slice's size exceed its chunk border, or have waited in the buffer for a certain amount of time.
+
+Together with the previously introduced readahead mechanism, buffer function can be described in below diagram:
+
+![read write buffer](../images/buffer-read-write.svg)
+
+Buffer is shared by both read & write, obviously write is treated with higher priority, this implies the possibility of write getting in the way of read. For instance, if object storage bandwidth isn't enough to support write load, there'll be congestion:
+
+![buffer congestion](../images/buffer-congestion.svg)
+
+As illustrated above, a high write load puts too much pending slices inside the buffer, leaving little buffer space for readahead, file read will hence slow down. Due to a low upload speed, write may also fail due to `flush` timeouts.
+
+### Observation and Optimization
+
+Buffer is crucial to both read & write, as is already introduced in above sections, making `--buffer-size` the first optimization target when faced with large scale scenarios. But simply increasing buffer size is not enough and might cause other problems (like buffer congestion, illustrated in the above section). The size of the buffer should be smartly decided along with other performance options.
+
+Before making any adjustments, we recommend running a [`juicefs stats`](../administration/fault_diagnosis_and_analysis.md#stats) command to check the current buffer usage, and read below content to guide your tuning.
+
+If you wish to improve write speed, and have already increased [`--max-uploads`](../reference/command_reference.md#mount-data-cache-options) for more upload concurrency, with no noticeable increase in upload traffic, consider also increasing `--buffer-size` so that concurrent threads may easier allocate memory for data uploads. This also works in the opposite direction: if tuning up `--buffer-size` didn't bring out an increase in upload traffic, you should probably increase `--max-uploads` as well.
+
+The `--buffer-size` also controls the data upload size for each `flush` operation, this means for clients working in a low bandwidth environment, you may need to use a lower `--buffer-size` to avoid `flush` timeouts. Refer to ["Connection problems with object storage"](../administration/troubleshooting.md#io-error-object-storage) for troubleshooting under low internet speed.
+
+If you wish to improve sequential read speed, use a larger `--buffer-size` to expand the readahead window, all data blocks within the window will be concurrently fetched from object storage. Also keep in mind that, reading a single large file will never consume the full buffer, the space reserved for readahead is between 1/4 to 1/2 of the total buffer size. So if you noticed that `juicefs stats` indicates `buf` is already half full, while performing sequential read on a single large file, then it's time to increase `--buffer-size` to set a larger readahead window.
+
 ## Data cache {#data-cache}
 
 To improve performance, JuiceFS also provides various caching mechanisms for data, including page cache in the kernel, local file system cache in client host, and read/write buffer in client process itself. Read requests will try the kernel page cache, the client process buffer, and the local disk cache in turn. If the data requested is not found in any level of the cache, it will be read from the object storage, and also be written into every level of the cache asynchronously to improve the performance of the next access.

diff --git a/docs/en/guide/clone.md b/docs/en/guide/clone.md
@@ -3,41 +3,29 @@ title: Clone Files or Directories
 sidebar_position: 6
 ---
 
-## Basic usage of `clone` <VersionAdd>1.1</VersionAdd> {#basic-usage-of-clone}
+This command makes a clone of your data by creating a mere metadata copy, without creating any new data in the object storage, thus cloning is very fast regardless of target file / directory size. Under JuiceFS, this command is a better alternative to `cp`, moreover, for Linux clients using kernels with [`copy_file_range`](https://man7.org/linux/man-pages/man2/copy_file_range.2.html) support, then the `cp` command achieves the same result as `juicefs clone`.
 
-The JuiceFS client provides the `clone` command to quickly clone directories or files within a single JuiceFS mount point. The cloning process involves copying only the metadata without copying the data blocks, making it extremely fast.
+![clone](../images/juicefs-clone.svg)
 
-The command format for `clone` is as follows:
+The clone result is a metadata copy, all the files are still referencing the same object storage blocks, that's why a clone behaves the same in every way as its originals. When either of them go through actual file data modification, the affected data blocks will be copied on write, and become new blocks after write, while the unchanged part of the files remains the same, still referencing the original blocks.
+
+**Clones takes up both file system storage space and metadata engine storage space**, pay special attention when making clones on large size directories.
 
 ```shell
-juicefs clone <SRC PATH> <DST PATH>
+juicefs clone SRC DST
 
 # Clone a file
 juicefs clone /mnt/jfs/file1 /mnt/jfs/file2
 
 # Clone a directory
 juicefs clone /mnt/jfs/dir1 /mnt/jfs/dir2
-
-# Clone with preserving the UID, GID, and mode of the file
-juicefs clone -p /mnt/jfs/file1 /mnt/jfs/file2`,
 ```
 
-- `<SRC PATH>`: The source path, which can be a file or directory.
-- `<DST PATH>`: The destination path, which can be a file or directory.
-
-## Preserve source's UID, GID, and mode {#preserve-source-uid-gid-mode}
-
-The `--preserve, -p` option is provided to preserve the UID, GID, and mode attributes of the source during cloning. By default, the current user's UID and GID are used. The mode is recalculated based on the current user's umask.
-
-## Consistency guarantee of the `clone` {#consistency-guarantee-of-clone}
+## Consistency {#consistency}
 
 The `clone` subcommand provides consistency guarantees as follows:
 
 - For file: The `clone` command ensures atomicity, meaning that the cloned file will always be in a correct and consistent state.
 - For directory: The `clone` command does not guarantee atomicity for directories. In other words, if the source directory changes during the cloning process, the target directory may be different from the source directory.
 
-## Other considerations for `clone` {#other-considerations-for-clone}
-
-1. The destination directory is not visible until the `clone` command is completed.
-2. If metadata redundancy occurs due to a failed `clone` command, it can be cleaned up using the `juicefs gc --delete` command.
-3. Both the source and destination of the `clone` command must be located under the same JuiceFS mount point.
+However, the destination directory is not visible until the `clone` command is completed, if the command didn't manage to finish properly, there could be metadata leak and you should clean up using the [`juicefs gc --delete`](../reference/command_reference.md#gc) command.