From e2fabcc1c63bad2b2854a7642b671f1f16b41c4d Mon Sep 17 00:00:00 2001 From: timfeirg Date: Wed, 29 Mar 2023 16:31:17 +0800 Subject: [PATCH] consolidate content regarding profile & stats (#3392) * move knowledge to fault_diagnosis_and_analysis.md, which contains all troubleshooting methods and guides. * add docs on how to interpret stats output --------- Co-authored-by: Changjian Gao --- .pre-commit-config.yaml | 2 +- README.md | 11 +- README_CN.md | 11 +- cmd/profile.go | 8 +- cmd/stats.go | 2 +- .../fault_diagnosis_and_analysis.md | 118 ++++++++++++++---- docs/en/administration/monitoring.md | 10 +- docs/en/benchmark/benchmark.md | 11 +- docs/en/benchmark/operations_profiling.md | 56 --------- .../benchmark/performance_evaluation_guide.md | 63 ++++------ docs/en/benchmark/stats_watcher.md | 36 ------ docs/en/reference/command_reference.md | 8 +- .../fault_diagnosis_and_analysis.md | 98 ++++++++++++--- docs/zh_cn/administration/monitoring.md | 10 +- docs/zh_cn/benchmark/benchmark.md | 14 +-- docs/zh_cn/benchmark/operations_profiling.md | 60 --------- .../benchmark/performance_evaluation_guide.md | 65 ++++------ docs/zh_cn/benchmark/stats_watcher.md | 36 ------ docs/zh_cn/faq.md | 2 +- docs/zh_cn/reference/command_reference.md | 8 +- 20 files changed, 252 insertions(+), 377 deletions(-) delete mode 100644 docs/en/benchmark/operations_profiling.md delete mode 100644 docs/en/benchmark/stats_watcher.md delete mode 100644 docs/zh_cn/benchmark/operations_profiling.md delete mode 100644 docs/zh_cn/benchmark/stats_watcher.md diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index d64bddeacd74..8fb117ae1143 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -7,6 +7,6 @@ repos: - id: end-of-file-fixer - id: trailing-whitespace - repo: https://github.com/golangci/golangci-lint - rev: v1.33.0 + rev: v1.52.2 hooks: - id: golangci-lint diff --git a/README.md b/README.md index ed3ba94b890d..d85e08029255 100644 --- a/README.md +++ b/README.md @@ -142,16 +142,7 @@ The result shows that JuiceFS can provide significantly more metadata IOPS than ### Analyze performance -There is a virtual file called `.accesslog` in the root of JuiceFS to show all the details of file system operations and the time they take, for example: - -```bash -$ cat /jfs/.accesslog -2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010> -2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014> -2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006> -``` - -The last number on each line is the time (in seconds) that the current operation takes. You can directly use this to debug and analyze performance issues, or try `juicefs profile /jfs` to monitor real time statistics. Please run `juicefs profile -h` or refer to [here](https://juicefs.com/docs/community/operations_profiling) to learn more about this subcommand. +See [Real-Time Performance Monitoring](https://juicefs.com/docs/community/fault_diagnosis_and_analysis#performance-monitor) if you encountered performance issues. ## Supported Object Storage diff --git a/README_CN.md b/README_CN.md index 6b8a2646c822..81fb4e28b7fb 100644 --- a/README_CN.md +++ b/README_CN.md @@ -144,16 +144,7 @@ JuiceFS 提供一个性能测试的子命令来帮助你了解它在你的环境 ### 性能分析 -在文件系统的根目录有一个叫做 `.accesslog` 的虚拟文件,它提供了所有文件系统操作的细节,以及所消耗的时间,比如: - -```bash -$ cat /jfs/.accesslog -2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010> -2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014> -2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006> -``` - -每一行的最后一个数字是该操作所消耗的时间,单位是秒。你可以直接利用它来分析各种性能问题,或者尝试 `juicefs profile /jfs` 命令实时监控统计信息。欲进一步了解此子命令请运行 `juicefs profile -h` 或参阅[这里](https://juicefs.com/docs/zh/community/operations_profiling)。 +如遇性能问题,查看[「实时性能监控」](https://juicefs.com/docs/zh/community/fault_diagnosis_and_analysis#performance-monitor)。 ## 支持的对象存储 diff --git a/cmd/profile.go b/cmd/profile.go index 3f74a0143c46..857c21b6e1cd 100644 --- a/cmd/profile.go +++ b/cmd/profile.go @@ -47,14 +47,14 @@ Examples: $ juicefs profile /mnt/jfs # Replay an access log -$ cat /mnt/jfs/.accesslog > /tmp/jfs.alog +$ cat /mnt/jfs/.accesslog > /tmp/juicefs.accesslog # Press Ctrl-C to stop the "cat" command after some time -$ juicefs profile /tmp/jfs.alog +$ juicefs profile /tmp/juicefs.accesslog # Analyze an access log and print the total statistics immediately -$ juicefs profile /tmp/jfs.alog --interval 0 +$ juicefs profile /tmp/juicefs.accesslog --interval 0 -Details: https://juicefs.com/docs/community/operations_profiling`, +Details: https://juicefs.com/docs/community/fault_diagnosis_and_analysis#profile`, Flags: []cli.Flag{ &cli.StringFlag{ Name: "uid", diff --git a/cmd/stats.go b/cmd/stats.go index 38d9161f2087..965a7b718463 100644 --- a/cmd/stats.go +++ b/cmd/stats.go @@ -45,7 +45,7 @@ $ juicefs stats /mnt/jfs # More metrics $ juicefs stats /mnt/jfs -l 1 -Details: https://juicefs.com/docs/community/stats_watcher`, +Details: https://juicefs.com/docs/community/fault_diagnosis_and_analysis#stats`, Flags: []cli.Flag{ &cli.StringFlag{ Name: "schema", diff --git a/docs/en/administration/fault_diagnosis_and_analysis.md b/docs/en/administration/fault_diagnosis_and_analysis.md index ff53a5be5a3b..2a6eae14f7a2 100644 --- a/docs/en/administration/fault_diagnosis_and_analysis.md +++ b/docs/en/administration/fault_diagnosis_and_analysis.md @@ -2,13 +2,13 @@ title: Troubleshooting Methods sidebar_position: 5 slug: /fault_diagnosis_and_analysis -description: This article describes how to view and interpret logs in various operating systems for JuiceFS FUSE, CSI Driver, Hadoop Java SDK S3 gateway, S3 gateway clients. +description: This article introduces troubleshooting methods for JuiceFS mount point, CSI Driver, Hadoop Java SDK, S3 Gateway, and other clients. --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -## Client log +## Client log {#client-log} JuiceFS client will output logs for troubleshooting while running. The level of logs in terms of fatality follows DEBUG < INFO < WARNING < ERROR < FATAL. Since DEBUG logs are not printed by default, you need to explicitly enable it if needed, e.g. by adding the `--debug` option when running the JuiceFS client. @@ -90,7 +90,7 @@ The meaning of each column is: - `OK`: Indicate the current operation is successful or not. If it is unsuccessful, specific failure information will be output. - `<0.000010>`: The time (in seconds) that the current operation takes. -You can debug and analyze performance issues with access log, or try using `juicefs profile ` to see real-time statistics. Run `juicefs profile -h` or refer to [Operations Profiling](../benchmark/operations_profiling.md) for details. +Access logs tend to get very large and difficult for human to process directly, use [`juicefs profile`](#profile) to quickly visualize performance data based on these logs. Different JuiceFS clients obtain access log in different ways, which are described below. @@ -114,7 +114,7 @@ Please refer to [CSI Driver documentation](https://juicefs.com/docs/csi/troubles ```bash kubectl -n kube-system exec juicefs-chaos-k8s-002-pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373 -- cat /jfs/pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373/.accesslog -```` +``` ### S3 Gateway @@ -124,11 +124,79 @@ You need to add the [`--access-log` option](../reference/command_reference.md#ju You need to add the `juicefs.access-log` configuration item in the [client configurations](../deployment/hadoop_java_sdk.md#other-configurations) of the JuiceFS Hadoop Java SDK to specify the path of the access log output, and the access log is not output by default. -## Runtime information +## Real-time performance monitoring {#performance-monitor} + +JuiceFS provides the `profile` and `stats` subcommands to visualize real-time performance data, the `profile` command is based on the [file system access log](#access-log), while the `stats` command uses [Real-time statistics](../administration/monitoring.md). + +### `juicefs profile` {#profile} + +[`juicefs profile`](../reference/command_reference.md#profile) will collect data from [file system access log](#access-log), run the `juicefs profile MOUNTPOINT` command, you can see the real-time statistics of each file system operation based on the latest access log: + +![](../images/juicefs-profiling.gif) + +Apart from real-time mode, this command also provides a play-back mode, which performs the same visualization on existing access log files: + +```shell +# Collect access logs in advance +cat /jfs/.accesslog > /tmp/juicefs.accesslog + +# After performance issue is reproduced, re-play this log file to find system bottleneck +juicefs profile -f /tmp/juicefs.accesslog +``` + +If the replay speed is too fast, pause anytime using Enter/Return, and continue by pressing it again. If too slow, use `--interval 0` and it will replay the whole log file as fast as possible, and directly show the final result. + +If you're only interested in a certain user or process, you can set filters: + +```bash +juicefs profile /tmp/juicefs.accesslog --uid 12345 +``` + +### `juicefs stats` {#stats} + +The [`juicefs stats`](../reference/command_reference.md#stats) command reads JuiceFS Client internal metrics data, and output performance data in a format similar to `dstat`: + +![](../images/juicefs_stats_watcher.png) + +Metrics description: + +#### `usage` + +- `cpu`: CPU usage of the process. +- `mem`: Physical memory used by the process. +- `buf`: Current [buffer size](../guide/cache_management.md#buffer-size), if this value is constantly close to (or even exceeds) the configured [`--buffer-size`](../reference/command_reference.md#mount), you should increase buffer size or decrease application workload. +- `cache`: Internal metric, ignore this. + +#### `fuse` + +- `ops`/`lat`: Operations processed by FUSE per second, and their average latency (in milliseconds). +- `read`/`write`: Read/write bandwidth usage of FUSE. + +#### `meta` + +- `ops`/`lat`: Metadata operations processed per second, and their average latency (in milliseconds). Please note that, operations returned directly from cache are not counted in, in order to show a more accurate latency of clients actually interacting with metadata engine. +- `txn`/`lat`: Write transactions per second processed by the metadata engine and their average latency (in milliseconds). Read-only requests such as `getattr` are only counted as `ops` but not `txn`. +- `retry`: Write transactions per second that the metadata engine retries. + +#### `blockcache` + +The `blockcache` stands for local cache data, if read requests are already handled by kernel page cache, they won't be counted into the `blockcache` read metric. If there's consistent `blockcache` read traffic while you are conducting repeated read on a fixed file, this means read requests never enter page cache, and you should probably troubleshoot in this direction (e.g. not enough memory). + +- `read`/`write`: Read/write bandwidth of client local data cache + +#### `object` + +The `object` stands for object storage related metrics, when cache is enabled, penetration to object storage will significantly hinder read performance, use these metrics to check if data has been fully cached. On the other hand, you can also compare `object.get` and `fuse.read` traffic to get a rough idea of the current [read amplification](./troubleshooting.md#read-amplification) status. + +- `get`/`get_c`/`lat`: Bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing read requests. +- `put`/`put_c`/`lat`: Bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing write requests. +- `del_c`/`lat`: Delete requests per second the object storage can process, and the average latency (in milliseconds). + +## Get runtime information using pprof {#runtime-information} By default, JuiceFS clients will listen to a TCP port locally via [pprof](https://pkg.go.dev/net/http/pprof) to get runtime information such as Goroutine stack information, CPU performance statistics, memory allocation statistics. You can see the specific port number that the current JuiceFS client is listening on by using the system command (e.g. `lsof`): -:::note +:::tip If you mount JuiceFS as the root user, you need to add `sudo` before the `lsof` command. ::: @@ -136,10 +204,12 @@ If you mount JuiceFS as the root user, you need to add `sudo` before the `lsof` lsof -i -nP | grep LISTEN | grep juicefs ``` -```output -juicefs 32666 user 8u IPv4 0x44992f0610d9870b 0t0 TCP 127.0.0.1:6061 (LISTEN) -juicefs 32666 user 9u IPv4 0x44992f0619bf91cb 0t0 TCP 127.0.0.1:6071 (LISTEN) -juicefs 32666 user 15u IPv4 0x44992f062886fc5b 0t0 TCP 127.0.0.1:9567 (LISTEN) +```shell +# pprof listen prot +juicefs 19371 user 6u IPv4 0xa2f1748ad05b5427 0t0 TCP 127.0.0.1:6061 (LISTEN) + +# Prometheus API listen port +juicefs 19371 user 11u IPv4 0xa2f1748ad05cbde7 0t0 TCP 127.0.0.1:9567 (LISTEN) ``` By default, pprof listens on port numbers ranging from 6060 to 6099. That's why the actual port number in the above example is 6061. Once you get the listening port number, you can view all the available runtime information by accessing `http://localhost:/debug/pprof`, and some important runtime information will be shown as follows: @@ -147,31 +217,31 @@ By default, pprof listens on port numbers ranging from 6060 to 6099. That's why - Goroutine stack information: `http://localhost:/debug/pprof/goroutine?debug=1` - CPU performance statistics: `http://localhost:/debug/pprof/profile?seconds=30` - Memory allocation statistics: `http://localhost:/debug/pprof/heap` -- -:::tip -You can also use the debug command to automatically collect these runtime information and save it locally. By default, it is saved to the debug directory under the current directory, for example: +To make it easier to analyze this runtime information, you can save it locally, e.g.: ```bash -juicefs debug /mnt/jfs +curl 'http://localhost:/debug/pprof/goroutine?debug=1' > juicefs.goroutine.txt ``` -For more information about the debug command, see [command reference](https://juicefs.com/docs/community/command_reference#juicefs-debug) -::: - -To make it easier to analyze this runtime information, you can save it locally, e.g.: - ```bash -curl 'http://localhost:/debug/pprof/goroutine?debug=1' > juicefs.goroutine.txt +curl 'http://localhost:/debug/pprof/profile?seconds=30' > juicefs.cpu.pb.gz ``` ```bash -$ curl 'http://localhost:/debug/pprof/profile?seconds=30' > juicefs.cpu.pb.gz +curl 'http://localhost:/debug/pprof/heap' > juicefs.heap.pb.gz +``` + +:::tip +You can also use the `juicefs debug` command to automatically collect these runtime information and save it locally. By default, it is saved to the `debug` directory under the current directory, for example: ```bash -$ curl 'http://localhost:/debug/pprof/heap' > juicefs.heap.pb.gz +juicefs debug /mnt/jfs ``` +For more information about the `juicefs debug` command, see [command reference](../reference/command_reference.md#debug). +::: + If you have the `go` command installed, you can analyze it directly with the `go tool pprof` command. For example to analyze CPU performance statistics: ```bash @@ -209,9 +279,9 @@ The export to visual chart function relies on [Graphviz](https://graphviz.org), go tool pprof -pdf 'http://localhost:/debug/pprof/heap' > juicefs.heap.pdf ``` -For more information about pprof, please see the [official documentation](https://github.com/google/pprof/blob/master/doc/README.md). +For more information about pprof, please see the [official documentation](https://github.com/google/pprof/blob/main/doc/README.md). -### Profiling with the Pyroscope +### Profiling with the Pyroscope {#use-pyroscope} ![Pyroscope](../images/pyroscope.png) diff --git a/docs/en/administration/monitoring.md b/docs/en/administration/monitoring.md index d4bb9de0a352..6a7004a21fd7 100644 --- a/docs/en/administration/monitoring.md +++ b/docs/en/administration/monitoring.md @@ -6,7 +6,7 @@ description: This article describes how to visualize JuiceFS status monitoring w As a distributed file system hosting massive data storage, it is important for users to directly view the status changes of the entire system in terms of capacity, files, CPU load, disk IO, cache, etc. JuiceFS provides real-time status data externally through the Prometheus-oriented API to achieve the visualization of JuiceFS monitoring with ease, and you only need to expose it to your own Prometheus Server to visualize time series data with tools like Grafana. -## Get started +## Getting started {#getting-started} It is assumed here that Prometheus Server, Grafana and JuiceFS clients are all running on the same host, in which @@ -69,11 +69,11 @@ Then, create a dashboard using [`grafana_template.json`](https://github.com/juic ![](../images/grafana-dashboard.jpg) -## Collecting monitoring metrics +## Collecting monitoring metrics {#collecting-metrics} There are different ways to collect monitoring metrics depending on how JuiceFS is deployed, which are described below. -### Mount point +### Mount point {#mount-point} When the JuiceFS file system is mounted via the [`juicefs mount`](../reference/command_reference.md#mount) command, you can collect monitoring metrics via the address `http://localhost:9567/metrics`, or you can customize it via the `--metrics` option. For example: @@ -274,7 +274,7 @@ For each instance registered to Consul, its `serviceName` is `juicefs`, and the The meta of each instance contains two aspects: `hostname` and `mountpoint`. When `mountpoint` is `s3gateway`, it means that the instance is an S3 gateway. -## Visualize monitoring metrics +## Visualize monitoring metrics {#visualize-metrics} ### Grafana dashboard template @@ -289,6 +289,6 @@ A sample Grafana dashboard looks like this: ![JuiceFS Grafana dashboard](../images/grafana_dashboard.png) -## Monitoring metrics reference +## Monitoring metrics reference {#metrics-reference} Please refer to the ["JuiceFS Metrics"](../reference/p8s_metrics.md) document. diff --git a/docs/en/benchmark/benchmark.md b/docs/en/benchmark/benchmark.md index c7788bf270bb..8e2eab1142fc 100644 --- a/docs/en/benchmark/benchmark.md +++ b/docs/en/benchmark/benchmark.md @@ -31,13 +31,4 @@ It shows JuiceFS can provide significantly more metadata IOPS than the other two ## Analyze performance -There is a virtual file called `.accesslog` in the root of JuiceFS to show all the operations and the time they takes, for example: - -``` -$ cat /jfs/.accesslog -2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010> -2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014> -2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006> -``` - -The last number in each line is the time (in seconds) the current operation takes. You can use this directly to debug and analyze performance issues, or try `./juicefs profile /jfs` to monitor real time statistics. Please run `./juicefs profile -h` or refer [here](../benchmark/operations_profiling.md) to learn more about this subcommand. +See [Real-Time Performance Monitoring](../administration/fault_diagnosis_and_analysis.md#performance-monitor) if you encounter performance issues. diff --git a/docs/en/benchmark/operations_profiling.md b/docs/en/benchmark/operations_profiling.md deleted file mode 100644 index 9b11c3f6a961..000000000000 --- a/docs/en/benchmark/operations_profiling.md +++ /dev/null @@ -1,56 +0,0 @@ ---- -title: Operations Profiling -sidebar_position: 3 -slug: /operations_profiling -description: JuiceFS profile is to aggregate all logs in the past interval and display statistics periodically, includes real time and replay modes. ---- - -## Introduction - -JuiceFS has a special virtual file named [`.accesslog`](../administration/fault_diagnosis_and_analysis.md#access-log) to track every operation occurred within its client. This file may generate thousands of log entries per second when under pressure, making it hard to find out what is actually going on at a certain time. Thus, we made a simple tool called [`juicefs profile`](../reference/command_reference.md#juicefs-profile) to show an overview of recently completed operations. The basic idea is to aggregate all logs in the past interval and display statistics periodically, like: - -![JuiceFS-profiling](../images/juicefs-profiling.gif) - -## Profiling Modes - -For now there are 2 modes of profiling: real time and replay. - -### Real Time Mode - -By executing the following command you can watch real time operations under the mount point: - -```bash -juicefs profile MOUNTPOINT -``` - -> **Tip**: The result is sorted in a descending order by total time. - -### Replay Mode - -Running the `profile` command on an existing log file enables the **replay mode**: - -```bash -juicefs profile LOGFILE -``` - -When debugging or analyzing performance issues, it is usually more practical to record access log first and then replay it (multiple times). For example: - -```bash -cat /jfs/.accesslog > /tmp/jfs-oplog -# later -juicefs profile /tmp/jfs-oplog -``` - -> **Tip 1**: The replay could be paused anytime by Enter/Return, and continued by pressing it again. -> -> **Tip 2**: Setting `--interval 0` will replay the whole log file as fast as possible, and show the result as if it was within one interval. - -## Filter - -Sometimes we are only interested in a certain user or process, then we can filter others out by specifying IDs, e.g: - -```bash -juicefs profile /tmp/jfs-oplog --uid 12345 -``` - -For more information, please run `juicefs profile -h`. diff --git a/docs/en/benchmark/performance_evaluation_guide.md b/docs/en/benchmark/performance_evaluation_guide.md index 03f5ad42798c..302a533ee262 100644 --- a/docs/en/benchmark/performance_evaluation_guide.md +++ b/docs/en/benchmark/performance_evaluation_guide.md @@ -32,9 +32,9 @@ An example of the basic usage of the JuiceFS built-in `bench` tool is shown belo JuiceFS v1.0+ has Trash enabled by default, which means the benchmark tools will create and delete temporary files in the file system. These files will eventually be dumped to the `.trash` folder which consumes storage space. To avoid this, you can disable the Trash before benchmarking by running `juicefs config META-URL --trash-days 0`. See [trash](../security/trash.md) for details. -### JuiceFS Bench +### `juicefs bench` -The JuiceFS [`bench`](../reference/command_reference.md#juicefs-bench) command can help you do a quick performance test on a standalone machine. With the test results, it is easy to evaluate if your environment configuration and JuiceFS performance are normal. Assuming you have mounted JuiceFS to `/mnt/jfs` on your server, execute the following command for this test (the `-p` option is recommended to set to the number of CPU cores on the server). If you need help with initializing or mounting JuiceFS, please refer to the [Quick Start Guide](../getting-started/README.md)) +The [`juicefs bench`](../reference/command_reference.md#bench) command can help you do a quick performance test on a standalone machine. With the test results, it is easy to evaluate if your environment configuration and JuiceFS performance are normal. Assuming you have mounted JuiceFS to `/mnt/jfs` on your server, execute the following command for this test (the `-p` option is recommended to set to the number of CPU cores on the server). If you need help with initializing or mounting JuiceFS, please refer to the [Quick Start Guide](../getting-started/README.md)) ```bash juicefs bench /mnt/jfs -p 4 @@ -44,7 +44,7 @@ The test results will show each performance indicator in green, yellow or red. I ![bench](../images/bench-guide-bench.png) -The detailed JuiceFS `bench` performance test flows are shown below (The logic behind is very simple. Please take a look at the [source code](https://github.com/juicedata/juicefs/blob/main/cmd/bench.go) if you are interested). +The detailed `juicefs bench` performance test flows are shown below (The logic behind is very simple. Please take a look at the [source code](https://github.com/juicedata/juicefs/blob/main/cmd/bench.go) if you are interested). 1. N concurrent `write`, each to a large file of 1 GiB with IO size of 1 MiB 2. N concurrent `read`, each from the large file of 1 GiB previously written, with IO size of 1 MiB @@ -75,9 +75,9 @@ Prices refer to [AWS US East, Ohio Region](https://aws.amazon.com/ebs/pricing/?n The data above is from [AWS official documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html), and the performance metrics are their maximum values. The actual performance of EBS is related to its volume capacity and instance type of mounted EC2. In general, the larger the volume and the higher the specification of EC2, the better the EBS performance will be, but not exceeding the maximum value mentioned above. ::: -### JuiceFS Objbench +### `juicefs objbench` -JuiceFS provides the [`objbench`](../reference/command_reference.md#juicefs-objbench) subcommand to run some tests on object storage to evaluate how well it performs as a backend storage for JuiceFS. Take testing Amazon S3 as an example: +The [`juicefs objbench`](../reference/command_reference.md#objbench) command can run some tests on object storage to evaluate how well it performs as a backend storage for JuiceFS. Take testing Amazon S3 as an example: ```bash juicefs objbench \ @@ -130,9 +130,9 @@ Finally clean up the test files. The next two performance observation and analysis tools are essential tools for testing, using, and tuning JuiceFS. -### JuiceFS Stats +### `juicefs stats` -JuiceFS `stats` is a tool for real-time statistics of JuiceFS performance metrics, similar to the `dstat` command on Linux systems. It can display changes of metrics for JuiceFS clients in real-time (see [documentation](stats_watcher.md) for details). For this, create a new session and execute the following command when the command `juicefs bench` is running, +The [`juicefs stats`](../administration/fault_diagnosis_and_analysis.md#stats) command is a tool for real-time statistics of JuiceFS performance metrics, similar to the `dstat` command on Linux systems. It can display changes of metrics for JuiceFS clients in real-time. For this, create a new session and execute the following command when the `juicefs bench` is running: ```bash juicefs stats /mnt/jfs --verbosity 1 @@ -140,54 +140,35 @@ juicefs stats /mnt/jfs --verbosity 1 The results are shown below, which would be easier to understand when combing with the `bench` performance test flows described above. -![stats](../images/bench-guide-stats.png) - -The meaning of indicators is as follows: - -- `usage` - - `cpu`: the CPU usage of JuiceFS process - - `mem`: the physical memory usage of JuiceFS process - - `buf`: internal read/write buffer size of JuiceFS process, limited by mount option `--buffer-size` - - `cache`: internal metric, can be simply ignored -- `fuse` - - `ops`/`lat`: requests per second processed by the FUSE interface and their average latency (in milliseconds) - - `read`/`write`: bandwidth of the FUSE interface to handle read and write requests per second -- `meta` - - `ops`/`lat`: requests per second processed by the metadata engine and their average latency (in milliseconds). Please note that some requests that can be processed directly in the cache are not included in the statistics, in order to better reflect the time spent by the client interacting with the metadata engine. - - `txn`/`lat`: **write transactions** per second processed by the metadata engine and their average latency (in milliseconds). Read-only requests such as `getattr` are only counted as ops but not txn. - - `retry`: **write transactions** per second that the metadata engine retries -- `blockcache` - - `read`/`write`: read/write traffic per second for the local data cache of the client -- `object` - - `get`/`get_c`/`lat`: bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing **read requests** - - `put`/`put_c`/`lat`: bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing **write requests** - - `del_c`/`lat`: **delete requests** per second the object storage can process, and the average latency (in milliseconds) - -### JuiceFS Profile - -JuiceFS `profile` is used to output all access logs of the JuiceFS client in real time, including information about each request. It can also be used to play back and count JuiceFS access logs, and visualize the JuiceFS running status (see [documentation](operations_profiling.md) for details). To run the JuiceFS profile, execute the following command in another session while the `juicefs bench` command is running. +![](../images/bench-guide-stats.png) + +Learn the meaning of indicators in [`juicefs stats`](../administration/fault_diagnosis_and_analysis.md#stats). + +### `juicefs profile` + +The [`juicefs profile`](../administration/fault_diagnosis_and_analysis.md#profile) command is used to output all [access logs](../administration/fault_diagnosis_and_analysis.md#access-log) of the JuiceFS client in real time, including information about each request. It can also be used to play back and count JuiceFS access logs, and visualize the JuiceFS running status. To run the JuiceFS profile, execute the following command in another session while the `juicefs bench` command is running. ```bash -cat /mnt/jfs/.accesslog > access.log +cat /mnt/jfs/.accesslog > juicefs.accesslog ``` -`.accessslog` is a virtual file for JuiceFS access logs. It does not produce any data until it is read (e.g. by executing `cat`). Press Ctrl-C to terminate the `cat` command and run the following one. +The `.accessslog` is a virtual file for JuiceFS access logs. It does not produce any data until it is read (e.g. by executing `cat`). Press Ctrl + C to terminate the `cat` command and run the following one. ```bash -juicefs profile access.log --interval 0 +juicefs profile juicefs.accesslog --interval 0 ``` The `---interval` parameter sets the sampling interval for accessing the log. 0 means quickly replay the log file to generate statistics, as shown in the following figure. ![profile](../images/bench-guide-profile.png) -Based on the bench performance test flows as described above, a total of (1 + 100) * 4 = 404 files were created during this test, and each file went through the process of "Create → Write → Close → Open → Read → Close → Delete". So there are a total of: +Based on the bench performance test flows as described above, a total of `(1 + 100) * 4 = 404` files were created during this test, and each file went through the process of "Create → Write → Close → Open → Read → Close → Delete". So there are a total of: -- 404 create, open and unlink requests -- 808 flush requests: flush is automatically invoked whenever a file is closed -- 33168 write/read requests: each large file takes 1024 1 MiB IOs on write, while the maximum size of a request at the FUSE level is 128 KiB by default. It means that each application IO is split into 8 FUSE requests, so there are `(1024 * 8 + 100) * 4 = 33168` requests. The read IOs work in a similar way, and so does its counting. +- 404 `create`, `open` and `unlink` requests +- 808 `flush` requests: `flush` is automatically invoked whenever a file is closed +- 33168 `write`/`read` requests: each large file takes 1024 1 MiB IOs on write, while the maximum size of a request at the FUSE level is 128 KiB by default. It means that each application IO is split into 8 FUSE requests, so there are `(1024 * 8 + 100) * 4 = 33168` requests. The read IOs work in a similar way, and so does its counting. -All these values correspond exactly to the results of `profile`. In addition, the test result shows that the average latency for the `write` operations is extremely low (45 μs). This is because JuiceFS `write` writes to a memory buffer first by default and then calls flush to upload data to the object storage when the file is closed, as expected. +All these values correspond exactly to the results of `profile`. In addition, the test result shows that the average latency for the `write` operations is extremely low (45 μs). This is because JuiceFS `write` writes to a memory buffer first by default and then calls `flush` to upload data to the object storage when the file is closed, as expected. ## Other Test Tool Configuration Examples diff --git a/docs/en/benchmark/stats_watcher.md b/docs/en/benchmark/stats_watcher.md deleted file mode 100644 index 1a58cd8ae6d4..000000000000 --- a/docs/en/benchmark/stats_watcher.md +++ /dev/null @@ -1,36 +0,0 @@ ---- -title: Performance Statistics Monitor -sidebar_position: 4 -slug: /stats_watcher ---- - -JuiceFS exposes a lot of [Prometheus metrics](../administration/monitoring.md) for monitoring system internal performance. However, when diagnosing performance issues in practice, users may need a more real-time monitoring tool to know what is actually going on within a certain time range. Thus, we provide a command `stats` to print metrics every second, just like what the Linux command `dstat` does. The output is like: - -![stats_watcher](../images/juicefs_stats_watcher.png) - -By default, this command will print the following metrics of the JuiceFS process corresponding to the given mount point. - -#### `usage` - -- `cpu`: CPU usage of the process -- `mem`: physical memory used by the process -- `buf`: current buffer size of JuiceFS, limited by mount option `--buffer-size` - -#### `fuse` - -- `ops`/`lat`: operations processed by FUSE per second, and their average latency (in milliseconds) -- `read`/`write`: read/write bandwidth usage of FUSE - -#### `meta` - -- `ops`/`lat`: metadata operations processed per second, and their average latency (in milliseconds). Please note that, operations returned directly from cache are not counted in, in order to show a more accurate latency of clients actually interacting with metadata engine. - -#### `blockcache` - -- `read`/`write`: read/write bandwidth of client local data cache - -#### `object` - -- `get`/`put`: Get/Put bandwidth between client and object storage - -Moreover, users can acquire verbose statistics (like read/write ops and the average latency) by setting `--verbosity 1`, or customize displayed metrics by changing `--schema`. For more information, please check `juicefs stats -h`. diff --git a/docs/en/reference/command_reference.md b/docs/en/reference/command_reference.md index 99804cff9dfb..669e300cfa0a 100644 --- a/docs/en/reference/command_reference.md +++ b/docs/en/reference/command_reference.md @@ -769,7 +769,7 @@ $ cd /mnt/jfs $ juicefs info -i 100 ``` -### `juicefs bench` +### `juicefs bench` {#bench} Run benchmark, including read/write/stat for big and small files. @@ -808,7 +808,7 @@ $ juicefs bench /mnt/jfs -p 4 $ juicefs bench /mnt/jfs --big-file-size 0 ``` -### `juicefs objbench` +### `juicefs objbench` {#objbench} Run basic benchmarks on the target object storage to test if it works as expected. @@ -906,7 +906,7 @@ juicefs fsck [command options] META-URL juicefs fsck redis://localhost ``` -### `juicefs profile` +### `juicefs profile` {#profile} Analyze [access log](../administration/fault_diagnosis_and_analysis.md#access-log). @@ -1172,7 +1172,7 @@ skip sanity check and force destroy the volume (default: false) juicefs destroy redis://localhost e94d66a8-2339-4abd-b8d8-6812df737892 ``` -### `juicefs debug` +### `juicefs debug` {#debug} It collects and displays information from multiple dimensions such as the operating environment and system logs to help better locate errors diff --git a/docs/zh_cn/administration/fault_diagnosis_and_analysis.md b/docs/zh_cn/administration/fault_diagnosis_and_analysis.md index 9ebbea21c8b3..c9ec85de41d4 100644 --- a/docs/zh_cn/administration/fault_diagnosis_and_analysis.md +++ b/docs/zh_cn/administration/fault_diagnosis_and_analysis.md @@ -2,13 +2,13 @@ title: 问题排查方法 sidebar_position: 5 slug: /fault_diagnosis_and_analysis -description: 本文介绍 JuiceFS FUSE、CSI Driver、Hadoop Java SDK S3 gateway、S3 gateway 等客户端在各类操作系统中的日志获取和解读方法。 +description: 本文介绍 JuiceFS 挂载点、CSI 驱动、Hadoop Java SDK、S3 网关等客户端的问题排查方法。 --- import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -## 客户端日志 +## 客户端日志 {#client-log} JuiceFS 客户端在运行过程中会输出日志用于故障诊断,日志等级从低到高分别是:DEBUG、INFO、WARNING、ERROR、FATAL,默认只输出 INFO 级别以上的日志。如果需要输出 DEBUG 级别的日志,需要在运行 JuiceFS 客户端时显式开启,如加上 `--debug` 选项。 @@ -69,11 +69,11 @@ S3 网关仅支持在前台运行,因此客户端日志会直接输出到终 使用 JuiceFS Hadoop Java SDK 的应用进程(如 Spark executor)的日志中会包含 JuiceFS 客户端日志,因为和应用自身产生的日志混杂在一起,需要通过特定关键词来过滤筛选(如 `juicefs`,注意这里忽略了大小写)。 -## 访问日志 {#access-log} +## 文件系统访问日志 {#access-log} 每个 JuiceFS 客户端都有一个访问日志,其中详细记录了文件系统上的所有操作,如操作类型、用户 ID、用户组 ID、文件 inode 及其花费的时间。访问日志可以有多种用途,如性能分析、审计、故障诊断。 -### 访问日志格式 +### 日志格式 访问日志的示例格式如下: @@ -90,7 +90,7 @@ S3 网关仅支持在前台运行,因此客户端日志会直接输出到终 - `OK`:当前操作是否成功,如果不成功会输出具体的失败信息。 - `<0.000010>`:当前操作花费的时间(以秒为单位) -你可以通过访问日志调试和分析性能问题,或者尝试使用 `juicefs profile ` 查看实时统计信息。运行 `juicefs profile -h` 或[点此](../benchmark/operations_profiling.md)了解该子命令的更多信息。 +访问日志量很大,直接阅读难以把握系统性能情况,推荐使用 [`juicefs profile`](#profile) 直接基于日志进行性能可视化分析。 不同 JuiceFS 客户端获取访问日志的方式不同,以下分别介绍。 @@ -124,11 +124,79 @@ kubectl -n kube-system exec juicefs-1.2.3.4-pvc-d4b8fb4f-2c0b-48e8-a2dc-53079943 需要在 JuiceFS Hadoop Java SDK 的[客户端配置](../deployment/hadoop_java_sdk.md#其它配置)中新增 `juicefs.access-log` 配置项,指定访问日志输出的路径,默认不输出访问日志。 -## 运行时信息 +## 实时性能监控 {#performance-monitor} + +JuiceFS 客户端提供 `profile` 和 `stats` 两个子命令来对性能数据进行可视化呈现。其中,`profile` 命令通过读取[「文件系统请求日志」](#access-log)进行汇总输出,而 `stats` 则依赖[客户端监控数据](../administration/monitoring.md)。 + +### `juicefs profile` {#profile} + +[`juicefs profile`](../reference/command_reference.md#profile) 会对[「文件系统访问日志」](#access-log)进行汇总,运行 `juicefs profile MOUNTPOINT` 命令,便能看到根据最新访问日志获取的各个文件系统操作的实时统计信息: + +![](../images/juicefs-profiling.gif) + +除了对挂载点进行实时分析,该命令还提供回放模式,可以对预先收集的日志进行回放分析: + +```shell +# 预先收集日志 +cat /jfs/.accesslog > /tmp/juicefs.accesslog + +# 性能问题复现后,重放日志,分析各调用耗时,找出性能瓶颈 +juicefs profile /tmp/juicefs.accesslog +``` + +如果认为回放日志的速度太快,可以用 Enter/Return 暂停/继续回放。如果太慢,则设置 `--interval 0` 来立即回放整个日志文件并直接显示统计结果。 + +如果只对某个用户或进程感兴趣,可以通过指定其 ID 来过滤掉其他用户或进程。例如: + +```bash +juicefs profile /tmp/juicefs.accesslog --uid 12345 +``` + +### `juicefs stats` {#stats} + +[`juicefs stats`](../reference/command_reference.md#stats) 命令通过读取 JuiceFS 客户端的监控数据,以类似 Linux `dstat` 工具的形式实时打印各个指标的每秒变化情况: + +![](../images/juicefs_stats_watcher.png) + +各个板块指标介绍: + +#### `usage` + +- `cpu`:进程的 CPU 使用率。 +- `mem`:进程的物理内存使用量。 +- `buf`:进程已使用的[读写缓冲区](../guide/cache_management.md#buffer-size)大小,如果该数值逼近甚至超过客户端所设置的 [`--buffer-size`](../reference/command_reference.md#mount),说明读写缓冲区空间不足,需要视情况扩大,或者降低应用读写负载。 +- `cache`:内部指标,无需关注。 + +#### `fuse` + +- `ops`/`lat`:通过 FUSE 接口处理的每秒请求数及其平均时延,单位为毫秒。 +- `read`/`write`:通过 FUSE 接口处理的读写带宽。 + +#### `meta` + +- `ops`/`lat`:每秒处理的元数据请求数和平均时延,单位为毫秒。注意部分能在缓存中直接处理的元数据请求未列入统计,以更好地体现客户端与元数据引擎交互的耗时。 +- `txn`/`lat`:元数据引擎每秒处理的写事务个数及其平均时延,单位为毫秒。只读请求如 `getattr` 只会计入 `ops` 而不会计入 `txn`。 +- `retry`:元数据引擎每秒重试写事务的次数。 + +#### `blockcache` + +`blockcache` 代表本地数据缓存,如果读请求已经被内核缓存,那么流量将不会体现在 `blockcache` 相关指标下。因此如果反复读取相同文件,却发现持续产生 `blockcache` 流量,说明文件始终未能被内核页缓存收录,考虑往该方向排查(比如内存吃紧,不足以缓存更多文件)。 + +- `read`/`write`:客户端本地数据缓存的每秒读写流量。 + +#### `object` + +`object` 代表与对象存储相关指标,在缓存场景下,读请求穿透到对象存储,将会明显降低读性能,可以用该指标来断定数据是否完整缓存。另一方面,通过对比 GET 请求流量和 FUSE 读流量的关系,也能初步判断[读放大](./troubleshooting.md#read-amplification)的情况。 + +- `get`/`get_c`/`lat`:对象存储每秒处理读请求的带宽值,请求个数及其平均时延(单位为毫秒)。 +- `put`/`put_c`/`lat`:对象存储每秒处理写请求的带宽值,请求个数及其平均时延(单位为毫秒)。 +- `del_c`/`lat`:对象存储每秒处理删除请求的个数和平均时延(单位为毫秒)。 + +## 用 pprof 获取运行时信息 {#runtime-information} JuiceFS 客户端默认会通过 [pprof](https://pkg.go.dev/net/http/pprof) 在本地监听一个 TCP 端口用以获取运行时信息,如 Goroutine 堆栈信息、CPU 性能统计、内存分配统计。你可以通过系统命令(如 `lsof`)查看当前 JuiceFS 客户端监听的具体端口号: -:::note 注意 +:::tip 提示 如果 JuiceFS 是通过 root 用户挂载,那么需要在 `lsof` 命令前加上 `sudo`。 ::: @@ -136,10 +204,12 @@ JuiceFS 客户端默认会通过 [pprof](https://pkg.go.dev/net/http/pprof) 在 lsof -i -nP | grep LISTEN | grep juicefs ``` -```output -juicefs 32666 user 8u IPv4 0x44992f0610d9870b 0t0 TCP 127.0.0.1:6061 (LISTEN) -juicefs 32666 user 9u IPv4 0x44992f0619bf91cb 0t0 TCP 127.0.0.1:6071 (LISTEN) -juicefs 32666 user 15u IPv4 0x44992f062886fc5b 0t0 TCP 127.0.0.1:9567 (LISTEN) +```shell +# pprof 监听端口 +juicefs 19371 user 6u IPv4 0xa2f1748ad05b5427 0t0 TCP 127.0.0.1:6061 (LISTEN) + +# Prometheus API 监听端口 +juicefs 19371 user 11u IPv4 0xa2f1748ad05cbde7 0t0 TCP 127.0.0.1:9567 (LISTEN) ``` 默认 pprof 监听的端口号范围是从 6060 开始至 6099 结束,因此上面示例中对应的实际端口号是 6061。在获取到监听端口号以后就可以通过 `http://localhost:/debug/pprof` 地址查看所有可供查询的运行时信息,一些重要的运行时信息如下: @@ -169,7 +239,7 @@ curl 'http://localhost:/debug/pprof/heap' > juicefs.heap.pb.gz juicefs debug /mnt/jfs ``` -关于 `juicefs debug` 命令的更多信息,请查看[命令参考](https://juicefs.com/docs/zh/community/command_reference#juicefs-debug)。 +关于 `juicefs debug` 命令的更多信息,请查看[命令参考](../reference/command_reference.md#debug)。 ::: 如果你安装了 `go` 命令,那么可以通过 `go tool pprof` 命令直接分析,例如分析 CPU 性能统计: @@ -209,9 +279,9 @@ Showing top 10 nodes out of 192 go tool pprof -pdf 'http://localhost:/debug/pprof/heap' > juicefs.heap.pdf ``` -关于 pprof 的更多信息,请查看[官方文档](https://github.com/google/pprof/blob/master/doc/README.md)。 +关于 pprof 的更多信息,请查看[官方文档](https://github.com/google/pprof/blob/main/doc/README.md)。 -### 使用 Pyroscope 进行性能剖析 +### 使用 Pyroscope 进行性能剖析 {#use-pyroscope} ![Pyroscope](../images/pyroscope.png) diff --git a/docs/zh_cn/administration/monitoring.md b/docs/zh_cn/administration/monitoring.md index 54761e8830b0..dc7116bfff0a 100644 --- a/docs/zh_cn/administration/monitoring.md +++ b/docs/zh_cn/administration/monitoring.md @@ -6,7 +6,7 @@ description: 本文介绍如何搭配 Prometheus、Grafana 等第三方工具可 作为承载海量数据存储的分布式文件系统,用户通常需要直观地了解整个系统的容量、文件数量、CPU 负载、磁盘 IO、缓存等指标的变化。JuiceFS 通过 Prometheus 兼容的 API 对外提供实时的状态数据,只需将其添加到用户自建的 Prometheus Server 建立时序数据,然后通过 Grafana 等工具即可轻松实现 JuiceFS 文件系统的可视化监控。 -## 快速上手 +## 快速上手 {#getting-started} 这里假设你搭建的 Prometheus Server、Grafana 与 JuiceFS 客户端都运行在相同的主机上。其中: @@ -69,11 +69,11 @@ scrape_configs: ![](../images/grafana-dashboard.jpg) -## 收集监控指标 +## 收集监控指标 {#collecting-metrics} 根据部署 JuiceFS 的方式不同可以有不同的收集监控指标的方法,下面分别介绍。 -### 挂载点 +### 挂载点 {#mount-point} 当通过 [`juicefs mount`](../reference/command_reference.md#mount) 命令挂载 JuiceFS 文件系统后,可以通过 `http://localhost:9567/metrics` 这个地址收集监控指标,你也可以通过 `--metrics` 选项自定义。如: @@ -274,7 +274,7 @@ juicefs mount --consul 1.2.3.4:8500 ... 每个 instance 的 meta 都包含了 `hostname` 与 `mountpoint` 两个维度,其中 `mountpoint` 为 `s3gateway` 代表该实例为 S3 网关。 -## 可视化监控指标 +## 可视化监控指标 {#visualize-metrics} ### Grafana 仪表盘模板 @@ -289,6 +289,6 @@ Grafana 仪表盘示例效果如下图: ![JuiceFS Grafana dashboard](../images/grafana_dashboard.png) -## 监控指标索引 +## 监控指标索引 {#metrics-reference} 请参考[「JuiceFS 监控指标」](../reference/p8s_metrics.md)文档 diff --git a/docs/zh_cn/benchmark/benchmark.md b/docs/zh_cn/benchmark/benchmark.md index 097862590cfe..1f1dcf57b046 100644 --- a/docs/zh_cn/benchmark/benchmark.md +++ b/docs/zh_cn/benchmark/benchmark.md @@ -31,16 +31,4 @@ JuiceFS 提供了 `bench` 子命令来运行一些基本的基准测试,用 ### 分析测试结果 -假定在 JuiceFS 的根目录下有一个名为 `.accesslog` 的文件,它保存了所有操作对应的时间,例如: - -```shell -cat /jfs/.accesslog -``` - -```output -2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010> -2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014> -2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006> -``` - -每行最后一个数表示当前操作所消耗的时间(单位:秒)。你可以直接参考这些数值来调试和分析性能问题,也可以试试 `./juicefs profile /jfs` 命令来实时监测性能统计数据。你也可以运行 `./juicefs profile -h` 或者参考[这里](../benchmark/operations_profiling.md)了解这个子命令。 +如遇性能问题,阅读[「实时性能监控」](../administration/fault_diagnosis_and_analysis.md#performance-monitor)了解如何排查。 diff --git a/docs/zh_cn/benchmark/operations_profiling.md b/docs/zh_cn/benchmark/operations_profiling.md deleted file mode 100644 index d849c925c62f..000000000000 --- a/docs/zh_cn/benchmark/operations_profiling.md +++ /dev/null @@ -1,60 +0,0 @@ ---- -title: 性能诊断 -sidebar_position: 3 -slug: /operations_profiling -description: JuiceFS 的 profile 命令主要用作汇总过去某个时间的所有日志并定期显示统计信息,包括实时模式和回放模式。 ---- - -## 介绍 - -JuiceFS 文件系统挂载以后,在文件系统的根目录中有一个名为 [`.accesslog`](../administration/fault_diagnosis_and_analysis.md#access-log) 的特殊虚拟文件,用于跟踪其客户端中发生的每个操作。在负载压力较大的情况下,此文件每秒可能会生成数千个日志记录,很难确定特定时间的实际情况。因此,我们制作了一个名为 [`juicefs profile`](../reference/command_reference.md#juicefs-profile) 的简单工具,可以显示最近完成操作的概述。目的是汇总过去某个时间的所有日志并定期显示统计信息,例如: - -![JuiceFS-profiling](../images/juicefs-profiling.gif) - -## 诊断模式 - -目前有两种诊断模式:`实时模式` 和 `回放模式`。 - -### 实时模式 - -通过执行以下命令,您可以观察挂载点上的实时操作: - -```bash -juicefs profile MOUNTPOINT -``` - -> **提示**:输出结果按总时间降序排列。 - -### 回放模式 - -在现有的日志文件上运行 `profile` 命令将启用「回放模式」: - -```bash -juicefs profile LOGFILE -``` - -在调试或分析性能问题时,更实用的做法通常是先记录访问日志,然后重放(多次)。例如: - -```bash -cat /jfs/.accesslog > /tmp/jfs-oplog -``` - -later - -```bash -juicefs profile /tmp/jfs-oplog -``` - -> **提示 1**:可以随时按键盘上的 Enter/Return 暂停/继续回放。 -> -> **提示 2**:如果设置 `--interval 0`,将立即回放完整个日志文件并显示整体统计结果。 - -## 过滤 - -有时我们只对某个用户或进程感兴趣,可以通过指定其 ID 来过滤掉其他用户或进程。例如: - -```bash -juicefs profile /tmp/jfs-oplog --uid 12345 -``` - -更多信息,请运行 `juicefs profile -h` 命令查看。 diff --git a/docs/zh_cn/benchmark/performance_evaluation_guide.md b/docs/zh_cn/benchmark/performance_evaluation_guide.md index 881a82603c4b..d29fe4067bc6 100644 --- a/docs/zh_cn/benchmark/performance_evaluation_guide.md +++ b/docs/zh_cn/benchmark/performance_evaluation_guide.md @@ -32,9 +32,9 @@ slug: /performance_evaluation_guide JuiceFS v1.0+ 默认启用了回收站,基准测试会在文件系统中创建和删除临时文件,这些文件最终会被转存到回收站 `.trash` 占用存储空间,为了避免这种情况,可以在基准测试之前关闭回收站 `juicefs config META-URL --trash-days 0`,详情参考[回收站](../security/trash.md)。 -### JuiceFS Bench +### `juicefs bench` -JuiceFS [`bench`](../reference/command_reference.md#juicefs-bench) 命令可以帮助你快速完成单机性能测试,通过测试结果判断环境配置和性能表现是否正常。假设你已经把 JuiceFS 挂载到了测试机器的 `/mnt/jfs` 位置(如果在 JuiceFS 初始化、挂载方面需要帮助,请参考[快速上手指南](../getting-started/README.md)),执行以下命令即可(推荐 `-p` 参数设置为测试机器的 CPU 核数): +[`juicefs bench`](../reference/command_reference.md#bench) 命令可以帮助你快速完成单机性能测试,通过测试结果判断环境配置和性能表现是否正常。假设你已经把 JuiceFS 挂载到了测试机器的 `/mnt/jfs` 位置(如果在 JuiceFS 初始化、挂载方面需要帮助,请参考[快速上手指南](../getting-started/README.md)),执行以下命令即可(推荐 `-p` 参数设置为测试机器的 CPU 核数): ```bash juicefs bench /mnt/jfs -p 4 @@ -44,7 +44,7 @@ juicefs bench /mnt/jfs -p 4 ![bench](../images/bench-guide-bench.png) -JuiceFS `bench` 基准性能测试的具体流程如下(它的实现逻辑非常简单,有兴趣了解细节的可以直接看[源码](https://github.com/juicedata/juicefs/blob/main/cmd/bench.go)): +`juicefs bench` 基准性能测试的具体流程如下(它的实现逻辑非常简单,有兴趣了解细节的可以直接看[源码](https://github.com/juicedata/juicefs/blob/main/cmd/bench.go)): 1. N 并发各写 1 个 1 GiB 的大文件,IO 大小为 1 MiB 2. N 并发各读 1 个之前写的 1 GiB 的大文件,IO 大小为 1 MiB @@ -75,9 +75,9 @@ Amazon EFS 的性能与容量线性相关([参考官方文档](https://docs.aw 以上数据来自 [AWS 官方文档](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html),性能指标为最大值,EBS 的实际性能与卷容量和挂载 EC2 实例类型相关,总的来说是越大容量,搭配约高配置的 EC2,得到的 EBS 性能越好,但不超过上面提到的最大值。 ::: -### JuiceFS Objbench +### `juicefs objbench` -JuiceFS 提供了 [`objbench`](../reference/command_reference.md#juicefs-objbench) 子命令来运行一些关于对象存储的测试,用以评估其作为 JuiceFS 的后端存储时的运行情况。以测试 Amazon S3 为例: +[`juicefs objbench`](../reference/command_reference.md#objbench) 命令可以运行一些关于对象存储的测试,用以评估其作为 JuiceFS 的后端存储时的运行情况。以测试 Amazon S3 为例: ```bash juicefs objbench \ @@ -130,9 +130,9 @@ juicefs objbench \ 接下来介绍两个性能观测和分析工具,是 JuiceFS 测试、使用、调优过程中必备的利器。 -### JuiceFS Stats +### `juicefs stats` -JuiceFS `stats` 是一个实时统计 JuiceFS 性能指标的工具,类似 Linux 系统的 `dstat` 命令,可以实时显示 JuiceFS 客户端的指标变化(详细说明和使用方法见[文档](./stats_watcher.md))。执行 `juicefs bench` 时,在另一个会话中执行以下命令: +[`juicefs stats`](../administration/fault_diagnosis_and_analysis.md#stats) 命令是一个实时统计 JuiceFS 性能指标的工具,类似 Linux 系统的 `dstat` 命令,可以实时显示 JuiceFS 客户端的指标变化。执行 `juicefs bench` 时,在另一个会话中执行以下命令: ```bash juicefs stats /mnt/jfs --verbosity 1 @@ -140,54 +140,35 @@ juicefs stats /mnt/jfs --verbosity 1 结果如下,可以将其与上述基准测试流程对照来看,更易理解: -![stats](../images/bench-guide-stats.png) - -其中各项指标具体含义如下: - -- `usage` - - `cpu`: JuiceFS 进程消耗的 CPU - - `mem`: JuiceFS 进程占用的物理内存 - - `buf`: JuiceFS 进程内部的读写 buffer 大小,受挂载选项 `--buffer-size` 限制 - - `cache`: 内部指标,可不关注 -- `fuse` - - `ops`/`lat`: FUSE 接口每秒处理的请求个数及其平均时延(单位为毫秒) - - `read`/`write`: FUSE 接口每秒处理读写请求的带宽值 -- `meta` - - `ops`/`lat`: 元数据引擎每秒处理的请求个数及其平均时延(单位为毫秒)。请注意部分能在缓存中直接处理的请求未列入统计,以更好地体现客户端与元数据引擎交互的耗时。 - - `txn`/`lat`: 元数据引擎每秒处理的**写事务**个数及其平均时延(单位为毫秒)。只读请求如 `getattr` 只会计入 ops 而不会计入 txn。 - - `retry`: 元数据引擎每秒重试**写事务**的次数 -- `blockcache` - - `read`/`write`: 客户端本地数据缓存的每秒读写流量 -- `object` - - `get`/`get_c`/`lat`: 对象存储每秒处理**读请求**的带宽值,请求个数及其平均时延(单位为毫秒) - - `put`/`put_c`/`lat`: 对象存储每秒处理**写请求**的带宽值,请求个数及其平均时延(单位为毫秒) - - `del_c`/`lat`: 对象存储每秒处理**删除请求**的个数和平均时延(单位为毫秒) - -### JuiceFS Profile - -JuiceFS `profile` 一方面用来实时输出 JuiceFS 客户端的所有访问日志,包含每个请求的信息。同时,它也可以用来回放、统计 JuiceFS 访问日志,方便用户直观了解 JuiceFS 的运行情况(详细的说明和使用方法见[文档](./operations_profiling.md))。执行 `juicefs bench` 时,在另一个会话中执行以下命令: +![](../images/bench-guide-stats.png) + +其中各项指标具体含义参考 [`juicefs stats`](../administration/fault_diagnosis_and_analysis.md#stats)。 + +### `juicefs profile` + +[`juicefs profile`](../administration/fault_diagnosis_and_analysis.md#profile) 命令可以基于[访问日志](../administration/fault_diagnosis_and_analysis.md#access-log)进行性能数据统计,来直观了解 JuiceFS 的运行情况。执行 `juicefs bench` 时,在另一个会话中执行以下命令: ```bash -cat /mnt/jfs/.accesslog > access.log +cat /mnt/jfs/.accesslog > juicefs.accesslog ``` -其中 `.accesslog` 是一个虚拟文件,它平时不会产生任何数据,只有在读取(如执行 `cat`)时才会有 JuiceFS 的访问日志输出。结束后使用 Ctrl-C 结束 `cat` 命令,并运行: +其中 `.accesslog` 是一个虚拟文件,它平时不会产生任何数据,只有在读取(如执行 `cat`)时才会有 JuiceFS 的访问日志输出。结束后使用 Ctrl + C 结束 `cat` 命令,并运行: ```bash -juicefs profile access.log --interval 0 +juicefs profile juicefs.accesslog --interval 0 ``` 其中 `--interval` 参数设置访问日志的采样间隔,设为 0 时用于快速重放一个指定的日志文件,生成统计信息,如下图所示: -![profile](../images/bench-guide-profile.png) +![](../images/bench-guide-profile.png) -从之前基准测试流程描述可知,本次测试过程一共创建了 (1 + 100) * 4 = 404 个文件,每个文件都经历了「创建 → 写入 → 关闭 → 打开 → 读取 → 关闭 → 删除」的过程,因此一共有: +从之前基准测试流程描述可知,本次测试过程一共创建了 `(1 + 100) * 4 = 404` 个文件,每个文件都经历了「创建 → 写入 → 关闭 → 打开 → 读取 → 关闭 → 删除」的过程,因此一共有: -- 404 次 create,open 和 unlink 请求 -- 808 次 flush 请求:每当文件关闭时会自动调用一次 flush -- 33168 次 write/read 请求:每个大文件写入了 1024 个 1 MiB IO,而在 FUSE 层请求的默认最大值为 128 KiB,也就是说每个应用 IO 会被拆分成 8 个 FUSE 请求,因此一共有 (1024 *8 + 100)* 4 = 33168 个请求。读 IO 与之类似,计数也相同。 +- 404 次 `create`,`open` 和 `unlink` 请求 +- 808 次 `flush` 请求:每当文件关闭时会自动调用一次 `flush` +- 33168 次 `write`/`read` 请求:每个大文件写入了 1024 个 1 MiB IO,而在 FUSE 层请求的默认最大值为 128 KiB,也就是说每个应用 IO 会被拆分成 8 个 FUSE 请求,因此一共有 `(1024 * 8 + 100) * 4 = 33168` 个请求。读 IO 与之类似,计数也相同。 -以上这些值均能与 `profile` 的结果完全对应上。另外,结果中还显示 write 的平均时延非常小(45 微秒),而主要耗时点在 flush。这是因为 JuiceFS 的 write 默认先写入内存缓冲区,在文件关闭时再调用 flush 上传数据到对象存储,与预期吻合。 +以上这些值均能与 `profile` 的结果完全对应上。另外,结果中还显示 `write` 的平均时延非常小(45 微秒),而主要耗时点在 `flush`。这是因为 JuiceFS 的 `write` 默认先写入内存缓冲区,在文件关闭时再调用 `flush` 上传数据到对象存储,与预期吻合。 ## 其他测试工具配置示例 diff --git a/docs/zh_cn/benchmark/stats_watcher.md b/docs/zh_cn/benchmark/stats_watcher.md deleted file mode 100644 index a789cfbf1a92..000000000000 --- a/docs/zh_cn/benchmark/stats_watcher.md +++ /dev/null @@ -1,36 +0,0 @@ ---- -title: 性能统计监控 -sidebar_position: 4 -slug: /stats_watcher ---- - -JuiceFS 预定义了许多监控指标来监测系统运行时的内部性能情况,并通过 Prometheus API [暴露对外接口](../administration/monitoring.md)。然而,在分析一些实际问题时,用户往往需要更实时的性能统计监控。因此,我们开发了 `stats` 命令,以类似 Linux `dstat` 工具的形式实时打印各个指标的每秒变化情况,如下图所示: - -![stats_watcher](../images/juicefs_stats_watcher.png) - -默认参数下,此命令会监控指定挂载点对应的 JuiceFS 进程的以下几个指标: - -#### usage - -- CPU:进程的 CPU 使用率 -- mem:进程的物理内存使用量 -- buf:进程已使用的 Buffer 大小;此值受限于挂载选项 `--buffer-size` - -#### FUSE - -- ops/lat:通过 FUSE 接口处理的每秒请求数及其平均时延(单位为毫秒) -- read/write:通过 FUSE 接口处理的读写带宽 - -#### meta - -- ops/lat:每秒处理的元数据请求数和平均时延(单位为毫秒)。注意部分能在缓存中直接处理的元数据请求未列入统计,以更好地体现客户端与元数据引擎交互的耗时。 - -#### blockcache - -- read/write:客户端本地数据缓存的每秒读写流量 - -#### object - -- get/put:客户端与对象存储交互的 Get/Put 每秒流量 - -此外,可以通过设置 `--verbosity 1` 来获取更详细的统计信息(如读写请求的个数和平均时延统计等),也可以通过修改 `--schema` 来自定义监控内容与格式。更多的命令信息请通过执行 `juicefs stats -h` 查看。 diff --git a/docs/zh_cn/faq.md b/docs/zh_cn/faq.md index a4c6efdf2ae9..66c0a2d9f3f3 100644 --- a/docs/zh_cn/faq.md +++ b/docs/zh_cn/faq.md @@ -19,7 +19,7 @@ slug: /faq ### JuiceFS 的日志在哪里? -不同类型的 JuiceFS 客户端获取日志的方式也不同,详情请参考[「客户端日志」](administration/fault_diagnosis_and_analysis.md#客户端日志)文档。 +不同类型的 JuiceFS 客户端获取日志的方式也不同,详情请参考[「客户端日志」](administration/fault_diagnosis_and_analysis.md#client-log)文档。 ### JuiceFS 是否可以直接读取对象存储中已有的文件? diff --git a/docs/zh_cn/reference/command_reference.md b/docs/zh_cn/reference/command_reference.md index 35c0273d735e..5617767a412b 100644 --- a/docs/zh_cn/reference/command_reference.md +++ b/docs/zh_cn/reference/command_reference.md @@ -769,7 +769,7 @@ $ cd /mnt/jfs $ juicefs info -i 100 ``` -### `juicefs bench` +### `juicefs bench` {#bench} 对指定的路径做基准测试,包括对大文件和小文件的读/写/获取属性操作。 @@ -808,7 +808,7 @@ $ juicefs bench /mnt/jfs -p 4 $ juicefs bench /mnt/jfs --big-file-size 0 ``` -### `juicefs objbench` +### `juicefs objbench` {#objbench} 测试对象存储接口的正确性与基本性能 @@ -906,7 +906,7 @@ juicefs fsck [command options] META-URL juicefs fsck redis://localhost ``` -### `juicefs profile` +### `juicefs profile` {#profile} 分析[访问日志](../administration/fault_diagnosis_and_analysis.md#access-log)。 @@ -1172,7 +1172,7 @@ juicefs destroy [command options] META-URL UUID juicefs destroy redis://localhost e94d66a8-2339-4abd-b8d8-6812df737892 ``` -### `juicefs debug` +### `juicefs debug` {#debug} 从运行环境、系统日志等多个维度收集和展示信息,帮助更好地定位错误