Skip to content

Commit

Permalink
consolidate content regarding profile & stats (#3392)
Browse files Browse the repository at this point in the history
* move knowledge to fault_diagnosis_and_analysis.md, which contains all troubleshooting methods and guides.
* add docs on how to interpret stats output

---------

Co-authored-by: Changjian Gao <[email protected]>
  • Loading branch information
timfeirg and xiaogaozi authored Mar 29, 2023
1 parent bc95ca0 commit e2fabcc
Show file tree
Hide file tree
Showing 20 changed files with 252 additions and 377 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@ repos:
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/golangci/golangci-lint
rev: v1.33.0
rev: v1.52.2
hooks:
- id: golangci-lint
11 changes: 1 addition & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,16 +142,7 @@ The result shows that JuiceFS can provide significantly more metadata IOPS than

### Analyze performance

There is a virtual file called `.accesslog` in the root of JuiceFS to show all the details of file system operations and the time they take, for example:

```bash
$ cat /jfs/.accesslog
2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>
2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014>
2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006>
```

The last number on each line is the time (in seconds) that the current operation takes. You can directly use this to debug and analyze performance issues, or try `juicefs profile /jfs` to monitor real time statistics. Please run `juicefs profile -h` or refer to [here](https://juicefs.com/docs/community/operations_profiling) to learn more about this subcommand.
See [Real-Time Performance Monitoring](https://juicefs.com/docs/community/fault_diagnosis_and_analysis#performance-monitor) if you encountered performance issues.

## Supported Object Storage

Expand Down
11 changes: 1 addition & 10 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,16 +144,7 @@ JuiceFS 提供一个性能测试的子命令来帮助你了解它在你的环境

### 性能分析

在文件系统的根目录有一个叫做 `.accesslog` 的虚拟文件,它提供了所有文件系统操作的细节,以及所消耗的时间,比如:

```bash
$ cat /jfs/.accesslog
2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>
2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014>
2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006>
```

每一行的最后一个数字是该操作所消耗的时间,单位是秒。你可以直接利用它来分析各种性能问题,或者尝试 `juicefs profile /jfs` 命令实时监控统计信息。欲进一步了解此子命令请运行 `juicefs profile -h` 或参阅[这里](https://juicefs.com/docs/zh/community/operations_profiling)
如遇性能问题,查看[「实时性能监控」](https://juicefs.com/docs/zh/community/fault_diagnosis_and_analysis#performance-monitor)

## 支持的对象存储

Expand Down
8 changes: 4 additions & 4 deletions cmd/profile.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,14 +47,14 @@ Examples:
$ juicefs profile /mnt/jfs
# Replay an access log
$ cat /mnt/jfs/.accesslog > /tmp/jfs.alog
$ cat /mnt/jfs/.accesslog > /tmp/juicefs.accesslog
# Press Ctrl-C to stop the "cat" command after some time
$ juicefs profile /tmp/jfs.alog
$ juicefs profile /tmp/juicefs.accesslog
# Analyze an access log and print the total statistics immediately
$ juicefs profile /tmp/jfs.alog --interval 0
$ juicefs profile /tmp/juicefs.accesslog --interval 0
Details: https://juicefs.com/docs/community/operations_profiling`,
Details: https://juicefs.com/docs/community/fault_diagnosis_and_analysis#profile`,
Flags: []cli.Flag{
&cli.StringFlag{
Name: "uid",
Expand Down
2 changes: 1 addition & 1 deletion cmd/stats.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ $ juicefs stats /mnt/jfs
# More metrics
$ juicefs stats /mnt/jfs -l 1
Details: https://juicefs.com/docs/community/stats_watcher`,
Details: https://juicefs.com/docs/community/fault_diagnosis_and_analysis#stats`,
Flags: []cli.Flag{
&cli.StringFlag{
Name: "schema",
Expand Down
118 changes: 94 additions & 24 deletions docs/en/administration/fault_diagnosis_and_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
title: Troubleshooting Methods
sidebar_position: 5
slug: /fault_diagnosis_and_analysis
description: This article describes how to view and interpret logs in various operating systems for JuiceFS FUSE, CSI Driver, Hadoop Java SDK S3 gateway, S3 gateway clients.
description: This article introduces troubleshooting methods for JuiceFS mount point, CSI Driver, Hadoop Java SDK, S3 Gateway, and other clients.
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

## Client log
## Client log {#client-log}

JuiceFS client will output logs for troubleshooting while running. The level of logs in terms of fatality follows DEBUG < INFO < WARNING < ERROR < FATAL. Since DEBUG logs are not printed by default, you need to explicitly enable it if needed, e.g. by adding the `--debug` option when running the JuiceFS client.

Expand Down Expand Up @@ -90,7 +90,7 @@ The meaning of each column is:
- `OK`: Indicate the current operation is successful or not. If it is unsuccessful, specific failure information will be output.
- `<0.000010>`: The time (in seconds) that the current operation takes.

You can debug and analyze performance issues with access log, or try using `juicefs profile <mount-point>` to see real-time statistics. Run `juicefs profile -h` or refer to [Operations Profiling](../benchmark/operations_profiling.md) for details.
Access logs tend to get very large and difficult for human to process directly, use [`juicefs profile`](#profile) to quickly visualize performance data based on these logs.

Different JuiceFS clients obtain access log in different ways, which are described below.

Expand All @@ -114,7 +114,7 @@ Please refer to [CSI Driver documentation](https://juicefs.com/docs/csi/troubles

```bash
kubectl -n kube-system exec juicefs-chaos-k8s-002-pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373 -- cat /jfs/pvc-d4b8fb4f-2c0b-48e8-a2dc-530799435373/.accesslog
````
```

### S3 Gateway

Expand All @@ -124,54 +124,124 @@ You need to add the [`--access-log` option](../reference/command_reference.md#ju

You need to add the `juicefs.access-log` configuration item in the [client configurations](../deployment/hadoop_java_sdk.md#other-configurations) of the JuiceFS Hadoop Java SDK to specify the path of the access log output, and the access log is not output by default.

## Runtime information
## Real-time performance monitoring {#performance-monitor}

JuiceFS provides the `profile` and `stats` subcommands to visualize real-time performance data, the `profile` command is based on the [file system access log](#access-log), while the `stats` command uses [Real-time statistics](../administration/monitoring.md).

### `juicefs profile` {#profile}

[`juicefs profile`](../reference/command_reference.md#profile) will collect data from [file system access log](#access-log), run the `juicefs profile MOUNTPOINT` command, you can see the real-time statistics of each file system operation based on the latest access log:

![](../images/juicefs-profiling.gif)

Apart from real-time mode, this command also provides a play-back mode, which performs the same visualization on existing access log files:

```shell
# Collect access logs in advance
cat /jfs/.accesslog > /tmp/juicefs.accesslog

# After performance issue is reproduced, re-play this log file to find system bottleneck
juicefs profile -f /tmp/juicefs.accesslog
```

If the replay speed is too fast, pause anytime using <kbd>Enter/Return</kbd>, and continue by pressing it again. If too slow, use `--interval 0` and it will replay the whole log file as fast as possible, and directly show the final result.

If you're only interested in a certain user or process, you can set filters:

```bash
juicefs profile /tmp/juicefs.accesslog --uid 12345
```

### `juicefs stats` {#stats}

The [`juicefs stats`](../reference/command_reference.md#stats) command reads JuiceFS Client internal metrics data, and output performance data in a format similar to `dstat`:

![](../images/juicefs_stats_watcher.png)

Metrics description:

#### `usage`

- `cpu`: CPU usage of the process.
- `mem`: Physical memory used by the process.
- `buf`: Current [buffer size](../guide/cache_management.md#buffer-size), if this value is constantly close to (or even exceeds) the configured [`--buffer-size`](../reference/command_reference.md#mount), you should increase buffer size or decrease application workload.
- `cache`: Internal metric, ignore this.

#### `fuse`

- `ops`/`lat`: Operations processed by FUSE per second, and their average latency (in milliseconds).
- `read`/`write`: Read/write bandwidth usage of FUSE.

#### `meta`

- `ops`/`lat`: Metadata operations processed per second, and their average latency (in milliseconds). Please note that, operations returned directly from cache are not counted in, in order to show a more accurate latency of clients actually interacting with metadata engine.
- `txn`/`lat`: Write transactions per second processed by the metadata engine and their average latency (in milliseconds). Read-only requests such as `getattr` are only counted as `ops` but not `txn`.
- `retry`: Write transactions per second that the metadata engine retries.

#### `blockcache`

The `blockcache` stands for local cache data, if read requests are already handled by kernel page cache, they won't be counted into the `blockcache` read metric. If there's consistent `blockcache` read traffic while you are conducting repeated read on a fixed file, this means read requests never enter page cache, and you should probably troubleshoot in this direction (e.g. not enough memory).

- `read`/`write`: Read/write bandwidth of client local data cache

#### `object`

The `object` stands for object storage related metrics, when cache is enabled, penetration to object storage will significantly hinder read performance, use these metrics to check if data has been fully cached. On the other hand, you can also compare `object.get` and `fuse.read` traffic to get a rough idea of the current [read amplification](./troubleshooting.md#read-amplification) status.

- `get`/`get_c`/`lat`: Bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing read requests.
- `put`/`put_c`/`lat`: Bandwidth, requests per second, and their average latency (in milliseconds) for object storage processing write requests.
- `del_c`/`lat`: Delete requests per second the object storage can process, and the average latency (in milliseconds).

## Get runtime information using pprof {#runtime-information}

By default, JuiceFS clients will listen to a TCP port locally via [pprof](https://pkg.go.dev/net/http/pprof) to get runtime information such as Goroutine stack information, CPU performance statistics, memory allocation statistics. You can see the specific port number that the current JuiceFS client is listening on by using the system command (e.g. `lsof`):

:::note
:::tip
If you mount JuiceFS as the root user, you need to add `sudo` before the `lsof` command.
:::

```bash
lsof -i -nP | grep LISTEN | grep juicefs
```

```output
juicefs 32666 user 8u IPv4 0x44992f0610d9870b 0t0 TCP 127.0.0.1:6061 (LISTEN)
juicefs 32666 user 9u IPv4 0x44992f0619bf91cb 0t0 TCP 127.0.0.1:6071 (LISTEN)
juicefs 32666 user 15u IPv4 0x44992f062886fc5b 0t0 TCP 127.0.0.1:9567 (LISTEN)
```shell
# pprof listen prot
juicefs 19371 user 6u IPv4 0xa2f1748ad05b5427 0t0 TCP 127.0.0.1:6061 (LISTEN)

# Prometheus API listen port
juicefs 19371 user 11u IPv4 0xa2f1748ad05cbde7 0t0 TCP 127.0.0.1:9567 (LISTEN)
```

By default, pprof listens on port numbers ranging from 6060 to 6099. That's why the actual port number in the above example is 6061. Once you get the listening port number, you can view all the available runtime information by accessing `http://localhost:<port>/debug/pprof`, and some important runtime information will be shown as follows:

- Goroutine stack information: `http://localhost:<port>/debug/pprof/goroutine?debug=1`
- CPU performance statistics: `http://localhost:<port>/debug/pprof/profile?seconds=30`
- Memory allocation statistics: `http://localhost:<port>/debug/pprof/heap`
-

:::tip
You can also use the debug command to automatically collect these runtime information and save it locally. By default, it is saved to the debug directory under the current directory, for example:
To make it easier to analyze this runtime information, you can save it locally, e.g.:

```bash
juicefs debug /mnt/jfs
curl 'http://localhost:<port>/debug/pprof/goroutine?debug=1' > juicefs.goroutine.txt
```

For more information about the debug command, see [command reference](https://juicefs.com/docs/community/command_reference#juicefs-debug)
:::
To make it easier to analyze this runtime information, you can save it locally, e.g.:
```bash
curl 'http://localhost:<port>/debug/pprof/goroutine?debug=1' > juicefs.goroutine.txt
curl 'http://localhost:<port>/debug/pprof/profile?seconds=30' > juicefs.cpu.pb.gz
```

```bash
$ curl 'http://localhost:<port>/debug/pprof/profile?seconds=30' > juicefs.cpu.pb.gz
curl 'http://localhost:<port>/debug/pprof/heap' > juicefs.heap.pb.gz
```

:::tip
You can also use the `juicefs debug` command to automatically collect these runtime information and save it locally. By default, it is saved to the `debug` directory under the current directory, for example:

```bash
$ curl 'http://localhost:<port>/debug/pprof/heap' > juicefs.heap.pb.gz
juicefs debug /mnt/jfs
```

For more information about the `juicefs debug` command, see [command reference](../reference/command_reference.md#debug).
:::

If you have the `go` command installed, you can analyze it directly with the `go tool pprof` command. For example to analyze CPU performance statistics:

```bash
Expand Down Expand Up @@ -209,9 +279,9 @@ The export to visual chart function relies on [Graphviz](https://graphviz.org),
go tool pprof -pdf 'http://localhost:<port>/debug/pprof/heap' > juicefs.heap.pdf
```

For more information about pprof, please see the [official documentation](https://github.com/google/pprof/blob/master/doc/README.md).
For more information about pprof, please see the [official documentation](https://github.com/google/pprof/blob/main/doc/README.md).

### Profiling with the Pyroscope
### Profiling with the Pyroscope {#use-pyroscope}

![Pyroscope](../images/pyroscope.png)

Expand Down
10 changes: 5 additions & 5 deletions docs/en/administration/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: This article describes how to visualize JuiceFS status monitoring w

As a distributed file system hosting massive data storage, it is important for users to directly view the status changes of the entire system in terms of capacity, files, CPU load, disk IO, cache, etc. JuiceFS provides real-time status data externally through the Prometheus-oriented API to achieve the visualization of JuiceFS monitoring with ease, and you only need to expose it to your own Prometheus Server to visualize time series data with tools like Grafana.

## Get started
## Getting started {#getting-started}

It is assumed here that Prometheus Server, Grafana and JuiceFS clients are all running on the same host, in which

Expand Down Expand Up @@ -69,11 +69,11 @@ Then, create a dashboard using [`grafana_template.json`](https://github.com/juic

![](../images/grafana-dashboard.jpg)

## Collecting monitoring metrics
## Collecting monitoring metrics {#collecting-metrics}

There are different ways to collect monitoring metrics depending on how JuiceFS is deployed, which are described below.

### Mount point
### Mount point {#mount-point}

When the JuiceFS file system is mounted via the [`juicefs mount`](../reference/command_reference.md#mount) command, you can collect monitoring metrics via the address `http://localhost:9567/metrics`, or you can customize it via the `--metrics` option. For example:

Expand Down Expand Up @@ -274,7 +274,7 @@ For each instance registered to Consul, its `serviceName` is `juicefs`, and the

The meta of each instance contains two aspects: `hostname` and `mountpoint`. When `mountpoint` is `s3gateway`, it means that the instance is an S3 gateway.

## Visualize monitoring metrics
## Visualize monitoring metrics {#visualize-metrics}

### Grafana dashboard template

Expand All @@ -289,6 +289,6 @@ A sample Grafana dashboard looks like this:

![JuiceFS Grafana dashboard](../images/grafana_dashboard.png)

## Monitoring metrics reference
## Monitoring metrics reference {#metrics-reference}

Please refer to the ["JuiceFS Metrics"](../reference/p8s_metrics.md) document.
11 changes: 1 addition & 10 deletions docs/en/benchmark/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,4 @@ It shows JuiceFS can provide significantly more metadata IOPS than the other two

## Analyze performance

There is a virtual file called `.accesslog` in the root of JuiceFS to show all the operations and the time they takes, for example:

```
$ cat /jfs/.accesslog
2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>
2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014>
2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006>
```

The last number in each line is the time (in seconds) the current operation takes. You can use this directly to debug and analyze performance issues, or try `./juicefs profile /jfs` to monitor real time statistics. Please run `./juicefs profile -h` or refer [here](../benchmark/operations_profiling.md) to learn more about this subcommand.
See [Real-Time Performance Monitoring](../administration/fault_diagnosis_and_analysis.md#performance-monitor) if you encounter performance issues.
56 changes: 0 additions & 56 deletions docs/en/benchmark/operations_profiling.md

This file was deleted.

Loading

0 comments on commit e2fabcc

Please sign in to comment.