Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add autoware_node_death_monitor package for monitoring node crashes #1786

Draft
wants to merge 4 commits into
base: tier4/main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
18 changes: 18 additions & 0 deletions system/autoware_process_alive_monitor/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
cmake_minimum_required(VERSION 3.14)
project(autoware_process_alive_monitor)

find_package(autoware_cmake REQUIRED)
autoware_package()

ament_auto_add_library(${PROJECT_NAME} SHARED
src/autoware_process_alive_monitor.cpp
)

rclcpp_components_register_node(${PROJECT_NAME}
PLUGIN "autoware::process_alive_monitor::ProcessAliveMonitor"
EXECUTABLE ${PROJECT_NAME}_node)

ament_auto_package(INSTALL_TO_SHARE
config
launch
)
85 changes: 85 additions & 0 deletions system/autoware_process_alive_monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# autoware_process_alive_monitor

This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs.

---

## Overview

- **Node name**: `autoware_process_alive_monitor`
- **Monitored file**: `launch.log`
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code.

When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as:

```bash
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...']
```

The `autoware_process_alive_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead."

---

## How it Works

1. **Find `launch.log`**:
- First, checks the `ROS_LOG_DIR` environment variable.
- If not set, falls back to `~/.ros/log`.
- Identifies the latest log directory based on modification time.
2. **Monitor `launch.log`**:
- Reads the file from the last known position to detect new log entries.
- Looks for lines containing `"process has died"`.
- Extracts the node name and exit code.
3. **Filtering**:
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped.
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors.
4. **Regular Updates**:
- A timer periodically reads new entries from `launch.log`.
- Dead nodes are reported in the logs. (will be changed to publish diagnostics)

---

## Parameters

| Parameter Name | Type | Default | Description |
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- |
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. |
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). |
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. |
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. |

Example **`autoware_process_alive_monitor.param.yaml`**:

```yaml
autoware_process_alive_monitor:
ros__parameters:
ignore_node_names:
- rviz2
- teleop_twist_joy
ignore_exit_codes:
- 0
- 130
check_interval: 1.0
enable_debug: false
```

---

## Unimplemented Features

1. **Heartbeat Monitoring**:

- Will publish a heartbeat topic that can be monitored by `topic_state_monitor`.
- The `topic_state_monitor` will check the topic's publishing frequency to confirm the node is operational.

2. **Diagnostic Information**:
- When a process death is detected, the node will publish to the `/diagnostics` topic.
- This feature is planned to be implemented but not yet implemented.

---

## Limitations

- **後で書く**: TBD.

---
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
/**:
ros__parameters:
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format)
# Example: Do not issue a warning if rviz2 crashes.
ignore_node_names:
- rviz2

# Exit codes to exclude from monitoring (e.g., Ctrl+C)
# Example: 0, 130 are considered normal exits and not treated as errors.
ignore_exit_codes:
- 0
- 130

# Check interval (seconds)
check_interval: 1.0

# Enable/disable debug output
enable_debug: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
// Copyright 2025 Tier IV, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef AUTOWARE_PROCESS_ALIVE_MONITOR__AUTOWARE_PROCESS_ALIVE_MONITOR_HPP_
#define AUTOWARE_PROCESS_ALIVE_MONITOR__AUTOWARE_PROCESS_ALIVE_MONITOR_HPP_

#include "rclcpp/rclcpp.hpp"

#include <filesystem>
#include <string>
#include <unordered_map>
#include <vector>

namespace autoware::process_alive_monitor
{

class ProcessAliveMonitor : public rclcpp::Node
{
public:
/**
* @brief Constructor for ProcessAliveMonitor
* @param options Node options for configuration
*/
explicit ProcessAliveMonitor(const rclcpp::NodeOptions & options);

private:
/**
* @brief Read and process new content appended to launch.log
*/
void read_launch_log_diff();

/**
* @brief Parse a single line from the log for process death information
* @param line The log line to parse
*/
void parse_log_line(const std::string & line);

/**
* @brief Timer callback to report and manage dead node list
*/
void on_timer();

// Map to track dead nodes: [node_name-#] -> true
std::unordered_map<std::string, bool> dead_nodes_;

rclcpp::TimerBase::SharedPtr timer_;

// Launch log file path and read position
std::filesystem::path launch_log_path_;
size_t last_file_pos_{static_cast<size_t>(-1)};

// Parameters
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination)
double check_interval_{1.0}; // Check interval in seconds
bool enable_debug_{false}; // Enable debug output
};

} // namespace autoware::process_alive_monitor

#endif // AUTOWARE_PROCESS_ALIVE_MONITOR__AUTOWARE_PROCESS_ALIVE_MONITOR_HPP_
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<launch>
<!-- Parameter -->
<arg name="config_file" default="$(find-pkg-share autoware_process_alive_monitor)/config/autoware_process_alive_monitor.param.yaml"/>

<!-- Set log level -->
<arg name="log_level" default="info"/>

<node pkg="autoware_process_alive_monitor" exec="autoware_process_alive_monitor_node" name="process_alive_monitor" output="screen" args="--ros-args --log-level $(var log_level)">
<!-- Parameter -->
<param from="$(var config_file)"/>
</node>
</launch>
23 changes: 23 additions & 0 deletions system/autoware_process_alive_monitor/package.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0"?>
<package format="3">
<name>autoware_process_alive_monitor</name>
<version>0.0.1</version>
<description>The process_alive_monitor package</description>

<maintainer email="[email protected]">Kyoichi Sugahara</maintainer>
<license>Apache License 2.0</license>

<buildtool_depend>ament_cmake_auto</buildtool_depend>
<buildtool_depend>autoware_cmake</buildtool_depend>

<depend>rcl_interfaces</depend>
<depend>rclcpp</depend>
<depend>rclcpp_components</depend>

<test_depend>ament_cmake_gtest</test_depend>
<test_depend>ament_lint_auto</test_depend>

<export>
<build_type>ament_cmake</build_type>
</export>
</package>
Loading
Loading