Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(system_monitor): check UDP network errors #9538

Merged
merged 17 commits into from
Dec 20, 2024

Conversation

iwatake2222
Copy link
Contributor

@iwatake2222 iwatake2222 commented Dec 2, 2024

Description

Why this PR is needed

  • UDP packet drops make Autoware unstable. Therefore, this is an important indicator for determining whether Autoware is functioning correctly
  • There are several causes of UDP packet drops, but the primary reason is drops in the send/receive buffer. This is particularly true in the case of the loopback network

What this PR changs

  • Monitor UDP recv buffer errors and UDP send buffer errors

Related links

None

How was this PR tested?

  • total UDP rcv buf errors and total UDP snd buf errors are monitored, and they are retrieved from /proc/net/snmp
  • When errors per unit time is greater than threshold (>=1), status becomes warning
  • The CPU usage of system_monitor doesn't change
    • Without this PR: 13.6% / 1600%
    • With this PR: 12.6% / 1600%
    • (the CPU usage of system_monitor in 16-core ECU)

image

How to cause UDP buf errors

Run the following scripts

import socket
import time

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

# To cause snd buf error, enable the following line and run the following command
# $> sudo tc qdisc add dev lo root netem delay 100ms
# $> sudo tc qdisc del dev lo root    # fix setting
sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4)

target_ip = "127.0.0.1"
target_port = 12345

while True:
    sock.sendto(b"a" * 1024, (target_ip, target_port))
    time.sleep(0.00001)
import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1)
sock.bind(("0.0.0.0", 12345))

while True:
    try:
        data, addr = sock.recvfrom(1024)
    except Exception as e:
        print(f"Error: {e}")

Notes for reviewers

None.

Interface changes

Topic changes

Additions and removals

None

Modifications

A new member named net_monitor: UDP Buf Errors is added into /diagnostics topic

ROS Parameter Changes

Additions and removals

system/system_monitor/config/net_monitor.param.yaml

Version Parameter Name Type Default Value Description
New udp_buf_errors_check_duration int 1 UDP buf errors check duration
New udp_buf_errors_check_count int 1 Generates warning when count of UDP buf errors during udp_buf_errors_check_duration reaches a specified value or higher

Modifications

None

Effects on system behavior

None.

@github-actions github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:system System design and integration. (auto-assigned) labels Dec 2, 2024
Copy link

github-actions bot commented Dec 2, 2024

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@TetsuKawa TetsuKawa added run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) tag:require-cuda-build-and-test labels Dec 2, 2024
@iwatake2222 iwatake2222 marked this pull request as ready for review December 3, 2024 11:33
@iwatake2222
Copy link
Contributor Author

@ito-san @TetsuKawa
Could you review this PR?
Also, please re-add run-build-and-test-differential label to run worflows (some workflows are stuck for some reasons...)

Signed-off-by: takeshi.iwanari <[email protected]>
@TetsuKawa TetsuKawa added run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) and removed run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) labels Dec 5, 2024
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 0% with 121 lines in your changes missing coverage. Please review.

Project coverage is 29.77%. Comparing base (eef298b) to head (b5e2f63).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...tem/system_monitor/src/net_monitor/net_monitor.cpp 0.00% 121 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #9538    +/-   ##
========================================
  Coverage   29.77%   29.77%            
========================================
  Files        1442     1442            
  Lines      108570   108622    +52     
  Branches    42621    42645    +24     
========================================
+ Hits        32324    32344    +20     
- Misses      72940    73100   +160     
+ Partials     3306     3178   -128     
Flag Coverage Δ *Carryforward flag
differential 0.00% <0.00%> (?)
total 29.81% <ø> (+0.03%) ⬆️ Carriedforward from 3730113

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@iwatake2222 iwatake2222 requested a review from ito-san December 5, 2024 14:38
@iwatake2222
Copy link
Contributor Author

@ito-san @TetsuKawa
Thank you for reviewing it. I updated the code as you suggested.
Could you review it again please?

@TetsuKawa
Copy link
Contributor

LGTM
@ito-san
I leave the rest to your judgment.

@ito-san
Copy link
Contributor

ito-san commented Dec 9, 2024

LGTM @ito-san I leave the rest to your judgment.

plz let me test on my side.

@iwatake2222
Copy link
Contributor Author

@ito-san
Friendly reminder...

@ito-san
Copy link
Contributor

ito-san commented Dec 19, 2024

@ito-san Friendly reminder...

Sorry, the codebase was different between this branch and my project, so my Autoware was broken after I merged this branch.
I've updated the branch, so please wait a little longer.

@ito-san
Copy link
Contributor

ito-san commented Dec 19, 2024

Verified that no errors occur during startup
image

@ito-san
Copy link
Contributor

ito-san commented Dec 19, 2024

It seems to be working well
image

Copy link
Contributor

@ito-san ito-san left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ito-san ito-san added run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) and removed run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) labels Dec 19, 2024
@ito-san
Copy link
Contributor

ito-san commented Dec 19, 2024

@iwatake2222
Please fix this.
/w/autoware.universe/autoware.universe/system/system_monitor/src/net_monitor/net_monitor.cpp:814:26: error: folding type 'unsigned int' into type 'int' might result in loss of precision [bugprone-fold-init-type,-warnings-as-errors]
value_per_unit_time
= std::accumulate(queue
.begin(), queue_.end(), 0);
^

Signed-off-by: takeshi.iwanari <[email protected]>
Signed-off-by: takeshi.iwanari <[email protected]>
@iwatake2222
Copy link
Contributor Author

@ito-san
Could you add label to run workflow?

@iwatake2222
Copy link
Contributor Author

@ito-san
Thank you for reviewing it !!
I fixed the code and all required checks are passed (pre-commit-optional fails due to link error, but it's not related to this PR)

Could you merge this PR? I'm asking because Only those with write access to this repository can merge pull requests.

@ito-san ito-san merged commit e5243e4 into autowarefoundation:main Dec 20, 2024
30 of 33 checks passed
@iwatake2222 iwatake2222 deleted the feat_add_udp_errors branch December 20, 2024 08:06
kyoichi-sugahara pushed a commit to kyoichi-sugahara/autoware.universe that referenced this pull request Dec 25, 2024
* feat(system_monitor): generalize logic for /proc/net/snmp

Signed-off-by: takeshi.iwanari <[email protected]>

* feat(system_monitor): add UDP buf errors check

Signed-off-by: takeshi.iwanari <[email protected]>

* fix calculation for errors per unit time at the first time

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* organize code

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* fix warnings

Signed-off-by: takeshi.iwanari <[email protected]>

* remove unnecessary fmt::format

Signed-off-by: takeshi.iwanari <[email protected]>

* organize code for metrics from /proc/net/snmp

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* separate ROS 2 parameters from constructor

Signed-off-by: takeshi.iwanari <[email protected]>

* suppress log

Signed-off-by: takeshi.iwanari <[email protected]>

* style(pre-commit): autofix

* fix bugprone-fold-init-type

Signed-off-by: takeshi.iwanari <[email protected]>

* dummy commit to kick workflows

Signed-off-by: takeshi.iwanari <[email protected]>

---------

Signed-off-by: takeshi.iwanari <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ito-san <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:system System design and integration. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) type:documentation Creating or refining documentation. (auto-assigned)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants