Add field for bulk chunk size in flex counter #950

stephenxs · 2024-11-11T04:04:29Z

Why I did it
Added a new field in FLEX_COUNTER_TABLE to represent the bulk chunk size and bulk chunk size per counter ID for bulk counter polling.

liuh-80 · 2024-11-18T02:04:05Z

@stephenxs , can you add PR description and check the failed test?

liuh-80 · 2024-11-18T02:04:15Z

/azpw run Azure.sonic-swss-common

qiluo-msft · 2024-12-02T06:16:54Z

Could you link to the HLD PR?

stephenxs · 2024-12-10T23:28:07Z

Could you link to the HLD PR?

sonic-net/SONiC#1864

mssonicbld · 2024-12-13T06:34:31Z

/azp run

azure-pipelines · 2024-12-13T06:34:43Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Stephen Sun <[email protected]>

mssonicbld · 2024-12-24T03:20:03Z

/azp run

azure-pipelines · 2024-12-24T03:20:17Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-01-13T18:40:44Z

Cherry-pick PR to 202411: #968

**What I did** Optimize the counter-polling performance in terms of polling interval accuracy 1. Enable bulk counter-polling to run at a smaller chunk size There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group. An example is the competition between the PFC watchdog and the port counter groups. The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled quickly. Sometimes, PFC watchdog counters must wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time. To resolve this issue, we can reduce the chunk size of the port counter group. By default, the port counter group polls the counters of all ports in a single bulk operation. By using a smaller chunk size, it polls the counters in several bulk operations, with each polling counter of a subset (whose size = `chunk size`) of all ports. Furthermore, we support setting chunk size on a per-counter-ID basis. By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time. 2. Collect the time stamp immediately after vendor SAI API returns. Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc. Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the `current` and `last` poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns. Depends on 1. sonic-net/sonic-swss-common#950 2. sonic-net/sonic-sairedis#1519 **Why I did it** **How I verified it** Run regression test and observe counter-polling performance. A comparison test shows very good results if we put any/or all of the above optimizations. **Details if related** For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects. eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues. Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps. We use the name of a counter group to identify the time stamp of the counter group. Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters

stephenxs mentioned this pull request Nov 11, 2024

Optimize counter polling interval by making it more accurate sonic-net/sonic-sairedis#1457

Merged

liuh-80 previously approved these changes Nov 18, 2024

View reviewed changes

stephenxs mentioned this pull request Nov 25, 2024

Optimize counter polling interval by making it more accurate sonic-net/sonic-swss#3391

Merged

stephenxs force-pushed the bulk-chunk-size branch from 9356bc2 to 7d6765a Compare November 25, 2024 09:55

qiluo-msft previously approved these changes Dec 2, 2024

View reviewed changes

stephenxs force-pushed the bulk-chunk-size branch from 7d6765a to 3ebd251 Compare December 4, 2024 13:37

FengPan-Frank previously approved these changes Dec 9, 2024

View reviewed changes

dprital added the Request for 202411 Branch label Dec 10, 2024

stephenxs dismissed stale reviews from FengPan-Frank, qiluo-msft, and liuh-80 via f014c72 December 13, 2024 06:34

stephenxs force-pushed the bulk-chunk-size branch from f642875 to f014c72 Compare December 13, 2024 06:34

stephenxs added the Request for 202405 Branch label Dec 13, 2024

stephenxs mentioned this pull request Dec 13, 2024

Enhance bulk counter poll HLD and implementation for better accuracy and performance sonic-net/SONiC#1864

Merged

stephenxs added 2 commits December 24, 2024 11:19

bulk chunk size

2d5615d

Signed-off-by: Stephen Sun <[email protected]>

Support bulk chunk size per counter ID subset

a8f7b74

Signed-off-by: Stephen Sun <[email protected]>

stephenxs force-pushed the bulk-chunk-size branch from f014c72 to a8f7b74 Compare December 24, 2024 03:19

liat-grozovik removed the Request for 202405 Branch label Dec 24, 2024

kcudnik approved these changes Dec 24, 2024

View reviewed changes

kcudnik merged commit c872f42 into sonic-net:master Dec 24, 2024
15 checks passed

stephenxs deleted the bulk-chunk-size branch December 24, 2024 14:07

r12f added the Request for msft-202412 Branch label Dec 27, 2024

kperumalbfn added the Approved for 202411 Branch label Jan 13, 2025

mssonicbld mentioned this pull request Jan 13, 2025

[action] [PR:950] Add field for bulk chunk size in flex counter #968

Merged

mssonicbld added the Created PR to 202411 Branch label Jan 13, 2025

mssonicbld added Included in 202411 Branch and removed Created PR to 202411 Branch labels Jan 13, 2025

r12f added Approved for msft-202412 Branch and removed Request for msft-202412 Branch Approved for msft-202412 Branch labels Jan 23, 2025

mssonicbld mentioned this pull request Feb 6, 2025

[action] [PR:3391] Optimize counter polling interval by making it more accurate sonic-net/sonic-swss#3500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field for bulk chunk size in flex counter #950

Add field for bulk chunk size in flex counter #950

stephenxs commented Nov 11, 2024 •

edited

Loading

liuh-80 commented Nov 18, 2024

liuh-80 commented Nov 18, 2024

qiluo-msft commented Dec 2, 2024

stephenxs commented Dec 10, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 24, 2024

azure-pipelines bot commented Dec 24, 2024

mssonicbld commented Jan 13, 2025

Add field for bulk chunk size in flex counter #950

Add field for bulk chunk size in flex counter #950

Conversation

stephenxs commented Nov 11, 2024 • edited Loading

liuh-80 commented Nov 18, 2024

liuh-80 commented Nov 18, 2024

qiluo-msft commented Dec 2, 2024

stephenxs commented Dec 10, 2024

mssonicbld commented Dec 13, 2024

azure-pipelines bot commented Dec 13, 2024

mssonicbld commented Dec 24, 2024

azure-pipelines bot commented Dec 24, 2024

mssonicbld commented Jan 13, 2025

stephenxs commented Nov 11, 2024 •

edited

Loading