-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field for bulk chunk size in flex counter #950
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
liuh-80
previously approved these changes
Nov 18, 2024
@stephenxs , can you add PR description and check the failed test? |
/azpw run Azure.sonic-swss-common |
9356bc2
to
7d6765a
Compare
qiluo-msft
previously approved these changes
Dec 2, 2024
Could you link to the HLD PR? |
7d6765a
to
3ebd251
Compare
FengPan-Frank
previously approved these changes
Dec 9, 2024
|
f014c72
f642875
to
f014c72
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
f014c72
to
a8f7b74
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
kcudnik
approved these changes
Dec 24, 2024
Cherry-pick PR to 202411: #968 |
mssonicbld
added a commit
to mssonicbld/sonic-swss
that referenced
this pull request
Feb 6, 2025
<!-- Please make sure you have read and understood the contribution guildlines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md 1. Make sure your commit includes a signature generted with `git commit -s` 2. Make sure your commit title follows the correct format: [component]: description 3. Make sure your commit message contains enough details about the change and related tests 4. Make sure your pull request adds related reviewers, asignees, labels Please also provide the following information in this pull request: --> **What I did** Optimize the counter-polling performance in terms of polling interval accuracy 1. Enable bulk counter-polling to run at a smaller chunk size There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group. An example is the competition between the PFC watchdog and the port counter groups. The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled quickly. Sometimes, PFC watchdog counters must wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time. To resolve this issue, we can reduce the chunk size of the port counter group. By default, the port counter group polls the counters of all ports in a single bulk operation. By using a smaller chunk size, it polls the counters in several bulk operations, with each polling counter of a subset (whose size = `chunk size`) of all ports. Furthermore, we support setting chunk size on a per-counter-ID basis. By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time. 2. Collect the time stamp immediately after vendor SAI API returns. Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc. Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the `current` and `last` poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns. Depends on 1. sonic-net/sonic-swss-common#950 2. sonic-net/sonic-sairedis#1519 **Why I did it** **How I verified it** Run regression test and observe counter-polling performance. A comparison test shows very good results if we put any/or all of the above optimizations. **Details if related** For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects. eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues. Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps. We use the name of a counter group to identify the time stamp of the counter group. Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters
mssonicbld
added a commit
to sonic-net/sonic-swss
that referenced
this pull request
Feb 7, 2025
<!-- Please make sure you have read and understood the contribution guildlines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md 1. Make sure your commit includes a signature generted with `git commit -s` 2. Make sure your commit title follows the correct format: [component]: description 3. Make sure your commit message contains enough details about the change and related tests 4. Make sure your pull request adds related reviewers, asignees, labels Please also provide the following information in this pull request: --> **What I did** Optimize the counter-polling performance in terms of polling interval accuracy 1. Enable bulk counter-polling to run at a smaller chunk size There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group. An example is the competition between the PFC watchdog and the port counter groups. The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled quickly. Sometimes, PFC watchdog counters must wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time. To resolve this issue, we can reduce the chunk size of the port counter group. By default, the port counter group polls the counters of all ports in a single bulk operation. By using a smaller chunk size, it polls the counters in several bulk operations, with each polling counter of a subset (whose size = `chunk size`) of all ports. Furthermore, we support setting chunk size on a per-counter-ID basis. By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time. 2. Collect the time stamp immediately after vendor SAI API returns. Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc. Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the `current` and `last` poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns. Depends on 1. sonic-net/sonic-swss-common#950 2. sonic-net/sonic-sairedis#1519 **Why I did it** **How I verified it** Run regression test and observe counter-polling performance. A comparison test shows very good results if we put any/or all of the above optimizations. **Details if related** For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects. eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues. Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps. We use the name of a counter group to identify the time stamp of the counter group. Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why I did it
Added a new field in
FLEX_COUNTER_TABLE
to represent the bulk chunk size and bulk chunk size per counter ID for bulk counter polling.