rocm_smi: Initial event count and event table initialization event count upper bound mismatch & handling unsupported events #318
+10
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Fix Event Table Initialization Bug
Tested using utility functions on:
Example Test Commands:
papi_command_line max_xgmi_internode_bw:device=3:target=0
papi_component_avail
Issue and Fix
In
papi/src/components/rocm_smi/rocs.c
, the functionshandle_derived_events_count
andhandle_derived_events
&handle_xgmi_events_count
andhandle_xgmi_events
have differing upper bounds for the number of events.Problem:
The loop iterates from
i = 0
toROCS_PCI_BW_VARIANT__LANE_IDX - ROCS_PCI_BW_VARIANT__CURRENT + 1
, which causes it to run one extra iteration beyond the intended range.Fix:
Removing the
+1
ensures that the loop iterates exactlyROCS_PCI_BW_VARIANT__LANE_IDX - ROCS_PCI_BW_VARIANT__CURRENT
times, correcting the off-by-one error.Similarly, in
handle_xgmi_events_count
andhandle_xgmi_events
, there is a mismatch in the upper bounds for the number of events:RSMI_EVNT_XGMI_LAST - RSMI_EVNT_XGMI_FIRST
andRSMI_EVNT_XGMI_DATA_OUT_LAST - RSMI_EVNT_XGMI_DATA_OUT_FIRST
are used inhandle_xgmi_events_count
.for
loop inhandle_xgmi_events
goes beyond the expected step, leading to an inconsistency.Additional Fix: Handle Unsupported PCIe Bandwidth Status
RSMI_STATUS_NOT_SUPPORTED
in PCIe bandwidth retrieval.Author Checklist
Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
Commits are self contained and only do one thing
Commits have a header of the form:
module: short description
Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
The PR needs to pass all the tests