Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm_smi: Initial event count and event table initialization event count upper bound mismatch & handling unsupported events #318

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

djwoun
Copy link
Contributor

@djwoun djwoun commented Feb 12, 2025

Pull Request Description

Fix Event Table Initialization Bug

Tested using utility functions on:

  • Dopamine (5.7.3 & 6.3.2)
  • Odyssey (6.3.0)
  • Histamine0 (5.7.3)

Example Test Commands:

  • papi_command_line max_xgmi_internode_bw:device=3:target=0
  • papi_component_avail

Issue and Fix

In papi/src/components/rocm_smi/rocs.c, the functions handle_derived_events_count and handle_derived_events & handle_xgmi_events_count and handle_xgmi_events have differing upper bounds for the number of events.

Problem:

for (i = 0; i < ROCS_PCI_BW_VARIANT__LANE_IDX - ROCS_PCI_BW_VARIANT__CURRENT + 1; ++i)

The loop iterates from i = 0 to ROCS_PCI_BW_VARIANT__LANE_IDX - ROCS_PCI_BW_VARIANT__CURRENT + 1, which causes it to run one extra iteration beyond the intended range.

Fix:

Removing the +1 ensures that the loop iterates exactly ROCS_PCI_BW_VARIANT__LANE_IDX - ROCS_PCI_BW_VARIANT__CURRENT times, correcting the off-by-one error.

Similarly, in handle_xgmi_events_count and handle_xgmi_events, there is a mismatch in the upper bounds for the number of events:

  • RSMI_EVNT_XGMI_LAST - RSMI_EVNT_XGMI_FIRST and RSMI_EVNT_XGMI_DATA_OUT_LAST - RSMI_EVNT_XGMI_DATA_OUT_FIRST are used in handle_xgmi_events_count.
  • However, the for loop in handle_xgmi_events goes beyond the expected step, leading to an inconsistency.

Additional Fix: Handle Unsupported PCIe Bandwidth Status

  • Added check for RSMI_STATUS_NOT_SUPPORTED in PCIe bandwidth retrieval.
  • Updated comment for future reference.

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@djwoun djwoun marked this pull request as ready for review February 12, 2025 16:21
@djwoun djwoun marked this pull request as draft February 12, 2025 16:45
@djwoun djwoun marked this pull request as ready for review February 12, 2025 16:51
@Treece-Burgess Treece-Burgess linked an issue Feb 12, 2025 that may be closed by this pull request
@Treece-Burgess Treece-Burgess added component-rocm_smi PRs and Issues related to the rocm_smi component type-bug Issues discussing bugs or PRs fixing bugs status-ready-for-review PR is ready to be reviewed labels Feb 13, 2025
@djwoun djwoun changed the title rocm_smi: Fix Event Table Initialization Bug rocm_smi: Initial event count and event table initialization event count upper bound mismatch & handling unsupported events Feb 14, 2025
@Treece-Burgess
Copy link
Contributor

Treece-Burgess commented Feb 14, 2025

Testing results shown below:

Note: PAPI utilities used were papi_component_avail, papi_native_avail, and papi_command_line.

System Device(s) ROCm Version Passing PAPI Utilities
Histamine0 2 * Radeon VII 5.7.3
Dopamine 2 * MI210's 5.7.3
Frontier 2 * MI250X's 5.7.1
Odyssey 4 * MI300A's 6.3.0

@@ -1022,7 +1022,12 @@ init_device_table(void)

for (i = 0; i < device_count; ++i) {
status = rsmi_dev_pci_bandwidth_get_p(i, &pcie_table[i]);
if (status != RSMI_STATUS_SUCCESS && status != RSMI_STATUS_NOT_YET_IMPLEMENTED) {
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, what do you think? @adanalis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.

src/components/rocm_smi/rocs.c Outdated Show resolved Hide resolved
src/components/rocm_smi/rocs.c Outdated Show resolved Hide resolved
…unt upper bound mismatch & handling unsupported events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-rocm_smi PRs and Issues related to the rocm_smi component status-ready-for-review PR is ready to be reviewed type-bug Issues discussing bugs or PRs fixing bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rocm smi component disabled on a machine with 2 * MI210's
3 participants