Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda component init error handling #144

Merged
merged 4 commits into from
Dec 20, 2023

Conversation

gcongiu
Copy link
Contributor

@gcongiu gcongiu commented Dec 19, 2023

Pull Request Description

The cuda component init function now returns the proper cuda error message for the disabled_reason string. This allows PAPI users to pin point more precisely what went wrong with the component initialization.

Example of system not properly configured:

$ utils/papi_component_avail
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 6.1.62-1.el9.elrepo.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7742 64-Core Processor (49, 0x31)
CPU revision             : 0.000000
CPUID                    : Family/Model/Stepping 23/49/0, 0x17/0x31/0x00
CPU Max MHz              : 2250
CPU Min MHz              : 1500
Total cores              : 256
SMT threads per core     : 2
Cores per socket         : 64
Sockets                  : 2
Cores per NUMA region    : 32
NUMA regions             : 8
Running in a VM          : no
Number Hardware Counters : 5
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   cuda                    CUDA profiling via NVIDIA CuPTI interfaces
   \-> Disabled: system not yet initialized
Name:   sysdetect               System info detection component

Active components:
Name:   perf_event              Linux perf_event CPU counters
                                Native: 141, Preset: 17, Counters: 5
                                PMUs supported: perf, perf_raw, amd64_fam17h_zen2

Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
                                Native: 1, Preset: 0, Counters: 3
                                PMUs supported: amd64_rapl

Name:   sysdetect               System info detection component
                                Native: 0, Preset: 0, Counters: 0


--------------------------------------------------------------------------------

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

With exception made for trivial functions (i.e. functions that cannot
fail) every function should return an error code for proper error
handling. The cuptic_device_get_count does not account for error
handling in the case a cuda call failure happens.
With exception made for trivial functions (i.e. functions that cannot
fail) every function should return an error code for proper error
handling. The util_gpu_collection_kind does not account for error
handling in the case a cuda call failure happens.
With exception made for trivial functions (i.e. functions that cannot
fail) every function should return an error code for proper error
handling. The get_gpu_compute_capability does not account for error
handling in the case a cuda call failure happens.
cudaGetErrorString is used to the proper disabled_message to the users
whenever there is a cuda related problem during initialization.
@gcongiu gcongiu requested a review from jagode December 20, 2023 08:15
@gcongiu gcongiu merged commit f1d5857 into icl-utk-edu:master Dec 20, 2023
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants