Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vulkan] Inconsistent segfault at shutdown on NVIDIA hardware #8497

Open
derek-gerstmann opened this issue Dec 6, 2024 · 2 comments
Open
Assignees

Comments

@derek-gerstmann
Copy link
Contributor

derek-gerstmann commented Dec 6, 2024

This is happening intermittently on the Linux worker build-bots, but doesn't present itself on nearly identical drivers and hardware when testing locally.

It shows up as a segfault at process exit for the correctness tests after the tests have run. When it happens, the Vulkan ICD function pointer chain is invalid, and any call to a Vulkan API method will segfault. If we don't cleanup, then the driver itself crashes. Same symptoms appear under JIT and AOT.

System details:

Ubuntu 22.04
Vulkan Loader v1.3.296
Vulkan API v1.3.280
NVIDIA Driver v560.35.5.0
NVIDIA GeForce RTX 3070

It appears to be either a Vulkan and/or NVIDIA driver bug. Running under the validation layers, and crash detection layers doesn't reveal anything, and we never receive a device lost error, making it difficult to detect or handle.

@derek-gerstmann
Copy link
Contributor Author

It appears LLAMA may be reporting the same thing:
ggerganov/llama.cpp#10528

@derek-gerstmann derek-gerstmann self-assigned this Dec 6, 2024
@abadams
Copy link
Member

abadams commented Dec 6, 2024

I can reproduce the same jump to a bad address during nvidia driver finalization using llama.cpp by running multiple instances at once (only offloading a few layers to GPU, so that the multiple instances all fit). Our dev meeting conclusion was to just move vulkan testing off these bots onto a raspberry pi 5 (partially because #8494 shows us that this is a more useful platform to be testing on anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants