Skip to content

fix call DestroyInstance after loader_platform_close_library#1805

Merged
charles-lunarg merged 1 commit into
KhronosGroup:mainfrom
Kkaiwz:main
Nov 16, 2025
Merged

fix call DestroyInstance after loader_platform_close_library#1805
charles-lunarg merged 1 commit into
KhronosGroup:mainfrom
Kkaiwz:main

Conversation

@Kkaiwz

@Kkaiwz Kkaiwz commented Nov 14, 2025

Copy link
Copy Markdown

Found that vkcts crash after "Get real path to layer & driver binaries : 77ccbe4" merged into vullan-loader.

The root cause of crash is icd.so lib has been closed before DestroyInstance called in loader free path when icd instance has been created and oom in layer.

eg. fpCreateInstance success but table->populate oom in wsi. TRY_LOG(fpCreateInstance(&modified_info, pAllocator, pInstance) , "Failed to create the instance");
TRY_LOG_CALL(table->populate(*pInstance, fpGetInstanceProcAddr, api_version));

So we need to defer close icd lib until DestroyInstance done.

Fix case: dEQP-VK.api.device_init.create_instance_device_ intentional_alloc_fail.basic

Found that vkcts crash after "Get real path to layer & driver
binaries : 77ccbe4" merged
into vullan-loader.

The root cause of crash is icd.so lib has been closed before
DestroyInstance called in loader free path when icd instance
has been created and oom in layer.

eg. fpCreateInstance success but table->populate oom in wsi.
TRY_LOG(fpCreateInstance(&modified_info, pAllocator, pInstance)
, "Failed to create the instance");
TRY_LOG_CALL(table->populate(*pInstance, fpGetInstanceProcAddr,
api_version));

So we need to defer close icd lib until DestroyInstance done.

Fix case: dEQP-VK.api.device_init.create_instance_device_
intentional_alloc_fail.basic

Signed-off-by: Ryan Zhang <ryan.zhang@nxp.com>
@ci-tester-lunarg

Copy link
Copy Markdown

Author Kkaiwz not on autobuild list. Waiting for curator authorization before starting CI build.

1 similar comment
@ci-tester-lunarg

Copy link
Copy Markdown

Author Kkaiwz not on autobuild list. Waiting for curator authorization before starting CI build.

@CLAassistant

CLAassistant commented Nov 14, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@ci-tester-lunarg

Copy link
Copy Markdown

CI Vulkan-Loader build queued with queue ID 578605.

@ci-tester-lunarg

Copy link
Copy Markdown

CI Vulkan-Loader build # 3268 running.

@ci-tester-lunarg

Copy link
Copy Markdown

CI Vulkan-Loader build # 3268 passed.

@charles-lunarg charles-lunarg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am content with this change - although I am unable to reproduce this crash locally.

I will need you to sign the CLA before I can accept this PR.

@Kkaiwz

Kkaiwz commented Nov 15, 2025

Copy link
Copy Markdown
Author

Hi @charles-lunarg , thanks for your review. I have signed in the CLA currently.

@wangshuo-lalala wangshuo-lalala left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@Kkaiwz

Kkaiwz commented Nov 15, 2025

Copy link
Copy Markdown
Author

BTW, As a technical question: Why did this commit expose a long-standing bug, and do you have any thoughts on this? I noticed that the change in fixup_library_binary_path within loader_scanned_icd_add altered certain behaviors, which seems to have triggered the bug, but I’m currently unable to connect the dots.
The behavior in vkcts is quite straightforward: it passes a custom allocator to layers and ICDs, then forces an allocation failure at the specified failIndex. This is meant to test whether the loader and ICD correctly handle the cleanup path after a failure. (So it’s not surprising you can’t reproduce the issue—since we’re using different layers and ICDs, the allocation at failIndex occurs in different places for each setup.)
What I don’t understand is why changing the library path resolution would cause the allocation at failIndex to occur in a previously unseen call site, ultimately exposing this crash. If you have any ideas, I’d greatly appreciate your input.

@charles-lunarg

Copy link
Copy Markdown
Collaborator

The root cause of crash is icd.so lib has been closed before DestroyInstance called in loader free path when icd instance has been created and oom in layer.

The reason it is tricky to reproduce likely comes from the loader's icd preloading that finds driver binaries, loads them, and keeps them around until the very end of vkDestroyInstance to prevent unnecessary load/unload of binaries. This makes it possible to 'close the lib' first and still be able to call functions into the driver binary as the binary hasn't unloaded, only its ref count value has decreased.

I was able to reproduce #1804 and was able to show that that PR fixed the crash. Considering that the crash came from the same loader change in the same test, its good to be clear that you are seeing a new, different crash, right?

@charles-lunarg

Copy link
Copy Markdown
Collaborator

Hmm it seems the CLA check still isn't happy. Perhaps it is because there are two authors on the git commits? @ryanKkaiii is the missing signature I believe.

@Kkaiwz

Kkaiwz commented Nov 16, 2025

Copy link
Copy Markdown
Author

Thanks for your remind. Solved it now.

@charles-lunarg charles-lunarg merged commit 8b5249c into KhronosGroup:main Nov 16, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants