Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent MVKCmdWaitEvents crashes #2319

Open
tycho opened this issue Aug 27, 2024 · 4 comments
Open

Frequent MVKCmdWaitEvents crashes #2319

tycho opened this issue Aug 27, 2024 · 4 comments
Labels

Comments

@tycho
Copy link

tycho commented Aug 27, 2024

This is happening on an Apple M3 Max with MoltenVK rev dad3851, ANGLE rev d51c251604, macOS 15.0 Beta (24A5320a), Xcode 16.0 beta 6 (16A5230g).

A few different crashes happening in MVKCmdWaitEvents due to a null pointer dereference somewhere, somehow:

https://sentry.uplinklabs.net/share/issue/23d442788de24281a469520d0007b504/
https://sentry.uplinklabs.net/share/issue/6c38cdefc0994c4dbb83bcc3d1d9bb22/
https://sentry.uplinklabs.net/share/issue/3d30f207a90340a18b2ac9c771be4c48/

Also happens on an earlier game build with the same MVK revision on an Apple M2 Max:

https://sentry.uplinklabs.net/share/issue/791bf76a3dc64c93a90aa7e219d57c39/

(In the above stack traces click the "Show N more frames" link to see where it died within MVK.)

It seems pretty easy to repro, it happens on game startup. I'd be happy to share a game binary for developers to debug the issue, but I would need to share the build privately. The game demo builds which I could share publicly do not have the "bootloaders" which retail/debug builds have and are triggering the issue.

Running with MTL_DEBUG_LAYER=1 in environment yields one of these two crashes:

-[IOGPUMetalCommandBuffer encodeWaitForEvent:value:timeout:]:494: failed assertion `encodeWaitForEvent:value: with uncommitted encoder'
-[MTLDebugRenderCommandEncoder setVertexBufferOffset:attributeStride:atIndex:]:1875: failed assertion `Set Vertex Buffer Offset Validation
index(28) must have an existing buffer.

This didn't happen around a month ago when I last built MoltenVK + ANGLE. Not sure where the blame lies exactly but it has been stable on other platforms using ANGLE's Vulkan backend, so I suspect it's MVK breaking in some way.

@tycho
Copy link
Author

tycho commented Aug 30, 2024

The second validation error I mentioned,

-[MTLDebugRenderCommandEncoder setVertexBufferOffset:attributeStride:atIndex:]:1875: failed assertion `Set Vertex Buffer Offset Validation
index(28) must have an existing buffer.

occurs when MVKMTLBufferBinding::justOffset is true, and this dodges that particular validation error:

diff --git a/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h b/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
index 97ddf4ac..942b393b 100644
--- a/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
+++ b/MoltenVK/MoltenVK/Commands/MVKMTLResourceBindings.h
@@ -89,7 +89,7 @@ typedef struct MVKMTLBufferBinding {
         } else if (offset != other.offset || stride != other.stride) {
             offset = other.offset;
                        stride = other.stride;
-            justOffset = !isOverridden && (!isDirty || justOffset);
+            justOffset = false; //!isOverridden && (!isDirty || justOffset);
                        isOverridden = false;
             isDirty = true;
         }

Obviously not the right solution, but maybe someone can figure out why _mtlRenderEncoder setVertexBuffer hasn't been called for that index before it tries to do _mtlRenderEncoder setVertexBufferOffset. I'm not clear on how it does the state tracking for what's already bound at the appropriate indices.

Still digging into the first validation error, but it's harder to hit in debug builds.

@tycho
Copy link
Author

tycho commented Aug 30, 2024

For the encodeWaitForEvent crash, this seems to get around it but I think there's got to be more to it:

diff --git a/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm b/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
index 697cca24..d750dcd6 100644
--- a/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
+++ b/MoltenVK/MoltenVK/Commands/MVKCmdPipeline.mm
@@ -617,6 +617,7 @@ VkResult MVKCmdWaitEvents<N>::setContent(MVKCommandBuffer* cmdBuff,

 template <size_t N>
 void MVKCmdWaitEvents<N>::encode(MVKCommandEncoder* cmdEncoder) {
+       cmdEncoder->endCurrentMetalEncoding();
        for (MVKEvent* mvkEvt : _mvkEvents) {
                mvkEvt->encodeWait(cmdEncoder->_mtlCmdBuffer);
        }

@billhollings
Copy link
Contributor

A few different crashes happening in MVKCmdWaitEvents due to a null pointer dereference somewhere, somehow:

Unfortunately, I'm not able to open the first two links above. Can you regenerate and repost them, please? The 3rd & 4th links are working.

Screenshot 2024-09-09 at 8 12 49 PM

@tycho
Copy link
Author

tycho commented Sep 16, 2024

@billhollings Instead of fixing the links, here's some stack traces:

M2 Ultra:

Thread 320996 Crashed:
0   libsystem_kernel.dylib          0x193902a60         <unknown>
1   libsystem_c.dylib               0x193847a30         <unknown>
2   libsystem_c.dylib               0x193846d20         <unknown>
3   Metal                           0x19ddf9194         <unknown>
4   Metal                           0x19ddd5db0         <unknown>
5   IOGPU                           0x1b29b5b2c         <unknown>
6   AGXMetalG14X                    0x1ebc60de0         <unknown>
7   AGXMetalG14X                    0x1ebc60e38         <unknown>
8   libMoltenVK.dylib               0x107d509a0         MVKCmdWaitEvents<T>::encode (MVKCmdPipeline.mm:621)
9   libMoltenVK.dylib               0x107cccf7c         [inlined] MVKCommandEncoder::encodeCommandsImpl (MVKCommandBuffer.mm:379)
10  libMoltenVK.dylib               0x107cccf7c         [inlined] MVKCommandEncoder::encodeCommands (MVKCommandBuffer.mm:372)
11  libMoltenVK.dylib               0x107cccf7c         MVKCommandEncoder::encode (MVKCommandBuffer.mm:346)
12  libMoltenVK.dylib               0x107ccd3bc         MVKCommandBuffer::submit (MVKCommandBuffer.mm:240)
13  libMoltenVK.dylib               0x107d22b3c         MVKQueueFullCommandBufferSubmission<T>::submitCommandBuffers (MVKQueue.mm:644)
14  libMoltenVK.dylib               0x107d212d0         MVKQueueCommandBufferSubmission::execute (MVKQueue.mm:464)
15  libMoltenVK.dylib               0x107d1f970         [inlined] execute (MVKQueue.mm:72)
16  libMoltenVK.dylib               0x107d1f970         [inlined] MVKQueue::submit (MVKQueue.mm:90)
17  libMoltenVK.dylib               0x107d1f970         MVKQueue::submit<T> (MVKQueue.mm:129)
18  libMoltenVK.dylib               0x107c89108         vkQueueSubmit (vulkan.mm:434)
19  libGLESv2.dylib                 0x10b969a0c         rx::vk::CommandQueue::queueSubmit (CommandProcessor.cpp:1551)
20  libGLESv2.dylib                 0x10b9663f4         rx::vk::CommandQueue::submitCommands (CommandProcessor.cpp:1457)
21  libGLESv2.dylib                 0x10ba203c0         rx::vk::Renderer::submitCommands (vk_renderer.cpp:5866)
22  libGLESv2.dylib                 0x10b9774a4         rx::ContextVk::submitCommands (ContextVk.cpp:3663)
23  libGLESv2.dylib                 0x10b973e44         rx::ContextVk::flushImpl (ContextVk.cpp:7674)
24  libGLESv2.dylib                 0x10b96f09c         [inlined] rx::ContextVk::flushCommandsAndEndRenderPass (ContextVk.cpp:8079)
25  libGLESv2.dylib                 0x10b96f09c         [inlined] rx::ContextVk::flushDirtyGraphicsRenderPass (ContextVk.cpp:8090)
26  libGLESv2.dylib                 0x10b96f09c         rx::ContextVk::handleDirtyGraphicsRenderPass (ContextVk.cpp:2394)
27  libGLESv2.dylib                 0x10b974404         rx::ContextVk::setupDraw (ContextVk.cpp:1598)
28  libGLESv2.dylib                 0x10b978180         rx::ContextVk::drawArrays (ContextVk.cpp:4072)
29  libGLESv2.dylib                 0x10b93fb9c         [inlined] gl::Context::drawArrays (Context.inl.h:152)
30  libGLESv2.dylib                 0x10b93fb9c         GL_DrawArrays (entry_points_gles_2_0_autogen.cpp:1260)

M3 Max:

Thread 24094048 Crashed:
0   libsystem_kernel.dylib          0x18d502600         <unknown>
1   libsystem_c.dylib               0x18d447908         <unknown>
2   libsystem_c.dylib               0x18d446c1c         <unknown>
3   Metal                           0x19846d908         <unknown>
4   Metal                           0x198449230         <unknown>
5   IOGPU                           0x1addc9338         <unknown>
6   AGXMetalG15X_M1                 0x10825be9c         <unknown>
7   AGXMetalG15X_M1                 0x10825bef4         <unknown>
8   MetalTools                      0x18ddedfc4         <unknown>
9   libMoltenVK.dylib               0x10469fb80         MVKCmdWaitEvents<T>::encode (MVKCmdPipeline.mm:621)
10  libMoltenVK.dylib               0x10461f95c         [inlined] MVKCommandEncoder::encodeCommandsImpl (MVKCommandBuffer.mm:379)
11  libMoltenVK.dylib               0x10461f95c         [inlined] MVKCommandEncoder::encodeCommands (MVKCommandBuffer.mm:372)
12  libMoltenVK.dylib               0x10461f95c         MVKCommandEncoder::encode (MVKCommandBuffer.mm:346)
13  libMoltenVK.dylib               0x10461fd9c         MVKCommandBuffer::submit (MVKCommandBuffer.mm:240)
14  libMoltenVK.dylib               0x104671dfc         MVKQueueFullCommandBufferSubmission<T>::submitCommandBuffers (MVKQueue.mm:659)
15  libMoltenVK.dylib               0x1046706ac         MVKQueueCommandBufferSubmission::execute (MVKQueue.mm:479)
16  libMoltenVK.dylib               0x10466e9d4         [inlined] execute (MVKQueue.mm:72)
17  libMoltenVK.dylib               0x10466e9d4         MVKQueue::submit (MVKQueue.mm:99)
18  libMoltenVK.dylib               0x10466ede4         MVKQueue::submit<T> (MVKQueue.mm:138)
19  libMoltenVK.dylib               0x1045dbb7c         vkQueueSubmit (vulkan.mm:434)
20  libGLESv2.dylib                 0x10737dcb8         rx::vk::CommandQueue::queueSubmit (CommandProcessor.cpp:1551)
21  libGLESv2.dylib                 0x10737a6a4         rx::vk::CommandQueue::submitCommands (CommandProcessor.cpp:1457)
22  libGLESv2.dylib                 0x107433db8         rx::vk::Renderer::submitCommands (vk_renderer.cpp:5929)
23  libGLESv2.dylib                 0x10738b02c         rx::ContextVk::submitCommands (ContextVk.cpp:3665)
24  libGLESv2.dylib                 0x107387f58         rx::ContextVk::flushImpl (ContextVk.cpp:7700)
25  libGLESv2.dylib                 0x107383348         [inlined] rx::ContextVk::flushCommandsAndEndRenderPass (ContextVk.cpp:8098)
26  libGLESv2.dylib                 0x107383348         [inlined] rx::ContextVk::flushDirtyGraphicsRenderPass (ContextVk.cpp:8109)
27  libGLESv2.dylib                 0x107383348         rx::ContextVk::handleDirtyGraphicsRenderPass (ContextVk.cpp:2395)
28  libGLESv2.dylib                 0x107388518         rx::ContextVk::setupDraw (ContextVk.cpp:1599)
29  libGLESv2.dylib                 0x10738bd04         rx::ContextVk::drawArrays (ContextVk.cpp:4074)
30  libGLESv2.dylib                 0x107353d68         [inlined] gl::Context::drawArrays (Context.inl.h:152)
31  libGLESv2.dylib                 0x107353d68         GL_DrawArrays (entry_points_gles_2_0_autogen.cpp:1260)

Basically the previous encoder must be ended before an encodeWait can be used, which lines up with what validation claims...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants