-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux 6.2 issue #7
Comments
I've tested it on rk3399 NanoPC-T2 with Let me know if this 6.x tree can be fixed, I am interested to follow.
index 710ebac..7cdf1a9 100644
--- a/r8p0/drivers/gpu/arm/midgard/mali_kbase_mem_linux.c
+++ b/r8p0/drivers/gpu/arm/midgard/mali_kbase_mem_linux.c
@@ -2514,7 +2514,7 @@ KBASE_EXPORT_TEST_API(kbase_vunmap);
#if (LINUX_VERSION_CODE >= KERNEL_VERSION(5, 5, 0))
static void mali_add_mm_counter(struct mm_struct *mm, int member, long value)
{
- atomic_long_add(value, &mm->rss_stat.count[MM_FILEPAGES]);
+ percpu_counter_add(&mm->rss_stat[MM_FILEPAGES], value);
}
#else
static void mali_add_mm_counter(struct mm_struct *mm, int member, long value)
|
@cbalint13 is the board really the NanoPC-T2? Because that has Samsung S5P4418 SoC: I think you're running on NanoPC-T4: Anyway the patch you've proposed looks correct to me. |
I've just ordered a RK3399 board that is already supported by Buildroot so I can check and debug it. |
|
|
Thanks a lot. For the moment I’ve built Buildroot. But I need to wait for the board to arrive, I’ve ordered it 2 hours ago :-) |
@cbalint13 can you please point me the URL of the mali blob you're using? So I can setup my Buildroot system correctly. Thanks in advance for helping! |
In short this is exposed to system:
Excerpt from build receipt:
Let me know if need more details toward reproducibility. |
Double checked,
|
Triple checked,
Failed test with r18 library:
Passed test with r14 library:
|
@cbalint13 thanks a lot for all the tests. But I’m a bit confused. Thanks a lot! |
|
@giuliobenetti I can confirm this still exists on 6.5, been working to resolve and have tried a few different variations of the driver and patches
I'm in a 32-bit userland with a 64-bit kernel, which is different than the issue author, but have the same error. Seems an upstream change in 6.2 has triggered this memory incompatibility. I'm happy to test anything you send! |
@mrfixit2001 @cbalint13 I'm very sorry I still haven't found time to fix this issue. @mrfixit2001 I agree with you, it seems like a memory incompatibility and it looks the same as @cbalint13 has pointed above. @mrfixit2001 Are you using OpenCL or OpenGL Userspace Blobs? |
@giuliobenetti appreciate the quick reply! I'm testing GLES / GBM using a RK3399 Midgard |
Ok, so this is a common problem between both OpenGL and OpenCL. |
@mrfixit2001 @cbalint13 could you please give a try to branch https://github.com/giuliobenetti/mali-driver/tree/test/fix-6.2%2B and see if that fixes the runtime failure? Thanks a lot! |
@giuliobenetti unfortunately that patch does not resolve. Same failure output. |
Ok, thanks for testing. That patch is needed for consistency in any case so I will commit it later. @mrfixit2001 would it be possible for you to issue a ftrace on modprobe? I will do my best to bring up a board to debug such bug. |
@giuliobenetti I've been staring at this code a few days now... Could this possibly be due to the reimplementation of kbase_unmapped_area_topdown? It coincidentally changed right around that same time to use a maple tree instead of rbtree. |
Attached is a function-graph trace of attempting to start my application. Please let me know what other debug detail you require. And thanks again for your time and involvement! |
@giuliobenetti
|
In case the function-graph isn't what you wanted, here is a new function trace instead |
Thank you but this is a normal behavior, the driver works even without devfreq. |
Yes, this is close to what I need, but I'd need the stackframe on segault including the last mali driver calls. Thank you! |
FYI - I just tested with the bleeding edge commit from torvalds, same error. Here's the full GDB backtrace:
|
@mrfixit2001 Thank you for the effort! |
@mrfixit2001 Ok, finally I have Rockpro64-V2 up and running where I have RK3399 with Mali-T860. |
@mrfixit2001 I've reproduced the error with the same board you have. I've straced glmark2-es2-drm and it dies here:
Need to dig. I will find an easier test program so I have less function calls. |
Thank you for keeping us updated!! I’m excited you’re able to reproduce and am hopeful you’ll find a fix soon. I don’t mind patching DRM instead of mali if that’s needed. Looking forward to your reply.
|
@giuliobenetti any progress? Would you mind sharing the full trace/dump so any of us interested can try and help? |
I still had no time to enter debugging, so I neither have the full trace. I have to put my hands on this soon. |
Thanks for the update, looking forward to hearing back. |
I was able to get around the shown error by fixing the way the vma flags
are cleared in kbase_mmap. The << 4 is no longer correct. Now there's a DMA
fence issue:
```[ 60.770210] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000010
[ 60.771002] Mem abort info:
[ 60.771248] ESR = 0x0000000096000007
[ 60.771578] EC = 0x25: DABT (current EL), IL = 32 bits
[ 60.772044] SET = 0, FnV = 0
[ 60.772335] EA = 0, S1PTW = 0
[ 60.772612] FSC = 0x07: level 3 translation fault
[ 60.773040] Data abort info:
[ 60.773293] ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[ 60.773773] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 60.774216] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 60.774682] user pgtable: 4k pages, 48-bit VAs,
pgdp=00000000f33be000
[ 60.775244] [0000000000000010] pgd=0800000006c09003,
p4d=0800000006c09003, pud=080000001160a003, pmd=08000000129
43003, pte=0000000000000000
[ 60.776356] Internal error: Oops: 0000000096000007 [#1] SMP
[ 60.776847] Modules linked in: 8021q btsdio hci_uart btqca btusb
btrtl btbcm btintel bluetooth ecdh_generic ecc
ir_rcmm_decoder ir_imon_decoder ir_xmp_decoder ir_mce_kbd_decoder
ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder
ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder fusb302
tcpm rk_crypto pwm_fan spi_rockchip rk3399_dmc
crypto_engine
[ 60.779635] CPU: 0 PID: 2691 Comm: mali-cmar-backe Not tainted
6.5.0 #14
[ 60.780223] Hardware name: Pine64 RockPro64 v2.1 (DT)
[ 60.780666] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[ 60.781277] pc : dma_resv_add_fence+0x7c/0x21c
[ 60.781679] lr : kbase_dma_fence_wait+0x170/0x3d4
[ 60.782097] sp : ffff800084af3ab0
[ 60.782388] x29: ffff800084af3ab0 x28: ffff800084651000 x27:
ffff000007e33600
[ 60.783018] x26: 0000000103115001 x25: 0000000000000000 x24:
0000000000000000
[ 60.783647] x23: 0000000000000001 x22: ffff0000032c4300 x21:
0000000000000000
[ 60.784276] x20: 0000000000000000 x19: ffff0000061b3100 x18:
0000000000000000
[ 60.784904] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000002
[ 60.785533] x14: 0000000000000001 x13: 00000000000da51e x12:
0000000000000048
[ 60.786161] x11: 00000000000007e8 x10: ffff800084bedaa8 x9 :
ffff0000025bbf00
[ 60.786789] x8 : ffff0000032c4340 x7 : 0000000000000000 x6 :
0000000000000000
[ 60.787418] x5 : ffff0000032c4310 x4 : 0000000000000001 x3 :
0000000000000000
[ 60.788047] x2 : ffff80008105f5e0 x1 : ffff80008106ce30 x0 :
ffff80008106ce80
[ 60.788677] Call trace:
[ 60.788894] dma_resv_add_fence+0x7c/0x21c
[ 60.789256] kbase_dma_fence_wait+0x170/0x3d4
[ 60.789640] jd_submit_atom+0x888/0x9a4
[ 60.789981] kbase_jd_submit+0x214/0x348
[ 60.790328] kbase_ioctl+0xb6c/0x157c
[ 60.790655] __arm64_compat_sys_ioctl+0x140/0x160
[ 60.791074] invoke_syscall+0x44/0x108
[ 60.791411] el0_svc_common.constprop.0+0x40/0xd8
[ 60.791827] do_el0_svc_compat+0x18/0x38
[ 60.792175] el0_svc_compat+0x14/0x48
[ 60.792505] el0t_32_sync_handler+0x88/0x114
[ 60.792881] el0t_32_sync+0x150/0x154
[ 60.793210] Code: fa401044 54000041 d4210000 f9401678 (b9401319)
[ 60.793744] ---[ end trace 0000000000000000 ]--- ```
…On Mon, Oct 23, 2023 at 2:21 PM Giulio Benetti ***@***.***> wrote:
@giuliobenetti <https://github.com/giuliobenetti> any progress? Would you
mind sharing the full trace/dump so any of us interested can try and help?
Side note - is the KDS DMA patch included in this repo required for proper
midgard functionality in the modern kernel?
I still had no time to enter debugging, so I neither have the full trace.
I have to put my hands on this soon.
What do you mean with KDS DMS patch? Can you elaborate?
—
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJIK7DCOURCEZBWFMRT6GI3YA2YRTAVCNFSM6AAAAAAWTDPD6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVG43DINRXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @mrfixit2001,
can you elaborate more? Is there a Linux commit that requires that shift to be changed? If yes please open a PR documenting the reason. |
Fix issue reported here: bootlin#7
@giuliobenetti I've created the PR for you. Feel free to edit if you'd like ofc. The change I've made should also be more future-proof without the shift. I would appreciate any insight you may have into the new error I'm getting, the NULL pointer dereference |
@giuliobenetti FYI - the new null reference error is being thrown because dma_resv_fences_list is returning NULL... which ultimately means __rcu_dereference_check is returning null. When I add a check to dma-resv that checks for this NULL I can bypass the error but then either the card locks up without throwing a panic or the mali driver throws errors about job hard stops, failures, and faults. |
@mrfixit2001 Can you please give a fast try with the 2 changes below and let me know the result? |
@giuliobenetti On a completely different topic, this will also need to be adjusted for newer kernels: After this commit: |
Ok, it was only a fast try.
Can you open a PR for that? Can you also create commit log like I've done for previous commits? Thanks you! |
I have refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, and I've updated kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex's (this mirrors changes done to the DRM drivers). This resolves the null error... but now I'm still stuck with iommu errors. And if the board doesn't lock up afterwards then I'm given a bunch of mali errors. Here's the next error to be investigated...
|
Additional detail - the above iommu error is thrown after attempting to exit ffplay which uses SDL2. The audio plays but the video is blank and then errors out after. I get a completely different error when I try to start KODI - which does NOT use SDL2 at all - it sends all it's display directly to drm / gbm. The below basically repeats over and over:
Any insight or ideas is welcome. |
I'll give you one more error to reference as well... Let me know if you have any ideas based on any of this... I tried to increased KBASE_AS_INACTIVE_MAX_LOOPS so it wouldn't think the gpu was stuck, but that either had no effect or caused the board to simply lock up without throwing an error... But if I first trigger the iommu error with ffplay, and THEN try and start kodi, I get yet a completely different error :)
|
I am also wondering if there is anything we need to change to integrate with this in the upstream kernel: |
@giuliobenetti The above mentioned errors were kind of misleading and ultimately the issue was actually related to a change in the kernel power management driver - not having to do with mali at all. In any event - the initially defined error in this issue was resolved by my PR. I'll leave the rest of the error outputs for future reference in case anyone is googling and this comes up. When you have time, I encourage you to refactored and adjusted dma fence to always call dma_resv_reserve_fences for both read and write dma reservations, as well as update kbase_dma_fence_lock_reservations to use dma resv locks instead of ww mutex - because this is what other upstream DRM drivers are doing. Thanks again for all your help and time! |
Very well! Can you please point the Linux patches you're referring to?
Ok, good to know
Thanks a lot for providing such fix
Can you please point me some example here? Or if you have it would be great if you can contribute with patches since you've already faced the problem.
Thank you too :-) |
Refer to #6 for further informations.
The text was updated successfully, but these errors were encountered: