Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MTL][SDW] rmmod stuck on kmod-load-unload tests #5015

Closed
plbossart opened this issue May 24, 2024 · 8 comments
Closed

[MTL][SDW] rmmod stuck on kmod-load-unload tests #5015

plbossart opened this issue May 24, 2024 · 8 comments
Labels
MTL Applies to Meteor Lake platform. SDW Applies to SoundWire bus for codec connection

Comments

@plbossart
Copy link
Member

I've seen repeated issues with the kmod-load-unload not completing on MTL Dell SKU 0CC7

Updating the firmware didn't help. I can still log-in remotely but the device has to be rebooted for audio tests

 starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 17:26:54 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 17:26:54 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 17:26:55 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh

WARNING: running as root is not supported

Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP	snd_usb_audio  	not loaded
SKIP	snd_hda_intel  	not loaded
SKIP	snd_sof_pci_intel_tng  	not loaded
SKIP	snd_sof_pci_intel_skl  	not loaded
SKIP	snd_sof_pci_intel_apl  	not loaded
SKIP	snd_sof_pci_intel_tgl  	not loaded
SKIP	snd_sof_pci_intel_icl  	not loaded
SKIP	snd_sof_pci_intel_cnl  	not loaded
SKIP	snd_sof_pci_intel_lnl  	not loaded
RMMOD	snd_sof_pci_intel_mtl
[  427.318607] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  427.422240] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[  427.422258] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  427.422264] Message payload: 00000000: 00000001 00000000
[  614.731740] INFO: task rmmod:7293 blocked for more than 122 seconds.
[  614.731760]       Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[  614.731764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  614.731767] task:rmmod           state:D stack:13384 pid:7293  tgid:7293  ppid:7292   flags:0x00004002
[  614.731780] Call Trace:
[  614.731784]  <TASK>
[  614.731792]  __schedule+0x385/0xb00
[  614.731812]  schedule+0x3a/0x130
[  614.731817]  schedule_timeout+0x131/0x140
[  614.731828]  __wait_for_common+0xb9/0x1d0
[  614.731833]  ? __pfx_schedule_timeout+0x10/0x10
[  614.731841]  remove_one+0xe8/0x110
[  614.731850]  simple_recursive_removal+0x1e0/0x280
[  614.731856]  ? __pfx_remove_one+0x10/0x10
[  614.731862]  debugfs_remove+0x44/0x70
[  614.731872]  snd_sof_device_remove+0xfb/0x180 [snd_sof]
[  614.731904]  sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[  614.731913]  pci_device_remove+0x3f/0xb0
[  614.731922]  device_release_driver_internal+0x1a9/0x210
[  614.731931]  driver_detach+0x4b/0x90
[  614.731936]  bus_remove_driver+0x70/0xf0
[  614.731941]  pci_unregister_driver+0x3f/0x90
[  614.731948]  __do_sys_delete_module+0x1e0/0x2c0
[  614.731961]  do_syscall_64+0x9d/0x1a0
[  614.731967]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  614.731975] RIP: 0033:0x7f7c360577cb
[  614.731980] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  614.731985] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[  614.731988] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[  614.731990] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[  614.731992] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[  614.731994] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[  614.732004]  </TASK>
[  614.732009] 
               Showing all locks held in the system:
[  614.732022] 1 lock held by khungtaskd/154:
[  614.732025]  #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[  614.732101] 1 lock held by dmesg/3479:
[  614.732104]  #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[  614.732117] 2 locks held by rmmod/7293:
[  614.732120]  #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[  614.732130]  #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280

[  614.732144] =============================================

[  737.608739] INFO: task rmmod:7293 blocked for more than 245 seconds.
[  737.608753]       Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[  737.608757] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  737.608760] task:rmmod           state:D stack:13384 pid:7293  tgid:7293  ppid:7292   flags:0x00004002
[  737.608770] Call Trace:
[  737.608773]  <TASK>
[  737.608779]  __schedule+0x385/0xb00
[  737.608795]  schedule+0x3a/0x130
[  737.608800]  schedule_timeout+0x131/0x140
[  737.608811]  __wait_for_common+0xb9/0x1d0
[  737.608816]  ? __pfx_schedule_timeout+0x10/0x10
[  737.608824]  remove_one+0xe8/0x110
[  737.608831]  simple_recursive_removal+0x1e0/0x280
[  737.608836]  ? __pfx_remove_one+0x10/0x10
[  737.608843]  debugfs_remove+0x44/0x70
[  737.608850]  snd_sof_device_remove+0xfb/0x180 [snd_sof]
[  737.608878]  sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[  737.608886]  pci_device_remove+0x3f/0xb0
[  737.608894]  device_release_driver_internal+0x1a9/0x210
[  737.608902]  driver_detach+0x4b/0x90
[  737.608907]  bus_remove_driver+0x70/0xf0
[  737.608912]  pci_unregister_driver+0x3f/0x90
[  737.608918]  __do_sys_delete_module+0x1e0/0x2c0
[  737.608930]  do_syscall_64+0x9d/0x1a0
[  737.608936]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  737.608943] RIP: 0033:0x7f7c360577cb
[  737.608947] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  737.608952] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[  737.608954] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[  737.608957] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[  737.608959] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[  737.608961] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[  737.608970]  </TASK>
[  737.608975] 
               Showing all locks held in the system:
[  737.608986] 1 lock held by khungtaskd/154:
[  737.608988]  #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[  737.609019] 1 lock held by dmesg/3479:
[  737.609021]  #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[  737.609032] 2 locks held by rmmod/7293:
[  737.609035]  #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[  737.609045]  #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280

[  737.609059] =============================================


@plbossart
Copy link
Member Author

reproduced with
kernel topic/sof-dev d99d9a0 ALSA: hda: intel-sdw-acpi: use acpi_get_local_u64_address()
SOF main : 737d4d41fbaa app: add winconsole_overlay.conf

TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/run-all-tests.sh -l1

@plbossart plbossart added MTL Applies to Meteor Lake platform. SDW Applies to SoundWire bus for codec connection labels May 24, 2024
@plbossart
Copy link
Member Author

The minimal sequence to make the problem appear is this in test-case/run-all-tests.sh

testlist="
multiple-pause-resume
kmod-load-unload
"

Without the multiple-pause-resume no errors are reported.

@plbossart
Copy link
Member Author

this also happens when I remove everything except 'Jack Out"

 starting test_multiple-pause-resume 
+ test_multiple-pause-resume
+ /root/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
2024-05-24 20:18:56 UTC Sub-Test: [INFO] /root/sof-test/test-case/multiple-pause-resume.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg to run the test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' in test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Run command to get pipeline parameters
2024-05-24 20:18:56 UTC Sub-Test: [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg -f 'type:any & ~pcm:Amplifier Reference' -b ' pcm:HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' -s 0 -e
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Starting /usr/local/bin/mtrace-reader.py >& /root/sof-test/logs/multiple-pause-resume/2024-05-24-16:18:56-14946/mtrace.txt &
2024-05-24 20:18:57 UTC Sub-Test: [WARNING] pipeline count is 1, don't need to run this case
 ---------- 
 ---------- 
 starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 20:19:00 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 20:19:00 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 20:19:00 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh

WARNING: running as root is not supported

Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP	snd_usb_audio  	not loaded
SKIP	snd_hda_intel  	not loaded
SKIP	snd_sof_pci_intel_tng  	not loaded
SKIP	snd_sof_pci_intel_skl  	not loaded
SKIP	snd_sof_pci_intel_apl  	not loaded
SKIP	snd_sof_pci_intel_tgl  	not loaded
SKIP	snd_sof_pci_intel_icl  	not loaded
SKIP	snd_sof_pci_intel_cnl  	not loaded
SKIP	snd_sof_pci_intel_lnl  	not loaded
RMMOD	snd_sof_pci_intel_mtl

@plbossart
Copy link
Member Author

looks like a test script problem maybe. I can't reproduce this with the simpler sequence

sudo TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
~/sof-test/test-case/check-kmod-load-unload.sh -l 2

maybe it's the debugfs stuff that we read now that causes a problem while removing debugfs. Or we have an open debugfs file.

Looks like no one is using run-all-tests.sh?

@plbossart
Copy link
Member Author

Just saw this as well on LNL

starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /home/ubuntu/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-06-19 08:24:57 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-06-19 08:24:57 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-06-19 08:24:57 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh
Specified filename /sys/kernel/debug/sof/trace does not exist.
RMMOD	snd_usb_audio
SKIP	snd_hda_intel  	not loaded
SKIP	snd_sof_pci_intel_tng  	not loaded
SKIP	snd_sof_pci_intel_skl  	not loaded
SKIP	snd_sof_pci_intel_apl  	not loaded
SKIP	snd_sof_pci_intel_tgl  	not loaded
SKIP	snd_sof_pci_intel_icl  	not loaded
SKIP	snd_sof_pci_intel_cnl  	not loaded
RMMOD	snd_sof_pci_intel_lnl

[  277.387525] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  277.539501] snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[  277.541983] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[  277.542024] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  277.542037] Message payload: 00000000: 00000001 00000000
[  277.547478] snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[  277.547489] snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[  277.547493] soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[  277.552306] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[  277.565812] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[  277.568162] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[  277.669458] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[  280.639869] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[  280.639910] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[  280.640037] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[  281.306958] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 17, 2024

The pause-resume tests had many implementation issues, I just rewrote it in thesofproject/sof-test#1218 Can you try again?

It's not surprising that kmod-load-unload could hang if the pause-resume test left something bad behind. Examples:

CI is generally more robust that run-all-tests.sh so it's less likely to leave something bad behind. Not impossible but less likely in CI.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 29, 2024

@plbossart can we close after thesofproject/sof-test#1218 ?

@plbossart
Copy link
Member Author

I haven't seen this problem in a very very long time, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MTL Applies to Meteor Lake platform. SDW Applies to SoundWire bus for codec connection
Projects
None yet
Development

No branches or pull requests

2 participants