-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][TGL][Chrome] DSP Panic with large config payload (was plugging/unplugging a headset while using "Online Voice Recorder") #4769
Comments
@bkokoszx @mwasko @abonislawski FYI. unfortunately another one. |
@johnylin76 do you have chance to capture the FW log? |
Prior to the crash we have a lot of IPC errors and attempts at setting a large config to component 57. What is component 57 ?
|
oh it appears the oops is after the set large config to comp 57 (see time delay)
The other crashes show similar delays with previous IPC errors. Can we reproduce with kernel IPC logs ON and FW logs ON ? |
This issue is reproduce on the chrome platform at SHZ1 site. [core dump]
|
Here are the logs with kernel IPC logs ON and FW logs ON: |
The logs show 2 things
But the delay is consistent, meaning we are likely scheduling agent in a timely manner but perhaps with the wrong clock or timebase.
We are not doing anything else at this time so could be a race when we are copying config data. |
Hi @1994lwz , may I know which device do you use? Delbin? Drobit? Do you know the phase (Proto/DVT/EVT) and the SKU of your device? |
the device is Delbin EVT SKU2. |
@lgirdwood Is there any way to reproduce this issue by cmd line? |
Unfortunately I still cannot make a script to successfully reproduce the issue with "cras_test_client --plug ::<0|1>" commands. It seems like a complicated problem that cannot be created by the simple combination of commands. However, I can still reproduce the issue on my Drobit device by the manual steps, although it is also not easy to make it happen. My strategy is rebooting the device and trying again when I failed to reproduce the issue after 30~40 times of headset plug-and-unplug iterations. (I rebooted 3 times and finally hit it...) Attached logs: The log from sof-logger looks weird and unfinished, but it is indeed the whole log I could retrieve after hitting the issue. |
@johnylin76 |
@bkokoszx |
@johnylin76 |
here's a tentative fix for this issue: #4833 Still being tested |
Where would the crash happen here? I am agree it is a good fix but I am having trouble reasoning out the race that would leave the lib with either freed memory or an incorrect size. |
@cujomalainey I'm not certain what could cause the crash given that we're likely sending the same blob. So, theoretically the size should remain the same. But from the logs we see that there are 2 blobs (or the same blob senttwice) sent during headset removal/pug-in. Do you know what those 2 blobs are for? |
@ranj063 we know they are the hotword models, but something is being weird in how they are sent. Previously on 012 this was not an issue. FYI i now have a script you can run to repro this. Repro times range from 5s to 10min
|
@cujomalainey do you mind trying my PR to see if the panic still hapens on your device? |
I don't have the latest IGO lib. @johnylin76 can test. He will be coming online soon. |
with my PR i'm able to hit the issue much sooner than without |
@ranj063
Update: |
According to @bzhg when he was checking crash, the actual crash is inside the microspeech lib on a load, but he can't tell what it is loading. I am betting we are pulling memory out from under its feet and causing it to jump into oblivion. As to how I am not sure. |
@cujomalainey do you think that instead of realloc'ing the blob_handler->data pointer to a new pointer everytime, allocating data with a max size and only updating it would help solve the issue? |
If you would like to check values of specific variable you can do it using e.g. trace_points: diff --git a/src/audio/google_hotword_detect.c b/src/audio/google_hotword_detect.c
index 8c46eaa785c8..1f3da320ebe3 100644
--- a/src/audio/google_hotword_detect.c
+++ b/src/audio/google_hotword_detect.c
@@ -343,6 +343,8 @@ static void ghd_detect(struct comp_dev *dev,
if (cd->detected)
return;
+ mailbox_sw_reg_write(SRAM_REG_DEBUG_WIN_1, bytes);
+
/* Assuming 1 channel, verified in ghd_params.
*
* TODO Make the logic multi channel safe when new hotword library can
diff --git a/src/include/kernel/mailbox.h b/src/include/kernel/mailbox.h
index c9fad0885b69..f6e707c8be02 100644
--- a/src/include/kernel/mailbox.h
+++ b/src/include/kernel/mailbox.h
@@ -36,7 +36,8 @@
#define SRAM_REG_R0_STATE_TRACE SRAM_REG_R_STATE_TRACE_BASE
#define SRAM_REG_R1_STATE_TRACE 0x38
#define SRAM_REG_R2_STATE_TRACE 0x40
-#define SRAM_REG_FW_END (SRAM_REG_R2_STATE_TRACE + 0x8)
+#define SRAM_REG_DEBUG_WIN_1 0x48
+#define SRAM_REG_FW_END (SRAM_REG_R2_STATE_TRACE + 0x4)
/** @}*/ And dump it by command:
You can define bigger or several debug "windows" and dump/check bigger part of memory. |
@sathya-nujella @johnylin76 @cujomalainey @bkokoszx I have a one line change which fixes quite some cache coherency issues when debugging multi-core + dynamic pipelines, can you help to give it quick try if it help: |
@bkokoszx thanks for the helpful tip. Will definitely remember to try that out. @johnylin76 can you test @keyonjie's change? |
Unfortunately by applying @keyonjie's change it is still easy to reproduce DSP panic issue by script (less than running 1 minute). |
Thanks for trying it out @johnylin76 . |
When I tried I also included PR#4833. |
Sorry didn't notice to this, so let's focus on @bkokoszx 's suggestion and debugging then. |
I can also reproduce panic issue with script with @keyonjie change. |
@cujomalainey @johnylin76 |
@bkokoszx I have not, I am sheriff for my team this week so my time is thin, @johnylin76 might have. I still think this is likely a data corruption issue caused by stressing malloc given the memory map works fine when not constantly being swapped. It might be worth re-enabling the CRC and seeing if we catch some CRC errors. |
@cujomalainey |
We are testing the CRAS workaround right now, but still underlying error exists |
So our workaround prevents the crash, but we have a new crash that is blocking the release |
@cujomalainey are both crashes solved now ? I've not seen any updates since the DMA fixes were merged so I'm assuming were are good ? |
@lgirdwood i believe this crash can still be replicated with the script. We simply modified CRAS so it won't hit it. It might be worth testing with another component that has a large config payload. (Note this payload is about 60k) |
Tracking for 2.1 - @cujomalainey @sathya-nujella pls append the script here if you have it. Thanks. |
Usually takes a couple of minutes |
@marc-hb @greg-intel @aiChaoSONG @XiaoyunWu6666 fyi - we need a stress/CI test that sends a sequence of large config messages to our WOV dummy component as we have a crash in the large message handling whilst doing a playback and capture. |
Can someone who worked on this bug and understand enough about it please file a more detailed test description in a new https://github.com/thesofproject/sof-test/issues? A |
@marc-hb The aim is to not test for this exact bug, but for the use case where it occurs. The active streams and large config IPC run at different contexts with xtos so we may have a race. Being able to do a test with running streams and sending large config IPCs would be helpful. Feature request added here thesofproject/sof-test#847 |
@cujomalainey @lgirdwood I removed the v2.1 milestone as that ship has sailed. I think we should use bug priorities if this needs more attention. The thesofproject/sof-test#847 test issue is also tracked (we could use e.g. EQ to mimic a similar test in upstream). |
@johnylin76 is this still relevant? |
I think this issue can be closed which should be fixed (or substantially alleviated) by the following module_handler improvements. I barely see this issue on ADL projects. |
Describe the bug
When IGO Noise Reduction was enabled and releases/tgl/v13.0-hotfix3 is applied on firmware.
During recording with "Online Voice Recorder", there was DSP panic found when plugging/unplugging a headset.
To Reproduce
Note that script is not reproducible.
Reproduction Rate
~1/20
Expected behavior
The recording is always valid.
Impact
DSP panic occurred and user needs reboot to recover.
Environment
Google private tgl-13 branch with the checkout of https://chrome-internal-review.googlesource.com/c/chromeos/third_party/sound-open-firmware-private/+/4116509/1
And with the latest IGO proprietary library
Screenshots or console output
sof-audio-pci 0000:00:1f.3: error : DSP panic!
sof-audio-pci 0000:00:1f.3: status: fw entered - code 00000005
sof-audio-pci 0000:00:1f.3: error: runtime exception
sof-audio-pci 0000:00:1f.3: error: trace point 00004000
sof-audio-pci 0000:00:1f.3: error: panic at :0
sof-audio-pci 0000:00:1f.3: error: DSP Firmware Oops
sof-audio-pci 0000:00:1f.3: error: Exception Cause: LoadStorePIFDataErrorCause, Synchronous PIF data error during LoadStore access
sof-audio-pci 0000:00:1f.3: EXCCAUSE 0x0000000d EXCVADDR 0x0000bb88 PS 0x00060b25 SAR 0x00000000
sof-audio-pci 0000:00:1f.3: EPC1 0xbe02e437 EPC2 0xbe02d7e6 EPC3 0x00000000 EPC4 0x00000000
sof-audio-pci 0000:00:1f.3: EPC5 0x00000000 EPC6 0x00000000 EPC7 0x00000000 DEPC 0x00000000
sof-audio-pci 0000:00:1f.3: EPS2 0x00060120 EPS3 0x00000000 EPS4 0x00000000 EPS5 0x00000000
sof-audio-pci 0000:00:1f.3: EPS6 0x00000000 EPS7 0x00000000 INTENABL 0x00000000 INTERRU 0x00000262
sof-audio-pci 0000:00:1f.3: stack dump from 0xbe143d40
sof-audio-pci 0000:00:1f.3: 0xbe143d40: 00000000 3ff00000 3fe1feb3 41b00000
sof-audio-pci 0000:00:1f.3: 0xbe143d44: 00000000 00000000 00000000 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d48: 00000000 00000000 00000000 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d4c: f66ba030 ffff9667 01cebdc0 ffffaed4
sof-audio-pci 0000:00:1f.3: 0xbe143d50: 00000000 00000000 b641b700 ffffffff
sof-audio-pci 0000:00:1f.3: 0xbe143d54: b05d7828 ffff9667 000c0800 00000000
sof-audio-pci 0000:00:1f.3: 0xbe143d58: 00000000 00000000 01cebe50 ffffaed4
sof-audio-pci 0000:00:1f.3: 0xbe143d5c: 01cebe78 ffffaed4 c05ac157 ffffffff
messages_panic_at_intl_mic.txt
messages_panic_at_headset.txt
The text was updated successfully, but these errors were encountered: