-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Device tests for close and threading crashes #809
base: develop
Are you sure you want to change the base?
Conversation
5493e90
to
319a647
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Just wondering on rpcClient - when is that now destructed - such that streams and other resources are destructed also (and that we bring still mitigate the initial issue that kept the device in an odd state).
Otherwise 👍 on this, targeting to bring in for next release
My current approach is
I can not (yet) find any sideaffect that rpcClient would cause to keep "the device in an odd state". @szabi-luxonis did the work on 4b3c687. I would like to understand their evidence or thinking on that. My suspicion is that it was other work (like the device firmware) that made a difference -or- there is a bug in the For example, in
If anyone were to call any XLink apis like I get suspicious off the following with a PoE device:
Step 3 was one of my concerns in #520 when my approach of |
2a3bc20
to
5668249
Compare
I think at the end of the day it boiled down to whether that rpcStream was destructed, in turn the underlying XLinkStream was closed. Though I cannot confirm this to be the case. Would have to be retested prehaps with original post change vs just closing the XLinkStream (instead of "nulling out the rpcClient). In that way the rpcClient could remain valid and issues with interfacing with post close would not be required.
This waits till USB device is back online or times out - it helps with doing device down/up events as otherwise issues could occur (afaik while device was still (on usb level) destructing, it'd reconnect). The TODO is there as this does not happen for PoE devices (as the "bootDevice" is never true in that case)
I agree that /w atomic this would make these simpler - I am up for changing that back, but IMO the crux here still lies somewhere around the XLinkStream being closed or not. |
I'm not arguing 🙂 just trying to understand the thinking/logic as it isn't aligning with the code I see. Anytime
The call to That's my reasoning of the code...which I definitely can be wrong 😅. I can't align my reasoning with why you want the underlying XLinkStream of rpcClient to be closed...and why you have a belief it puts "the device in an odd state". I don't have visibility into the OAK device firmware. The firmware must support any disconnect type at any time regardless of the state of XLinkStream's. Meaning the firmware should support: tcpip/ether failures, usb cable failures, sudden |
I've isolated an unexpected grouping of XLink failures with rapid restarts. When this PR's test "Multiple devices created and destroyed in parallel" is run with 2 OAK-D-Lite and 1 OAK-D attached, I get failures. Might have to run the test more than once, but it should eventually fail. On my laptop, it fails almost every run. Fails in
The rpc to |
Probably found bug95% sure I found a somewhat relevant bug here depthai-core/src/xlink/XLinkConnection.cpp Line 342 in 125feb8
OAK devices are not closing cleanlyFor the below test run...
I am assuming that
In that single run, I see many unexpected things
I have also run the test with
|
Thanks a bunch for all the exploration @diablodale (sorry for the delay)
This seems to be what resolved the issue with the initial commit - though it might not be the actual culprit. More testing required - I'm not actually sure either
Most likely the case - the RPCs themselfs likely aren't the culprit.
Yes - we we can remove this outright.
Good catch - Yes, that would in fact exit out too soon. Shall we add it here: https://github.com/luxonis/depthai-core/pull/809/files#diff-17c7352c03ae1d3bb07f3e7a3c1b6f98312f549a440e186f450e3a8f563e3121L338-L340 with
That would be X_LINK_UNBOOTED state. X_LINK_BOOTLOADER would be for cases where device is presented as Luxonis Bootloader (the former is ROM BL/1st stage, the latter is NOR flashed custom BL (2nd stage))
I think this relates to device being in a state that comms aren't possible. Though not 100% sure what is the underlying cause for this error message from libusb on Windows |
Unfortunately, I don't think I can make progress on the connection issue(s) because I don't have visibiity into the OAK device and its firmware. There is not enough public knowledge on its state changes, internal logic, etc.
If you want, I can cleanup, or close an open a new PR, to include the one bug I found and all the cleanup work. It doesn't fix the connection problems -- instead forward movement to have cleaner build, enables more analyzers/warnings, etc. |
de21468
to
6ea13e0
Compare
@themarpe I've cleaned up the commits in this PR. It has some fixes that are outside the main topic. If wanted, I can break them out into several separate PRs. [EDIT] And the last commit will be dropped before merge. It is additional debugs and warns and a supporting API. (which itself could be included in production code and the APIs dropped if the Meanwhile, below is my first attempt mermaid of the XLINK states and their base transitions. Are there any corrections that need to be made? In this 1st diagram, I haven't added reboots, fails, etc. flowchart TD
DEVICE_TYPE{USB device?}
STANDALONE{Standalone<br/>flashed device?}
X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]
HOST_CREATE_DEVICE["host app: Device(pipeline)"]
HOST_BOOTLOADER["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]
STANDALONE_BOOTLOADER["device boots its bootloader from flash"]
STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]
DEVICE_TYPE -->|yes| X_LINK_UNBOOTED
X_LINK_UNBOOTED --> STANDALONE
STANDALONE -->|no| HOST_CREATE_DEVICE
STANDALONE -.->|yes| STANDALONE_BOOTLOADER
STANDALONE_BOOTLOADER -.-> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
STANDALONE_PIPELINE -.-> X_LINK_FLASH_BOOTED
HOST_CREATE_DEVICE --> HOST_BOOTLOADER
HOST_BOOTLOADER --> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -->HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE
HOST_RPC_PIPELINE --> X_LINK_BOOTED
subgraph "host sdk code: Device(pipeline)"
HOST_BOOTLOADER
HOST_RPC_READ
HOST_RPC_PIPELINE
end
subgraph "Automatic standalone flash behavior"
STANDALONE_BOOTLOADER
STANDALONE_READ
STANDALONE_PIPELINE
end
|
3a404ea
to
5b8a566
Compare
@diablodale Each of transitions is a USB down->up event as reenumeration is done. One extra case is if you call DeviceBootloader, then it boots into X_LINK_BOOTLOADER state if it was UNBOOTED before. Otherwise it just connects to running Bootloader (and neat thing, this mermaid flowchart:) ) flowchart TD
DEVICE_TYPE{USB device}
X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]
HOST_XLINK_BOOT["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]
HOST_BOOTLOADER_BOOT["DeviceBootloader::bootMemory(firmwareBin)"]
STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]
DEVICE_TYPE -->|no BL or BOOT btn is pressed| X_LINK_UNBOOTED
DEVICE_TYPE -->|has BL| X_LINK_BOOTLOADER
X_LINK_UNBOOTED --> HOST_XLINK_BOOT
X_LINK_BOOTLOADER -->|standalone flashed| X_LINK_FLASH_BOOTED
X_LINK_FLASH_BOOTED -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
X_LINK_BOOTLOADER -->|normal| HOST_BOOTLOADER_BOOT
HOST_BOOTLOADER_BOOT --> X_LINK_BOOTED
HOST_XLINK_BOOT --> X_LINK_BOOTED
X_LINK_BOOTED --> HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE
subgraph "host app code: Device(pipeline) [after boot part]"
X_LINK_BOOTED
HOST_RPC_READ
HOST_RPC_PIPELINE
end
subgraph "Automatic standalone flash behavior"
X_LINK_FLASH_BOOTED
STANDALONE_READ
STANDALONE_PIPELINE
end
|
You may include it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM at current state
First CI issue is that only with c++17 is [EDIT] I rebased the fix into the [EDIT2] I was unable to find any flag for MSVC so it would fail to compile like GNU does. I tried both |
- revert more closed locking behavior luxonis#520
5b8a566
to
80d3593
Compare
I don't see any "code" faults. Woohoo 😃 The CI thinks there are failures because:
I need to know what are next steps / answers to questions.
|
IMO best if we tag it (like PoE/USB) and skip that tag by default for now (as we don't have HIL In this repo that handles this scenario ATM - we'll add it afterwards)
LGMT, thanks! |
- multiple devices test case is tagged so that it will not run with the default test list using the [.] syntax
Force-pushed. I removed the last commit that had the extra debug in it -- replaced it with a small commit that clarified some comments and debug log messages. Major concernThe majority 50+% of OAK usb devices do not reboot fast enough with the current depthai-core codebase on Windows. depthai-core/src/xlink/XLinkConnection.cpp Line 316 in 5f75ae7
This PR's new test case with multiple devices --- more than half the USB device reboots do not complete within the BOOTUP_SEARCH 5 seconds. This easily cascades to failed connections when depthai-core attempts to boot and upload firmware to the device. When depthai-core attempts this after-reboot activity, some part of the USB device accepts a connection (myraid firmware?) but that connection seems to be brain-dead or not functioning correctly. This series of states causes various levels of depthai-core errors...some may cause exceptions to be thrown. The app will crash if these exceptions are not caught with a try/catch around the Device() constructor. I extended BOOTUP_SEARCH to 10 seconds. Some reboots still did not complete within 10 seconds. I estimate 25-33% of them failed. Eventually, all 3 of my USB devices reboot, yet the time needed to reboot can be many seconds beyond 10. I recommend BOOTUP_SEARCH be extended or new reboot logic for depthai-core to accommodate the time needed to reboot. While new firmware/bootloader may help, it doesn't help all those devices which are not burned with the new firmware. It is not useful for depthai-core to start uploading firmware to a device when it hasn't completely rebooted. The existing codebase does this. The code might have been written to accommodate the scenario of a device being unplugged after/during a reboot....and not wanting that user to wait more than 5 seconds. However, this choice leads to the majority of USB reboots failing. That doesn't seem a wise choice given the failure scenarios I describe above. Separately, is the concern of a 10+ second reboot for a usb device. This is a topic to be explored by the Luxonis team who has access to the firmware and multiple host OS+hardware. |
80d3593
to
7b17b29
Compare
Updated depth colorization logic
Not ready to merge as it is Windows-only as I refine new tests.Device::close()
#805 and RPC serialization and its mutex locks and/or shared_ptr use are unclear and likely crash #807.Test
Multiple devices created and destroyed in parallel
will usualy crash when 3 USB devices are attached. If it doesn't run the test again until it crashes.Test
Device APIs after Device::close()
will always crash, SEG fault, etc.