Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Device tests for close and threading crashes #809

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

diablodale
Copy link
Contributor

@diablodale diablodale commented Apr 13, 2023

Not ready to merge as it is Windows-only as I refine new tests.

Test Multiple devices created and destroyed in parallel will usualy crash when 3 USB devices are attached. If it doesn't run the test again until it crashes.

Test Device APIs after Device::close() will always crash, SEG fault, etc.

Copy link
Collaborator

@themarpe themarpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Just wondering on rpcClient - when is that now destructed - such that streams and other resources are destructed also (and that we bring still mitigate the initial issue that kept the device in an odd state).

Otherwise 👍 on this, targeting to bring in for next release

@diablodale
Copy link
Contributor Author

diablodale commented Apr 24, 2023

Just wondering on rpcClient - when is that now destructed - such that streams and other resources are destructed also (and that we bring still mitigate the initial issue that kept the device in an odd state).

My current approach is pimpl->rpcClient is never explicitely destructed. Instead, it is destucted when pimpl is destructed as part of Device destruct

  • depthai-core rules/contract require we have a callable rpcClient to live past close()
  • rpcClient has a member shared_ptr<stream> which contains a member shared_ptr<xlink connection>
  • the xlink connection is closed
    connection->close();
    at the start of the Device() close process. Assuming XLink is thread-safe, and XLinkConnection::close() is well written, the close of the Xlink connection can happen anytime and API calls to anyone will nicely fail/throw.
  • when xlink connection is closed, then xlink calls made by rpcClient (or anyone else) should fail with xlink errors or throw similar to\ "Device already closed"

I can not (yet) find any sideaffect that rpcClient would cause to keep "the device in an odd state". @szabi-luxonis did the work on 4b3c687. I would like to understand their evidence or thinking on that. My suspicion is that it was other work (like the device firmware) that made a difference -or- there is a bug in the XLinkConnection::close() codepath causing an issue. I don't think rpcClient itself is the problem.

For example, in XLinkConnection::close() is a code comment you wrote // TODO(themarpe) - revisit for TCPIP protocol. That seems relevant to PoE devices. 🤔 What is that about?
xlinkconnection::close() does...

  1. locks a mutex
  2. calls XLink with long multi-seconds of work
  3. sets deviceLinkId = -1
  4. calls XLink again with long multi-seconds of work
  5. finally sets closed = true;
  6. releases mutex

If anyone were to call any XLink apis like Queue or a Device api that uses rpcClient, then since XLink is thread-safe, then the apis should nicely return an XLink error which is caught and rethrown. The customer that opened the triggering issue was using PoE. That makes be think again this is a timing/race issue because PoE has the long multi-second start/stop/reboot times.

I get suspicious off the following with a PoE device:

  1. Someone calls Device::closeimpl()
  2. connection->close() which starts the blocking multi-second work above
  3. in parallel the Device is still running watchdogThread and monitorThread. If they attempt their tests during xlinkconnection::close() I think it is possible in certain scenarios for connection->close(); to be called by the monitor. Which could block due to the mutex still being held by step 2. Perhaps this cascade of blocks causes a problem or "odd state".

Step 3 was one of my concerns in #520 when my approach of atomic<bool> connection::closed was replaced with a blocking mutex. The scenarios are so many and varied it is somewhat difficult to think through them all. Which the atomic approach supports as it is non-blocking. This may not be the specific cause of this specific PoE problem. Yet, I think it is problematic to use a mutex and maybe it contributes to these types of race/timing fails. Personally, I don't like a public ::close() on any of these objects. They seem to cause more problems than the limited utility.

@diablodale diablodale force-pushed the device-tests-close-threads branch from 2a3bc20 to 5668249 Compare April 24, 2023 15:53
@themarpe
Copy link
Collaborator

I can not (yet) find any sideaffect that rpcClient would cause to keep "the device in an odd state". @szabi-luxonis did the work on 4b3c687. I would like to understand their evidence or thinking on that. My suspicion is that it was other work (like the device firmware) that made a difference -or- there is a bug in the XLinkConnection::close() codepath causing an issue. I don't think rpcClient itself is the problem.

I think at the end of the day it boiled down to whether that rpcStream was destructed, in turn the underlying XLinkStream was closed.

Though I cannot confirm this to be the case. Would have to be retested prehaps with original post change vs just closing the XLinkStream (instead of "nulling out the rpcClient). In that way the rpcClient could remain valid and issues with interfacing with post close would not be required.

For example, in XLinkConnection::close() is a code comment you wrote // TODO(themarpe) - revisit for TCPIP protocol. That seems relevant to PoE devices. thinking What is that about?

This waits till USB device is back online or times out - it helps with doing device down/up events as otherwise issues could occur (afaik while device was still (on usb level) destructing, it'd reconnect). The TODO is there as this does not happen for PoE devices (as the "bootDevice" is never true in that case)

Which the atomic approach supports as it is non-blocking. This may not be the specific cause of this specific PoE problem. Yet, I think it is problematic to use a mutex and maybe it contributes to these types of race/timing fails.

I agree that /w atomic this would make these simpler - I am up for changing that back, but IMO the crux here still lies somewhere around the XLinkStream being closed or not.

@diablodale
Copy link
Contributor Author

I'm not arguing 🙂 just trying to understand the thinking/logic as it isn't aligning with the code I see.
Why does it matter if the underlying XLinkStream of rpcClient was closed?

Anytime Device::close() or Device::~Device is called then the first meanful line of code in DeviceBase::closeImpl() is XLinkConnection::close(). It blocks doing work like XLinkResetRemoteTimeout(). The monitor thread with failing pings also calls XLinkConnection::close(). Both of their calls to XLinkConnection::close() call XLinkResetRemoteTimeout().

XLinkResetRemoteTimeout() causes hardware "down" aka getXLinkState(link) != XLINK_UP. All XLink* apis calls check to see if the stream or link given as a param is XLINK_UP and return XLINK errors when false. And these XLINK errors are caught in depthai-core code as return values to those XLink* apis. This keeps the host in a well-defined state.

The call to XLinkResetRemoteTimeout() also sends a XLINK_RESET_REQ message to the remote OAK hardware for it to down itself and may force it with DispatcherDeviceFdDown(). Finally, it XLink_sem_wait(&link->dispatcherClosedSem) waits for the dispatcher to be closed. None of that codepath needs any XLinkStream to be closed.

That's my reasoning of the code...which I definitely can be wrong 😅. I can't align my reasoning with why you want the underlying XLinkStream of rpcClient to be closed...and why you have a belief it puts "the device in an odd state".

I don't have visibility into the OAK device firmware. The firmware must support any disconnect type at any time regardless of the state of XLinkStream's. Meaning the firmware should support: tcpip/ether failures, usb cable failures, sudden DispatcherDeviceFdDown(), sudden XLINK_RESET_REQ, etc. All of them should be handled cleanly and independently by the firmware itself. There is nothing for the depthai-core host code to do nicely (e.g. always close rpcClient's XLinkStream). Why? Because there are an unbounded number of disconnect scenarios and the OAK firmware needs to handle them by itself in isolation of any host.

@diablodale
Copy link
Contributor Author

diablodale commented Apr 25, 2023

I've isolated an unexpected grouping of XLink failures with rapid restarts. When this PR's test "Multiple devices created and destroyed in parallel" is run with 2 OAK-D-Lite and 1 OAK-D attached, I get failures. Might have to run the test more than once, but it should eventually fail. On my laptop, it fails almost every run.

Fails in DeviceBase::startPipelineImpl when it makes rpc calls. Error is dai::XLinkWriteError and it occurs on pimpl->rpcClient->call("printAssets") or pimpl->rpcClient->call("startPipeline"). Those two call locations are additionally curious to me because:

  1. when both of those calls occur, there have been previous rpcClient calls that didn't throw. That suggests to me that the underlying XLink dispatcher for this connection has transitioned from a good state to a bad state that can't write -or- there is something in common with the two calls and unique from the earlier calls.
  2. Both calls have no rpc parameters. Both calls have no .as<> for a return value. That makes these two unique in the whole function as earlier calls do have params or return values. This might be a false lead, but so far...no other rpcs are failing.

The rpc to printAssets...can we remove that? Or make it debug only?

@diablodale
Copy link
Contributor Author

diablodale commented Apr 26, 2023

Probably found bug

95% sure I found a somewhat relevant bug here

} while(!found && steady_clock::now() - t1 < BOOTUP_SEARCH);

!found should not be there. It errantly causes early exit of the while loop when anything in any state is found.

OAK devices are not closing cleanly

For the below test run...

  1. Changed the if test in the above loop to only exit on X_LINK_BOOTLOADER
  2. Changed BOOTUP_SEARCH to 10s.
  3. added more logging throughout XLinkConnection::close()
  4. set XLINK_LEVEL=fatal

I am assuming that X_LINK_BOOTLOADER is the "at rest" state of a USB device like OAK-D and OAK-D-Lite after it is plugged into a computer's USB port and letting the OAK device power on and load its internally-flashed bootloader. It appears in the OS as Movidius MyriadX and USB\VID_03E7&PID_2485&REV_0001

Found device count: 3
MXID: 1844301031490A1300
MXID: 18443010C17F0C1300
[14442C10C1A1C6D200] [3.1] [1.729] [system] [critical] Error executing RPC call: [json.exception.parse_error.110] parse error at byte 2: syntax error while parsing MessagePack value: expected end of input; last byte: 0x00
[2023-04-26 12:31:26.317] [warning] RPC mxid(14442C10C1A1C6D200) error: Couldn't read data from stream: '__rpc_main' (X_LINK_ERROR)
[2023-04-26 12:31:26.320] [warning] XLinkResetRemoteTimeout(1) mxid(14442C10C1A1C6D200) returned: X_LINK_ERROR
[2023-04-26 12:31:36.181] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(18443010C17F0C1300) failed to find device in 10s
[2023-04-26 12:31:36.208] [warning] Reboot after XLinkResetRemoteTimeout(0) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:31:36.356] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(14442C10C1A1C6D200) failed to find device in 10s

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
multiple_devices_test.exe is a Catch2 v3.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Multiple devices created and destroyed in parallel
-------------------------------------------------------------------------------
..\tests\src\multiple_devices_test.cpp(106)
...............................................................................

..\tests\src\multiple_devices_test.cpp(173): FAILED:
explicitly with message:
  Fail construct Device() with DeviceInfo(name=3.1, mxid=14442C10C1A1C6D200,
  X_LINK_UNBOOTED, X_LINK_USB_VSC, X_LINK_MYRIAD_X, X_LINK_SUCCESS) Device
  already closed or disconnected: io error

MXID: 18443010C17F0C1300
MXID: 1844301031490A1300
MXID: 14442C10C1A1C6D200
[2023-04-26 12:31:48.160] [warning] Reboot after XLinkResetRemoteTimeout(2) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:31:48.352] [warning] Reboot after XLinkResetRemoteTimeout(3) mxid(14442C10C1A1C6D200) failed to find device in 10s
[2023-04-26 12:31:48.434] [warning] Reboot after XLinkResetRemoteTimeout(1) mxid(18443010C17F0C1300) failed to find device in 10s
MXID: 1844301031490A1300
MXID: 18443010C17F0C1300
MXID: 14442C10C1A1C6D200
[2023-04-26 12:32:00.295] [warning] Reboot after XLinkResetRemoteTimeout(3) mxid(1844301031490A1300) failed to find device in 10s
[2023-04-26 12:32:00.314] [warning] Reboot after XLinkResetRemoteTimeout(4) mxid(18443010C17F0C1300) failed to find device in 10s
[2023-04-26 12:32:00.788] [warning] Reboot after XLinkResetRemoteTimeout(4) mxid(14442C10C1A1C6D200) failed to find device in 10s
===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 11 | 10 passed | 1 failed

In that single run, I see many unexpected things

  • 9 times the devices were rebooted yet couldn't be found in 10 seconds. Why are these devices taking so long? I also set it to 15s and I still get fails to find.
  • 1 time the OAK hardware failed to execute an RPC call. I can't find any host code that has the text Error executing RPC call so I think the OAK devices received a message on the RPC named stream, tried to unpack it and it was invalid. This should not happen; it suggests to me a bug somewhere between nanorpc on the host and nanorpc on the device.

I have also run the test with XLINK_LEVEL=debug and see often libusb errors about permission problems. Which on my Windows setup is unexpected as I remember permissions being related to UDEV

...
D: [global] [         0] [ThreadN] getLibusbDeviceMxId:326      libusb_open: Access denied (insufficient permissions) 
D: [global] [         0] [ThreadN] getUSBDevices:171    getLibusbDeviceMxId returned: Access denied (insufficient permissions)

...

@themarpe
Copy link
Collaborator

themarpe commented May 2, 2023

Thanks a bunch for all the exploration @diablodale (sorry for the delay)

That's my reasoning of the code...which I definitely can be wrong sweat_smile. I can't align my reasoning with why you want the underlying XLinkStream of rpcClient to be closed...and why you have a belief it puts "the device in an odd state".

This seems to be what resolved the issue with the initial commit - though it might not be the actual culprit. More testing required - I'm not actually sure either

XLink dispatcher for this connection has transitioned from a good state to a bad state that can't write

Most likely the case - the RPCs themselfs likely aren't the culprit.

The rpc to printAssets...can we remove that? Or make it debug only?

Yes - we we can remove this outright.

95% sure I found a somewhat relevant bug here
!found should not be there. It errantly causes early exit of the while loop when anything in any state is found.

Good catch - Yes, that would in fact exit out too soon. Shall we add it here: https://github.com/luxonis/depthai-core/pull/809/files#diff-17c7352c03ae1d3bb07f3e7a3c1b6f98312f549a440e186f450e3a8f563e3121L338-L340 with else found = false?

I am assuming that X_LINK_BOOTLOADER is the "at rest" state of a USB device like OAK-D and OAK-D-Lite after it is plugged into a computer's USB port and letting the OAK device power on and load its internally-flashed bootloader. It appears in the OS as Movidius MyriadX and USB\VID_03E7&PID_2485&REV_0001

That would be X_LINK_UNBOOTED state. X_LINK_BOOTLOADER would be for cases where device is presented as Luxonis Bootloader (the former is ROM BL/1st stage, the latter is NOR flashed custom BL (2nd stage))

I have also run the test with XLINK_LEVEL=debug and see often libusb errors about permission problems. Which on my Windows setup is unexpected as I remember permissions being related to UDEV

I think this relates to device being in a state that comms aren't possible. Though not 100% sure what is the underlying cause for this error message from libusb on Windows

@diablodale
Copy link
Contributor Author

Unfortunately, I don't think I can make progress on the connection issue(s) because I don't have visibiity into the OAK device and its firmware. There is not enough public knowledge on its state changes, internal logic, etc.

  • All my debugging to date indicates to me hardware problems in state transitions. Either libusb code failing to transition states timely, or the OAK device failing to transition states promptly. This is backed by my repeated test failure evidence and logs showing that even after 15 seconds OAK devices can not be seen by libusb-xlink code (meaning libusb-xlink, WIndows, or OAK have failed to work as expected). If it doesn't work on USB, it is probably even worse with PoE. I recomemnd focus first on USB and get it working. My expanded test case in this PR is an easy repro to expose the bug.
  • To isolate, I attempted to use XLink from depthai-core v2.15.5 (the last xlink with winusb). I failed because the XLink interface had large changes with winusb->libusb and broke both the xlink API and its behavior. Getting current depthai-core to work with the winusb-XLink was more than a half-day of changes (I stopped then).
  • Do you have any docs and flow charts that document the state changes for OAK devices? I can find none in code or in the online depthai docs. At a mimum it would have circles for each XLinkDeviceState_t and arrows showing all possible transitions and how they occur. For example, all circles except the first would have an arrow for kill -9 on app that goes back to the beginning. I suspect some of those arrows are not coded correctly in today's depthai+xlink+firmware. Such a state diagram would be used for all XLink coding and test case creation.

If you want, I can cleanup, or close an open a new PR, to include the one bug I found and all the cleanup work. It doesn't fix the connection problems -- instead forward movement to have cleaner build, enables more analyzers/warnings, etc.

@diablodale diablodale force-pushed the device-tests-close-threads branch 2 times, most recently from de21468 to 6ea13e0 Compare May 11, 2023 18:17
@diablodale
Copy link
Contributor Author

diablodale commented May 11, 2023

@themarpe I've cleaned up the commits in this PR. It has some fixes that are outside the main topic. If wanted, I can break them out into several separate PRs.

[EDIT] And the last commit will be dropped before merge. It is additional debugs and warns and a supporting API. (which itself could be included in production code and the APIs dropped if the Logger class can always show the mxid.

Meanwhile, below is my first attempt mermaid of the XLINK states and their base transitions. Are there any corrections that need to be made? In this 1st diagram, I haven't added reboots, fails, etc.

flowchart TD

DEVICE_TYPE{USB device?}
STANDALONE{Standalone<br/>flashed device?}

X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]

HOST_CREATE_DEVICE["host app: Device(pipeline)"]
HOST_BOOTLOADER["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]

STANDALONE_BOOTLOADER["device boots its bootloader from flash"]
STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]

DEVICE_TYPE -->|yes| X_LINK_UNBOOTED
X_LINK_UNBOOTED --> STANDALONE

STANDALONE -->|no| HOST_CREATE_DEVICE
STANDALONE -.->|yes| STANDALONE_BOOTLOADER

STANDALONE_BOOTLOADER -.-> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
STANDALONE_PIPELINE -.-> X_LINK_FLASH_BOOTED

HOST_CREATE_DEVICE --> HOST_BOOTLOADER
HOST_BOOTLOADER --> X_LINK_BOOTLOADER
X_LINK_BOOTLOADER -->HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE
HOST_RPC_PIPELINE --> X_LINK_BOOTED

    subgraph "host sdk code: Device(pipeline)"
        HOST_BOOTLOADER
        HOST_RPC_READ
        HOST_RPC_PIPELINE
    end

    subgraph "Automatic standalone flash behavior"
        STANDALONE_BOOTLOADER
        STANDALONE_READ
        STANDALONE_PIPELINE
    end

Loading

@diablodale diablodale marked this pull request as ready for review May 11, 2023 18:25
@diablodale diablodale force-pushed the device-tests-close-threads branch from 3a404ea to 5b8a566 Compare May 11, 2023 20:30
@themarpe
Copy link
Collaborator

@diablodale
The following diagram is how the boot process flows. The ROM BL decides (based on HW + NOR contents) on whether or not it boots BL (if its present, some devices don't have NOR, and can only ever boot directly from UNBOOTED->BOOTED).

Each of transitions is a USB down->up event as reenumeration is done.

One extra case is if you call DeviceBootloader, then it boots into X_LINK_BOOTLOADER state if it was UNBOOTED before. Otherwise it just connects to running Bootloader

(and neat thing, this mermaid flowchart:) )

flowchart TD

DEVICE_TYPE{USB device}

X_LINK_UNBOOTED[[X_LINK_UNBOOTED<br/>'Movidius MyriadX' 0x03E7, 0x2485]]
X_LINK_BOOTLOADER[[X_LINK_BOOTLOADER<br/>'Luxonis Bootloader' 0x03E7, 0xf63c]]
X_LINK_BOOTED[[X_LINK_BOOTED<br/>'Luxonis Device' 0x03E7, 0xf63b]]
X_LINK_FLASH_BOOTED[[X_LINK_FLASH_BOOTED<br/>0x03E7, 0xf63d]]

HOST_XLINK_BOOT["XLinkBootMemory(firmwareBin)"]
HOST_RPC_READ["rpc(readPipelineCodeFromXLink)"]
HOST_RPC_PIPELINE["rpc(build) + rpc(startPipeline)"]

HOST_BOOTLOADER_BOOT["DeviceBootloader::bootMemory(firmwareBin)"]

STANDALONE_READ["device reads its pipeline from flash"]
STANDALONE_PIPELINE["device builds+starts its pipeline"]


DEVICE_TYPE -->|no BL or BOOT btn is pressed| X_LINK_UNBOOTED
DEVICE_TYPE -->|has BL| X_LINK_BOOTLOADER

X_LINK_UNBOOTED --> HOST_XLINK_BOOT

X_LINK_BOOTLOADER -->|standalone flashed| X_LINK_FLASH_BOOTED
X_LINK_FLASH_BOOTED -.-> STANDALONE_READ
STANDALONE_READ -.-> STANDALONE_PIPELINE
X_LINK_BOOTLOADER -->|normal| HOST_BOOTLOADER_BOOT
HOST_BOOTLOADER_BOOT --> X_LINK_BOOTED

HOST_XLINK_BOOT --> X_LINK_BOOTED

X_LINK_BOOTED --> HOST_RPC_READ
HOST_RPC_READ --> HOST_RPC_PIPELINE

subgraph "host app code: Device(pipeline) [after boot part]"
    X_LINK_BOOTED
    HOST_RPC_READ
    HOST_RPC_PIPELINE
end

subgraph "Automatic standalone flash behavior"
    X_LINK_FLASH_BOOTED
    STANDALONE_READ
    STANDALONE_PIPELINE
end

Loading

@themarpe
Copy link
Collaborator

If you want, I can cleanup, or close an open a new PR, to include the one bug I found and all the cleanup work. It doesn't fix the connection problems -- instead forward movement to have cleaner build, enables more analyzers/warnings, etc.

You may include it here

Copy link
Collaborator

@themarpe themarpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM at current state

src/xlink/XLinkConnection.cpp Show resolved Hide resolved
@diablodale
Copy link
Contributor Author

diablodale commented May 16, 2023

First CI issue is that only with c++17 is inline implicit with constexpr so those definitions I removed need to be put back. I'll fix that and push

[EDIT] I rebased the fix into the code cleanup commit and force pushed

[EDIT2] I was unable to find any flag for MSVC so it would fail to compile like GNU does. I tried both /permissive- and /Zc:externConstexpr. MSVC with VS2019 continued to compile the non-compilant c++14 code. 🤷‍♀️

@diablodale diablodale force-pushed the device-tests-close-threads branch from 5b8a566 to 80d3593 Compare May 16, 2023 16:41
@diablodale
Copy link
Contributor Author

I don't see any "code" faults. Woohoo 😃

The CI thinks there are failures because:

  1. the extra debug has the word "warning" in it which is then detected by the ctest regex and fails the test
  2. the multiple_devices_test case 2 REQUIRES 2+ hardware devices to be attached. And it fails otherwise. The only machine that ran in CI was a windows machine which appears to have only one device attached.

I need to know what are next steps / answers to questions.

  1. What do you want to do about only 1 OAK on the test machine? Should that case be marked/filtered using flags? Or what do you want?
  2. Is your team good with this PR's state so that I can remove the extra debug and push?

@themarpe
Copy link
Collaborator

themarpe commented May 26, 2023

What do you want to do about only 1 OAK on the test machine? Should that case be marked/filtered using flags? Or what do you want?

IMO best if we tag it (like PoE/USB) and skip that tag by default for now (as we don't have HIL In this repo that handles this scenario ATM - we'll add it afterwards)

Is your team good with this PR's state so that I can remove the extra debug and push?

LGMT, thanks!

- multiple devices test case is tagged so that it
  will not run with the default test list using
  the [.] syntax
@diablodale
Copy link
Contributor Author

Force-pushed. I removed the last commit that had the extra debug in it -- replaced it with a small commit that clarified some comments and debug log messages.

Major concern

The majority 50+% of OAK usb devices do not reboot fast enough with the current depthai-core codebase on Windows.
PoE is unknown. Other OS are unknown.

constexpr auto BOOTUP_SEARCH = seconds(5);

This PR's new test case with multiple devices --- more than half the USB device reboots do not complete within the BOOTUP_SEARCH 5 seconds. This easily cascades to failed connections when depthai-core attempts to boot and upload firmware to the device. When depthai-core attempts this after-reboot activity, some part of the USB device accepts a connection (myraid firmware?) but that connection seems to be brain-dead or not functioning correctly. This series of states causes various levels of depthai-core errors...some may cause exceptions to be thrown. The app will crash if these exceptions are not caught with a try/catch around the Device() constructor.

I extended BOOTUP_SEARCH to 10 seconds. Some reboots still did not complete within 10 seconds. I estimate 25-33% of them failed. Eventually, all 3 of my USB devices reboot, yet the time needed to reboot can be many seconds beyond 10.

I recommend BOOTUP_SEARCH be extended or new reboot logic for depthai-core to accommodate the time needed to reboot. While new firmware/bootloader may help, it doesn't help all those devices which are not burned with the new firmware.

It is not useful for depthai-core to start uploading firmware to a device when it hasn't completely rebooted. The existing codebase does this. The code might have been written to accommodate the scenario of a device being unplugged after/during a reboot....and not wanting that user to wait more than 5 seconds. However, this choice leads to the majority of USB reboots failing. That doesn't seem a wise choice given the failure scenarios I describe above.

Separately, is the concern of a 10+ second reboot for a usb device. This is a topic to be explored by the Luxonis team who has access to the firmware and multiple host OS+hardware.

@diablodale diablodale force-pushed the device-tests-close-threads branch from 80d3593 to 7b17b29 Compare June 7, 2023 16:42
jakgra pushed a commit that referenced this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants