-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved canbus diagnostics #6784
base: master
Are you sure you want to change the base?
Conversation
f14c1ff
to
6ab5ee6
Compare
Signed-off-by: Kevin O'Connor <[email protected]>
Signed-off-by: Kevin O'Connor <[email protected]>
Signed-off-by: Kevin O'Connor <[email protected]>
Signed-off-by: Kevin O'Connor <[email protected]>
Signed-off-by: Kevin O'Connor <[email protected]>
Add support or a new get_canbus_status command to canserial.c . Add new canbus_stats.py module that will periodically query canbus mcus for connection status information. Signed-off-by: Kevin O'Connor <[email protected]>
If the low-level canbus stops working then it could become impossible to send messages to and from the canbus bridge node itself. This can make it difficult to diagnose canbus problems. Change the canbus bridge code to detect if message transmits become stalled for 50+ milliseconds and go into a "discarding" state. In this discarding state, messages destined for the canbus will be discarded until the canbus becomes active again. In this discarding state it will therefore be possible to transmit messages to and from the canbus bridge node. Signed-off-by: Kevin O'Connor <[email protected]>
6ab5ee6
to
1d298ce
Compare
Thanks for testing. It is odd that the tx_error increased on an rp2040 node without an rx_error increase. It might be a bug in the new code here. Could you attach the full klipper log here? That'll help me investigate.
That's normal as the "bytes retransmitted" stat shows the number of times the software performed a retransmit. If the canbus itself does a retransmission then it isn't shown there. Cheers, |
Sure...here's the klippy.log. |
Thanks. So, on rp2040 the rx_error field actually tracks errors while the tx_error field actually tracks the number of transmit retries. (A transmit may be retried due to bus contention, which is not considered an "error" on the stm32 chips.) On stm32 chips, the rx_error and tx_error report errors and they have no way to report the number of non-error transmit retries. This is confusing, so I'll have to rework the code on this PR to make the new statistics more clear. In the interim, you can treat the rx_error counter as the more serious indicator for the rp2040. It is surprising that you're getting so many transmit retries though - I'll see if I can reproduce that locally. For what it is worth, you may want to also reflash your canbus bridge node ( Cheers, |
Yep...that was the plan this evening. Just needed to figure out the right steps to flash the usb2can device (it's a Manta M8P) and get it set up in ukam. Now all but the cartographer have been flashed. These are the stats I have at startup (note: i've restarted and firmware restarted a few times and that mmu rx_error count has stayed the same):
Will start my test print and see where the stats go during the print. Typically this test print (the Box Turtle Cereal print) has been giving me TTC anywhere from half hour to an hour into the print. |
Here's my klippy.log after getting the print failed with "Timer Too Close". Here's the new stats just after the stacktrace
Though the next reported stats after that does show 1 rx_error on toolhead, but perhaps that was related to the shutdown sequence? |
Okay, thanks for testing. For what it is worth, it doesn't look like your issue is canbus related to me. You've got too many modifications to the code for me to provide direct help with the "timer too close". However, if you want a guess, I'd look at "MMU_CHANGE_TOOL TOOL=0 QUIET=1". Your system is doing some things the mainline code should not do - it's running into garbage collection ( Cheers, |
Sounds good, I appreciate your feedback. It is helpful at least to get further evidence that my wiring seems fine. |
Any chance you can point me to a code example of how to buffer the moves if there is a common pattern already used for doing that? I can look at incorporating it into Happy Hare. I've done some work in there in the past. |
I don't know enough about what's going on to give much advice. From a quick look at your logs, it seems like the stepper scheduling was put into "drip mode" (ie, homing mode):
In "drip mode" the buffering is explicitly disabled so that the host can rapidly schedule moves after an endstop event. In this mode, the Klipper host must be able to maintain low-latency interaction with the mcus (<100ms). That's much more demanding than normal mode, where the Klipper host will buffer 2+ seconds of data into the mcus. When mainline Klipper goes into drip mode, it basically does nothing else but homing - however, your logs indicate that lots of things seem to be going on at the same time - in particular you have a whole bunch of neopixel commands being sent during the event. You also had python garbage collection occurring during the same event. I'd guess it'll be very difficult to get the required low latency timing needed for "drip mode" if many other events are occurring simultaneous. Again, I'm not sure what the code is attempting to do, so take my comments with "a grain of salt". I think we're getting a bit off-topic here though. -Kevin |
Ah...okay...that makes sense! Thank you!
Agreed! I'll continue running this branch of klipper and keep an eye on my canstats regardless. I do think it will be useful to get the can stats from the MCU's perspective. |
This PR adds support for reporting the micro-controller canbus receive and transmit error status.
The low-level canbus hardware will generally automatically retransmit messages that don't transmit correctly (eg, due to line noise). Unfortunately, this system may hide wiring problems - wiring problems that could cause hard to debug failures in the middle of a print.
The new code here will report the low-level interface statistics in the log statistics messages. For example, one might see messages like:
Stats 815.9: ... canstat_mcu: rx_error=0 tx_error=0 bus_state=active ... canstat_corexy_rear_left: rx_error=0 tx_error=0 bus_state=active
In addition to the additional logging, the "usb to canbus bridge mode" has also been extended to detect canbus stalls. Previously, if the canbus were to stall, it would likely become impossible to communicate with the canbus bridge node itself. That complicates debugging (as it is not obvious if the failure is due to something occurring on the bridge node or the canbus). The new code here changes the "usb to canbus bridge" code to detect canbus stalls (no ability to transmit for 50+ms). When a stall is detected canbus transmit messages are discarded, but at least messages to and from the bridge node can continue.
Unfortunately, I don't have a good way to test this code. It runs fine on my local canbus wired printer, but I'm not getting any interface errors. Testing results from others would be appreciated.
-Kevin