Improved canbus diagnostics #6784

KevinOConnor · 2025-01-15T04:13:38Z

This PR adds support for reporting the micro-controller canbus receive and transmit error status.

The low-level canbus hardware will generally automatically retransmit messages that don't transmit correctly (eg, due to line noise). Unfortunately, this system may hide wiring problems - wiring problems that could cause hard to debug failures in the middle of a print.

The new code here will report the low-level interface statistics in the log statistics messages. For example, one might see messages like: Stats 815.9: ... canstat_mcu: rx_error=0 tx_error=0 bus_state=active ... canstat_corexy_rear_left: rx_error=0 tx_error=0 bus_state=active

In addition to the additional logging, the "usb to canbus bridge mode" has also been extended to detect canbus stalls. Previously, if the canbus were to stall, it would likely become impossible to communicate with the canbus bridge node itself. That complicates debugging (as it is not obvious if the failure is due to something occurring on the bridge node or the canbus). The new code here changes the "usb to canbus bridge" code to detect canbus stalls (no ability to transmit for 50+ms). When a stall is detected canbus transmit messages are discarded, but at least messages to and from the bridge node can continue.

Unfortunately, I don't have a good way to test this code. It runs fine on my local canbus wired printer, but I'm not getting any interface errors. Testing results from others would be appreciated.

-Kevin

Signed-off-by: Kevin O'Connor <[email protected]>

Add support or a new get_canbus_status command to canserial.c . Add new canbus_stats.py module that will periodically query canbus mcus for connection status information. Signed-off-by: Kevin O'Connor <[email protected]>

If the low-level canbus stops working then it could become impossible to send messages to and from the canbus bridge node itself. This can make it difficult to diagnose canbus problems. Change the canbus bridge code to detect if message transmits become stalled for 50+ milliseconds and go into a "discarding" state. In this discarding state, messages destined for the canbus will be discarded until the canbus becomes active again. In this discarding state it will therefore be possible to transmit messages to and from the canbus bridge node. Signed-off-by: Kevin O'Connor <[email protected]>

ammmze · 2025-01-16T18:06:35Z

I've got a printer that struggles with Timer Too Close errors, so thought I'd give this update a try. It clearly shows something is up with my toolhead board (or the wiring to it). Though curious that it's on got errors for tx, not rx. Are those errors transmit from the context of the toolhead or host (i.e. toolhead sending messages that are erroring or host sending messages that are erroring)? Based on my understanding when reviewing the code, those would toolhead transmit errors.

printer['canbus_stats mcu'].rx_error = None
printer['canbus_stats mcu'].tx_error = None
printer['canbus_stats mcu'].bus_state = None
printer['canbus_stats scanner'].rx_error = None
printer['canbus_stats scanner'].tx_error = None
printer['canbus_stats scanner'].bus_state = None
printer['canbus_stats mmu'].rx_error = 0
printer['canbus_stats mmu'].tx_error = 0
printer['canbus_stats mmu'].bus_state = active
printer['canbus_stats toolhead'].rx_error = 0
printer['canbus_stats toolhead'].tx_error = 1666
printer['canbus_stats toolhead'].bus_state = active

Now despite these tx errors, the bytes retransmitted stats is still showing 0. But I haven't actually tried printing. Typically when i've gone to actually print stuff I do see a few hundred bytes retransmitted.

Also if it's useful, here's a screenshot of the system loads. Which also shows that the 2 mcu's that I did flash are STM and RP2040. So I can at least confirm that both of those boards are at least reporting to the host.

KevinOConnor · 2025-01-16T18:29:24Z

Thanks for testing. It is odd that the tx_error increased on an rp2040 node without an rx_error increase. It might be a bug in the new code here. Could you attach the full klipper log here? That'll help me investigate.

bytes retransmitted stats is still showing 0

That's normal as the "bytes retransmitted" stat shows the number of times the software performed a retransmit. If the canbus itself does a retransmission then it isn't shown there.

Cheers,
-Kevin

ammmze · 2025-01-16T20:30:57Z

Could you attach the full klipper log here?

Sure...here's the klippy.log.

KevinOConnor · 2025-01-17T00:36:17Z

Sure...here's the klippy.log

Thanks. So, on rp2040 the rx_error field actually tracks errors while the tx_error field actually tracks the number of transmit retries. (A transmit may be retried due to bus contention, which is not considered an "error" on the stm32 chips.) On stm32 chips, the rx_error and tx_error report errors and they have no way to report the number of non-error transmit retries.

This is confusing, so I'll have to rework the code on this PR to make the new statistics more clear. In the interim, you can treat the rx_error counter as the more serious indicator for the rp2040.

It is surprising that you're getting so many transmit retries though - I'll see if I can reproduce that locally.

For what it is worth, you may want to also reflash your canbus bridge node ([mcu]) as that'll also provide useful canbus debugging.

Cheers,
-Kevin

ammmze · 2025-01-17T02:38:35Z

For what it is worth, you may want to also reflash your canbus bridge node ([mcu]) as that'll also provide useful canbus debugging.

Yep...that was the plan this evening. Just needed to figure out the right steps to flash the usb2can device (it's a Manta M8P) and get it set up in ukam. Now all but the cartographer have been flashed.

These are the stats I have at startup (note: i've restarted and firmware restarted a few times and that mmu rx_error count has stayed the same):

printer['canbus_stats mcu'].rx_error = 0
printer['canbus_stats mcu'].tx_error = 0
printer['canbus_stats mcu'].bus_state = active
printer['canbus_stats scanner'].rx_error = None
printer['canbus_stats scanner'].tx_error = None
printer['canbus_stats scanner'].bus_state = None
printer['canbus_stats mmu'].rx_error = 45
printer['canbus_stats mmu'].tx_error = 0
printer['canbus_stats mmu'].bus_state = active
printer['canbus_stats toolhead'].rx_error = 0
printer['canbus_stats toolhead'].tx_error = 48
printer['canbus_stats toolhead'].bus_state = active

Will start my test print and see where the stats go during the print. Typically this test print (the Box Turtle Cereal print) has been giving me TTC anywhere from half hour to an hour into the print.

ammmze · 2025-01-17T03:23:00Z

Here's my klippy.log after getting the print failed with "Timer Too Close".

Here's the new stats just after the stacktrace

canstat_mcu: bus_state=unknown rx_error=0 tx_error=0
canstat_mmu: bus_state=unknown rx_error=45 tx_error=0 
canstat_toolhead: bus_state=unknown rx_error=0 tx_error=59700

Though the next reported stats after that does show 1 rx_error on toolhead, but perhaps that was related to the shutdown sequence?

KevinOConnor · 2025-01-17T14:56:54Z

after getting the print failed with "Timer Too Close".

Okay, thanks for testing. For what it is worth, it doesn't look like your issue is canbus related to me. You've got too many modifications to the code for me to provide direct help with the "timer too close". However, if you want a guess, I'd look at "MMU_CHANGE_TOOL TOOL=0 QUIET=1". Your system is doing some things the mainline code should not do - it's running into garbage collection (Reactor garbage collection: (3156.352535264, 3129.683607084, 3158.637710431)) and is driving one of the steppers at a very rast rate without buffering moves (mcu 'toolhead': Sent 99 3158.932379 3158.932193 12: seq: 10(19072), queue_step oid=1 interval=945 count=635 add=0).

Cheers,
-Kevin

ammmze · 2025-01-17T15:08:43Z

For what it is worth, it doesn't look like your issue is canbus related to me. You've got too many modifications to the code for me to provide direct help with the "timer too close".

Sounds good, I appreciate your feedback. It is helpful at least to get further evidence that my wiring seems fine.

ammmze · 2025-01-17T15:17:14Z

is driving one of the steppers at a very rast rate without buffering moves (mcu 'toolhead': Sent 99 3158.932379 3158.932193 12: seq: 10(19072), queue_step oid=1 interval=945 count=635 add=0).

Any chance you can point me to a code example of how to buffer the moves if there is a common pattern already used for doing that? I can look at incorporating it into Happy Hare. I've done some work in there in the past.

KevinOConnor · 2025-01-17T16:29:04Z

Any chance you can point me to a code example of how to buffer the moves if there is a common pattern already used for doing that?

I don't know enough about what's going on to give much advice. From a quick look at your logs, it seems like the stepper scheduling was put into "drip mode" (ie, homing mode):

extruder queue_step 0: t=27976619521(3158.809660) p=958218 i=945 c=-635 a=0
extruder queue_step 1: t=27977219596(3158.859665) p=957583 i=945 c=-635 a=0
extruder queue_step 2: t=27977819671(3158.909670) p=956948 i=945 c=-635 a=0
...

In "drip mode" the buffering is explicitly disabled so that the host can rapidly schedule moves after an endstop event. In this mode, the Klipper host must be able to maintain low-latency interaction with the mcus (<100ms). That's much more demanding than normal mode, where the Klipper host will buffer 2+ seconds of data into the mcus. When mainline Klipper goes into drip mode, it basically does nothing else but homing - however, your logs indicate that lots of things seem to be going on at the same time - in particular you have a whole bunch of neopixel commands being sent during the event. You also had python garbage collection occurring during the same event. I'd guess it'll be very difficult to get the required low latency timing needed for "drip mode" if many other events are occurring simultaneous. Again, I'm not sure what the code is attempting to do, so take my comments with "a grain of salt".

I think we're getting a bit off-topic here though.

-Kevin

ammmze · 2025-01-17T16:44:07Z

In "drip mode" the buffering is explicitly disabled so that the host can rapidly schedule moves after an endstop event. In this mode, the Klipper host must be able to maintain low-latency interaction with the mcus (<100ms).

Ah...okay...that makes sense! Thank you!

I think we're getting a bit off-topic here though.

Agreed! I'll continue running this branch of klipper and keep an eye on my canstats regardless. I do think it will be useful to get the can stats from the MCU's perspective.

KevinOConnor force-pushed the work-canstats-20250114 branch from f14c1ff to 6ab5ee6 Compare January 15, 2025 16:13

KevinOConnor added 7 commits January 15, 2025 11:34

stm32: Add support for reporting canbus state from can.c

51766a7

Signed-off-by: Kevin O'Connor <[email protected]>

stm32: Add support for reporting canbus state from fdcan.c

3e3d30d

Signed-off-by: Kevin O'Connor <[email protected]>

rp2040: Add support for reporting canbus state

3cf5f32

Signed-off-by: Kevin O'Connor <[email protected]>

atsamd: Add support for reporting canbus state

03efa63

Signed-off-by: Kevin O'Connor <[email protected]>

atsam: Add support for reporting canbus state

08859dd

Signed-off-by: Kevin O'Connor <[email protected]>

canbus_stats: Periodically report canbus interface statistics

778c48c

Add support or a new get_canbus_status command to canserial.c . Add new canbus_stats.py module that will periodically query canbus mcus for connection status information. Signed-off-by: Kevin O'Connor <[email protected]>

KevinOConnor force-pushed the work-canstats-20250114 branch from 6ab5ee6 to 1d298ce Compare January 15, 2025 17:28

ammmze mentioned this pull request Jan 25, 2025

fix(gc): try to not run gc when there is not enough time before the next timer #6794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved canbus diagnostics #6784

Improved canbus diagnostics #6784

KevinOConnor commented Jan 15, 2025

ammmze commented Jan 16, 2025

KevinOConnor commented Jan 16, 2025

ammmze commented Jan 16, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025 •

edited

Loading

ammmze commented Jan 17, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025

ammmze commented Jan 17, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025

Improved canbus diagnostics #6784

Are you sure you want to change the base?

Improved canbus diagnostics #6784

Conversation

KevinOConnor commented Jan 15, 2025

ammmze commented Jan 16, 2025

KevinOConnor commented Jan 16, 2025

ammmze commented Jan 16, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025 • edited Loading

ammmze commented Jan 17, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025

ammmze commented Jan 17, 2025

KevinOConnor commented Jan 17, 2025

ammmze commented Jan 17, 2025

ammmze commented Jan 17, 2025 •

edited

Loading