Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved canbus diagnostics #6784

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

KevinOConnor
Copy link
Collaborator

This PR adds support for reporting the micro-controller canbus receive and transmit error status.

The low-level canbus hardware will generally automatically retransmit messages that don't transmit correctly (eg, due to line noise). Unfortunately, this system may hide wiring problems - wiring problems that could cause hard to debug failures in the middle of a print.

The new code here will report the low-level interface statistics in the log statistics messages. For example, one might see messages like: Stats 815.9: ... canstat_mcu: rx_error=0 tx_error=0 bus_state=active ... canstat_corexy_rear_left: rx_error=0 tx_error=0 bus_state=active

In addition to the additional logging, the "usb to canbus bridge mode" has also been extended to detect canbus stalls. Previously, if the canbus were to stall, it would likely become impossible to communicate with the canbus bridge node itself. That complicates debugging (as it is not obvious if the failure is due to something occurring on the bridge node or the canbus). The new code here changes the "usb to canbus bridge" code to detect canbus stalls (no ability to transmit for 50+ms). When a stall is detected canbus transmit messages are discarded, but at least messages to and from the bridge node can continue.

Unfortunately, I don't have a good way to test this code. It runs fine on my local canbus wired printer, but I'm not getting any interface errors. Testing results from others would be appreciated.

-Kevin

@KevinOConnor KevinOConnor force-pushed the work-canstats-20250114 branch from f14c1ff to 6ab5ee6 Compare January 15, 2025 16:13
Add support or a new get_canbus_status command to canserial.c .

Add new canbus_stats.py module that will periodically query canbus
mcus for connection status information.

Signed-off-by: Kevin O'Connor <[email protected]>
If the low-level canbus stops working then it could become impossible
to send messages to and from the canbus bridge node itself.  This can
make it difficult to diagnose canbus problems.

Change the canbus bridge code to detect if message transmits become
stalled for 50+ milliseconds and go into a "discarding" state.  In
this discarding state, messages destined for the canbus will be
discarded until the canbus becomes active again.  In this discarding
state it will therefore be possible to transmit messages to and from
the canbus bridge node.

Signed-off-by: Kevin O'Connor <[email protected]>
@KevinOConnor KevinOConnor force-pushed the work-canstats-20250114 branch from 6ab5ee6 to 1d298ce Compare January 15, 2025 17:28
@ammmze
Copy link

ammmze commented Jan 16, 2025

I've got a printer that struggles with Timer Too Close errors, so thought I'd give this update a try. It clearly shows something is up with my toolhead board (or the wiring to it). Though curious that it's on got errors for tx, not rx. Are those errors transmit from the context of the toolhead or host (i.e. toolhead sending messages that are erroring or host sending messages that are erroring)? Based on my understanding when reviewing the code, those would toolhead transmit errors.

printer['canbus_stats mcu'].rx_error = None
printer['canbus_stats mcu'].tx_error = None
printer['canbus_stats mcu'].bus_state = None
printer['canbus_stats scanner'].rx_error = None
printer['canbus_stats scanner'].tx_error = None
printer['canbus_stats scanner'].bus_state = None
printer['canbus_stats mmu'].rx_error = 0
printer['canbus_stats mmu'].tx_error = 0
printer['canbus_stats mmu'].bus_state = active
printer['canbus_stats toolhead'].rx_error = 0
printer['canbus_stats toolhead'].tx_error = 1666
printer['canbus_stats toolhead'].bus_state = active

Now despite these tx errors, the bytes retransmitted stats is still showing 0. But I haven't actually tried printing. Typically when i've gone to actually print stuff I do see a few hundred bytes retransmitted.

Also if it's useful, here's a screenshot of the system loads. Which also shows that the 2 mcu's that I did flash are STM and RP2040. So I can at least confirm that both of those boards are at least reporting to the host.

Screenshot 2025-01-16 at 11 03 28 AM

@KevinOConnor
Copy link
Collaborator Author

Thanks for testing. It is odd that the tx_error increased on an rp2040 node without an rx_error increase. It might be a bug in the new code here. Could you attach the full klipper log here? That'll help me investigate.

bytes retransmitted stats is still showing 0

That's normal as the "bytes retransmitted" stat shows the number of times the software performed a retransmit. If the canbus itself does a retransmission then it isn't shown there.

Cheers,
-Kevin

@ammmze
Copy link

ammmze commented Jan 16, 2025

Could you attach the full klipper log here?

Sure...here's the klippy.log.

@KevinOConnor
Copy link
Collaborator Author

Sure...here's the klippy.log

Thanks. So, on rp2040 the rx_error field actually tracks errors while the tx_error field actually tracks the number of transmit retries. (A transmit may be retried due to bus contention, which is not considered an "error" on the stm32 chips.) On stm32 chips, the rx_error and tx_error report errors and they have no way to report the number of non-error transmit retries.

This is confusing, so I'll have to rework the code on this PR to make the new statistics more clear. In the interim, you can treat the rx_error counter as the more serious indicator for the rp2040.

It is surprising that you're getting so many transmit retries though - I'll see if I can reproduce that locally.

For what it is worth, you may want to also reflash your canbus bridge node ([mcu]) as that'll also provide useful canbus debugging.

Cheers,
-Kevin

@ammmze
Copy link

ammmze commented Jan 17, 2025

For what it is worth, you may want to also reflash your canbus bridge node ([mcu]) as that'll also provide useful canbus debugging.

Yep...that was the plan this evening. Just needed to figure out the right steps to flash the usb2can device (it's a Manta M8P) and get it set up in ukam. Now all but the cartographer have been flashed.

These are the stats I have at startup (note: i've restarted and firmware restarted a few times and that mmu rx_error count has stayed the same):

printer['canbus_stats mcu'].rx_error = 0
printer['canbus_stats mcu'].tx_error = 0
printer['canbus_stats mcu'].bus_state = active
printer['canbus_stats scanner'].rx_error = None
printer['canbus_stats scanner'].tx_error = None
printer['canbus_stats scanner'].bus_state = None
printer['canbus_stats mmu'].rx_error = 45
printer['canbus_stats mmu'].tx_error = 0
printer['canbus_stats mmu'].bus_state = active
printer['canbus_stats toolhead'].rx_error = 0
printer['canbus_stats toolhead'].tx_error = 48
printer['canbus_stats toolhead'].bus_state = active

Will start my test print and see where the stats go during the print. Typically this test print (the Box Turtle Cereal print) has been giving me TTC anywhere from half hour to an hour into the print.

@ammmze
Copy link

ammmze commented Jan 17, 2025

Here's my klippy.log after getting the print failed with "Timer Too Close".

Here's the new stats just after the stacktrace

canstat_mcu: bus_state=unknown rx_error=0 tx_error=0
canstat_mmu: bus_state=unknown rx_error=45 tx_error=0 
canstat_toolhead: bus_state=unknown rx_error=0 tx_error=59700

Though the next reported stats after that does show 1 rx_error on toolhead, but perhaps that was related to the shutdown sequence?

@KevinOConnor
Copy link
Collaborator Author

after getting the print failed with "Timer Too Close".

Okay, thanks for testing. For what it is worth, it doesn't look like your issue is canbus related to me. You've got too many modifications to the code for me to provide direct help with the "timer too close". However, if you want a guess, I'd look at "MMU_CHANGE_TOOL TOOL=0 QUIET=1". Your system is doing some things the mainline code should not do - it's running into garbage collection (Reactor garbage collection: (3156.352535264, 3129.683607084, 3158.637710431)) and is driving one of the steppers at a very rast rate without buffering moves (mcu 'toolhead': Sent 99 3158.932379 3158.932193 12: seq: 10(19072), queue_step oid=1 interval=945 count=635 add=0).

Cheers,
-Kevin

@ammmze
Copy link

ammmze commented Jan 17, 2025

For what it is worth, it doesn't look like your issue is canbus related to me. You've got too many modifications to the code for me to provide direct help with the "timer too close".

Sounds good, I appreciate your feedback. It is helpful at least to get further evidence that my wiring seems fine.

@ammmze
Copy link

ammmze commented Jan 17, 2025

is driving one of the steppers at a very rast rate without buffering moves (mcu 'toolhead': Sent 99 3158.932379 3158.932193 12: seq: 10(19072), queue_step oid=1 interval=945 count=635 add=0).

Any chance you can point me to a code example of how to buffer the moves if there is a common pattern already used for doing that? I can look at incorporating it into Happy Hare. I've done some work in there in the past.

@KevinOConnor
Copy link
Collaborator Author

Any chance you can point me to a code example of how to buffer the moves if there is a common pattern already used for doing that?

I don't know enough about what's going on to give much advice. From a quick look at your logs, it seems like the stepper scheduling was put into "drip mode" (ie, homing mode):

extruder queue_step 0: t=27976619521(3158.809660) p=958218 i=945 c=-635 a=0
extruder queue_step 1: t=27977219596(3158.859665) p=957583 i=945 c=-635 a=0
extruder queue_step 2: t=27977819671(3158.909670) p=956948 i=945 c=-635 a=0
...

In "drip mode" the buffering is explicitly disabled so that the host can rapidly schedule moves after an endstop event. In this mode, the Klipper host must be able to maintain low-latency interaction with the mcus (<100ms). That's much more demanding than normal mode, where the Klipper host will buffer 2+ seconds of data into the mcus. When mainline Klipper goes into drip mode, it basically does nothing else but homing - however, your logs indicate that lots of things seem to be going on at the same time - in particular you have a whole bunch of neopixel commands being sent during the event. You also had python garbage collection occurring during the same event. I'd guess it'll be very difficult to get the required low latency timing needed for "drip mode" if many other events are occurring simultaneous. Again, I'm not sure what the code is attempting to do, so take my comments with "a grain of salt".

I think we're getting a bit off-topic here though.

-Kevin

@ammmze
Copy link

ammmze commented Jan 17, 2025

In "drip mode" the buffering is explicitly disabled so that the host can rapidly schedule moves after an endstop event. In this mode, the Klipper host must be able to maintain low-latency interaction with the mcus (<100ms).

Ah...okay...that makes sense! Thank you!

I think we're getting a bit off-topic here though.

Agreed! I'll continue running this branch of klipper and keep an eye on my canstats regardless. I do think it will be useful to get the can stats from the MCU's perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants