Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some ActionClients being left waiting forever when server sends results to two or more clients in a very short span of time #144

Open
glpuga opened this issue Jul 8, 2019 · 5 comments

Comments

@glpuga
Copy link

glpuga commented Jul 8, 2019

I found an issue when the cpp Action Server sends results to two or more clients during a very short span of time, which causes all but one of the clients to be left hanging in most cases.

I tracked the problem to being very likely related to the default value of the queue length (only one message long) used for the feedback and result subscribers in the ActionClient, see action_client.h:224.

This is related to the changes made in PR #55

Since all clients share the same feedback and result topics, when the server sends multiple messages addressed to different clients in quick succession, all of the clients subscriber queues become full after the first message and drop the following messages.

As a result, in most cases the clients that were meant to receive the second, third or later messages become hanged waiting for results that were lost and will not be resent.

A similar problem may cause feedback messages to be silently dropped.

While the queue length can be configured using the parameter server, a higher default value may reduce the chances of this problem arising. This would be specially useful since it's hard to track the symptoms to the cause when clients are left waiting indefinitely.

This issue was noted in ROS Kinetic, but it's probably present in other versions as well.

@glpuga
Copy link
Author

glpuga commented Jul 8, 2019

Poking @mikaelarguedas who knows this code.

@glpuga glpuga changed the title ActionClient being left hanged when server sends results to two or more clients in a very short span of time Some ActionClients being left waiting forever when server sends results to two or more clients in a very short span of time Jul 8, 2019
@fujitatomoya
Copy link
Contributor

@glpuga

i do agree with you, it is so unlikely to happen but eventually there is a possibility. since it is configurable using parameter actionlib_client_pub_queue_size/actionlib_client_sub_queue_size, it is user responsibility what to set.

@StefanFabian
Copy link

@fujitatomoya I have to strongly disagree.
You're right that most users may not ever encounter this limitation but it's not as unlikely as you think it is.

As far as I understand it, the idea of the ActionClient is explicitly to support multiple goals whereas the SimpleActionClient simplifies that to tracking a single goal.
As it is, the ActionClient does not support multiple goals fully.
A simple example would be sending 2 or more goals to an ActionServer and canceling those goals using cancelAllGoals(). Depending on the implementation of the server, this will result in the cancel messages from the server to be sent faster than the client-side ROS queue will be handled and all except for one goal will be in a broken state stuck forever in waiting for a cancel ACK.

This is a critical bug and no sane user would expect the queue size to be set to such a low value by default as there is no reason to do so.
Also, the python implementation of the action client does not contain this bug (I assume setting the pub_queue_size to None does the same as setting it to 0, i.e., disables the queue size limit).

I would suggest to at least change this:

n_.param("actionlib_client_pub_queue_size", pub_queue_size, 10);
n_.param("actionlib_client_sub_queue_size", sub_queue_size, 1);
if (pub_queue_size < 0) {pub_queue_size = 10;}
if (sub_queue_size < 0) {sub_queue_size = 1;}

to a sub_queue_size of 100 or better yet 0.
Not as critical but still noteworthy, the pub_queue_size could also be increased significantly. That limitation may not usually be hit but there's also not really reason I could think of to set it that low.
Especially, given that most uncritical queue sizes in other software stacks are usually set to sizes in the hundreds or thousands.

I can make a PR if you want, though it may be a bit overkill for at most 10 line changes.

@fujitatomoya
Copy link
Contributor

fujitatomoya commented Mar 13, 2020

@StefanFabian

SimpleActionClient simplifies that to tracking a single goal.
As it is, the ActionClient does not support multiple goals fully.

correct,

A simple example would be sending 2 or more goals to an ActionServer and canceling those goals using cancelAllGoals(). Depending on the implementation of the server, this will result in the cancel messages from the server to be sent faster than the client-side ROS queue will be handled and all except for one goal will be in a broken state stuck forever in waiting for a cancel ACK.

i do understand, and yes logically that is possible or probable.

Also, the python implementation of the action client does not contain this bug (I assume setting the pub_queue_size to None does the same as setting it to 0, i.e., disables the queue size limit).

None means infinite(default), agreed on this. at least this should be consistent as ActionClient python. (i cannot explain why there is a difference between cpp and python.)

@fujitatomoya
Copy link
Contributor

@StefanFabian @glpuga

#162, that is an easy fix, but could you check just in case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants