Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network issues with the Baxter Research Robot #177

Open
alecive opened this issue Jan 27, 2017 · 9 comments
Open

Network issues with the Baxter Research Robot #177

alecive opened this issue Jan 27, 2017 · 9 comments

Comments

@alecive
Copy link

alecive commented Jan 27, 2017

Hello, since three days ago we are experiencing some difficult-to-reproduce, difficult-to-track network issues with our Baxter.

What happens in practice is that after a while some topics stop being published, and the robot is not responding to even the simplest command (e.g. tuck). Also, rebooting both the Baxter and the machine connected to it does not change much: after 5 minutes the issue reappears.

After hours in trying to understand the reason for this, we discovered that the issue might be related to the rosmaster that is not closing sockets properly, leaving them in CLOSE_WAIT state. Also, when this happens rosmaster starts to use more than 100% of one core on the baxter machine, which seems strange.

The issue seems to be well-known to the ROS community, but although I have seen many issues on that, I don't really understand how to fix the problem.

Some useful links:

Please tell me if I can help you debug it in any way.

@rethink-imcmahon sorry for tagging directly you, but I saw your comment in one of the issues I linked above, and I thought you may have a quick solution to debug the issue.

@IanTheEngineer
Copy link
Contributor

No worries on tagging me. I have not dug into any of the issues you've called out here, but I have helped many, many Baxter customers properly network their computers to their robots. I would first make sure the networking layer is rock solid - make sure to follow our Networking tutorial, with key points being:

  • Disable Wireless networking on your Ubuntu workstation
  • Disable the Ubuntu Firewall: $ sudo ufw disable
  • Note whether you have two network cards on your Workstation, and how that may affect the ROS traffic
    (you'll likely want to make sure everything is working in the simplest configuration first, before adding network complexity)
  • The best configuration by far is having a router you control connect to the Internet, and put your computer & robot behind that router. This is much easier to debug, with fewer unknowns than most networking configurations:
    800px-router_config

@alecive
Copy link
Author

alecive commented Jan 27, 2017

Thank you @IanTheEngineer for being so quick in your reply. So what we basically had before was a direct connection between the development workstation and the Baxter robot through the second network card the workstation is equipped with.

We proceeded to unplug the Baxter from the workstation and plug it to the router, changed the baxter.sh params and rebooted both the Baxter and the workstation.

It seems that the issue has disappeared, at least from our quick testing. The problem we have now is that the input data network speed is capped at 11MB/s, which renders most of our code unusable. When we were connecting directly to the robot, we had a speed always higher than 50MB/s. Whilst I am aware that a better/faster/newer router would increase the network speed in such a configuration, this does not explain (at least to me) the reason for the problem, and why it appeared only now. What do you think the issue originates from? From here, it seems that rebooting the machines helps anyway because you cleanup the number of sockets in CLOSE_WAIT state.

Anyway, we will try to keep debugging the issue in both configurations (with or without the router in between the two machines), and we'll let you know. It takes time to reproduce the issue and I am still not sure if the "router fix" helped or not.

@alecive
Copy link
Author

alecive commented Jan 27, 2017

Further investigation: we went back to the "direct connection" configuration.

Again, rosmaster usage comes back up to more than 100%, and the issue shows up again even though the number of sockets in CLOSE_WAIT state seems low (about 10). We experience a big number of sockets in TIME_WAIT state, though.

After closing our launch files, what happens is that rosmaster usage stays high for ~5 minutes, until all these sockets exit from their TIME_WAIT state. When this happens, rosmaster usage goes back down to 0.3%, and we regain control over the network and the Baxter.

@alecive
Copy link
Author

alecive commented Jan 27, 2017

The number of sockets in TIME_WAIT state keeps increasing over the time after launching our launch file. After 30 seconds usage, it fluctuates around 3000 and it stays there. Closing the launch files starts reducing those sockets until they go back to 0 after ~5 minutes.

@IanTheEngineer
Copy link
Contributor

It is entirely possible that this CLOSE_WAIT issue is affecting Baxter's roscore. This is really useful debugging info you're collecting here, and I'd recommend adding it to the ticket you've linked so that the ros_comm maintainers have more context for the bug. In the mean time, I'd recommend getting a solid router for around $50 to mitigate the issue.

@alecive
Copy link
Author

alecive commented Jan 27, 2017

We'll do that. In the meanwhile, I am not so sure if it's worth upgrading the whole system to kinetic. It seems that ROS support for older versions is not that great, and an upgrade might help.

I am following the issue here: when do you think the QA team will be able to test the Baxter with kinetic? Is there an ETA for that? I would like to stick with the official channels for the baxter robot.

@alecive
Copy link
Author

alecive commented Feb 6, 2017

@IanTheEngineer do you have any suggestion about the best router we could by to satisfy our bandwidth hunger? I can obviously look for a router by myself, but maybe Rethink has a list of suggested/recommended hardware in this regard.

@alecive
Copy link
Author

alecive commented Mar 9, 2017

@IanTheEngineer we finally bought a new router for our setup.

After quick testing, now the max read/write speeds allowed by our system hover around 100MB/s, that much bigger than our needs and importantly much better than the 11MB/s allowed by our previous router (now the bottleneck is probably the hard drive).

Above all, the problem seems to be gone now, so I am going to close the issue. We will keep testing the new setup in the following weeks, and we'll re-open this issue if needed.

For future reference, here is a link for purchasing the exact model we bought: https://www.amazon.com/dp/B00QGOQ2BA/ref=psdc_300189_t1_B00HEX851C?th=1

Thank you for the support!
Cc @omangin

@alecive alecive closed this as completed Mar 9, 2017
@alecive
Copy link
Author

alecive commented Jun 16, 2017

@IanTheEngineer reopening because what we believe are network issues are still present. We updated the router, I fixed a bug in ros_comm that was causing some issues (see here), and now I don't have any ROS error I could try to use in order to understand what is going on.

The behavior we have right now is that everything is fine, until at some point one of the following happens:

  • is not possible to tuck the robot any more
  • I cannot see images coming from either of the two hands camera
  • One of the two cameras shuts down, and I am not able to turn it back on again with the provided rosrun baxter_tools camera_control.py -o

The only way I have to fix them is to reboot the robot altogether. FYI:

[baxter - http://baxter.local:11311] scazlab@baxterserver:~/ros_devel_ws$ rosparam get /rethink/software_version 
1.2.0.57
[baxter - http://baxter.local:11311] scazlab@baxterserver:~/ros_baxter_ws/src/baxter (master)$ git describe
v1.2.0

Also, I just reinstalled the baxter workstation from scratch with indigo, and my network setup is the recommended, ie this one:
img

@alecive alecive reopened this Jun 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants