Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kindling_tcp_connect_total无法真实反应容器之间是否有tcp建联失败 #548

Open
xuchuan-666 opened this issue Jul 17, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@xuchuan-666
Copy link

Describe the bug
prosql:increase(kindling_tcp_connect_total{success="false"}[2m])
在服务与服务之间,总是有数值出现
How to reproduce?
部署kubernetes集群,网络采用calico的ipip的overlay网络模式,部署任意java程序之间调用即可复现
What did you expect to see?
increase(kindling_tcp_connect_total{success="false"}[2m]) 这个指标可以真实的反应两个pod之间是否tcp链接失败的情形,数据准确性提高

What did you see instead?

1689561557677

框中的数据都是误报出来的数据

Screenshots

What config did you use?

kindlingproject/kindling-agent:latesttest
kindlingproject/kindling-grafana:latesttest

Logs

Environment (please complete the following information)

  • Kindling agent version
  • Kindlinng-falcon-lib version
  • Node OS version
  • Node Kernel version
  • Kubernetes version
  • Prometheus version
  • Grafana version

Additional context

@xuchuan-666 xuchuan-666 added the bug Something isn't working label Jul 17, 2023
@dxsup
Copy link
Member

dxsup commented Jul 18, 2023

请问是怎么确定这些数据是“误报”的?这些调用根本不存在还是存在调用但没有发生“建连失败”?

@xuchuan-666
Copy link
Author

请问是怎么确定这些数据是“误报”的?这些调用根本不存在还是存在调用但没有发生“建连失败”?

这些调用存在,但是没有发生“建联失败”的情况,我们服务的调用及日志都没有任何的异常,但是通过kindling采集出来的数据,却时不时的会有显示tcp建联失败

@xuchuan-666
Copy link
Author

我们应用的场景也比较简单,无论是集群服务之间的调用,还是集群服务与集群外部中间件之间的调用,都会不定时的会显示tcp建连失败的数据,但是我们排查了业务的日志,发现根本没有任何的错误输出,并且不只一个业务会出现这种问题,所以怀疑采集出来的数据有问题

@dxsup
Copy link
Member

dxsup commented Jul 20, 2023

麻烦打开debug日志,然后把日志发出来,我看一下tcpconnectanalyzer中收到的数据情况。

方法为在配置文件中修改observability.console_leveldebug,然后在observability.debug_selector增加tcpconnectanalyzer。再使用kubectl logs将日志重定向到文件中,然后把文件贴出来。

这个日志建议打印5分钟,这段时间内要出现过“误报的建连失败”指标。

@xuchuan-666
Copy link
Author

2.txt
0358a44100bd16129b5a8c2d7fb371d
58fab7587abaede58896aab485a035b

@xuchuan-666
Copy link
Author

在采集的数据中kindling_tcp_connect_total{errno="-2",success="false"},errno的value为-2,这个报错会在UnixSocketDomain类型下发生,应该把socket类型是AF_UNIX的过滤掉,这类不算TCP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants