Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeshAccessLog - add support for timeout and retry for TCP backend #8348

Open
Constantin07 opened this issue Nov 13, 2023 · 22 comments
Open

MeshAccessLog - add support for timeout and retry for TCP backend #8348

Constantin07 opened this issue Nov 13, 2023 · 22 comments
Labels
kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it

Comments

@Constantin07
Copy link

Description

Hi,
When sending Kuma logs via DataDog service in K8s we get sporadically error messages from kuma-sidekar with failures to send logs.
It would be nice to add support for configuring timeout & nr. of retries in case of TCP connection failure.

@Constantin07 Constantin07 added kind/feature New feature triage/pending This issue will be looked at on the next triage meeting labels Nov 13, 2023
@cbugneac-nex
Copy link

Here is an example of exception raised:

2023-11-10T14:47:39.554Z	ERROR	access-log-streamer	could not send the log to TCP log destination. Dropping the log	{"address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "error": "failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518: write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out", "errorVerbose": "write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out
failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*logSender).send
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/sender.go:33
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).streamAccessLogs
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:163
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).Start.func2
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:72
runtime.goexit
\truntime/asm_amd64.s:1650"}

@jakubdyszkiewicz jakubdyszkiewicz added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/pending This issue will be looked at on the next triage meeting labels Nov 20, 2023
@jakubdyszkiewicz
Copy link
Contributor

Triage: did you consider using OpenTelemetry? It might support this already (on the agent side)

@samm-git
Copy link

@jakubdyszkiewicz but opentelemetry are for metrics, and accessLogStreamer is for logs, isnt it?

@jakubdyszkiewicz
Copy link
Contributor

OpenTelemetry can also be used for logs.

Here we support this in MeshAccessLog https://kuma.io/docs/2.5.x/policies/meshaccesslog/#opentelemetry

And it seems that Datadog agent now has a builtin support for it
https://www.datadoghq.com/blog/agent-otlp-log-ingestion/

@samm-git
Copy link

@jakubdyszkiewicz thank you for the hint. I will check proposed configuration and will let you know if it works. I was using datadog tcp collector as i think it was recommended in the docs, but i would explore solution proposed above.

@cbugneac-nex
Copy link

cbugneac-nex commented Nov 27, 2023

Thanks @jakubdyszkiewicz for the suggestion, not sure how OpenTelemetry would help here.
Does the OpenTelemetry client have support for retry/timeout settings ?
Reason being is that if the backend on DataDog is not ready (temporarily) for any reason to accept connections - it would fail similar to TCP backend.

@samm-git
Copy link

@jakubdyszkiewicz i see that if oltp logs are specified kuma just configures envoy to send the logs in openmetrics format. But i did not found any retry logic/settings/documentation in envoy about it. Do you know any details?

@samm-git
Copy link

samm-git commented Nov 28, 2023

@jakubdyszkiewicz I was able to get OTLP logs working, however, it also has own problems:

  1. It is not documented that openmetric logs are using GRPC, was able to find it from the source code and rendered envoy config. This should be mentioned in the doc, as OTLP supports both.
  2. Field openTelemetry.attributes fields do not accept %KUMA_*% placeholders. Moreover, it renders bad config for envoy, so it stops accepting any changes.
  3. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly
  4. I also tried to put to openTelemetry.body object instead of string (just using test: "value" but seems that such config is silently ignored by Kuma and it reverts to default text line

Could you please help with that?

@samm-git
Copy link

@michaelbeaumont
Copy link
Contributor

michaelbeaumont commented Nov 28, 2023

  1. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly

@samm-git could you copy/paste an example of what you mean?

@samm-git
Copy link

@michaelbeaumont sure.

openTelemetry:
  body: '{"test":"value"}'

would render as escaped

"{\"test\": \"value\"}"
openTelemetry:
  body:
    test: "value"

gets ignored and default is in use (why?)

openTelemetry:
  body:
    - test: value

Actually gets rendered to the array

[{"test":"value"}]

what again prevents datadog from creating an index.

@samm-git
Copy link

Also, i did an experiment - i tried to kill Datadog daemonset to see how errors are handled.

From what i see - envoy silently dropped all messages when datadog agent was down. No errors were reported, nothing. So it's even worse compared to the original behavior, at least we had some reference that data is incomplete.
@jakubdyszkiewicz ^^

@samm-git
Copy link

From what i see in Envoy rendered config - MeshAccessLogs does not configure any retry settings

{
  "log_name": ...,
  "grpc_service": {...},
  "transport_api_version": ...,
  "buffer_flush_interval": {...},
  "buffer_size_bytes": {...},
  "filter_state_objects_to_log": [],
  "grpc_stream_retry_policy": {...},
  "custom_tags": []
}

Most of the things related to retry and log flush are configured via buffer_size_bytes and grpc_stream_retry_policy. As we have only default values which are very small, logs likely will be lost.

@jakubdyszkiewicz
Copy link
Contributor

Hey, I did some investigation on the "body" and our docs are lacking, or rather examples.

Our docs says

Body is a raw string or an OTLP any value as described at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-body It can contain placeholders available on https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/accesslog/usage#command-operators_

This body field is field from OTEL and it's quite complex. If you want to do key value you can do this

apiVersion: kuma.io/v1alpha1
kind: MeshAccessLog
metadata:
  name: default
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh
  from:
    - targetRef:
        kind: Mesh
      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

KUMA_MESH is then interpolated just fine.

We should also do a better job at failing when the format is not right instead of fallback to a default. I agree that it's confusing.

When it comes to retries. I think grpc_stream_retry_policy could help. Let my try to compose MeshProxyPatch to configure this. If this becomes useful we could incorporate this into MeshAccessLog policy

@samm-git
Copy link

samm-git commented Nov 29, 2023

@jakubdyszkiewicz thank you. May i kindly ask to add interpolation to headers as well? As in fact, we don't need a body at all, these all could be injected as OTLP attributes. Plus it is very confusing to have different rules for headers and for the body

@samm-git
Copy link

samm-git commented Nov 29, 2023

@jakubdyszkiewicz JFYI - example

      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

works as expected and DataDog is happy with it and indexing the mesh field. I think it would be good to update the documentation with it as it is very confusing atm.

@jakubdyszkiewicz
Copy link
Contributor

Ok, so to test retry you can check this proxypatch

apiVersion: kuma.io/v1alpha1
kind: MeshProxyPatch
metadata:
  name: custom-template-1
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh # use MeshService or MeshSubset if you want to apply this only on specific proxies
  default:
    appendModifications:
      - networkFilter:
          operation: Patch
          match:
            name: envoy.filters.network.http_connection_manager
            # origin: inbound # optional to if you want to add this on incoming listeners
            # listenerName: "inbound:10.42.0.9:3001" # optional if you want to add this to specific listener
          jsonPatches:
            - op: add
              path: /accessLog/0/typedConfig/commonConfig/grpcStreamRetryPolicy
              value:
                numRetries: 5
                retryBackOff:
                  baseInterval: "1s"
                  maxInterval: "10s"

I agree on docs. Created issue kumahq/kuma-website#1538
Interpolation of attributes makes sense #8492

Let me know how this retry goes.

@bartsmykla
Copy link
Contributor

I added above example to OpenAPI schema of MeshAccessLog policy: https://github.com/kumahq/kuma/pull/8533/files#diff-ed8e7e0fddb480be4d601ae35bb953cddd53659fb3b9d306b43996a8f322cb38R82

@lukidzi lukidzi added this to the 2.7.x milestone Feb 21, 2024
@lahabana lahabana removed this from the 2.7.x milestone Mar 25, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jun 24, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@slonka slonka removed the triage/stale Inactive for some time. It will be triaged again label Jun 24, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Sep 23, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lobkovilya lobkovilya added triage/pending This issue will be looked at on the next triage meeting and removed triage/accepted The issue was reviewed and is complete enough to start working on it triage/stale Inactive for some time. It will be triaged again labels Sep 30, 2024
@lobkovilya
Copy link
Contributor

Triage: it should be configurable somewhere (without MeshProxyPatch). We should support it only for OTEL backend.

@lobkovilya lobkovilya added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/pending This issue will be looked at on the next triage meeting labels Sep 30, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Dec 31, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lukidzi lukidzi removed the triage/stale Inactive for some time. It will be triaged again label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

No branches or pull requests

10 participants