-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (spike): Package opentelemetry forwarding to xray #115
base: main
Are you sure you want to change the base?
Conversation
export( | ||
spans: ReadableSpan[], | ||
resultCallback: (result: ExportResult) => void | ||
): void { | ||
console.log("Exporting!!!", spans); | ||
const sym = Symbol(); | ||
const promise = client | ||
.send( | ||
new PutLogEventsCommand({ | ||
logGroupName: this.logGroupName, | ||
logStreamName: this.logStreamName, | ||
logEvents: spans.map((s) => ({ | ||
message: this.serializeSpan(s), | ||
timestamp: new Date().getTime(), | ||
})), | ||
}) | ||
) | ||
.then(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this run in a side car or within the lambda execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the lambda, but asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this. It couples the critical path to telemetry. Should we instead run it as a side car ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, it'd add some complexity and a bit of overhead though. What's the concern with it being on the critical path? Could just wrap the whole thing in a try/catch if failure is a worry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request won't complete until the (slow) put logs request succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on not polluting logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but wont the lambda execution be incomplete until the side car has completed sending to cloudwatch, and terminated, anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are their lifetimes independent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, having issues with sequencing the logs to cloudwatch anyway, with this model. Did some reading up, I see lambda extensions get independent lifetimes from the function, ie the function can return back to caller, while the extension gets to clean up. So yeah I'll try making a cut down build of the collector that just exports to cloudwatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's https://github.com/open-telemetry/opentelemetry-collector/tree/main/cmd/builder for making custom builds of the collector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have any sample of what value this change brings?
this.workflows.orchestrator, | ||
"orchestrator" | ||
); | ||
this.telemetry.attachToFunction(this.scheduler.forwarder, "forwarder"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the scheduler forwarded needs this? What is it logging? If it does need it, then the scheuler.handler
also needs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure can take it off.
const logStream = new LogStream(this, `LogStream${componentName}`, { | ||
logGroup: this.logGroup, | ||
logStreamName: componentName, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we want? A log steam per function for all time? Or is this just an experiment?
There is a limit to the number of writes to a log stream.
5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
The orchestrator, for all workflow executions, would be limited to 5TPS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm ok didnt realise there was a throttle. I originally had it creating a new log stream every execution, but reliased without static streams it would be difficult to attach events listeners to the streams, to forward logs to the real collector. With static streams we can just set that up in cdk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option I can think of instead, actually, is just skip the logging to cloudwatch part. Instead the extension just sends the data to the otel collector running in a different lambda, over http
const tracer = trace.getTracer(executionId, "0.0.0"); | ||
await tracer.startActiveSpan( | ||
"startWorkflow", | ||
{ | ||
attributes: { workflowName, input }, | ||
kind: SpanKind.PRODUCER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to trace in the client or should we trace in the orchestrator (aka: those who call the client). Not all of the callers of the client will have tracing on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah orchestrator probably makes more sense.
provider.addSpanProcessor( | ||
new BatchSpanProcessor(new OTLPTraceExporter({ hostname: "127.0.0.1" })) | ||
); | ||
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
/** | ||
* Creates an entrypoint function for orchestrating a workflow | ||
* from within an AWS Lambda Function attached to a SQS FIFO queue. | ||
*/ | ||
const traceProvider = registerTelemetryApi(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions compile to MJS, would it help to make the AWS clients and telemetry start in parallel (top level await)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried that somewhere else, but the functions also compile to cjs, and breaks in that form. Its our lowest common denominator. That being said, registerTelemetryApi
isnt an async function, I dont think top level await's going to help us here.
await tracer.startActiveSpan( | ||
"createActivityWorker", | ||
{ attributes: { command: request.command.name } }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have a span for createActivityWorker, but not createOrchestrator? Does this one time operation need a span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only created a couple of spans for testing purposes. I'll leave it up to you guys to create more once everything's in place.
const orchestrateSpan = tracer.startSpan("orchestrate"); | ||
const ret = await orchestrateExecution(workflow, executionId, records); | ||
orchestrateSpan.end(); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the scope function be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be, for some reason that one isn't working right now though. Probably missing something in the setup of the sdk.
Aside from that, I'm suspecting that the traces being sent to eg. grafana could do a good job at visualising the workflow runs, and with a bunch of customisation ability, better than our own visualiser will be able to provide for some time. As for a sample, I'll put one together once everything works. |
❌ Deploy Preview for preeminent-heliotrope-380b2a failed.
|
Packages AWS opentelemetry distro for lambda into the functions, enabling capturing of opentelemetry instrumentation, and currently forwarding data to x-ray. An experiment on the impact on cold start times. Currently clocking in at 1.5s.
Have attempted to use own build of collector, enabling region-agnostic access, and enabling us to trim unnecessary modules from it, However so far haven't been able to get my build to forward to x-ray.
Next step will add some light instrumentation around activities and events to see how it turns out.