Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability: Add pre-built dashboards for basic infrastructure and model metrics #634

Closed
1 of 2 tasks
rddefauw opened this issue Aug 15, 2024 · 4 comments · Fixed by #651
Closed
1 of 2 tasks

Observability: Add pre-built dashboards for basic infrastructure and model metrics #634

rddefauw opened this issue Aug 15, 2024 · 4 comments · Fixed by #651
Assignees
Labels
backlog enhancement New feature or request

Comments

@rddefauw
Copy link

Describe the feature

When using a model from Bedrock or hosted in SageMaker, provide a pre-built CloudWatch dashboard that shows the essential infrastructure and model metrics. For Bedrock, this would include things like invocations, latency, and token count. For SageMaker we would have invocations and latency, and perhaps in future provide an extension to track token count. For multi-tenant use cases, we could also track per-tenant dimensions to help with cost allocation, to the extent that the services support this.

Other components like vector databases would support both infrastructure metrics (e..g, CPU overhead on an OpenSearch collection) and metrics like number of retrieval requests.

Use Case

As with any workload, observability is key to understanding workload behavior. In the case of GenAI, surfacing the basic metrics via CloudWatch dashboards and alarms helps the operator understand performance problems (e.g., unusually high invocation latency) and more general trends (average number of model requests).

Proposed Solution

Provide L2 constructs for dashboards for major components, like models hosted in Bedrock or SageMaker, or vector databases in AOSS. The simplest usage might be:

new BedrockModelDashboard(this, "ModelDashboard", { modelId: "anthropic.claude-3-haiku-20240307-v1:0", })

One potential drawback is that the native metrics for Bedrock don't support dimensions beyond model ID. If a single account is hosting multiple workloads in the same region, the Bedrock metrics would be aggregated across all workloads. If in future Bedrock offers more dimensions, we can take advantage of them. We can also consider more advanced constructs that provide their own way to meter things like token count at a more granular level.

Other Information

I have a reference implementation available in a private repo, and plan to publish it soon. It includes pre-baked CloudWatch dashboards for Bedrock and AOSS.

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change
@rddefauw rddefauw added the needs-triage This issue or PR still needs to be triaged. label Aug 15, 2024
@krokoko krokoko added the enhancement New feature or request label Aug 15, 2024
@krokoko
Copy link
Collaborator

krokoko commented Aug 15, 2024

Hi @rddefauw , thank you for this feature request ! This might be a good fit for https://github.com/cdklabs/cdk-monitoring-constructs, what do you think ? I see there is an open feature request to support AOSS

@rddefauw
Copy link
Author

I think the problem is not supporting a specific service like AOSS, but knowing which metrics are relevant for a GenAI workload. I think that fits better here.

There are also some more interesting metrics, like collecting human feedback, that I can see coming as a follow-up, and that would require deeper modifications into the application stack.

@krokoko
Copy link
Collaborator

krokoko commented Aug 22, 2024

Hi @rddefauw , thanks again for this feature request ! Could you please have a look at the draft PR mentioned above (#651) ?
For a first iteration, we can publish the Bedrock dashboard, and I'll create separate tickets for the other component level dashboards (AOSS, SageMaker endpoint) if that works

@rddefauw
Copy link
Author

Bedrock dashboard looks good. Agree on creating tickets for other components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

2 participants