-
Notifications
You must be signed in to change notification settings - Fork 56
Releases
Dan Bason edited this page Jun 22, 2023
·
78 revisions
Additional features we're working on:
- Cortex Alert Manager :
- scalability & HA consistency improvements for opni-alerting
- Gateway/Agent
- Add ping endpoint to gateway
- Add rate limiting to streaming connections
- Auto update of agent
- Logging
- Bug fixes to controlplane logs and events
- Metrics
- Use OTEL collector for metrics collection (greatly reduced agent footprint)
- AIOps enhancements/fixes
- Alerting enhancements/fixes
- Monitoring fixes:
- Fixed an issue where an agent reconnecting could cause errors scraping gateway metrics (#1295)
- CLI fixes:
- Fixed bash completion not working correctly (#1299)
- AIOps enhancements/fixes
- Update AI Services to now use Nats Jetstream KV storage (#1146)
- Remove "Enable GPU Services" button from Opni Admin Dashboard. (#1147)
- Update AIOps gateway plugin to update Nats Jetstream kv when model is submitted for training. (#1148)
- Update AI Services to launch model training upon startup of service (#1233)
-
Alerting enhancements
- Metrics based alarm incidents are now tracked in the alerting timeline
-
Monitoring enhancements
- Opni internal metrics containing cluster names are now stored per-tenant instead of in the local cluster's tenant
- Upstream Opni local agent identification
-
Logging enhancements
- Move to using open-telemetry-collector to collect logs
- Send logs to the central cluster using gRPC
- Store cluster id to name mapping for better visibility
-
AIOps enhancements/fixes
-
UI
-
AIOps Enhancements/Fixes
-
UI Enhancements/Fixes
- Fix an issue where the agents page crashed when an agent no longer exists for a role that was applied to the agent ( #1109 )
- Fixed Monitoring capability metrics not showing up ( #1111 )
- Changing how we add agents via the UI to improve reliability ( #965 )
- Switched to using the new alarm API. ( pr118 )
- Monitoring enhancements
- metrics admin CLI can list / filter & aggregate cortex rules
- includes listing & querying stateful information about alerting rules
- metrics admin CLI can create,read, update or delete standard prometheus rule group files to clusters
- An initial cluster name can be set during bootstrap of new agents by setting the
opni-agent.friendlyName
helm value. When a cluster name is set in the UI, this flag will be added to the copyable helm install command.
- metrics admin CLI can list / filter & aggregate cortex rules
- Minor update to ingest plugin to prepare for Opni preprocessing service
- Notable dependency updates
- go 1.20
- cortex to v1.14.1
- go-plugin to v1.4.8
- nats-server to v2.9.12
- alertmanager to v0.25.0
- prometheus to v0.41.0
- opentelemetry to v1.12.0
- grpc to v1.52.1
- kubernetes sdk to v0.26.1
- controller-runtime to v0.14.4
- cert-manager to v1.10.2
- gin to v1.8.2
- UI Enhancements/Fixes
- Updating the tooltip in the alerting overview page to improve accuracy (#1006)
- Fixed a bug where we didn't show all log templates in the Opensearch plugin. Also added a count to the top of the same table. (pr 106)
- Fixed a bug where you couldn't update the s3 endpoint in the monitoring config (#1009)
- Fixed a bug where you couldn't edit PagerDuty alarm endpoints (#958)
- Improved the consistency of Cluster/Agent name and id labeling (#1048)
- Improved chart labeling on the insights page of the opensearch dashboard (#1046)
- Only show relevant storage options when selecting the 'Highly Available' Monitoring option
- Compensate for an overzealous 'no changes to apply' error the Alerting backend emits when installing Alerting.
- AIOps Enhancements/Fixes
- Improved pre-trained DRAIN models with more ground truth (pr16)
- Added test coverage to 64% to the deep-learning model service (pr38)
- Fixed the NotFoundError during downloading training data from Opensearch(pr39)
- Updated Opensearch query to filter out anomaly keywords in training data (pr16)
- Bump PyTorch version to latest stable version (pr1035)
- Alerting enhancements
- scalability optimizations; reduced memory footprint for the Alerting backend
- Improved fault tolerance for the Alerting backend notification system
Bugfix release:
- Add retry for fetching client certs
- Fix agent crash on startup
Introduction of Opni AI workload training module:
- Pre-req: NVIDIA GPU is enabled in cluster Opni is installed in
- Users can select deployments are important to them in the Opni Admin UI -> a model will be trained for the user and logs belonging to those deployments in the future will be given a "Normal" or "Anomalous" label
- Users can ingest these insights in the Opni plugin in their Opensearch Dashboards (enabled when Opni Logging backend is setup)
Alerting enhancements
- New PagerDuty endpoint for receiving notifications
- Clone operation for cloning alarms to any target cluster(s)
- Alarms for when downstream agent capabilities are unhealthy
- Alarms for tracking the health of the opni monitoring backend health
- Alarms based on prometheus queries
- Alarms for tracking the states of kubernetes objects
Opensearch changes:
- Opensearch version updated to 2.4.0
- Dataprepper version updated to 2.0.1
- Reconcilers now user TLC client cert to authenticate to the Opensearch API
Introduction of Opni Alerting:
- Slack, Email Endpoint for receiving notifications
- Alarms on agent connection status
- Overview for breached alarms
Architecture
- Backends
- Core Components