Skip to content

Releases

Dan Bason edited this page Jun 22, 2023 · 78 revisions

Additional features we're working on:

  • Cortex Alert Manager :
    • scalability & HA consistency improvements for opni-alerting

v0.10.1 (TBA)

  • Gateway/Agent
    • Add ping endpoint to gateway
    • Add rate limiting to streaming connections
    • Auto update of agent
  • Logging
    • Bug fixes to controlplane logs and events

v0.10.0 (June 6, 2023)

  • Metrics
    • Use OTEL collector for metrics collection (greatly reduced agent footprint)

v0.9.2 (April 26, 2023)

  • AIOps enhancements/fixes
    • Update AI services to now give clearer statuses of the Deep Learning model when training. (#1272)
    • Fix bug where updating a watchlist by simply removing a deployment would cause training status to hang. (#1344)
  • Alerting enhancements/fixes
    • Resolve internal data race with internal alerting conditions (#1174)
    • Further improved performance of alarm status listing (#1022)
  • Monitoring fixes:
    • Fixed an issue where an agent reconnecting could cause errors scraping gateway metrics (#1295)
  • CLI fixes:
    • Fixed bash completion not working correctly (#1299)

v0.9.1 (April 6, 2023)

  • AIOps enhancements/fixes
    • Update AI Services to now use Nats Jetstream KV storage (#1146)
    • Remove "Enable GPU Services" button from Opni Admin Dashboard. (#1147)
    • Update AIOps gateway plugin to update Nats Jetstream kv when model is submitted for training. (#1148)
    • Update AI Services to launch model training upon startup of service (#1233)

v0.9.0 (March 16, 2023)

  • Alerting enhancements

    • Metrics based alarm incidents are now tracked in the alerting timeline
  • Monitoring enhancements

  • Logging enhancements

    • Move to using open-telemetry-collector to collect logs
    • Send logs to the central cluster using gRPC
    • Store cluster id to name mapping for better visibility
  • AIOps enhancements/fixes

    • Update Opensearch query to get log count per deployment based on updated field names from open-telemetry collector ( #1181 )
    • Update Opensearch query to use updated field names from open-telemetry-collector to fetch logs from Opensearch for training Deep Learning model. ( #19 )
  • UI

    • Fixed a bug that was preventing the Alert timeline to fully render
    • Added a friendly name field to the Manual Installation section of Add Agent to coincide with a chart update. (#1177)
    • Tagging Agents in the table with an icon which indicates if the agent is local or not. (#1166)

v0.8.3 (March 2, 2023)

  • AIOps Enhancements/Fixes

    • Optimize query for filtering out anomalous keywords within training controller. ( #18 )
    • implement streaming data-loader for log anomaly detection model training to reduce memory pressure from large training dataset. ( pr40, pr17)
  • UI Enhancements/Fixes

    • Fix an issue where the agents page crashed when an agent no longer exists for a role that was applied to the agent ( #1109 )
    • Fixed Monitoring capability metrics not showing up ( #1111 )
    • Changing how we add agents via the UI to improve reliability ( #965 )
    • Switched to using the new alarm API. ( pr118 )

v0.8.2 (February 16, 2023)

  • Monitoring enhancements
    • metrics admin CLI can list / filter & aggregate cortex rules
      • includes listing & querying stateful information about alerting rules
    • metrics admin CLI can create,read, update or delete standard prometheus rule group files to clusters
    • An initial cluster name can be set during bootstrap of new agents by setting the opni-agent.friendlyName helm value. When a cluster name is set in the UI, this flag will be added to the copyable helm install command.
  • Minor update to ingest plugin to prepare for Opni preprocessing service
  • Notable dependency updates
    • go 1.20
    • cortex to v1.14.1
    • go-plugin to v1.4.8
    • nats-server to v2.9.12
    • alertmanager to v0.25.0
    • prometheus to v0.41.0
    • opentelemetry to v1.12.0
    • grpc to v1.52.1
    • kubernetes sdk to v0.26.1
    • controller-runtime to v0.14.4
    • cert-manager to v1.10.2
    • gin to v1.8.2
  • UI Enhancements/Fixes
    • Updating the tooltip in the alerting overview page to improve accuracy (#1006)
    • Fixed a bug where we didn't show all log templates in the Opensearch plugin. Also added a count to the top of the same table. (pr 106)
    • Fixed a bug where you couldn't update the s3 endpoint in the monitoring config (#1009)
    • Fixed a bug where you couldn't edit PagerDuty alarm endpoints (#958)
    • Improved the consistency of Cluster/Agent name and id labeling (#1048)
    • Improved chart labeling on the insights page of the opensearch dashboard (#1046)
    • Only show relevant storage options when selecting the 'Highly Available' Monitoring option
    • Compensate for an overzealous 'no changes to apply' error the Alerting backend emits when installing Alerting.
  • AIOps Enhancements/Fixes
    • Improved pre-trained DRAIN models with more ground truth (pr16)
    • Added test coverage to 64% to the deep-learning model service (pr38)
    • Fixed the NotFoundError during downloading training data from Opensearch(pr39)
    • Updated Opensearch query to filter out anomaly keywords in training data (pr16)
    • Bump PyTorch version to latest stable version (pr1035)
  • Alerting enhancements
    • scalability optimizations; reduced memory footprint for the Alerting backend
    • Improved fault tolerance for the Alerting backend notification system

v0.8.1 (December 18, 2022)

Bugfix release:

  • Add retry for fetching client certs
  • Fix agent crash on startup

v0.8.0 (December 16, 2022)

Introduction of Opni AI workload training module:

  • Pre-req: NVIDIA GPU is enabled in cluster Opni is installed in
  • Users can select deployments are important to them in the Opni Admin UI -> a model will be trained for the user and logs belonging to those deployments in the future will be given a "Normal" or "Anomalous" label
  • Users can ingest these insights in the Opni plugin in their Opensearch Dashboards (enabled when Opni Logging backend is setup)

Alerting enhancements

  • New PagerDuty endpoint for receiving notifications
  • Clone operation for cloning alarms to any target cluster(s)
  • Alarms for when downstream agent capabilities are unhealthy
  • Alarms for tracking the health of the opni monitoring backend health
  • Alarms based on prometheus queries
  • Alarms for tracking the states of kubernetes objects

Opensearch changes:

  • Opensearch version updated to 2.4.0
  • Dataprepper version updated to 2.0.1
  • Reconcilers now user TLC client cert to authenticate to the Opensearch API

v0.7.0 (November 23, 2022)

Introduction of Opni Alerting:

  • Slack, Email Endpoint for receiving notifications
  • Alarms on agent connection status
  • Overview for breached alarms
Clone this wiki locally