Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Pruning for Anomaly Detection #1193

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aayush-se
Copy link
Member

@aayush-se aayush-se commented Sep 24, 2024

  • Create Celery task to delete data that is over 28 days old and update matrix profiles for remaining points
  • Update DbDynamicAlert to include data_purge_flag and queued_at columns to reflect Celery task status
  • Write unit tests for new accessor methods and cleanup task

@aayush-se aayush-se requested a review from a team September 24, 2024 18:09
@aayush-se aayush-se changed the title Anomaly detection/data pruning Data Pruning for Anomaly Detection Sep 24, 2024
@corps
Copy link
Member

corps commented Sep 24, 2024

Before we turn this on in production with ENABLE_CELERY_WORKERs, lets get #1192 in and add move this beat task to a conditional for anomaly detection.

src/migrations/versions/480ca2916d86_migration.py Outdated Show resolved Hide resolved
src/migrations/versions/560f663c1cfd_migration.py Outdated Show resolved Hide resolved
src/seer/anomaly_detection/accessors.py Show resolved Hide resolved
src/seer/anomaly_detection/models/external.py Outdated Show resolved Hide resolved
src/seer/anomaly_detection/anomaly_detection.py Outdated Show resolved Hide resolved
src/seer/anomaly_detection/models/external.py Outdated Show resolved Hide resolved
@ram-senth ram-senth force-pushed the anomaly-detection/data-pruning branch 3 times, most recently from dcb45bb to d36cbde Compare September 25, 2024 22:17
@ram-senth
Copy link
Member

We also need some metrics published from the Celerey task so that we can keep track of:

  • number of clean ups done in a time window
  • time duration taken for each cleanup so we can monitor p90 or p99
  • count of times when a cleanup results in the alert having less than 7 days of data
  • number of times cleanup task encountered error

These are in addition to the regular queue related metrics like queue size, time spent waiting etc. We should check with Jenn if we need to do anything specific for those to work.

@aayush-se
Copy link
Member Author

We also need some metrics published from the Celerey task so that we can keep track of:

  • number of clean ups done in a time window
  • time duration taken for each cleanup so we can monitor p90 or p99
  • count of times when a cleanup results in the alert having less than 7 days of data
  • number of times cleanup task encountered error

These are in addition to the regular queue related metrics like queue size, time spent waiting etc. We should check with Jenn if we need to do anything specific for those to work.

I did include Sentry tracing for the cleanup task so I think everything aside from the third point (less than 7 days of data) should be viewable. I'm unsure about the other queue related metrics though since Sentry doesn't support direct metrics anymore (https://docs.sentry.io/product/explore/metrics/).

src/seer/db.py Outdated Show resolved Hide resolved
src/seer/db.py Outdated
@@ -58,6 +59,8 @@ class Base(DeclarativeBase):
migrate = Migrate(directory="src/migrations")
Session = sessionmaker(autoflush=False, expire_on_commit=False)

TaskStatus = Literal["not_queued", "processing", "queued"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can convert this to string enum and use it in both the business logic as well as in the data layer.

src/seer/anomaly_detection/models/external.py Outdated Show resolved Hide resolved
@ram-senth
Copy link
Member

We also need some metrics published from the Celerey task so that we can keep track of:

  • number of clean ups done in a time window
  • time duration taken for each cleanup so we can monitor p90 or p99
  • count of times when a cleanup results in the alert having less than 7 days of data
  • number of times cleanup task encountered error

These are in addition to the regular queue related metrics like queue size, time spent waiting etc. We should check with Jenn if we need to do anything specific for those to work.

I did include Sentry tracing for the cleanup task so I think everything aside from the third point (less than 7 days of data) should be viewable. I'm unsure about the other queue related metrics though since Sentry doesn't support direct metrics anymore (https://docs.sentry.io/product/explore/metrics/).

Wondering if we should add the alert id as a tag so that we can look at metrics at an alert level. Let us ask Zach and Jenn about the celery metrics in the next standup. Rest is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants