Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Celery queue to debug CPU/memory issues #2693

Open
5 tasks
mfocko opened this issue Jan 9, 2025 · 0 comments
Open
5 tasks

New Celery queue to debug CPU/memory issues #2693

mfocko opened this issue Jan 9, 2025 · 0 comments
Labels
area/general Related to whole service, not a specific part/integration. complexity/single-task Regular task, should be done within days. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.

Comments

@mfocko
Copy link
Member

mfocko commented Jan 9, 2025

Related to #2522

Context

Based on the discussion on the arch meeting with regards to the CPU and memory issues. The issue is tied to the production deployment (higher load) and short-running workers (tasks are run concurrently).

As for the memory issues the best “guess” is failing clean up, therefore let's set up a new queue that will run in the same way as the short-running one, but just with a subset of tasks to try to pinpoint specific handlers that are causing issues.

TODO

  • Create a new queue

  • Pick a task (e.g., process_message since there's a high amount of those) or subset of tasks (e.g., less frequent tasks that could be filtered further on) that will run in that queue

  • (optionally) Unify the way tasks are split between the queues; currently we declare the queue both in the decorator:

    bind=True, name=TaskName.downstream_koji_scratch_build, base=TaskWithRetry, queue="long-running"

    and also in the global Celery config:

    task_routes = {
    "task.babysit_vm_image_build": "long-running",
    "task.babysit_copr_build": "long-running",
    "packit_service.service.tasks.get_past_usage_data": "long-running",
    "packit_service.service.tasks.get_usage_interval_data": "long-running",
    }

  • (optionally) Improve the docs on what tasks are supposed to be run where; currently by default everything gets run in the short-running unless specified otherwise, also there are some tasks that stand out, e.g., VM Image build being triggered from the short-running

  • Based on the time spent on previous points, either stalk the OpenShift/Celery metrics, or create a follow-up card

@mfocko mfocko added complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. labels Jan 9, 2025
@lbarcziova lbarcziova moved this from new to priority-backlog in Packit Kanban Board Jan 9, 2025
@lbarcziova lbarcziova moved this from priority-backlog to refined in Packit Kanban Board Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/general Related to whole service, not a specific part/integration. complexity/single-task Regular task, should be done within days. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.
Projects
Status: refined
Development

No branches or pull requests

1 participant