New Celery queue to debug CPU/memory issues #2693

mfocko · 2025-01-09T10:53:55Z

Related to #2522

Context

Based on the discussion on the arch meeting with regards to the CPU and memory issues. The issue is tied to the production deployment (higher load) and short-running workers (tasks are run concurrently).

As for the memory issues the best “guess” is failing clean up, therefore let's set up a new queue that will run in the same way as the short-running one, but just with a subset of tasks to try to pinpoint specific handlers that are causing issues.

TODO

Create a new queue
Pick a task (e.g., process_message since there's a high amount of those) or subset of tasks (e.g., less frequent tasks that could be filtered further on) that will run in that queue

(optionally) Unify the way tasks are split between the queues; currently we declare the queue both in the decorator:

packit-service/packit_service/worker/tasks.py

Line 413 in 2c46677

    
               bind=True, name=TaskName.downstream_koji_scratch_build, base=TaskWithRetry, queue="long-running"

and also in the global Celery config:

packit-service/packit_service/celery_config.py

Lines 18 to 23 in 2c46677

    
           task_routes = { 
        
               "task.babysit_vm_image_build": "long-running", 
        
               "task.babysit_copr_build": "long-running", 
        
               "packit_service.service.tasks.get_past_usage_data": "long-running", 
        
               "packit_service.service.tasks.get_usage_interval_data": "long-running", 
        
           }

(optionally) Improve the docs on what tasks are supposed to be run where; currently by default everything gets run in the short-running unless specified otherwise, also there are some tasks that stand out, e.g., VM Image build being triggered from the short-running
Based on the time spent on previous points, either stalk the OpenShift/Celery metrics, or create a follow-up card

The text was updated successfully, but these errors were encountered:

mfocko added complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. labels Jan 9, 2025

usercont-release-bot added this to Packit Kanban Board Jan 9, 2025

github-project-automation bot moved this to new in Packit Kanban Board Jan 9, 2025

lbarcziova moved this from new to priority-backlog in Packit Kanban Board Jan 9, 2025

lbarcziova moved this from priority-backlog to refined in Packit Kanban Board Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Celery queue to debug CPU/memory issues #2693

New Celery queue to debug CPU/memory issues #2693

mfocko commented Jan 9, 2025

New Celery queue to debug CPU/memory issues #2693

New Celery queue to debug CPU/memory issues #2693

Comments

mfocko commented Jan 9, 2025

Context

TODO