Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt parallel execution has no min/max control for state parallel support #49301

Open
ferringb opened this issue Aug 23, 2018 · 23 comments · May be fixed by #66957
Open

Salt parallel execution has no min/max control for state parallel support #49301

ferringb opened this issue Aug 23, 2018 · 23 comments · May be fixed by #66957
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Milestone

Comments

@ferringb
Copy link

Description of Issue/Question

There doesn't seem to be a way to properly limit parallel execution- neither in checking docs, nor in reading through the various calls to state's call_parallel.

This in turn means that if the chunk graph is correct, and you have quite a few parallels in use you can easily exhaust the box.

I'll provide a test repro below, but for a real world example; in my setup I tried marking file.managed written to /etc/nginx/sites-enabled/* with parallel:true; for the renders I'm working with, that's 500+ files being written. Which salt then tried to write all at once, beating the hell out of my dev system in the process.

This ticket likely relates to #38746 and/or the process_count_max setting in 2018.8.x: https://docs.saltstack.com/en/latest/ref/configuration/minion.html#process-count-max ; that said, I've not found any reference in the state code that uses that value, so I'm guessing it wasn't wired.

Setup + Steps to Reproduce Issue

consider the following bit of test code to run 200 sleeps, in parallel.

{%- for x in range(200) %}
sleepy-{{x}}:
  cmd.run:
    - parallel: true
    - name: sleep 20
{%- endfor %}

Now either we can scrape from the logs how many it started in parallel, or we can just do a simple ps grep; point is, we see 200 running in parallel.

200

Versions Report

Tested for 2017.7.2 and 2018.8.3; from what I know of the code,

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 24, 2018

ping @saltstack/team-core any ideas here on how we can limit this?

@Ch3LL Ch3LL added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Aug 24, 2018
@Ch3LL Ch3LL added this to the Blocked milestone Aug 24, 2018
@Oloremo
Copy link
Contributor

Oloremo commented Sep 12, 2019

Actually a very important feature to have. The Salt position itself as all about the scale and performance and this is very much needed.

@stale

This comment has been minimized.

@stale stale bot added the stale label Jan 7, 2020
@Oloremo

This comment has been minimized.

@stale

This comment has been minimized.

@stale stale bot removed the stale label Jan 7, 2020
@stale

This comment has been minimized.

@stale stale bot added the stale label Feb 6, 2020
@Oloremo

This comment has been minimized.

@stale

This comment has been minimized.

@stale stale bot removed the stale label Feb 6, 2020
@stale

This comment has been minimized.

@stale stale bot added the stale label Mar 8, 2020
@Oloremo

This comment has been minimized.

@stale

This comment has been minimized.

@stale stale bot removed the stale label Mar 8, 2020
@stale

This comment has been minimized.

@stale stale bot added the stale label Apr 7, 2020
@Oloremo

This comment has been minimized.

@stale

This comment has been minimized.

@stale stale bot removed the stale label Apr 7, 2020
@sagetherage sagetherage removed the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label May 18, 2021
@sagetherage sagetherage removed this from the Blocked milestone May 18, 2021
@sagetherage
Copy link
Contributor

putting this back into triage as I think there may be some other issues that are similar and assigned to @garethgreenaway but I may be wrong.

@garethgreenaway
Copy link
Contributor

There is the process_count_max which limits the maximum number of processes or threads created by salt-minion when running commands. By default it is set to be unlimited. This might be something to try.

@sagetherage sagetherage added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Jun 1, 2021
@sagetherage sagetherage self-assigned this Jun 1, 2021
@garethgreenaway garethgreenaway added this to the Blocked milestone Jun 2, 2021
@sagetherage
Copy link
Contributor

@Oloremo would you be able to try Gareth's suggestion above on process_count_max and give some feedback?

@sagetherage sagetherage removed their assignment Jul 13, 2021
@sagetherage
Copy link
Contributor

@Oloremo friendly reminder to see if Gareth's suggestion was helpful or not

@sagetherage
Copy link
Contributor

@Oloremo would you be able to try Gareth's suggestion above on process_count_max and give some feedback?

@Oloremo
Copy link
Contributor

Oloremo commented Aug 9, 2021

@sagetherage Hi!

I'm on vacation right now plus I think there are good steps to reproduce provided by the original author.
So the solution could be confirmed by Salt folks as well.

@lkubb
Copy link
Contributor

lkubb commented Oct 4, 2023

This would be a very useful feature to have.

Sadly, process_count_max will not work since it only limits the number of jobs* running in parallel when receiving new publications via _handle_decoded_payload:

salt/salt/minion.py

Lines 1755 to 1763 in 6e64117

process_count_max = self.opts.get("process_count_max")
if process_count_max > 0:
process_count = len(salt.utils.minion.running(self.opts))
while process_count >= process_count_max:
log.warning(
"Maximum number of processes reached while executing jid %s,"
" waiting...",
data["jid"],
)

* From my testing, states run with parallel: true do not create new files in the proc cache dir, which is what is actually counted by that function.

@lomeroe
Copy link
Contributor

lomeroe commented Oct 12, 2023

I recently ran into needing this and agree that it would be nice to have a global option to set a limit. I was able to use some "creative state writing" to enforce a maximum number of tasks happening in parallel within my state run by inserting a test.nop state into the mix and using requisites to group them together.

suppose you have 100 tasks and you want to run a maximum of 25 parallel

my_state.sls:

{% set max_concurrent = 25 %}
{# have to use a dict so we can update it inside the for loop w/do statement #}
{% set parallel_group = {'val': 0} %}
{% for x in range(1, 100) %}
    {% if loop.index0 is divisibleby(max_concurrent) %}
        {% do parallel_group.update({'val': loop.index0}) %}
parallel_group_{{ parallel_group['val'] }}:
  test.nop:
    - name: 'parallel for tasks {{ parallel_group['val'] }} through {{ parallel_group['val'] + max_concurrent }}'
    {% endif %}
parallel_task_{{ x }}:
  cmd.run:
    - name: 'sleep 5'
    - parallel: True
    - require_in:
        - test: parallel_group_{{ parallel_group['val'] }}
  
{% endfor %}

outputs (some state returns removed for brevity):

#salt-call state.apply my_state
local:
----------
          ID: parallel_task_1
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:01.360235
    Duration: 5012.159 ms
     Changes:
              ----------
              pid:
                  1368695
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_2
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:01.365125
    Duration: 5010.894 ms
     Changes:
              ----------
              pid:
                  1368697
              retcode:
                  0
              stderr:
              stdout:
----------
<STATE OUTPUT 3-23 SNIPPED FOR BREVITY>
----------
          ID: parallel_task_24
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:01.482979
    Duration: 5081.1 ms
     Changes:
              ----------
              pid:
                  1368744
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_25
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:01.487615
    Duration: 5078.515 ms
     Changes:
              ----------
              pid:
                  1368745
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_group_0
    Function: test.nop
        Name: parallel for tasks 0 through 25
      Result: True
     Comment: Success!
     Started: 15:12:06.584734
    Duration: 55.972 ms
     Changes:
----------
          ID: parallel_task_26
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:06.651578
    Duration: 5012.242 ms
     Changes:
              ----------
              pid:
                  1368813
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_27
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:06.656087
    Duration: 5011.327 ms
     Changes:
              ----------
              pid:
                  1368816
              retcode:
                  0
              stderr:
              stdout:
----------
<STATE OUTPUT 28-47 SNIPPED FOR BREVITY>
----------
          ID: parallel_task_48
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:06.755338
    Duration: 5012.221 ms
     Changes:
              ----------
              pid:
                  1368858
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_49
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:06.759880
    Duration: 5010.475 ms
     Changes:
              ----------
              pid:
                  1368860
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_50
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:06.764747
    Duration: 5010.464 ms
     Changes:
              ----------
              pid:
                  1368861
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_group_25
    Function: test.nop
        Name: parallel for tasks 25 through 50
      Result: True
     Comment: Success!
     Started: 15:12:11.789038
    Duration: 0.974 ms
     Changes:
----------
          ID: parallel_task_51
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:11.800138
    Duration: 5012.249 ms
     Changes:
              ----------
              pid:
                  1368960
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_52
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:11.804196
    Duration: 5011.743 ms
     Changes:
              ----------
              pid:
                  1368962
              retcode:
                  0
              stderr:
              stdout:
----------
<STATE OUTPUT 53-73 SNIPPED FOR BREVITY>
----------
          ID: parallel_task_74
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:11.902587
    Duration: 5011.342 ms
     Changes:
              ----------
              pid:
                  1369010
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_75
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:11.908248
    Duration: 5011.967 ms
     Changes:
              ----------
              pid:
                  1369011
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_group_50
    Function: test.nop
        Name: parallel for tasks 50 through 75
      Result: True
     Comment: Success!
     Started: 15:12:16.930839
    Duration: 0.927 ms
     Changes:
----------
          ID: parallel_task_76
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:16.941630
    Duration: 5011.569 ms
     Changes:
              ----------
              pid:
                  1369147
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_77
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:16.945098
    Duration: 5010.945 ms
     Changes:
              ----------
              pid:
                  1369148
              retcode:
                  0
              stderr:
              stdout:
<STATE OUTPUT 78-97 SNIPPED FOR BREVITY>
----------
          ID: parallel_task_98
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:17.065731
    Duration: 5010.658 ms
     Changes:
              ----------
              pid:
                  1369192
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_task_99
    Function: cmd.run
        Name: sleep 5
      Result: True
     Comment: Command "sleep 5" run
     Started: 15:12:17.070838
    Duration: 5010.106 ms
     Changes:
              ----------
              pid:
                  1369193
              retcode:
                  0
              stderr:
              stdout:
----------
          ID: parallel_group_75
    Function: test.nop
        Name: parallel for tasks 75 through 100
      Result: True
     Comment: Success!
     Started: 15:12:22.096673
    Duration: 1.272 ms
     Changes:

Summary for local
--------------
Succeeded: 103 (changed=99)
Failed:      0
--------------
Total states run:     103
Total run time:   496.425 s

@bdrx312
Copy link
Contributor

bdrx312 commented Apr 24, 2024

Currently as soon as call_parallel is called, the subprocess is started with proc.start(). Instead of immediately starting the process it could first check that the number of procs in the running dict is less than a configured max size; if it is less then start the proc; if not less then don't start the proc and return the result with the proc in it the same as it does now. Finally reconcile_procs would be updated to start the unstarted procs when it removes any procs that it detects finished. The current implementation is not ideal since it iterates through the whole running dict (not the greatest name because it is a dict that stores the return value dict of each state that is called not just the currently running ones) every time reconcile_procs is called. The code could also be optimized to store a data structure for quicker look up of just the running procs rather than having to filter through all the completed states each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants