Reservoir coupling: Spawn slave processes from master process #5453

hakonhagland · 2024-06-28T07:29:34Z

Builds on #5443. Depends on upstream OPM/opm-common#4123.

This is work in progress, so I am putting this in draft mode for now.

opm/simulators/flow/ReservoirCouplingMaster.cpp

blattms

Didn't even know that we can spawn addtional processes. Cool. so we could run flow in parallel without mpirun if the data file tells us to.

There are a few things that I am wonderinfg about:

What happens if I start Master/Slave with mpriun -np N flow and N>1
Can I compile this without MPI
On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?. Usually you have allocated some nodes and they have some hardware limits. So there might even be a natural limit (by RAM) for the number of processes. Or things might slow down if too many of those land on one node.

Having said that, I think this is a beautiful approach and in any case we will use communicators and intercommunicators like here in the end. Let's see whether we spawn or use some other means in the end. I tink that can be changed easily later.

blattms · 2024-09-06T08:39:59Z

opm/simulators/flow/Main.cpp

+    // If first command line argument is "--slave-log-file=<filename>",
+    // then redirect stdout and stderr to the specified file.
+    if (this->argc_ >= 2) {
+        std::string_view arg = this->argv_[1];


I think this can be handled later, but having an assumption that a parameter is at a special position seems to be a really strong/unrealistic one.

The assumption that if the first argument is --slave-log-file we can assume that this is a reservoir coupling slave, seems to be quite safe to me since a reservoir coupling slave can only be started by a master process. If this is not a slave process the user may still put --slave-log-file as the first argument by accident but that is not something that should happen very often.

blattms · 2024-09-06T08:45:48Z

opm/simulators/flow/ReservoirCouplingMaster.cpp

+    const auto& rescoup = this->schedule_[0].rescoup();
+    char *flow_program_name = argv[0];
+    for (const auto& [slave_name, slave] : rescoup.slaves()) {
+        auto master_slave_comm = MPI_Comm_Ptr(new MPI_Comm(MPI_COMM_NULL));


We need to able to compile flow without MPI, too. Hence I think we will need some #if HAVE_MPI somewhere

Good point. I have added #if HAVE_MPI preprocessor directives in SimulatorFullyImplicitBlackoil.hpp that should prevent inclusion of the reservoir coupling code if we compile with -DUSE_MPI=NO

blattms · 2024-09-06T08:49:41Z

opm/simulators/flow/ReservoirCouplingMaster.cpp

+        }
+    }
+    slave_argv[argc] = const_cast<char *>("--slave=true");
+    slave_argv[argc+1] = nullptr;


I am curious: What is this for?

The argument --slave=true is passed to flow such that it knows that it is a slave. It is used for example in SimulatorFullyImplicit.hpp like this auto slave_mode = Parameters::Get<Parameters::Slave>();

blattms · 2024-09-06T11:45:24Z

opm/simulators/utils/PartiallySupportedFlowKeywords.cpp

+         {
+            "SLAVES",
+            {
+                {1,{true, [](const std::string& val){ return val.size()<=8;}, "SLAVES(SLAVE_RESERVOIR): Only names of slave reservoirs of up to 8 characters are supported."}},


We should check with the engineers whether they really only use 8 or reky on truncation and use the rest for further documentation. I think the commercial simulator just truncates.

This discussion has been moved to #5436 (comment)

blattms · 2024-09-06T11:51:42Z

opm/simulators/utils/readDeck.cpp

+    {
+        OPM_THROW(std::logic_error,
+                  fmt::format("Inconsistent SCHEDULE section: {}", message));
+    }
+
+    void checkScheduleKeywordConsistency(const Opm::Schedule& schedule)
+    {
+        const auto& final_state = schedule.back();
+        const auto& rescoup = final_state.rescoup();
+        if (rescoup.slaveCount() > 0 && rescoup.masterGroupCount() == 0) {
+            inconsistentScheduleError("SLAVES keyword without GRUPMAST keyword");
+        }
+        if (rescoup.slaveCount() == 0 && rescoup.masterGroupCount() > 0) {
+            inconsistentScheduleError("GRUPMAST keyword without SLAVES keyword");
+        }


Shouldn't we rather do this check during parsing in opm-common? Then we might even be able to provide file locations.

I have to admit that I do not understand 100% what we are checking. What happens if we have SLAVES and GRUPMAST but at differen dates/steps?

hakonhagland · 2024-09-06T19:29:04Z

What happens if I start Master/Slave with mpriun -np N flow and N>1

@blattms I think that should work as usual. The master spawns N sub process, but only one of those will spawn the slaves.

hakonhagland · 2024-09-06T19:32:09Z

On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?

@blattms Good point. I do not have access to the NORCE cluster yet, I will try to test this next week. Currently, I have only tested on my own computer.

hakonhagland · 2024-09-09T08:07:13Z

Rebased

Do not specify slave program name twice when launching slave process

Open MPI does not support output redirection for spawned child processes.

Copy command line parameters from master to slave command line. Also replace data file name in master argv with data file name of the slave.

Remove debug code that was introduced by mistake in the previous commit

Create one log file for each slave subprocess. Redirect both stdout and stderr to this file

Exclude the reservoir coupling stuff if MPI is not enabled

hakonhagland · 2024-10-13T20:25:49Z

On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?

I do not have access to the NORCE cluster yet, I will try to test this next week.

@blattms Unfortunately, the NORCE compute machines did not have a job scheduler that I could test this out with. Though, the documentation for e.g. SLURM seems to indicate that it should be possible to allocate resources for MPI_Comm_spawn() in advance, however this should be tested out in practice on real cluster of course.

hakonhagland marked this pull request as draft June 28, 2024 07:29

hakonhagland commented Jun 28, 2024

View reviewed changes

opm/simulators/flow/ReservoirCouplingMaster.cpp Show resolved Hide resolved

hakonhagland commented Jun 28, 2024

View reviewed changes

opm/simulators/flow/ReservoirCouplingMaster.cpp Outdated Show resolved Hide resolved

hakonhagland force-pushed the spawn_slaves branch from a231330 to 1167c9d Compare August 5, 2024 15:40

hakonhagland mentioned this pull request Aug 6, 2024

Add support for GRUPSLAV OPM/opm-common#4123

Open

hakonhagland force-pushed the spawn_slaves branch 2 times, most recently from c7de9b0 to 6b84a2d Compare August 9, 2024 20:34

hakonhagland mentioned this pull request Aug 12, 2024

Reservoir coupling: Send start date from slaves to master #5521

Draft

blattms reviewed Sep 6, 2024

View reviewed changes

hakonhagland force-pushed the spawn_slaves branch from 6b84a2d to bf103dc Compare September 9, 2024 08:06

hakonhagland mentioned this pull request Sep 24, 2024

Add support for SLAVES keyword #5436

Draft

hakonhagland added 11 commits October 3, 2024 23:21

Spawn slaves from master

13c2b43

Do not specify program name twice

4a472f9

Do not specify slave program name twice when launching slave process

Open MPI does not support output redirection

eca2e0f

Open MPI does not support output redirection for spawned child processes.

Pass parameter --slave=true to the slaves

7d29164

Copy command line parameters from master

25b6768

Copy command line parameters from master to slave command line. Also replace data file name in master argv with data file name of the slave.

Redirect slave standard output to a logfile

a107920

Improved comments

9ba43e4

Add missing header files

53667b1

Remove debug code

d381e7b

Remove debug code that was introduced by mistake in the previous commit

Rebased, and fixed command line parsing

0232d34

Create one log file for each slave subprocess. Redirect both stdout and stderr to this file

Check if MPI is enabled

7eb07e4

Exclude the reservoir coupling stuff if MPI is not enabled

hakonhagland force-pushed the spawn_slaves branch from bf103dc to 7eb07e4 Compare October 5, 2024 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reservoir coupling: Spawn slave processes from master process #5453

Reservoir coupling: Spawn slave processes from master process #5453

hakonhagland commented Jun 28, 2024

blattms left a comment

blattms Sep 6, 2024

hakonhagland Oct 14, 2024

blattms Sep 6, 2024

hakonhagland Oct 14, 2024

blattms Sep 6, 2024

hakonhagland Oct 14, 2024

blattms Sep 6, 2024

hakonhagland Oct 14, 2024

blattms Sep 6, 2024

hakonhagland commented Sep 6, 2024

hakonhagland commented Sep 6, 2024

hakonhagland commented Sep 9, 2024

hakonhagland commented Oct 13, 2024

Reservoir coupling: Spawn slave processes from master process #5453

Are you sure you want to change the base?

Reservoir coupling: Spawn slave processes from master process #5453

Conversation

hakonhagland commented Jun 28, 2024

blattms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakonhagland commented Sep 6, 2024

hakonhagland commented Sep 6, 2024

hakonhagland commented Sep 9, 2024

hakonhagland commented Oct 13, 2024