Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reservoir coupling: Spawn slave processes from master process #5453

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

hakonhagland
Copy link
Contributor

Builds on #5443. Depends on upstream OPM/opm-common#4123.

This is work in progress, so I am putting this in draft mode for now.

@hakonhagland hakonhagland marked this pull request as draft June 28, 2024 07:29
Copy link
Member

@blattms blattms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't even know that we can spawn addtional processes. Cool. so we could run flow in parallel without mpirun if the data file tells us to.

There are a few things that I am wonderinfg about:

  • What happens if I start Master/Slave with mpriun -np N flow and N>1
  • Can I compile this without MPI
  • On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?. Usually you have allocated some nodes and they have some hardware limits. So there might even be a natural limit (by RAM) for the number of processes. Or things might slow down if too many of those land on one node.

Having said that, I think this is a beautiful approach and in any case we will use communicators and intercommunicators like here in the end. Let's see whether we spawn or use some other means in the end. I tink that can be changed easily later.

// If first command line argument is "--slave-log-file=<filename>",
// then redirect stdout and stderr to the specified file.
if (this->argc_ >= 2) {
std::string_view arg = this->argv_[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be handled later, but having an assumption that a parameter is at a special position seems to be a really strong/unrealistic one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption that if the first argument is --slave-log-file we can assume that this is a reservoir coupling slave, seems to be quite safe to me since a reservoir coupling slave can only be started by a master process. If this is not a slave process the user may still put --slave-log-file as the first argument by accident but that is not something that should happen very often.

const auto& rescoup = this->schedule_[0].rescoup();
char *flow_program_name = argv[0];
for (const auto& [slave_name, slave] : rescoup.slaves()) {
auto master_slave_comm = MPI_Comm_Ptr(new MPI_Comm(MPI_COMM_NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to able to compile flow without MPI, too. Hence I think we will need some #if HAVE_MPI somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have added #if HAVE_MPI preprocessor directives in SimulatorFullyImplicitBlackoil.hpp that should prevent inclusion of the reservoir coupling code if we compile with -DUSE_MPI=NO

}
}
slave_argv[argc] = const_cast<char *>("--slave=true");
slave_argv[argc+1] = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious: What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument --slave=true is passed to flow such that it knows that it is a slave. It is used for example in SimulatorFullyImplicit.hpp like this auto slave_mode = Parameters::Get<Parameters::Slave>();

{
"SLAVES",
{
{1,{true, [](const std::string& val){ return val.size()<=8;}, "SLAVES(SLAVE_RESERVOIR): Only names of slave reservoirs of up to 8 characters are supported."}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check with the engineers whether they really only use 8 or reky on truncation and use the rest for further documentation. I think the commercial simulator just truncates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion has been moved to #5436 (comment)

Comment on lines +241 to +255
{
OPM_THROW(std::logic_error,
fmt::format("Inconsistent SCHEDULE section: {}", message));
}

void checkScheduleKeywordConsistency(const Opm::Schedule& schedule)
{
const auto& final_state = schedule.back();
const auto& rescoup = final_state.rescoup();
if (rescoup.slaveCount() > 0 && rescoup.masterGroupCount() == 0) {
inconsistentScheduleError("SLAVES keyword without GRUPMAST keyword");
}
if (rescoup.slaveCount() == 0 && rescoup.masterGroupCount() > 0) {
inconsistentScheduleError("GRUPMAST keyword without SLAVES keyword");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we rather do this check during parsing in opm-common? Then we might even be able to provide file locations.

I have to admit that I do not understand 100% what we are checking. What happens if we have SLAVES and GRUPMAST but at differen dates/steps?

@hakonhagland
Copy link
Contributor Author

What happens if I start Master/Slave with mpriun -np N flow and N>1

@blattms I think that should work as usual. The master spawns N sub process, but only one of those will spawn the slaves.

@hakonhagland
Copy link
Contributor Author

On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?

@blattms Good point. I do not have access to the NORCE cluster yet, I will try to test this next week. Currently, I have only tested on my own computer.

@hakonhagland
Copy link
Contributor Author

Rebased

Do not specify slave program name twice when launching slave process
Open MPI does not support output redirection for spawned child
processes.
Copy command line parameters from master to slave command line. Also
replace data file name in master argv with data file name of the slave.
Remove debug code that was introduced by mistake in the previous commit
Create one log file for each slave subprocess. Redirect both
stdout and stderr to this file
Exclude the reservoir coupling stuff if MPI is not enabled
@hakonhagland
Copy link
Contributor Author

On a cluster with a queuing system (e.g. the NORCE cluster?), will the admin allow spawning at all or is there a limit for the number of processes?

I do not have access to the NORCE cluster yet, I will try to test this next week.

@blattms Unfortunately, the NORCE compute machines did not have a job scheduler that I could test this out with. Though, the documentation for e.g. SLURM seems to indicate that it should be possible to allocate resources for MPI_Comm_spawn() in advance, however this should be tested out in practice on real cluster of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants