Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification: canonical way to define custom global settings #82

Closed
albertz opened this issue Apr 19, 2022 · 13 comments
Closed

Clarification: canonical way to define custom global settings #82

albertz opened this issue Apr 19, 2022 · 13 comments
Assignees

Comments

@albertz
Copy link
Member

albertz commented Apr 19, 2022

@michelwi wrote in rwth-i6/i6_core#32:

Actually RETURNN_PYTHON_EXE is not a setting that is defined in sisyphus.global_settings so we are here in fact abusing the settings mechanics of sisyphus and should not have set this to begin with.

I want to clarify: Are we abusing this? If yes, how should we do it instead? I.e. where/how should we define some global settings (which we intentionally do not want as arguments to jobs)?

Let's not argue too much about RETURNN_PYTHON_EXE specifically here. This is just an example. This is actually an example where some people prefer it one way (having it as an explicit job arg) and others the other way (use global default, which is just a recent version), so it stays an argument where the default None would fallback to the custom global setting. Maybe we will often have it like this that both variants make sense. Maybe there are also other cases where we really want to have custom global settings which are never an argument.

My take here is that we can simply extend the scope of sisyphus.global_settings to also support such use cases. I don't think we need another separate mechanism for this. So basically no change needed then, just maybe some clarification in the doc that this is a valid use case.

@critias
Copy link
Contributor

critias commented Apr 19, 2022

sisyphus.global_settings was not meant to be used inside recipes. It's there to define sisyphus specific settings, e.g. timeouts, which engines to use or which environment should be set during job execution.

Workflow related things should be defined inside the recipes or config directory. If you want to define all external paths in one place you could create a file either inside the recipe folder or, especially if you don't want to cleanly share it across multiple recipes, inside your config folder. This file can then be imported in all relevant places. It would be possible to define the used path inside a tk.Path object there, or use the output of a job which sets everything up.

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

Maybe you have a more specific example how it look like? Don't you also have sth similar in your recipes? In i6_core, there is for example RETURNN_PYTHON_EXE as mentioned, and some others. So there needs to be some well defined place where these are specified. Basically very much like global_settings. Maybe sth like i6_core_global_settings? Does it really make sense to have that file inside config, while global_settings is in the root? This seems a bit inconsistent to me. Also, maybe it should use a very similar mechanism like global_settings, that there are some defaults (defined in i6_core) and then it additionally loads a user provided file which potentially overwrites these defaults? How do you handle that in your recipes?

@critias
Copy link
Contributor

critias commented Apr 20, 2022

Why doesn't it make sense to have it inside the config or recipe folder? It should be doing something totally different to settings.py. The default values of you scripts/binaries is directly related to you config or recipes, settings.py is something for sisyphus internal settings.

If anything settings.py should be move to something like $USER/.sisyphus_settings.py. I'm currently basically copying the same file around for all my setups.

We currently have these path inside our recipes, but our recipes are not split across multiple repositories. So in your case I would do something like:

config/default.py

from sisyphus import *
# I use hash_overwrite to simply change the folder without changing all hash values
RETURNN_ROOT = tk.Path('~/src/returnn', hash_overwrite='RETURNN')

recipe/returnn_stuff.py

from config.default import RETURNN_ROOT
...

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

Why doesn't it make sense to have it inside the config or recipe folder? It should be doing something totally different to settings.py.

I thought the config folder is specifically to define the job graph to compute, and/or the final outputs to compute and nothing else. But then my understanding of that was simply wrong. Ok then.

So, in i6_core recipes, we would have some file like i6_core/settings.py (similar as sisyphus.global_settings), and then we would use sth like update_global_settings_from_text (right?) to load config/i6_core_settings.py or so. Right?

@critias
Copy link
Contributor

critias commented Apr 20, 2022

The idea is to keep everything that is mainly static inside the recipes like job definitions or static workflows. The parameters to these more static stuff should than be placed inside the config directory. I must admit the line between these two is fairly blurry in practice, especially for the workflows...

Please don't use global_settings or any of it's supporting functions, it's not meant to be used outside of sisyphus internal functions and might change at some point. I probably should have called it __global_settings__...

Overwriting stuff automatically if config.default is available can be done simply like this:

i6_core/default.py

from sisyphus import *

SOMEPATH = tk.Path('/path/to/somewhere')

try:
    from config.default import *
except ImportError:
    pass

This way you can have some default values in i6_core/default.py which would be import where ever you need it. If you want to change any of these values just add a config/default.py file with the new values e.g.:

config/default.py

from sisyphus import *
SOMEPATH = tk.Path('/path/to/somewhere/else')

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

I assume config.i6_core_settings or so makes more sense instead of config.default?

Also, instead of:

try:
    from config.i6_core_settings import *
except ImportError:
    pass

sth like update_global_settings_from_file makes more sense, or not? Otherwise you potentially hide ImportErrors inside the users config.i6_core_settings.

@critias
Copy link
Contributor

critias commented Apr 20, 2022

You can name it what ever you like default was just an example, i6_core_settings would work as well.

You are right, ImportError is too broad, it would be better to use ModuleNotFoundError. Beside that I don't see an advantage of putting these four lines inside an extra function.

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

You can name it what ever you like default was just an example, i6_core_settings would work as well.

But default does not really make sense in this example, right? The user might want to mix different recipes. So the name for this config file should somehow be reserved for i6_core.

You are right, ImportError is too broad, it would be better to use ModuleNotFoundError. Beside that I don't see an advantage of putting these four lines inside an extra function.

But this would still have the same problem and hide other ModuleNotFoundError inside the config then. You also have update_global_settings_from_file where you explicitly handle the case the module is not found, so there you don't have this problem. Why should we do it differently here and introduce a potential source of hard to find bugs?

@critias
Copy link
Contributor

critias commented Apr 20, 2022

I don't know if you like want to use an extra settings file per recipe folder or use a share one for all recipes. A shared one is simpler, separated files is more flexible, but this should be discussed over at i6_core.

Feel free to write a specific function for it if you like, but using update_global_settings_from_file directly is a bad idea since everything loaded will be added explicitly to sisyphus.global_settings, which is not what you want at this point. Feel free to copy update_global_settings_from_file and adjust it to your needs, but this belongs to the recipe code base and not to sisyphus.

Why don't you just check if the exception was raised for the right file?

try:
    from config.i6_core_settings import *
except ModuleNotFoundError as e:
    if e.name != 'config.i6_core_settings':
        raise

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

but using update_global_settings_from_file directly

Sure, I just meant to do it conceptually in the same way.

Why don't you just check if the exception was raised for the right file?

try:
    from config.i6_core_settings import *
except ModuleNotFoundError as e:
    if e.name != 'config.i6_core_settings':
        raise

Is this (e.name) stable (well defined by Python) or an implementation detail of CPython, so it might change? If it is stable, then yes, this would work.

But why do you do it differently in update_global_settings_from_file? The code in update_global_settings_from_file looks a bit more explicit (and thus maybe safer) to me.

@critias
Copy link
Contributor

critias commented Apr 20, 2022

The code is mainly historically grown and checking some special cases. Also given that settings.py is not inside the normal search space for modules from settings import * wouldn't work.
I still don't see any real advantage in making this more complicated, but it's your decision how to implement it. If you like to use something like update_global_settings_from_file go ahead.

@michelwi
Copy link
Contributor

I think @albertz is profoundly missing the point. i6_core does not need a settings file.

The sisyphus recipes contain:

  • code for atomic jobs
  • parameters for each job (some of which have defaults)
  • predefined building blocks or groups of jobs

The configs contain:

  • Instanciation of specific jobs
  • specific parameters for the jobs
  • links between the jobs outputs and other jobs inputs

As an example let's look at the RetrunnTrainingJob. It does contain some code to setup a training, but does not contain the whole code or Returnn to run it. Instead it has a parameter returnn_root that points to the returnn repository. It also has the parameters horovod_num_processes to control multi-GPU training and returnn_config that contains the training config.
The returnn_config is mandatory. horovod_num_processes defaults to single-GPU training and returnn_root currently defaults to gs.RETURNN_ROOT which is - as I pointed out and @critias confirmed - abusing the sisyphus global settings mechanism.

Instead we should simply use a different mechanism to provide a default value to returnn_root or make it a mandatory parameter.
I propose that we do provide a default. And I propose the default should be a clean checkout of the head of the returnn repository. e.g. we can have i6_core.tools.default_software

import i6_core.tools.git as git
RETURNN_ROOT = git.CloneGitRepositoryJob("https://github.com/rwth-i6/returnn").out_repository

Then, if anyone needs or wants a different version of returnn, they can override this default by either the output of a different CloneGitRepositoryJob or a tk.Path object pointing to a filesystem location. But this would then be a specific parameter of a jobinstance and should be set in the config in whichever way the user wants.
This is similar to the case if somebody wanted to always do dual-GPU training, they would set horovod_num_processes=2 in every job (probably via some default_config_params in their config) instead of changing the RetrunnTrainingJob default parameter

@albertz
Copy link
Member Author

albertz commented Apr 20, 2022

i6_core does not need a settings file.
...
Instead we should simply use a different mechanism to provide a default value to returnn_root ... e.g. we can have i6_core.tools.default_software

Isn't this a contradiction, or just rephrasing "settings file" as "default software" file?

Or do you say that it should not be allowed to overwrite the default? But this is maybe debatable. E.g. why should you be able to overwrite the default env (via Sis global settings) but not this setting? The env probably has a much bigger effect on differences in the results than this setting (different RETURNN versions should never produce different results; but for example different TensorFlow versions, or different FFmpeg versions, or different CUDA versions, or whatever likely will produce different results). I don't think the default you proposed is a good one. I would want to have a specific custom path (e.g. os.path.dirname(__file__) + "/returnn").

(Your horovod_num_processes example is also not really relevant here, as this obviously has an impact on the result, just like learning rate and many other config settings.)

When we now discuss this more specifically about i6_core, and more specifically about RETURNN_ROOT and co, this discussion belongs to rwth-i6/i6_core#254 and not so much here in this issue, where we discussed how to define such custom global settings in general.

Btw, other example are also gs.RASR_ROOT, gs.RASR_ARCH, gs.RASR_PYTHON_HOME.
Or gs.G2P_PYTHON, gs.SCTK_PATH, gs.KALDI_PATH, gs.WARNING_NO_SENTENCEPIECE, gs.SUBWORD_NMT_PATH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants