Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel numba rime #186

Draft
wants to merge 30 commits into
base: master
Choose a base branch
from
Draft

Parallel numba rime #186

wants to merge 30 commits into from

Conversation

sjperkins
Copy link
Member

  • Tests added / passed

    $ py.test -v -s africanus

    If the pep8 tests fail, the quickest way to correct
    this is to run autopep8 and then flake8 and
    pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i africanus
    $ flake8 africanus
    $ pycodestyle africanus
    
  • Fully documented, including HISTORY.rst for all changes
    and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

@sjperkins sjperkins marked this pull request as draft May 7, 2020 16:22
@sjperkins
Copy link
Member Author

@JSKenyon Could you take a quick look at this? I'd appreciate general feedback on the approach.

Briefly, the user can instruct codex-africanus to create a parallel implementation for numba RIME functions using the donfig configuration package.

{ 'rime.feed_rotation.parallel': True }

# or

{
    'rime.feed_rotation.parallel': {
        # number of threads
        'threads': 2,
        # prange axes
        'axes': ['source', 'row'],
    }
}

# or

from africanus.config import config
with config.set({"rime.feed_rotation.parallel": True}):
   from africanus.rime import feed_rotation
   ...

One can place the above config in a YAML file or even set options via the command line

$ AFRICANUS_RIME__FEED_ROTATION__PARALLEL="{'threads':2}" python script.py

One issue is brittleness around imports (3rd python approach above): The config import and set needs to occur before the feed_rotation import otherwise the module import code may not be given the appropriate config.

@JSKenyon
Copy link
Collaborator

JSKenyon commented May 8, 2020

That all looks really good/cool to me. Are you intending to nest pranges? Just wondering based on the ability to specify two axes in the above example.

One thing which might be useful is the ability to set this in a more global fashion if required. What I mean is that it might get tedious to specify this manually per term. Having some sensible default for the axes and the ability to just throw nthreads at the problem would be handy.

@sjperkins
Copy link
Member Author

That all looks really good/cool to me. Are you intending to nest pranges? Just wondering based on the ability to specify two axes in the above example.

I thought I'd try to support it, although it may not be possible in all cases. I know that in order to collapse two loops in OpenMP, there can't be any code between them and the same may apply in the case of numba prange.

One thing which might be useful is the ability to set this in a more global fashion if required. What I mean is that it might get tedious to specify this manually per term. Having some sensible default for the axes and the ability to just throw nthreads at the problem would be handy.

Yeah, I guess a rime.parallel or even something more global might be appopriate. I don't want this to be in the public interface at first though.

Any thoughts on the use of get_num_threads and set_num_threads within the code block? On numba 0.49.0 disables caching in the parallel case so that means that number of threads won't be baked into the code on the first compile.

@JSKenyon
Copy link
Collaborator

I am not sure if I have understood it in sufficient detail, but I believe that by default numba.config.NUMBA_NUM_THREADS defaults to the number of available cores. Do you intend users to override this by setting the environment variable? I might just be misunderstanding how things get set using donfig. In principle I think what you have done seems sensible and I suspect that simply spinning up a huge pool of threads will work wonders.

@sjperkins
Copy link
Member Author

@JSKenyon. In general I've set up the RIME function generated_jit's as follows:

@generated_jit(nopython=True, nogil=True, cache=not parallel, parallel=parallel)
def fn(...):
   pass

There's something a bit off in the beam_cube_dde function. I've set nopython=False because if I set it to True I run into these recursioni errors:


africanus/rime/tests/test_fast_beams.py:77: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:420: in _compile_for_args
    raise e
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:353: in _compile_for_args
    return self.compile(tuple(argtypes))
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_lock.py:32: in _acquire_compile_lock
    return func(*args, **kwargs)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:794: in compile
    cres = self._compiler.compile(args, return_type)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:77: in compile
    status, retval = self._compile_cached(args, return_type)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:91: in _compile_cached
    retval = self._compile_core(args, return_type)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/dispatcher.py:109: in _compile_core
    pipeline_class=self.pipeline_class)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler.py:568: in compile_extra
    return pipeline.compile_extra(func)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler.py:339: in compile_extra
    return self._compile_bytecode()
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler.py:401: in _compile_bytecode
    return self._compile_core()
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler.py:381: in _compile_core
    raise e
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler.py:372: in _compile_core
    pm.run(self.state)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_machinery.py:341: in run
    raise patched_exception
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_machinery.py:332: in run
    self._runPass(idx, pass_inst, state)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_lock.py:32: in _acquire_compile_lock
    return func(*args, **kwargs)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_machinery.py:291: in _runPass
    mutated |= check(pss.run_pass, internal_state)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/compiler_machinery.py:264: in check
    mangled = func(compiler_state)
../../../../venv/afr/lib/python3.6/site-packages/numba/core/typed_passes.py:288: in run_pass
    parfor_pass.run()
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:2694: in run
    get_parfor_reductions(self.func_ir, p, p.params, self.calltypes)
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:3318: in get_parfor_reductions
    reduce_nodes = get_reduce_nodes(param, param_nodes[param], func_ir)
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:3399: in get_reduce_nodes
    rhs = lookup(rhs)
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:3389: in lookup
    return lookup(val)
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:3389: in lookup
    return lookup(val)
../../../../venv/afr/lib/python3.6/site-packages/numba/parfors/parfor.py:3389: in lookup
    return lookup(val)
E   RecursionError: Failed in nopython mode pipeline (step: convert to parfors)
E   maximum recursion depth exceeded
!!! Recursion detected (same locals & position)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed, 1 skipped, 281 deselected in 15.41s ===========================================================

There are a couple of scratch numpy arrays allocated outside the prange but used inside. Moving them inside doesn't fix the above recursion error. If I track it down with pdb, it looks like its getting very confused by the ncorrs variable.

@sjperkins
Copy link
Member Author

To invoke the parallelism in this PR, the following structure can be used:

from africanus.config import config

cfg= {
    'rime.feed_rotation.parallel': True,
    'rime.predict_vis.parallel':True,
    'rime.phase_delay.parallel':True,
    'model.spectral_model.parallel':True,
    'model.shape.gaussian.parallel':True
}

with config.set(cfg):
   # All imports **must** happen after the config.set
   from africanus.rime.dask import phase_delay, feed_rotation, predict_vis
   from africanus.model.dask import spectral_model
   from africanus.model.shape import gaussian

I've typed the above from memory, so some of it may be incorrect, but the general idea holds.

@JSKenyon
Copy link
Collaborator

I have played with this PR briefly and the components I have used seem to work. It is worth noting that the parallelism is still bound by certain non-Numba tasks such as einsum. Of course, this will become even more powerful with a prange accelerated monolithic RIME, but this is already a useful addition for cases where we cannot afford to run a huge number of predicts in parallel. For anyone else wanting to make use of this functionality, it is worth familiarizing yourself with numba -s. This provides a summary of some useful Numba properties. The important one is whether or not tbb is the active threading layer. For tasks involving nested parallelism, it is VERY important that this is the case.

@sjperkins
Copy link
Member Author

I have played with this PR briefly and the components I have used seem to work.

Thanks for trying this out.

It is worth noting that the parallelism is still bound by certain non-Numba tasks such as einsum.

I take it this is in the nested parallelism case where there are a mix of dask and numba threads?

Of course, this will become even more powerful with a prange accelerated monolithic RIME, but this is already a useful addition for cases where we cannot afford to run a huge number of predicts in parallel.

This PR still needs some work:

  1. It's still difficult to get a parallel nopython version of the beam working.
  2. rejitting the numba kernels for each test case parametrization has increased the test suite run time by double or worse.

I hope to find time for this later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants