nbmodular

Convert data science notebooks with poor modularity to fully modular notebooks that are automatically exported as python modules.

Motivation

In data science, it is usual to develop experimentally and quickly based on notebooks, with little regard to software engineering practices and modularity. It can become challenging to start working on someone else’s notebooks with no modularity in terms of separate functions, and a great degree of duplicated code between the different notebooks. This makes it difficult to understand the logic in terms of semantically separate units, see what are the commonalities and differences between the notebooks, and be able to extend, generalize, and configure the current solution.

Objectives

nbmodular is a library conceived with the objective of helping converting the cells of a notebook into separate functions with clear dependencies in terms of inputs and outputs. This is done though a combination of tools which semi-automatically understand the data-flow in the code, based on mild assumptions about its structure. It also helps test the current logic and compare it against a modularized solution, to make sure that the refactored code is equivalent to the original one.

Features

Install

pip install nbmodular

Usage

Load ipython extension

This allows us to use the following of magic commands, among others

function <name_of_function_to_define>
print <name_of_previous_function>
function_info <name_of_previous_function>
print_pipeline

Let’s go one by one

function

Use magic command function allows to:

Run the code in the cell normally, and at the same time detect its input and output dependencies and define a function with this input and output:

a = 2
b = 3
c = a+b
print (a+b)

The code in the previous cell runs as it normally would, but and at the same time defines a function named get_initial_values which we can show with the magic command print:

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)

This function is defined in the notebook space, so we can invoke it:

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)

The inputs and outputs of the function change dynamically every time we add a new function cell. For example, if we add a new function get_d:

d = 10

def get_d():
    d = 10

And then a function add_all that depend on the previous two functions:

a = a + d
b = b + d
c = c + d

f = %function_info add_all

print(f.code)

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d

from sklearn.utils import Bunch
from pathlib import Path
import joblib
import pandas as pd
import numpy as np

def test_index_pipeline (test=True, prev_result=None, result_file_name="index_pipeline"):
    result = index_pipeline (test=test, load=True, save=True, result_file_name=result_file_name)
    if prev_result is None:
        prev_result = index_pipeline (test=test, load=True, save=True, result_file_name=f"test_{result_file_name}")
    for k in prev_result:
        assert k in result
        if type(prev_result[k]) is pd.DataFrame:    
            pd.testing.assert_frame_equal (result[k], prev_result[k])
        elif type(prev_result[k]) is np.array:
            np.testing.assert_array_equal (result[k], prev_result[k])
        else:
            assert result[k]==prev_result[k]

def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):

    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    b, c, a = get_initial_values (test=test)
    d = get_d ()
    add_all (d, b, c, a)

    # save result
    result = Bunch (b=b,c=c,a=a,d=d)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d

We can see that the uputs from get_initial_values and get_d change as needed. We can look at all the functions defined so far by using print all:

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return b,c,a

def get_d():
    d = 10
    return d

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d

Similarly the outputs from the last function add_all change after we add a other functions that depend on it:

print (a, b, c, d)

12 13 15 10

print

We can see each of the defined functions with print my_function, and list all of them with print all

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return b,c,a

def get_d():
    d = 10
    return d

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d
    return b,c,a

def print_all(b, d, a, c):
    print (a, b, c, d)

print_pipeline

As we add functions to the notebook, a pipeline function is defined. We can print this pipeline with the magic print_pipeline

def index_pipeline (test=False, load=True, save=True, result_file_name="index_pipeline"):

    # load result
    result_file_name += '.pk'
    path_variables = Path ("index") / result_file_name
    if load and path_variables.exists():
        result = joblib.load (path_variables)
        return result

    b, c, a = get_initial_values (test=test)
    d = get_d ()
    b, c, a = add_all (d, b, c, a)
    print_all (b, d, a, c)

    # save result
    result = Bunch (b=b,d=d,c=c,a=a)
    if save:    
        path_variables.parent.mkdir (parents=True, exist_ok=True)
        joblib.dump (result, path_variables)
    return result

This shows the data flow in terms of inputs and outputs

And run it:

self = %cell_processor

self.function_list

[FunctionProcessor with name get_initial_values, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
     Arguments: []
     Output: ['b', 'c', 'a']
     Locals: dict_keys(['a', 'b', 'c']),
 FunctionProcessor with name get_d, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
     Arguments: []
     Output: ['d']
     Locals: dict_keys(['d']),
 FunctionProcessor with name add_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
     Arguments: ['d', 'b', 'c', 'a']
     Output: ['b', 'c', 'a']
     Locals: dict_keys(['a', 'b', 'c']),
 FunctionProcessor with name print_all, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
     Arguments: ['b', 'd', 'a', 'c']
     Output: []
     Locals: dict_keys([])]

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return b,c,a

def get_d():
    d = 10
    return d

def add_all(d, b, c, a):
    a = a + d
    b = b + d
    c = c + d
    return b,c,a

def print_all(b, d, a, c):
    print (a, b, c, d)

index_pipeline()

{'d': 10, 'b': 13, 'a': 12, 'c': 15}

function_info

We can get access to many of the details of each of the defined functions by calling function_info on a given function name:

get_initial_values_info = %function_info get_initial_values

This allows us to see:

The name and value (at the time of running) of the local variables, arguments and results from the function:

get_initial_values_info.arguments

[]

get_initial_values_info.current_values

{'a': 2, 'b': 3, 'c': 5}

get_initial_values_info.return_values

['b', 'c', 'a']

We can also inspect the original code written in the cell…

print (get_initial_values_info.original_code)

a = 2
b = 3
c = a+b
print (a+b)

the code of the defined function:

print (get_initial_values_info.code)

def get_initial_values(test=False):
    a = 2
    b = 3
    c = a+b
    print (a+b)
    return b,c,a

.. and the AST trees:

print (get_initial_values_info.get_ast (code=get_initial_values_info.original_code))

Module(
  body=[
    Assign(
      targets=[
        Name(id='a', ctx=Store())],
      value=Constant(value=2)),
    Assign(
      targets=[
        Name(id='b', ctx=Store())],
      value=Constant(value=3)),
    Assign(
      targets=[
        Name(id='c', ctx=Store())],
      value=BinOp(
        left=Name(id='a', ctx=Load()),
        op=Add(),
        right=Name(id='b', ctx=Load()))),
    Expr(
      value=Call(
        func=Name(id='print', ctx=Load()),
        args=[
          BinOp(
            left=Name(id='a', ctx=Load()),
            op=Add(),
            right=Name(id='b', ctx=Load()))],
        keywords=[]))],
  type_ignores=[])
None

print (get_initial_values_info.get_ast (code=get_initial_values_info.code))

Module(
  body=[
    FunctionDef(
      name='get_initial_values',
      args=arguments(
        posonlyargs=[],
        args=[
          arg(arg='test')],
        kwonlyargs=[],
        kw_defaults=[],
        defaults=[
          Constant(value=False)]),
      body=[
        Assign(
          targets=[
            Name(id='a', ctx=Store())],
          value=Constant(value=2)),
        Assign(
          targets=[
            Name(id='b', ctx=Store())],
          value=Constant(value=3)),
        Assign(
          targets=[
            Name(id='c', ctx=Store())],
          value=BinOp(
            left=Name(id='a', ctx=Load()),
            op=Add(),
            right=Name(id='b', ctx=Load()))),
        Expr(
          value=Call(
            func=Name(id='print', ctx=Load()),
            args=[
              BinOp(
                left=Name(id='a', ctx=Load()),
                op=Add(),
                right=Name(id='b', ctx=Load()))],
            keywords=[])),
        Return(
          value=Tuple(
            elts=[
              Name(id='b', ctx=Load()),
              Name(id='c', ctx=Load()),
              Name(id='a', ctx=Load())],
            ctx=Load()))],
      decorator_list=[])],
  type_ignores=[])
None

Now, we can define another function in a cell that uses variables from the previous function.

cell_processor

This magic allows us to get access to the CellProcessor class managing the logic for running the above magic commands, which can become handy:

cell_processor = %cell_processor

Merging function cells

In order to explore intermediate results, it is convenient to split the code in a function among different cells. This can be done by passing the flag --merge True

x = [1, 2, 3]
y = [100, 200, 300]
z = [u+v for u,v in zip(x,y)]

[101, 202, 303]

def analyze():
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]

product = [u*v for u, v in zip(x,y)]

def analyze():
    x = [1, 2, 3]
    y = [100, 200, 300]
    z = [u+v for u,v in zip(x,y)]
    product = [u*v for u, v in zip(x,y)]

Test functions

By passing the flag --test we can indicate that the logic in the cell is dedicated to test other functions in the notebook. The test function is defined taking the well-known pytest library as a test engine in mind.

This has the following consequences:

The analysis of dependencies is not associated with variables found in other cells. - Test functions do not appear in the overall pipeline. - The data variables used by the test function can be defined in separate test data cells which in turn are converted to functions. These functions are called at the beginning of the test cell.

Let’s see an example

a = 5
b = 3
c = 6
d = 7

add_all(d, a, b, c)

(12, 10, 13)

# test function add_all
assert add_all(d, a, b, c)==(12, 10, 13)

def test_add_all():
    b,c,a,d = test_input_add_all()
    # test function add_all
    assert add_all(d, a, b, c)==(12, 10, 13)

def test_input_add_all(test=False):
    a = 5
    b = 3
    c = 6
    d = 7
    return b,c,a,d

Test functions are written in a separate test module, withprefix test_

!ls ../tests

index.ipynb  test_example.py

Imports

In order to include libraries in our python module, we can use the magic imports. Those will be written at the beginning of the module:

import pandas as pd

Imports can be indicated separately for the test module by passing the flag --test:

import matplotlib.pyplot as plt

Defined functions

Functions can be included already being defined with signature and return values. The only caveat is that, if we want the function to be executed, the variables in the argument list need to be created outside of the function. Otherwise we need to pass the flag –norun to avoid errors:

def myfunc (x, y, a=1, b=3):
    print ('hello', a, b)
    c = a+b
    return c

Although the internal code of the function is not executed, it is still parsed using an AST. This allows to provide very tentative warnings regarding names not found in the argument list

def other_func (x, y):
    print ('hello', a, b)
    c = a+b
    return c

Detected the following previous variables that are not in the argument list: ['b', 'a']

Let’s do the same but running the function:

a=1
b=3

def myfunc (x, y, a=1, b=3):
    print ('hello', a, b)
    c = a+b
    return c

hello 1 3

myfunc (10, 20)

hello 1 3

4

myfunc_info = %function_info myfunc

myfunc_info

FunctionProcessor with name myfunc, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'norun', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx', 'previous_values', 'current_values', 'all_values', 'code'])
    Arguments: ['x', 'y', 'a', 'b']
    Output: ['c']
    Locals: dict_keys(['c'])

myfunc_info.c

Storing local variables in memory

By default, when we run a cell function its local variables are stored in a dictionary called current_values:

my_new_local = 3
my_other_new_local = 4

The stored variables can be accessed by calling the magic function_info:

my_new_function_info = %function_info my_new_function

my_new_function_info.current_values

{'my_new_local': 3, 'my_other_new_local': 4}

This default behaviour can be overriden by passing the flag --not-store

my_second_variable = 100
my_second_other_variable = 200

my_second_new_function_info = %function_info my_second_new_function

my_second_new_function_info.current_values

{}

(Un)packing Bunch I/O

from sklearn.utils import Bunch

x = Bunch (a=1, b=2)

c = 3
a = 4

def bunch_processor(x, day):
    a = x["a"]
    b = x["b"]
    c = 3
    a = 4
    x["a"] = a
    x["c"] = c
    x["day"] = day
    return x

Function’s info object holding local variables

df = pd.DataFrame (dict(Year=[1,2,3], Month=[1,2,3], Day=[1,2,3]))
fy = '2023'

def days (df, fy, x=1, /, y=3, *, n=4):
    df_group = df.groupby(['Year','Month']).agg({'Day': lambda x: len (x)})
    df_group = df.reset_index()
    print ('other args: fy', fy, 'x', x, 'y', y)
    return df_group

other args: fy 2023 x 1 y 3
Stored the following local variables in the days current_values dictionary: ['df_group']
Detected the following previous variables that are not in the argument list: ['x', 'df', 'fy']

An info object with name <function_name>_info is created in memory, and can be used to get access to local variables

days_info.df_group

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	index	Year	Month	Day
0	0	1	1	1
1	1	2	2	2
2	2	3	3	3

There is more information in this object: previous variables, code, etc.

days_info.current_values

{'df_group':    index  Year  Month  Day
 0      0     1      1    1
 1      1     2      2    2
 2      2     3      3    3}

days_info

FunctionProcessor with name days, and fields: dict_keys(['original_code', 'name', 'call', 'tab_size', 'arguments', 'return_values', 'unknown_input', 'unknown_output', 'test', 'data', 'defined', 'permanent', 'signature', 'not_run', 'previous_values', 'current_values', 'returns_dict', 'returns_bunch', 'unpack_bunch', 'include_input', 'exclude_input', 'include_output', 'exclude_output', 'store_locals_in_disk', 'created_variables', 'loaded_names', 'previous_variables', 'argument_variables', 'read_only_variables', 'posterior_variables', 'all_variables', 'idx'])
    Arguments: ['df', 'fy', 'x', 'y']
    Output: ['df_group']
    Locals: dict_keys(['df_group'])

The function can also be called directly:

days (df*100, 100, x=4)

other args: fy 100 x 4 y 3

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	index	Year	Month	Day
0	0	100	100	100
1	1	200	200	200
2	2	300	300	300

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
docs		docs
nbm		nbm
nbmodular		nbmodular
nbs		nbs
test_data		test_data
tests		tests
wheels		wheels
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
commit.sh		commit.sh
pull.sh		pull.sh
push.sh		push.sh
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nbmodular

Motivation

Objectives

Features

Install

Usage

function

print

print_pipeline

function_info

cell_processor

Merging function cells

Test functions

Imports

Defined functions

Storing local variables in memory

(Un)packing Bunch I/O

Function’s info object holding local variables

About

Releases

Packages

Contributors 2

Languages

License

JaumeAmoresDS/nbmodular

Folders and files

Latest commit

History

Repository files navigation

nbmodular

Motivation

Objectives

Features

Install

Usage

function

print

print_pipeline

function_info

cell_processor

Merging function cells

Test functions

Imports

Defined functions

Storing local variables in memory

(Un)packing Bunch I/O

Function’s info object holding local variables

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages