Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sor pull request #153

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Sor pull request #153

wants to merge 22 commits into from

Conversation

jaglima
Copy link

@jaglima jaglima commented Sep 21, 2023

Summer of Reproducibility results - 2023 - PR

Intro:

This PR refers to the project proposal submitted to the Summer Or Reproducibility in 2023. It is related to "the noWorkflow capabilities on experiments in Data Science and Machine Learning" and ran under the mentorship of Juliana Freire and Joao Pimentel.

For more details, please, refer to the initial proposal, the mid-term technical report and the
final report.

Features description:

The following features were implemented, aiming to empower a Jupyter Notebook user into

  1. tagging variables
  2. retrieving the operations and variables that contributed to the final value of the tagged variable
  3. compare the differences between two trials
  4. retrieve the values of scalar tagged variables from a set of trials.

Here are the changes and inclusions implementing these features:

  • A new table was added in the database in order to store values and names of a given tagged variable. This implementation is composed by the commits in e902999

  • All implemented methods/classes related with tagged variables were bundled in the file capture/noworkflow/now/tagging/var_tagging/py. They are:

    • class NotebookQuerierOptions(QuerierOptions) : extends QueriesOptions
      • visit_arrow() : navigation method through the dependencies
      • back_deps() : Creates a readable list of backward dependencies
      • global_back_deps() : Same that back_deps() for all notebook dependencies
    • now_cells('cell_tag') : Tag a given cell
    • now_variable('target_var', value): tag a variable
    • backward_deps('target_var', glanularity_level): retrieve the backward dependencies of a variable
    • global_backward_deps('target_var', granularity_level): gets the global backward dependencies, which means, retrieve all the references to it along the notebook
    • store_operations(trial_id, dict_ops): saves the list of dependencies of a tagged variable
    • dict_to_text : Convert a dictionary format to plain text.
    • resume_trials(): show all stored list of operations
    • trial_intersection_diff(trial_id_a, trial_id_b): shows the differences between the intersection of ops in two trials
    • trial_diff(trial_id_a, trial_id_b): displays a diff visualization between two trials
    • var_tag_plot('target_var_name'): plot the values of a tagged variable across all trials in the database
    • var_tag_values('target_var_name'): returns a pandas dataframe with all values of a variable in the database
  • A tutorial/usecase directory was added here in 33d0ae9
    . The directory contains:

    • A friendly tutorial (in README.md)
    • A five Notebooks walking the user through an ML example with noWorkflow
    • A basic_operations Notebook with all implemented functionalities in this PR

Bug Fixing:

During the project, we came across with a minor bug affecting Numpy vectors/matrices and Pandas DataFrames when the noWorkflow kernel was active. In these cases, simply instantiating such objects resulted in an error. A solution was proposed firstly in aedf476 and improved in c6be1ab commit.

@jaglima jaglima marked this pull request as ready for review September 21, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant