Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the codespell pre-commit hook #403

Conversation

westford14
Copy link

@westford14 westford14 commented Sep 2, 2024

Closes #401 -- I ignored all the jupyter notebooks as there's a lot of custom variable definition in there that is getting flagged as incorrect words. I felt like trying to maintain a list of all these words, long term, as a dictionary would be a bit too tricky / annoying.


📚 Documentation preview 📚: https://causalpy--403.org.readthedocs.build/en/403/

@westford14 westford14 force-pushed the westford14/codespell-implementation branch from 13f540d to e2e4bfe Compare September 2, 2024 16:50
@westford14 westford14 force-pushed the westford14/codespell-implementation branch from e2e4bfe to 446ae05 Compare September 2, 2024 17:00
@westford14
Copy link
Author

@drbenvincent -- hey just hoping to see if this is the right direction for this PR. codespell is fine if not a little finicky

@drbenvincent
Copy link
Collaborator

Hi @westford14 thanks for the PR! I'll be able to take a look and do a review later today/tomorrow 🙂

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @westford14. This looks pretty good to me. We'll definitely merge, but I've got a suggested edit...

If you are running into loads of issues with the notebooks, then that's fine. Though it's a bit of a shame because there are likely a number of errors in notebooks. What about converting the notebooks into markdown files and running codespell on those markdown files? ChatGPT suggests a way:

python script to convert notebooks to markdown

import nbformat
from nbconvert import MarkdownExporter
import sys
import os

def notebook_to_markdown(notebook_path, output_dir):
    with open(notebook_path, 'r', encoding='utf-8') as f:
        nb = nbformat.read(f, as_version=4)
    
    markdown_exporter = MarkdownExporter()
    (body, resources) = markdown_exporter.from_notebook_node(nb)
    
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Create output file name based on the notebook name
    output_file = os.path.join(output_dir, os.path.splitext(os.path.basename(notebook_path))[0] + ".md")
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(body)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python notebook_to_markdown.py <notebook.ipynb> <output_dir>")
    else:
        notebook_path = sys.argv[1]
        output_dir = sys.argv[2]
        notebook_to_markdown(notebook_path, output_dir)

We'd also have to update the optional docs dependencies for those extra imports.

Add that temp folder to .gitignore

# Ignore the temporary markdown files directory
tmp_markdown/

Run the script in the pre-commit checks

- repo: local
  hooks:
    - id: convert-notebooks
      name: Convert Notebooks to Markdown
      entry: python notebook_to_markdown.py
      language: python
      files: \.ipynb$
      pass_filenames: false
      args: ['{}', 'tmp_markdown']

    - id: codespell
      name: Spell Check Markdown Files
      entry: codespell
      args: [
          "-w",
          "--ignore-words=.codespell-whitelist.txt",
        ]
      language: python
      types: [markdown]
      files: tmp_markdown/.*\.md$

It's clear that that is not exactly right because it's not got the ignore args etc. But something along these lines?

Tagging you @OriolAbril because I know you've thought about these with the pymc-examples repo. Here I'm not suggesting that we keep or commit the converted notebooks, just use them purely locally for spellchecking.

The only added complexity might be ensuring that the pre-commit checks also run in the remote tests. But should all be doable I hope!

.codespell-whitelist.txt Outdated Show resolved Hide resolved
docs/source/knowledgebase/glossary.rst Show resolved Hide resolved
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.35%. Comparing base (4b0fcef) to head (4d26e6d).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #403   +/-   ##
=======================================
  Coverage   94.35%   94.35%           
=======================================
  Files          30       30           
  Lines        1967     1967           
=======================================
  Hits         1856     1856           
  Misses        111      111           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westford14
Copy link
Author

Thanks for the PR @westford14. This looks pretty good to me. We'll definitely merge, but I've got a suggested edit...

If you are running into loads of issues with the notebooks, then that's fine. Though it's a bit of a shame because there are likely a number of errors in notebooks. What about converting the notebooks into markdown files and running codespell on those markdown files? ChatGPT suggests a way:

python script to convert notebooks to markdown

import nbformat
from nbconvert import MarkdownExporter
import sys
import os

def notebook_to_markdown(notebook_path, output_dir):
    with open(notebook_path, 'r', encoding='utf-8') as f:
        nb = nbformat.read(f, as_version=4)
    
    markdown_exporter = MarkdownExporter()
    (body, resources) = markdown_exporter.from_notebook_node(nb)
    
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Create output file name based on the notebook name
    output_file = os.path.join(output_dir, os.path.splitext(os.path.basename(notebook_path))[0] + ".md")
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(body)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python notebook_to_markdown.py <notebook.ipynb> <output_dir>")
    else:
        notebook_path = sys.argv[1]
        output_dir = sys.argv[2]
        notebook_to_markdown(notebook_path, output_dir)

We'd also have to update the optional docs dependencies for those extra imports.

Add that temp folder to .gitignore

# Ignore the temporary markdown files directory
tmp_markdown/

Run the script in the pre-commit checks

- repo: local
  hooks:
    - id: convert-notebooks
      name: Convert Notebooks to Markdown
      entry: python notebook_to_markdown.py
      language: python
      files: \.ipynb$
      pass_filenames: false
      args: ['{}', 'tmp_markdown']

    - id: codespell
      name: Spell Check Markdown Files
      entry: codespell
      args: [
          "-w",
          "--ignore-words=.codespell-whitelist.txt",
        ]
      language: python
      types: [markdown]
      files: tmp_markdown/.*\.md$

It's clear that that is not exactly right because it's not got the ignore args etc. But something along these lines?

Tagging you @OriolAbril because I know you've thought about these with the pymc-examples repo. Here I'm not suggesting that we keep or commit the converted notebooks, just use them purely locally for spellchecking.

The only added complexity might be ensuring that the pre-commit checks also run in the remote tests. But should all be doable I hope!

Awesome thanks for direction here! I'll give this a go and ping you once it's ready for re-review. Though the diff might be kind of a lot to parse through 😬

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@westford14
Copy link
Author

hey @drbenvincent -- let me know what you think of this, i slightly changed the chatgpt script and also added a final pre-commit to remove the created temporary directory for the notebook to markdown conversions

again sorry for the large diff in this PR since, you were right, there were a number of spelling issues in the notebooks themselves

@drbenvincent
Copy link
Collaborator

Thanks @westford14. I hope to be able to take a proper look tomorrow or Monday

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty cool! Just 2 minor things:

My only minor gripe is the addition of the .codespell directory in the root of the repo. Would it be possible to move that into ~/docs/ or ~/docs/source/?

Thinking about long term stability and maintainability, I'm thinking that it could be good to add some test coverage to the notebook_to_markdown function. I'm in 2 minds whether this is overkill or not - In a way this new code becomes quite important as part of the pre-commit checks. But at the same time, it's important in the developer workflow and isn't a user-facing thing that will upset a lot of people if it breaks. But I am leaning towards adding some basic test coverage. What do you think?

@westford14
Copy link
Author

This looks pretty cool! Just 2 minor things:

My only minor gripe is the addition of the .codespell directory in the root of the repo. Would it be possible to move that into ~/docs/ or ~/docs/source/?

Thinking about long term stability and maintainability, I'm thinking that it could be good to add some test coverage to the notebook_to_markdown function. I'm in 2 minds whether this is overkill or not - In a way this new code becomes quite important as part of the pre-commit checks. But at the same time, it's important in the developer workflow and isn't a user-facing thing that will upset a lot of people if it breaks. But I am leaning towards adding some basic test coverage. What do you think?

  • yea abosolutely can move this into the ~/docs/source/ directory

  • for your second point, i agree that adding some basic test coverage would be good here -- i'll give it a first pass and then re-request a review

thanks for your guidance on this!

@westford14
Copy link
Author

@drbenvincent took a cut at the tests and I wasn't totally sure where the test should go, didn't really make sense to put it into the causalpy/tests dir so i put it into the docs/source/.codespell dir and created another ci step

let me know if this makes sense

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Happy to merge once the conflicts with .pre-commit-config.yaml are resolved.

I'm still a little new to GitHub Actions - but as far as I understand non of this stuff is not being triggered to run remotely when there's a PR or push to main and that the .pre-commit-config.yaml is only being run locally when a user pushes to their branch? I think that's probably what we want.

EDIT: Well it seems that these checks are being run remotely, and they are failing. Quick look suggests it's just a matter of adding in some of the new optional dependencies into the pyproject.toml?

@drbenvincent
Copy link
Collaborator

I think nbformat and nbconvert might also need to be added to the optional test dependencies.

@@ -41,3 +41,39 @@ repos:
# needed to make excludes in pyproject.toml work
# see here https://github.com/econchick/interrogate/issues/60#issuecomment-735436566
pass_filenames: false
- repo: local
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, design choices! (i.e. no right or wrong here, but would love to double-check ideas)

May I ask, codespell appears to behave more like a software test than a linter with the setup (convert-notebooks) and teardown (remove-temp-directory-notebooks), so would it be better for this to be implemented inside CI/CD instead of in pre-commit hooks?

Not suggesting that we do so, but I just wanted to see whether there's a strong(er) rationale for leaving it in a pre-commit hook than within a GitHub action independently. Is the intent for it to be run locally? Also, might there be a more compact way of configuring this?

To be clear, definitely not suggesting that we move away from what's implemented. Just asking these questions to make sure the rationale is strong.

The only ask I'd have here is to document this design choice in the documentation directory. (My criteria for documentation is that if a topic has been asked and the answers are not in the docs already, then it should be documented.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably more of a philosophical question for you (@ericmjl ) and @drbenvincent

initially this was just checking spelling outside of the jupyter notebooks, which to me definitely feels like a pre-commit check, but now the check is including the notebook to markdown conversion to find spelling mistakes in these notebooks. maybe this highlights a bit of scope creep for this one PR / issue because purely as a pre-commit check I think it makes sense to just look at the .py and .md files and then the notebook spelling check is more of a CI check

so, it's up to you two but i'm happy to keep plugging away at this PR with the updated doc changes and rationale changes but maybe it makes sense to prune back this PR to just the base codespell checks then cherry-pick out the commits to a new PR for an actual CI check for jupyter notebooks since those should be more static and don't need to be checked on every commit -- thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westford14. Yeah, I was thinking the same thing - happy for you to do this.

rev: v2.3.0
hooks:
- id: codespell
args: [
Copy link

@ericmjl ericmjl Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing that codespell can be configured by pyproject.toml. I think it's worth standardizing on pyproject.toml as the place for configuration. Establishing the pattern will be good for the long-term health of the package. @drbenvincent what are your thoughts here?

I'm mostly thinking of the -S flags btw, just to see if we can compact down the args list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - pyproject.toml should do as much of the project config work as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that's an easy fix

@maresb maresb force-pushed the westford14/codespell-implementation branch from 442302b to df16e4b Compare September 10, 2024 16:40
@maresb
Copy link
Contributor

maresb commented Sep 10, 2024

Sorry about the noise, I was doing a test and didn't mean to commit to this branch. I reset the branch to the previous commit.

@maresb
Copy link
Contributor

maresb commented Sep 10, 2024

Regarding the aforementioned test, I was suspicious that it might not work on Windows.

I'm not sure how representative a GitHub runner is, but the hook ran successfully on the Windows GH runner.

Copy link
Collaborator

@drbenvincent drbenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's go with your suggestion here https://github.com/pymc-labs/CausalPy/pull/403/files#r1753839292

That way, we can keep this as a single pre-commit hook to check the .py and .md files.

@westford14
Copy link
Author

i'm going to close this PR, but reference this in new PRs that are forthcoming -- thanks for all the back and forth with me on this!

@westford14 westford14 mentioned this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add auto spell check as part of pre-commit checks
4 participants