Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add great_tables renderer #2846

Merged
merged 2 commits into from
Oct 22, 2024
Merged

Conversation

cosmicBboy
Copy link
Contributor

@cosmicBboy cosmicBboy commented Oct 21, 2024

Tracking issue

flyteorg/flyte#5882

Why are the changes needed?

The current deck renderer for the flytekit pandera is super basic, see here. We could provide more value to users by showing them a more visually rich report, highlighting schema-level and data-level errors.

What changes were proposed in this pull request?

  1. Adds great_tables as a dependency to the pandera flytekit plugin
  2. Updates the pandera deck renderer to show a prettier data validation report:
image image

How was this patch tested?

Run the following workflow:

import typing
from typing import Annotated

import json
import pandas as pd
import pandera as pa
from flytekit import ImageSpec, task, workflow
from flytekitplugins.pandera import ValidationConfig
from pandera.typing import DataFrame, Series



custom_image = ImageSpec(
    apt_packages=["git"],
    packages=[
        "git+https://github.com/flyteorg/flytekit@eed6b0f#subdirectory=plugins/flytekit-pandera",
        "scikit-learn",
        "pyarrow",
    ],
)


class InSchema(pa.DataFrameModel):
    hourly_pay: float = pa.Field(ge=7)
    hours_worked: float = pa.Field(ge=10)
    number: int
    another_number: int

    @pa.check("hourly_pay", "hours_worked")
    def check_numbers_are_positive(cls, series: Series) -> Series[bool]:
        """Defines a column-level custom check."""
        return series > 0

    class Config:
        coerce = True


class IntermediateSchema(InSchema):
    total_pay: float

    @pa.dataframe_check
    def check_total_pay(cls, df: DataFrame) -> Series[bool]:
        """Defines a dataframe-level custom check."""
        return df["total_pay"] == df["hourly_pay"] * df["hours_worked"]


class OutSchema(IntermediateSchema):
    worker_id: Series[str] = pa.Field()


config = ValidationConfig(on_error="warn")

def fn(df: DataFrame[InSchema]):
    ...


@task(container_image=custom_image, enable_deck=True, deck_fields=[])
def json_to_dataframe(data: str) -> Annotated[DataFrame[InSchema], config]:
    """Helper task to convert a dictionary input to a dataframe."""
    data = json.loads(data)
    return pd.DataFrame(data)


@task(container_image=custom_image, enable_deck=True, deck_fields=[])
def total_pay(df: Annotated[DataFrame[InSchema], config]) -> Annotated[DataFrame[IntermediateSchema], config]:  # noqa : F811
    return df.assign(total_pay=df.hourly_pay * df.hours_worked)


@task(container_image=custom_image, enable_deck=True, deck_fields=[])
def add_ids(
    df: Annotated[DataFrame[IntermediateSchema], config],
    worker_ids: typing.List[str],
) -> Annotated[DataFrame[OutSchema], config]:
    return df
    # return df.assign(worker_id=worker_ids)


@workflow
def process_data(  # noqa : F811
    data: str = '{"hourly_pay": [1, 8, 9], "hours_worked": [30.5, 40.0, -1.0], "number": ["a", "b", "c"], "another_number": ["d", "e", "f"]}',
    worker_ids: typing.List[str] = ["a", "b", "c"],
) -> Annotated[DataFrame[OutSchema], config]:
    return add_ids(df=total_pay(df=json_to_dataframe(data=data)), worker_ids=worker_ids)

Run with:

pyflyte -vvv run foo.py process_data

You should see a path to the local deck in the output:

16:08:04.898630 INFO     utils.py:340 - Translate the output to literals. [Time: 0.050881s]
16:08:04.899190 INFO     utils.py:340 - dispatch execute. [Time: 0.051538s]
16:08:04.900302 INFO     deck.py:184 - total_pay task creates flyte deck html to
                         file:///var/folders/4q/frdnh9l10h53gggw1m59gr9m0000gp/T/flyte-nclyneri/sandbox/local_flytek
                         it/7711bf0b5fab93febc393bb2ddf43dde/deck.html

Go to the file and inspect the flyte deck.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Niels Bantilan <[email protected]>
Copy link

codecov bot commented Oct 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 38.25%. Comparing base (61b5896) to head (7d9640c).
Report is 3 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (61b5896) and HEAD (7d9640c). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (61b5896) HEAD (7d9640c)
3 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2846       +/-   ##
===========================================
- Coverage   79.15%   38.25%   -40.90%     
===========================================
  Files         196      196               
  Lines       20403    20367       -36     
  Branches     2632     2631        -1     
===========================================
- Hits        16149     7791     -8358     
- Misses       3526    12370     +8844     
+ Partials      728      206      -522     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Niels Bantilan <[email protected]>
Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool.

@eapolinario eapolinario merged commit da8436e into master Oct 22, 2024
105 of 106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants