-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1805840: Augment telemetry with method_call_count #2804
base: main
Are you sure you want to change the base?
SNOW-1805840: Augment telemetry with method_call_count #2804
Conversation
…emetry Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
Outdated
Show resolved
Hide resolved
@@ -295,6 +318,13 @@ def _telemetry_helper( | |||
need_to_restore_args0_api_calls = True | |||
session = args[0]._query_compiler._modin_frame.ordered_dataframe.session | |||
class_prefix = args[0].__class__.__name__ | |||
args[0]._query_compiler._method_call_counts[func.__qualname__] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we generate the function name with _gen_func_name
, as we do for the API call list? https://github.com/snowflakedb/snowpark-python/blob/2247ae5b14ef83c75b5ea6707dd6e53499c197f1/src/snowflake/snowpark/modin/plugin/_internal/telemetry.py#L316C17-L316C31
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test where 2 different methods are called? Also, how does this interact with query compiler methods that call other query compiler methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test that calls two different methods on the same QC now.
Trying out align, it returns the telemetry with its own func name for ex
[{'func_name': 'DataFrame.align', 'category': 'snowpark_pandas', 'error_msg': None, 'call_count': 1, 'api_calls': [{'name': 'DataFrame.align'}]}]
Did you have another specific method in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, SnowflakeQueryCompiler.any
with axis=0 calls SnowflakeQueryCompiler._bool_reduce_helper
, which then in turn calls SnowflakeQueryCompiler.agg
. I was wondering how this would be reflected in telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case _query_compiler._method_call_counts
would include the func_name count 'DataFrame.BasePandasDataset.any' = 1
as well as the other attributes called tracked with telemetry such as 'DataFrame.__repr__'
and 'DataFrame.property.iloc_get'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Would it be accurate to say that _query_compiler._method_call_counts
only tracks methods called on this particular instance of query compiler (the example call chain I gave returns a new query compiler instance every time), and these counts are only used in telemetry for certain frontend methods like dataframe and repr that we specify explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly the goal is to track call counts on the same query compiler instance, more info is in the design doc here: https://docs.google.com/document/d/1EfqQwejVbF5_36hnOP-ap0t3NaCWmDz62iAcR0PtX20/edit?tab=t.0#heading=h.4uu48icmuq7z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description is talking about introducing interchange_call_count
, which I don't think is accurate -- there is no such variable/field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Just left a few nits/questions.
assert len(telemetry_data) == 6 | ||
# s calls __dataframe__() for the first time. | ||
assert telemetry_data[0]["call_count"] == 1 | ||
# s calls __dataframe__() for the second time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this second call served from some cache or recomputed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the telemetry call data is recomputed every time, @sfc-gh-azhan might confirm
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, SnowflakeQueryCompiler.any
with axis=0 calls SnowflakeQueryCompiler._bool_reduce_helper
, which then in turn calls SnowflakeQueryCompiler.agg
. I was wondering how this would be reflected in telemetry.
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
Signed-off-by: Labanya Mukhopadhyay <[email protected]>
…com:snowflakedb/snowpark-python into lmukhopadhyay-SNOW-1805840-telem-call-count
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1805840
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Adding method_call_count which is the # of times a pandas API method has been called, and interchange_call_count which is the # of times the
__dataframe__
method has been called.See more info in the interchange protocol design doc here: https://docs.google.com/document/d/1EfqQwejVbF5_36hnOP-ap0t3NaCWmDz62iAcR0PtX20/edit?tab=t.0#heading=h.4uu48icmuq7z