-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix all null list column with missing child column in JSON reader #17348
base: branch-24.12
Are you sure you want to change the base?
Conversation
Co-authored-by: Nghia Truong <[email protected]>
rmm::device_buffer{}, | ||
rmm::device_buffer{}, | ||
0, | ||
std::move(child_columns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
Could this utility be used instead?
cudf/cpp/include/cudf/lists/detail/lists_column_factories.hpp
Lines 50 to 52 in 9cc9071
std::unique_ptr<column> make_empty_lists_column(data_type child_type, | |
rmm::cuda_stream_view stream, | |
rmm::device_async_resource_ref mr); |
rmm::device_buffer{}, | ||
rmm::device_buffer{}, | ||
0, | ||
std::move(child_columns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do different than the factory call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discovered through profiling that make_lists_column
and make_structs_column
introduce significant overhead by unnecessarily calling to purge_non_empty_nulls
and superimpose_nulls_no_sanitize
. These sanitization may be needed in general, however, here we know that the child(ren) of JSON output column cannot have non-null rows if the parent is null (due to invalid input) thus we want to avoid such overhead by calling to the column constructor directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann Please add comments to clarify the reason why we have this change:
// Do not use `cudf::make_lists_column` since we do not need to call `purge_nonempty_nulls`
// on the child column as it does not have non-empty nulls.
and
// Do not use `cudf::make_structs_column` since we do not need to call `superimpose_nulls`
// on the children columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed a related issue for this: #17356
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we could fix the make_structs_column
and make_lists_column
to not call these utilities for empty columns.
Or the utilities could be fixed to no-op for empty columns perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either fix is good.
cpp/tests/io/json/json_test.cpp
Outdated
{{"a", "c2"}}}; | ||
in_options.set_dtypes(dtype_schema); | ||
cudf::io::table_with_metadata result = cudf::io::read_json(in_options); | ||
// Make sure we have column "a":[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Make sure we have column "a":[float] | |
// Make sure we have column "a":[int64_t] |
ASSERT_EQ(result.metadata.schema_info[0].children.size(), 2); | ||
EXPECT_EQ(result.metadata.schema_info[0].children[0].name, "offsets"); | ||
EXPECT_EQ(result.metadata.schema_info[0].children[1].name, "element"); | ||
// Make sure we have all null list "c2": [{"d": ""}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test fails because the output for this column is still not all-null:
cudf::list_view<cudf::struct_view<cudf::string_view,>>:
Length : 2
Offsets : 0, 0, 0
Null count: 1
01
cudf::struct_view<cudf::string_view,>:
Length : 0:
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## branch-24.12 #17348 +/- ##
===============================================
Coverage ? 83.84%
===============================================
Files ? 225
Lines ? 32018
Branches ? 0
===============================================
Hits ? 26844
Misses ? 5174
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Fixes #17341
Description
Checklist