Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all null list column with missing child column in JSON reader #17348

Open
wants to merge 10 commits into
base: branch-24.12
Choose a base branch
from

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Nov 16, 2024

Fixes #17341

  • expose make_empty_column
  • fix all null list with empty nested child on missing list child column when prune columns is enabled
  • add unit test

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann requested a review from a team as a code owner November 16, 2024 05:36
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 16, 2024
@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Nov 16, 2024
rmm::device_buffer{},
rmm::device_buffer{},
0,
std::move(child_columns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?
Could this utility be used instead?

std::unique_ptr<column> make_empty_lists_column(data_type child_type,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr);

rmm::device_buffer{},
rmm::device_buffer{},
0,
std::move(child_columns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do different than the factory call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discovered through profiling that make_lists_column and make_structs_column introduce significant overhead by unnecessarily calling to purge_non_empty_nulls and superimpose_nulls_no_sanitize. These sanitization may be needed in general, however, here we know that the child(ren) of JSON output column cannot have non-null rows if the parent is null (due to invalid input) thus we want to avoid such overhead by calling to the column constructor directly.

Copy link
Contributor

@ttnghia ttnghia Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann Please add comments to clarify the reason why we have this change:

// Do not use `cudf::make_lists_column` since we do not need to call `purge_nonempty_nulls`
// on the child column as it does not have non-empty nulls.

and

// Do not use `cudf::make_structs_column` since we do not need to call `superimpose_nulls`
// on the children columns.

Copy link
Contributor

@ttnghia ttnghia Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed a related issue for this: #17356

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we could fix the make_structs_column and make_lists_column to not call these utilities for empty columns.
Or the utilities could be fixed to no-op for empty columns perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either fix is good.

{{"a", "c2"}}};
in_options.set_dtypes(dtype_schema);
cudf::io::table_with_metadata result = cudf::io::read_json(in_options);
// Make sure we have column "a":[float]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Make sure we have column "a":[float]
// Make sure we have column "a":[int64_t]

ASSERT_EQ(result.metadata.schema_info[0].children.size(), 2);
EXPECT_EQ(result.metadata.schema_info[0].children[0].name, "offsets");
EXPECT_EQ(result.metadata.schema_info[0].children[1].name, "element");
// Make sure we have all null list "c2": [{"d": ""}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test fails because the output for this column is still not all-null:

cudf::list_view<cudf::struct_view<cudf::string_view,>>:
Length : 2
Offsets : 0, 0, 0
Null count: 1
01
   cudf::struct_view<cudf::string_view,>:
   Length : 0:

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (branch-24.12@302e625). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-24.12   #17348   +/-   ##
===============================================
  Coverage                ?   83.84%           
===============================================
  Files                   ?      225           
  Lines                   ?    32018           
  Branches                ?        0           
===============================================
  Hits                    ?    26844           
  Misses                  ?     5174           
  Partials                ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

[FEA] read_json should output all-nulls columns for the schema columns that do not match with the input JSON
3 participants