[Feature] support iceberg metadata table #33186

stephen-shelby · 2023-10-19T17:57:46Z

Fixes #issue
support following sql:
select * from table_name$snapshots
select * from table_name$manifests
select * from table_name$files

select * from orders_test$snapshots order by committed_at desc limit 1 \G
*************************** 1. row ***************************
committed_at: 2023-10-20 01:42:23
snapshot_id: 4924636944729424676
parent_id: 7219695656029995314
operation: append
manifest_list: hdfs://e/orders_test-f70fff62928f4257860d7/metadata/snap-4924636944729424676-1.avro
summary: {"added-data-files":"2406","added-records":"1500000","added-files-size":"39169690","changed-partition-count":"2406","total-records":"1500000","total-files-size":"39169690","total-data-files":"2406","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}
select * from orders_test$manifests\G
*************************** 1. row ***************************
path: hdfs://ss/metadata/5d6e9209-07dc-435d-bb28-67e058c101a6-m0.avro
length: 252203
partition_spec_id: 0
added_snapshot_id: 4924636944729424676
added_data_files_count: 2406
added_rows_count: 1500000
existing_data_files_count: 0
existing_rows_count: 0
deleted_data_files_count: 0
deleted_rows_count: 0
partition_summaries: [{"contains_null":false,"contains_nan":false,"lower_bound":"1992-01-01","upper_bound":"1998-08-02"},{"contains_null":false,"contains_nan":false,"lower_bound":"0","upper_bound":"0"}]
select * from orders_test$files limit 1\G
*************************** 1. row ***************************
content: 0
file_path: hdfs://xx/data/o_orderdate=1992-12-15/o_shippriority=0/xxxxx.orc
file_format: ORC
spec_id: 0
record_count: 628
file_size_in_bytes: 16650
column_sizes:
value_counts: {"1":628,"2":628,"3":628,"4":628,"5":628,"6":628,"7":628,"8":628,"9":628}
null_value_counts: {"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0}
nan_value_counts:
lower_bounds: {"1":"11552","2":"386","3":"F","4":"981","5":"1992-12-15","6":"1- URGENT","7":"Clerk#000000002","8":"0","9":" Tiresias lose b"}
upper_bounds: {"1":"5962919","2":"149743","3":"F","4":"460642","5":"1992-12-15","6":"5-LOW","7":"Clerk#000000999","8":"0","9":"yly. pending, ew"}
split_offsets:
equality_ids:
find max value of partition column
select max(get_json_string(partition_summaries, '$[0].upper_bound')) from orders_test$manifests;
+---------------------------------------------------------------+
| max(get_json_string(partition_summaries, '$[0].upper_bound')) |
+---------------------------------------------------------------+
| 1998-08-02 |
+---------------------------------------------------------------+
find min value of non-partition column
select min(get_json_string(lower_bounds, '$.2')) from orders_test$files;
+-------------------------------------------+
| min(get_json_string(lower_bounds, '$.2')) |
+-------------------------------------------+
| 1 |
+-------------------------------------------+

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

Signed-off-by: stephen <[email protected]>

starrocks-cr · 2023-10-19T17:58:50Z

fe/fe-core/src/main/java/com/starrocks/connector/iceberg/FilesTable.java

+                                Conversions.fromByteBuffer(idToTypeMapping.get(entry.getKey()), entry.getValue()))));
+        return GsonUtils.GSON.toJson(values);
+    }
+}


Here are some suggestions for the given code:

Error Handling: In the getFiles method, the CloseableIterator<FileScanTask> fileScanTaskIterator is closed inside a try-catch with a catch block that ignores any IOException. Consider at least logging the error message or almost certainly handling it in some way since failing to close resources can lead to memory leaks.

Conciseness and Readability: Some methods (e.g., getLowerBounds and getUpperBounds) seem almost duplicated; you could consider refactoring them into one generic method to make the code more readable and reduce redundancy.

Testing: From an overall perspective, this class and its methods seem complex. Depending on your testing strategy, many edge cases may need to be handled, so ensure appropriate unit tests are created wherever required.

Commenting: Although the naming conventions you used provide some context about what your code is supposed to do, adding comments for non-trivial sections of the code would improve readability.

Null Checking: Check for null values where relevant before invoking methods. An instance is where fileScanTaskIterator.next() is called. If the iterator does not have a next element, it could throw a NoSuchElementException. You're using fileScanTasks.hasNext() which should prevent this but the next line uses fileScanTaskIterator.next().

Please note that these suggestions might depend on the bigger context and coding guidelines you follow in your project.

starrocks-cr · 2023-10-19T17:58:53Z

be/src/exec/schema_scanner/schema_iceberg_files_scanner.cpp

+    return fill_chunk(chunk);
+}
+
+} // namespace starrocks


The code seems generally well-structured and follows good practices. I do have a few suggestions, however:

Error messages could be more descriptive: In functions like get_next() or start(), instead of just mentioning 'Used before initialized' or 'input pointer is null' you can provide the function name as well to make debugging easier.

Avoid magic numbers: You might consider defining constants for each case in fill_chunk(ChunkPtr* chunk). Magic numbers like 1, 2, 3...14 are not immediately clear on what they signify.

Repeated Code: Inside your fill_chunk function's switch statement, the cases where you're dealing with string fields look repetitive. Consider creating a helper method to handle varchar types where you just pass the file attribute.

Addressing possible empty strings: For varchar fields, it's a common practice to first check if the string is not empty before proceeding with other operations. Consider adding this logic to further bullet-proof your code.

Variable Naming, Explicitness: In the start() function, _files_index = 0; would make better sense just after _files_res has been loaded. Also, using meaningful naming (like current_file_index) could improve readability.

Exception Handling: Consider adding exception handling or checking where RETURN_IF_ERROR macro is used. You may want to handle exceptions depending on the seriousness of the business impact.

Again, these are slight improvements that can be made over already good quality code.

starrocks-cr · 2023-10-19T17:59:07Z

be/src/exec/schema_scanner/schema_iceberg_manifests_scanner.cpp

+    return fill_chunk(chunk);
+}
+
+} // namespace starrocks


The code you've presented seems to be well-structured and formatted, which is good. Here are a few suggestions that may enhance the quality and readablility of the code:

Avoid using namespace in headers: If any headers [implicitly] used have a using namespace statement, it's better to avoid or remove them, since it can lead to name conflicts.

Put magic numbers into a constant: In the method fill_chunk(ChunkPtr* chunk), the switch case statements use numerical values from 1 to 11. These could be replaced with descriptive enumeration or constants for better readability.

Redundant code blocks: There are redundant brackets {...} in the cases of the switch statement in the method fill_chunk(). The extra scope created by these brackets is not necessary and can therefore be removed, enhancing overall readability.

Error message context: When returning an error status such as Status::InternalError("Used before initialized."), you might want to add context to help with debugging, for example, including the variable name or function where the error originated.

Commenting: Although your methods likely have self-explanatory names, adding comments explaining what they do can improve readability.

Use safer C++ casts: Instead of using C-style casting (void*) , use static_cast<void*>. They're safer because C++ style casts allow the compiler to check type information when the cast is performed.

Check for null pointers: Where applicable, you could check for nullity before dereferencing pointers. This could prevent potential segmentation faults.

Safe division: Be aware when performing the division when initializing SchemaScanner. Ensure the denominator isn't zero.

Input validation: Validate inputs where reasonable, throw or return meaningful exceptions or errors back to the user. For example, in the start() method before accessing _param->catalog, validate if _param is a valid pointer.

Potential memory leak: If you've allocated memory during certain operations but didn't release it properly when you're done, a memory leak could occur. It's hard to tell with the given code, so make sure you handle that correctly if you've reserved memory on the heap somewhere outside of the provided code.

Note: Some recommendations might seem not critical because the context or the overall project structure is unknown. The code you provided only contains part of a class implementation; suggestions might change depending on the missing parts or the overall project conventions and standards.
Review the suggestions considering the specificity of your software project.

starrocks-cr · 2023-10-19T17:59:27Z

be/src/exec/schema_scanner/schema_iceberg_snapshots_scanner.cpp

+    return fill_chunk(chunk);
+}
+
+} // namespace starrocks


Overall, the code is quite clear and easy to understand. However, there are a few improvements and suggestions that can be made:

Magic Numbers: Instead of using magic numbers directly in the fill_chunk method (e.g., case 1, case 2, etc.), you can use named constants. This can increase readability and maintainability.

Error Messages: The error messages could be more descriptive. For instance, "IP or port doesn't exist" can specify which one doesn't exist.

Null Check: In the start function, there is an extensive usage of nullptr checks before setting the request parameters. If _param is really valid, consider refactoring the null check into a separate validation method, checking all fields at once.

Comments: While the current comments provide some information (such as for committed_at), it would be helpful to have comments for other parts of the code as well, especially for complex/important calculations or logic.

Code Duplication: There's some repeated code involving creating a Slice object and filling the column with its data. You might be able to make this a single function call with the necessary parameters passed in.

Default Function Statements: The declaration SchemaIcebergSnapshotsScanner::~SchemaIcebergSnapshotsScanner() = default; could be removed if it's not doing anything beyond the compiler-generated destructor.

Namespace Usage: Consider using explicit namespace (starrocks::TypeName) instead of using clause (using namespace starrocks). It better communicates where identifiers come from and also prevents symbol clashing.

Consistent Naming Convention: Ensure that the naming convention is consistent across the entire project. For example, some function names are 'snake_case' while others are 'camelCase'.

The possibility of implementing these changes depends on the wider context of your code base and the standards of your specific project.

starrocks-cr · 2023-10-19T17:59:37Z

fe/fe-core/src/main/java/com/starrocks/connector/iceberg/ManifestsTable.java

+        }
+        return res.toString();
+    }
+}


Overall, this code is well-written with clear responsibilities divided into methods. However, there are a few potential improvements you could make:

Avoid Magic Numbers: In the getPartitionSummaries method, consider avoiding the use of the magic number i in your loop and opt instead for a more descriptive variable name tailored to its function.

for (int partitionIndex = 0; partitionIndex < summaries.size(); partitionIndex++) {..}

Use Optional efficiently:

Instead of having an explicit condition check if (!icebergTable.getSnapshot().isPresent()) { return res; }, you could benefit from using Optional.ifPresent.

icebergTable.getSnapshot().ifPresent(snapshot -> { List<ManifestFile> manifestFiles = snapshot.allManifests(icebergTable.getNativeTable().io()); // Rest of logic here... });

Ensure Thread Safety: If multiple threads can access or modify ConnectorTableId.CONNECTOR_ID_GENERATOR, ensure that it’s thread-safe. Undefined behavior or data races could happen if one thread modifies the CONNECTOR_ID_GENERATOR while another uses it at the same time.

Error Handling: How does your code handle exceptions? Currently there's no try-catch blocks or documentations to indicate what exceptions may be thrown and how they are handled. Pay particular attention to possible null values, missing fields, I/O operations, file not found errors, etc.

Code Comments: While the code is generally clean, providing comments giving a brief rundown on what certain sections of the code do would increase maintainability and readability. This is especially important in the getManifests() and getPartitionSummaries() methods which contain some potentially complex logic.

Immutability: tIcebergManifests in getManifests() can be made final to enhance code safety.

final List<TIcebergManifest> tIcebergManifests = new ArrayList<>();

Do remember, these are suggestions and practices which might enhance the maintainability, readability, and quality of your code in general. The current code you provided is organized pretty well and does not seem to have any major issues at first glance.

sonarcloud · 2023-10-19T18:06:56Z

SonarCloud Quality Gate failed.

1 Bug
0 Vulnerabilities
0 Security Hotspots
49 Code Smells

0.0% Coverage
2.5% Duplication

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

wanpengfei-git · 2023-10-19T19:50:52Z

[FE Incremental Coverage Report]

😞 fail : 30 / 239 (12.55%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/service/FrontendServiceImpl.java	0	19	00.00%	[2710, 2711, 2712, 2713, 2714, 2716, 2721, 2722, 2723, 2724, 2725, 2727, 2732, 2733, 2734, 2735, 2736, 2737, 2739]
🔵	com/starrocks/server/MetadataMgr.java	0	2	00.00%	[300, 301]
🔵	com/starrocks/connector/CatalogConnectorMetadata.java	0	2	00.00%	[116, 117]
🔵	com/starrocks/connector/iceberg/FilesTable.java	0	73	00.00%	[52, 70, 71, 72, 76, 77, 80, 84, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 115, 116, 118, 119, 121, 122, 124, 125, 127, 128, 130, 131, 133, 134, 137, 138, 140, 142, 146, 147, 148, 149, 150, 154, 155, 156, 157, 159, 162, 163, 164, 166, 167, 168, 169, 173, 174, 175, 177, 178, 179, 180]
🔵	com/starrocks/connector/iceberg/SnapshotsTable.java	0	26	00.00%	[38, 48, 49, 50, 54, 55, 58, 62, 66, 67, 68, 69, 70, 71, 72, 73, 75, 76, 78, 79, 81, 82, 84, 85, 86, 87]
🔵	com/starrocks/connector/iceberg/ManifestsTable.java	0	46	00.00%	[42, 57, 58, 59, 63, 64, 67, 71, 75, 76, 77, 78, 79, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 114, 115, 116, 119, 121]
🔵	com/starrocks/connector/ExternalMetadataTable.java	0	2	00.00%	[19, 23]
🔵	com/starrocks/connector/ConnectorMetadata.java	0	1	00.00%	[116]
🔵	com/starrocks/sql/plan/PlanFragmentBuilder.java	1	4	25.00%	[1334, 1335, 1336]
🔵	com/starrocks/planner/SchemaScanNode.java	4	12	33.33%	[143, 147, 148, 151, 155, 156, 263, 266]
🔵	com/starrocks/connector/iceberg/IcebergMetadata.java	7	19	36.84%	[208, 227, 236, 237, 238, 239, 242, 243, 245, 247, 249, 251]
🔵	com/starrocks/connector/iceberg/IcebergTableName.java	13	28	46.43%	[39, 47, 52, 56, 57, 59, 60, 61, 62, 64, 70, 79, 80, 81, 82]
🔵	com/starrocks/connector/iceberg/IcebergTableType.java	5	5	100.00%	[]

wanpengfei-git · 2023-10-19T19:51:07Z

[BE Incremental Coverage Report]

😞 fail : 0 / 241 (00.00%)

file detail

	path	new_line	coverage	not_covered_line_detail
🔵	src/exec/schema_scanner/schema_helper.cpp	10	00.00%	[216, 220, 221, 222, 225, 229, 230, 231, 234, 238]
🔵	src/exec/schema_scanner/schema_iceberg_manifests_scanner.cpp	72	00.00%	[37, 39, 41, 43, 44, 45, 46, 47, 49, 50, 52, 53, 55, 56, 58, 60, 61, 64, 65, 66, 67, 68, 69, 70, 73, 74, 75, 79, 81, 82, 84, 86, 87, 89, 91, 92, 94, 96, 97, 99, 101, 102, 104, 106, 107, 109, 111, 112, 114, 116, 117, 119, 121, 122, 124, 127, 128, 129, 133, 134, 137, 138, 141, 142, 143, 145, 146, 147, 149, 150, 152, 153]
🔵	src/exec/schema_scanner.cpp	6	00.00%	[175, 176, 177, 178, 179, 180]
🔵	src/exec/schema_scanner/schema_iceberg_snapshots_scanner.cpp	59	00.00%	[32, 34, 36, 38, 39, 40, 41, 42, 44, 45, 47, 48, 50, 51, 53, 55, 56, 59, 60, 61, 62, 63, 64, 65, 68, 69, 70, 74, 76, 77, 79, 81, 82, 84, 87, 88, 89, 93, 96, 97, 98, 102, 105, 106, 107, 111, 112, 115, 116, 119, 120, 121, 123, 124, 125, 127, 128, 130, 131]
🔵	src/exec/pipeline/scan/schema_scan_context.cpp	4	00.00%	[118, 119, 121, 122]
🔵	src/exec/schema_scanner/schema_iceberg_files_scanner.cpp	90	00.00%	[40, 41, 43, 45, 46, 47, 48, 49, 51, 52, 54, 55, 57, 58, 59, 61, 63, 64, 67, 68, 69, 70, 71, 72, 73, 75, 76, 78, 81, 82, 83, 87, 90, 91, 92, 96, 98, 99, 101, 103, 104, 106, 108, 109, 111, 114, 115, 116, 120, 123, 124, 125, 129, 132, 133, 134, 138, 141, 142, 143, 147, 150, 151, 152, 156, 159, 160, 161, 165, 168, 169, 170, 174, 177, 178, 179, 183, 184, 187, 188, 191, 192, 193, 195, 196, 197, 199, 200, 202, 203]

imay · 2023-10-19T20:23:28Z

@stephen-shelby
I think we need to parse the meta file from Backend directly other than reading it from Frontends.

[Feature] support iceberg metadata table

aa25ece

Signed-off-by: stephen <[email protected]>

wanpengfei-git added the PROTO-REVIEW label Oct 19, 2023

mergify bot assigned stephen-shelby Oct 19, 2023

wanpengfei-git requested a review from a team October 19, 2023 17:58

starrocks-cr bot reviewed Oct 19, 2023

View reviewed changes

stephen-shelby mentioned this pull request Jul 25, 2024

[WIP] iceberg metadata table #48884

Closed

24 tasks

stephen-shelby closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support iceberg metadata table #33186

[Feature] support iceberg metadata table #33186

stephen-shelby commented Oct 19, 2023 •

edited

Loading

starrocks-cr bot Oct 19, 2023

starrocks-cr bot Oct 19, 2023

starrocks-cr bot Oct 19, 2023

starrocks-cr bot Oct 19, 2023

starrocks-cr bot Oct 19, 2023

sonarcloud bot commented Oct 19, 2023

wanpengfei-git commented Oct 19, 2023

wanpengfei-git commented Oct 19, 2023

imay commented Oct 19, 2023

[Feature] support iceberg metadata table #33186

[Feature] support iceberg metadata table #33186

Conversation

stephen-shelby commented Oct 19, 2023 • edited Loading

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

starrocks-cr bot Oct 19, 2023

Choose a reason for hiding this comment

starrocks-cr bot Oct 19, 2023

Choose a reason for hiding this comment

starrocks-cr bot Oct 19, 2023

Choose a reason for hiding this comment

starrocks-cr bot Oct 19, 2023

Choose a reason for hiding this comment

starrocks-cr bot Oct 19, 2023

Choose a reason for hiding this comment

sonarcloud bot commented Oct 19, 2023

wanpengfei-git commented Oct 19, 2023

[FE Incremental Coverage Report]

file detail

wanpengfei-git commented Oct 19, 2023

[BE Incremental Coverage Report]

file detail

imay commented Oct 19, 2023

stephen-shelby commented Oct 19, 2023 •

edited

Loading