Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support iceberg metadata table #33186

Closed

Conversation

stephen-shelby
Copy link
Contributor

@stephen-shelby stephen-shelby commented Oct 19, 2023

Fixes #issue
support following sql:
select * from table_name$snapshots
select * from table_name$manifests
select * from table_name$files

  1. select * from orders_test$snapshots order by committed_at desc limit 1 \G
    *************************** 1. row ***************************
    committed_at: 2023-10-20 01:42:23
    snapshot_id: 4924636944729424676
    parent_id: 7219695656029995314
    operation: append
    manifest_list: hdfs://e/orders_test-f70fff62928f4257860d7/metadata/snap-4924636944729424676-1.avro
    summary: {"added-data-files":"2406","added-records":"1500000","added-files-size":"39169690","changed-partition-count":"2406","total-records":"1500000","total-files-size":"39169690","total-data-files":"2406","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}

  2. select * from orders_test$manifests\G
    *************************** 1. row ***************************
    path: hdfs://ss/metadata/5d6e9209-07dc-435d-bb28-67e058c101a6-m0.avro
    length: 252203
    partition_spec_id: 0
    added_snapshot_id: 4924636944729424676
    added_data_files_count: 2406
    added_rows_count: 1500000
    existing_data_files_count: 0
    existing_rows_count: 0
    deleted_data_files_count: 0
    deleted_rows_count: 0
    partition_summaries: [{"contains_null":false,"contains_nan":false,"lower_bound":"1992-01-01","upper_bound":"1998-08-02"},{"contains_null":false,"contains_nan":false,"lower_bound":"0","upper_bound":"0"}]

  3. select * from orders_test$files limit 1\G
    *************************** 1. row ***************************
    content: 0
    file_path: hdfs://xx/data/o_orderdate=1992-12-15/o_shippriority=0/xxxxx.orc
    file_format: ORC
    spec_id: 0
    record_count: 628
    file_size_in_bytes: 16650
    column_sizes:
    value_counts: {"1":628,"2":628,"3":628,"4":628,"5":628,"6":628,"7":628,"8":628,"9":628}
    null_value_counts: {"1":0,"2":0,"3":0,"4":0,"5":0,"6":0,"7":0,"8":0,"9":0}
    nan_value_counts:
    lower_bounds: {"1":"11552","2":"386","3":"F","4":"981","5":"1992-12-15","6":"1- URGENT","7":"Clerk#000000002","8":"0","9":" Tiresias lose b"}
    upper_bounds: {"1":"5962919","2":"149743","3":"F","4":"460642","5":"1992-12-15","6":"5-LOW","7":"Clerk#000000999","8":"0","9":"yly. pending, ew"}
    split_offsets:
    equality_ids:

  4. find max value of partition column
    select max(get_json_string(partition_summaries, '$[0].upper_bound')) from orders_test$manifests;
    +---------------------------------------------------------------+
    | max(get_json_string(partition_summaries, '$[0].upper_bound')) |
    +---------------------------------------------------------------+
    | 1998-08-02 |
    +---------------------------------------------------------------+

  5. find min value of non-partition column
    select min(get_json_string(lower_bounds, '$.2')) from orders_test$files;
    +-------------------------------------------+
    | min(get_json_string(lower_bounds, '$.2')) |
    +-------------------------------------------+
    | 1 |
    +-------------------------------------------+

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.2
    • 3.1
    • 3.0
    • 2.5

Conversions.fromByteBuffer(idToTypeMapping.get(entry.getKey()), entry.getValue()))));
return GsonUtils.GSON.toJson(values);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions for the given code:

  1. Error Handling: In the getFiles method, the CloseableIterator<FileScanTask> fileScanTaskIterator is closed inside a try-catch with a catch block that ignores any IOException. Consider at least logging the error message or almost certainly handling it in some way since failing to close resources can lead to memory leaks.

  2. Conciseness and Readability: Some methods (e.g., getLowerBounds and getUpperBounds) seem almost duplicated; you could consider refactoring them into one generic method to make the code more readable and reduce redundancy.

  3. Testing: From an overall perspective, this class and its methods seem complex. Depending on your testing strategy, many edge cases may need to be handled, so ensure appropriate unit tests are created wherever required.

  4. Commenting: Although the naming conventions you used provide some context about what your code is supposed to do, adding comments for non-trivial sections of the code would improve readability.

  5. Null Checking: Check for null values where relevant before invoking methods. An instance is where fileScanTaskIterator.next() is called. If the iterator does not have a next element, it could throw a NoSuchElementException. You're using fileScanTasks.hasNext() which should prevent this but the next line uses fileScanTaskIterator.next().

Please note that these suggestions might depend on the bigger context and coding guidelines you follow in your project.

return fill_chunk(chunk);
}

} // namespace starrocks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems generally well-structured and follows good practices. I do have a few suggestions, however:

  1. Error messages could be more descriptive: In functions like get_next() or start(), instead of just mentioning 'Used before initialized' or 'input pointer is null' you can provide the function name as well to make debugging easier.

  2. Avoid magic numbers: You might consider defining constants for each case in fill_chunk(ChunkPtr* chunk). Magic numbers like 1, 2, 3...14 are not immediately clear on what they signify.

  3. Repeated Code: Inside your fill_chunk function's switch statement, the cases where you're dealing with string fields look repetitive. Consider creating a helper method to handle varchar types where you just pass the file attribute.

  4. Addressing possible empty strings: For varchar fields, it's a common practice to first check if the string is not empty before proceeding with other operations. Consider adding this logic to further bullet-proof your code.

  5. Variable Naming, Explicitness: In the start() function, _files_index = 0; would make better sense just after _files_res has been loaded. Also, using meaningful naming (like current_file_index) could improve readability.

  6. Exception Handling: Consider adding exception handling or checking where RETURN_IF_ERROR macro is used. You may want to handle exceptions depending on the seriousness of the business impact.

Again, these are slight improvements that can be made over already good quality code.

return fill_chunk(chunk);
}

} // namespace starrocks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you've presented seems to be well-structured and formatted, which is good. Here are a few suggestions that may enhance the quality and readablility of the code:

  1. Avoid using namespace in headers: If any headers [implicitly] used have a using namespace statement, it's better to avoid or remove them, since it can lead to name conflicts.

  2. Put magic numbers into a constant: In the method fill_chunk(ChunkPtr* chunk), the switch case statements use numerical values from 1 to 11. These could be replaced with descriptive enumeration or constants for better readability.

  3. Redundant code blocks: There are redundant brackets {...} in the cases of the switch statement in the method fill_chunk(). The extra scope created by these brackets is not necessary and can therefore be removed, enhancing overall readability.

  4. Error message context: When returning an error status such as Status::InternalError("Used before initialized."), you might want to add context to help with debugging, for example, including the variable name or function where the error originated.

  5. Commenting: Although your methods likely have self-explanatory names, adding comments explaining what they do can improve readability.

  6. Use safer C++ casts: Instead of using C-style casting (void*) , use static_cast<void*>. They're safer because C++ style casts allow the compiler to check type information when the cast is performed.

  7. Check for null pointers: Where applicable, you could check for nullity before dereferencing pointers. This could prevent potential segmentation faults.

  8. Safe division: Be aware when performing the division when initializing SchemaScanner. Ensure the denominator isn't zero.

  9. Input validation: Validate inputs where reasonable, throw or return meaningful exceptions or errors back to the user. For example, in the start() method before accessing _param->catalog, validate if _param is a valid pointer.

  10. Potential memory leak: If you've allocated memory during certain operations but didn't release it properly when you're done, a memory leak could occur. It's hard to tell with the given code, so make sure you handle that correctly if you've reserved memory on the heap somewhere outside of the provided code.

Note: Some recommendations might seem not critical because the context or the overall project structure is unknown. The code you provided only contains part of a class implementation; suggestions might change depending on the missing parts or the overall project conventions and standards.
Review the suggestions considering the specificity of your software project.

return fill_chunk(chunk);
}

} // namespace starrocks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the code is quite clear and easy to understand. However, there are a few improvements and suggestions that can be made:

  1. Magic Numbers: Instead of using magic numbers directly in the fill_chunk method (e.g., case 1, case 2, etc.), you can use named constants. This can increase readability and maintainability.

  2. Error Messages: The error messages could be more descriptive. For instance, "IP or port doesn't exist" can specify which one doesn't exist.

  3. Null Check: In the start function, there is an extensive usage of nullptr checks before setting the request parameters. If _param is really valid, consider refactoring the null check into a separate validation method, checking all fields at once.

  4. Comments: While the current comments provide some information (such as for committed_at), it would be helpful to have comments for other parts of the code as well, especially for complex/important calculations or logic.

  5. Code Duplication: There's some repeated code involving creating a Slice object and filling the column with its data. You might be able to make this a single function call with the necessary parameters passed in.

  6. Default Function Statements: The declaration SchemaIcebergSnapshotsScanner::~SchemaIcebergSnapshotsScanner() = default; could be removed if it's not doing anything beyond the compiler-generated destructor.

  7. Namespace Usage: Consider using explicit namespace (starrocks::TypeName) instead of using clause (using namespace starrocks). It better communicates where identifiers come from and also prevents symbol clashing.

  8. Consistent Naming Convention: Ensure that the naming convention is consistent across the entire project. For example, some function names are 'snake_case' while others are 'camelCase'.

The possibility of implementing these changes depends on the wider context of your code base and the standards of your specific project.

}
return res.toString();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this code is well-written with clear responsibilities divided into methods. However, there are a few potential improvements you could make:

  1. Avoid Magic Numbers: In the getPartitionSummaries method, consider avoiding the use of the magic number i in your loop and opt instead for a more descriptive variable name tailored to its function.

    for (int partitionIndex = 0; partitionIndex < summaries.size(); partitionIndex++) {..}
  2. Use Optional efficiently:

    • Instead of having an explicit condition check if (!icebergTable.getSnapshot().isPresent()) { return res; }, you could benefit from using Optional.ifPresent.
        icebergTable.getSnapshot().ifPresent(snapshot -> {
            List<ManifestFile> manifestFiles = snapshot.allManifests(icebergTable.getNativeTable().io());
            // Rest of logic here...
        });
  3. Ensure Thread Safety: If multiple threads can access or modify ConnectorTableId.CONNECTOR_ID_GENERATOR, ensure that it’s thread-safe. Undefined behavior or data races could happen if one thread modifies the CONNECTOR_ID_GENERATOR while another uses it at the same time.

  4. Error Handling: How does your code handle exceptions? Currently there's no try-catch blocks or documentations to indicate what exceptions may be thrown and how they are handled. Pay particular attention to possible null values, missing fields, I/O operations, file not found errors, etc.

  5. Code Comments: While the code is generally clean, providing comments giving a brief rundown on what certain sections of the code do would increase maintainability and readability. This is especially important in the getManifests() and getPartitionSummaries() methods which contain some potentially complex logic.

  6. Immutability: tIcebergManifests in getManifests() can be made final to enhance code safety.

     final List<TIcebergManifest> tIcebergManifests = new ArrayList<>();

Do remember, these are suggestions and practices which might enhance the maintainability, readability, and quality of your code in general. The current code you provided is organized pretty well and does not seem to have any major issues at first glance.

@sonarcloud
Copy link

sonarcloud bot commented Oct 19, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug C 1 Bug
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 49 Code Smells

0.0% 0.0% Coverage
2.5% 2.5% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@wanpengfei-git
Copy link
Collaborator

[FE Incremental Coverage Report]

😞 fail : 30 / 239 (12.55%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/service/FrontendServiceImpl.java 0 19 00.00% [2710, 2711, 2712, 2713, 2714, 2716, 2721, 2722, 2723, 2724, 2725, 2727, 2732, 2733, 2734, 2735, 2736, 2737, 2739]
🔵 com/starrocks/server/MetadataMgr.java 0 2 00.00% [300, 301]
🔵 com/starrocks/connector/CatalogConnectorMetadata.java 0 2 00.00% [116, 117]
🔵 com/starrocks/connector/iceberg/FilesTable.java 0 73 00.00% [52, 70, 71, 72, 76, 77, 80, 84, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 115, 116, 118, 119, 121, 122, 124, 125, 127, 128, 130, 131, 133, 134, 137, 138, 140, 142, 146, 147, 148, 149, 150, 154, 155, 156, 157, 159, 162, 163, 164, 166, 167, 168, 169, 173, 174, 175, 177, 178, 179, 180]
🔵 com/starrocks/connector/iceberg/SnapshotsTable.java 0 26 00.00% [38, 48, 49, 50, 54, 55, 58, 62, 66, 67, 68, 69, 70, 71, 72, 73, 75, 76, 78, 79, 81, 82, 84, 85, 86, 87]
🔵 com/starrocks/connector/iceberg/ManifestsTable.java 0 46 00.00% [42, 57, 58, 59, 63, 64, 67, 71, 75, 76, 77, 78, 79, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 114, 115, 116, 119, 121]
🔵 com/starrocks/connector/ExternalMetadataTable.java 0 2 00.00% [19, 23]
🔵 com/starrocks/connector/ConnectorMetadata.java 0 1 00.00% [116]
🔵 com/starrocks/sql/plan/PlanFragmentBuilder.java 1 4 25.00% [1334, 1335, 1336]
🔵 com/starrocks/planner/SchemaScanNode.java 4 12 33.33% [143, 147, 148, 151, 155, 156, 263, 266]
🔵 com/starrocks/connector/iceberg/IcebergMetadata.java 7 19 36.84% [208, 227, 236, 237, 238, 239, 242, 243, 245, 247, 249, 251]
🔵 com/starrocks/connector/iceberg/IcebergTableName.java 13 28 46.43% [39, 47, 52, 56, 57, 59, 60, 61, 62, 64, 70, 79, 80, 81, 82]
🔵 com/starrocks/connector/iceberg/IcebergTableType.java 5 5 100.00% []

@wanpengfei-git
Copy link
Collaborator

[BE Incremental Coverage Report]

😞 fail : 0 / 241 (00.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/exec/schema_scanner/schema_helper.cpp 0 10 00.00% [216, 220, 221, 222, 225, 229, 230, 231, 234, 238]
🔵 src/exec/schema_scanner/schema_iceberg_manifests_scanner.cpp 0 72 00.00% [37, 39, 41, 43, 44, 45, 46, 47, 49, 50, 52, 53, 55, 56, 58, 60, 61, 64, 65, 66, 67, 68, 69, 70, 73, 74, 75, 79, 81, 82, 84, 86, 87, 89, 91, 92, 94, 96, 97, 99, 101, 102, 104, 106, 107, 109, 111, 112, 114, 116, 117, 119, 121, 122, 124, 127, 128, 129, 133, 134, 137, 138, 141, 142, 143, 145, 146, 147, 149, 150, 152, 153]
🔵 src/exec/schema_scanner.cpp 0 6 00.00% [175, 176, 177, 178, 179, 180]
🔵 src/exec/schema_scanner/schema_iceberg_snapshots_scanner.cpp 0 59 00.00% [32, 34, 36, 38, 39, 40, 41, 42, 44, 45, 47, 48, 50, 51, 53, 55, 56, 59, 60, 61, 62, 63, 64, 65, 68, 69, 70, 74, 76, 77, 79, 81, 82, 84, 87, 88, 89, 93, 96, 97, 98, 102, 105, 106, 107, 111, 112, 115, 116, 119, 120, 121, 123, 124, 125, 127, 128, 130, 131]
🔵 src/exec/pipeline/scan/schema_scan_context.cpp 0 4 00.00% [118, 119, 121, 122]
🔵 src/exec/schema_scanner/schema_iceberg_files_scanner.cpp 0 90 00.00% [40, 41, 43, 45, 46, 47, 48, 49, 51, 52, 54, 55, 57, 58, 59, 61, 63, 64, 67, 68, 69, 70, 71, 72, 73, 75, 76, 78, 81, 82, 83, 87, 90, 91, 92, 96, 98, 99, 101, 103, 104, 106, 108, 109, 111, 114, 115, 116, 120, 123, 124, 125, 129, 132, 133, 134, 138, 141, 142, 143, 147, 150, 151, 152, 156, 159, 160, 161, 165, 168, 169, 170, 174, 177, 178, 179, 183, 184, 187, 188, 191, 192, 193, 195, 196, 197, 199, 200, 202, 203]

@imay
Copy link
Contributor

imay commented Oct 19, 2023

@stephen-shelby
I think we need to parse the meta file from Backend directly other than reading it from Frontends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants