Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

rulle-io · 2021-09-01T20:20:44Z

This PR is meant to be a solution for issue #579 .

Also make a schema generation process less dependant on a user-provided schema and more fault-tolerant.

Current implementation

dbeam always generates an doc-related properties for a Avro schema based on input parameters and ResultSet value.
Optionally a user can provide a custom "handwritten" schema.
A user-provided schema is only used for Avro doc values.
Thus fields' names, types and type length are taken from an auto-generated schema.

Drawback(s)

One of drawbacks of this behaviour is that when a new field appears in a DB table and as consequence in a source SQL ResultSet (e.g. SELECT * is used), and a user-provided scheam doesn't contain this field, the process will throw an error.

Solution

dbeam's auto-generated schema is always used as a back-up, if a new a user-provided schema doesn't contain the field in question.

Additional use-case

An unplanned positive side-effect can be that one can use a a user-provided schema as a dictionary of descriptions (docs) for various fields, so one schema file can be used for muliple tables. We are going to use this side-effect.

"Unit tests are included"

Checklist for PR author(s)

Changes are covered by unit tests (no major decrease in code coverage %) and/or integration tests.
Ensure code formating (use mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)
Document any relevant additions/changes in the appropriate spot in javadocs/docs/README.

codecov · 2021-09-01T20:22:00Z

Codecov Report

Merging #377 (7bd0191) into master (2646c35) will increase coverage by 0.42%.
The diff coverage is 92.75%.

@@             Coverage Diff              @@
##             master     #377      +/-   ##
============================================
+ Coverage     91.47%   91.90%   +0.42%     
- Complexity      243      258      +15     
============================================
  Files            26       27       +1     
  Lines           927      963      +36     
  Branches         67       71       +4     
============================================
+ Hits            848      885      +37     
+ Misses           52       50       -2     
- Partials         27       28       +1

rulle-io · 2021-09-16T21:48:04Z

@labianchin

labianchin · 2022-03-07T14:11:40Z

Hi. Sorry it took me a while to get here, as I am putting little time on this project...

Is this PR still relevant? It has some conflicts with the just merged #380 .

If so, can you elaborate a bit further on the need for these changes? Specifically: what do we mean by "more fault-tolerant"? And what problem does "less dependant on a user-supplied schema" solves?

…en supplied schema (expected data format) and an actual data format, returned by a SQL query. Reorganize some code to make locations more logical. Always use generated Avro schema. Optional user provided schema used for `doc` fields retrieval.

rulle-io · 2022-03-13T14:36:15Z

Updated the description.

Add more tests and updte docs.

…n by a SQL result).

rulle-io changed the title ~~Add initial supplied schema validation to prevent inconsistency betwe…~~ Change to use user-provided for doc-related data only. Sep 16, 2021

rulle-io changed the title ~~Change to use user-provided for doc-related data only.~~ Change to use user-provided scheam for doc-related info retrieval only. Sep 16, 2021

rulle-io changed the title ~~Change to use user-provided scheam for doc-related info retrieval only.~~ Change to use user-provided schema for doc-related info retrieval only. Sep 16, 2021

rulle-io changed the title ~~Change to use user-provided schema for doc-related info retrieval only.~~ Use user-provided schema for doc-related info retrieval only. Sep 18, 2021

rulle-io force-pushed the supplied_schema_validation branch from d232488 to bcd97b7 Compare February 11, 2022 16:55

Ruslan Altynnikov added 2 commits March 12, 2022 11:18

Changes after rebase.

12989f1

rulle-io force-pushed the supplied_schema_validation branch from bcd97b7 to 12989f1 Compare March 12, 2022 10:47

rulle-io changed the title ~~Use user-provided schema for doc-related info retrieval only.~~ Always use an auto-generated schema as a back-up for Avro doc-related metadata retrieval. Jan 18, 2023

rulle-io changed the title ~~Always use an auto-generated schema as a back-up for Avro doc-related metadata retrieval.~~ Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. Jan 18, 2023

Ruslan Altynnikov added 4 commits January 19, 2023 00:45

Add a better (full-range) test.

057b4ab

Fix a couple of code-style warnings.

9b11910

Update avro doc field precedence rules.

b4e5a38

Add more tests and updte docs.

Add printout of SQL ResultSet metadata (raw fields descriptions retur…

7bd0191

…n by a SQL result).

rulle-io linked an issue Feb 6, 2023 that may be closed by this pull request

Incorrect user-supplied Avro schema (--avroSchemaFilePath) causes dbeam to produce invalid avro files. #579

Open

rulle-io requested a review from labianchin May 16, 2023 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

rulle-io commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

rulle-io commented Sep 16, 2021

labianchin commented Mar 7, 2022

rulle-io commented Mar 13, 2022

Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

Are you sure you want to change the base?

Always use an auto-generated doc values as a back-up for Avro doc-related metadata retrieval. #377

Conversation

rulle-io commented Sep 1, 2021 • edited Loading

Current implementation

Drawback(s)

Solution

Additional use-case

Checklist for PR author(s)

codecov bot commented Sep 1, 2021 • edited Loading

Codecov Report

rulle-io commented Sep 16, 2021

labianchin commented Mar 7, 2022

rulle-io commented Mar 13, 2022

rulle-io commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading