-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV validation behaviour is suprising and buggy #401
Comments
I probably face the same issue as @zachary-povey but see much more verbose output (see below). Any pointers how to tackle this is appreciated as this basically renders the CSV functionality of the datacontract-cli kinda useless as in: isn't the whole point of the contract test to catch missing columns or wrongly typed columns (as in if column is an unquoted number To reproduceThe version I run: datacontract --version
0.10.16 My test data: cat test_data.csv
A,B
1,2 The contract test (reducted for brevity): cat datacontract.yaml
servers:
local:
type: local
path: test_data.csv
format: csv
...
models:
my_model:
description: Test Model
type: table
fields:
A:
type: text
description: A Test
required: true
B:
type: text
description: B Test
required: true
C:
type: text
description: B Test
required: true Running the test fails loudly with exceptions: datacontract test --server local datacontract.yaml
Testing datacontract.yaml
ERROR:root:Exception occurred
Traceback (most recent call last):
File ".venv/lib/python3.13/site-packages/datacontract/data_contract.py", line 202, in test
check_soda_execute(run, data_contract, server, self._spark, tmp_dir)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.13/site-packages/datacontract/engines/soda/check_soda_execute.py", line 28, in check_soda_execute
con = get_duckdb_connection(data_contract, server, run)
File ".venv/lib/python3.13/site-packages/datacontract/engines/soda/connections/duckdb.py", line 50, in get_duckdb_connection
con.sql(
~~~~~~~^
f"""CREATE VIEW "{model_name}" AS SELECT * FROM read_csv('{model_path}', hive_partitioning=1, columns={columns});"""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:
* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')
ERROR:root:Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:
* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')
╭────────┬────────────────────┬───────┬────────────────────────────────────────────────────╮
│ Result │ Check │ Field │ Details │
├────────┼────────────────────┼───────┼────────────────────────────────────────────────────┤
│ error │ Test Data Contract │ │ Invalid Input Error: Error when sniffing file │
│ │ │ │ "test_data.csv". │
│ │ │ │ It was not possible to automatically detect the │
│ │ │ │ CSV Parsing dialect/types │
│ │ │ │ The search space used was: │
│ │ │ │ Delimiter Candidates: ',', '|', ';', ' ' │
│ │ │ │ Quote/Escape Candidates: │
│ │ │ │ ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'… │
│ │ │ │ Comment Candidates: '#', '\0' │
│ │ │ │ Possible fixes: │
│ │ │ │ *Set delimiter (e.g., delim=',') │
│ │ │ │* Set quote (e.g., quote='"') │
│ │ │ │ *Set escape (e.g., escape='"') │
│ │ │ │* Set comment (e.g., comment='#') │
│ │ │ │ *Set skip (skip=${n}) to skip ${n} lines at the │
│ │ │ │ top of the file │
│ │ │ │* Enable ignore errors (ignore_errors=true) to │
│ │ │ │ ignore potential errors │
│ │ │ │ *Enable null padding (null_padding=true) to pad │
│ │ │ │ missing columns with NULL values │
│ │ │ │* Check you are using the correct file │
│ │ │ │ compression, otherwise set it (e.g., compression = │
│ │ │ │ 'zstd') │
│ │ │ │ │
╰────────┴────────────────────┴───────┴────────────────────────────────────────────────────╯
🔴 data contract is invalid, found the following errors:
1) Test Data Contract: Invalid Input Error: Error when sniffing file "test_data.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['"','"'],['"','\0'],['"','''],['"','\'],[''','\'],['\0','\0']
Comment Candidates: '#', '\0'
Possible fixes:
* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd') Expected behaviourI was expecting something like this: Testing datacontract.yaml
╭────────┬────────────────────────────────────────────────┬───────┬─────────╮
│ Result │ Check │ Field │ Details │
├────────┼────────────────────────────────────────────────┼───────┼─────────┤
│ passed │ Check that field A is present │ │ │
│ passed │ Check that field B is present │ │ │
│ error │ Check that field C is present │ │ │
│ passed │ Check that required field A has no null values │ A │ │
│ passed │ Check that required field B has no null values │ B │ │
╰────────┴────────────────────────────────────────────────┴───────┴─────────╯
🔴 data contract is invalid, found the following errors:
Field C is missing The same happens when the CSV file contains columns that are not covered by the contract. To me it looks like DuckDB is rejecting the data before the data hits the contract tests, or am I wrong? Any pointers where this can be tackled in the code (I'm very new to the project). Thanks! |
Overview
Hi, so I've come to datacontract-cli with the hope of replacing our current piecemeal data import validation and as a way to formally define our team's expectations regarding data we receive (I realise that, in theory, contracts should be owned by data producers, but that's not the reality at my org).
One of our use cases, that the docs seem to suggest is supported, doesn't really work with this cli as far as I can tell: validating an example CSV file against the model schema.
I understand that this is probably because in reality, csv files have no proper types, so validating them against the schema is not straightforward. However, this kind of validation is a valid use case* and from the docs, it looks like something the cli would do: it says it supports csv and that it validates the schema as part of the test command.
* Something we currently do using frictionless.
Detailed Examples
I've attached a zip file containing a datacontract yaml and some example files to demonstrate my point, with an equivalent example for the same data in a json format, highlighting how the behaviour is different and therefore surprising given the difference is not documented.
datacontract_csv_validation_github_issue.zip
Bad Data Type
In this example I have simply set the
integer_field
to be equal to a string value. Running the test against thebad_data_type_json
server gives exactly the result we would expect:However when we run test for
bad_data_type_csv
we get this:On inspection, it appears that for csv files the cli basically just checks all the fields are there during a test, but does not actually check the values.
Missing Field
Given the above result, I thought I'd check out if the validation would at least therefore pick up missing fields in my CSV so I set up servers for json/csv files where the
integer_field
was missing completely.Again when running against
missing_field_json
i got what I was expecting:But when running against
missing_field_csv
, I got what appears to be an unhandled error:(I think there is also a slight subtlety here with nulls vs actually missing keys/columns but I'll avoid getting into that for brevity).
Summary
My basic question is: is schema validation of csv data ever likely to be in scope of this tool? If not, I can understand to some extent taking the line that csv validation against a tech-agnostic schema can't really go much further than checking columns exist without getting messy, but that does limit the usefulness of it as a tool and I think needs making more explicit in the documentation.
The text was updated successfully, but these errors were encountered: