Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when extra columns present in dummy data file #2257

Open
evansd opened this issue Nov 28, 2024 · 0 comments
Open

Warn when extra columns present in dummy data file #2257

evansd opened this issue Nov 28, 2024 · 0 comments

Comments

@evansd
Copy link
Contributor

evansd commented Nov 28, 2024

ehrQL lets you supply your own --dummy-data-file (distinct from --dummy-tables). It ingests this, checks that it meets minimum syntactic constraints and then writes it out in the required output format.

If the dummy data file contains extra columns which aren't in the dataset definition these are ignored. I think this is the desired behaviour, but it would also be helpful to warn the user that these extra columns are being ignored.

This came up in @wjchulme's dummy data workshop.

Implementation notes

I think we'd need to supply some new argument to read_rows() e.g. warn_on_extra_columns=True:

ehrql/ehrql/main.py

Lines 127 to 130 in 95197de

if dummy_data_file:
log.info(f"Reading dummy data from {dummy_data_file}")
reader = read_rows(dummy_data_file, column_specs)
results = iter(reader)

This would need to be threaded through to the reader constructor:

def read_rows(filename, column_specs, allow_missing_columns=False):
extension = get_file_extension(filename)
if extension not in FILE_FORMATS:
raise FileValidationError(f"Unsupported file type: {extension}")
if not filename.is_file():
raise FileValidationError(f"Missing file: {filename}")
reader = FILE_FORMATS[extension][1]
return reader(filename, column_specs, allow_missing_columns=allow_missing_columns)

I think we'd probably want to refactor things slightly here so that the validate_columns() function:

def validate_columns(columns, column_specs, allow_missing_columns=False):
if allow_missing_columns:
required_columns = [
name for name, spec in column_specs.items() if not spec.nullable
]
else:
required_columns = column_specs.keys()
missing = [c for c in required_columns if c not in columns]
if missing:
raise FileValidationError(f"Missing columns: {', '.join(missing)}")

Becomes a _validate_columns(column_names) method on BaseRowsReader which subclasses can then invoke to do the right thing:

class BaseRowsReader:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant