Warn when extra columns present in dummy data file #2257

evansd · 2024-11-28T10:21:15Z

ehrQL lets you supply your own --dummy-data-file (distinct from --dummy-tables). It ingests this, checks that it meets minimum syntactic constraints and then writes it out in the required output format.

If the dummy data file contains extra columns which aren't in the dataset definition these are ignored. I think this is the desired behaviour, but it would also be helpful to warn the user that these extra columns are being ignored.

This came up in @wjchulme's dummy data workshop.

Implementation notes

I think we'd need to supply some new argument to read_rows() e.g. warn_on_extra_columns=True:

ehrql/ehrql/main.py

Lines 127 to 130 in 95197de

    
           if dummy_data_file: 
        
               log.info(f"Reading dummy data from {dummy_data_file}") 
        
               reader = read_rows(dummy_data_file, column_specs) 
        
               results = iter(reader)

This would need to be threaded through to the reader constructor:

ehrql/ehrql/file_formats/__init__.py

Lines 33 to 40 in 95197de

    
           def read_rows(filename, column_specs, allow_missing_columns=False): 
        
               extension = get_file_extension(filename) 
        
               if extension not in FILE_FORMATS: 
        
                   raise FileValidationError(f"Unsupported file type: {extension}") 
        
               if not filename.is_file(): 
        
                   raise FileValidationError(f"Missing file: {filename}") 
        
               reader = FILE_FORMATS[extension][1] 
        
               return reader(filename, column_specs, allow_missing_columns=allow_missing_columns)

I think we'd probably want to refactor things slightly here so that the validate_columns() function:

ehrql/ehrql/file_formats/base.py

Lines 72 to 81 in 95197de

    
           def validate_columns(columns, column_specs, allow_missing_columns=False): 
        
               if allow_missing_columns: 
        
                   required_columns = [ 
        
                       name for name, spec in column_specs.items() if not spec.nullable 
        
                   ] 
        
               else: 
        
                   required_columns = column_specs.keys() 
        
               missing = [c for c in required_columns if c not in columns] 
        
               if missing: 
        
                   raise FileValidationError(f"Missing columns: {', '.join(missing)}")

Becomes a _validate_columns(column_names) method on BaseRowsReader which subclasses can then invoke to do the right thing:

ehrql/ehrql/file_formats/base.py

Line 8 in 95197de

class BaseRowsReader:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when extra columns present in dummy data file #2257

Warn when extra columns present in dummy data file #2257

evansd commented Nov 28, 2024 •

edited

Loading

Warn when extra columns present in dummy data file #2257

Warn when extra columns present in dummy data file #2257

Comments

evansd commented Nov 28, 2024 • edited Loading

Implementation notes

evansd commented Nov 28, 2024 •

edited

Loading