- Purpose
- Structure
- Dependancies
- Generating Clean Files
- Generating Test Files
- Generating Large Files
- Generating Edit Reports
- Data Generation Notes
This repository contains code used to generate synthetic LAR and TS files. The test files repository has file creation for the 2018 and 2019 collection years.
Two types of files can be created: clean files and test files. Clean files will pass all edit checks in the FIG for the relevant year, while test files will fail the edit in the file name. Test files may also fail some additional edits, this is known behavior.
Each year listed in the parent directory contains its own codebase for creating test files. Each year relates to a HMDA collection year. Test files are year specific due to changes in the HMDA FIG.
-
2018/python and 2019/python contain the notebooks and python scripts used to generate both clean and test files.
-
2018/schemas and 2019/schemas contain schemas for the LAR and TS in JSON format. These schemas are taken from the 2018 HMDA FIG and the 2019 HMDA FIG
-
2018/dependencies and 2019/dependencies contain supplemental data files used in the generation of clean and test files. - Relevant FFIEC Census data, see this repo for more information - A file containing a list of US ZIP codes
- Python 3.5 or greater
- Jupyter Notebooks:
pip3 install jupyter
- Pandas:
pip3 install pandas
- Other required Python libraries can be installed with
pip3 install -r requirements.txt
These files are used as the base for generating files that will fail edits. Running the following scripts will create the edits_files directory and a data file that will pass the HMDA edit checks. The file will have a number of rows set in a YAML clean file configuration for each directory. Other variables, such as data ranges can also be set in the configuration files.
Configuration values for clean files can be changed using the:
- 2021 Clean File Configuration
- 2020 Clean File Configuration
- 2019 Clean File Configuration
- 2018 Clean File Configuration.
Additional configuration options are available in the configuration folders by year:
For 2019, 2020, and 2021:
- Navigate to the
<year>/python
directory - Run
python3 generate_clean_files.py
- The clean test file will be created with the following path:
{year}/edits_files/{bank name}/clean_files/{Bank Name}_clean_{row count}.txt
.
For 2018:
- Navigate to the
2018/python
directory - Run
python3 generate_2018_clean_files.py
- The clean test file will be created in a new edits_files directory under
2018/edits_files/clean_files/{Bank Name}/
with the filenameclean_file_{Number or Rows}_{Bank Name}.txt
The generation of edit test files requires a clean data file to be present.The steps above outline the process to create the clean data files.
Test files will be created using a clean file of the length specifid in the file_length
value fo the clean file configuration.
Test files will be written to sub directories based on the type of edit they fail:
edits_files/{bank name}/test_files/{edit type}/{bank name}_{edit name}_{row count}.txt
Existing test files of the same length will be overwritten. These filepaths can be changed in test filepaths configuration.
To create test files for 2019, 2020, and 2021:
- Navigate to the
<year>/python
directory. - Run
python3 generate_error_files.py
To create test files for 2018:
- Navigate to the
2018/python
directory. - Run
python3 generate_2018_error_files.py
The error files for testing syntax, validity, and quality edit test files will be created in the following diretories:
- Syntax: {year}/edits_files/test_files/{Bank Name}/syntax
- Validity: {year}/edits_files/test_files/{Bank Name}/validity
- Quality: {year}/edits_files/test_files/{Bank Name}/quality
- Quality (Adjusted to pass syntax and validity): {year}/edits_files/test_files/{Bank Name}/quality_pass_s_v
Due to code design and the edit rules for the LAR data generating synthetic data files of large size was time prohibitive. The large file generation script takes a different approach by using a clean file base and copying rows until the desired file size is created.
To generate large files for 2019, 2020, and 2021:
- Navigate to the
<year>/python
directory - Run
python3 generate_large_files.py
- To set the large file size for 2019 edit the
large_file_write_length
value in the clean configuration. To set the base file used to create large files edit thelarge_file_base_length
value in the clean configuration. - To set the large file size for 2020, and 2021, edit the
large_file_write_length
value in the 2020 large configuration, or 2021 large configuration. To set the base file used to create large files edit thelarge_file_base_length
value in the 2020 large configuration, or 2021 large configuration.- For 2020 and 2021,
large_file_base_length
value inlarge_file_config.yaml
should correspond withfile_length
value inbank1_config.yaml
, as the generated clean file being the base for generating the large file, and the filenames corresponds with record numbers.
- For 2020 and 2021,
Note: the 2018 process is different than 2019. To generate large files for 2018:
- Navigate to the
2018/python
directory. - Adjust the 2018 File Large File Script Configuration to specify bank name, lei, tax id, row count, output filepath, and output filename.
- Run
python3 large_test_files_script.py
to produce the large file.
Edit reports provide a summary of the syntax, validity, or quality edits passed or failed in a test submission file. The edit report contains the following fields.
- edit name
- status (pass/fail)
- number of rows failed
- ULIs/NULIs of rows that failed (as a list).
Edit reports can be generated for any synthetic submission file. Configuration options include (with defaulted values):
To generate edit reports for 2019 and 2020:
- Navigate to the
<year>/python
directory. - Adjust the Edit Report Configuration to specify output.
- Run
python3 generate_edit_report.py
to produce the edit report in the directory according to the configuration file.
To generate edit reports for 2018:
- Navigate to the 2018/python directory.
- Adjust the 2018 Edit Report Configuration to specify output.
- Run
python3 generate_edit_report.py
to produce the edit report in the directory according to the configuration file.
The default values for Bank0 are listed below.
- Name:
Bank0
- LEI:
B90YWS6AFX2LGWOXJ1LD
- Tax ID:
01-0123456
The default values for Bank1 are listed below.
- Name:
Bank1
- LEI:
BANK1LEIFORTEST12345
- Tax ID:
02-1234567
Other test bank LEIs:
- BANK3LEIFORTEST12345
- BANK4LEIFORTEST12345
- 999999LE3ZOZXUS7W648
- 28133080042813308004