Synthetic Data FeatureCloud App

Description

A Synthetic Data Feature Cloud App, generating synthetic data with the Synthetic Data Vault (SDV) library in Python [1].

Input

data.txt containing the original dataset (columns: features; rows: samples)

Output

synthetic_data.csv containing the synthetic dataset generated with the given parameters.

Workflow

Can be combined with the following apps:

Post:
- Preprocessing apps (e.g. Cross-validation, Normalization ...)
- Various analysis apps (e.g. Logistic Regression, Linear Regression ...)

Config

Use the config file to set the parameters for the synthetic data generation. Upload it together with your data that will be synthesized.

fc_synthetic_data: 
  local_dataset:
    data: data.txt
    sep: ","
  synthetic_data_vault:
    model: GaussianCopula
    number_of_rows: 300
    synthetize_fields:
      - age
      - workclass
      - education
      - education-num 
      - marital-status
      - occupation 
      - relationship
      - race
      - sex
      - capital-gain
      - capital-loss
      - hours-per-week
      - native-country
      - prediction
    categorical_fields:
      - workclass
      - education
      - education-num 
      - marital-status
      - relationship
      - race
      - sex
      - native-country
      - prediction
    anonymize_fields:
      - occupation : job 
  result:
    file: synthetic_data.csv

The config file allows to specify the following:

the model for generating synthetic data, the options include: GaussianCopula, CTGAN, TVAE, CopulaGAN. The default model is GaussianCopula.
the number of rows to generate, if not specified the dafult value corresponds to the number of rows in the original dataset.

Similarly, under the option synthesize_fields, the user can specify the columns to be synthetized and under the option categorical_fields, the user can specify which columns are categorical. The data types of the other fields are inferred automatically.

Furthermore, under the option anonymize_fields, the user can create fake data for fields labeled as Personally Identifiable Information with the same statistical properties. To do this, as shown in the configuration example indicate the name of the field and the category. For checking the possible categories, we refer the reader to Python Faker Documentation.

For more information, we refer the reader to the SDV Documentation.

Resources

[1]. N. Patki, R. Wedge, and K. Veeramachaneni, The Synthetic Data Vault., IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016,pp. 399-410, doi: 10.1109/DSAA.2016.49.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Sample		Sample
server_config		server_config
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
main.py		main.py
requirements.txt		requirements.txt
states.py		states.py
synthetic_data_states_diagram.png		synthetic_data_states_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data FeatureCloud App

Description

Input

Output

Workflow

Config

Resources

About

Releases

Packages

Languages

License

sbaresearch/featurecloud-synthetic-data-app

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data FeatureCloud App

Description

Input

Output

Workflow

Config

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages