diff --git a/doc/2.0/fate/components/feature_binning.md b/doc/2.0/fate/components/feature_binning.md index 47e7379ebd..1c06e58bd1 100644 --- a/doc/2.0/fate/components/feature_binning.md +++ b/doc/2.0/fate/components/feature_binning.md @@ -32,17 +32,19 @@ Principle](../../images/multiple_host_binning.png) 1. Support Quantile Binning based on quantile summary algorithm. 2. Support Bucket Binning. -3. Support calculating woe and iv values. -4. Support transforming data into bin indexes or woe value(guest only). -5. Support multiple-host binning. -6. Support asymmetric binning methods on Host & Guest sides. +3. Support manual binning based on user-defined binning points. +4. Support calculating woe and iv values. +5. Support transforming data into bin indexes or woe value(guest only). +6. Support multiple-host binning. +7. Support asymmetric binning methods on Host & Guest sides. Below lists supported features with links to examples: -| Cases | Scenario | -|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py)
[quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) | -| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py)
[woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) | -| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) | +| Cases | Scenario | +|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Input Data with Categorical Features | [bucket binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_bucket.py)
[quantile binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_quantile.py) | +| Binning with User-defined split points | [manual binning](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) | +| Output Data Transformed | [bin index](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py)
[woe value(guest-only)](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py) | +| Skip Metrics Calculation | [multi_host](../../../../examples/pipeline/hetero_feature_binning/test_feature_binning_multi_host.py) | diff --git a/doc/2.0/fate/components/feature_selection.md b/doc/2.0/fate/components/feature_selection.md index 113db782d7..fbdb26af1e 100644 --- a/doc/2.0/fate/components/feature_selection.md +++ b/doc/2.0/fate/components/feature_selection.md @@ -28,10 +28,10 @@ Below lists their acceptable parameter values. | IV Filter | filter_param | "iv" | "threshold", "top_k", "top_percentile" | True | | Statistic Filter | statistic_param | "max", "min", "mean", "median", "std", "var", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" | True/False | -1. - - iv\_filter: Use iv as criterion to selection features. Support - three mode: threshold value, top-k and top-percentile. +## Filter Configuration +1. iv\_filter: Use iv as criterion to selection features. + - filter_type: Support three modes: threshold value, top-k and top-percentile. - threshold value: Filter those columns whose iv is smaller than threshold. You can also set different threshold for each party. @@ -39,13 +39,41 @@ Below lists their acceptable parameter values. k features in the sorted result. - top-percentile. Sort features from larger to smaller and take top percentile. - + - select_federated: If set to True, the feature selection will be + performed in a federated manner. The feature selection will be + performed on the guest side, and the anonymously selected features will be + sent to the host side. The host side will then filter the + features based on the selected features from the guest side. This param is available in iv\_filter only. + - threshold: The threshold value for feature selection. + - take_high: If set to True, the filter will select features with + higher iv values. If set to False, the filter will select + features with lower iv values. + - host_filter_type: The filter type for host features. It can be + "threshold", "top_k", "top_percentile". This param is available in iv\_filter only. + - host_threshold: The threshold value for feature selection on host + features. This param is available in iv\_filter only. + - host_top_k: The top k value for feature selection on host features. + This param is available in iv\_filter only. 2. statistic\_filter: Use statistic values calculate from DataStatistic component. Support coefficient of variance, missing value, percentile value etc. You can pick the columns with higher statistic values or smaller values as you need. + - filter_type: Support three modes: threshold value, top-k and top-percentile. + - threshold value: Filter those columns whose statistic metric is smaller + than threshold. You can also set different threshold for + each party. + - top-k: Sort features from larger statistic metric to smaller and take top + k features in the sorted result. + - top-percentile. Sort features from larger to smaller and + take top percentile. + - threshold: The threshold value for feature selection. + - take_high: If set to True, the filter will select features with + higher metric values. If set to False, the filter will select + features with lower iv values. 3. manually: Indicate features that need to be filtered or kept. + - keep_col: The columns that need to be kept. + - filter_out_col: The columns that need to be dropped. Besides, we support multi-host federated feature selection for iv filters. Starting in ver 2.0.0-beta, all data sets will obtain anonymous header diff --git a/doc/2.0/fate/model_building_quick_start.md b/doc/2.0/fate/model_building_quick_start.md new file mode 100644 index 0000000000..c826dbd07c --- /dev/null +++ b/doc/2.0/fate/model_building_quick_start.md @@ -0,0 +1,163 @@ +## Quick Start: A Model Building Demo + +1. install `fate_client` with extra package `fate` + +```sh +python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0 +``` +after installing packages successfully, initialize fate_flow service and fate_client + +```sh +mkdir fate_workspace +fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace +pipeline init --ip 127.0.0.1 --port 9380 + +fate_flow start +fate_flow status # make sure fate_flow service is started +``` + +2. download example data + +```sh +wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \ +wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv +``` + +3. transform example data to dataframe using in fate + +```python +import os +from fate_client.pipeline import FateFlowPipeline + +base_path = os.path.abspath(os.path.join(__file__, os.path.pardir)) +guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv") +host_data_path = os.path.join(base_path, "breast_hetero_host.csv") + +data_pipeline = FateFlowPipeline().set_parties(local="0") +guest_meta = { + "delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id" +} +host_meta = { + "delimiter": ",", "input_format": "dense", "match_id_name": "id" +} +data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest", + meta=guest_meta, head=True, extend_sid=True) +data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host", + meta=host_meta, head=True, extend_sid=True) +``` +4. run training example and save pipeline + +```python +from fate_client.pipeline.components.fate import ( + Reader, + PSI, + HeteroFeatureBinning, + HeteroFeatureSelection, + DataSplit, + Statistics, + FeatureScale, + SSHELR, + Evaluation +) +from fate_client.pipeline import FateFlowPipeline + + +# create pipeline for training +pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000") + +# create reader task_desc +reader_0 = Reader("reader_0") +reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest") +reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host") + +# create psi component_desc +psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"]) + +data_split_0 = DataSplit("data_split_0", input_data=psi_0.outputs["output_data"], + train_size=0.7, validate_size=0.3, test_size=None, stratified=True) + +# compute metrics for selection +binning_0 = HeteroFeatureBinning("binning_0", train_data=data_split_0.outputs["train_output_data"], + method="bucket", n_bins=10) +statistics_0 = Statistics("statistics_0", input_data=data_split_0.outputs["train_output_data"], + metrics=["min", "max", "25%", "mean", "median"]) + +# run feature selection +selection_0 = HeteroFeatureSelection("selection_0", + method=["iv", "statistics", "manual"], + train_data=data_split_0.outputs["train_output_data"], + input_models=[binning_0.outputs["output_model"], + statistics_0.outputs["output_model"]], + iv_param={"metrics": "iv", "filter_type": "top_k", "threshold": 6, + "select_federated": True}, + statistic_param={"metrics": ["max", "mean"], + "filter_type": "top_k", "threshold": 5, "take_high": False}, + manual_param={"keep_col": ["x0", "x1"]}) +selection_1 = HeteroFeatureSelection("selection_1", + test_data=data_split_0.outputs["validate_output_data"], + input_model=selection_0.outputs["train_output_model"]) + +# scale data +scale_0 = FeatureScale("scale_0", train_data=selection_0.outputs["train_output_data"], method="min_max") +scale_1 = FeatureScale("scale_1", test_data=selection_1.outputs["test_output_data"], + input_model=scale_0.outputs["output_model"]) + +# train with sshe lr +sshe_lr_0 = SSHELR("sshe_lr_0", train_data=selection_0.outputs["train_output_data"], + validate_data=scale_0.outputs["test_output_data"], epochs=3) + +# evaluate both models' output +evaluation_0 = Evaluation("evaluation_0", input_datas=[sshe_lr_0.outputs["train_output_data"]], + default_eval_setting="binary", + runtime_parties=dict(guest="9999")) + +# compose training pipeline +pipeline.add_tasks([reader_0, psi_0, data_split_0, + binning_0, statistics_0, selection_0, selection_1, + scale_0, scale_1, sshe_lr_0, evaluation_0]) + +# compile and train +pipeline.compile() +pipeline.fit() + +# print metric and model info +print (pipeline.get_task_info("sshe_lr_0").get_output_model()) +print (pipeline.get_task_info("evaluation_0").get_output_metric()) + +# save pipeline for later usage +pipeline.dump_model("./pipeline.pkl") + +``` + +5. reload trained pipeline and run prediction + +```python +from fate_client.pipeline import FateFlowPipeline +from fate_client.pipeline.components.fate import Reader + +# create pipeline for predicting +predict_pipeline = FateFlowPipeline() + +# reload trained pipeline +pipeline = FateFlowPipeline.load_model("./pipeline.pkl") + +# deploy task for inference +pipeline.deploy([pipeline.psi_0, pipeline.selection_0, pipeline.scale_0, pipeline.sshe_lr_0]) + +# add input to deployed_pipeline +deployed_pipeline = pipeline.get_deployed_pipeline() +reader_1 = Reader("reader_1") +reader_1.guest.task_parameters(namespace="experiment", name="breast_hetero_guest") +reader_1.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host") +deployed_pipeline.psi_0.input_data = reader_1.outputs["output_data"] + +# add task to predict pipeline +predict_pipeline.add_tasks([reader_1, deployed_pipeline]) + +# compile and predict +predict_pipeline.compile() +predict_pipeline.predict() +``` + +6. More tutorials +More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md) diff --git a/doc/2.0/fate/psi_quick_start.md b/doc/2.0/fate/psi_quick_start.md new file mode 100644 index 0000000000..b63b39e357 --- /dev/null +++ b/doc/2.0/fate/psi_quick_start.md @@ -0,0 +1,80 @@ +## PSI Quick Start + +1. install `fate_client` with extra package `fate` + +```sh +python -m pip install -U pip && python -m pip install fate_client[fate,fate_flow]==2.2.0 +``` +after installing packages successfully, initialize fate_flow service and fate_client + +```sh +mkdir fate_workspace +fate_flow init --ip 127.0.0.1 --port 9380 --home $(pwd)/fate_workspace +pipeline init --ip 127.0.0.1 --port 9380 + +fate_flow start +fate_flow status # make sure fate_flow service is started +``` + + +2. download example data + +```sh +wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_guest.csv && \ +wget https://raw.githubusercontent.com/wiki/FederatedAI/FATE/example/data/breast_hetero_host.csv +``` + +3. transform example data to dataframe using in fate +```python +import os +from fate_client.pipeline import FateFlowPipeline + + +base_path = os.path.abspath(os.path.join(__file__, os.path.pardir)) +guest_data_path = os.path.join(base_path, "breast_hetero_guest.csv") +host_data_path = os.path.join(base_path, "breast_hetero_host.csv") + +data_pipeline = FateFlowPipeline().set_parties(local="0") +guest_meta = { + "delimiter": ",", "dtype": "float64", "label_type": "int64","label_name": "y", "match_id_name": "id" +} +host_meta = { + "delimiter": ",", "input_format": "dense", "match_id_name": "id" +} +data_pipeline.transform_local_file_to_dataframe(file=guest_data_path, namespace="experiment", name="breast_hetero_guest", + meta=guest_meta, head=True, extend_sid=True) +data_pipeline.transform_local_file_to_dataframe(file=host_data_path, namespace="experiment", name="breast_hetero_host", + meta=host_meta, head=True, extend_sid=True) +``` +4. run psi + +```python +from fate_client.pipeline.components.fate import ( + Reader, + PSI +) +from fate_client.pipeline import FateFlowPipeline + + +# create pipeline for training +pipeline = FateFlowPipeline().set_parties(guest="9999", host="10000") + +# create reader task_desc +reader_0 = Reader("reader_0") +reader_0.guest.task_parameters(namespace="experiment", name="breast_hetero_guest") +reader_0.hosts[0].task_parameters(namespace="experiment", name="breast_hetero_host") + +# create psi component_desc +psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"]) + +# add training task +pipeline.add_tasks([reader_0, psi_0]) + +# compile and train +pipeline.compile() +pipeline.fit() + +``` + +5. More tutorials +More pipeline api guides can be found in this [link](https://github.com/FederatedAI/FATE-Client/blob/main/doc/pipeline.md) diff --git a/doc/README.md b/doc/README.md index f2f084d65e..6005a1b283 100644 --- a/doc/README.md +++ b/doc/README.md @@ -3,7 +3,9 @@ ### Tutorial - [Quick Start](./2.0/fate/quick_start.md): Train & predict with FATE HeteroSecureBoost using FATE-Pipeline +- [Running PSI](./2.0/fate/psi_quick_start.md): Run PSI only using FATE-PipeLine - [Quick Start with Homo NN](./2.0/fate/homo_quick_start.md): Train & predict with FATE HomoNN using FATE-PipeLine +- [Building Models with Hetero Components](./2.0/fate/model_building_quick_start.md): model-building tutorial with Hetero components: including reading data, feature engineering, and training & evaluating models ### FATE Design - [Architecture](./architecture/README.md): Building unified and standardized API for heterogeneous computing engines interconnection diff --git a/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py b/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py index 12d398123e..69a36364f3 100644 --- a/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py +++ b/examples/pipeline/hetero_feature_binning/test_feature_binning_asymmetric.py @@ -44,8 +44,8 @@ def main(config="../config.yaml", namespace=""): psi_0 = PSI("psi_0", input_data=reader_0.outputs["output_data"]) binning_0 = HeteroFeatureBinning("binning_0", - method="quantile", - n_bins=10, + method="manual", + split_pt_dict={"x0": [0.1, 0.3, 0.5]}, train_data=psi_0.outputs["output_data"], local_only=True ) diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py index 22ceca3024..2833ce9b7a 100644 --- a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py +++ b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning.py @@ -53,7 +53,8 @@ def main(config=".../config.yaml", namespace=""): method=["iv"], train_data=psi_0.outputs["output_data"], input_models=[binning_0.outputs["output_model"]], - iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1}) + iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1, + "select_federated": True}) pipeline.add_tasks([reader_0, psi_0, binning_0, selection_0]) diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py index 8b0db4df80..b8452aa0cc 100644 --- a/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py +++ b/examples/pipeline/hetero_feature_selection/test_feature_selection_binning_lr.py @@ -55,7 +55,8 @@ def main(config=".../config.yaml", namespace=""): method=["iv"], train_data=psi_0.outputs["output_data"], input_models=[binning_0.outputs["output_model"]], - iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1}) + iv_param={"metrics": "iv", "filter_type": "threshold", "threshold": 0.1, + "select_federated": True}) lr_0 = SSHELR("lr_0", learning_rate=0.05, diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py index 468e14423e..3036292adf 100644 --- a/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py +++ b/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_host.py @@ -54,7 +54,8 @@ def main(config=".../config.yaml", namespace=""): train_data=psi_0.outputs["output_data"], input_models=[binning_0.outputs["output_model"], statistics_0.outputs["output_model"]], - iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8}, + iv_param={"metrics": "iv", "filter_type": "top_percentile", "threshold": 0.8, + "select_federated": True}, statistic_param={"metrics": ["max", "mean"], "filter_type": "top_k", "threshold": 5}, manual_param={"keep_col": ["x0", "x1"]} diff --git a/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py b/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py index 7249a7e642..649093ebd4 100644 --- a/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py +++ b/examples/pipeline/hetero_feature_selection/test_feature_selection_multi_model.py @@ -54,7 +54,8 @@ def main(config=".../config.yaml", namespace=""): train_data=psi_0.outputs["output_data"], input_models=[binning_0.outputs["output_model"], statistics_0.outputs["output_model"]], - iv_param={"metrics": "iv", "filter_type": "top_k", "threshold": 6}, + iv_param={"metrics": "iv", "filter_type": "top_k", "threshold": 6, + "select_federated": True}, statistic_param={"metrics": ["max", "mean"], "filter_type": "top_k", "threshold": 5, "take_high": False}, manual_param={"keep_col": ["x0", "x1"]}