Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Porting real-world ML model(s) into ZK #39

Open
only4sim opened this issue Feb 28, 2024 · 9 comments
Open

Proposal: Porting real-world ML model(s) into ZK #39

only4sim opened this issue Feb 28, 2024 · 9 comments
Assignees
Labels
Application Proposal Proposal submitted by applicants Grant Work in Progress Passed review and work in progress

Comments

@only4sim
Copy link

only4sim commented Feb 28, 2024

Open Task RFP for Porting real-world ML model(s) into ZK

Project: Porting real-world ML model(s) into ZK

Executive Summary

  • Project Overview: this project aims to integrate advanced machine learning (ML) models, specifically designed for predicting hourly rainfall distribution using polarimetric radar data, with blockchain Oracles through Zero-Knowledge (ZK) proofs. The objective is to securely extend Oracles data on the blockchain by providing verifiable, privacy-preserving proofs of ML inference results. This integration will demonstrate the practical application of ZK proofs in enhancing the trustworthiness and security of ML computations used in critical decision-making processes, such as agricultural planning and resource management.

Project Details

  • Motivation: Oracles serve as bridges between blockchain systems and external data sources, playing a pivotal role in smart contracts and decentralized applications. There is often a limit to the amount of authenticated data that can be provided by an Oracle. Also, the data provided by the oracle may not meet the needs of the dApp's usage. For example, dApps are often limited in their use of rainfall data by the distribution of weather stations. Especially in a large number of developing countries, there is a lack of appropriate infrastructure. To address this concern, researcher have proposed the use of radar signals to assess rainfall over large areas and then use a limited number of weather stations to correct the radar estimates. ML has demonstrated extremely high accuracy in this area (Continuous Ranked Probability Score, which represents the evaluation error, is less than 0.01). By porting real-world ML models into a zkML, I aim to establish a novel methodology for verifying the accuracy and integrity of inference results without revealing sensitive data. This project is motivated by the potential to significantly enhance the functionality and security of blockchain Oracles, making them more applicable for sectors like agriculture where decision-making is heavily data-dependent.

  • Data Set: the AMS-AI 2015-2016 Contest: Probabilistic estimate of hourly rainfall from radar in 13th Conference on Artificial Intelligence. All the data is stored in CSV format. The training data consists of NEXRAD and MADIS data collected over midwestern corn-growing states the first 8 days of Apr to Nov 2013. Time and location information have been censored, and the data have been shuffled so that they are not ordered by time or place. The test data consists of data from the same radars and gauges over the same months but in 2014. The train set has 1048575 samples. The test set has 630452 samples. We are given polarimetric radar values and derived quantities at a location over the period of one hour. We will need to produce a probabilistic distribution of the hourly rain gauge total.

  • Related Work:

    1. Regarding the use of polarized radar values to predict rainfall distributions, Devin Anzelmo proposed in the competition to consider the predictions as a linear combination of CDFs estimated from the training set. Each component CDF is weighted according to the class probability of the label associated with the CDF given by the classification algorithm. This probability is given by the classification algorithm.
    2. The use of radar and rain gauges to predict rainfall has been widely studied. The use of radar and rain gauges to predict rainfall has been widely studied. For example, Yan et al. proposed Short time precipitation estimation using weather radar and surface observations: with rainfall displacement information integrated in a stochastic manner. Ochoa-Rodriguez reviewed this area in A Review of Radar-Rain Gauge Data Merging Methods and Their Potential for Urban Hydrological Applications.
    3. Regarding the use of zero-knowledge proof techniques for proving machine learning inferences, there is some representative work such as circomlib-ml, EZKL, and ZKML by Daniel Kang et al.
  • Method: The rainfall will be evaluated by decision tree, random forest or XGBoost, and proved by ZK-DTP/circomlib-ml/EZKL.

  • Scope of Work:

    • Model Selection and Development: Identify and develop an ML model capable of generating probabilistic rainfall distributions from polarimetric radar data, suitable for ZK implementation. Using polarimetric radar values and derived quantities to predict the probability distribution of rainfall. Correct this prediction using available meteorological base station data.

    • ZK Proof System Integration: Implement the selected ML model within a ZK proof system, ensuring that inference results can be verified on the blockchain without compromising data privacy.

    • Performance and Security Analysis: Evaluate the system's performance in terms of accuracy, computational efficiency, and security. Compare the ZK-enabled model's inference results with traditional approaches to demonstrate improvements.

    • Documentation and Publication: Document the development process, challenges, solutions, and performance analysis. Prepare comprehensive materials for scientific publication, highlighting the project's contribution to blockchain and ML fields.

  • Expected Outcomes:

    • A machine learning model capable of producing probabilistic rainfall estimates from polarimetric radar data.

    • Implementation of the model in a ZK environment, demonstrating the application of privacy-preserving techniques in agricultural data analysis.

    • A comprehensive documentation of the project process, findings, and a scientific article ready for publication.

Preliminary results

Continuous Ranked Probability Score of pruned models.

I used my vacation time during the preparation of the proposal to conduct preliminary experiments and analysis, pruning the Devin Anzelmo's solution, a multi-class XGBoost model with soft labels, which can estimate rainfall amounts accurately. However, the original model was aggregated from five XGBoost models, each including 10,000 decision trees, for a total of 50,000 decision trees. I tried to use Bonsai's zero knowledge prove hardware acceleration. The proof limit is about 6 million, and the proof time is in the range of 7 to 10 minutes. The complexity of this model makes it extremely difficult to implement the proof on a personal laptop. To address this problem, I pruned the model, reducing the number of decision trees for a single model to 300, for a total of 1500 decision trees. As shown in the figure, the loss of Continuous Ranked Probability Score was only 0.00002, less than 1/300.

Qualifications

  • Proposer: Li@only4sim
  • Email: [email protected]
  • Telegram Handle: @sing4cat
  • Discord Handle: li.quan
  • GitHub: https://github.com/only4sim
  • Skills Required: Python (Experienced), XGBoost, PyTorch, and scikit-learn (Familiar), EZKL (Experienced).
  • Preferred Qualifications:
    • Creator of the ZK-DTP and Snarky-ML libraries (First zkML libraries in o1js).
    • Grantee of Mina Protocol Innovation Grant (delivered beyond application expectations).
    • First place in the RISC Zero AI Challenge in ZK Hack Istanbul
    • Fourth Place, Top Prize in the RISC Zero Challenge in ZK Hack Lisbon

Administrative Details

  • Estimated Project Duration: 120 hours, accommodating the project's multifaceted nature and the integration of cutting-edge technologies.
  • Project Complexity: Medium. This project requires a cross-disciplinary approach, combining ML, ZK proofs, and blockchain technology to achieve its objectives. Since I have zkML related experience, I have a clear grasp of the difficulty and implementation of the project.

Development Roadmap

Overview

  • Total Estimated Duration: 120 hours
  • Full-time equivalent (FTE): 0.5 FTE
  • Starting data of the whole proposal: 2024-06-17

Milestone 1: Data Analysis and Model Selection

  • Estimated Duration: 2 weeks (30 hours)
  • FTE: 0.5
  • Expected delivery date: 2024-07-01
  • Deliverables and Specifications:
    • Source Code / Documentation: Detailed documentation on the analysis of polarimetric radar data and selection of the machine learning model best suited for predicting probabilistic rainfall distributions, taking into account both model performance and its adaptability for ZK implementation.
    • Functionality: Establishment of a data analysis pipeline, criteria for model selection based on accuracy, efficiency, and ZK compatibility.

Milestone 2: ZK Model Implementation

  • Estimated Duration: 3 weeks (45 hours)
  • FTE: 0.5
  • Deliverables and Specifications:
    • Expected delivery date: 2024-07-22
    • Source Code / Documentation: Comprehensive source code and documentation outlining the conversion of the selected ML model into a ZK framework using tools like ZK-DTP, Circom or EZKL, focusing on proof generation efficiency and computational resource optimization.
    • Functionality: A fully functional proof generation and verification pipeline, ensuring the model's efficient operation within ZK constraints.

Milestone 3: Integration with Blockchain Oracles

  • Estimated Duration: 2 weeks (30 hours)
  • FTE: 0.5
    • Expected delivery date: 2024-08-05
  • Deliverables and Specifications:
    • Source Code / Documentation: Clear instructions and codebase for the integration of the ZK-secured ML model with blockchain oracles, including detailed steps for extending oracle data with verified inference results in a privacy-preserving manner.
    • Functionality: Operational demonstration of using the ML model's inference results in smart contracts through oracles, focusing on privacy and data integrity.

Milestone 4: Performance Evaluation and Final Reporting

  • Estimated Duration: 1 week (15 hours)
  • FTE: 0.5
    • Expected delivery date: 2024-08-12
  • Deliverables and Specifications:
    • Final Report: An evaluative report on the project's outcomes, comparing model performance pre and post ZK implementation, its effect on enhancing oracle data reliability, and an assessment of overall system efficiency.
    • Functionality: Comprehensive benchmarks detailing the model's accuracy, proof generation, and verification complexity, with a focus on time and memory efficiency.

Reference

[1] Wendy Kan Alex Kleeman, Lakshmanan V. 2015. How Much Did It Rain? (2015). https://kaggle.com/competitions/how-much-did-it-rain
[2] Devin Anzelmo. 2015. First place code. https://www.kaggle.com/c/how-much-did-it-rain/discussion/16260. (2015). Accessed: 2023-11-20.
[3] Eli Ben-Sasson, Alessandro Chiesa, Daniel Genkin, Eran Tromer, and Madars Virza. 2013. SNARKs for C: Verifying Program Executions Succinctly and in Zero Knowledge. IACR Cryptol. ePrint Arch. 2013 (2013), 507.
[4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).

@only4sim
Copy link
Author

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

@NOOMA-42 NOOMA-42 added the Application Proposal Proposal submitted by applicants label Feb 29, 2024
@NOOMA-42
Copy link
Collaborator

NOOMA-42 commented Apr 1, 2024

@only4sim
Update: Cathie will be able to review around mid April, she's been quite busy these 2 months

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

@only4sim
Copy link
Author

only4sim commented Apr 2, 2024

Hey @NOOMA-42,

Thanks for updating! I know Dr. Cathie is very busy recently and am also looking forward to seeing their new work. Mid April would be fine for me. Please take the time.

Best wishes,
Li

@only4sim Update: Cathie will be able to review around mid April, she's been quite busy these 2 months

Hey @socathie and @NOOMA-42 ! I've submitted my proposal. Are you available to review it? Looking forward to your feedback and suggestions :)

@socathie
Copy link

Hi @only4sim, sorry for the delayed response. This paper came out shortly after your proposal: https://github.com/Modulus-Labs/Papers/blob/master/remainder-paper.pdf

I think there is no doubt that decision tree/forest is an important type of model to bring onchain. However, Remainder seems to have cracked Decision Forest with GKR, so I'm not too sure if it is worth the effort to rewrite it in other DSLs.

That being said, Remainder is not currently open-source, so there is still space to port an actual real-world model along with oracle data, into zkML - Just wondering if this is something you want to take into account and maybe modify your proposal in light of recent developments.

@NOOMA-42
Copy link
Collaborator

@only4sim Do you have any update

@only4sim
Copy link
Author

Hey @socathie and @NOOMA-42,

Thank you very much for your sharing! The paper looks very interesting and it also shows that this direction get more attention. Sorry it took me some time to understand what they do, but of course it's not complete. I feel that the solutions they propose are interesting, especially in terms of complexity optimization and parallelism, but the lack of open source makes it quite difficult for us to use them.

Here's what I'm thinking so far:

  1. One of our main tasks lies in finding scenarios and corresponding solutions where ZKML is suitable for on-chain use. Extension and enhancement of rainfall data through ZKML is a representative sample.
  2. In addition, we want to be able to provide the community with new tools, rather than just solving a specific problem. We can adapt the field in two steps. First, a generalized converter can convert decision forests into different DSLs. Since decision forests do not rely on a large number of nonlinear operations as NNs do, the converter can be relatively well adapted to different DSLs, even if some of the DSLs' functions are simple. In the second step, after Modulus-Labs' work is open-sourced, it is possible to try to use their tools to prove our decision forests about rainfall evaluation, so that we not only can further validate Modulus-Labs' work, but also get a baseline.

Of course, this is just my initial thoughts at the moment. I'm looking forward to your input as an expert in the field.

Best wishes :)

@socathie
Copy link

Love the second point. Perhaps an adapter across multiple DSLs would be a good OS contribution

@only4sim
Copy link
Author

Love the second point. Perhaps an adapter across multiple DSLs would be a good OS contribution

Hey @socathie! I agree. Thanks for your suggestion, I think it would be great to put more efforts on the forests like model adaptors across different DSLs, which looks exciting.

Looking forward to your guidance and suggestions in the future.

@NOOMA-42 NOOMA-42 added the Grant Work in Progress Passed review and work in progress label May 28, 2024
@IvanAnishchuk
Copy link

Regarding the use of zero-knowledge proof techniques for proving machine learning inferences, there is some representative work such as circomlib-ml, EZKL, and ZKML by Daniel Kang et al.

Well, there are issues with them all. 1) EZKL changed their license and while sources are open there's no legal right to use anything without permission. 2) ZKML is unsupported and limited in terms of what operations it supports (also TensorFlow is bad on its own) - there is a plan to replace it with TensorPlonk but that's not finished and not open source yet 3) circomlib-ml... well, I think this is theoretically the best one given how useful circom is for other ZK things but in practice it doesn't really support many operations, the best way to circuitize any model is to approximate everything heavily (my previous team eventually decided it's unsuitable even for a simple LSTM model, approximations required rendered output too imprecise to be usable)

Remainder seems to have cracked Decision Forest with GKR,

Modulus labs haven't published any news for like half a year, I doubt there will be anything opensource from them soon if at all. One more team I know (and prefer not to give them credit by mentioning) originally interested in GKRs, etc. decided not to do core research after all.

This paper was fun to read but it's not being worked on anymore https://github.com/jvhs0706/zkllm-ccs2024

I'm afraid to say core tech is just not ready to do real-world ML with ZKML. Normally, an ML engineer has a data set and tries to see how different models fit so solutions that require several man-weeks to circuitize a new model are hardly practical. For ML engineers to come onboard there must be a way to take a random weird, say, linear model with some special normalization and construct a circuit with weight commitments within minutes (even if actual proof of inference is not too fast, that's not always that big a deal). So ideally native support for anything in PyTorch or ONNX within reasonable size limit.

There is this idea that lookup/GKR based protocol will be a huge performance improvement for ZKML (way beyond what plonkish systems can manage, Jolt had some good results with GKR/lookup proofs for ZKVMs too) and there was some fresh research on optimal MatMul proofs but I feel I'm running out of my own resources here and will probably stop researching this in order to find some web2 gig and be able to pay rent. Ping me if any brave soul ever wants to look into lookup/GKR/sumcheck proofs for ZK-ML. My username is the same across most social and messaging apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Application Proposal Proposal submitted by applicants Grant Work in Progress Passed review and work in progress
Projects
None yet
Development

No branches or pull requests

4 participants