-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Porting real-world ML model(s) into ZK #39
Comments
Hey @NOOMA-42, Thanks for updating! I know Dr. Cathie is very busy recently and am also looking forward to seeing their new work. Mid April would be fine for me. Please take the time. Best wishes,
|
Hi @only4sim, sorry for the delayed response. This paper came out shortly after your proposal: https://github.com/Modulus-Labs/Papers/blob/master/remainder-paper.pdf I think there is no doubt that decision tree/forest is an important type of model to bring onchain. However, Remainder seems to have cracked Decision Forest with GKR, so I'm not too sure if it is worth the effort to rewrite it in other DSLs. That being said, Remainder is not currently open-source, so there is still space to port an actual real-world model along with oracle data, into zkML - Just wondering if this is something you want to take into account and maybe modify your proposal in light of recent developments. |
@only4sim Do you have any update |
Thank you very much for your sharing! The paper looks very interesting and it also shows that this direction get more attention. Sorry it took me some time to understand what they do, but of course it's not complete. I feel that the solutions they propose are interesting, especially in terms of complexity optimization and parallelism, but the lack of open source makes it quite difficult for us to use them. Here's what I'm thinking so far:
Of course, this is just my initial thoughts at the moment. I'm looking forward to your input as an expert in the field. Best wishes :) |
Love the second point. Perhaps an adapter across multiple DSLs would be a good OS contribution |
Hey @socathie! I agree. Thanks for your suggestion, I think it would be great to put more efforts on the forests like model adaptors across different DSLs, which looks exciting. Looking forward to your guidance and suggestions in the future. |
Well, there are issues with them all. 1) EZKL changed their license and while sources are open there's no legal right to use anything without permission. 2) ZKML is unsupported and limited in terms of what operations it supports (also TensorFlow is bad on its own) - there is a plan to replace it with TensorPlonk but that's not finished and not open source yet 3) circomlib-ml... well, I think this is theoretically the best one given how useful circom is for other ZK things but in practice it doesn't really support many operations, the best way to circuitize any model is to approximate everything heavily (my previous team eventually decided it's unsuitable even for a simple LSTM model, approximations required rendered output too imprecise to be usable)
Modulus labs haven't published any news for like half a year, I doubt there will be anything opensource from them soon if at all. One more team I know (and prefer not to give them credit by mentioning) originally interested in GKRs, etc. decided not to do core research after all. This paper was fun to read but it's not being worked on anymore https://github.com/jvhs0706/zkllm-ccs2024 I'm afraid to say core tech is just not ready to do real-world ML with ZKML. Normally, an ML engineer has a data set and tries to see how different models fit so solutions that require several man-weeks to circuitize a new model are hardly practical. For ML engineers to come onboard there must be a way to take a random weird, say, linear model with some special normalization and construct a circuit with weight commitments within minutes (even if actual proof of inference is not too fast, that's not always that big a deal). So ideally native support for anything in PyTorch or ONNX within reasonable size limit. There is this idea that lookup/GKR based protocol will be a huge performance improvement for ZKML (way beyond what plonkish systems can manage, Jolt had some good results with GKR/lookup proofs for ZKVMs too) and there was some fresh research on optimal MatMul proofs but I feel I'm running out of my own resources here and will probably stop researching this in order to find some web2 gig and be able to pay rent. Ping me if any brave soul ever wants to look into lookup/GKR/sumcheck proofs for ZK-ML. My username is the same across most social and messaging apps. |
Open Task RFP for Porting real-world ML model(s) into ZK
Project: Porting real-world ML model(s) into ZK
Executive Summary
Project Details
Motivation: Oracles serve as bridges between blockchain systems and external data sources, playing a pivotal role in smart contracts and decentralized applications. There is often a limit to the amount of authenticated data that can be provided by an Oracle. Also, the data provided by the oracle may not meet the needs of the dApp's usage. For example, dApps are often limited in their use of rainfall data by the distribution of weather stations. Especially in a large number of developing countries, there is a lack of appropriate infrastructure. To address this concern, researcher have proposed the use of radar signals to assess rainfall over large areas and then use a limited number of weather stations to correct the radar estimates. ML has demonstrated extremely high accuracy in this area (Continuous Ranked Probability Score, which represents the evaluation error, is less than 0.01). By porting real-world ML models into a zkML, I aim to establish a novel methodology for verifying the accuracy and integrity of inference results without revealing sensitive data. This project is motivated by the potential to significantly enhance the functionality and security of blockchain Oracles, making them more applicable for sectors like agriculture where decision-making is heavily data-dependent.
Data Set: the AMS-AI 2015-2016 Contest: Probabilistic estimate of hourly rainfall from radar in 13th Conference on Artificial Intelligence. All the data is stored in CSV format. The training data consists of NEXRAD and MADIS data collected over midwestern corn-growing states the first 8 days of Apr to Nov 2013. Time and location information have been censored, and the data have been shuffled so that they are not ordered by time or place. The test data consists of data from the same radars and gauges over the same months but in 2014. The train set has 1048575 samples. The test set has 630452 samples. We are given polarimetric radar values and derived quantities at a location over the period of one hour. We will need to produce a probabilistic distribution of the hourly rain gauge total.
Related Work:
Method: The rainfall will be evaluated by decision tree, random forest or XGBoost, and proved by ZK-DTP/circomlib-ml/EZKL.
Scope of Work:
Model Selection and Development: Identify and develop an ML model capable of generating probabilistic rainfall distributions from polarimetric radar data, suitable for ZK implementation. Using polarimetric radar values and derived quantities to predict the probability distribution of rainfall. Correct this prediction using available meteorological base station data.
ZK Proof System Integration: Implement the selected ML model within a ZK proof system, ensuring that inference results can be verified on the blockchain without compromising data privacy.
Performance and Security Analysis: Evaluate the system's performance in terms of accuracy, computational efficiency, and security. Compare the ZK-enabled model's inference results with traditional approaches to demonstrate improvements.
Documentation and Publication: Document the development process, challenges, solutions, and performance analysis. Prepare comprehensive materials for scientific publication, highlighting the project's contribution to blockchain and ML fields.
Expected Outcomes:
A machine learning model capable of producing probabilistic rainfall estimates from polarimetric radar data.
Implementation of the model in a ZK environment, demonstrating the application of privacy-preserving techniques in agricultural data analysis.
A comprehensive documentation of the project process, findings, and a scientific article ready for publication.
Preliminary results
I used my vacation time during the preparation of the proposal to conduct preliminary experiments and analysis, pruning the Devin Anzelmo's solution, a multi-class XGBoost model with soft labels, which can estimate rainfall amounts accurately. However, the original model was aggregated from five XGBoost models, each including 10,000 decision trees, for a total of 50,000 decision trees. I tried to use Bonsai's zero knowledge prove hardware acceleration. The proof limit is about 6 million, and the proof time is in the range of 7 to 10 minutes. The complexity of this model makes it extremely difficult to implement the proof on a personal laptop. To address this problem, I pruned the model, reducing the number of decision trees for a single model to 300, for a total of 1500 decision trees. As shown in the figure, the loss of Continuous Ranked Probability Score was only 0.00002, less than 1/300.
Qualifications
Administrative Details
Development Roadmap
Overview
Milestone 1: Data Analysis and Model Selection
Milestone 2: ZK Model Implementation
Milestone 3: Integration with Blockchain Oracles
Milestone 4: Performance Evaluation and Final Reporting
Reference
[1] Wendy Kan Alex Kleeman, Lakshmanan V. 2015. How Much Did It Rain? (2015). https://kaggle.com/competitions/how-much-did-it-rain
[2] Devin Anzelmo. 2015. First place code. https://www.kaggle.com/c/how-much-did-it-rain/discussion/16260. (2015). Accessed: 2023-11-20.
[3] Eli Ben-Sasson, Alessandro Chiesa, Daniel Genkin, Eran Tromer, and Madars Virza. 2013. SNARKs for C: Verifying Program Executions Succinctly and in Zero Knowledge. IACR Cryptol. ePrint Arch. 2013 (2013), 507.
[4] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016).
The text was updated successfully, but these errors were encountered: