There is no trust between data owners so proprietary data is wasted in silos. Combined in a trustless manner, using ZKML, this data can massively benefit data providers and consumers.
Basket of datasets is priced by training a foundation AI model and calculating ZK-verified accuracy on a public testset. This is a proxy for data value => access to the model can be bought and sold.
This is the total amount allocated to CODI: Community-Owned Data Insights; Powered by Cardano.
N/A
No dependencies
The algorithm that uses heuristics to establish which potential baskets of data sources are promising combinations to test and price. This makes the product faster and more accurate over time as it can learn from previous experience.
The product itself is a 3-way marketplace between data producers, compute owners, and data consumers. Here are the three key problems it solves:
Our product uses ZKML to train a foundation model that uses multiple data sources without any particular source needing access to the other. The accuracy of the model on a testset can be publicly verified through ZK proofs. This accuracy serves as a proxy to the actual combined value of the data sources. This collection can then be bought or sold on the marketplace. Using ZK proofs for this use case is ideal because we require verification that a public testset has produced certain results when run through a private model.
This product is aimed at anyone who has highly-specific data they want to sell (e.g. research labs selling manually classified microscope images of bacteria in their country), and people who would benefit from more general proprietary data models (e.g. startups looking to use bacteria classification models to detect the concentration of certain antibiotic resistant strains in various areas).
The proof that this product works is easy to verify by simply testing that data bought on the marketplace truly produces a model at the attested accuracy
This solution is vital to the Cardano developer ecosystem. As more tools and software integrate real-world data into Cardano, it is important to be able to evaluate the value of incoming data and, correspondingly, the value of data oracles. As described, this data could be used to build rate-limited AI models that allow certain parties access insights in a fair, decentralized way. Beyond AI models, this data can be used to produce any form of insights that are valuable to the whole community, distributed fairly to contributors of the insight generation, and most importantly are co-owned by contributors. The insights from data combined in a secure way will be more valuable that any individual party would be able to derive from their own proprietary data. At the same time, each party is able to do this without revealing all their data.
All the outputs of the product will be immediately open-sourced with the exception of the algorithm that uses heuristics to establish which potential baskets of data sources are promising combinations to test and price. This algorithm will be open-sourced once it is sufficiently performant because, otherwise, the project could be instantly ported onto any other chain.
The project will be open to any customer. However, data providers will have to verify that they are proposing to sell data of a sufficient quantity so that the system doesn't get overloaded with a very high number of low-quantity sell orders.
This product can be used in future development to allow parties to price different kinds of data such as stream data and multi-modal data. These are all areas for future research.
I am an ML engineer with smart contract and Web3 development experience. I believe this unique combination makes me the perfect fit for this project. I am very enthusiastic about democratizing data and levelling the playing field in AI. This means breaking the data moat established by monolith companies and that requires a method of pooling together smaller data sources in a decentralized, fair, and trustless manner.
I have excelled in many previous endeavours. My academics, graduating top of my cohort at Imperial College London and currently studying at Oxford; my research, currently writing a workshop paper in causal AI and with previous work researching the solana consensus mechanism; and my building, where I've won many hackathons (e.g. IC Hack 2022, UN Privacy Hack 2022, EduDAO Hack 2023, etc.).
I can learn fast, research deeply, and engineer solutions precisely, which makes me the perfect fit for a project like this.
Proposed goals: To build a system for pricing data sources and proposing a combined collection of data that is optimally valuable for a particular project at a particular budget. The system will also facilitate queries to this model under conditions predefined by the project proposer.
(1) Collate several public datasets of varying sizes and domains [0.5 months]
(2) Develop a pricing algorithm for datasets [2 months]
(3) Develop a matching algorithm and frontend to allow customers to buy dataset collections at a particular budget [0.5 months]
(4) Establish a way of hosting the foundation model such that only contributors can access it at the agreed rate-limits [1 month]
(1) A collection of datasets
(2) An accurate pricing model where prices correlate well with the accuracy of the corresponding foundation models
(3) Clean UX with fast matching experience
(4) Reliable and robust system with relatively low-latency access and low downtime
All costs are listed in Ada.
71000 - Developer hire costs
2000 - Compute resources
1000 - Proprietary data for testing
1000 - Publicity and community engagement
Any further funds required for developer hire will be accrued through further grants or investments.
The industry rates for developers experienced in both web3 and ML in the UK is £50 ($65) per hour. This grant would pay for ~300 hours of developer time. Overall, this would allow us to develop a sufficient prototype / proof-of-concept to get further grants or even VC funding.
Even a simple first version of this project will allow clients to price their data which is immensely valuable in itself, even if the transaction between buyers and sellers is conducted in a centralized way. The revenue from this first version will be used to fund the later stages which involve hosting the foundation model and providing secure access to the relevant parties on the agreed terms.
Manuj Mishra: Founder and Chief Developer
Manuj has extensive ML experience and has built multiple ZK projects before, most recently winning 3 prizes in EduDAO hack with a ZKML verifier. His ML background starts with his BEng in Maths and CS from Imperial College London where he graduated top of this cohort and his MSc in Advanced CS from Oxford where he focused on cutting-edge ML research. He has applied his ML expertise in industry at fortune 500 companies (e.g. American Express) and unicorn startups (e.g. Rebellion Defense). He is currently undertaking a founders residency at HomeDAO which has already spunout several multi-million dollar crypto projects in its first year of operation.
https://www.linkedin.com/in/manuj-mishra/
https://github.com/manuj-mishra
https://manuj-mishra.github.io/resume
Future plans: I intend to work with an appropriate co-founder on this project to speed up development. There are many potential options for this within HomeDAO and through personal connections. I have reached out to potential candidates already, especially since I live with most of them.