Home Funds overview Fund10 Development & Infrastructure

Last updated a year ago

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

CODI: Community-Owned Data Insights; Powered by Cardano

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

Problem

There is no trust between data owners so proprietary data is wasted in silos. Combined in a trustless manner, using ZKML, this data can massively benefit data providers and consumers.

Solution

Basket of datasets is priced by training a foundation AI model and calculating ZK-verified accuracy on a public testset. This is a proxy for data value => access to the model can be bought and sold.

Total to date

This is the total amount allocated to CODI: Community-Owned Data Insights; Powered by Cardano.

₳75,000

Total funds requested

Complete

In progress

223

Total votes cast

₳24.8M

Votes yes

₳142M

Votes no

About this idea

data blockchain developers cardano infrastructure startup

[GENERAL] Name and Surname of Main Applicant

Manuj Mishra

[GENERAL] Email address of Main Applicant

manujmishra2000@gmail.com

Additional Applicants

N/A

[GENERAL] Please specify how many months you expect your project to last (from 2-12 months)

[GENERAL] Please indicate if your proposal has been auto-translated into English from another language.

[GENERAL] Does your project have any dependencies on other organizations, technical or otherwise?

[GENERAL] If YES, please describe what the dependency is and why you believe it is essential for your project’s delivery. If NO, please write “No dependencies.” in this field.

No dependencies

[GENERAL] Will your project outputs be fully Open Source?

[GENERAL] If NO, please describe which outputs are not going to be open source. If YES, please write “Project will be fully open source.” in this field.

The algorithm that uses heuristics to establish which potential baskets of data sources are promising combinations to test and price. This makes the product faster and more accurate over time as it can learn from previous experience.

[METADATA] Category of Proposal

Exchange

[IMPACT] Please describe your proposed solution.

The product itself is a 3-way marketplace between data producers, compute owners, and data consumers. Here are the three key problems it solves:

The value of data is not additive. Individually, two datasets may not be valuable since the models built on them are not general enough. However, a model built on the combined dataset may be much more accurate. This makes datasets difficult to price.
Data providers cannot leverage this phenomenon to sell their data at its true value (i.e. what its worth when combined with other available data in the world) without immediately revealing it.
Data consumers have a laborious process for finding, aggregating, and using data for their particular purpose.

Our product uses ZKML to train a foundation model that uses multiple data sources without any particular source needing access to the other. The accuracy of the model on a testset can be publicly verified through ZK proofs. This accuracy serves as a proxy to the actual combined value of the data sources. This collection can then be bought or sold on the marketplace. Using ZK proofs for this use case is ideal because we require verification that a public testset has produced certain results when run through a private model.

This product is aimed at anyone who has highly-specific data they want to sell (e.g. research labs selling manually classified microscope images of bacteria in their country), and people who would benefit from more general proprietary data models (e.g. startups looking to use bacteria classification models to detect the concentration of certain antibiotic resistant strains in various areas).

The proof that this product works is easy to verify by simply testing that data bought on the marketplace truly produces a model at the attested accuracy

[IMPACT] How does your proposed solution address the challenge and what benefits will this bring to the Cardano ecosystem?

This solution is vital to the Cardano developer ecosystem. As more tools and software integrate real-world data into Cardano, it is important to be able to evaluate the value of incoming data and, correspondingly, the value of data oracles. As described, this data could be used to build rate-limited AI models that allow certain parties access insights in a fair, decentralized way. Beyond AI models, this data can be used to produce any form of insights that are valuable to the whole community, distributed fairly to contributors of the insight generation, and most importantly are co-owned by contributors. The insights from data combined in a secure way will be more valuable that any individual party would be able to derive from their own proprietary data. At the same time, each party is able to do this without revealing all their data.

[IMPACT] How do you intend to measure the success of your project?

Does the incentive system work in practice? Are data holders sufficiently incentivised to put their data forward to be pooled into community-owned models and insights? Are the reward structures sufficient for consumers to fund training procedures? This can be measured through metrics like MAU.
Is the pricing model accurate? Do data collections bought at higher prices provide better results than those priced lower? This can be tested by verifying on private test sets and also collecting user feedback.
Is the system robust? Although the ZKML systems provide theoretical guarantees, it would still be important to test the robustness of the system overall to adversarial attacks (e.g. try to construct data that has an artificially inflated price by overfitting to the public test set).

[IMPACT] Please describe your plans to share the outputs and results of your project?

All the outputs of the product will be immediately open-sourced with the exception of the algorithm that uses heuristics to establish which potential baskets of data sources are promising combinations to test and price. This algorithm will be open-sourced once it is sufficiently performant because, otherwise, the project could be instantly ported onto any other chain.

The project will be open to any customer. However, data providers will have to verify that they are proposing to sell data of a sufficient quantity so that the system doesn't get overloaded with a very high number of low-quantity sell orders.

This product can be used in future development to allow parties to price different kinds of data such as stream data and multi-modal data. These are all areas for future research.

[CAPABILITY/ FEASIBILITY] What is your capability to deliver your project with high levels of trust and accountability?

I am an ML engineer with smart contract and Web3 development experience. I believe this unique combination makes me the perfect fit for this project. I am very enthusiastic about democratizing data and levelling the playing field in AI. This means breaking the data moat established by monolith companies and that requires a method of pooling together smaller data sources in a decentralized, fair, and trustless manner.

I have excelled in many previous endeavours. My academics, graduating top of my cohort at Imperial College London and currently studying at Oxford; my research, currently writing a workshop paper in causal AI and with previous work researching the solana consensus mechanism; and my building, where I've won many hackathons (e.g. IC Hack 2022, UN Privacy Hack 2022, EduDAO Hack 2023, etc.).

I can learn fast, research deeply, and engineer solutions precisely, which makes me the perfect fit for a project like this.

[CAPABILITY/ FEASIBILITY] What are the main goals for the project and how will you validate if your approach is feasible?

Proposed goals: To build a system for pricing data sources and proposing a combined collection of data that is optimally valuable for a particular project at a particular budget. The system will also facilitate queries to this model under conditions predefined by the project proposer.

[CAPABILITY/ FEASIBILITY] Please provide a detailed breakdown of your project’s milestones and each of the main tasks or activities to reach the milestone plus the expected timeline for the delivery.

(1) Collate several public datasets of varying sizes and domains [0.5 months]

Build scrapers or use online scraping services
Store datasets in an appropriately structured way

(2) Develop a pricing algorithm for datasets [2 months]

(3) Develop a matching algorithm and frontend to allow customers to buy dataset collections at a particular budget [0.5 months]

(4) Establish a way of hosting the foundation model such that only contributors can access it at the agreed rate-limits [1 month]

[CAPABILITY/ FEASIBILITY] Please describe the deliverables, outputs and intended outcomes of each milestone.

(1) A collection of datasets

(2) An accurate pricing model where prices correlate well with the accuracy of the corresponding foundation models

(3) Clean UX with fast matching experience

(4) Reliable and robust system with relatively low-latency access and low downtime

[RESOURCES & VALUE FOR MONEY] Please provide a detailed budget breakdown of the proposed work and resources.

All costs are listed in Ada.

71000 - Developer hire costs

2000 - Compute resources

1000 - Proprietary data for testing

1000 - Publicity and community engagement

Any further funds required for developer hire will be accrued through further grants or investments.

[RESOURCES & VALUE FOR MONEY] How does the cost of the project represent value for money for the Cardano ecosystem?

The industry rates for developers experienced in both web3 and ML in the UK is £50 ($65) per hour. This grant would pay for ~300 hours of developer time. Overall, this would allow us to develop a sufficient prototype / proof-of-concept to get further grants or even VC funding.

Even a simple first version of this project will allow clients to price their data which is immensely valuable in itself, even if the transaction between buyers and sellers is conducted in a centralized way. The revenue from this first version will be used to fund the later stages which involve hosting the foundation model and providing secure access to the relevant parties on the agreed terms.

[IMPORTANT NOTE] The Applicant agreed to Fund10 rules and also that data in the Submission Form and other data provided by the project team during the course of the project will be publicly available.

I Accept

Team

Manuj Mishra: Founder and Chief Developer

Manuj has extensive ML experience and has built multiple ZK projects before, most recently winning 3 prizes in EduDAO hack with a ZKML verifier. His ML background starts with his BEng in Maths and CS from Imperial College London where he graduated top of this cohort and his MSc in Advanced CS from Oxford where he focused on cutting-edge ML research. He has applied his ML expertise in industry at fortune 500 companies (e.g. American Express) and unicorn startups (e.g. Rebellion Defense). He is currently undertaking a founders residency at HomeDAO which has already spunout several multi-million dollar crypto projects in its first year of operation.

https://www.linkedin.com/in/manuj-mishra/

https://github.com/manuj-mishra

https://manuj-mishra.github.io/resume

Future plans: I intend to work with an appropriate co-founder on this project to speed up development. There are many potential options for this within HomeDAO and through personal connections. I have reached out to potential candidates already, especially since I live with most of them.