Last updated 7 months ago
Absence of standardized LLM benchmarks for Cardano’s smart contract languages hinders the advancement of a robust and efficient AI-assisted development ecosystem.
Wolfram will build an open-source framework to benchmark LLMs on interpreting and generating Cardano smart contracts, measuring correctness, syntax compliance, vulnerability and explain-ability.
This is the total amount allocated to Wolfram: AI Benchmarks for Cardano.
Please provide your proposal title
Wolfram: AI Benchmarks for Cardano
Enter the amount of funding you are requesting in ADA
50000
Please specify how many months you expect your project to last
5
Please indicate if your proposal has been auto-translated
No
Original Language
en
What is the problem you want to solve?
Absence of standardized LLM benchmarks for Cardano’s smart contract languages hinders the advancement of a robust and efficient AI-assisted development ecosystem.
Supporting links
Does your project have any dependencies on other organizations, technical or otherwise?
No
Describe any dependencies or write 'No dependencies'
No dependencies
Will your project's outputs be fully open source?
Yes
License and Additional Information
The project will be open-sourced under the Apache 2.0 license for maximum permissiveness and enterprise adoption. All source code will be published on GitHub from project start. Documentation, API references, and example use cases will be included. Community contributions will be encouraged through public issues and PRs.
Please choose the most relevant theme and tag related to the outcomes of your proposal.
AI
Describe what makes your idea innovative compared to what has been previously funded (whether by you or others).
Unlike prior AI projects in blockchain or Cardano, this initiative builds the first open-source framework to benchmark LLM proficiency in Cardano’s smart contract languages. It combines blockchain-specific validation (syntax, deployment, vulnerability checks) with AI evaluation methods (semantic similarity, explainability). By testing both commercial APIs and open-weight models, it enables fair, extensible comparisons, giving developers actionable insights to adopt AI tools more effectively.
Describe what your prototype or MVP will demonstrate, and where it can be accessed.
The MVP will be a working benchmarking framework, accessible as a public GitHub repository under the Apache 2.0 license from project start. It will include:
A curated set of Plutus smart contract dataset with reference code and test cases.
LLM execution pipeline capable of running prompts across multiple commercial and local LLM models.
Evaluation module for correctness, syntax compliance, semantic similarity, and vulnerability detection.
A final benchmark report comparing well-known commercial LLMs and open-weight model.
Describe realistic measures of success, ideally with on-chain metrics.
50+ clones of the GitHub repository within 2 months of release.
Benchmark dataset includes 50+ unique contract tasks of varying complexity.
Identification of some actionable changes for improving Plutus LLM support.
Pull requests from the community
Some successfully deployed smart contract to Cardano Testnet, generated during benchmarking process.
Please describe your proposed solution and how it addresses the problem
With Cardano's smart contract ecosystem evolving, its range of smart contract languages such as Plutus, Aiken, etc. provides developers with flexibility and tailored capabilities. Yet this variety also brings differing levels of complexity, unique learning requirements, and distinct workflows. This may pose entry-barriers for new comers and slow productivity in general. Large Language Models (LLMs) have emerged as transformative tools in software engineering, offering capabilities such as accelerating code generation, explaining complex logic, optimizing performance, and identifying security vulnerabilities.
Within the Cardano ecosystem, LLMs hold significant potential to enhance smart contract development. However, the extent to which these models fully understand Cardano’s smart contract languages remains unclear. We propose development of a systematic benchmarking framework to evaluate LLM proficiency in interpreting, generating, and explaining Cardano smart contract code. The framework will identify strengths, weaknesses, and language-specific challenges, providing actionable insights to guide developers toward the most LLM-compatible tools and workflows.
The proposed solution aims to:
Diagram for LLM Benchmarking: https://amoeba.wolfram.com/index.php/s/5F2fczGM6tWH8wi
The evaluation framework will assess LLM performance across key dimensions including correctness, syntax compliance, vulnerability, and explainability. It will ingest a curated collection of prompt-response pairs, and each prompt will be executed through an LLM Execution Engine capable of interfacing with both commercial LLM APIs and open-weight local models. The generated responses will be processed by an Evaluation Engine, which will compare them to ground truth using a variety of techniques, including:
A Performance Metrics Module will output structured evaluation data, visual summaries, and annotated response logs, enabling the Cardano community to benchmark LLM behavior and identify opportunities for improving language toolchains, documentation, and ecosystem readiness for AI-assisted development. The framework will be built using open-source technologies, fully documented, and designed for extensibility so it can adapt to additional Cardano smart contract languages, new evaluation methods, and emerging LLM capabilities.
Please define the positive impact your project will have on the wider Cardano community
This project will deliver immediate and long-term benefits by systematically evaluating how effectively Large Language Models (LLMs) understand and work with Cardano’s smart contract languages. The results will empower developers, researchers, and ecosystem leaders with actionable insights to guide AI integration strategies for the Cardano ecosystem.
The project will:
What is your capability to deliver your project with high levels of trust and accountability? How do you intend to validate if your approach is feasible?
Our team has extensive expertise in Large Language Model (LLM) evaluation, stemming from our work on large-scale public benchmarks and custom internal projects. We have a strong track record in developing comprehensive LLM benchmarks, including the Wolfram Benchmarking Project: https://www.wolfram.com/llm-benchmarking-project/. This work provides us with deep insights into LLM capabilities by rigorously evaluating their performance on complex tasks like code generation.
Our experience is further enriched by the creation and utilization of a suite of internal benchmarks for specialized AI applications, such as AI tutors for mathematics and biology, and for evaluating tool-assisted mathematical problem-solving. We are proficient in a wide array of evaluation methodologies, from designing robust frameworks and metrics to performing granular error analysis. This allows us to develop a nuanced, actionable understanding of model performance that goes far beyond simple accuracy scores. Wolfram brings its proven expertise with AI, blockchain and data driven applications, ensuring the successful delivery of developing comprehensive LLM benchmarking project.
Organizational Strengths
Our team has over three decades of leadership in computational science, data science, and AI. We have multidisciplinary team specializing in AI, blockchain & development, data science, and community engagement. Our company has always prided ourselves on the global talent and knowledge we have. Along with our innovative tech stack, we have an in-house consulting team who has helped organizations with the most difficult problems for decades.
Milestone Title
Project Setup & Dataset Curation
Milestone Outputs
A public GitHub repository is initialized under the Apache 2.0 license with clear base documentation, including README and license files. A curated dataset of Plutus smart contract code snippets is prepared, organized into folders, and accompanied by prompt templates with examples that demonstrate intended use.
Acceptance Criteria
The GitHub repository must be publicly accessible and include an initial commit with README, Apache 2.0 license, dataset folder, and template documentation. Example prompts are tested and confirmed to run correctly, ensuring that contributors and community members can reproduce the setup and understand usage.
Evidence of Completion
Public GitHub repository link displaying the dataset folder, documentation, and prompt templates. Repository history confirms an initial commit containing README, license, and curated dataset files.
Delivery Month
1
Cost
10000
Progress
20 %
Milestone Title
Benchmarking Framework Architecture & LLM Integration
Milestone Outputs
The architecture for the LLM execution and code validation pipeline is fully defined and documented, with diagrams and specifications uploaded to GitHub. A working integration layer is implemented that connects with at least one commercial API and one local open-weight model. A collection module for prompts and responses is also established and tested.
Acceptance Criteria
The GitHub repository contains architecture diagrams and technical specifications of the benchmarking pipeline. The system must demonstrate successful retrieval of outputs from both a commercial API and a local model, with prompt–response pairs collected and stored for reproducibility, proving integration works as intended.
Evidence of Completion
Repository updates include architecture diagrams, integration code for API and local models, and a record of successful prompt–response retrievals to show that both sources are functioning in the pipeline.
Delivery Month
2
Cost
10000
Progress
40 %
Milestone Title
Evaluation Engine & Metrics Implementation
Milestone Outputs
A comprehensive evaluation engine is implemented with modules for rule-based correctness and syntax compliance of Plutus contracts. Validators check blockchain-specific criteria including deployment success, execution cost efficiency, and vulnerability detection. A semantic similarity module using embedding scoring is integrated, with unit tests confirming accuracy.
Acceptance Criteria
All evaluation modules, including syntax, correctness, and semantic similarity, must pass unit tests on a curated set of sample inputs. Preliminary benchmarking results comparing multiple LLMs must be generated and published in a structured report, demonstrating that the evaluation engine produces consistent, actionable outputs.
Evidence of Completion
GitHub repository updated with evaluation engine source code and test cases, plus an initial benchmarking report comparing LLMs. Both PDF and Markdown formats are available for review by the community.
Delivery Month
3
Cost
10000
Progress
60 %
Milestone Title
Security Assessment & Continuous Benchmarking Setup
Milestone Outputs
A security and vulnerability assessment module is created and integrated into the benchmarking pipeline. This module checks for known vulnerability patterns in Plutus smart contracts. Documentation is written explaining how to extend the framework to additional Cardano languages, providing step-by-step guidance for community developers.
Acceptance Criteria
The security assessment module must successfully detect known vulnerabilities in at least a set of sample contracts. Documentation for extending the benchmark to other Cardano languages is reviewed by community developers and validated for clarity and completeness, ensuring long-term extensibility.
Evidence of Completion
GitHub repository includes the security module source code and examples of detected vulnerabilities, as well as extension documentation that has been publicly shared for review and feedback.
Delivery Month
4
Cost
10000
Progress
80 %
Milestone Title
Final Benchmark Report & Open-Source Release
Milestone Outputs
The complete benchmarking framework is released publicly under Apache 2.0 with full source code, datasets, and detailed documentation. A comprehensive final report compares all tested LLMs across correctness, syntax, vulnerability, and explainability. Supporting materials include video tutorials, user guides, developer documentation, and a community presentation summarizing findings.
Acceptance Criteria
The GitHub repository contains the final benchmarking framework with all supporting materials. A final benchmark report is published in PDF and Markdown formats, alongside tutorials, walkthrough videos, and clear documentation. Feedback from community testing is incorporated to ensure the release is robust and accessible.
Evidence of Completion
Public GitHub repository includes final code, datasets, and documentation. A recorded walkthrough or demo session is uploaded alongside the benchmark report and tutorials to provide evidence of completion and usability.
Delivery Month
5
Cost
10000
Progress
100 %
Please provide a cost breakdown of the proposed work and resources
The total project budget is 50,000 ADA, allocated to ensure each development stage is fully resourced and tied to measurable outputs.
Project Setup & Dataset Curation – 10,000 ADA covers initial repository creation under Apache 2.0 license, dataset collection of Plutus smart contract snippets, and definition of prompt templates.
Benchmarking Framework Architecture & LLM Integration – 10,000 ADA funds the design and documentation of the LLM Execution & Code Validation Pipeline, integration with both commercial LLM APIs and local models, and development of a structured prompt–response collection system.
Evaluation Engine & Preliminary Benchmarking – 10,000 ADA supports the implementation of rule-based correctness checks, syntax compliance validation, semantic similarity scoring, and blockchain-specific deployment tests, culminating in the first benchmarking report across multiple LLMs.
Security Assessment & Continuous Benchmarking Setup – 10,000 ADA enables integration of vulnerability detection into the evaluation pipeline and preparation of documentation for extending benchmarks to other Cardano smart contract languages.
Final Benchmark Report & Open-Source Release – 10,000 ADA delivers the complete open-source benchmarking framework, final comparative benchmark report, developer and user documentation, video tutorials, community presentation, and incorporation of community feedback into the final release.
How does the cost of the project represent value for the Cardano ecosystem?
This project aims to deliver lasting value without unnecessary cost. By creating an open-source benchmarking framework, we reduce duplication of effort and make future model evaluations more efficient. Developers gain actionable insights into LLM performance, saving time otherwise spent on trial-and-error. The framework also strengthens Cardano’s developer community through improved documentation and a shared knowledge base. Released under Apache 2.0, it can be maintained and extended at low cost, ensuring scalability and continued ecosystem benefit.
Terms and Conditions:
Yes
Jon Woodard, CEO
Jon Woodard is the CEO at Wolfram Blockchain Labs, where Jon coordinates the decentralized projects that connect the Wolfram Technology ecosystem to different DLT ecosystems. Previously at Wolfram Research Jon worked on projects at the direction of Wolfram Research CEO Stephen Wolfram and prior to that was a member of the team who worked on the monetization strategies and execution for Wolfram|Alpha. Jon has a background in economics and computational neuroscience. He enjoys cycling in his spare time.
Steph Macurdy, Head of Research and Education
Steph Macurdy has a background in economics, with a focus on complex systems. He attended the Real World Risk Institute in 2019, lead by Nassim Taleb, and has been investing in the crypto asset space since 2015. He previously worked for Tesla as an energy advisor and Cambridge Associates as an investment analyst. Steph is a youth soccer coach in the Philadelphia area and is interested in permaculture.
Gaurav Vishal, Manager
Gaurav Vishal is a Manager in Wolfram Research’s Technical Consulting team with over 6 years of experience in designing and delivering computational applications and enterprise AI solutions. He specializes in full-stack development, data analysis, machine learning, with expertise in microservices architecture and distributed data systems. Since joining Wolfram in 2019, he has led projects ranging from AI-powered tutoring platforms and large-scale data integration pipelines to secure on-premise analytics systems for enterprise and government clients.
Gaurav holds a B.Tech degree from IIT Bhubaneswar, where he earned the Institute’s Silver Medal for academic excellence. He also received the TSIL Research Partner Award for his research on heat transfer in coal-fired Sponge Iron Rotary Kilns. Known for his focus on computational efficiency, system reliability, and client satisfaction, Gaurav has contributed to high-impact technical solutions for Fortune 500 companies, academic institutions, and public sector organizations.
Subesh Sonthalia, Application Developer
Subesh Sonthalia is an Application Developer at Wolfram, specializing in scalable software architecture and AI-driven systems. With a Master’s in Mechanical Engineering focused on Microfabrication and Simulation Optimization, he combines engineering precision with computational innovation to build solutions that reflect Wolfram’s vision. Outside of work, he enjoys adventure sports.
Sanjeet Patra, Application Developer
Sanjeet Patra is as an Application Developer at Wolfram Technology Consulting, building products aligned with Wolfram’s mission and technological vision. Previously worked as an Internal Combustion Engine Engineer, gaining deep expertise in automotive systems. His career journey reflects a versatile skill set, spanning from core mechanical engineering to modern software development. Outside of work, he enjoys playing cricket and soccer.
Gabriela Guerra Galan, Project Manager
Gabriela Guerra Galan: Gabriela has 15+ years of experience leading projects. She is a certified PMP and Product Owner with bachelor's degree in Mechatronical Engineering, complemented by a master's degree in Automotive Engineering. As the co-founder of Bloinx, a startup that secured funding from the UNICEF Innovation Fund, she has demonstrated a passion for driving innovation and social impact.