SolidityBench by IQ has recently been introduced as the first leaderboard designed to evaluate Language Model Machines (LLMs) in Solidity code generation. This innovative tool is now available on Hugging Face, showcasing two unique benchmarks, NaïveJudge and HumanEval for Solidity. These benchmarks are specifically tailored to assess and rank the proficiency of AI models in generating smart contract code efficiently.
Developed by BrainDAO, a subsidiary of IQ, SolidityBench is part of the upcoming IQ Code suite. This suite aims to provide AI models that are specialized in generating and auditing smart contract code, addressing the increasing demand for secure and efficient blockchain applications.
According to a report by CryptoSlate, NaïveJudge introduces a novel approach by challenging LLMs to implement smart contracts based on detailed specifications derived from audited OpenZeppelin contracts. These contracts serve as a benchmark for correctness and efficiency. The generated code is then evaluated against a reference implementation using criteria such as functional completeness, adherence to Solidity best practices and security standards, as well as optimization efficiency.
The evaluation process involves advanced LLMs, including versions of OpenAI’s GPT-4 and Claude 3.5 Sonnet, acting as impartial code reviewers. They evaluate the code based on strict criteria, including the implementation of key functionalities, handling of edge cases, error management, proper syntax usage, and overall code structure and maintainability.
Additionally, optimization considerations such as gas efficiency and storage management are taken into account during the evaluation process. Scores range from 0 to 100, providing a comprehensive assessment across functionality, security, and efficiency, mirroring the complexities of professional smart contract development.
In the benchmarking results, OpenAI’s GPT-4o model emerged as the top performer with an overall score of 80.05. This model achieved a NaïveJudge score of 72.18 and had HumanEval for Solidity pass rates of 80% at pass@1 and 92% at pass@3. Other models such as OpenAI’s o1-preview and o1-mini, as well as models from Anthropic and XAI, also demonstrated competitive performance.
HumanEval for Solidity, adapted from OpenAI’s HumanEval benchmark, encompasses 25 tasks of varying difficulty in Solidity. Each task includes corresponding tests compatible with Hardhat, facilitating accurate compilation and testing of generated code. The evaluation metrics, pass@1 and pass@3, provide insights into the model’s precision and problem-solving capabilities.
The introduction of SolidityBench aims to advance AI-assisted smart contract development by promoting the creation of more sophisticated and reliable AI models. It also provides valuable insights into AI’s current capabilities and limitations in Solidity development, setting new standards for AI-assisted smart contract development in the blockchain ecosystem.
Developers, researchers, and AI enthusiasts are encouraged to explore and contribute to SolidityBench to drive the continuous refinement of AI models, promote best practices, and advance decentralized applications. Visit the SolidityBench leaderboard on Hugging Face to learn more and start benchmarking Solidity generation models.