Project Agent 0xTitans Dev Notes #1 - Matchbox DAO

This will be an experimental series of notes reflecting our R&D process. The notes highly raw and represent conclusions which are battle tested Some of the conclusions may have inaccuracies and might change as we move forward. We believe our insights and process documentation will be highly valuable for other builders looking to develop LLM agents in general and LLM applications in Web3.

Intro0xTitans is a fully on-chain game platform. One of the main gaming experiences on the platform is 0xMonaco V2, a strategy racing code game in which three smart contract-based cars compete against each other in a highly adversarial, resource-limited game. For more information about the game mechanics, please refer to this previous article and this tweet from one of the top players.

Up until now, 0xTitans was an exclusive experience limited to the best companies and technical teams in web3, competing against each other and showcasing their technical skills and strategic prowess. Polygon, Uniswap, Kyber Network, YGG, and Optimism are just a handful of the more than 100 teams that have participated so far in the esports events. But now, for the first time, we are looking to harness the power of LLMs to unlock a no-code game experience for everyone.

"AI Agent behind the wheel"

Strategy Crafting Components

Crafting a strategy for game comprises three distinct phases:

Meta-reasoning: These represent the core guiding principles that assist in formulating the design process of a strategy. the framework or meta heuristic, one example for such principles can applied game theory.
Strategy Specification: This phase is dedicated to the detailed elaboration of the strategy's components, and can illustrated through pseudocode or higher level spec
Implementation: This involves transforming the pseudocode or the strategy specification into actionable Solidity code. It's crucial that this code is syntactically correct and aligns seamlessly with the game contract and associated interfaces.

While several advanced models offer high performance, many lack a comprehensive understanding of Solidity and the web3 context. Our aim was to assess these models, particularly GPT-4 (given its renowned reasoning capabilities), in regard to strategy specification(2) and isolate it from the solidity generation(3) . To facilitate this evaluation, we employed a Python version of our game contracts. The outcomes were promising, though some recurring challenges with Chat based LLMs included a compromise in accuracy, especially when dealing with mathematical computations. In our specific scenario, VRGDA is an integral component of our strategy crafting process and game mechanics.

Deciphering Code Model Evaluation: Trade-offs and Differences

We are rigorously evaluating a range of models that are general or specialize in code related tasks:

CodeLlama
StarCoder
CodeGen2.5-7B-multi
Solidity Generator
MistralAI
Wizard-coder

When evaluating these models, several considerations emerge:

Pre-evaluation Benchmarks: Prior to diving into evaluations, one needs to regard the existing benchmarks related to code production capabilities of these models, such as HumanEval, EvalPlus, and MultiPL-E.
Code Evaluation Complexity: Assessing the code capabilities of LLMs is not straightforward. With many evaluation frameworks available, there can be disparities, be it in the programming language they emphasize, the nature and complexity of tasks they entail, and so on.

Our research has led us to multiple attempts aimed at standardizing the industry's approach to code evaluation. One such noteworthy evaluation can be viewed here:

https://huggingface.co/blog/codellama#evaluation

Models sorted by avg score

Moreover, this leaderboard offers more insight.

https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

From these evaluations, some models like Phind-CodeLlama-34B-v2 have showcased exemplary performances, even surpassing GPT-4 on HumanEval. Another model worth highlighting is WizardCoder-15B, which demonstrated impressive results in Python Eval, matching or even exceeding larger models in efficiency(due to size and architecture has almost 3x the performance compared to bigger Llama models)

However, biases in these evaluations do exist. For instance, given the Python-centric nature of Eval, models that have been specifically fine-tuned for Python can undeniably excel. This is evident in models such as Phind-CodeLlama-34B-Python-v1 and WizardCoder-Python-13B-V1.0.It seems that specific adaptations in training or fine-tuning can lead to much higher results in certain variations of these evaluations. For example, fine-tuning or training with a higher quality Java dataset can lead to a higher score on Java Eval.

Specific tweaks in training or fine-tuning can massively improve results in certain evaluation variations. Yet, these variances make it challenging to derive overarching conclusions, especially for specialized tasks such as debugging a Solidity function.

Assessment of Solidity-specific Training and Performance

Incorporating both open-source and proprietary LLMs into Web3 has been a challenge due to the limited Solidity training in existing models and the scarcity of high-quality datasets. Although Solidity shares foundational elements with other languages, it has unique attributes, especially concerning security, design patterns, and resource optimization.

During our research, we discovered that the BigCode pack includes a significant portion of Solidity in its dataset. However, not all the models disclosed the exact dataset they used for their training or fine-tuning. Models that incorporate the BigCode dataset include StarCoder and Wizard-coder. The latter is based on a pre-trained StarCoder using the Evol-Instruct method, which emphasizes training with longer and more complex code and evolving prompts.

We are currently evaluating all the models to gain a firsthand understanding of their ability to produce clear and high-quality Solidity code, assuming the correct prompt is provided.

Model Adaptations and Hypotheses Under Examination

Our exploration is multifaceted:

Literature Survey: We're delving into the current landscape of specialized LLMs, studying potential adaptations to boost performance in niche domains, such as Solidity and strategy creation for 0xTitans.
Comparative Analysis: This involves juxtaposing diverse methodologies, keeping in view parameters like cost, complexity, potential returns, and their adaptability to additional web3 applications.

Possible enhancement strategies could encompass modifying training procedures in existing models or employing fine-tuning techniques enriched with frequently updated embeddings.

Furthermore, we are keen on:

Investigating how different models respond to identical adaptations. For example, a general-purpose Mistral fine-tuned for web3/Solidity might outperform a code-centric LLM undergoing the same fine-tuning.
Evaluating the impact of varying dataset sizes and their quality on LLM performance. A recent example of this is the Collama Wikipedia-style Community LLMs, which leverages a meticulously curated a web3 dataset.

High level thoughts on designing autonomous agent system

Designing an autonomous agent system necessitates a strategic and layered approach. Each agent is tasked with producing inputs for the subsequent agent, ensuring there's a validation mechanism at every juncture. Opting for a multi-step approval system with multiple LLMs is viewed as more effective than solely leaning on a singular model, regardless of its prowess. This methodology ensures a collective effort of specialized models working in harmony, rather than attempting to overextend a single model's capabilities. providing a level of robustness and versatility for diverse applications. The ideal system is visualized as comprising agent components, each excelling in its domain, facilitating a modular and adaptable approach to agent-system construction.

In the context of the Titans-system, the workflow for an autonomous agent system could be visualized as follows:

The user provides an input encompassing both intent and preferences.
A chat LLM interprets this user input, transforming it into a structured and actionable plan.
This central "Orchestrator" LLM has the capability to communicate with specialized expert agents, such as Math-agents with superior expertise in specific domains, or it can directly initiate API calls. The outcome of this stage would be a comprehensive specification or pseudo-code.
This detailed specification then becomes the directive for the code/Solidity LLM, which is responsible for crafting the strategy contract. To ensure the integrity and validity of the generated code, additional validation mechanisms and safeguards are implemented. If the code fails these checks, the system will prompt a re-generation.
Once validated, the code undergoes testing in a simulated environment. Based on the results and insights derived from these tests, further iterations and refinements can be made, leveraging the fresh data acquired from the testing phase.

System Chart

If you are working on LLM applications, want to learn more, or have an interesting use-case for Web3 please reach us out via MatchboxDAO

Nate Product lead and Research Nick Tech lead

subscribe://