Agent Token Protocol

Posted on May 03, 2023Read on Mirror.xyz

Moving Beyond Naive Chatbots

LLMs and Wallet Messaging

The world already seems changed from November 30th’s ChatGPT launch, with AI becoming a powerful force and policy-making topic, as it is right to be.

Relay was founded on the trend-line of increasing benefits to communicating as your web3 identity. With blockchain-aware LLM agents, we can turn complex DeFi flows into clear conversations, turn confused devs flooding Discord channels into transparent DevRel learning session, and turn helpful onboarding chatbots into user-wallet-knowing upselling agents of commerce.

Looking relatively far in the future (for this nascent space) to ~Q1 2024 — We’re developing LLM agents able to self-custody their crypto safely, communicate with other bots, and transact with them. The Relay interfaces and communication rails through ENS and wallet chat can become the de facto standard as the easiest method for bots to 1) control their crypto 2) communicate 3) transact.

The Relay Robot

The Relay Robot is a system for foundational models that significantly improves on baseline LLM solutions by implementing what we call an LLM trajectory framework, an idea Relay adapted from the famous ReAct whitepaper. It’s useful to think of a “trajectory” as an LLM’s train of thought or its “trajectory through reasoning-space”.

We instrument an LLM with external tools, show the LLM how to use the tools, and then observe the LLM’s trajectories as it responds to both a training environment and real user queries. By tracing, modeling, and analyzing these trajectories (sometimes in real-time) we can massively improve upon baseline LLM behavior and deliver production-ready solutions.

In this post we go through the problems with the product made by following the first (and great!) OpenAI data retrieval tutorials, and then we go over our solutions, both in-practice and theoretical, that improve a web3-enabled chatbot used for education and transaction assistance.

Lastly we discuss Relay’s vision for how the AI agents that we and others build will use our LLM framework for oversight and safety, plus Relay’s wallet-messaging rails, to:

  1. self-custody and have plausible independent ownership of their crypto

  2. communicate with other user-wallets (both human and AI)

  3. intelligently transact crypto and NFTs with humans and other AIs. Trading, buying digital goods, buying services, it’s all possible.

The AI x Blockchain industry is growing fast, we also just published our landscape overview post here.

The Baseline Solution

Advances in language model technology have made it straightforward to build a proof-of-concept chatbot with access to an external knowledge base. These chatbots all follow roughly the same core architecture:

The default early solutions all follow roughly the same pattern:

  1. Download a bunch of data from the internet

  2. Chunk up the data into LLM-manageable pieces

  3. Convert the text data into embeddings using an embeddings model (an embedding is a quantitative representation of the semantics of a piece of text)

  4. Upload the embeddings into a database

  5. When a user asks a question, convert the question into an embedding and use it to search the embeddings database for related documents

  6. Pass the user’s question and results of the search to a language model

  7. Return the language model’s answer to the user

Building a chatbot in this way feels somewhat magical, it’s pretty incredible the extent to which they “just work” without running headfirst into any additional complexities. That said, most of these bots aren’t suitable for production use-cases and have some major challenges — especially at scale.

Problems with the Baseline

💰 Expensive

The state-of-the-art and publicly-available-for-production LLM (GPT-4 from OpenAI) is $0.06 per 1,000 tokens (1-2 pages of English text) generated by the LLM. A single request to GPT-4 can cost up to $0.48. For language models to behave in a way that feels natural to users, they require entire conversation histories to be passed in every request, so it is quite common to run into this upper limit. Even at very model scale this is quite expensive (10,000 requests at 2,000 tokens per request would cost ~$1000).

🦥 Slow

A single request to GPT-4 can take anywhere between 3 and 30 seconds depending on server load. In a conversational setting, this amount of latency is unacceptable to users and will lead to “conversion abandonment”. This latency is decreasing as OpenAI scales their own servers, but it is a fact that caching and UI/UX patterns have to account for the greatly atypical response time.

🔐 Vendor Lock-In

Every language model (even different “flavors” of the same underlying models) display unique behaviors and react differently to identical inputs. Migrating a system from one model to another is more of an art than a science.

💥 Unreliable

Commercialized language models are both extremely novel, extremely resource-intensive, and in extremely high-demand. These factors produce a level of unreliability that is mostly unheard of with modern web applications, and so potential API downtime must be designed around.

☢️ Unstable

Language models are somewhat unpredictable. They actually aren’t even deterministic: if you pass a language model the same input 100 times, you’ll end up with 3-5 seemingly random answers. This can lead to confusing and heterogenous user experiences and will undermine users’ trust in the platform.

💔 Brittle

Language models feel extremely powerful, and they are, but their limits become obvious with non-trivial use. If you want a language model to do anything other than literally answer a question, with no constraints on the kinds of answers you desire, they work well, but that’s about it. The default LLM experience will become stale and users bored and unenthusiastic about your brand.

⬜️ Generic

Language models don’t adhere to any particular brand. If you go to example.com and talk to their GPT-4-powered chatbot and then go uniswap.org and talk to its chatbot, they will be indistinguishable from each other.

🤦‍♂️ Impersonal

Language models are not only generic, but they’re also completely impersonal. They don’t know “who” they’re talking to or “why” they’re talking to them or what the user wants. Talking to a language model can become frustrating for users, especially when the conversations are within a specific and complicated context.

🕛 Static

Language models don’t learn from their conversations and they don’t improve over time. They don’t acquire new “experience”. Once deployed, a baseline language model solution will never improve.

🐪 Non-Autonomous

Language models don’t have the capacity to interface with anything other than their prompts and the data they’ve been trained on. Users expect new features, they expect applications to grow with them. A baseline language model cannot support users’ expectations.

The Advanced Relay Robot

Each of the following sections describe a category of solutions that improve on the baseline. Each section will describe the category in general, provide a specific example of what the very first version could look like, and go on to describe a more long-term vision. One thing to keep in mind is that the space of possible features is quite large and the field quite young, so the following sections are not meant to be prescriptive nor exhaustive. The goal is to provide a primer or a map of what’s possible.

Agent Tooling

Augment an LLM by providing a natural-language interface for calling external tools. A first version of agent tooling could include

  • a way for the LLM to search white-listed, well-known web sites for info

  • access to a pre-populated, indexed knowledge base

  • access to on-chain data such as token prices and transaction history for the current user

Agent tooling can be used to improve reliability, stability, and dynamism (decrease brittleness).

Semantic Tracing

Every LLM trajectory leaves a “trace” in “semantic space”: each step in the trajectory can be defined by a question in natural-language that the LLM is trying to answer. By embedding and indexing these questions, we generate a dataset that can be used as an input into many of the other solutions we develop. A first version of tracing could include:

  • trace the LLM input

  • trace the LLM output

  • trace each call to agent tooling

  • trace each “cache” miss (failed calls to agent tooling)

Semantic tracing can used to improve costs, latency, reliability, observability, analytics, and more.

Real-time Human in the Loop

Intercept, pause, cancel, and divert trajectories. The behavior of when to intercept, when to cancel, when to loop in a human can be pre-determined or based on dynamic data (such as data generated from semantic tracing). A first version of real-time human in the loop could include:

  • basic observability (pipe all trajectories through Discord)

  • escalation (react to a trajectory to jump into the conversation in lieu of the robot)

  • trajectory labeling (similar to, and the beginnings of, a reinforcement learning framework)

Real-time human in the loop features can be used to improve stability, observability, and branding (reduce genericness).

Real-time Analytics

Provide dashboards and alerting based on predefined criteria and semantic tracing data. A first version of this could include:

  • topic-based alerting: trigger alerts or notifications when users ask about specific topics

  • sentiment-based alerting: trigger notifications for conversations based on “how the conversation is going”

  • a semantic heatmap with clustering: show clusters of user requests and responses

Real-time analytics can be used to improve reliability, stability, and useful insights into the product.

Semantic Caching

Very similar to semantic tracing, we can also introduce semantic caching. When a trajectory is very close to another in semantic space we can choose to halt the trajectory and return the result from the previously-executed trajectory. One way to think of this is that we allow the LLM to learn. A first version of this solution could include:

  • pre-trajectory short-circuiting (if the user request is a match, don’t even call the LLM)

  • mid-trajectory short-circuiting (if the LLM-generated question for the trajectory step is a match, fetch the previous answer and return immediately)

Semantic caching can be used to improve cost, latency, and reliability.

Replay

Because we’re tracing all of the important details about each trajectory, it should be trivial to “replay” the trajectory. A first version of this could include:

  • replay a trajectory

  • edit a trajectory and replay the edited version

Replay can be used to improve observability and to reduce vendor lock-in.

Guardrails

Every LLM output will pass through a “guardrails” evaluation framework. The framework will help catch obvious errors but can also evaluate more subtle details like whether the robot’s output stayed “on brand”. A first version of this solution could include:

  • factual error detection (potentially using a “side channel” LLM)

  • transaction assurance: For any onchain action recommended (or generated) by the chatbot, we use SOTA tools to increase the user’s understanding of the action and confidence that the action does what they expect. For an example of what’s possible, see Stelo Labs tx simulation.

Guardrails can be used to improve stability, reliability, and personalization (reduce genericness).

Learning and Recency Bias

The best way to think about the knowledge base is a pre-computed cache of LLM trajectories. The knowledge base is “what the robot knows”. Recent data can be attached a more significant weight in the vectorbase. For any dataset that we want the chatbot to understand, we trigger a process that automates the generation of a large number of trajectories over that knowledge base. A first version of ingestion could include:

  • scraping a public or private website, generating trajectories over the data, and saving the results

  • providing a patching API that allows the

Learning can be used to improve cost, latency, reliability, observability, and many other aspects of the platform.

Invalidation

Knowledge invalidation is a semantic search + delete over the robot’s knowledge base. A first version could include:

  • semantic search + delete conditioned on some external criteria (like data source)

Invalidation is a useful tool to improve the flexibility of the robot, allowing it to evolve unencumbered by stale understanding.

Curation and Analytics

The robot’s knowledge base, and the tracing data around it, can be used to curate the platform’s user-facing content. A first version could include:

  • A report that shows which subjects are most likely to be requested by users

  • A report that shows which subjects that the robot doesn’t know about are most likely to be requested

  • A report that shows which subjects require the longest/shortest trajectories

Curation and analytics are useful for improving essentially all pieces of the platform. In particular, insights from the chatbot can lead to improvements in user-facing documentation and materials.

Further R&D in the Industry

Query Routing

Some questions are answered by the docs index, some are answered by the list index, etc.

https://gpt-index.readthedocs.io/en/latest/guides/tutorials/graph.html

https://twitter.com/jerryjliu0/status/1653789212620230658

Caching

Similarity caching so that frequent questions skip a trip to OpenAI https://python.langchain.com/en/latest/modules/models/llms/examples/llm_caching.html#gptcache

Hallucination rate

Compare query/response/source to see if they all match. Leads to near-0 hallucinations. (Currently implementing).

https://twitter.com/jerryjliu0/status/1645451894637367298

LLM Evaluation

On Retrieval, how do certain variables affect the result?

split_method chunk_chars overlap embeddings retriever_type num_neighbors

https://github.com/PineappleExpress808/auto-evaluator

Guard Rails

These strictly check the output of the LLM, and ensure it conforms to certain patterns or text. Nvidia has an option shown here, and there is a separate project (that came out first at: getguardrails.ai

https://twitter.com/NVIDIAAIDev/status/1650887287494901763

Back to the lab! (Appendix links below)

Research and Theory

Projects and Tools