banteg

Posted on Jul 26, 2022Read on Mirror.xyz

100x faster EVM traces with Python

An introduction to evm-trace, a fast and correct Python library to work with Ethereum Virtual Machine traces.

There are two common trace formats, which we’ll refer to as Geth-style and Parity-style traces. Geth traces provide opcode-level resolution, while the most commonly used Parity trace format provides a call tree.

Our goal today would be to make opcode-resolution traces 100x faster, making them from a little-known Parity trace format. We’ll also make them compatible with Geth traces so they can be used as a drop-in replacement.

debug_traceTransaction

Geth traces return frames containing the full state of the virtual machine at each step of the execution, including the program counter, opcode, remaining gas, gas cost of the operation, call depth, items on the stack, a snapshot of the memory, and a storage diff.

Each subcall starts with a fresh and separate memory, stack, and storage changes.

It’s a powerful but a bit unwieldy format, since you need to do extra work to know things like what’s the current contract being called or what’s the calldata.

I started digging into this when I stumbled upon several transactions that crash Erigon if you request a debug trace. It appeared that they generated very large traces, and soon enough I was able to confirm why when Erigon re-enabled streaming for heavy RPC endpoints.

In Geth format each frame contains the full copy of the EVM memory, which can lead to traces that don’t fit in the system memory. A contract expanding the memory to 100kb could generate traces as large as 70 GB.

This can be addressed by streaming and iteratively decoding the response as you get it from the Ethereum client. This is what ape does by default with Geth traces when you call receipt.get_transaction_trace().

If we look up the memory expansion gas cost formulas at evm.codes, it becomes apparent that EVM memory is not that expensive and we can possibly construct or find transactions with even larger traces.

memory_cost = (memory_size_word ** 2) / 512 + (3 * memory_size_word)

After sampling peak memory usage over 10,000 most recent blocks, we find even larger traces. For one of those, Erigon hangs up on us after 30 minutes, after returning 460 GB and 898,836 frames.

We can patch Erigon to increase the write timeout, and learn that the full trace consists of 960,198 frames, takes up 524 GB and 34m7s to fetch, before deserialization or processing. We can calculate that we got this trace at 275 MB/s and 469 frames/s.

Keep that number in mind, because with a method presented below, we can recompute the same trace using 137 MB of data (3916x less), which we can fetch in 5.6s, deserialize in 7.1s and replay in 4.6 seconds, with full processing taking 117x less time than just fetching a Geth trace.

contracts with peak memory usage

trace_replayTransaction

Parity traces are quite different from Geth, the ad hoc traces come in three different flavors: trace, vmTrace, and stateDiff.

Parity traces are supported in Erigon, Nethermind, Besu, Foundry. They are not supported in Geth, Hardhat, Ganache. Anything beyond trace type requires an archive node.

trace

The trace is the most widely used mode, it returns a call tree, similar to what you can see on ethtx.info or when calling receipt.show_trace() in ape. If you need a raw call tree, you can get it from chain.provider.get_call_tree(tx). You can also provide a custom display class and only display the parts you care about.

Lido.executeVote as raw and decorated call tree

stateDiff

The stateDiff mode shows the changes of balance, code, nonce, and storage grouped by account.

a raw stateDiff

A more interesting problem here is how to map storage keys to readable names like balanceOf[account]. I’ve seen it solved in Tenderly, which is a proprietary tool.

Maybe you, the reader, could be the one to tackle this problem and release it as an open source tool for everyone to use.

a decoded stateDiff in tenderly

vmTrace

The vmTrace mode is one of the most enigmatic and rarely used, mainly because it was never really documented. This mode will be the main focus of this article and I’ll show you how you can make a drop-in replacement for Geth traces that requires 1000x less data and works 100x faster.

vmTrace returns a nested structure of incremental updates which we can use to replay the transaction, while tracking stack, memory, storage at each step of the execution. Of course, the incremental format doesn’t come for free and requires some legwork.

vmTrace struct

Here is what we get in a vmTrace response:

  • VMTrace represents a call and contains all subcalls
    • code EVM bytecode to be executed
    • ops list of VMOperation to be executed
  • VMOperation represents a step of the execution
    • pc program counter
    • cost gas cost of the operation
    • ex the result of the execution, could be null if the operation has reverted
    • sub list of VMTrace subcalls
    • op opcode name
    • idx index in the call tree
  • VMExecutedOperation represents side effects of executing the operation
    • used incorrectly named, shows the remaining gas
    • push the items to be placed on the stack
    • mem the memory delta
    • store the storage delta
  • MemoryDiff represents a memory update
    • off memory offset where to write the data
    • data the bytes to write starting at the offset
  • StorageDiff represents a storage write
    • key storage key to write to
    • val value to write

I defined this structure using msgspec to get fast deserialization and type validation.

Missing pieces

Now we need to replay the transaction, applying the incremental updates and untangling the nested structure into a flat execution order.

I borrowed the real Stack and Memory implementations from py-evm, which is a great EVM reference itself.

I wrote a small harness to do pairwise comparison of each frame of my synthetic trace with Geth trace frames, which exited after the first divergence in stack or memory. I learned that Geth sends frames after memory expansion but before anything from ex has been applied.

Memory was no problem, but the stack was missing a few important pieces. We had push, but no info about the number of items popped from the stack by each opcode.

Popcodes

I’ve assembled this list by hand using this file from ethereumjs as a reference. These opcodes pop a fixed amount of items:

  1. EXTCODEHASH, ISZERO, NOT, BALANCE, CALLDATALOAD, EXTCODESIZE, BLOCKHASH, POP, MLOAD, SLOAD, JUMP, SELFDESTRUCT
  2. SHL, SHR, SAR, REVERT, ADD, MUL, SUB, DIV, SDIV, MOD, SMOD, EXP, SIGNEXTEND, LT, GT, SLT, SGT, EQ, AND, XOR, OR, BYTE, SHA3, MSTORE, MSTORE8, SSTORE, JUMPI, RETURN
  3. RETURNDATACOPY, ADDMOD, MULMOD, CALLDATACOPY, CODECOPY, CREATE
  4. CREATE2, EXTCODECOPY
  5. STATICCALL, DELEGATECALL
  6. CALL, CALLCODE

These opcodes can be expressed as a formula:

  • LOG0-LOG4 pops 2-6 items
  • SWAP1-SWAP16 pops¹ 2-17 items
  • DUP1-DUP16 pops¹ 1-16 items

¹ not really, this is only true for how the vmTrace format is served

Opcode incognita

Some transactions reconciled, but the stack still diverged, especially for large transactions. Using foundry’s debugger I found that the incremental updates have missed occasional pushes for several opcodes, namely SMOD, CHAINID, COINBASE. I fixed this in Erigon and the patch has already made it to 2022.07.03-alpha release.

This finding has made me question a lot of things. Most people fully trust the data they get from the node, but Ethereum is a very complex system and bugs happen. It’s great having so many clients and references and different approaches to check on each other.

Performance

I’ve added full block tracing with opcode-level resolution and an option to return a memoryview of the memory bytearray instead of copying it for each frame. It works like a pointer and comes with a caveat that you must process the frame immediately after receiving it from the iterator, otherwise the value may be mutated. It should work for most streaming workflows with traces, since you are usually scanning for some specific frames and don’t look back.

I’ve benchmarked evm-trace against the largest traces I could identify. Most of them were replayed in under 3 seconds, with no significant slowdown even when a lot of memory is allocated. I’ve also benchmarked it against 1000 most recent Ethereum mainnet blocks. It was pleasant seeing the needle has moved from seconds per trace to traces per second.

evm-trace performance benchmark

When dealing with full blocks, you can expect to fetch a median of 27.4 transactions per second from the node and replay 108.8 transactions per second at around 252,000 frames per second. When requesting traces of individual transactions, your client replays all transactions in the block up to the requested transaction index to get to the correct state and discards all other traces. To give you the intuition, here is how many frames you would usually encounter in a block:

Give it a try

This lightning-fast drop-in replacement for Geth-style traces is available in the latest evm-trace and soon will be integrated into ape, your new favorite Ethereum development and data science framework.

References