Let’s Write An Ethereum Virtual Machine Disassembler
In this article we’ll discuss writing a disassembler for the Ethereum Virtual Machine (EVM). While writing disassemblers doesn’t necessarily require much knowledge of the machine the instructions are for, it’s actually quite a fun way to get your feet wet.
During this tutorial we’ll be using the D programming language. Why D? Honestly there’s no great reason. It’s mostly because I think it’s clearer and simpler than the alternatives for this kind of low-level code. Let’s get started.
Introduction
When writing a disassembler, the first place I like to start is with a high-level description of the machine. This helps me to get my bearings and hopefully answers any major architectural questions I might have during development. So let’s elaborate the EVM a bit.
EVM Basics
Machine Word and Stack
The EVM is a big-endian, stack-based machine architecture. The machine uses 256-bit words in order to facilitate SHA3-256 hash values and elliptic curve computations.
The machine stack has a maximum depth of 1024 items (each item is a 256-bit word).
Memory Model
Memory for the EVM is word rather than byte addressable. This means that each sequential address stores a whole 256-bit value. Machines that are byte-addressable would use 32 sequential addresses to store the same 256-bit value as each address only contains 8-bits (1 byte) of the total value. All storage and memory in the EVM is initialized to zero.
Programs are not stored in memory, but instead can be considered to be stored in a separate virtual read-only memory (ROM) that is accessible only through a special instruction – CODECOPY – which copies the program into main memory.
Compiled Contracts
When Ethereum contracts are compiled down to byte-code the convention is not to represent them as binary, but instead to represent them as a single long hex string like the following:
6060604052600a8060106000396000f360606040526008565b00
This is the compiled byte-code for the following solidity contract:
contract Test { }
There could literally not be less going on here, but it provides us a good place to start. To see what that byte-code is doing let’s use Etherscan’s disassembler to decompile the hex string into the 25 opcodes that compose it:
PUSH1 0x60 PUSH1 0x40 MSTORE PUSH1 0x0a DUP1 PUSH1 0x10 PUSH1 0x00 CODECOPY PUSH1 0x00 RETURN PUSH1 0x60 PUSH1 0x40 MSTORE PUSH1 0x08 JUMP JUMPDEST STOP
Disassembler
Hex Strings To Binary
This is actually more information than we probably need, but it’s interesting and provides a nice high-level summary. To begin using our disassembler let’s first write a program to convert the hexadecimal string of a compiled contract into binary. The binary format is a useful first step as it allows us to do numeric comparisons and operations, which will come in handy later.
After playing around with it for a little bit we can come up with some D code to do the conversion for us.
import std.conv; import std.range; import std.stdio; /** * Converts a long hexadecimal string into an array of bytes. * * Params: * hexString = A string containing hexadecimal bytes. * * Returns: An array of bytes in the same order they appear in the string. */ ubyte[] hexStringToByteArray(string hexString) { ubyte[] binaryBytecode; foreach (i; iota(0, hexString.length, 2)) { string nextByte = hexString[i..(i + 2)]; binaryBytecode ~= parse!ubyte(nextByte, 16); } return binaryBytecode; } int main() { string contractHex = "6060604052600a8060106000396000f360606040526008565b00"; ubyte[] binaryBytecode = hexStringToByteArray(contractHex); writeln("[ORIGINAL]"); writeln(contractHex); writeln("[BINARY]"); foreach (opcodeByte; binaryBytecode) { writef("%02x", opcodeByte); } write("\n"); return 0; }
Collecting The Opcodes
Now that we’ve got our data in the format we want, it’s time to approach writing the disassembler. The next step is assembling a list of all the possible opcodes and their binary values. Once we have this list we can then recognize and handle the opcodes on a case-by-base basis. The easiest place to get this list of opcodes is from the original Ethereum Yellow Paper, namely you can find the hex values in the appendix.
Completing The Disassembler
Now it’s time to complete our disassembler. One simple way to write a disassembler is simply to check for every single opcode one-by-one and then display its name. At 129 opcodes, this is still doable for the EVM but tedious.
In addition to thinking about displaying the opcode names, we also need to display any arguments they take. For some architectures this is a very complicated affair involving various argument encoding schemes. Luckily for us, Ethereum is stack based, so the only arguments to opcodes are provided to the PUSH opcodes, all 32 of ’em.
This means if we can match names to opcodes, the only thing we have to deal with on a case-by-case basis are the PUSH opcodes. The parameters to these opcodes are values between 8 and 256-bits, and these values are encoded in big-endian format. Most modern CPUs are little endian, and most modern programming languages do not natively support 256-bit integer values. This means we have to do a bit of work to construct our 256-bit integer values. Specifically, we consolidate them into a big integer byte-by-byte as we encounter them. With this information we can writing a disassembler routine as follows, where EvmOpcodes is our enumeration of all possible opcodes:
pure string[ubyte] generateOpcodeToNameMap() { string[ubyte] results; foreach (immutable opcode; [EnumMembers!EvmOpcodes]) { results[opcode] = std.conv.to!string(opcode); } return results; } string[] disassembleBytecode(ubyte[] bytecode) { string[] results; auto nameForOpcode = generateOpcodeToNameMap(); for (size_t pc = 0; pc < bytecode.length; pc++) { auto opcode = bytecode[pc]; switch(opcode) { case EvmOpcodes.PUSH1: case EvmOpcodes.PUSH2: case EvmOpcodes.PUSH3: case EvmOpcodes.PUSH4: case EvmOpcodes.PUSH5: case EvmOpcodes.PUSH6: case EvmOpcodes.PUSH7: case EvmOpcodes.PUSH8: case EvmOpcodes.PUSH9: case EvmOpcodes.PUSH10: case EvmOpcodes.PUSH11: case EvmOpcodes.PUSH12: case EvmOpcodes.PUSH13: case EvmOpcodes.PUSH14: case EvmOpcodes.PUSH15: case EvmOpcodes.PUSH16: case EvmOpcodes.PUSH17: case EvmOpcodes.PUSH18: case EvmOpcodes.PUSH19: case EvmOpcodes.PUSH20: case EvmOpcodes.PUSH21: case EvmOpcodes.PUSH22: case EvmOpcodes.PUSH23: case EvmOpcodes.PUSH24: case EvmOpcodes.PUSH25: case EvmOpcodes.PUSH26: case EvmOpcodes.PUSH27: case EvmOpcodes.PUSH28: case EvmOpcodes.PUSH29: case EvmOpcodes.PUSH30: case EvmOpcodes.PUSH31: case EvmOpcodes.PUSH32: size_t numBytes = (opcode - EvmOpcodes.PUSH1 + 1); // Consolidate the bytes of the PUSH argument into an integer. BigInt arg = BigInt(0); for (ubyte i = 1; i <= numBytes; i++) { arg <<= 8; if (pc + i < bytecode.length) { arg |= bytecode[pc + i]; } } results ~= format("%s 0x%032x", nameForOpcode[opcode], arg); pc += numBytes; break; default: if (opcode in nameForOpcode) { results ~= nameForOpcode[opcode]; } break; } } return results; }
Testing It Out
Now we can test it out, let’s use it on our original example:
PUSH1 0x00000000000000000000000000000060 PUSH1 0x00000000000000000000000000000040 MSTORE PUSH1 0x0000000000000000000000000000000a DUP1 PUSH1 0x00000000000000000000000000000010 PUSH1 0x00000000000000000000000000000000 CODECOPY PUSH1 0x00000000000000000000000000000000 RETURN PUSH1 0x00000000000000000000000000000060 PUSH1 0x00000000000000000000000000000040 MSTORE PUSH1 0x00000000000000000000000000000008 JUMP JUMPDEST STOP
The output is slightly different, but everything seems to match, and with that we’ve completed our EVM disassembler.
Conclusion
Thanks for following along. You can find the link to the project in GitHub below. For extra points you might consider trying the following:
- Format the output to match the output from etherscan’s opcode tool.
- Write an assembler that takes this code as input and outputs bytecode.
Links: