Wasm Introduction (Part 1): Binary Format
Written by the CoinEx Chain lab, this article introduces the Wasm binary format and is the first one of a series. CoinEx Chain is the world’s first public chain exclusively designed for DEX, and will also include a Smart Chain supporting smart contracts and a Privacy Chain protecting users’ privacy.
There have been many introductions to WebAssembly (hereinafter referred to as Wasm), and the article does not go into more details here but to focus on Wasm binary format. We will compile the simplest Rust program (yes, it’s the Hello, World!
Program) into the Wasm binary format, and then analyze the format in the form of Go pseudo code combined with the xxd
command. The following is the complete code of this Rust program (if you don't know how to compile Rust to Wasm, please see this article):
Module
Wasm’s top-level structure is a module, and each Wasm binary file corresponds to a module. The module starts with a 4-byte magic number, followed by 4-byte version number, and the rest is the data of the module. The specific module data is divided into different types and placed in different sections. Each section is identified by a unique section ID. Except for the custom section (which will be described later), all other sections can only appear at most once, and must appear in the order of increasing section ID. The following pseudo code (using Go language syntax, the same below) gives the overall structure of the module:
Observing the hw.wasm file with the xxd
command, you can find that the magic number of the Wasm binary format is \0asm
and the current version number is 1
(the integer is stored in the Little-endian way):
Section
The Wasm specification defines a total of 12 sections with IDs from 0 to 11. See the following definition of Go constants:
In the Wasm binary file, each section starts with a section ID. Except for custom sections, the structure of other sections has been defined. Since the content of the custom section may be unknown, the actual number of bytes of the section is stored after the section ID, so that the Wasm implementation can skip custom sections that it does not recognize. In order to make Wasm binary files more compact, integers are stored after being encoded in LEB128 format. The following pseudo code shows the abstract structure of a section (varU32
type represents a 32-bit unsigned integer encoded with LEB128):
Type Section
In the Wasm binary file, the function information is distributed in three sections (let’s ignore the imported functions and debugging information for now). The Type section is used to store the type information (or signature ) of the function. The Code section stores the local variables information and bytecode of the function. The Function section associates the code section with the type section. As you need to store multiple function types, you need to store the number of function types first. This is the standard practice for storing vector data in the Wasm binary format. For the sake of simplicity, we omit the number in the following pseudo code, and directly use Go’s slice type to represent vec. The section ID of type is 1, and the structure is given by the following pseudo code:
Function types begin with 0x60
, followed by parameter types and return value types:
From the output of xxd
command, we can see that the first section that appears is indeed a type section with 15 (0x0F
) bytes and 3 function types:
Let’s go to the first function type. We can see that it really starts with 0x60
. There are two parameters, both of which are of type i32
(0x7F
), with no return value:
Import Section
The import section stores import information, and the section ID is 2. The import information includes the import module name, import name, and import description. Wasm supports four types of imports: functions, tables, memory, and global variables. In order to distinguish the specific import, the import description needs to start with a one-byte tag, and the specific description information varies depending on the tag. The following pseudo code describes the overall structure of the import section:
For function import, Desc describes the index of the type information of the function in the type section. Other import types are not detailed in this article. Please refer to the Wasm specification. From the output of xxd
command, we can see that the import section follows the type section, with 17 (0x11
) bytes. There is only 1 import information:
In the only import information, the module name is env
(which is generated by the Rust compiler by default), and the import name is print_str
:
It is not difficult to find that in the Wasm binary file, a character string is also stored in the way of vector (length+content), and the content is an array of bytes encoded by UTF-8. If you continue to observe, you can see that the tag is 0x00
, which means that this is a function import. The type information of the function is stored in position 0 of the type section:
Function Section
The function section is relatively simple. It stores the index of the signature of each internal function in the type section in order. For example, if a module has 5 internal functions. Then check the table twice, and you can find the third function’s signature: TypeSec.Types [FuncSec.Types [2]]
. The ID of the function section is 3, and the overall structure is given by the pseudo code below:
From the output of xxd
command, we can see that the function section follows the import section with 5 bytes and 4 function type indexes (1, 1, 2, 1):
Table&Element Section
The table section (ID is 4) and the element section (ID is 9) are related to the call_indirect instructions. The author does not intend to introduce them in this article. Please refer to the Wasm specification.
Memory Section
The memory section ID is 5, which stores the memory information. The memory information gives the lower limit (which must be specified) and the upper limit (optional) of the number of pages required for runnning the module (64K for a page). Although the memory section supports multiple memories, the Wasm 1.0 specification stipulates that the memory cannot exceed one, and this restriction may be eliminated in subsequent versions. The following pseudo code describes the overall structure of the memory section:
From the output of xxd
command, we can see that the memory section contains 3 bytes and 1 memory information. This memory information starts with 0
, so only the lower limit of the page number(0x11
, which is 17), instead of the upper limit, is specified:
Globlal Section
The global variable section stores internal (non-imported) global variable information. The section ID is 6. Global variable information includes the type of the variable, whether it is read-only, and the bytecode used to initialize the global variable. The following pseudo code describes the overall structure of the global variables section:
From the output of xxd
command, we can see that the global variable section follows the memory section with 25 (0x19
) bytes and 3 global variables:
Among them, the type of the first global variable is i32
(0x7F
), mutable (0x01
), and the initialization bytecode has a total of 6 bytes (the instruction encoding is not included in this article. Please refer to the Wasm specification):
Export Section
Corresponding to the import section is the export section, and the ID is 7. Like import information, export information is also divided into four types: functions, tables, memory, and global variables. It is a bit simpler to export the information as it does not include the module name, and no matter what is exported, you only need to give the corresponding index. The following pseudo code describes the overall structure of the derived section:
From the output of xxd
command, we can see that the export section follows the global variable section with 44 (0x2C
) bytes and 4 global variables:
The fourth exported name is main
, the type is function (0x00
), and the index is 3 (the imported function and the internal function together form the function index space):
Start Section
Unlike other sections, the starting section contains only one starting function index. The specified function is equivalent to the main function in C/C++/Java and other languages. After Wasm implementations instanciated the module, it is required to execute this function. The starting section ID is 8. The structure can be described by the following pseudo code:
As the Rust compiler does not generate a starting section for the module, there is no way to observe it with the hw.wasm file.
Code Section
As mentioned earlier, the local variables’ information and bytecode of the internal function are stored in the code section, and the section ID is 10. For the sake of parallel processing (such as verification, analysis, AOT compilation, etc.), each item in the code section comes with a number of bytes. In this way, the local variable information and bytecode data of each internal function can be extracted first, and then multiple functions can be processed simultaneously. The following pseudo code describes the overall structure of the code section:
From the output of xxd
command, we can see that the code section follows the export section with 221 (0xDD
, 0x01
) bytes and 4 items:
The first item has 93 bytes of data (0x5D
). There is information about a local variable, indicating that the function requires 8 local variables of typei32
(0x7F
). The remaining 90 bytes are the bytecode of the function (the instruction encoding is not included in this article. Please refer to the Wasm specification):
Data Section
The data section stores the memory initialization information, and the ID is 11. Each piece of memory initialization information includes the memory index (which can only be 0 currently), the starting position (which is the bytecode, and must be a constant expression), and initial data. The following pseudo code describes the overall structure of the data section:
From the output of xxd
command, we can see that the data section follows the code section with 23 (0x17
) bytes, including 1 memory initialization information:
The memory index of this memory initialization information is 0, and the starting position of the memory is determined by an i32.const instruction (the opcode is 0x41
, and the parameter is 0x100000
). It can be calculated from the instruction that the starting position of memory is at the beginning of page 17 (0x100000 / (64 * 1024))
:
The initialization data is 14 (0x0E
) bytes, and the content is the string literal Hello, World! \n
in the Rust code:
Custom Section
Custom sections can store debugging information, third party extension information, etc. These information are not necessary for running the module, and can be safely ignored. In addition, unlike other sections (which must appear in strict order and can only appear at most once), custom sections can appear freely before and after other sections, and can appear multiple times consecutively. The content of a custom section starts with a string, which further distinguishes the custom section. The following pseudo code describes the abstract structure of the custom section:
The Wasm specification defines only one type of custom section, and name is string name
. The specific format of the name section is not included in the article. Please refer to the Wasm specification. From the output of xxd
command, we can see that a custom section followed by the data section contains 222 (0xDE
, 0x01
) bytes, and its name is exactly the name
(0x6E
, 0x61
, 0x6D
, 0x65
):