luo.box

Assembly and Executable Files

Published:

┌──────────────────┐   ┌──────────────────┐  High-level language
│int main() {      │   │int func(int a) { │         program     
│  return func(10);│   │  return a * 133; │  (C, C++, Rust, ...)
│}                 │   │}                 │       (text file)   
└────────┬─────────┘   └────────┬─────────┘                     
         │                      │                               
 Complier(e.g. gcc)     Complier(e.g. gcc)                      
         │                      │                               
┌────────▼─────────┐   ┌────────▼─────────┐                     
│main:             │   │func:             │  Assembly language  
│  li a0, 10       │   │  li a1, 133      │       program       
│  jal func        │   │  mul a0, a0, a1  │     (text file)     
│  ret             │   │  ret             │                     
└────────┬─────────┘   └────────┬─────────┘                     
         │                      │                               
 Assembler(e.g. as)     Assembler(e.g. as)                      
         │                      │                               
┌────────▼─────────┐   ┌────────▼─────────┐                     
│    0101010101    │   │    0101010101    │  Machine language   
│    1010001010    │   │    1010001010    │      program        
│    1100101101    │   │    1100101101    │    object file      
└────────┬─────────┘   └────────┬─────────┘      (binary)       
         │                      │                               
         └───►Linker(e.g. ld)◄──┘                               
                     │                                          
            ┌────────▼─────────┐                                
            │    0101010101    │             Machine language   
            │    1010001010    │                 program        
            │    1100101101    │             executable file    
            └────────┬─────────┘                 (binary)       
                     ▼                                          
             Loader(e.g. Linux)                                

Run a simple assembly program

Install toolchain (macOS)

brew install riscv/riscv/riscv-tools

This installs RISC-V toolchain including gcc (with newlib), simulator (spike), and pk:

(Optional) Complie from a higher level language

int main() {
  int r = func (10);
  return r + 1;
}
fn main() {
}

A compiler is a tool that translates a program from one language to another.

cargo rustc --target riscv-unknown-elf --release -- --emit asm

The assembly will be in target/aarch64-apple-ios/release/deps/*.s.

  .text
  .align      2
main:
  addi  sp, sp, -16
  li    a0, 10
  sw    ra, 12(sp)
  jal   func
  lw    ra, 12(sp)
  addi  a0, a0, 1
  addi  sp, sp, 16
  ret

Assemble assembly code

An assembler is a tool that translates a program in assembly language into a program in machine language.

riscv64-none-elf-as -march=rv64imac main.s -o main.o

{% fold(title=“What is that -march=rv64g doing in there?”) %} The default march for our toolchain is rv64gc (or more specifically rv64imafdc), but we are removing the C extension, which indicates that a machine supports instruction compression. {% end %}

The assembler produces object files that are encoded in bindary and contains code in machine language and other information, such as the list of symbols (e.g., global variables and functins) defined in the file. There are several known file formats used to encode object files. The Executable and Linking Format, or ELF, is frequently used on Linux-based systems while the Portable Executable format, or PE, is used on Windows-based systems.

Though the object file produced by the assembler is in mahine language, it is usually incomplete in the sense that it may still need to be relocated or linked with other object files to compose the whole program.

A linker is a tool that “links” together one or more object files and produces an executable fine.

riscv64-none-elf-ld main.o -o main

Now to actually run the program using the simulator.

spike $(which pk) main

Labels, symbols, references, and relocation

Labels and symbols

Labels are “markers” that represent program locations. They are usually defined by a name ended with the suffix : and can be inserted into an assembly program to “mark” a program position so that it can be referred to by assembly instructions or other assembly commands, such as directives.

Global variables and program routines are program elements that are stored on the computer main memory.

Program symbols are “names” that are associated with numerical values and the “symbol table” is a data structure that maps each program symbol to its value. Labels are automatically converted into program symbols by the assembler and associated with a numerical value that represents its position in the program, which is a memory address. The assembler adds all symbols to the program’s “symbol table”, which is also stored on the object file.

We can inspect the contents of the object file using the GNU nm tool:

$ riscv64-none-elf-nm  sum10.o
00000004 t .L0
00000004 t sum10
00000000 t x

You may also explicitly define symbols by using the .set directive.

.set answer, 42
get_answer:
  li a0, answer
  ret

to check:

$ riscv64-none-elf-as -march=rv32im get_answer.s -o get_answer.o
$ riscv64-none-elf-nm  get_answer.o
0000002a a answer
00000000 t get_answer

Notice that the answer symbol is an absolute symbol. i.e., its value is not changed during the linking process. The get_answer symbol represents a location on the .text section and may have its value (which is an address) changed during the relocation process.

References to labels and relocation

Each reference to a label must be replaced by an address during the assembling and linking processes.

To illustrate, consider the following RV32I assembly program, which contains four instructions and two labels.

trunk42:
  li t1, 42
  bge t1, a0, done
  mv a0, t1
done:
  ret

When assembling it, the assembler translates each assembly instruction to a machine instruction with 32 bits. Four for each instruction, the program occupies a total of 16 words as a result.The first instruction to address 0, the second to address 4, and so on. In context, the trunk42 label, which marks the beginning of the program, is associated with address 0 and the done label, which marks the position in which instruction ret is located, is associated with address c.

The GNU objdump tool can inspect several parts of the object file.

$ riscv64-none-elf-objdump -D trunk.o
trunk.o:     file format elf32-littleriscv
Disassembly of section .text:
00000000 <trunk42>:
   0: 02a00313
   4: 00a35463
   8: 00030513
0000000c <done>:
   c: 00008067
...

For each instruction, it shows its address, its encoding in hexadecimal, and a text that resembles assembly code.

The trunk42 function starts address 0, however, when linking this object file (trunk.o) with others, the linker may need to move the code (assign new address) so that they do not occupy the same addresses.

Relocation is the process in which the code and data are assigned new memory addresses. The relocation table is a data structure that contains information that describes how the program instructions and data need to be modified to reflect the addresses reassignment. Each object file contains a relocation table and the linker uses their information to adjust the code when performing the relocation process.

$ riscv64-unknown-elf-objdump -r trunk.o
trunk.o:     file format elf32-littleriscv
RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE              VALUE
00000004 R_RISCV_BRANCH    done

To check the binary after it’s linked:

$ riscv64-unknown-elf-ld -m elf32lriscv trunk.o -o trunk.x
$ riscv64-unknown-elf-objdump -D trunk.x

trunk.x:     file format elf32-littleriscv
Disassembly of section .text:
00010054 <trunk42>:
   10054: 02a00313
   10058: 00a35463
   1005c: 00030513
00010060 <done>:
   10060: 00008067
   ...

Notice that the code was relocated.

Undefined references

Global vs local symbols

The program entry point

Every program has an entry point, i.e., the point from which the CPU must start executing the program. The entry point is defined by an address, which is the address of the first instruction that must be executed.

The executable file has a header that contains several information about the program and one of the header fields store the entry point address. Once the operating system loads the program into the main memory, it sets the PC with the entry point address so the program starts executing.

The linker is responsible for setting the entry point field on the executable file. To do so, it looks for a symbol named start. If it finds it, it sets the entry point field with the start symbol value. Otherwise, it sets the entry point to a default value, usually the address of the first instruction of the progrom.

The start label must be registered as a global symbol for the linker to recognize it as the entry point.

.globl start
start:
  li a0, 10
  li a1, 20
  jal exit

Program sections

Executable and object files, and assembly programs are usually organized in sections, contents of each section are mapped to a set of consecutive main memory addresses. The following sections are often present on executable files generated for Linux-based systems.

When linking multiple object files, the linker groups information from sections with the same name and places them together into a single section on the executable file. The layout looks like this:

               Executable File                   
                    (ELF)                        
                ┌──────────┐                     
                │ 11101011 │                     
                │    ...   │ ELF header          
                │ 10101010 │                     
                ├──────────┤                     
           8011 │ 11010110 │                     
           8010 │ 10010101 │                     
           800f │ 10011100 │ .data section       
           800e │ 11101011 │                     
           800d │ 10101010 │                     
                ├──────────┤                     
           800b │ 10010101 │                     
Addresses  800a │ 10011100 │ .rodata section     
           8009 │ 11101011 │                     
           8008 │ 10101010 │                     
                ├──────────┤                     
           8007 │ 11010110 │                     
           8006 │ 10010101 │                     
           8005 │ 10011100 │                     
           8004 │ 11101011 │ .text section       
           8003 │ 10101010 │                     
           8002 │ 10011100 │                     
           8001 │ 10010101 │                     
           8000 │ 11010110 │                     
                ├──────────┤                     
                │ 11101011 │                     
                │    ...   │ Section header table
                │ 10101010 │                     
                └──────────┘                     

Executable vs object files

The ELF is used by several Linux-based operating systems to encode both object and executable files, they differ in the following aspects:

Assembly language

Assembly programs are encoded as plain text files and contain four main elements:

Directives

DirectiveDescription
.text or .section .text后续内容存放在代码节(机器代码)。
.data or .section .data后续内容存放在数据节(全局变量)。
.bss or .secion .bss后续内容存放在 bss 节(初始化为 0 的全局变量)。
.section .foo后续内容存放在名为.foo 的节。
.align n后续数据按 2^n 字节对齐。
.balign n后续数据按 n 字节对齐。
.globl sym声明 sym 为全局符号,可从其他文件引用。
.string "str" or .asciz "str"将字符串 str 存放在内存,以空字符结尾。
.ascii "str"将字符串 str 存放在内存,不以空字符结尾。
.byte b1,..., bn在内存中连续存放 n 个 8 位数据。
.half b1,..., bn在内存中连续存放 n 个 16 位数据。
.word b1,..., bn在内存中连续存放 n 个 32 位数据。
.dword b1,..., bn在内存中连续存放 n 个 64 位数据。
.fload b1,..., bn在内存中连续存放 n 个单精度浮点数。
.double b1,..., bn在内存中连续存放 n 个双精度浮点数。
.option rvc压缩后续指令。
.option norvc不压缩后续指令。
.option relax允许链接器松弛后续指令。
.option norelax禁止链接器松弛后续指令。
.option pic后续指令为位置无关代码。
.option nopic后续指令为位置相关代码。
.option push将当前所有.option 选项压栈,后续.option pop 可恢复。
.option pop将选项弹栈,将所有.option 恢复为上次.option push 的配置。
#assembly