┌──────────────────┐ ┌──────────────────┐ High-level language │int main() { │ │int func(int a) { │ program │ return func(10);│ │ return a * 133; │ (C, C++, Rust, ...) │} │ │} │ (text file) └────────┬─────────┘ └────────┬─────────┘ │ │ Complier(e.g. gcc) Complier(e.g. gcc) │ │ ┌────────▼─────────┐ ┌────────▼─────────┐ │main: │ │func: │ Assembly language │ li a0, 10 │ │ li a1, 133 │ program │ jal func │ │ mul a0, a0, a1 │ (text file) │ ret │ │ ret │ └────────┬─────────┘ └────────┬─────────┘ │ │ Assembler(e.g. as) Assembler(e.g. as) │ │ ┌────────▼─────────┐ ┌────────▼─────────┐ │ 0101010101 │ │ 0101010101 │ Machine language │ 1010001010 │ │ 1010001010 │ program │ 1100101101 │ │ 1100101101 │ object file └────────┬─────────┘ └────────┬─────────┘ (binary) │ │ └───►Linker(e.g. ld)◄──┘ │ ┌────────▼─────────┐ │ 0101010101 │ Machine language │ 1010001010 │ program │ 1100101101 │ executable file └────────┬─────────┘ (binary) ▼ Loader(e.g. Linux)
Run a simple assembly program
Install toolchain (macOS)
brew install riscv/riscv/riscv-tools
This installs RISC-V toolchain including gcc (with newlib), simulator (spike), and pk:
riscv64-none-elf-gdb
: the “GNU Debugger”.riscv64-none-elf-ld
: the “GNU Linker”.riscv64-none-elf-as
: the “GNU Assembler” tool, an assembler that is capable of translating programs written in several assembly languages into machine language for their respective ISAs.riscv64-none-elf-nm
: the tool used to inspect the “symbol table” of an object file.riscv64-none-elf-ar
: the “GNU Linker”.riscv64-none-elf-objdump
: the tool can be used to inspect several parts of the object file.riscv64-none-elf-objcopy
: copy and translate object files.riscv64-none-elf-strings
: print the sequences of printable charactars in files.riscv64-none-elf-size
: list section sizes and total size of binary files.riscv64-none-elf-strip
: discard symbols and other data from object files.riscv64-none-elf-elfedit
: update ELF header and program property of ELF files.riscv64-none-elf-readelf
: display information about ELF files.spike
: the RISC-V ISA Simulator, implements a functional model of one or more RISC-V harts.pk
: the RISC-V proxy kernel, is a lightweight application execution environment that can host statically-linked RISC-V ELF binaries.
(Optional) Complie from a higher level language
int main() {
int r = func (10);
return r + 1;
}
fn main() {
}
A compiler is a tool that translates a program from one language to another.
cargo rustc --target riscv-unknown-elf --release -- --emit asm
The assembly will be in target/aarch64-apple-ios/release/deps/*.s
.
.text
.align 2
main:
addi sp, sp, -16
li a0, 10
sw ra, 12(sp)
jal func
lw ra, 12(sp)
addi a0, a0, 1
addi sp, sp, 16
ret
Assemble assembly code
An assembler is a tool that translates a program in assembly language into a program in machine language.
riscv64-none-elf-as -march=rv64imac main.s -o main.o
{% fold(title=“What is that -march=rv64g
doing in there?”) %}
The default march for our toolchain is rv64gc
(or more specifically rv64imafdc
), but we are removing the C extension, which indicates that a machine supports instruction compression.
{% end %}
The assembler produces object files that are encoded in bindary and contains code in machine language and other information, such as the list of symbols (e.g., global variables and functins) defined in the file. There are several known file formats used to encode object files. The Executable and Linking Format, or ELF, is frequently used on Linux-based systems while the Portable Executable format, or PE, is used on Windows-based systems.
Though the object file produced by the assembler is in mahine language, it is usually incomplete in the sense that it may still need to be relocated or linked with other object files to compose the whole program.
A linker is a tool that “links” together one or more object files and produces an executable fine.
riscv64-none-elf-ld main.o -o main
Now to actually run the program using the simulator.
spike $(which pk) main
Labels, symbols, references, and relocation
Labels and symbols
Labels are “markers” that represent program locations. They are usually defined by a name ended with the suffix :
and can be inserted into an assembly program to “mark” a program position so that it can be referred to by assembly instructions or other assembly commands, such as directives.
Global variables and program routines are program elements that are stored on the computer main memory.
Program symbols are “names” that are associated with numerical values and the “symbol table” is a data structure that maps each program symbol to its value. Labels are automatically converted into program symbols by the assembler and associated with a numerical value that represents its position in the program, which is a memory address. The assembler adds all symbols to the program’s “symbol table”, which is also stored on the object file.
We can inspect the contents of the object file using the GNU nm
tool:
$ riscv64-none-elf-nm sum10.o
00000004 t .L0
00000004 t sum10
00000000 t x
You may also explicitly define symbols by using the .set
directive.
.set answer, 42
get_answer:
li a0, answer
ret
to check:
$ riscv64-none-elf-as -march=rv32im get_answer.s -o get_answer.o
$ riscv64-none-elf-nm get_answer.o
0000002a a answer
00000000 t get_answer
Notice that the answer
symbol is an a
bsolute symbol. i.e., its value is not changed during the linking process. The get_answer
symbol represents a location on the .text
section and may have its value (which is an address) changed during the relocation process.
References to labels and relocation
Each reference to a label must be replaced by an address during the assembling and linking processes.
To illustrate, consider the following RV32I assembly program, which contains four instructions and two labels.
trunk42:
li t1, 42
bge t1, a0, done
mv a0, t1
done:
ret
When assembling it, the assembler translates each assembly instruction to a machine instruction with 32 bits. Four for each instruction, the program occupies a total of 16 words as a result.The first instruction to address 0, the second to address 4, and so on. In context, the trunk42
label, which marks the beginning of the program, is associated with address 0 and the done
label, which marks the position in which instruction ret
is located, is associated with address c
.
The GNU objdump
tool can inspect several parts of the object file.
$ riscv64-none-elf-objdump -D trunk.o
trunk.o: file format elf32-littleriscv
Disassembly of section .text:
00000000 <trunk42>:
0: 02a00313
4: 00a35463
8: 00030513
0000000c <done>:
c: 00008067
...
For each instruction, it shows its address, its encoding in hexadecimal, and a text that resembles assembly code.
The trunk42
function starts address 0
, however, when linking this object file (trunk.o
) with others, the linker may need to move the code (assign new address) so that they do not occupy the same addresses.
Relocation is the process in which the code and data are assigned new memory addresses. The relocation table is a data structure that contains information that describes how the program instructions and data need to be modified to reflect the addresses reassignment. Each object file contains a relocation table and the linker uses their information to adjust the code when performing the relocation process.
$ riscv64-unknown-elf-objdump -r trunk.o
trunk.o: file format elf32-littleriscv
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
00000004 R_RISCV_BRANCH done
To check the binary after it’s linked:
$ riscv64-unknown-elf-ld -m elf32lriscv trunk.o -o trunk.x
$ riscv64-unknown-elf-objdump -D trunk.x
trunk.x: file format elf32-littleriscv
Disassembly of section .text:
00010054 <trunk42>:
10054: 02a00313
10058: 00a35463
1005c: 00030513
00010060 <done>:
10060: 00008067
...
Notice that the code was relocated.
Undefined references
Global vs local symbols
The program entry point
Every program has an entry point, i.e., the point from which the CPU must start executing the program. The entry point is defined by an address, which is the address of the first instruction that must be executed.
The executable file has a header that contains several information about the program and one of the header fields store the entry point address. Once the operating system loads the program into the main memory, it sets the PC with the entry point address so the program starts executing.
The linker is responsible for setting the entry point field on the executable file. To do so, it looks for a symbol named start
. If it finds it, it sets the entry point field with the start
symbol value. Otherwise, it sets the entry point to a default value, usually the address of the first instruction of the progrom.
The start
label must be registered as a global symbol for the linker to recognize it as the entry point.
.globl start
start:
li a0, 10
li a1, 20
jal exit
Program sections
Executable and object files, and assembly programs are usually organized in sections, contents of each section are mapped to a set of consecutive main memory addresses. The following sections are often present on executable files generated for Linux-based systems.
.text
: a section dedicated to store the program instructions;.data
: a section dedicated to store initialized global variables, i.e., the vaiables that need their value to initialized before the program starts executing;.bss
: a.k.a, the block starting symbol, a section dedicated to store uninitialized global variables;rodata
: a section dedicated to store constants, i.e., values that are read by the program but not modified during execution.
When linking multiple object files, the linker groups information from sections with the same name and places them together into a single section on the executable file. The layout looks like this:
Executable File (ELF) ┌──────────┐ │ 11101011 │ │ ... │ ELF header │ 10101010 │ ├──────────┤ 8011 │ 11010110 │ 8010 │ 10010101 │ 800f │ 10011100 │ .data section 800e │ 11101011 │ 800d │ 10101010 │ ├──────────┤ 800b │ 10010101 │ Addresses 800a │ 10011100 │ .rodata section 8009 │ 11101011 │ 8008 │ 10101010 │ ├──────────┤ 8007 │ 11010110 │ 8006 │ 10010101 │ 8005 │ 10011100 │ 8004 │ 11101011 │ .text section 8003 │ 10101010 │ 8002 │ 10011100 │ 8001 │ 10010101 │ 8000 │ 11010110 │ ├──────────┤ │ 11101011 │ │ ... │ Section header table │ 10101010 │ └──────────┘
Executable vs object files
The ELF is used by several Linux-based operating systems to encode both object and executable files, they differ in the following aspects:
- Addresses on object files are not final and elements from different sections may be assigned the same addresses;
- Object files usually contain several references to undefined symbols, which are expected to be resolved by the linker;
- Object files contain a relocation table so that instructions and data on object files can be relocated on linking. Addresses on executable files are usually final;
- Object files do not have an entry point;
Assembly language
Assembly programs are encoded as plain text files and contain four main elements:
- Comments: textual notes
- Labels: program location markers.
- Assembly instructions: actual ISA instructions.
- Assembly directives: commands used to coordinate the assembling process.
Directives
Directive | Description |
---|---|
.text or .section .text | 后续内容存放在代码节(机器代码)。 |
.data or .section .data | 后续内容存放在数据节(全局变量)。 |
.bss or .secion .bss | 后续内容存放在 bss 节(初始化为 0 的全局变量)。 |
.section .foo | 后续内容存放在名为.foo 的节。 |
.align n | 后续数据按 2^n 字节对齐。 |
.balign n | 后续数据按 n 字节对齐。 |
.globl sym | 声明 sym 为全局符号,可从其他文件引用。 |
.string "str" or .asciz "str" | 将字符串 str 存放在内存,以空字符结尾。 |
.ascii "str" | 将字符串 str 存放在内存,不以空字符结尾。 |
.byte b1,..., bn | 在内存中连续存放 n 个 8 位数据。 |
.half b1,..., bn | 在内存中连续存放 n 个 16 位数据。 |
.word b1,..., bn | 在内存中连续存放 n 个 32 位数据。 |
.dword b1,..., bn | 在内存中连续存放 n 个 64 位数据。 |
.fload b1,..., bn | 在内存中连续存放 n 个单精度浮点数。 |
.double b1,..., bn | 在内存中连续存放 n 个双精度浮点数。 |
.option rvc | 压缩后续指令。 |
.option norvc | 不压缩后续指令。 |
.option relax | 允许链接器松弛后续指令。 |
.option norelax | 禁止链接器松弛后续指令。 |
.option pic | 后续指令为位置无关代码。 |
.option nopic | 后续指令为位置相关代码。 |
.option push | 将当前所有.option 选项压栈,后续.option pop 可恢复。 |
.option pop | 将选项弹栈,将所有.option 恢复为上次.option push 的配置。 |