Have a look at previous articles if you need to catch up. In this article we will cover updates done to the Linux kernel, GCC (done long ago) and GDB for debugging.
Supporting hardware floating point operations in Linux user space applications means adding the ability for the Linux kernel to store and restore FPU state upon context switches. This allows multiple programs to use the FPU at the same time as if each program has it’s own floating point hardware. The kernel allows programs to multiplex usage of the FPU transparently. This is similar to how the kernel allows user programs to share other hardware like the CPU and Memory.
On OpenRISC this requires to only add one addition register, the floating point
control and status register (FPCSR
) to to context switches. The FPCSR
contains status bits pertaining to rounding mode and exceptions.
We will cover three places where Linux needs to store FPU state:
In order for the kernel to be able to support FPU usage in user space programs it needs to be able to save and restore the FPU state during context switches. Let’s look at the different kinds of context switches that happen in the Linux Kernel to understand where FPU state needs to be stored and restored.
For our discussion purposes we will define a context switch as being when one CPU state (context) is saved from CPU hardware (registers and program counter) to memory and then another CPU state is loaded form memory to CPU hardware. This can happen in a few ways.
Furthermore, exceptions may be categorized into one of two cases: Interrupts and System calls. For each of these a different amount of CPU state needs to be saved.
Below are outlines of the sequence of operations that take place when transitioning from Interrupts and System calls into kernel code. It highlights at what point state is saved with the INT, kINT and FPU labels.
The monospace
labels below correspond to the actual assembly labels in entry.S,
the part of the OpenRISC Linux kernel that handles entry into kernel code.
Interrupts (slow path)
EXCEPTION_ENTRY
- all INT state is saved_resume_userspace
- Check thread_info
for work pending_work_pending
- Call do_work_pending
_switch
which save/restores kINT
and FPU statedo_signal
which save/restores FPU stateRESTORE_ALL
- all INT state is restored and return to user spaceSystem calls (fast path)
_sys_call_handler
- callee saved registers are saved_syscall_check_work
- Check thread_info
for work pending_work_pending
- Call do_work_pending
_switch
which save/restores kINT
and FPU statedo_signal
which save/restores FPU stateRESTORE_ALL
- all INT state is restored and return to user space_syscall_resume_userspace
- restore callee saved registers return to user space.Some key points to note on the above:
v6.8
save and restores both
INT and FPU state
what is shown before is a more optimized mechanism of only saving FPU state when needed. Further optimizations could
be still make to only save FPU state for user space, and not save/restore if it is already done.With these principal’s in mind we can now look at how the mechanics of context switching works.
Upon a Mode Switch from user mode to kernel mode the process thread
stack switches from using a user space stack to the associated kernel space
stack. The required state is stored to a stack frame in a pt_regs
structure.
The pt_regs
structure (originally ptrace registers) represents the CPU
registers and program counter context that needs to be saved.
Below we can see how the kernel stack and user space stack relate.
kernel space
+--------------+ +--------------+
| Kernel Stack | | Kernel stack |
| | | | | |
| v | | | |
| pt_regs' -------\ | v |
| v | | | pt_regs' --------\
| pt_reg'' |<+ | | v |<+ |
| | | | | | | |
+--------------+ | | +--------------+ | |
| thread_info | | | | thread_info | | |
| *task | | | | *task | | |
| ksp ----------/ | | ksp-----------/ |
+--------------+ | +--------------+ |
| |
0xc0000000 | |
---------------------|----------------------|-----
| |
process a | process b |
+--------------+ | +------------+ |
| User Stack | | | User Stack | |
| | | | | | | |
| V |<---+ | v |<-----+
| | | |
| | | |
| text | | text |
| heap | | heap |
+--------------+ +------------+
0x00000000
user space
process a
In the above diagram notice how there are 2 set’s of pt_regs
for process a.
The pt_regs’ structure represents the user space registers (INT)
that are saved during the switch from user mode to kernel mode. Notice how
the pt_regs’ structure has an arrow pointing the user space stack, that’s the
saved stack pointer. The second pt_regs’’ structure represents the frozen
kernel state (kINT)
that was saved before a task switch was performed.
process b
Also in the diagram above we can see process b has only a pt_regs’ (INT) structure saved on the stack and does not currently have a pt_regs’’ (kINT) structure saved. This indicates that that this process is currently running in kernel space and is not yet frozen.
As we can see here, for OpenRISC there are two places to store state.
pt_regs
structure on the kernel
stack represented by pt_regs’ at this point only integer registers need to
be saved. This is represents the user process state.task
structure or to the thread_info
structure. This context may store the all
extra registers including FPU and Vector registers.thread_info
live in
the same memory space. This is a source of security issues and many
architectures have moved to support virtually mapped kernel stacks,
OpenRISC does not yet support this and it would be a good opportunity
for improvement.
The structure of the pt_regs
used by OpenRISC is as per below:
struct pt_regs {
long gpr[32];
long pc;
/* For restarting system calls:
* Set to syscall number for syscall exceptions,
* -1 for all other exceptions.
*/
long orig_gpr11; /* For restarting system calls */
long dummy; /* Cheap alignment fix */
long dummy2; /* Cheap alignment fix */
};
The structure of thread_struct
now used by OpenRISC to store the
user specific FPU state is as per below:
struct thread_struct {
long fpcsr; /* Floating point control status register. */
};
The patches to OpenRISC added to support saving and restoring FPU state during context switches are below:
Signal frames are another place that we want FPU’s state, namely FPCSR
, to be available.
When a user process receives a signal it executes a signal handler in the process space on a stack slightly outside it’s current stack. This is setup with setup_rt_frame.
As we saw above signals are received after syscalls or exceptions, during the
do_pending_work
phase of the entry code. This means means FPU state will need
to be saved and restored.
Again, we can look at the stack frames to paint a picture of how this works.
kernel space
+--------------+
| Kernel Stack |
| | |
| v |
| pt_regs' --------------------------\
| |<+ |
| | | |
| | | |
+--------------+ | |
| thread_info | | |
| *task | | |
| ksp ----------/ |
+--------------+ |
|
0xc0000000 |
------------------- |
|
process a |
+--------------+ |
| User Stack | |
| | | |
| V | |
|xxxxxxxxxxxxxx|> STACK_FRAME_OVERHEAD |
| siginfo |\ |
| ucontext | >- sigframe |
| retcode[] |/ |
| |<-----------------------/
| |
| text |
| heap |
+--------------+
0x00000000
user space
Here we can see that when we enter a signal handler, we can get a bunch of stuff
stuffed in the stack in a sigframe
structure. This includes the ucontext
,
or user context which points to the original state of the program, registers and
all. It also includes a bit of code, retcode
, which is a trampoline to bring us
back into the kernel after the signal handler finishes.
The user pt_regs
(as we called pt_regs’) is updated before returning to user
space to execute the signal handler code by updating the registers as follows:
sp: stack pinter updated to point to a new user stack area below sigframe
pc: program counter: sa_handler(
r3: argument 1: signo,
r4: argument 2: &siginfo
r5: argument 3: &ucontext)
r9: link register: retcode[]
Now, when we return from the kernel to user space, user space will resume in the signal handler, which runs within the user process context.
After the signal handler completes it will execute the retcode
block which is setup to call the special system call rt_sigreturn.
sa_restorer
.
The rt_sigreturn
system call will restore the ucontext
registers (which
may have been updated by the signal handler) to the user pt_regs
on the
kernel stack. This allows us to either restore the user context before the
signal was received or return to a new context setup by the signal handler.
We need to to provide and restore the FPU FPCSR
during signals via
ucontext
but also not break user space ABI. The ABI is important because
kernel and user space programs may be built at different times. This means the
layout of existing fields in ucontext
cannot change. As we can see below by
comparing the ucontext
definitions from Linux, glibc and musl each program
maintains their own separate header file.
In Linux we cannot add fields to uc_sigcontext
as it would make uc_sigmask
unable to be read. Fortunately we had a bit of space in sigcontext
in the
unused oldmask
field which we could repurpose for FPCSR
.
The structure used by Linux to populate the signal frame is:
From: uapi/asm-generic/ucontext.h
struct ucontext {
unsigned long uc_flags;
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
sigset_t uc_sigmask;
};
From: uapi/asm/ptrace.h
struct sigcontext {
struct user_regs_struct regs; /* needs to be first */
union {
unsigned long fpcsr;
unsigned long oldmask; /* unused */
};
};
From: uapi/asm/sigcontext.h
struct user_regs_struct {
unsigned long gpr[32];
unsigned long pc;
unsigned long sr;
};
The structure that glibc expects is.
/* Context to describe whole processor state. */
typedef struct
{
unsigned long int __gprs[__NGREG];
unsigned long int __pc;
unsigned long int __sr;
} mcontext_t;
/* Userlevel context. */
typedef struct ucontext_t
{
unsigned long int __uc_flags;
struct ucontext_t *uc_link;
stack_t uc_stack;
mcontext_t uc_mcontext;
sigset_t uc_sigmask;
} ucontext_t;
struct mcontext_t
in glibc is
missing the space for oldmask
.
The structure used by musl is:
typedef struct sigcontext {
struct {
unsigned long gpr[32];
unsigned long pc;
unsigned long sr;
} regs;
unsigned long oldmask;
} mcontext_t;
typedef struct __ucontext {
unsigned long uc_flags;
struct __ucontext *uc_link;
stack_t uc_stack;
mcontext_t uc_mcontext;
sigset_t uc_sigmask;
} ucontext_t;
Below were the patches the to OpenRISC kernel to add floating point state to the signal API. This originally caused some ABI breakage and was fixed in the second patch.
Register sets provide debuggers the ability to read and save the state of
registers in other processes. This is done via the ptrace
PTRACE_GETREGSET
and PTRACE_SETREGSET
requests.
Regsets also define what is dumped to core dumps when a process crashes.
In OpenRISC we added the ability to get and set the FPCSR
register
with the following patches:
I ported GCC to the OpenRISC FPU back in 2019 , this entailed defining new instructions in the RTL machine description for example:
(define_insn "plussf3"
[(set (match_operand:SF 0 "register_operand" "=r")
(plus:SF (match_operand:SF 1 "register_operand" "r")
(match_operand:SF 2 "register_operand" "r")))]
"TARGET_HARD_FLOAT"
"lf.add.s\t%d0, %d1, %d2"
[(set_attr "type" "fpu")])
(define_insn "minussf3"
[(set (match_operand:SF 0 "register_operand" "=r")
(minus:SF (match_operand:SF 1 "register_operand" "r")
(match_operand:SF 2 "register_operand" "r")))]
"TARGET_HARD_FLOAT"
"lf.sub.s\t%d0, %d1, %d2"
[(set_attr "type" "fpu")])
The above is a simplified example of GCC Register Transfer Language(RTL) lisp expressions. Note, the real expression actually uses mode iterators and is a bit harder to understand, hence the simplified version above. These expressions are used for translating the GCC compiler RTL from it’s abstract syntax tree form to actual machine instructions.
Notice how the above expressions are in the format (define_insn INSN_NAME RTL_PATTERN CONDITION MACHINE_INSN ...)
. If
we break it down we see:
INSN_NAME
- this is a unique name given to the instruction.RTL_PATTERN
- this is a pattern we look for in the RTL tree, Notice how the lisp represents 3 registers connected by the instruction node.CONDITION
- this is used to enable the instruction, in our case we use TARGET_HARD_FLOAT
. This means if the GCC hardware floating point
option is enabled this expression will be enabled.MACHINE_INSN
- this represents the actual OpenRISC assembly instruction that will be output.In order for glibc to properly support floating point operations GCC needs to do a bit more than just support outputting floating point instructions. Another component of GCC is software floating point emulation. When there are operations not supported by hardware GCC needs to fallback to using software emulation. With way GCC and GLIBC weave software math routines and floating point instructions we can think of the FPU as a math accelerator. For example, the floating point square root operation is not provided by OpenRISC hardware.
When operations like square root are not available by hardware glibc will inject
software routines to handle the operation. The outputted square root routine
may use hardware multiply lf.mul.s
and divide lf.div.s
operations to
accelerate the emulation.
In order for this to work correctly the rounding mode and exception state of the FPU and libgcc emulation need to by in sync. Notably, we had one patch to fix an issue with exceptions not being in sync which was found when running glibc tests.
The libc math routines include:
float sinf (float x)
float logf (float x)
float ccoshf (complex float z)
The above just names a few, but as you can imagine the floating point acceleration
provided by the FPU
is essential for performant scientific applications.
FPU debugging allows a user to inspect the FPU specific registers. This includes FPU state registers and flags as well as view the floating point values in each general purpose register. This is not yet implemented on OpenRISC.
This will be something others can take up. The work required is to map Linux FPU register sets to GDB.
In summary adding floating point support to Linux revolved around adding one more register, the FPCSR
,
to context switches and a few other places.
GCC fixes were needed to make sure hardware and software floating point routines could work together.
There are still improvements that can be done for the Linux port as noted above. In the next article we will wrap things up by showing the glibc port.
pt_regs
In this entry we will cover updating Simulators and CPU implementations to support the architecture changes which are called for as per the previous article.
The simulators used for testing OpenRISC software without hardware are QEMU and or1ksim. They both needed to be updated to cohere to the specification updates discussed above.
The OpenRISC architectue simulator or1ksim has been updated with the single patch: cpu: Allow FPCSR to be read/written in user mode.
The softfloat FPU implementation was already configured to detect tininess before rounding.
If you are interested you can download and run the simulator and test this out with a docker image pulled from docker hub using the following:
# using podman instead of docker, you can use docker here too
podman pull stffrdhrn/or1k-sim-env:latest
podman run -it --rm stffrdhrn/or1k-sim-env:latest
root@9a4a52eec8ee:/tmp# or1k-elf-sim -version
Seeding random generator with value 0x4a3c2bbd
OpenRISC 1000 Architectural Simulator, version 2023-08-20
This starts up an environment which has access to the OpenRISC architecture simulator and a GNU compiler toolchain. While still in the container can run a quick test using the FPU as follows:
# Create a test program using OpenRISC FPU
cat > fpee.c <<EOF
#include <float.h>
#include <stdio.h>
#include <or1k-sprs.h>
#include <or1k-support.h>
static void enter_user_mode() {
int32_t sr = or1k_mfspr(OR1K_SPR_SYS_SR_ADDR);
sr &= ~OR1K_SPR_SYS_SR_SM_MASK;
or1k_mtspr(OR1K_SPR_SYS_SR_ADDR, sr);
}
static void enable_fpu_exceptions() {
unsigned long fpcsr = OR1K_SPR_SYS_FPCSR_FPEE_MASK;
or1k_mtspr(OR1K_SPR_SYS_FPCSR_ADDR, fpcsr);
}
static void fpe_handler() {
printf("Got FPU Exception, PC: 0x%lx\n", or1k_mfspr(OR1K_SPR_SYS_EPCR_BASE));
}
int main() {
float result;
or1k_exception_handler_add(0xd, fpe_handler);
#ifdef USER_MODE
/* Note, printf here also allocates some memory allowing user mode runtime to
work. */
printf("Enabling user mode\n");
enter_user_mode();
#endif
enable_fpu_exceptions();
printf("Exceptions enabled, now DIV 3.14 / 0!\n");
result = 3.14f / 0.0f;
/* Verify we see infinity. */
printf("Result: %f\n", result);
/* Verify we see DZF set. */
printf("FPCSR: %x\n", or1k_mfspr(OR1K_SPR_SYS_FPCSR_ADDR));
#ifdef USER_MODE
asm volatile("l.movhi r3, 0; l.nop 1"); /* Exit sim, now */
#endif
return 0;
}
EOF
# Compile the program
or1k-elf-gcc -g -O2 -mhard-float fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee
# Expected results
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801
# Compile the program to run in USER_MODE
or1k-elf-gcc -g -O2 -mhard-float -DUSER_MODE fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee
# Expected results with USER_MODE
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Enabling user mode
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801
# exit(0)
In the above we can see how to compile and run a simple FPU test program and run it on or1ksim. The program set’s up an FPU exception handler, enables exceptions then does a divide by zero to produce an exception. This program uses the OpenRISC newlib (baremetal) toolchain to compile a program that can run directly on the simulator, as oppposed to a program running in an OS on a simulator or hardware.
Note, that normally newlib programs expect to run in supervisor mode, when our program switches to user mode we need to take some precautions to ensure it can run correctly. As noted in the comments, usually when allocating and exiting the newlib runtime will do things like disabling/enabling interrupts which will fail when running in user mode.
The QEMU update was done in my OpenRISC user space FPCSR qemu patch series. The series was merged for the qemu 8.1 release.
The updates were split it into three changes:
The first patch to allow FPCSR access in user mode was trivial, but required some code structure changes making the patch look bigger than it really was.
The next patch to properly set the exception PC address fixed a long existing
bug where the EPCR
was not properly updated after FPU exceptions. Up until now
OpenRISC userspace did not support FPU instructions and this code path had not
been tested.
To explain why this fix is important let us look at the EPCR
and what it is used for
in a bit more detail.
In general, when an exception occurs an OpenRISC CPU will store the program counter (PC
)
of the instruction that caused the exception into the exeption program counter address
(EPCR
). Floating point exceptions are a special case in that the EPCR
is
actually set to the next instruction to be executed, this is to avoid looping.
When the linux kernel handles a floating point exception it follows the path
0xd00 > fpe_trap_handler > do_fpe_trap. This will setup a
signal to be delivered to the user process.
The Linux OS uses the EPCR
to report the exception instruction address to
userspace via a signal
which we can see being done in do_fpe_trap
which
we can see below:
asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned long address)
{
int code = FPE_FLTUNK;
unsigned long fpcsr = regs->fpcsr;
if (fpcsr & SPR_FPCSR_IVF)
code = FPE_FLTINV;
else if (fpcsr & SPR_FPCSR_OVF)
code = FPE_FLTOVF;
else if (fpcsr & SPR_FPCSR_UNF)
code = FPE_FLTUND;
else if (fpcsr & SPR_FPCSR_DZF)
code = FPE_FLTDIV;
else if (fpcsr & SPR_FPCSR_IXF)
code = FPE_FLTRES;
/* Clear all flags */
regs->fpcsr &= ~SPR_FPCSR_ALLF;
force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
}
Here we see the excption becomes a SIGFPE
signal and the exception address in
regs->pc
is passed to force_sig_fault. The PC
will be used to set the
si_addr
field of the siginfo_t
structure.
Next upon return from kernel space to user space the path is do_fpe_trap
>
_fpe_trap_handler
> ret_from_exception > resume_userspace >
work_pending > do_work_pending > restore_all.
Inside of do_work_pending
with there the signal handling is done. In explain a bit
about this in the article Unwinding a Bug - How C++ Exceptions Work.
In restore_all
we see EPCR
is returned to when exception handling is
complete. A snipped of this code is show below:
#define RESTORE_ALL \
DISABLE_INTERRUPTS(r3,r4) ;\
l.lwz r3,PT_PC(r1) ;\
l.mtspr r0,r3,SPR_EPCR_BASE ;\
l.lwz r3,PT_SR(r1) ;\
l.mtspr r0,r3,SPR_ESR_BASE ;\
l.lwz r3,PT_FPCSR(r1) ;\
l.mtspr r0,r3,SPR_FPCSR ;\
l.lwz r2,PT_GPR2(r1) ;\
l.lwz r3,PT_GPR3(r1) ;\
l.lwz r4,PT_GPR4(r1) ;\
l.lwz r5,PT_GPR5(r1) ;\
l.lwz r6,PT_GPR6(r1) ;\
l.lwz r7,PT_GPR7(r1) ;\
l.lwz r8,PT_GPR8(r1) ;\
l.lwz r9,PT_GPR9(r1) ;\
l.lwz r10,PT_GPR10(r1) ;\
l.lwz r11,PT_GPR11(r1) ;\
l.lwz r12,PT_GPR12(r1) ;\
l.lwz r13,PT_GPR13(r1) ;\
l.lwz r14,PT_GPR14(r1) ;\
l.lwz r15,PT_GPR15(r1) ;\
l.lwz r16,PT_GPR16(r1) ;\
l.lwz r17,PT_GPR17(r1) ;\
l.lwz r18,PT_GPR18(r1) ;\
l.lwz r19,PT_GPR19(r1) ;\
l.lwz r20,PT_GPR20(r1) ;\
l.lwz r21,PT_GPR21(r1) ;\
l.lwz r22,PT_GPR22(r1) ;\
l.lwz r23,PT_GPR23(r1) ;\
l.lwz r24,PT_GPR24(r1) ;\
l.lwz r25,PT_GPR25(r1) ;\
l.lwz r26,PT_GPR26(r1) ;\
l.lwz r27,PT_GPR27(r1) ;\
l.lwz r28,PT_GPR28(r1) ;\
l.lwz r29,PT_GPR29(r1) ;\
l.lwz r30,PT_GPR30(r1) ;\
l.lwz r31,PT_GPR31(r1) ;\
l.lwz r1,PT_SP(r1) ;\
l.rfe
Here we can see how l.mtspr r0,r3,SPR_EPCR_BASE
restores the EPCR
to the pc
address stored in pt_regs
when we entered the exception handler. All
other register are restored and finally the l.rfe
instruction is issued to
return from the exception which affectively jumps to EPCR
.
The reason QEMU was not setting the correct exception address is due to the way
qemu is implemented which optimizes performance. QEMU executes target code
basic blocks that are translated to host native instructions, during runtime
all PC
addresses are those of the host, for example x86-64 64-bit
addresses. When an exception occurs, updating the target PC
address from the host PC
need to be explicityly requested.
The next patch to implement tininess before rouding was also trivial but brought up a conversation about default NaN payloads.
Wait, there is more. During writing this article I realized that if QEMU
was setting the ECPR
to the FPU instruction causing the exception then
we would end up in an endless loop.
Luckily the arcitecture anticipated this calling for FPU exceptions to set the next
instruction to be executed to EPCR
. QEMU was missing this logic.
The patch target/openrisc: Set EPCR to next PC on FPE exceptions fixes this up.
Updating the actual verilog RTL CPU implementations also needed to be done. Updates have been made to both the mor1kx and the or1k_marocchino implementations.
Updates to the mor1kx to support user mode reads and write to the FPCSR
were done in the patch:
Make FPCSR is R/W accessible for both user- and supervisor- modes.
The full patch is:
@@ -618,7 +618,7 @@ module mor1kx_ctrl_cappuccino
spr_fpcsr[`OR1K_FPCSR_FPEE] <= 1'b0;
end
else if ((spr_we & spr_access[`OR1K_SPR_SYS_BASE] &
- (spr_sr[`OR1K_SPR_SR_SM] & padv_ctrl | du_access)) &&
+ (padv_ctrl | du_access)) &&
`SPR_OFFSET(spr_addr)==`SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) begin
spr_fpcsr <= spr_write_dat[`OR1K_FPCSR_WIDTH-1:0]; // update all fields
`ifdef OR1K_FPCSR_MASK_FLAGS
The change to verilog shows that before when writng (spr_we
) to the FPCSR (OR1K_SPR_FPCSR_ADDR
) register
we used to check that the supervisor bit (OR1K_SPR_SR_SM
) bit of the sr spr (spr_sr
) is set. That check
enforced supervisor mode only write access, removing this allows user space to write to the regsiter.
Updating mor1kx to support tininess checking before rounding was done in the change Refactoring and implementation tininess detection before rounding. I will not go into the details of these patches as I don’t understand them so much.
Updates to the or1k_marocchino to support user mode reads and write to the FPCSR
were done in the patch:
Make FPCSR is R/W accessible for both user- and supervisor- modes.
The full patch is:
@@ -714,7 +714,7 @@ module or1k_marocchino_ctrl
assign except_fpu_enable_o = spr_fpcsr[`OR1K_FPCSR_FPEE];
wire spr_fpcsr_we = (`SPR_OFFSET(({1'b0, spr_sys_group_wadr_r})) == `SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) &
- spr_sys_group_we & spr_sr[`OR1K_SPR_SR_SM];
+ spr_sys_group_we; // FPCSR is R/W for both user- and supervisor- modes
`ifdef OR1K_FPCSR_MASK_FLAGS
reg [`OR1K_FPCSR_ALLF_SIZE-1:0] ctrl_fpu_mask_flags_r;
Updating the marocchino to support dttectig tininess before rounding was done in the patch: Refactoring FPU Implementation for tininess detection BEFORE ROUNDING. I will not go into details of the patch as I didn’t write them. In general it is a medium size refactoring of the floating point unit.
We discussed updates to the architecture simulators and verilog CPU implementations to allow supporting user mode floating point programs. These updates will now allow us to port Linux and glibc to the OpenRISC floating point unit.
The upstreamed OpenRISC glibc support is missing support for leveraging the OpenRISC floating-point unit (FPU). Adding OpenRISC glibc FPU support requires a cross cutting effort across the architecture’s fullstack from:
In this blog entry I will cover how the OpenRISC architecture specification was updated to support user space floating point applications. But first, what is FPU porting?
The FPU in modern CPU’s allow the processor to perform IEEE 754
floating point math like addition, subtraction, multiplication. When used in a
user application the FPU’s function becomes more of a math accelerator, speeding
up math operations including
trigonometric and
complex functions such as sin
,
sinf
and cexpf
. Not all FPU’s provide the same
set of FPU operations nor do they have to. When enabled, the compiler will
insert floating point instructions where they can be used.
OpenRISC FPU support was added to the GCC compiler a while back. We can see how this works with a simple example using the bare-metal newlib toolchain.
C code example addf.c
:
float addf(float a, float b) {
return a + b;
}
To compile this C function we can do:
$ or1k-elf-gcc -O2 addf.c -c -o addf-sf.o
$ or1k-elf-gcc -O2 -mhard-float addf.c -c -o addf-hf.o
Assembly output of addf-sf.o
contains the default software floating point
implementation as we can see below. We can see below that a call to __addsf3
was
added to perform our floating point operation. The function __addsf3
is provided
by libgcc
as a software implementation of the single precision
floating point (sf
) add operation.
$ or1k-elf-objdump -dr addf-sf.o
Disassembly of section .text:
00000000 <addf>:
0: 9c 21 ff fc l.addi r1,r1,-4
4: d4 01 48 00 l.sw 0(r1),r9
8: 04 00 00 00 l.jal 8 <addf+0x8>
8: R_OR1K_INSN_REL_26 __addsf3
c: 15 00 00 00 l.nop 0x0
10: 85 21 00 00 l.lwz r9,0(r1)
14: 44 00 48 00 l.jr r9
18: 9c 21 00 04 l.addi r1,r1,4
The disassembly of the addf-hf.o
below shows that the FPU instruction
(hardware) lf.add.s
is used to perform addition, this is because the snippet
was compiled using the -mhard-float
argument. One could imagine if this is
supported it would be more efficient compared to the software implementation.
$ or1k-elf-objdump -dr addf-hf.o
Disassembly of section .text:
00000000 <addf>:
0: c9 63 20 00 lf.add.s r11,r3,r4
4: 44 00 48 00 l.jr r9
8: 15 00 00 00 l.nop 0x0
So if the OpenRISC toolchain already has support for FPU instructions what else needs to be done? When we add FPU support to glibc we are adding FPU support to the OpenRISC POSIX runtime and create a toolchain that can compile and link binaries to run on this runtime.
Below we can see examples of two application runtimes, one Application A runs with software floating point, the other Application B run’s with full hardware floating point.
Both Application A and Application B can run on the same system, but Application B requires a libc and kernel that support the floating point runtime. As we can see:
Another aspect is that supporting hardware floating point in the OS means that multiple user land programs can transparently use the FPU. To do all of this we need to update the kernel and the C runtime libraries to:
In order to compile applications like Application B a separate compiler toolchain is needed. For highly configurable embredded system CPU’s like ARM, RISC-V there are multiple toolchains available for building software for the different CPU configurations. Usually there will be one toolchain for soft float and one for hard float support, see the below example from the arm toolchain download page.
As we started to work on the floating point support we found two issues:
The GLIBC OpenRISC FPU port, or any port for that matter, starts
by looking at what other architectures have done. For GLIBC FPU support we can
look at what MIPS, ARM, RISC-V etc. have implemented. Most ports have a file
called sysdeps/{arch}/fpu_control.h
, I noticed one thing right away as I went
through this, we can look at ARM or MIPS for example:
sysdeps/mips/fpu_control.h: Excerpt from the MIPS port showing the definition of _FPU_GETCW and _FPU_SETCW
#else
# define _FPU_GETCW(cw) __asm__ volatile ("cfc1 %0,$31" : "=r" (cw))
# define _FPU_SETCW(cw) __asm__ volatile ("ctc1 %0,$31" : : "r" (cw))
#endif
sysdeps/arm/fpu_control.h: Excerpt from the ARM port showing the definition of _FPU_GETCW and _FPU_SETCW
# define _FPU_GETCW(cw) \
__asm__ __volatile__ ("vmrs %0, fpscr" : "=r" (cw))
# define _FPU_SETCW(cw) \
__asm__ __volatile__ ("vmsr fpscr, %0" : : "r" (cw))
#endif
What we see here is a macro that defines how to read or write the floating point control word for each architecture. The macros are implemented using a single assembly instruction.
In OpenRISC we have similar instructions for reading and writing the floating
point control register (FPCSR), writing for example is: l.mtspr r0,%0,20
. However,
on OpenRISC the FPCSR is read-only when running in user-space, this is a
problem.
If we remember from our operating system studies, user applications run in user-mode as apposed to the privileged kernel-mode. The user floating point environment is defined by POSIX in the ISO C Standard. The C library provides functions to set rounding modes and clear exceptions using for example fesetround for setting FPU rounding modes and feholdexcept for clearing exceptions. If userspace applications need to be able to control the floating point unit the having architectures support for this is integral.
Originally OpenRISC architecture specification specified the floating point control and status registers (FPCSR) as being read only when executing in user mode, again this is a problem and needs to be addressed.
Other architectures define the floating point control register as being writable in user-mode. For example, ARM has the FPCR and FPSR, and RISC-V has the FCSR all of which are writable in user-mode.
I am skipping ahead a bit here, once the OpenRISC GLIBC port was working we noticed many problematic math test failures. This turned out to be inconsistencies between the tininess detection [pdf] settings in the toolchain. Tininess detection must be selected by an FPU implementation as being done before or after rounding. In the toolchain this is configured by:
TININESS_AFTER_ROUNDING
- macro used by test suite to control
expectations_FP_TININESS_AFTER_ROUNDING
- macro used to control softfloat
implementation in GLIBC._FP_TININESS_AFTER_ROUNDING
- macro used to control softfloat
implementation in GCC libgcc.Writing to FPCSR from user-mode could be worked around in OpenRISC by introducing a syscall, but we decided to just change the architecture specification for this. Updating the spec keeps it similar to all other architectures out there.
In OpenRISC we have defined tininess detection to be done before rounding as this matches what existing FPU implementation have done.
As of architecture specification revision 1.4 the FPCSR is defined as being writable in user-mode and we have documented tininess detection to be before rounding.
We’ve gone through an overview of how the FPU accelarates math in an application runtime. We then looked how the OpenRISC architecture specification needed to be updated to support the floating point POSIX runtime.
In the next entry we shall look into patches to get QEMU and and CPU implementations updated to support the new spec changes.
]]>My first upstreaming attempt was completely tested on the QEMU simulator. I have since added an FPGA LiteX SoC to my test platform options. LiteX runs Linux on the OpenRISC mor1kx softcore and tests are loaded over an SSH session. The SoC eliminates an issue I was seeing on the simulator where under heavy load it appears the MMU starves the kernel from getting any work done.
To get to where I am now this required:
Adding GDB Linux debugging support is great because it allows debugging of multithreaded processes and signal handling; which we are going to need.
Our story starts when I was trying to fix a failing GLIBC NPTL
test case. The test case involves C++ exceptions and POSIX threads.
The issue is that the catch
block of a try/catch
block is not
being called. Where do we even start?
My plan for approaching test case failures is:
2.
Let’s have a try.
The GLIBC test case is nptl/tst-cancel24.cc.
The test starts in the do_test
function and it will create a child thread with pthread_create
.
The child thread executes function tf
which waits on a semaphore until the parent thread cancels it. It
is expected that the child thread, when cancelled , will call it’s catch block.
The failure is that the catch
block is not getting run as evidenced by the except_caught
variable
not being set to true
.
Below is an excerpt from the test showing the tf
function.
static void *
tf (void *arg) {
sem_t *s = static_cast<sem_t *> (arg);
try {
monitor m;
pthread_barrier_wait (&b);
while (1)
sem_wait (s);
} catch (...) {
except_caught = true;
throw;
}
return NULL;
}
So the catch
block is not being run. Simple, but where do we start to
debug that? Let’s move onto the next step.
This one is a bit tricky as it seems C++ try/catch
blocks are broken. Here, I am
working on GLIBC testing, what does that have to do with C++?
To get a better idea of where the problem is I tried to modify the test to test some simple ideas. First, maybe there is a problem with catching exceptions throws from thread child functions.
static void do_throw() { throw 99; }
static void * tf () {
try {
monitor m;
while (1) do_throw();
} catch (...) {
except_caught = true;
}
return NULL;
}
No, this works correctly. So try/catch
is working.
Hypothesis: There is a problem handling exceptions while in a syscall. There may be something broken with OpenRISC related to how we setup stack frames for syscalls that makes the unwinder fail.
How does that work? Let’s move onto the next step.
To find this bug we need to understand how C++ exceptions work. Also, we need to know what happens when a thread is cancelled in a multithreaded (pthread) glibc environment.
There are a few contributors pthread cancellation and C++ exceptions which are:
.eh_frame
ELF
sectionlibgcc_s.so
- handles unwinding by reading program DWARF metadata and doing the frame decodinglibstdc++.so.6
- provides the C++ personality routine which
identifies and prepares catch
blocks for executionELF binaries provide debugging information in a data format called DWARF. The name was chosen to maintain a fantasy theme. Lately the Linux community has a new debug format called ORC.
Though DWARF is a debugging format and usually stored in .debug_frame
,
.debug_info
, etc sections, a stripped down version it is used for exception
handling.
Each ELF binary that supports unwinding contains the .eh_frame
section to
provide unwinding information. This can be seen with the readelf
program.
$ readelf -S sysroot/lib/libc.so.6
There are 70 section headers, starting at offset 0xaa00b8:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .note.ABI-tag NOTE 00000174 000174 000020 00 A 0 0 4
[ 2] .gnu.hash GNU_HASH 00000194 000194 00380c 04 A 3 0 4
[ 3] .dynsym DYNSYM 000039a0 0039a0 008280 10 A 4 15 4
[ 4] .dynstr STRTAB 0000bc20 00bc20 0054d4 00 A 0 0 1
[ 5] .gnu.version VERSYM 000110f4 0110f4 001050 02 A 3 0 2
[ 6] .gnu.version_d VERDEF 00012144 012144 000080 00 A 4 4 4
[ 7] .gnu.version_r VERNEED 000121c4 0121c4 000030 00 A 4 1 4
[ 8] .rela.dyn RELA 000121f4 0121f4 00378c 0c A 3 0 4
[ 9] .rela.plt RELA 00015980 015980 000090 0c AI 3 28 4
[10] .plt PROGBITS 00015a10 015a10 0000d0 04 AX 0 0 4
[11] .text PROGBITS 00015ae0 015ae0 155b78 00 AX 0 0 4
[12] __libc_freeres_fn PROGBITS 0016b658 16b658 001980 00 AX 0 0 4
[13] .rodata PROGBITS 0016cfd8 16cfd8 0192b4 00 A 0 0 4
[14] .interp PROGBITS 0018628c 18628c 000018 00 A 0 0 1
[15] .eh_frame_hdr PROGBITS 001862a4 1862a4 001a44 00 A 0 0 4
[16] .eh_frame PROGBITS 00187ce8 187ce8 007cf4 00 A 0 0 4
[17] .gcc_except_table PROGBITS 0018f9dc 18f9dc 000341 00 A 0 0 1
...
We can decode the metadata using readelf
as well using the
--debug-dump=frames-interp
and --debug-dump=frames
arguments.
The frames
dump provides a raw output of the DWARF metadata for each frame.
This is not usually as useful as frames-interp
, but it shows how the DWARF
format is actually a bytecode. The DWARF interpreter needs to execute these
operations to understand how to derive the values of registers based current PC.
There is an interesting talk in Exploiting the hard-working DWARF.pdf.
An example of the frames
dump:
$ readelf --debug-dump=frames sysroot/lib/libc.so.6
...
00016788 0000000c ffffffff CIE
Version: 1
Augmentation: ""
Code alignment factor: 4
Data alignment factor: -4
Return address column: 9
DW_CFA_def_cfa_register: r1
DW_CFA_nop
00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
DW_CFA_advance_loc: 4 to 0016b588
DW_CFA_def_cfa_offset: 4
DW_CFA_advance_loc: 8 to 0016b590
DW_CFA_offset: r9 at cfa-4
DW_CFA_advance_loc: 68 to 0016b5d4
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 56 to 0016b60c
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 36 to 0016b630
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 40 to 0016b658
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
The frames-interp
argument is a bit more clear as it shows the interpreted output
of the bytecode. Below we see two types of entries:
CIE
- Common Information EntryFDE
- Frame Description EntryThe CIE
provides starting point information for each child FDE
entry. Some
things to point out: we see ra=9
indicates the return address is stored in
register r9
, we see CFA r1+0
indicates the canonical frame pointer is stored in
register r1
and we see the stack frame size is 4
bytes.
An example of the frames-interp
dump:
$ readelf --debug-dump=frames-interp sysroot/lib/libc.so.6
...
00016788 0000000c ffffffff CIE "" cf=4 df=-4 ra=9
LOC CFA
00000000 r1+0
00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
LOC CFA ra
0016b584 r1+0 u
0016b588 r1+4 u
0016b590 r1+4 c-4
0016b5d4 r1+4 c-4
0016b60c r1+4 c-4
0016b630 r1+4 c-4
0016b658 r1+0 u
GLIBC provides pthreads
which when used with C++ needs to support exception
handling. The main place exceptions are used with pthreads
is when cancelling
threads. When using pthread_cancel
a cancel signal is sent to the target thread using tgkill
which causes an exception.
This is implemented with the below APIs.
__do_cancel
, which calls __pthread_unwind
.pd->cancel_jmp_buf
. It calls glibc’s __Unwind_ForcedUnwind
.libgcc_s.so
version of _Unwind_ForcedUnwind
and calls it with parameters:
exc
- the exception contextunwind_stop
- the stop callback to GLIBC, called for each frame of the unwind, with
the stop argument ibuf
ibuf
- the jmp_buf
, created by setjmp
(self->cancel_jmp_buf
) in start_thread
cancel_jmp_buf
if
we are at the end of stack. When the cancel_jmp_buf
is called the thread
exits.Let’s look at pd->cancel_jmp_buf
in more details. The cancel_jmp_buf
is
setup during pthread_create
after clone in start_thread.
It uses the setjmp and
longjump non local goto mechanism.
Let’s look at some diagrams.
The above diagram shows a pthread that exits normally. During the Start phase
of the thread setjmp
will create the cancel_jmp_buf
. After the thread
routine exits it returns to the start_thread
routine to do cleanup.
The cancel_jmp_buf
is not used.
The above diagram shows a pthread that is cancelled. When the
thread is created setjmp
will create the cancel_jmp_buf
. In this case
while the thread routine is running it is cancelled, the unwinder runs
and at the end it calls unwind_stop
which calls longjmp
. After the
longjmp
the thread is returned to start_thread
to do cleanup.
A highly redacted version of our start_thread
and unwind_stop
functions is
shown below.
start_thread()
{
struct pthread *pd = START_THREAD_SELF;
...
struct pthread_unwind_buf unwind_buf;
int not_first_call;
not_first_call = setjmp ((struct __jmp_buf_tag *) unwind_buf.cancel_jmp_buf);
...
if (__glibc_likely (! not_first_call))
{
/* Store the new cleanup handler info. */
THREAD_SETMEM (pd, cleanup_jmp_buf, &unwind_buf);
...
/* Run the user provided thread routine */
ret = pd->start_routine (pd->arg);
THREAD_SETMEM (pd, result, ret);
}
... free resources ...
__exit_thread ();
}
unwind_stop (_Unwind_Action actions,
struct _Unwind_Context *context, void *stop_parameter)
{
struct pthread_unwind_buf *buf = stop_parameter;
struct pthread *self = THREAD_SELF;
int do_longjump = 0;
...
if ((actions & _UA_END_OF_STACK)
|| ... )
do_longjump = 1;
...
/* If we are at the end, go back start_thread for cleanup */
if (do_longjump)
__libc_unwind_longjmp ((struct __jmp_buf_tag *) buf->cancel_jmp_buf, 1);
return _URC_NO_REASON;
}
GCC provides the exception handling and unwinding capabilities
to the C++ runtime. They are provided in the libgcc_s.so
and libstdc++.so.6
libraries.
The libgcc_s.so
library implements the IA-64 Itanium Exception Handling ABI.
It’s interesting that the now defunct Itanium
architecture introduced this ABI which is now the standard for all processor exception
handling. There are two main entry points for the unwinder are:
_Unwind_ForcedUnwind
- for forced unwinding_Unwind_RaiseException
- for raising normal exceptionsThere are also two data structures to be aware of:
The _Unwind_Context
important parts:
struct _Unwind_Context {
_Unwind_Context_Reg_Val reg[__LIBGCC_DWARF_FRAME_REGISTERS__+1];
void *cfa;
void *ra;
struct dwarf_eh_bases bases;
_Unwind_Word flags;
};
The _Unwind_FrameState
important parts:
typedef struct {
struct frame_state_reg_info { ... } regs;
void *pc;
/* The information we care about from the CIE/FDE. */
_Unwind_Personality_Fn personality;
_Unwind_Sword data_align;
_Unwind_Word code_align;
_Unwind_Word retaddr_column;
unsigned char fde_encoding;
unsigned char signal_frame;
void *eh_ptr;
} _Unwind_FrameState;
These two data structures are very similar. The _Unwind_FrameState
is for internal
use and closely ties to the DWARF definitions of the frame. The _Unwind_Context
struct is more generic and is used as an opaque structure in the public unwind api.
Forced Unwinds
Exceptions that are raised for thread cancellation use a single phase forced unwind. Code execution will not resume, but catch blocks will be run. This is why cancel exceptions must be rethrown.
Forced unwinds use the unwind_stop
handler which GLIBC provides as explained in
the GLIBC section above.
_Unwind_ForcedUnwind_Phase2
- loops forever doing:
stop
- callback to GLIBC to stop the unwind if neededFS.personality
- the C++ personality routine, see below, called with _UA_FORCE_UNWIND | _UA_CLEANUP_PHASE
Normal Exceptions
For exceptions raised programmatically unwinding is very similar to the forced unwind, but
there is no stop
function and exception unwinding is 2 phase.
uw_init_context
- load details of the current frame from cpu/stack into CONTEXTuw_frame_state_for
- populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT->raFS.personality
- the C++ personality routine, see below, called with _UA_SEARCH_PHASE
uw_advance_context
)uw_install_context
- exit unwinder jumping to selected frame_Unwind_RaiseException_Phase2
- do phase 2, loops forever doing:
uw_frame_state_for
- populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT->raFS.personality
- the C++ personality routine, called with _UA_CLEANUP_PHASE
uw_update_context
- advance CONTEXT by populating it from FSThe libstdc++.so.6
library provides the C++ standard library
which includes the C++ personality routine __gxx_personality_v0.
The personality routine is the interface between the unwind routines and the c++
(or other language) runtime, which handles the exception handling logic for that
language.
As we saw above the personality routine is executed for each stack frame. The
function checks if there is a catch
block that matches the exception being
thrown. If there is a match, it will update the context to prepare it to jump
into the catch routine and return _URC_INSTALL_CONTEXT
. If there is no catch
block matching it returns _URC_CONTINUE_UNWIND
.
In the case of _URC_INSTALL_CONTEXT
then the _Unwind_ForcedUnwind_Phase2
loop breaks and calls uw_install_context
.
When the GCC unwinder is looping through frames the uw_frame_state_for
function will search DWARF information. The DWARF lookup will fail for signal
frames and a fallback mechanism is provided for each architecture to handle
this. For OpenRISC Linux this is handled by
or1k_fallback_frame_state.
To understand how this works let’s look into the Linux kernel a bit.
A process must be context switched to kernel by either a system call, timer or other interrupt in order to receive a signal.
The diagram above shows what a process stack looks like after the kernel takes over.
An interrupt frame is push to the top of the stack and the pt_regs
structure
is filled out containing the processor state before the interrupt.
This second diagram shows what happens when a signal handler is invoked. A new
special signal frame is pushed onto the stack and when the process is resumed
it resumes in the signal handler. In OpenRISC the signal frame is setup by the setup_rt_frame
function which is called inside of do_signal
which calls handle_signal
which calls setup_rt_frame
.
After the signal handler routine runs we return to a special bit of code called the Trampoline. The trampoline code lives on the stack and runs sigretrun.
Now back to or1k_fallback_frame_state
.
The or1k_fallback_frame_state
function checks if the current frame is a
signal frame by confirming the return address points to a Trampoline. If
it is a trampoline it looks into the kernel saved ucontext
and pt_regs
find
the previous user frame. Unwinding, can then continue as normal.
Now with a good background in how unwinding works we can start to debug our test case. We can recall our hypothesis:
Hypothesis: There is a problem handling exceptions while in a syscall. There may be something broken with OpenRISC related to how we setup stack frames for syscalls that makes the unwinder fail.
With GDB we can start to debug exception handling, we can trace right to the
start of the exception handling logic by setting our breakpoint at
_Unwind_ForcedUnwind
.
This is the stack trace we see:
#0 _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:192
#1 0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc <unwind_stop>, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#2 0x30321fc0 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#3 0x30312388 in __do_cancel () at pthreadP.h:313
#4 sigcancel_handler (sig=32, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:162
#5 sigcancel_handler (sig=<optimized out>, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:127
#6 <signal handler called>
#7 0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78, expected=1, clockid=<optimized out>, abstime=0x0, private=<optimized out>)
at ../sysdeps/nptl/futex-internal.c:66
#8 0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#9 0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
#10 0x30314548 in start_thread (arg=<optimized out>) at pthread_create.c:463
#11 0x3043638c in __or1k_clone () from /lib/libc.so.6
Backtrace stopped: frame did not save the PC
(gdb)
In the GDB backtrack we can see it unwinds through, the signal frame, sem_wait
all the way to our thread routine tf
. It appears everything, is working fine.
But we need to remember the backtrace we see above is from GDB’s unwinder not
GCC, also it uses the .debug_info
DWARF data, not .eh_frame
.
To really ensure the GCC unwinder is working as expected we need to debug it
walking the stack. Debugging when we unwind a signal frame can be done by
placing a breakpoint on or1k_fallback_frame_state
.
Debugging this code as well shows it works correctly.
#0 or1k_fallback_frame_state (context=<optimized out>, context=<optimized out>, fs=<optimized out>) at ../../../libgcc/unwind-dw2.c:1271
#1 uw_frame_state_for (context=0x30caeb6c, fs=0x30cae914) at ../../../libgcc/unwind-dw2.c:1271
#2 0x30303200 in _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:162
#3 0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc <unwind_stop>, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#4 0x30321fc0 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#5 0x30312388 in __do_cancel () at pthreadP.h:313
#6 sigcancel_handler (sig=32, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:162
#7 sigcancel_handler (sig=<optimized out>, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:127
#8 <signal handler called>
#9 0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78, expected=1, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at ../sysdeps/nptl/futex-internal.c:66
#10 0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#11 0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
Debugging when the unwinding stops can be done by setting a breakpoint
on the unwind_stop
function.
When debugging I was able to see that the unwinder failed when looking for
the __futex_abstimed_wait_cancelable64
frame. So, this is not an issue
with unwinding signal frames.
Debugging showed that the uwinder is working correctly, and it can properly
unwind through our signal frames. However, the unwinder is bailing out early
before it gets to the tf
frame which has the catch block we need to execute.
Hypothesis 2: There is something wrong finding DWARF info for
__futex_abstimed_wait_cancelable64
.
Looking at libpthread.so
with readelf
this function was missing completely from the .eh_frame
metadata. Now we found something.
What creates the .eh_frame
anyway? GCC or Binutils (Assembler). If we run GCC
with the -S
argument we can see GCC will output inline .cfi
directives.
These .cfi
annotations are what gets compiled to the to .eh_frame
. GCC
creates the .cfi
directives and the Assembler puts them into the .eh_frame
section.
An example of gcc -S
:
.file "unwind.c"
.section .text
.align 4
.type unwind_stop, @function
unwind_stop:
.LFB83:
.cfi_startproc
l.addi r1, r1, -28
.cfi_def_cfa_offset 28
l.sw 0(r1), r16
l.sw 4(r1), r18
l.sw 8(r1), r20
l.sw 12(r1), r22
l.sw 16(r1), r24
l.sw 20(r1), r26
l.sw 24(r1), r9
.cfi_offset 16, -28
.cfi_offset 18, -24
.cfi_offset 20, -20
.cfi_offset 22, -16
.cfi_offset 24, -12
.cfi_offset 26, -8
.cfi_offset 9, -4
l.or r24, r8, r8
l.or r22, r10, r10
l.lwz r18, -1172(r10)
l.lwz r20, -692(r10)
l.lwz r17, -688(r10)
l.add r20, r20, r17
l.andi r16, r4, 16
l.sfnei r16, 0
When looking at the glibc build I noticed the .eh_frame
data for
__futex_abstimed_wait_cancelable64
is missing from futex-internal.o. The one
where unwinding is failing we find it was completely mising .cfi
directives.
Why is GCC not generating .cfi
directives for this file?
.file "futex-internal.c"
.section .text
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "The futex facility returned an unexpected error code.\n"
.section .text
.align 4
.global __futex_abstimed_wait_cancelable64
.type __futex_abstimed_wait_cancelable64, @function
__futex_abstimed_wait_cancelable64:
l.addi r1, r1, -20
l.sw 0(r1), r16
l.sw 4(r1), r18
l.sw 8(r1), r20
l.sw 12(r1), r22
l.sw 16(r1), r9
l.or r22, r3, r3
l.or r20, r4, r4
l.or r16, r6, r6
l.sfnei r6, 0
l.ori r17, r0, 1
l.cmov r17, r17, r0
l.sfeqi r17, 0
l.bnf .L14
l.nop
Looking closer at the build line of these 2 files I see the build of futex-internal.c
is missing -fexceptions
.
This flag is needed to enable the eh_frame
section, which is what powers C++
exceptions, the flag is needed when we are building C code which needs to
support C++ exceptions.
So why is it not enabled? Is this a problem with the GLIBC build?
Looking at GLIBC the nptl/Makefile
set’s -fexceptions
explicitly for each
c file that needs it. For example:
# The following are cancellation points. Some of the functions can
# block and therefore temporarily enable asynchronous cancellation.
# Those must be compiled asynchronous unwind tables.
CFLAGS-pthread_testcancel.c += -fexceptions
CFLAGS-pthread_join.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_timedjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_clockjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_once.c += $(uses-callbacks) -fexceptions \
-fasynchronous-unwind-tables
CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
It is missing such a line for futex-internal.c
. The following patch and a
libpthread rebuild fixes the issue!
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -220,6 +220,7 @@ CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
+CFLAGS-futex-internal.c += -fexceptions -fasynchronous-unwind-tables
# These are the function wrappers we have to duplicate here.
CFLAGS-fcntl.c += -fexceptions -fasynchronous-unwind-tables
I submitted this patch to GLIBC but it turns out it was already fixed upstream a few weeks before. Doh.
I hope the investigation into debugging this C++ exception test case proved interesting. We can learn a lot about the deep internals of our tools when we have to fix bugs in them. Like most illusive bugs, in the end this was a trivial fix but required some key background knowledge.
This is the third part in an illustrated 3 part series covering:
In the last article we covered how Thread Local Storage (TLS) works at runtime, but how do we get there? How does the compiler and linker create the memory structures and code fragments described in the previous article?
In this article we will discuss how TLS relocations are is implemented. Our outline:
As before, the examples in this article can be found in my tls-examples project. Please check it out.
I will assume here that most people understand what a compiler and assembler basically do. In the sense that compiler will compile routines written C code or something similar to assembly language. It is then up to the assembler to turn that assembly code into machine code to run on a CPU.
That is a big part of what a toolchain does, and it’s pretty much that simple if we have a single file of source code. But usually we don’t have a single file, we have the multiple files, the c runtime, crt0 and other libraries like libc. These all need to be put together into our final program, that is where the complexities of the linker comes in.
In this article I will cover how variables in our source code (symbols) traverse the toolchain from code to the memory in our final running program. A picture that looks something like this:
First we start off with how relocations are created and emitted in the compiler.
As I work primarily on the GNU toolchain with it’s GCC compiler we will look at that, let’s get started.
To start we define a symbol as named address in memory. This address can be a program variable where data is stored or function reference to where a subroutine starts.
In GCC we have have TARGET_LEGITIMIZE_ADDRESS
, the OpenRISC implementation
being or1k_legitimize_address().
It takes a symbol (memory address) and makes it usable in our CPU by generating RTX
sequences that are possible on our CPU to load that address into a register.
RTX represents a tree node in GCC’s register transfer language (RTL). The RTL Expression is used to express our algorithm as a series of register transfers. This is used as register transfer is basically what a CPU does.
A snippet from legitimize_address()
function is below. The argument x
represents our input symbol (memory address) that we need to make usable by our
CPU. This code uses GCC internal API’s to emit RTX code sequences.
static rtx
or1k_legitimize_address (rtx x, rtx /* unused */, machine_mode /* unused */)
...
case TLS_MODEL_NONE:
t1 = can_create_pseudo_p () ? gen_reg_rtx (Pmode) : scratch;
if (!flag_pic)
{
emit_insn (gen_rtx_SET (t1, gen_rtx_HIGH (Pmode, x)));
return gen_rtx_LO_SUM (Pmode, t1, x);
}
else if (is_local)
{
crtl->uses_pic_offset_table = 1;
t2 = gen_sym_unspec (x, UNSPEC_GOTOFF);
emit_insn (gen_rtx_SET (t1, gen_rtx_HIGH (Pmode, t2)));
emit_insn (gen_add3_insn (t1, t1, pic_offset_table_rtx));
return gen_rtx_LO_SUM (Pmode, t1, copy_rtx (t2));
}
else
{
...
We can read the code snippet above as follows:
TLS
case as we see TLS_MODEL_NONE
.t1
.flag_pic
) we do:
x
into our temporary register t1
.t1
and the low bits of x
.is_local
) we do:
uses_pic_offset_table
.t2
.t2
(the GOT offset) into out temporary register t1
.t1
(high bits of t2) and the GOT into
t1`.t1
and the low bits of t1
.You may have noticed that the local symbol still used the global offset table (GOT). This is because Position-idependent code requires using the GOT to reference symbols.
An example, from nontls.c:
static int x;
int *get_x_addr() {
return &x;
}
Example of the non pic case above, when we look at the assembly code generated by GCC we can see the following:
.file "nontls.c"
.section .text
.local x
.comm x,4,4
.align 4
.global get_x_addr
.type get_x_addr, @function
get_x_addr:
l.addi r1, r1, -8 # \
l.sw 0(r1), r2 # | function prologue
l.addi r2, r1, 8 # |
l.sw 4(r1), r9 # /
l.movhi r17, ha(x) # \__ legitimize address of x into r17
l.addi r17, r17, lo(x) # /
l.or r11, r17, r17 # } place result in return register r11
l.lwz r2, 0(r1) # \
l.lwz r9, 4(r1) # | function epilogue
l.addi r1, r1, 8 # |
l.jr r9 # |
l.nop # /
.size get_x_addr, .-get_x_addr
.ident "GCC: (GNU) 9.0.1 20190409 (experimental)"
Example of the local pic case above the same code compiled with the -fPIC
GCC option
looks like the following:
.file "nontls.c"
.section .text
.local x
.comm x,4,4
.align 4
.global get_x_addr
.type get_x_addr, @function
get_x_addr:
l.addi r1, r1, -8 # \
l.sw 0(r1), r2 # | function prologue
l.addi r2, r1, 8 # |
l.sw 4(r1), r9 # /
l.jal 8 # \
l.movhi r19, gotpchi(_GLOBAL_OFFSET_TABLE_-4) # | PC relative, put
l.ori r19, r19, gotpclo(_GLOBAL_OFFSET_TABLE_+0) # | GOT into r19
l.add r19, r19, r9 # /
l.movhi r17, gotoffha(x) # \
l.add r17, r17, r19 # | legitimize address of x into r17
l.addi r17, r17, gotofflo(x) # /
l.or r11, r17, r17 # } place result in return register r11
l.lwz r2, 0(r1) # \
l.lwz r9, 4(r1) # | function epilogue
l.addi r1, r1, 8 # |
l.jr r9 # |
l.nop # /
.size get_x_addr, .-get_x_addr
.ident "GCC: (GNU) 9.0.1 20190409 (experimental)"
TLS and Addend cases are also handled by or1k_legitimize_address()
.
Once RTX is generated by legitimize address and GCC passes
run all of their optimizations the RTX needs to be printed out as assembly code. During
this process relocations are printed by GCC macros TARGET_PRINT_OPERAND_ADDRESS
and TARGET_PRINT_OPERAND
. In OpenRISC these defined
by or1k_print_operand_address()
and or1k_print_operand().
Let us have a look at or1k_print_operand_address()
.
/* Worker for TARGET_PRINT_OPERAND_ADDRESS.
Prints the argument ADDR, an address RTX, to the file FILE. The output is
formed as expected by the OpenRISC assembler. Examples:
RTX OUTPUT
(reg:SI 3) 0(r3)
(plus:SI (reg:SI 3) (const_int 4)) 0x4(r3)
(lo_sum:SI (reg:SI 3) (symbol_ref:SI ("x"))) lo(x)(r3) */
static void
or1k_print_operand_address (FILE *file, machine_mode, rtx addr)
{
rtx offset;
switch (GET_CODE (addr))
{
case REG:
fputc ('0', file);
break;
case ...
case LO_SUM:
offset = XEXP (addr, 1);
addr = XEXP (addr, 0);
print_reloc (file, offset, 0, RKIND_LO);
break;
default: ...
}
fprintf (file, "(%s)", reg_names[REGNO (addr)]);
}
The above code snippet can be read as we explain below, but let’s first make some notes:
addr
for TARGET_PRINT_OPERAND_ADDRESS
will usually contain
a register and an offset typically this is used for LOAD and STORE
operations.addr
as a node in an AST.REG
and SYMBOL_REF
are always leaf nodes.With that, and if we use the or1k_print_operand_address()
c comments above as examples
of some RTX addr
input we will have:
RTX | (reg:SI 3) (lo_sum:SI (reg:SI 3) (symbol_ref:SI("x")))
-----------+--------------------------------------------------------------------
TREE |
(code) | (code:REG regno:3) (code:LO_SUM)
/ \ | / \
(0) (1) | (code:REG regno:3) (code:SYMBOL_REF "x")
We can now read the above snippet as:
CODE
of the RTX.
CODE
is REG
(a register) than our offset can be 0
.IS
is LO_SUM
(an addition operation) then we need to break it down to:
0
is our new addr
RTX (which we assume is a register)1
is an offset (which we then print with print_reloc())addr
i.e. “r3”.The code of or1k_print_operand()
is similar and the reader may be inclined to
read more details. With that we can move on to the assembler.
TLS cases are also handled inside of the print_reloc()
function.
In the GNU Toolchain our assembler is GAS, part of binutils.
The code that handles relocations is found in the function
parse_reloc()
found in opcodes/or1k-asm.c
. The function parse_reloc()
is the direct counterpart of GCC’s print_reloc()
discussed above. This is actually part of or1k_cgen_parse_operand()
which is wired into our assembler generator CGEN used for parsing operands.
If we are parsing a relocation like the one from above lo(x)
then we can
isolate the code that processes that relocation.
static const bfd_reloc_code_real_type or1k_imm16_relocs[][6] = {
{ BFD_RELOC_LO16,
BFD_RELOC_OR1K_SLO16,
...
BFD_RELOC_OR1K_TLS_LE_AHI16 },
};
static int
parse_reloc (const char **strp)
{
const char *str = *strp;
enum or1k_rclass cls = RCLASS_DIRECT;
enum or1k_rtype typ;
...
else if (strncasecmp (str, "lo(", 3) == 0)
{
str += 3;
typ = RTYPE_LO;
}
...
*strp = str;
return (cls << RCLASS_SHIFT) | typ;
}
This uses strncasecmp to match
our "lo("
string pattern. The returned result is a relocation type and relocation class
which are use to lookup the relocation BFD_RELOC_LO16
in the or1k_imm16_relocs[][]
table
which is indexed by relocation class and relocation class.
The assembler will encode that into the ELF binary. For TLS relocations the exact same pattern is used.
In the GNU Toolchain our object linker is the GNU linker LD, also part of the binutils project.
The GNU linker uses the framework BFD or Binary File Descriptor which is a beast. It is not only used in the linker but also used in GDB, the GNU Simulator and the objdump tool.
What makes this possible is a rather complex API.
The BFD API is a generic binary file access API. It has been designed to support multiple file formats and architectures via an object oriented, polymorphic API all written in c. It supports file formats including a.out, COFF and ELF as well as unexpected file formats like verilog hex memory dumps.
Here we will concentrate on the BFD ELF implementation.
The API definition is split across multiple files which include:
bfd_hash_table
bfd
and asection
bfd_link_info
and bfd_link_hash_table
elf_link_hash_table
bfd/elf{wordsize}-{architecture}.c
- architecture specific implementationsFor each architecture implementations are defined in bfd/elf{wordsize}-{architecture}.c
. For
example for OpenRISC we have
bfd/elf32-or1k.c.
Throughout the linker code we see access to the BFD Linker and ELF APIs. Some key symbols to watch out for include:
info
- A reference to bfd_link_info
top level reference to all linker
state.htab
- A pointer to elf_or1k_link_hash_table
from or1k_elf_hash_table (info)
, a hash
table on steroids which stores generic link state and arch specific state, it’s also a hash
table of all global symbols by name, contains:
htab->root.splt
- the output .plt
sectionhtab->root.sgot
- the output .got
sectionhtab->root.srelgot
- the output .relgot
section (relocations against the got)htab->root.sgotplt
- the output .gotplt
sectionhtab->root.dynobj
- a special bfd
to which sections are added (created in or1k_elf_check_relocs
)sym_hashes
- From elf_sym_hashes (abfd)
a list of for global symbols
in a bfd
indexed by the relocation index ELF32_R_SYM (rel->r_info)
.h
- A pointer to a struct elf_link_hash_entry
, represents link state
of a global symbol, contains:
h->got
- A union of different attributes with different roles based on link phase.h->got.refcount
- used during phase 1 to count the symbol .got
section referencesh->got.offset
- used during phase 2 to record the symbol .got
section offseth->plt
- A union with the same function as h->got
but used for the .plt
section.h->root.root.string
- The symbol namelocal_got
- an array of unsigned long
from elf_local_got_refcounts (ibfd)
with the same
function to h->got
but for local symbols, the function of the unsigned long
is changed base
on the link phase. Ideally this should also be a union.tls_type
- Retrieved by ((struct elf_or1k_link_hash_entry *) h)->tls_type
used to store the
tls_type
of a global symbol.local_tls_type
- Retrieved by elf_or1k_local_tls_type(abfd)
entry to store tls_type
for local
symbols, when h
is NULL
.root
- The struct field root
is used in subclasses to represent the parent class, similar to how super
is used
in other languages.Putting it all together we have a diagram like the following:
Now that we have a bit of understanding of the data structures we can look to the link algorithm.
The link process in the GNU Linker can be thought of in phases.
The or1k_elf_check_relocs()
function is called during the first phase to
do book keeping on relocations. The function signature looks like:
static bfd_boolean
or1k_elf_check_relocs (bfd *abfd,
struct bfd_link_info *info,
asection *sec,
const Elf_Internal_Rela *relocs)
#define elf_backend_check_relocs or1k_elf_check_relocs
The arguments being:
abfd
- The current elf object file we are working oninfo
- The BFD APIsec
- The current elf section we are working onrelocs
- The relocations from the current sectionIt does the book keeping by looping over relocations for the provided section and updating the local and global symbol properties.
For local symbols:
...
else
{
unsigned char *local_tls_type;
/* This is a TLS type record for a local symbol. */
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (abfd);
if (local_tls_type == NULL)
{
bfd_size_type size;
size = symtab_hdr->sh_info;
local_tls_type = bfd_zalloc (abfd, size);
if (local_tls_type == NULL)
return FALSE;
elf_or1k_local_tls_type (abfd) = local_tls_type;
}
local_tls_type[r_symndx] |= tls_type;
}
...
else
{
bfd_signed_vma *local_got_refcounts;
/* This is a global offset table entry for a local symbol. */
local_got_refcounts = elf_local_got_refcounts (abfd);
if (local_got_refcounts == NULL)
{
bfd_size_type size;
size = symtab_hdr->sh_info;
size *= sizeof (bfd_signed_vma);
local_got_refcounts = bfd_zalloc (abfd, size);
if (local_got_refcounts == NULL)
return FALSE;
elf_local_got_refcounts (abfd) = local_got_refcounts;
}
local_got_refcounts[r_symndx] += 1;
}
The above is pretty straight forward and we can read as:
TLS
type information:
local_tls_type
array is not initialized:
local_tls_type
for the current symbol.got
section references:
local_got_refcounts
array is not initialized:
local_got_refcounts
for the current symbolFor global symbols, it’s much more easy we see:
...
if (h != NULL)
((struct elf_or1k_link_hash_entry *) h)->tls_type |= tls_type;
else
...
if (h != NULL)
h->got.refcount += 1;
else
...
As the tls_type
and refcount
fields are available directly on each
hash_entry
handling global symbols is much easier.
TLS
type information:
tls_type
for the current hash_entry
.got
section references:
got.refcounts
for the hash_entry
The above is repeated for all relocations and all input sections. A few other
things are also done including accounting for .plt
entries.
The or1k_elf_size_dynamic_sections()
function iterates over all input object files to calculate the size required for
output sections. The _bfd_elf_create_dynamic_sections()
function does the
actual section allocation, we use the generic version.
Setting up the sizes of the .got
section (global offset table) and .plt
section (procedure link table) is done here.
The definition is as below:
static bfd_boolean
or1k_elf_size_dynamic_sections (bfd *output_bfd ATTRIBUTE_UNUSED,
struct bfd_link_info *info)
#define elf_backend_size_dynamic_sections or1k_elf_size_dynamic_sections
#define elf_backend_create_dynamic_sections _bfd_elf_create_dynamic_sections
The arguments to or1k_elf_size_dynamic_sections()
being:
output_bfd
- Unused, the output elf objectinfo
- the BFD API which provides access to everything we needInternally the function uses:
htab
- from or1k_elf_hash_table (info)
htab->root.dynamic_sections_created
- true
if sections like .interp
have been created by the linkeribfd
- a bfd
pointer from info->input_bfds
, represents an input object when iterating.s->size
- represents the output .got
section size, which we will be
incrementing.srel->size
- represents the output .got.rela
section size, which will
contain relocations against the .got
sectionDuring the first part of phase 2 we set .got
and .got.rela
section sizes
for local symbols with this code:
/* Set up .got offsets for local syms, and space for local dynamic
relocs. */
for (ibfd = info->input_bfds; ibfd != NULL; ibfd = ibfd->link.next)
{
...
local_got = elf_local_got_refcounts (ibfd);
if (!local_got)
continue;
symtab_hdr = &elf_tdata (ibfd)->symtab_hdr;
locsymcount = symtab_hdr->sh_info;
end_local_got = local_got + locsymcount;
s = htab->root.sgot;
srel = htab->root.srelgot;
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (ibfd);
for (; local_got < end_local_got; ++local_got)
{
if (*local_got > 0)
{
unsigned char tls_type = (local_tls_type == NULL)
? TLS_UNKNOWN
: *local_tls_type;
*local_got = s->size;
or1k_set_got_and_rela_sizes (tls_type, bfd_link_pic (info),
&s->size, &srel->size);
}
else
*local_got = (bfd_vma) -1;
if (local_tls_type)
++local_tls_type;
}
}
Here, for example, we can see we iterate over each input elf object ibfd
and
each local symbol (local_got
) we try and update s->size
and srel->size
to
account for the required size.
The above can be read as:
local_got
entry:
.got
section:
tls_type
byte stored in the local_tls_type
arraylocal_got
to the section offset s->size
, that is used
in phase 3 to tell us where we need to write the symbol into the .got
section.s->size
and srel->size
using or1k_set_got_and_rela_sizes()
.got
section:
local_got
to the -1
, to indicate not usedIn the next part of phase 2 we allocate space for all global symbols by
iterating through symbols in htab
with the allocate_dynrelocs
iterator. To
do that we call:
elf_link_hash_traverse (&htab->root, allocate_dynrelocs, info);
Inside allocate_dynrelocs()
we record the space used for relocations and
the .got
and .plt
sections. Example:
if (h->got.refcount > 0)
{
asection *sgot;
bfd_boolean dyn;
unsigned char tls_type;
...
sgot = htab->root.sgot;
h->got.offset = sgot->size;
tls_type = ((struct elf_or1k_link_hash_entry *) h)->tls_type;
dyn = htab->root.dynamic_sections_created;
dyn = WILL_CALL_FINISH_DYNAMIC_SYMBOL (dyn, bfd_link_pic (info), h);
or1k_set_got_and_rela_sizes (tls_type, dyn,
&sgot->size, &htab->root.srelgot->size);
}
else
h->got.offset = (bfd_vma) -1;
The above, with h
being our global symbol, a pointer to struct elf_link_hash_entry
,
can be read as:
.got
section:
.got
section and put it in sgot
h->got.offset
for the symbol to the current got
section size htab->root.sgot
.dyn
to true
if we will be doing a dynamic link.or1k_set_got_and_rela_sizes()
to update the sizes for the .got
and .got.rela
sections..got
section:
h->got.offset
to -1
The function or1k_set_got_and_rela_sizes()
used above is used to increment
.got
and .rela
section sizes accounting for if these are TLS symbols, which
need additional entries and relocations.
The or1k_elf_relocate_section()
function is called to fill in the relocation holes in the output binary .text
section. It does this by looping over relocations and writing to the .text
section the correct symbol value (memory address). It also updates other output
binary sections like the .got
section. Also, for dynamic executables and
libraries new relocations may be written to .rela
sections.
The function signature looks as follows:
static bfd_boolean
or1k_elf_relocate_section (bfd *output_bfd,
struct bfd_link_info *info,
bfd *input_bfd,
asection *input_section,
bfd_byte *contents,
Elf_Internal_Rela *relocs,
Elf_Internal_Sym *local_syms,
asection **local_sections)
#define elf_backend_relocate_section or1k_elf_relocate_section
The arguments to or1k_elf_relocate_sectioni()
being:
output_bfd
- the output elf object we will be writing toinfo
- the BFD API which provides access to everything we needinput_bfd
- the current input elf object being iterated overinput_section
the current .text
section in the input elf object being iterated
over. From here we get .text
section output details for pc relative relocations:
input_section->output_section->vma
- the location of the output section.input_section->output_offset
- the output offsetcontents
- the output file buffer we will write torelocs
- relocations from the current input sectionlocal_syms
- an array of local symbols used to get the relocation
value for local symbolslocal_sections
- an array input sections for local symbols, used to get the relocation
value for local symbolsInternally the function uses:
howto
structs indexed by relocation enum.
The howto
struct expresses the algorithm required to update the relocation.relocation
- a bfd_vma
the value of the relocation symbol (memory address)
to be written to the output file.
in the output file that needs to be updated for the relocation.value
- the value that needs to be written to the relocation location.During the first part of relocate_section
we see:
if (r_symndx < symtab_hdr->sh_info)
{
sym = local_syms + r_symndx;
sec = local_sections[r_symndx];
relocation = _bfd_elf_rela_local_sym (output_bfd, sym, &sec, rel);
name = bfd_elf_string_from_elf_section
(input_bfd, symtab_hdr->sh_link, sym->st_name);
name = name == NULL ? bfd_section_name (sec) : name;
}
else
{
bfd_boolean unresolved_reloc, warned, ignored;
RELOC_FOR_GLOBAL_SYMBOL (info, input_bfd, input_section, rel,
r_symndx, symtab_hdr, sym_hashes,
h, sec, relocation,
unresolved_reloc, warned, ignored);
name = h->root.root.string;
}
This can be read as:
relocation
to the local symbol value using _bfd_elf_rela_local_sym()
.RELOC_FOR_GLOBAL_SYMBOL()
macro to initialize relocation
.During the next part we use the howto
information to update the relocation
value, and also
add relocations to the output file. For example:
case R_OR1K_TLS_GD_HI16:
case R_OR1K_TLS_GD_LO16:
case R_OR1K_TLS_GD_PG21:
case R_OR1K_TLS_GD_LO13:
case R_OR1K_TLS_IE_HI16:
case R_OR1K_TLS_IE_LO16:
case R_OR1K_TLS_IE_PG21:
case R_OR1K_TLS_IE_LO13:
case R_OR1K_TLS_IE_AHI16:
{
bfd_vma gotoff;
Elf_Internal_Rela rela;
asection *srelgot;
bfd_byte *loc;
bfd_boolean dynamic;
int indx = 0;
unsigned char tls_type;
srelgot = htab->root.srelgot;
/* Mark as TLS related GOT entry by setting
bit 2 to indcate TLS and bit 1 to indicate GOT. */
if (h != NULL)
{
gotoff = h->got.offset;
tls_type = ((struct elf_or1k_link_hash_entry *) h)->tls_type;
h->got.offset |= 3;
}
else
{
unsigned char *local_tls_type;
gotoff = local_got_offsets[r_symndx];
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (input_bfd);
tls_type = local_tls_type == NULL ? TLS_NONE
: local_tls_type[r_symndx];
local_got_offsets[r_symndx] |= 3;
}
/* Only process the relocation once. */
if ((gotoff & 1) != 0)
{
gotoff += or1k_initial_exec_offset (howto, tls_type);
/* The PG21 and LO13 relocs are pc-relative, while the
rest are GOT relative. */
relocation = got_base + (gotoff & ~3);
if (!(r_type == R_OR1K_TLS_GD_PG21
|| r_type == R_OR1K_TLS_GD_LO13
|| r_type == R_OR1K_TLS_IE_PG21
|| r_type == R_OR1K_TLS_IE_LO13))
relocation -= got_sym_value;
break;
}
...
/* Static GD. */
else if ((tls_type & TLS_GD) != 0)
{
bfd_put_32 (output_bfd, 1, sgot->contents + gotoff);
bfd_put_32 (output_bfd, tpoff (info, relocation, dynamic),
sgot->contents + gotoff + 4);
}
gotoff += or1k_initial_exec_offset (howto, tls_type);
...
/* Static IE. */
else if ((tls_type & TLS_IE) != 0)
bfd_put_32 (output_bfd, tpoff (info, relocation, dynamic),
sgot->contents + gotoff);
/* The PG21 and LO13 relocs are pc-relative, while the
rest are GOT relative. */
relocation = got_base + gotoff;
if (!(r_type == R_OR1K_TLS_GD_PG21
|| r_type == R_OR1K_TLS_GD_LO13
|| r_type == R_OR1K_TLS_IE_PG21
|| r_type == R_OR1K_TLS_IE_LO13))
relocation -= got_sym_value;
}
break;
Here we process the relocation for TLS General Dynamic and Initial Exec relocations. I have trimmed out the shared cases to save space.
This can be read as:
sreloc
.offset |= 3
trick is
possible because on 32-bit machines we have 2 lower bits free. This
is used during phase 4.relocation
to the location in the output .got
section and break, we only need to create .got
entries 1 time.got
section entries
.got
section, a literal 1
and the thread pointer offset.got
section, the thread pointer offsetrelocation
to the location in the output .got
sectionIn the last part of the loop we write the relocation
value to the output
.text
section. This is done with the or1k_final_link_relocate()
function.
r = or1k_final_link_relocate (howto, input_bfd, input_section, contents,
rel->r_offset, relocation + rel->r_addend);
With this the .text
section is complete.
During phase 3 above we wrote the .text
section out to file. During the
final finishing up phase we need to write the remaining sections. This
includes the .plt
section an more writes to the .got
section.
This also includes the .plt.rela
and .got.rela
sections which contain
dynamic relocation entries.
Writing of the data sections is handled by or1k_elf_finish_dynamic_sections() and writing of the relocation sections is handled by or1k_elf_finish_dynamic_symbol(). These are defined as below.
static bfd_boolean
or1k_elf_finish_dynamic_sections (bfd *output_bfd,
struct bfd_link_info *info)
static bfd_boolean
or1k_elf_finish_dynamic_symbol (bfd *output_bfd,
struct bfd_link_info *info,
struct elf_link_hash_entry *h,
Elf_Internal_Sym *sym)
#define elf_backend_finish_dynamic_sections or1k_elf_finish_dynamic_sections
#define elf_backend_finish_dynamic_symbol or1k_elf_finish_dynamic_symbol
A snippet for the or1k_elf_finish_dynamic_sections()
shows how when writing to
the .plt
section assembly code needs to be injected. This is where the first
entry in the .plt
section is written.
else if (bfd_link_pic (info))
{
plt0 = OR1K_LWZ(15, 16) | 8; /* .got+8 */
plt1 = OR1K_LWZ(12, 16) | 4; /* .got+4 */
plt2 = OR1K_NOP;
}
else
{
unsigned ha = ((got_addr + 0x8000) >> 16) & 0xffff;
unsigned lo = got_addr & 0xffff;
plt0 = OR1K_MOVHI(12) | ha;
plt1 = OR1K_LWZ(15,12) | (lo + 8);
plt2 = OR1K_LWZ(12,12) | (lo + 4);
}
or1k_write_plt_entry (output_bfd, splt->contents,
plt0, plt1, plt2, OR1K_JR(15));
elf_section_data (splt->output_section)->this_hdr.sh_entsize = 4;
Here we see a write to output_bfd
, this represents the output object file
which we are writing to. The argument splt->contents
represents the object
file offset to write to for the .plt
section. Next we see the line
elf_section_data (splt->output_section)->this_hdr.sh_entsize = 4
this allows the linker to calculate the size of the section.
A snippet from the or1k_elf_finish_dynamic_symbol()
function shows where
we write out the code and dynamic relocation entries for each symbol to
the .plt
section.
splt = htab->root.splt;
sgot = htab->root.sgotplt;
srela = htab->root.srelplt;
...
else
{
unsigned ha = ((got_addr + 0x8000) >> 16) & 0xffff;
unsigned lo = got_addr & 0xffff;
plt0 = OR1K_MOVHI(12) | ha;
plt1 = OR1K_LWZ(12,12) | lo;
plt2 = OR1K_ORI0(11) | plt_reloc;
}
or1k_write_plt_entry (output_bfd, splt->contents + h->plt.offset,
plt0, plt1, plt2, OR1K_JR(12));
/* Fill in the entry in the global offset table. We initialize it to
point to the top of the plt. This is done to lazy lookup the actual
symbol as the first plt entry will be setup by libc to call the
runtime dynamic linker. */
bfd_put_32 (output_bfd, plt_base_addr, sgot->contents + got_offset);
/* Fill in the entry in the .rela.plt section. */
rela.r_offset = got_addr;
rela.r_info = ELF32_R_INFO (h->dynindx, R_OR1K_JMP_SLOT);
rela.r_addend = 0;
loc = srela->contents;
loc += plt_index * sizeof (Elf32_External_Rela);
bfd_elf32_swap_reloca_out (output_bfd, &rela, loc);
Here we can see we write 3 things to output_bfd
for the single .plt
entry.
We write:
.plt
section.plt_base_addr
(the first entry in the .plt
for runtime lookup) to the .got
section..plt.rela
.With that we have written all of the sections out to our final elf object, and it’s ready to be used.
The runtime linker, also referred to as the dynamic linker, will do the final linking as we load our program and shared libraries into memory. It can process a limited set of relocation entries that were setup above during phase 4 of linking.
The runtime linker implementation is found mostly in the
elf/dl-*
GLIBC source files. Dynamic relocation processing is handled in by
the _dl_relocate_object()
function in the elf/dl-reloc.c
file. The back end macro used for relocation
ELF_DYNAMIC_RELOCATE
is defined across several files including elf/dynamic-link.h
and elf/do-rel.h
Architecture specific relocations are handled by the function elf_machine_rela()
, the implementation
for OpenRISC being in sysdeps/or1k/dl-machine.h.
In summary from top down:
dl_main()
the top level entry for the dynamic linker.dl_open_worker()
calls _dl_relocate_object()
, you may also recognize this from dlopen(3)._dl_relocate_object
calls ELF_DYNAMIC_RELOCATE
elf/dynamic-link.h
- defined macro ELF_DYNAMIC_RELOCATE
calls elf_dynamic_do_Rel()
via several macroself/do-rel.h
- function elf_dynamic_do_Rel()
calls elf_machine_rela()
sysdeps/or1k/dl-machine.h
- architecture specific function elf_machine_rela()
implements dynamic relocation handlingIt supports relocations for:
R_OR1K_NONE
- do nothingR_OR1K_COPY
- used to copy initial values from shared objects to process memory.R_OR1K_32
- a 32-bit
valueR_OR1K_GLOB_DAT
- aligned 32-bit
values for GOT
entriesR_OR1K_JMP_SLOT
- aligned 32-bit
values for PLT
entriesR_OR1K_TLS_DTPMOD/R_OR1K_TLS_DTPOFF
- for shared TLS GD GOT
entriesR_OR1K_TLS_TPOFF
- for shared TLS IE GOT
entriesA snippet of the OpenRISC implementation of elf_machine_rela()
can be seen
below. It is pretty straight forward.
/* Perform the relocation specified by RELOC and SYM (which is fully resolved).
MAP is the object containing the reloc. */
auto inline void
__attribute ((always_inline))
elf_machine_rela (struct link_map *map, const Elf32_Rela *reloc,
const Elf32_Sym *sym, const struct r_found_version *version,
void *const reloc_addr_arg, int skip_ifunc)
{
struct link_map *sym_map = RESOLVE_MAP (&sym, version, r_type);
Elf32_Addr value = SYMBOL_ADDRESS (sym_map, sym, true);
...
switch (r_type)
{
...
case R_OR1K_32:
/* Support relocations on mis-aligned offsets. */
value += reloc->r_addend;
memcpy (reloc_addr_arg, &value, 4);
break;
case R_OR1K_GLOB_DAT:
case R_OR1K_JMP_SLOT:
*reloc_addr = value + reloc->r_addend;
break;
...
}
}
The complicated part of the runtime linker is how it handles TLS variables.
This is done in the following files and functions.
elf/rtld.c
- implements
init_tls()
which initializes the TLS data structures.The reader can read through the initialization code which is pretty straight forward, except for the macros. Like most GNU code the code relies heavily on untyped macros. These macros are defined in the architecture specific implementation files. For OpenRISC this is:
From the previous article on TLS we have the TLS data structure that looks as follows:
dtv[] [ dtv[0], dtv[1], dtv[2], .... ]
counter ^ | \
----/ / \________
/ V V
/------TCB-------\/----TLS[1]----\ /----TLS[2]----\
| pthread tcbhead | tbss tdata | | tbss tdata |
\----------------/\--------------/ \--------------/
^
|
TP-----/
The symbols and macros defined in sysdeps/or1k/nptl/tls.h
are:
__thread_self
- a symbol representing the current thread alwaysTLS_DTV_AT_TP
- used throughout the TLS code to adjust offsetsTLS_TCB_AT_TP
- used throughout the TLS code to adjust offsetsTLS_TCB_SIZE
- used during init_tls()
to allocate memory for TLSTLS_PRE_TCB_SIZE
- used during init_tls()
to allocate space for the pthread
structINSTALL_DTV
- used during initialization to update a new dtv pointer into the given tcbGET_DTV
- gets dtv via the provided tcb pointerINSTALL_NEW_DTV
- used during resizing to update the dtv into the current runtime __thread_self
TLS_INIT_TP
- sets __thread_self
this is the final step in init_tls()
THREAD_DTV
- gets dtv via _thread_selfTHREAD_SELF
- get the pthread pointer via __thread_self
Implementations for OpenRISC are:
register tcbhead_t *__thread_self __asm__("r10");
#define TLS_DTV_AT_TP 1
#define TLS_TCB_AT_TP 0
#define TLS_TCB_SIZE sizeof (tcbhead_t)
#define TLS_PRE_TCB_SIZE sizeof (struct pthread)
#define INSTALL_DTV(tcbp, dtvp) (((tcbhead_t *) (tcbp))->dtv = (dtvp) + 1)
#define GET_DTV(tcbp) (((tcbhead_t *) (tcbp))->dtv)
#define TLS_INIT_TP(tcbp) ({__thread_self = ((tcbhead_t *)tcbp + 1); NULL;})
#define THREAD_DTV() ((((tcbhead_t *)__thread_self)-1)->dtv)
#define INSTALL_NEW_DTV(dtv) (THREAD_DTV() = (dtv))
#define THREAD_SELF \
((struct pthread *) ((char *) __thread_self - TLS_INIT_TCB_SIZE \
- TLS_PRE_TCB_SIZE))
We have looked at how symbols move from the Compiler, to Assembler, to Linker to Runtime linker.
This has ended up being a long article to explain a rather complicated subject. Let’s hope it helps provide a good reference for others who want to work on the GNU toolchain in the future.
This is the second part in an illustrated 3 part series covering:
In the last article we covered ELF Binary internals and how relocation entries are used to during link time to allow our programs to access symbols (variables). However, what if we want a different variable instance for each thread? This is where thread local storage (TLS) comes in.
In this article we will discuss how TLS works. Our outline:
As before, the examples in this article can be found in my tls-examples project. Please check it out.
Did you know that in C you can prefix variables with __thread
to create
thread local variables?
__thread int i;
A thread local variable is a variable that will have a unique instance per thread. Each time a new thread is created, the space required to store the thread local variables is allocated.
TLS variables are stored in dynamic TLS sections.
In the previous article we saw how variables were stored in the .data
and
.bss
sections. These are initialized once per program or library.
When we get to binaries that use TLS we will additionally have .tdata
and
.tbss
sections.
.tdata
- static and non static initialized thread local variables.tbss
- static and non static non-initialized thread local variablesThese exist in a special TLS
segment which
is loaded per thread. In the next article we will discuss more about how this
loading works.
As we recall, to access data in .data
and .bss
sections simple code
sequences with relocation entries are used. These sequences set and add
registers to build pointers to our data. For example, the below sequence uses 2
relocations to compose a .bss
section address into register r11
.
Addr. Machine Code Assembly Relocations
0000000c <get_x_addr>:
c: 19 60 [00 00] l.movhi r11,[0] # c R_OR1K_AHI16 .bss
10: 44 00 48 00 l.jr r9
14: 9d 6b [00 00] l.addi r11,r11,[0] # 14 R_OR1K_LO_16_IN_INSN .bss
With TLS the code sequences to access our data will also build pointers to our data, but they need to traverse the TLS data structures.
As the code sequence is read only and will be the same for each thread another level of indirection is needed, this is provided by the Thread Pointer (TP).
The Thread Pointer points into a data structure that allows us to locate TLS data sections. The TLS data structure includes:
These are illustrated as below:
dtv[] [ dtv[0], dtv[1], dtv[2], .... ]
counter ^ | \
----/ / \________
/ V V
/------TCB-------\/----TLS[1]----\ /----TLS[2]----\
| pthread tcbhead | tbss tdata | | tbss tdata |
\----------------/\--------------/ \--------------/
^
|
TP-----/
The TP is unique to each thread. It provides the starting point to the TLS data structure.
r10
$fs
*tls
pointer passed to the
clone() system call when
using CLONE_SETTLS
.The TCB is the head of the TLS data structure. The TCB consists of:
pthread
- the pthread
struct for the current thread, contains tid
etc. Located by TP - TCB size - Pthread size
tcbhead
- the tcbhead_t
struct, machine dependent, contains pointer to DTV. Located by TP - TCB size
.For OpenRISC tcbhead_t
is defined in
sysdeps/or1k/nptl/tls.h as:
typedef struct {
dtv_t *dtv;
} tcbhead_t
dtv
- is a pointer to the dtv array, points to entry dtv[1]
For x86_64 the tcbhead_t
is defined in
sysdeps/x86_64/nptl/tls.h
as:
typedef struct
{
void *tcb; /* Pointer to the TCB. Not necessarily the
thread descriptor used by libpthread. */
dtv_t *dtv;
void *self; /* Pointer to the thread descriptor. */
int multiple_threads;
int gscope_flag;
uintptr_t sysinfo;
uintptr_t stack_guard;
uintptr_t pointer_guard;
unsigned long int vgetcpu_cache[2];
/* Bit 0: X86_FEATURE_1_IBT.
Bit 1: X86_FEATURE_1_SHSTK.
*/
unsigned int feature_1;
int __glibc_unused1;
/* Reservation of some values for the TM ABI. */
void *__private_tm[4];
/* GCC split stack support. */
void *__private_ss;
/* The lowest address of shadow stack, */
unsigned long long int ssp_base;
/* Must be kept even if it is no longer used by glibc since programs,
like AddressSanitizer, depend on the size of tcbhead_t. */
__128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));
void *__padding[8];
} tcbhead_t;
The x86_64 implementation includes many more fields including:
gscope_flag
- Global Scope lock flags used by the runtime linker, for OpenRISC this is stored in pthread
.stack_guard
- The stack
guard canary stored in
the thread local area. For OpenRISC a global stack guard is stored in .bss
.pointer_guard
- The pointer
guard stored in the
thread local area. For OpenRISC a global pointer guard is stored in .bss
.The DTV is an array of pointers to each TLS data section. The first entry in the DTV array contains the generation counter. The generation counter is really just the array size. The DTV can be dynamically resized as more TLS modules are loaded.
The dtv_t
type is a union as defined below:
typedef struct {
void *val; // Aligned pointer to data/bss
void *to_free; // Unaligned pointer for free()
} dtv_pointer
typedef union {
int counter; // for entry 0
dtv_pointer pointer; // for all other entries
} dtv_t
Each dtv_t
entry can be either a counter or a pointer. By convention the
first entry, dtv[0]
is a counter and the rest are pointers.
The initial set of TLS data sections is allocated contiguous with the TCB. Additional TLS data blocks will be allocated dynamically. There will be one entry for each loaded module, the first module being the current program. For dynamic libraries it is lazily initialized per thread.
tbss
- the .tbss
section for the current thread from the current
processes ELF binary.tdata
- the .tdata
section for the current thread from the current
processes ELF binary.tbss
- the .tbss
section for variables defined in the first shared library loaded by the current processtdata
- the .tdata
section for variables defined in the first shared library loaded by the current processThe __tls_get_addr()
function can be used at any time to traverse the TLS data
structure and return a variable’s address. The function is given a pointer to
an architecture specific argument tls_index
.
0
for the current process, 1
for the first loaded shared
library etc.TLS
data section__tls_get_addr
uses TP to located the TLS data structureFor static builds the implementation is architecture dependant and defined in OpenRISC sysdeps/or1k/libc-tls.c as:
__tls_get_addr (tls_index *ti)
{
dtv_t *dtv = THREAD_DTV ();
return (char *) dtv[1].pointer.val + ti->ti_offset;
}
Note for for static builds the module index can be hard coded to 1
as there
will always be only one module.
For dynamically linked programs the implementation is defined as part of the runtime dynamic linker in elf/dl-tls.c as:
void *
__tls_get_addr (GET_ADDR_ARGS)
{
dtv_t *dtv = THREAD_DTV ();
if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation)))
return update_get_addr (GET_ADDR_PARAM);
void *p = dtv[GET_ADDR_MODULE].pointer.val;
if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);
return (char *) p + GET_ADDR_OFFSET;
}
Here several macros are used so it’s a bit hard to follow but there are:
THREAD_DTV
- uses TP to get the pointer to the DTV array.GET_ADDR_ARGS
- short for tls_index* ti
GET_ADDR_PARAM
- short for ti
GET_ADDR_MODULE
- short for ti->ti_module
GET_ADDR_OFFSET
- short for ti->ti_offset
As one can imagine, traversing the TLS data structures when accessing each variable could be slow. For this reason there are different TLS access models that the compiler can choose to minimize variable access overhead.
The Global Dynamic (GD), sometimes called General Dynamic, access model is the slowest access model which will traverse the entire TLS data structure for each variable access. It is used for accessing variables in dynamic shared libraries.
Not counting relocations for the PLT and GOT entries; before linking the .text
contains 1 placeholder for a GOT offset. This GOT entry will contain the
arguments to __tls_get_addr
.
After linking there will be 2 relocation entries in the GOT to be resolved by
the dynamic linker. These are R_TLS_DTPMOD
, the TLS module index, and
R_TLS_DTPOFF
, the offset of the variable into the TLS module.
File: tls-gd.c
extern __thread int x;
int* get_x_addr() {
return &x;
}
tls-gd.o: file format elf32-or1k
Disassembly of section .text:
0000004c <get_x_addr>:
4c: 18 60 [00 00] l.movhi r3,[0] # 4c: R_OR1K_TLS_GD_HI16 x
50: 9c 21 ff f8 l.addi r1,r1,-8
54: a8 63 [00 00] l.ori r3,r3,[0] # 54: R_OR1K_TLS_GD_LO16 x
58: d4 01 80 00 l.sw 0(r1),r16
5c: d4 01 48 04 l.sw 4(r1),r9
60: 04 00 00 02 l.jal 68 <get_x_addr+0x1c>
64: 1a 00 [00 00] l.movhi r16,[0] # 64: R_OR1K_GOTPC_HI16 _GLOBAL_OFFSET_TABLE_-0x4
68: aa 10 [00 00] l.ori r16,r16,[0] # 68: R_OR1K_GOTPC_LO16 _GLOBAL_OFFSET_TABLE_
6c: e2 10 48 00 l.add r16,r16,r9
70: 04 00 [00 00] l.jal [0] # 70: R_OR1K_PLT26 __tls_get_addr
74: e0 63 80 00 l.add r3,r3,r16
78: 85 21 00 04 l.lwz r9,4(r1)
7c: 86 01 00 00 l.lwz r16,0(r1)
80: 44 00 48 00 l.jr r9
84: 9c 21 00 08 l.addi r1,r1,8
tls-gd.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000020 <get_x_addr>:
20: 48 83 ec 08 sub $0x8,%rsp
24: 66 48 8d 3d [00 00 00 00] lea [0](%rip),%rdi # 28 R_X86_64_TLSGD x-0x4
2c: 66 66 48 e8 [00 00 00 00] callq [0] # 30 R_X86_64_PLT32 __tls_get_addr-0x4
34: 48 83 c4 08 add $0x8,%rsp
38: c3 retq
The Local Dynamic (LD) access model is an optimization for Global Dynamic where
multiple variables may be accessed from the same TLS module. Instead of
traversing the TLS data structure for each variable, the TLS data section address
is loaded once by calling __tls_get_addr
with an offset of 0
. Next, variables
can be accessed with individual offsets.
Local Dynamic is not supported on OpenRISC yet.
Not counting relocations for the PLT and GOT entries; before linking the .text
contains 1 placeholder for a GOT offset and 2 placeholders for the TLS offsets.
This GOT entry will contain the arguments to __tls_get_addr
.
The TLD offsets will be the offsets to our variables in the TLD data section.
After linking there will be 1 relocation entry in the GOT to be resolved by
the dynamic linker. This is R_TLS_DTPMOD
, the TLS module index, the offset
will be 0x0
.
File: tls-ld.c
static __thread int x;
static __thread int y;
int sum() {
return x + y;
}
tls-ld.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000030 <sum>:
30: 48 83 ec 08 sub $0x8,%rsp
34: 48 8d 3d [00 00 00 00] lea [0](%rip),%rdi # 37 R_X86_64_TLSLD x-0x4
3b: e8 [00 00 00 00] callq [0] # 3c R_X86_64_PLT32 __tls_get_addr-0x4
40: 8b 90 [00 00 00 00] mov [0](%rax),%edx # 42 R_X86_64_DTPOFF32 x
46: 03 90 [00 00 00 00] add [0](%rax),%edx # 48 R_X86_64_DTPOFF32 y
4c: 48 83 c4 08 add $0x8,%rsp
50: 89 d0 mov %edx,%eax
52: c3 retq
The Initial Exec (IE) access model does not require traversing the TLS data structure. It requires that the compiler knows that offset from the TP to the variable can be computed during link time.
As Initial Exec does not require calling __tls_get_addr
is is more efficient
compared the GD and LD access.
Text contains a placeholder for the got address of the offset. Not counting
relocation entry for the GOT; before linking the .text
contains 1 placeholder
for a GOT offset. This GOT entry will contain the TP offset to the variable.
After linking there will be no remaining relocation entries. The .text
section
contains the actual GOT offset and the GOT entry will contain the TP offset
to the variable.
File: tls-ie.c
Initial exec C code will be the same as global dynamic, however IE access will be chosen when static compiling.
extern __thread int x;
int* get_x_addr() {
return &x;
}
00000038 <get_x_addr>:
38: 9c 21 ff fc l.addi r1,r1,-4
3c: 1a 20 [00 00] l.movhi r17,[0x0] # 3c: R_OR1K_TLS_IE_AHI16 x
40: d4 01 48 00 l.sw 0(r1),r9
44: 04 00 00 02 l.jal 4c <get_x_addr+0x14>
48: 1a 60 [00 00] l.movhi r19,[0x0] # 48: R_OR1K_GOTPC_HI16 _GLOBAL_OFFSET_TABLE_-0x4
4c: aa 73 [00 00] l.ori r19,r19,[0x0] # 4c: R_OR1K_GOTPC_LO16 _GLOBAL_OFFSET_TABLE_
50: e2 73 48 00 l.add r19,r19,r9
54: e2 31 98 00 l.add r17,r17,r19
58: 85 71 [00 00] l.lwz r11,[0](r17) # 58: R_OR1K_TLS_IE_LO16 x
5c: 85 21 00 00 l.lwz r9,0(r1)
60: e1 6b 50 00 l.add r11,r11,r10
64: 44 00 48 00 l.jr r9
68: 9c 21 00 04 l.addi r1,r1,4
0000000000000010 <get_x_addr>:
10: 48 8b 05 [00 00 00 00] mov 0x0(%rip),%rax # 13: R_X86_64_GOTTPOFF x-0x4
17: 64 48 03 04 25 00 00 00 00 add %fs:0x0,%rax
20: c3 retq
The Local Exec (LD) access model does not require traversing the TLS data structure or a GOT entry. It is chosen by the compiler when accessing file local variables in the current program.
The Local Exec access model is the most efficient.
Before linking the .text
section contains one relocation entry for a TP
offset.
After linking the .text
section contains the value of the TP offset.
File: tls-le.c
In the Local Exec example the variable x
is local, it is not extern
.
static __thread int x;
int * get_x_addr() {
return &x;
}
00000010 <get_x_addr>:
10: 19 60 [00 00] l.movhi r11,[0x0] # 10: R_OR1K_TLS_LE_AHI16 .LANCHOR0
14: e1 6b 50 00 l.add r11,r11,r10
18: 44 00 48 00 l.jr r9
1c: 9d 6b [00 00] l.addi r11,r11,[0] # 1c: R_OR1K_TLS_LE_LO16 .LANCHOR0
0000000000000010 <get_x_addr>:
10: 64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
19: 48 05 [00 00 00 00] add $0x0,%rax # 1b: R_X86_64_TPOFF32 x
1f: c3 retq
As some TLS access methods are more efficient than others we would like to choose the best method for each variable access. However, we sometimes don’t know where a variable will come from until link time.
On some architectures the linker will rewrite the TLS access code sequence to change to a more efficient access model, this is called relaxation.
One type of relaxation performed by the linker is GD to IE relaxation. During compile
time GD relocation may be chosen for extern
variables. However, during link time
the variable may be found in the same module i.e. not a shared object which would require
GD access. In this case the access model can be changed to IE.
That’s pretty cool.
The architecture I work on OpenRISC does not support any
of this yet, it requires changes to the compiler and linker. The compiler needs
to be updated to mark sections of the output .text
that can be rewritten
(often with added NOP
codes). The linker needs to be updated to know how to
identify the relaxation opportunity and perform it.
In this article we have covered how TLS variables are accessed per thread via the TLS data structure. Also, we saw how different TLS access models provide varying levels of efficiency.
In the next article we will look more into how this is implemented in GCC, the linker and the GLIBC runtime dynamic linker.
In order to fix issues with tests I had to learn more than I did before about ELF Relocations , Thread Local Storage and the binutils linker implementation in BFD. There is a lot of documentation available, but it’s a bit hard to follow as it assumes certain knowledge, for example have a look at the Solaris Linker and Libraries section on relocations. In this article I will try to fill in those gaps.
This will be an illustrated 3 part series covering
All of the examples in this article can be found in my tls-examples project. Please check it out.
On Linux, you can download it and make
it with your favorite toolchain.
By default it will cross compile using an openrisc toolchain.
This can be overridden with the CROSS_COMPILE
variable.
For example, to build for your current host.
$ git clone git@github.com:stffrdhrn/tls-examples.git
$ make CROSS_COMPILE=
gcc -fpic -c -o tls-gd-dynamic.o tls-gd.c -Wall -O2 -g
gcc -fpic -c -o nontls-dynamic.o nontls.c -Wall -O2 -g
...
objdump -dr x-static.o > x-static.S
objdump -dr xy-static.o > xy-static.S
Now we can get started.
Before we can talk about relocations we need to talk a bit about what makes up ELF binaries. This is a prerequisite as relocations and TLS are part of ELF binaries. There are a few basic ELF binary types:
.o
) - produced by a compiler, contains a collection of sections, also call relocatable files..so
) - a program library, contains sections grouped into segments.Here we will discuss Object Files and Program Files.
The compiler generates object files, these contain sections of binary data and these are not executable.
The object file produced by gcc
generally contains .rela.text
, .text
, .data
and .bss
sections.
.rela.text
- a list of relocations against the .text
section.text
- contains compiled program machine code.data
- static and non static initialized variable values.bss
- static and non static non-initialized variablesELF binaries are made of sections and segments.
A segment contains a group of sections and the segment defines how the data should be loaded into memory for program execution.
Each segment is mapped to program memory by the kernel when a process is created. Program files contain most of the same sections as objects but there are some differences.
.text
- contains executable program code, there is no .rela.text
section.got
- the global offset table used to access variables, created during link time. May be populated during runtime.readelf
)The readelf
tool can help inspect elf binaries.
Some examples:
Using the -S
option we can read sections from an elf file.
As we can see below we have the .text
, .rela.text
, .bss
and many other
sections.
$ readelf -S tls-le-static.o
There are 20 section headers, starting at offset 0x604:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000020 00 AX 0 0 4
[ 2] .rela.text RELA 00000000 0003f8 000030 0c I 17 1 4
[ 3] .data PROGBITS 00000000 000054 000000 00 WA 0 0 1
[ 4] .bss NOBITS 00000000 000054 000000 00 WA 0 0 1
[ 5] .tbss NOBITS 00000000 000054 000004 00 WAT 0 0 4
[ 6] .debug_info PROGBITS 00000000 000054 000074 00 0 0 1
[ 7] .rela.debug_info RELA 00000000 000428 000084 0c I 17 6 4
[ 8] .debug_abbrev PROGBITS 00000000 0000c8 00007c 00 0 0 1
[ 9] .debug_aranges PROGBITS 00000000 000144 000020 00 0 0 1
[10] .rela.debug_arang RELA 00000000 0004ac 000018 0c I 17 9 4
[11] .debug_line PROGBITS 00000000 000164 000087 00 0 0 1
[12] .rela.debug_line RELA 00000000 0004c4 00006c 0c I 17 11 4
[13] .debug_str PROGBITS 00000000 0001eb 00007a 01 MS 0 0 1
[14] .comment PROGBITS 00000000 000265 00002b 01 MS 0 0 1
[15] .debug_frame PROGBITS 00000000 000290 000030 00 0 0 4
[16] .rela.debug_frame RELA 00000000 000530 000030 0c I 17 15 4
[17] .symtab SYMTAB 00000000 0002c0 000110 10 18 15 4
[18] .strtab STRTAB 00000000 0003d0 000025 00 0 0 1
[19] .shstrtab STRTAB 00000000 000560 0000a1 00 0 0 1
Using the -S
option on a program file we can also read the sections. The file
type does not matter as long as it is an ELF we can read the sections.
As we can see below there is no longer a rela.text
section, but we have others
including the .got
section.
$ readelf -S tls-le-static
There are 31 section headers, starting at offset 0x32e8fc:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 000020d4 0000d4 080304 00 AX 0 0 4
[ 2] __libc_freeres_fn PROGBITS 000823d8 0803d8 001118 00 AX 0 0 4
[ 3] .rodata PROGBITS 000834f0 0814f0 01544c 00 A 0 0 4
[ 4] __libc_subfreeres PROGBITS 0009893c 09693c 000024 00 A 0 0 4
[ 5] __libc_IO_vtables PROGBITS 00098960 096960 0002f4 00 A 0 0 4
[ 6] __libc_atexit PROGBITS 00098c54 096c54 000004 00 A 0 0 4
[ 7] .eh_frame PROGBITS 00098c58 096c58 0027a8 00 A 0 0 4
[ 8] .gcc_except_table PROGBITS 0009b400 099400 000089 00 A 0 0 1
[ 9] .note.ABI-tag NOTE 0009b48c 09948c 000020 00 A 0 0 4
[10] .tdata PROGBITS 0009dc28 099c28 000010 00 WAT 0 0 4
[11] .tbss NOBITS 0009dc38 099c38 000024 00 WAT 0 0 4
[12] .init_array INIT_ARRAY 0009dc38 099c38 000004 04 WA 0 0 4
[13] .fini_array FINI_ARRAY 0009dc3c 099c3c 000008 04 WA 0 0 4
[14] .data.rel.ro PROGBITS 0009dc44 099c44 0003bc 00 WA 0 0 4
[15] .data PROGBITS 0009e000 09a000 000de0 00 WA 0 0 4
[16] .got PROGBITS 0009ede0 09ade0 000064 04 WA 0 0 4
[17] .bss NOBITS 0009ee44 09ae44 000bec 00 WA 0 0 4
[18] __libc_freeres_pt NOBITS 0009fa30 09ae44 000014 00 WA 0 0 4
[19] .comment PROGBITS 00000000 09ae44 00002a 01 MS 0 0 1
[20] .debug_aranges PROGBITS 00000000 09ae6e 002300 00 0 0 1
[21] .debug_info PROGBITS 00000000 09d16e 0fd048 00 0 0 1
[22] .debug_abbrev PROGBITS 00000000 19a1b6 0270ca 00 0 0 1
[23] .debug_line PROGBITS 00000000 1c1280 0ce95c 00 0 0 1
[24] .debug_frame PROGBITS 00000000 28fbdc 0063bc 00 0 0 4
[25] .debug_str PROGBITS 00000000 295f98 011e35 01 MS 0 0 1
[26] .debug_loc PROGBITS 00000000 2a7dcd 06c437 00 0 0 1
[27] .debug_ranges PROGBITS 00000000 314204 00c900 00 0 0 1
[28] .symtab SYMTAB 00000000 320b04 0075d0 10 29 926 4
[29] .strtab STRTAB 00000000 3280d4 0066ca 00 0 0 1
[30] .shstrtab STRTAB 00000000 32e79e 00015c 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
Using the -l
option on a program file we can read the segments.
Notice how segments map from file offsets to memory offsets and alignment.
The two different LOAD
type segments are segregated by read only/execute and read/write.
Each section is also mapped to a segment here. As we can see .text is in the first
LOAD` segment
which is executable as expected.
$ readelf -l tls-le-static
Elf file type is EXEC (Executable file)
Entry point 0x2104
There are 5 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x00002000 0x00002000 0x994ac 0x994ac R E 0x2000
LOAD 0x099c28 0x0009dc28 0x0009dc28 0x0121c 0x01e1c RW 0x2000
NOTE 0x09948c 0x0009b48c 0x0009b48c 0x00020 0x00020 R 0x4
TLS 0x099c28 0x0009dc28 0x0009dc28 0x00010 0x00034 R 0x4
GNU_RELRO 0x099c28 0x0009dc28 0x0009dc28 0x003d8 0x003d8 R 0x1
Section to Segment mapping:
Segment Sections...
00 .text __libc_freeres_fn .rodata __libc_subfreeres __libc_IO_vtables __libc_atexit .eh_frame .gcc_except_table .note.ABI-tag
01 .tdata .init_array .fini_array .data.rel.ro .data .got .bss __libc_freeres_ptrs
02 .note.ABI-tag
03 .tdata .tbss
04 .tdata .init_array .fini_array .data.rel.ro
Using the -l
option with an object file does not work as we can see below.
readelf -l tls-le-static.o
There are no program headers in this file.
As mentioned an object file by itself is not executable. The main reason is that
there are no program headers as we just saw. Another reason is that
the .text
section still contains relocation entries (or placeholders) for the
addresses of variables located in the .data
and .bss
sections.
These placeholders will just be 0
in the machine code. So, if we tried to run
the machine code in an object file we would end up with Segmentation faults (SEGV).
A relocation entry is a placeholder that is added by the compiler or linker when
producing ELF binaries.
The relocation entries are to be filled in with addresses pointing to data.
Relocation entries can be made in code such as the .text
section or in data
sections like the .got
section. For example:
The diagram above shows relocation entries as white circles. Relocation entries may be filled or resolved at link-time or dynamically during execution.
Link time relocations
.text
sectionsDynamic relocations
.got
and .plt
sections which link
to shared objects.Note: Statically built binaries do not have any dynamic relocations and are not loaded with the dynamic linker.
In general link time relocations are used to fill in relocation entries in code. Dynamic relocations fill in relocation entries in data sections.
A list of relocations in a ELF binary can printed using readelf
with
the -r
options.
Output of readelf -r tls-gd-dynamic.o
Relocation section '.rela.text' at offset 0x530 contains 10 entries:
Offset Info Type Sym.Value Sym. Name + Addend
00000000 00000f16 R_OR1K_TLS_GD_HI1 00000000 x + 0
00000008 00000f17 R_OR1K_TLS_GD_LO1 00000000 x + 0
00000020 0000100c R_OR1K_GOTPC_HI16 00000000 _GLOBAL_OFFSET_TABLE_ - 4
00000024 0000100d R_OR1K_GOTPC_LO16 00000000 _GLOBAL_OFFSET_TABLE_ + 0
0000002c 00000d0f R_OR1K_PLT26 00000000 __tls_get_addr + 0
...
The relocation entry list explains how to and where to apply the relocation entry. It contains:
Offset
- the location in the binary that needs to be updatedInfo
- the encoded value containing the Type, Sym and Addend
, which is
broken down to:
Type
- the type of relocation (the formula for what is to be performed is defined in the
linker)Sym. Value
- the address value (if known) of the symbol.Sym. Name
- the name of the symbol (variable name) that this relocation needs to find
during link time.Addend
- a value that needs to be added to the derived symbol address.
This is used to with arrays (i.e. for a relocation referencing a[14]
we would have Sym. Name a
and an Addend of the data size of a
times 14
)File: nontls.c
In the example below we have a simple variable and a function to access it’s address.
static int x;
int* get_x_addr() {
return &x;
}
Let’s see what happens when we compile this source.
The steps to compile and link can be found in the tls-examples project hosting the source examples.
The diagram above shows relocations in the resulting object file as white circles.
In the actual output below we can see that access to the variable x
is
referenced by a literal 0
in each instruction. These are highlighted with
square brackets []
below for clarity.
These empty parts of the .text
section are relocation entries.
Addr. Machine Code Assembly Relocations
0000000c <get_x_addr>:
c: 19 60 [00 00] l.movhi r11,[0] # c R_OR1K_AHI16 .bss
10: 44 00 48 00 l.jr r9
14: 9d 6b [00 00] l.addi r11,r11,[0] # 14 R_OR1K_LO_16_IN_INSN .bss
The function get_x_addr
will return the address of variable x
.
We can look at the assembly instruction to understand how this is done. Some background
of the OpenRISC ABI.
r11
.r9
.Now, lets break down the assembly:
l.movhi
- move the value [0]
into high bits of register r11
, clearing the lower bits.l.addi
- add the value in register r11
to the value [0]
and store the results in r11
.l.jr
- jump to the address in r9
This constructs a 32-bit value out of 2 16-bit values.
The diagram above shows the relocations have been replaced with actual values.
As we can see from the linker output the places in the machine code that had relocation place holders
are now replaced with values. For example 1a 20 00 00
has become 1a 20 00 0a
.
00002298 <get_x_addr>:
2298: 19 60 00 0a l.movhi r11,0xa
229c: 44 00 48 00 l.jr r9
22a0: 9d 6b ee 60 l.addi r11,r11,-4512
If we calculate 0xa << 16 + -4512 (fee60)
we see get 0009ee60
. That is the
same location of x
within our binary. This we can check with readelf -s
which lists all symbols.
$ readelf -s nontls-static | grep ' x'
42: 0009ee60 4 OBJECT LOCAL DEFAULT 17 x
As we saw above, a simple program resulted in 2 different relocation entries just to compose the address of 1 variable. We saw:
R_OR1K_AHI16
R_OR1K_LO_16_IN_INSN
The need for different relacation types comes from the different requirements for the relocation. Processing of a relocation involves usually a very simple transform , each relocation defines a different transform. The components of the relocation definition are:
To be more specific about the above relocations we have:
Relocation Type | Bit-Field | Formula |
---|---|---|
R_OR1K_AHI16 |
simm16 |
S >> 16 |
R_OR1K_LO_16_IN_INSN |
simm16 |
S && 0xffff |
The Bit-Field described above is simm16
which means update the lower 16-bits
of the 32-bit value at the output offset and do not disturb the upper 16-bits.
+----------+----------+
| | simm16 |
| 31 16 | 15 0 |
+----------+----------+
There are many other Relocation Types with difference Bit-Fields and Formulas. These use different methods based on what each instruction does, and where each instruction encodes its immediate value.
For full listings refer to architecture manuals.
Take a look and see if you can understand how to read these now.
In this article we have discussed what ELF binaries are and how they can be read. We have talked about how from compilation to linking to runtime, relocation entries are used to communicate which parts of a program remain to be resolved. We then discussed how relocation types provide a formula and bit-mask for updating the places in ELF binaries that need to be filled in.
In the next article we will discuss how Thread Local Storage works, both link-time and runtime relocation entries play big part in how TLS works.
.got
and .plt
sectionsIn the last article, Marocchino Instruction Pipeline we discussed the architecture of the CPU. In this article let’s look at how Marocchino achieves out-of-order execution using the Tomasulo algorithm.
In a traditional pipelined CPU the goal is retire one instruction per clock cycle. Any pipeline stall means an execution clock cycle will be lost. One method for reducing the affect of pipeline stalls is instruction parallelization. In 1993 the Intel Pentium processor was one of the first consumer CPUs to achieve this with it’s dual U and V integer pipelines. The pentium U and V pipelines require certain coding techniques to take full advantage. Achieving more parallelism requires more sophisticated data hazard detection and instruction scheduling. Introduced with the IBM System/360 in the 60’s by Robert Tomasulo, the Tomosulo Algorithm provides the building blocks to allow for multiple instruction execution parallelism. Generally speaking no special programming is needed to take advantage of instruction parallelism on a processor implementing Tomasulo algorithm.
Though the technique of out-of-order CPU execution with Tomasulo’s algorithm had been designed in the 60’s it did not make its way into popular consumer hardware until the Pentium Pro in the 1995. Further Pentium revisions such as the Pentium III, Pentium 4 and Core architectures are based on this same architecture. Understanding this architecture is a key to understanding modern CPUs.
In this article we will point out comparisons between the Marocchino and Pentium pro who’s architecture can be seen in the below diagram.
The Marocchino implements the Tomasulo algorithm in a CPU that can be synthesized and run on an FPGA. Let’s dive into the implementation by breaking down the building blocks used in Tomasulo’s algorithm and how they have been implemented in Marocchino.
Besides the basic CPU modules like Instruction Fetch, Decode and Register File, the building blocks that are used in the Tomasulo algorithm are as follows:
extaddr
.The below diagram shows how these components are arranged in the Marocchino processor.
As mentioned above the goal of a pipelined architecture is to retire one instruction per clock cycle. Pipelining helps achieve this by splitting an instruction into pipeline stages i.e. Fetch, Decode, Execute, Load/Store and Register Write Back. If one instruction depends on the results produced by a previous instruction will be a problem as register write back of the previous instruction may not complete before registers are read during the Decode phase of a instruction. This and other types of dependencies between pipeline stages are called hazards, and they must be avoided.
The Tomasulo algorithm with its Reservation Stations, Register Allocation Tables and other building blocks try to avoid hazards causing pipeline stalls. Let’s look at a simple example to see how this is done.
b = a * 2
x = a + b
y = x / y
Here we can see that instruction 2
depends on instruction 1
as the addition
of a + b
cannot be performed until b
is produced by instruction 1
.
Let’s assume that instruction 1
is currently executing on the MULTIPLY
unit.
The CPU decodes instruction 2
, instead of detecting a data hazard and stalling
the pipeline instruction 2
will be placed in the reservation station of the
ADD
execution unit. The RAT indicates that b
is busy and being produced by
insruction 1
. This means instruction 2
cannot execute right away. Next, we
can look at instruction 3
and place it onto the reservation station of the
DIVIDE
execution unit. As instruction 3
has no hazards for x
and y
it
can proceed directly to execution, even before instruction 2
is ready for
execution.
Note, if a required reservation station is full the pipeline will stall.
As mentioned above, execution units will present their output onto the common
data bus wrbk_result
and the data will be written into reservation stations.
Writing the register to the reservation station may occur before writing
back to the register file. This is what register renaming is, as the
register input does not come directly from the register file.
When an instruction is issued it may be registered in the RAT, OCB and Reservation
Station. It is assigned an Instruction Id for tracking purposes. In Marocchino
this is called the extadr
and is 3
bits wide. It is generated by the simple
instruction ID generation logic.
The is implemented in or1k_marocchino_oman.v with the following counter
logic which generates a new extadr
every time an instruction is decoded.
// extension to DEST, FLAG or CARRY
// Zero value is reserved as "not used"
localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MAX = ((1 << DEST_EXTADR_WIDTH) - 1);
localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MIN = 1;
// ---
reg [DEST_EXTADR_WIDTH-1:0] dcod_extadr_r;
wire [DEST_EXTADR_WIDTH-1:0] extadr_adder;
// ---
assign extadr_adder = (dcod_extadr_r == EXTADR_MAX) ? EXTADR_MIN : (dcod_extadr_r + 1'b1);
// ---
always @(posedge cpu_clk) begin
if (pipeline_flush_i)
dcod_extadr_r <= {DEST_EXTADR_WIDTH{1'b0}};
else if (padv_dcod_i)
dcod_extadr_r <= fetch_valid_i ? extadr_adder : dcod_extadr_r;
end // @clock
// support in-1clk-unit forwarding
assign dcod_extadr_o = dcod_extadr_r;
Every instruction that is queued by the order manager is designated
an extadr
. This allows components like the reservation station and RAT tables
to track when an instruction starts and completes executing.
The interactions between the extadr
and other components are as follows.
During decode:
extaddr
by incrementing a counter.extaddr
along with other decoded instruction detailsextaddr
for the decoded instruction to indicate which
instruction will resolve a hazard.During execution:
extaddr
of the oldest instruction registered in a FIFO
fashion. This is to indicate which instruction is to be retired and ensures
instructions are retired in order.extaddr
indicating which queued instruction will produce a registerextaddr
from the OCB output to clear allocation flagsextaddr
with hazards to track when
instructions have finished and results are available.The register allocation table (RAT), sometimes called register alias table, keeps track of which registers are currently in progress of being generated by pending instructions. This is used to derive and resolve hazards.
The outputs of the RAT cell are:
rat_rd_extadr_o
- indicates which extadr
instruction has been allocated to
generate this register.
This will be updated with decod_extadr_i
when padv_exec_i
goes high.rat_rd_alloc_o
- indicates that this register is currently allocated to an
instruction which is not yet complete.
This will be set when padv_exec_i
goes high, decod_rfd_we_i
is high,
and dcod_rfd_adr_i
is equal to GPR_ADR
.The RAT table is made of 32 rat_cell
modules; one cell per register. The
register which the cell is allocated to is stored within GPR_ADR
in the rat
cell.
Outputs of the RAT are registered to reservation stations. The hazards are derived with the following logic in or1k_marocchino_oman.v.
The omn2dec_hazard_d1a1_o
hazard means that the argument a
of the decoded
instruction will be resolved when the instruction with extadr
in omn2dec_extadr_dxa1_o
is
retired. The 2
in d2
, a2
and b2
represent the 2nd register used in 64-bit
FPU instructions.
// # relative operand A1
assign omn2dec_hazard_d1a1_o = rat_rd1_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
assign omn2dec_hazard_d2a1_o = rat_rd2_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
assign omn2dec_extadr_dxa1_o = rat_extadr[dcod_rfa1_adr_i];
// # relative operand B1
assign omn2dec_hazard_d1b1_o = rat_rd1_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
assign omn2dec_hazard_d2b1_o = rat_rd2_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
assign omn2dec_extadr_dxb1_o = rat_extadr[dcod_rfb1_adr_i];
// # relative operand A2
assign omn2dec_hazard_d1a2_o = rat_rd1_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
assign omn2dec_hazard_d2a2_o = rat_rd2_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
assign omn2dec_extadr_dxa2_o = rat_extadr[dcod_rfa2_adr_i];
// # relative operand B2
assign omn2dec_hazard_d1b2_o = rat_rd1_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
assign omn2dec_hazard_d2b2_o = rat_rd2_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
assign omn2dec_extadr_dxb2_o = rat_extadr[dcod_rfb2_adr_i];
The reservation station receives an instruction from the decode stage and queues it until all hazards are resolved and the execution unit is free.
Each reservation station has one busy slot and one execution slot. In the Pentium Pro there were 20 reservation station slots, the Marocchino has 5 or 10 depending if you count the execution slots.
Reservation stations are populated when the pipeline advance padv_rsrvs_i
signal comes.
An instruction may be forwarded directly to execution if there are no hazards
and the execution unit is free.
busy_extadr_dxa_r
- is populated with data from omn2dec_hazards_addrs_i
. The busy_extadr_dxa_r
register represents the extadr
to look for which will resolve the A register hazard.busy_extadr_dxb_r
- same as ‘A’ but indicates which extadr
will produce the B register.
busy_hazard_dxa_r
- is populated with data from omn2dec_hazards_flags_i
. The busy_hazard_dxa_r
register represents that there is an instruction executing that will produce register A
which has not yet completed.busy_hazard_dxb_r
- same as ‘A’ but indicates that ‘B’ is not available yet.
busy_op_any_r
- populated with 1
when padv_rsrvs_i
goes high indicates that
there is an operation queued.busy_op_r
- populated with dcod_op_i
. Represents the operation pending in the queue.busy_rfa_r
- populated with data from dcod_rfxx_i
. Represents the value of operand A pending in the queue.busy_rfb_r
- populated with data from dcod_rfxx_i
. Represents the value of operand B pending in the queue.The reservation station resolves hazards by watching and comparing wrbk_extadr_i
with the busy_extadr_dxa_r
and busy_extadr_dxb_r
registers. If the two match
it means that the instruction producing register A or B has finished writing back
its results and the hazard can be cleared.
Writeback forwarding is handled via the following verilog multiplexer and register logic.
The first bit is used to register the decoded values dcod_rf*
from the register
file otherwise we watch for inputs from the forwarding logic. If there is a pending hazard
results are forwarded from the common data bus, otherwise results are maintained.
// BUSY stage operands A1 & B1
always @(posedge cpu_clk) begin
if (padv_rsrvs_i) begin
busy_rfa1_r <= dcod_rfa1;
busy_rfb1_r <= dcod_rfb1;
end
else begin
busy_rfa1_r <= busy_rfa1;
busy_rfb1_r <= busy_rfb1;
end
end // @clock
// Forwarding
// operand A1
assign busy_rfa1 = busy_hazard_d1a1_r ? wrbk_result1_i :
(busy_hazard_d2a1_r ? wrbk_result2_i : busy_rfa1_r);
// operand B1
assign busy_rfb1 = busy_hazard_d1b1_r ? wrbk_result1_i :
(busy_hazard_d2b1_r ? wrbk_result2_i : busy_rfb1_r);
When all hazard flags are cleared the contents of busy_op_r
, busy_rfa_r
and
busy_rfb_r
will be transferred to exec_op_any_r
, exec_op_r
, etc. They
are presented on the outputs and the execution unit can take them and start processing.
The unit_free_o
output signals the control unit that the reservation station
is free and can be issued another instruction. The signal goes high when all hazards
are cleared and the busy state transfers to exec.
In Marocchino the execution units (also referred to as functional units) execute instructions which it receives from the reservation stations.
The execution units in Marocchino are:
or1k_marocchino_int_1clk
- handles integer instructions which can complete
in 1 clock cycle. This includes SHIFT
, ADD
, AND
, OR
etc.or1k_marocchino_int_div
- handles integer DIVIDE
operations.or1k_marocchino_int_mul
- handles integer MULTIPLY
operations.or1k_marocchino_lsu
- handles memory load store operations. It interfaces
with the data cache, MMU and memory bus.pfpu_marocchino_top
- handles floating point operations. These include
ADD
, MULTIPLY
, CMP
, I2F
etc.Handshake signals between the reservation station and execution units are used to issue operations to execution units.
The taking_op_i
is the signal from the execution unit signalling it has
received the op and the reservation station will clear all exec_*_o
output
signals.
In the Marocchino the Order Control Buffer (OCB) is the in order retirement unit. It can retire a single instruction at a time. The implementation is a 7 entry FIFO queue. This is much less than the Pentium Pro which contains 40 slots. The OCB receives a single instruction at time from the decoder and broadcasts the oldest instruction for other components to see. Instructions are retired after execution write back is complete.
If the OCB output indicates a branch instruction or an exception, branch logic is invoked. Instead of waiting for write back to a register the write back logic in the Marocchino will perform the branch operations. This may include flushing the OCB. Special care is taken to handle branch delay slot instruction execution.
The OCB is different from a traditional Tomasulo Reorder Buffer (ROB) in that it does not store any execution write back results.
Each OCB entry stores:
extaddr
This can be seen as defined by the ocbi
and ocbi
wire buses in
or1k_marocchino_oman.v.
// --- OCB-Controls input ---
wire [OCBT_MSB:0] ocbi;
assign ocbi =
{
// --- pipeline [C]ontrol flags ---
dcod_extadr_r, // OCB-Controls entrance
dcod_op_ls_i, // OCB-Controls entrance
dcod_op_fpxx_cmp_i, // OCB-Controls entrance
dcod_op_fpxx_arith_i, // OCB-Controls entrance
dcod_op_mul_i, // OCB-Controls entrance
dcod_op_div_i, // OCB-Controls entrance
dcod_op_1clk_i, // OCB-Controls entrance
dcod_op_jb_r, // OCB-Controls entrance
dcod_op_push_wrbk_i, // OCB-Controls entrance
// --- instruction [A]ttributes ---
pc_decode_i, // OCB-Attributes entrance
dcod_rfd2_adr_i, // OCB-Attributes entrance
dcod_rfd2_we_i, // OCB-Attributes entrance
dcod_rfd1_adr_i, // OCB-Attributes entrance
dcod_rfd1_we_i, // OCB-Attributes entrance
dcod_delay_slot_i, // OCB-Attributes entrance
dcod_op_rfe_i, // OCB-Attributes entrance
// Flag that istruction is restartable
interrupts_en, // OCB-Attributes entrance
// Combined IFETCH/DECODE an exception flag
dcod_an_except_fd_i, // OCB-Attributes entrance
// FETCH & DECODE exceptions
dcod_fetch_except_ibus_err_r, // OCB-Attributes entrance
dcod_fetch_except_ipagefault_r, // OCB-Attributes entrance
dcod_fetch_except_itlb_miss_r, // OCB-Attributes entrance
dcod_except_illegal_i, // OCB-Attributes entrance
dcod_except_syscall_i, // OCB-Attributes entrance
dcod_except_trap_i // OCB-Attributes entrance
};
// --- INSN OCB input ---
wire [OCBT_MSB:0] ocbo;
As discussed above the common data collects write back results from execution units and routes them for write back.
This can be seen in the or1k_marocchino_cpu.v as below.
// --- regular ---
always @(wrbk_1clk_result or wrbk_div_result or wrbk_mul_result or
wrbk_fpxx_arith_res_hi or wrbk_lsu_result or wrbk_mfspr_result)
begin
wrbk_result1 = wrbk_1clk_result | wrbk_div_result | wrbk_mul_result |
wrbk_fpxx_arith_res_hi | wrbk_lsu_result | wrbk_mfspr_result;
end
// --- FPU64 extention ---
assign wrbk_result2 = wrbk_fpxx_arith_res_lo;
Tomasulo’s algorithm is still relevant today and used in many processors. Marocchino provides an accessible implementation. Marocchino is however, not super-scalar, while Pentium Pro can decode up to 4 instructions at a time the Marocchino can only decode 1 at a time.
Furthermore many improvements can be made to Marocchino to increase performance. Including:
However, these come with a cost of size on the FPGA. If you are interested in helping out please feel free to contribute.
If anything in this article could be improved, more timing diagrams, typos or fixes for diagrams please send me a message on twitter.
In the last article, Marocchino in Action we discussed the history of the CPU and how to setup setup a development environment for it. In this article let’s look a bit deeper into how the Marocchino CPU works.
We will look at how an instruction flows through the Marocchino pipeline.
The Marocchino source code is available on github and is easy to navigate. We have these directories:
rtl/verilog
- the core verilog code, with toplevel modules
or1k_marocchino_top.v
- top level module, connects CPU to wishbone busor1k_marocchino_cpu.v
- CPU module, connects CPU pipelinertl/verilog/pfpu_marocchino
- the FPU implementation
pfpu_marocchino_top.v
- FPU module, wires together FPU componentsbench
- test bench harness monitor modulesdoc
- design documentationAt first glance of the code the Marocchino may look like a traditional 5 stage RISC pipeline. It has fetch, decode, execution, load/store and register write back modules which you might picture in your head as follows:
PIPELINE CRTL - progress/stall the pipeline
(or1k_marocchino_ctrl.v)
INSTRUCTION PIPELINE - process an instruction
|
V
/ FETCH \
\ (or1k_marocchino_fetch.v) /
|
V
/ DECODE \
\ (or1k_marocchino_decode.v) /
|
V
/ EXECUTE \
| (or1k_marocchino_int_1clk.v) ALU |
| (or1k_marocchino_int_div.v) DIVISION |
\ (or1k_marocchino_int_mul.v) MULTIPLICATION /
|
V
/ LOAD STORE \
\ (or1k_marocchino_lsu.v) TO/FROM RAM /
|
V
/ WRITE BACK \
\ (or1k_marocchino_rf.v) /
However, once you look a bit closer you notice some things that are different. The top-level module or1k_marocchino_cpu connects the modules and shows:
What this CPU is is a super-scalar instruction pipeline with in order instruction retirement implementing the Tomasulo algorithm.
A simplified view of the CPU’s internal module layout is as per the below diagram.
The marocchino has two modules for coordinating pipeline stage instruction propagation. The control unit and the order manager.
The control unit of the CPU is in charge of watching over the pipeline stages
and signalling when operations can transfer from one stage to the next. The
Marocchino does this with a series of pipeline advance (padv_*
) signals. In
general for the best efficiency all padv_*
wires should be high at all times
allowing instructions to progress on every clock cycle. But as we will see in
reality, this is difficult to achieve due to pipeline stall scenarios like cache
misses and branch prediction misses. The padv_*
signals include:
The padv_fetch_o
signal instructs the instruction fetch unit to progress.
Internally the fetch unit has 3 stages. The instruction fetch unit interacts
with the instruction cache and instruction memory management unit (MMU).
The padv_fetch_o
signal goes low and the pipeline stalls when the decode
module is busy (dcod_emtpy_i
is low). The signal dcod_empty_i
comes from
the Decode module and indicates that an instruction can be accepted by the
decode stage.
This is represented by this assign
in or1k_marocchino_ctrl.v:
Note The stepping
and pstep[]
signals are related to debug single stepping, and can be
ignored for our purposes.
// Advance IFETCH
// Stepping condition is close to the one for DECODE
assign padv_fetch_o = padv_all & ((~stepping) | (dcod_empty_i & pstep[0])); // ADV. IFETCH
The padv_dcod_o
signal instructs the instruction decode stage to output
decoded operands. The decode unit is one stage, if padv_dcod_o
is high, it
will decode the instruction input every cycle.
The padv_dcod_o
signal goes low if the destination reservation station for the
operands cannot accept an instruction.
This is represented by this assign
in or1k_marocchino_ctrl.v:
// Advance DECODE
assign padv_dcod_o = padv_all & (~wrbk_rfdx_we_i) & // ADV. DECODE
(((~stepping) & dcod_free_i & (dcod_empty_i | ena_dcod)) | // ADV. DECODE
(stepping & dcod_empty_i & pstep[0])); // ADV. DECODE
The padv_exec_o
signal to order manager enqueues decoded ops into the Order Control
Buffer (OCB). The OCB is a FIFO
queue which keeps track of the order instructions have been decoded.
The padv_*_rsrvs_o
signal wired one of the reservation stations
enables registering of an instruction into a reservation station. There is one
padv_*_rsrvs_o
signal and reservation station per execution unit. They are:
padv_1clk_rsrvs_o
- to the reservation station for single clock ALU operationspadv_muldiv_rsrvs_o
- to the reservation station for multiply and divide
operations. Divide operations take 32 clock cycles. Multiply operations
execute with 2 clock cycles.padv_fpxx_rsrvs_o
- to the reservation station for the floating point unit
(FPU). There are multiple FPU operations including multiply, divide, add,
subtract, comparison and conversion between integer and floating point.padv_lsu_rsrvs_o
- to the reservation station for the load store unit. The
load store unit will load data from memory
to registers or store data from registers to memory. It interacts with the
data cache and data MMU.Both padv_exec_o
and padv_*_rsrvs_o
are dependent on the execution units being
ready and both signals will go high or low at the same time.
This is represented by the assign
in or1k_marocchino_ctrl.v:
// Advance EXECUTE (push OCB & clean up DECODE)
assign padv_exec_o = ena_exec & padv_an_exec_unit;
// Per execution unit (or reservation station) advance
assign padv_1clk_rsrvs_o = ena_1clk_rsrvs & padv_an_exec_unit;
assign padv_muldiv_rsrvs_o = ena_muldiv_rsrvs & padv_an_exec_unit;
assign padv_fpxx_rsrvs_o = ena_fpxx_rsrvs & padv_an_exec_unit;
assign padv_lsu_rsrvs_o = ena_lsu_rsrvs & padv_an_exec_unit;
The padv_wrbk_o
signal to the execution units will go active when exec_valid_i
is active
and will finalize writing back the execution results. The padv_wrbk_o
signal to
the order manager will retire the oldest instruction from the OCB.
This is represented by this assign
in or1k_marocchino_ctrl.v:
// Advance Write Back latches
wire exec_valid_l = exec_valid_i | op_mXspr_valid;
assign padv_wrbk_o = exec_valid_l & padv_all & (~wrbk_rfdx_we_i) & ((~stepping) | pstep[2]);
An astute reader would notice that there are no pipeline advance (padv_*
)
signals to each of the execution units. This is where the order manager comes
in.
The order manager ensures that instructions are retired in the same order that they are decoded. It contains a register allocation table (RAT) for hazard resolution and the OCB. We will go into more depth on the RAT in the next article, but for now let’s look at how the order manager interacts with the instruction pipeline flow.
As the OCB is a FIFO queue the output port presents the oldest non retired
instruction to the order manager. The exec_valid_o
signal to the control unit
will go active when the *_valid_i
signal from the execution unit and the OCB
output instruction match.
This is represented by this assign
in or1k_marocchino_oman.v:
assign exec_valid_o =
(op_1clk_valid_l & ~ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID: but wait attributes for l.jal/ljalr
(exec_jb_attr_valid & ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID
(div_valid_i & ocbo[OCBTC_OP_DIV_POS]) | // EXEC VALID
(mul_valid_i & ocbo[OCBTC_OP_MUL_POS]) | // EXEC VALID
(fpxx_arith_valid_i & ocbo[OCBTC_OP_FPXX_ARITH_POS]) | // EXEC VALID
(fpxx_cmp_valid_i & ocbo[OCBTC_OP_FPXX_CMP_POS]) | // EXEC VALID
(lsu_valid_i & ocbo[OCBTC_OP_LS_POS]) | // EXEC VALID
ocbo[OCBTC_OP_PUSH_WRBK_POS]; // EXEC VALID
The OCB helps the order manager ensure that instructions are retired in the same order that they are decoded.
The grant_wrbk_*_o
signal to the execution units will go active depending on
the OCB output port instruction.
This is represented by this assign
in
or1k_marocchino_oman.v:
// Grant Write-Back-access to units
assign grant_wrbk_to_1clk_o = ocbo[OCBTC_OP_1CLK_POS];
assign grant_wrbk_to_div_o = ocbo[OCBTC_OP_DIV_POS];
assign grant_wrbk_to_mul_o = ocbo[OCBTC_OP_MUL_POS];
assign grant_wrbk_to_fpxx_arith_o = ocbo[OCBTC_OP_FPXX_ARITH_POS];
assign grant_wrbk_to_lsu_o = ocbo[OCBTC_OP_LS_POS];
assign grant_wrbk_to_fpxx_cmp_o = ocbo[OCBTC_OP_FPXX_CMP_POS];
The grant_wrbk_*_o
signal along with the padb_wrbk_o
signal signal an
execution unit that it can write back its result to the register file / RAT /
reservation station.
The wrbk_rfd1_we_o
and wrbk_rfd2_we_o
signals enable writeback
to the register file. There are 2 signals because some 64-bit FPU instructions
require writing results to 2 registers. When there is just a single register to write
only signal wrbk_rfd1_we_o
is used. When there are two results, writing happens
in 2-stages, first wrbk_rfd1_we_o
signals the write back to register 1 then in
the next cycle wrbk_rfd2_we_o
signals the write back to register 2.
The wrbk_rfdx_we_o
signal to the control unit stalls the pipeline to allow
the second write to complete.
This is represented by this logic in or1k_marocchino_oman.v:
// instuction requests write-back
wire exec_rfd1_we = ocbo[OCBTA_RFD1_WRBK_POS];
wire exec_rfd2_we = ocbo[OCBTA_RFD2_WRBK_POS];
...
// 1-clock Write-Back-pulses
// # for D1
always @(posedge cpu_clk) begin
if (padv_wrbk_i)
wrbk_rfd1_we_o <= exec_rfd1_we;
else
wrbk_rfd1_we_o <= 1'b0;
end // @clock
// # for D2 we delay WriteBack for 1-clock
// to split write into RF from D1
always @(posedge cpu_clk) begin
if (cpu_rst) begin
wrbk_rfdx_we_o <= 1'b0; // flush
wrbk_rfd2_we_o <= 1'b0; // flush
end
else if (wrbk_rfd2_we_o) begin
wrbk_rfdx_we_o <= 1'b0; // D2 write done
wrbk_rfd2_we_o <= 1'b0; // D2 write done
end
else if (wrbk_rfdx_we_o)
wrbk_rfd2_we_o <= 1'b1; // do D2 write
else if (padv_wrbk_i)
wrbk_rfdx_we_o <= exec_rfd2_we;
end // @clock
The padv_wrbk_i
signal from the control unit to the order manager also takes
care of dequeuing the last instruction from the OCB. With that and the
writebacks completed the instruction is said to be retired.
The Marocchino instruction pipeline is not very complicated while still being full featured including Caches, MMU and FPU. We have mentioned a few structures such as Reservation Station and RAT which we haven’t gone into much details on. These help implement out-of-order superscalar execution using Tomasulo’s algorithm. In the next article we will go into more details on these components and how Tomasulo works.
]]>The Marocchino is an advanced new OpenRISC soft core CPU. However, not many have heard about it. Let’s try to change this.
In this series of posts I would like to take the reader over some key parts of the Marocchino and it’s architecture.
In the beginning of 2019 I had finished the OpenRISC GCC port and was working on building up toolchain test and verification support using the mor1kx soft core. Part of the mor1kx’s feature set is the ability to swap out different pipeline arrangements to configure the CPU for performance or resource usage. Each pipeline is named after an Italian coffee, we have Cappuccino, Espresso and Pronto-Espresso. One of these pipelines which has been under development but never integrated into the main branch was the Marocchino. I had never paid much attention to the Marocchino pipeline.
Around the same time the author of Marocchino sent a mail mentioning he could not use the new GCC port as it was missing Single and Double precision FPU support. Using the new verification pipeline I set out to start working on adding single and double precision floating point support to the OpenRISC gcc port. My verification target would be the Marocchino pipeline.
After some initial investigation I found this CPU was much more than a new pipeline for the mor1kx with a special FPU. The marocchino has morphed into a complete re-implementation of the OpenRISC 1000 spec. Seeing this we split the marocchino out to it’s own repository where it could grow on it’s own. Of course maintaining the Italian coffee name.
With features like out-of-order execution using Tomasulo’s algorithm, 64-bit FPU operations using register pairs, MMU, Instruction caches, Data caches, Multicore support and a clean verilog code base the Marocchino is advanced to say the least.
I would claim Marocchino is one of the most advanced implementations of a out-of-order execution open source CPU cores. One of it’s friendly rivals is the BOOM core a 64-bit risc-v implementation written in Chisel. To contrast Marocchino has a similar feature set but is 32-bit OpenRISC written in verilog making it approachable. If you know more out-of-order execution open source cores I would love to know, and I can update this list.
Let’s dive in.
We can quickly get started with Marocchino as we use FuseSoC. Which makes bringing together an running verilog libraries, or cores, a snap.
The Marocchino development environment requires Linux, you can use a VM, docker or your own machine. I personally use fedora and maintain several Debian docker images for continuous integration and package builds.
The environment we install allows for simulating verilog using
icarus or
verilator. It also allows synthesis
and programming to an FPGA using EDA tools. Here we will cover only simulation.
For details on programming an SoC to an FPGA using FuseSoC see the build
and
pgm
commands in FuseSoC documentation.
Note Below we use /tmp/openrisc
to install software and work on code, but
you can use any path you like.
To get started let’s setup FuseSoC and install the required cores into the FuseSoC library.
Here we clone the git repositories used for Marocchino development into
/tmp/openrisc/src
feel free to have a look, If you feel adventurous make some
changes. The repos include:
sudo pip install fusesoc
mkdir -p /tmp/openrisc/src
cd /tmp/openrisc/src
git clone https://github.com/stffrdhrn/mor1kx-generic.git
git clone https://github.com/openrisc/or1k_marocchino.git
# As yourself
fusesoc init -y
fusesoc library add intgen https://github.com/stffrdhrn/intgen.git
fusesoc library add elf-loader https://github.com/fusesoc/elf-loader.git
fusesoc library add mor1kx-generic /tmp/openrisc/src/mor1kx-generic
fusesoc library add or1k_marocchino /tmp/openrisc/src/or1k_marocchino
Next we will need to install our verilog compiler/simulator Icarus Verilog (iverilog).
mkdir -p /tmp/openrisc/iverilog
cd /tmp/openrisc/iverilog
git clone https://github.com/steveicarus/iverilog.git .
sh autoconf.sh
./configure --prefix=/tmp/openrisc/local
make
make install
export PATH=/tmp/openrisc/local/bin:$PATH
If you want to get started very quickly faster we can use the librecores-ci docker image. Which includes iverilog, verilator and fusesoc.
This allows us to skip the Setting up Icarus Verilog and part of the Setting up FuseSoC step above.
This can be done with the following.
docker pull librecores/librecores-ci
docker run -it --rm docker.io/librecores/librecores-ci
Next we install the GCC toolchain which is used for compiling C and OpenRISC assembly programs. The produced elf binaries can be loaded and run on the CPU core. Pull the latest toolchain from my gcc releases page. Here we use the newlib (baremetal) toolchain which allows compiling programs which run directly on the processor. For details on other toolchains available see the toolchain summary on the OpenRISC homepage.
mkdir -p /tmp/openrisc
cd /tmp/openrisc
wget https://github.com/stffrdhrn/gcc/releases/download/or1k-9.1.1-20190507/or1k-elf-9.1.1-20190507.tar.xz
tar -xf or1k-elf-9.1.1-20190507.tar.xz
export PATH=/tmp/openrisc/or1k-elf/bin:$PATH
The development environment should now be set up.
To check everything works you should be able to run the following commands.
To ensure the toolchain is installed and working we can run the following:
$ or1k-elf-gcc --version
or1k-elf-gcc (GCC) 9.1.1 20190503
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
To ensure FuseSoC and the required cores are installed we can run this:
$ fusesoc core-info mor1kx-generic
CORE INFO
Name: ::mor1kx-generic:1.1
Core root: /root/.local/share/fusesoc/mor1kx-generic
Targets:
marocchino_tb
mor1kx_tb
$ fusesoc list-cores
...
::intgen:0 : local
::or1k_marocchino:5.0-r3 : local
...
The most simple program you can run on OpenRISC is a simple assembly program. When running, everything in the below program is loaded into the OpenRISC memory, nothing more nothing less.
To compile, run and trace a simple assembly program we do the following.
Create a source file asm-openrisc.s
as follows:
/* Exception vectors section. */
.section .vectors, "ax"
/* 0x100: OpenRISC RESET reset vector. */
.org 0x100
/* Jump to program initialisation code */
.global _main
l.movhi r4, hi(_main)
l.ori r4, r4, lo(_main)
l.jr r4
l.nop
/* Main executable section. */
.section .text
.global _main
_main:
l.addi r1,r0,0x1
l.addi r2,r1,0x2
l.addi r3,r2,0x4
l.addi r4,r3,0x8
l.addi r5,r4,0x10
l.addi r6,r5,0x20
l.addi r7,r6,0x40
l.addi r8,r7,0x80
l.addi r9,r8,0x100
l.addi r10,r9,0x200
l.addi r11,r10,0x400
l.addi r12,r11,0x800
l.addi r13,r12,0x1000
l.addi r14,r13,0x2000
l.addi r15,r14,0x4000
l.addi r16,r15,0x8000
l.sub r31,r0,r1
l.sub r30,r31,r2
l.sub r29,r30,r3
l.sub r28,r29,r4
l.sub r27,r28,r5
l.sub r26,r27,r6
l.sub r25,r26,r7
l.sub r24,r25,r8
l.sub r23,r24,r9
l.sub r22,r23,r10
l.sub r21,r22,r11
l.sub r20,r21,r12
l.sub r19,r20,r13
l.sub r18,r19,r14
l.sub r17,r18,r15
l.sub r16,r17,r16
/* Set sim return code to 0 - meaning OK. */
l.movhi r3, 0x0
l.nop 0x1 /* Exit simulation */
l.nop
l.nop
To compile this we use or1k-elf-gcc
. Note the -nostartfiles
option, this is
useful for compiling assembly when we don’t need
newlib/libgloss
provided “startup” sections linked into the binary as we provide them ourselves.
mkdir /tmp/openrisc/src
cd /tmp/openrisc/src
vim openrisc-asm.s
or1k-elf-gcc -nostartfiles openrisc-asm.s -o openrisc-asm
Finally, to run the program on the Marocchino we run fusesoc
with the below
options.
run
- specifies that we want to run a simulation.--target
- is a FuseSoC option for run
specifying which of the
mor1kx-generic targets we want to run, here we specify marochino_tb
, the
Marocchino test bench.--tool
- is a sub option for --target
specifying that we want to run the
marocchino_tb target using icarus.::mor1kx-generic:1.1
- specifies which system we want to run. System’s
represent an SoC that can be simulated or synthesized. You can see a list of
system using list-cores
.--elf_load
- is mor1kx-generic
specific option which specifies an elf
binary that will be loaded into memory before the simulation starts.--trace_enable
- is a mor1kx-generic
specific option enabling tracing.
When specified the simulator will output a trace file to {fusesoc-builds}/mor1kx-generic_1.1/marocchino_tb-icarus/marocchino-trace.log
see log.--trace_to_screen
- is a mor1kx-generic
specific option enabling tracing
instruction execution to the console as we can see below.--vcd
- is a mor1kx-generic
option instruction icarus to output a vcd
file which creates a trace file which can be loaded with gtkwave.fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
--elf_load ./openrisc-asm --trace_enable --trace_to_screen --vcd
VCD info: dumpfile testlog.vcd opened for output.
Program header 0: addr 0x00000000, size 0x000001A0
elf-loader: /tmp/openrisc/src/openrisc-asm was loaded
Loading 104 words
0 : Illegal Wishbone B3 cycle type (xxx)
S 00000100: 18800000 l.movhi r4,0x0000 r4 = 00000000 flag: 0
S 00000104: a8840110 l.ori r4,r4,0x0110 r4 = 00000110 flag: 0
S 00000108: 44002000 l.jr r4 flag: 0
S 0000010c: 15000000 l.nop 0x0000 flag: 0
S 00000110: 9c200001 l.addi r1,r0,0x0001 r1 = 00000001 flag: 0
S 00000114: 9c410002 l.addi r2,r1,0x0002 r2 = 00000003 flag: 0
S 00000118: 9c620004 l.addi r3,r2,0x0004 r3 = 00000007 flag: 0
S 0000011c: 9c830008 l.addi r4,r3,0x0008 r4 = 0000000f flag: 0
S 00000120: 9ca40010 l.addi r5,r4,0x0010 r5 = 0000001f flag: 0
S 00000124: 9cc50020 l.addi r6,r5,0x0020 r6 = 0000003f flag: 0
S 00000128: 9ce60040 l.addi r7,r6,0x0040 r7 = 0000007f flag: 0
S 0000012c: 9d070080 l.addi r8,r7,0x0080 r8 = 000000ff flag: 0
S 00000130: 9d280100 l.addi r9,r8,0x0100 r9 = 000001ff flag: 0
S 00000134: 9d490200 l.addi r10,r9,0x0200 r10 = 000003ff flag: 0
S 00000138: 9d6a0400 l.addi r11,r10,0x0400 r11 = 000007ff flag: 0
S 0000013c: 9d8b0800 l.addi r12,r11,0x0800 r12 = 00000fff flag: 0
S 00000140: 9dac1000 l.addi r13,r12,0x1000 r13 = 00001fff flag: 0
S 00000144: 9dcd2000 l.addi r14,r13,0x2000 r14 = 00003fff flag: 0
S 00000148: 9dee4000 l.addi r15,r14,0x4000 r15 = 00007fff flag: 0
S 0000014c: 9e0f8000 l.addi r16,r15,0x8000 r16 = ffffffff flag: 0
S 00000150: e3e00802 l.sub r31,r0,r1 r31 = ffffffff flag: 0
S 00000154: e3df1002 l.sub r30,r31,r2 r30 = fffffffc flag: 0
S 00000158: e3be1802 l.sub r29,r30,r3 r29 = fffffff5 flag: 0
S 0000015c: e39d2002 l.sub r28,r29,r4 r28 = ffffffe6 flag: 0
S 00000160: e37c2802 l.sub r27,r28,r5 r27 = ffffffc7 flag: 0
S 00000164: e35b3002 l.sub r26,r27,r6 r26 = ffffff88 flag: 0
S 00000168: e33a3802 l.sub r25,r26,r7 r25 = ffffff09 flag: 0
S 0000016c: e3194002 l.sub r24,r25,r8 r24 = fffffe0a flag: 0
S 00000170: e2f84802 l.sub r23,r24,r9 r23 = fffffc0b flag: 0
S 00000174: e2d75002 l.sub r22,r23,r10 r22 = fffff80c flag: 0
S 00000178: e2b65802 l.sub r21,r22,r11 r21 = fffff00d flag: 0
S 0000017c: e2956002 l.sub r20,r21,r12 r20 = ffffe00e flag: 0
S 00000180: e2746802 l.sub r19,r20,r13 r19 = ffffc00f flag: 0
S 00000184: e2537002 l.sub r18,r19,r14 r18 = ffff8010 flag: 0
S 00000188: e2327802 l.sub r17,r18,r15 r17 = ffff0011 flag: 0
S 0000018c: e2118002 l.sub r16,r17,r16 r16 = ffff0012 flag: 0
S 00000190: 18600000 l.movhi r3,0x0000 r3 = 00000000 flag: 0
S 00000194: 15000001 l.nop 0x0001 flag: 0
exit(0x00000000);
If we look at the VCD trace file in gtkwave we can see the below trace. The trace file is also helpful for navigating through the various SoC components as it captures all wire transitions.
Take note that we are not seeing very good performance, this is because caching is not enabled and the CPU takes several cycles to read an instruction from memory. This means we are not seeing one instruction executed per cycle. Enabling caches would fix this.
When we compile a C program there is a lot more happening behind the scenes. The linker will link in an entire runtime that along with standard libc functions on OpenRISC will setup interrupt vectors, enable caches, setup memory sections for variables, run static initializers and finally run our program.
The program:
/* Simple c program, doing some math. */
#include <stdio.h>
int a [] = { 1, 2, 3, 4, 5 };
int madd(int a, int b, int c) {
return a * b + c;
}
int main() {
int res;
for (int i = 0; i < 5; i++) {
res = madd(0 , a[1], a[i]);
res = madd(res, a[2], a[i]);
res = madd(res, a[2], a[i]);
res = madd(res, a[3], a[i]);
res = madd(res, a[4], a[i]);
}
printf("Result is = %d\n", res);
return 0;
}
To compile we use the below or1k-elf-gcc
command. Notice, we do not specify
-nostartfiles
here as we do want newlib to link in all the start routines to
provide a full c runtime. We do specify the following arguments to tell GCC a
bit about our OpenRISC cpu. If these -m
options are not specified GCC will
link in code using the libgcc library to emulate these instructions.
-mhard-mul
- indicates that our cpu target supports multiply instructions.-mhard-div
- indicates that our cpu target supports divide instructions.-mhard-float
- indicates that our cpu target supports FPU instructions.-mdouble-float
- indicates that our cpu target supports the new double precision floating point instructions using register pairs (orfpx64a32).To see a full list of options for OpenRISC read the GCC manual or see the output
of or1k-elf-gcc --target-help
.
or1k-elf-gcc -Wall -O2 -mhard-mul -mhard-div -mhard-float -mdouble-float -mror \
openrisc-c.c -o openrisc-c
If we want to inspect the assembly to ensure we did generate multiply instructions
we can use the trusty objdump
utility. As per below, yes, we can see multiply
instructions.
or1k-elf-objdump -d openrisc-c | grep -A10 main
00002000 <main>:
2000: 1a a0 00 01 l.movhi r21,0x1
2004: 9c 21 ff f8 l.addi r1,r1,-8
2008: 9e b5 40 3c l.addi r21,r21,16444
200c: 86 75 00 10 l.lwz r19,16(r21)
2010: 86 f5 00 08 l.lwz r23,8(r21)
2014: e2 37 9b 06 l.mul r17,r23,r19
2018: e2 31 98 00 l.add r17,r17,r19
201c: e2 31 bb 06 l.mul r17,r17,r23
2020: e2 31 98 00 l.add r17,r17,r19
2024: 86 b5 00 0c l.lwz r21,12(r21)
...
Similar to running the assembly example we can run this with fusesoc
as follows.
fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
--elf_load ./openrisc-c --trace_enable --vcd
...
Result is = 1330
Now, if we look at the VCD trace file we can see the below trace. Notice that with the c program we can observe better pipelining where an instruction can be executed every clock cycle. This is because caches have been initialized as part of the newlib c-runtime initialization, great!
In this article we went through a quick introduction to the Marocchino development environment. The development environment would actually be similar when developing any OpenRISC core.
This environment will allow the reader to following in future Marocchino articles where we go deeper into the architecture. In this environment you can now:
In the next article we will look more into how the above programs actually flow through the Marocchino pipeline. Stay tuned.
]]>