22 Aug 2023
In the last article we introduced the
OpenRISC glibc FPU port and the effort required to get user space FPU support
into OpenRISC linux user applications. We explained how the FPU port is a
fullstack project covering:
- Architecture Specification
- Simulators and CPU implementations
- Linux Kernel support
- GCC Instructions and Soft FPU
- Binutils/GDB Debugging Support
- glibc support
In this entry we will cover updating Simulators and CPU implementations to support
the architecture changes which are called for as per the previous article.
- Allowing usermode programs to update the FPCSR register
- Detecting tininess before rounding
Simulator Updates
The simulators used for testing OpenRISC software without hardware are QEMU
and or1ksim. They both needed to be updated to cohere to the specification
updates discussed above.
Or1ksim Updates
The OpenRISC architectue simulator or1ksim has been updated with the single patch:
cpu: Allow FPCSR to be read/written in user mode.
The softfloat FPU implementation was already configured to detect tininess before
rounding.
If you are interested you can download and run the simulator and test this out
with a docker image pulled from docker hub using the following:
# using podman instead of docker, you can use docker here too
podman pull stffrdhrn/or1k-sim-env:latest
podman run -it --rm stffrdhrn/or1k-sim-env:latest
root@9a4a52eec8ee:/tmp# or1k-elf-sim -version
Seeding random generator with value 0x4a3c2bbd
OpenRISC 1000 Architectural Simulator, version 2023-08-20
This starts up an environment which has access to the OpenRISC architecture
simulator and a GNU compiler toolchain. While still in the container can run a
quick test using the FPU as follows:
# Create a test program using OpenRISC FPU
cat > fpee.c <<EOF
#include <float.h>
#include <stdio.h>
#include <or1k-sprs.h>
#include <or1k-support.h>
static void enter_user_mode() {
int32_t sr = or1k_mfspr(OR1K_SPR_SYS_SR_ADDR);
sr &= ~OR1K_SPR_SYS_SR_SM_MASK;
or1k_mtspr(OR1K_SPR_SYS_SR_ADDR, sr);
}
static void enable_fpu_exceptions() {
unsigned long fpcsr = OR1K_SPR_SYS_FPCSR_FPEE_MASK;
or1k_mtspr(OR1K_SPR_SYS_FPCSR_ADDR, fpcsr);
}
static void fpe_handler() {
printf("Got FPU Exception, PC: 0x%lx\n", or1k_mfspr(OR1K_SPR_SYS_EPCR_BASE));
}
int main() {
float result;
or1k_exception_handler_add(0xd, fpe_handler);
#ifdef USER_MODE
/* Note, printf here also allocates some memory allowing user mode runtime to
work. */
printf("Enabling user mode\n");
enter_user_mode();
#endif
enable_fpu_exceptions();
printf("Exceptions enabled, now DIV 3.14 / 0!\n");
result = 3.14f / 0.0f;
/* Verify we see infinity. */
printf("Result: %f\n", result);
/* Verify we see DZF set. */
printf("FPCSR: %x\n", or1k_mfspr(OR1K_SPR_SYS_FPCSR_ADDR));
#ifdef USER_MODE
asm volatile("l.movhi r3, 0; l.nop 1"); /* Exit sim, now */
#endif
return 0;
}
EOF
# Compile the program
or1k-elf-gcc -g -O2 -mhard-float fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee
# Expected results
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801
# Compile the program to run in USER_MODE
or1k-elf-gcc -g -O2 -mhard-float -DUSER_MODE fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee
# Expected results with USER_MODE
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Enabling user mode
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801
# exit(0)
In the above we can see how to compile and run a simple FPU test program and run
it on or1ksim. The program set’s up an FPU exception handler, enables exceptions
then does a divide by zero to produce an exception. This program uses the
OpenRISC newlib (baremetal) toolchain to
compile a program that can run directly on the simulator, as oppposed to a
program running in an OS on a simulator or hardware.
Note, that normally newlib programs expect to run in supervisor mode, when
our program switches to user mode we need to take some precautions to ensure it
can run correctly. As noted in the comments, usually when allocating and exiting
the newlib runtime will do things like disabling/enabling interrupts which
will fail when running in user mode.
QEMU Updates
The QEMU update was done in my
OpenRISC user space FPCSR
qemu patch series. The series was merged for the
qemu 8.1 release.
The updates were split it into three changes:
- Allowing FPCSR access in user mode.
- Properly set the exception PC address on floating point exceptions.
- Configuring the QEMU softfloat implementation to perform tininess check
before rounding.
QEMU Patch 1
The first patch to allow FPCSR access in user mode was trivial, but required some
code structure changes making the patch look bigger than it really was.
QEMU Patch 2
The next patch to properly set the exception PC address fixed a long existing
bug where the EPCR was not properly updated after FPU exceptions. Up until now
OpenRISC userspace did not support FPU instructions and this code path had not
been tested.
To explain why this fix is important let us look at the EPCR and what it is used for
in a bit more detail.
In general, when an exception occurs an OpenRISC CPU will store the program counter (PC)
of the instruction that caused the exception into the exeption program counter address
(EPCR). Floating point exceptions are a special case in that the EPCR is
actually set to the next instruction to be executed, this is to avoid looping.
When the linux kernel handles a floating point exception it follows the path
0xd00 > fpe_trap_handler > do_fpe_trap. This will setup a
signal to be delivered to the user process.
The Linux OS uses the EPCR to report the exception instruction address to
userspace via a signal which we can see being done in do_fpe_trap which
we can see below:
asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned long address)
{
int code = FPE_FLTUNK;
unsigned long fpcsr = regs->fpcsr;
if (fpcsr & SPR_FPCSR_IVF)
code = FPE_FLTINV;
else if (fpcsr & SPR_FPCSR_OVF)
code = FPE_FLTOVF;
else if (fpcsr & SPR_FPCSR_UNF)
code = FPE_FLTUND;
else if (fpcsr & SPR_FPCSR_DZF)
code = FPE_FLTDIV;
else if (fpcsr & SPR_FPCSR_IXF)
code = FPE_FLTRES;
/* Clear all flags */
regs->fpcsr &= ~SPR_FPCSR_ALLF;
force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
}
Here we see the excption becomes a SIGFPE signal and the exception address in
regs->pc is passed to force_sig_fault. The PC will be used to set the
si_addr field of the siginfo_t structure.
Next upon return from kernel space to user space the path is do_fpe_trap >
_fpe_trap_handler > ret_from_exception > resume_userspace >
work_pending > do_work_pending > restore_all.
Inside of do_work_pending with there the signal handling is done. In explain a bit
about this in the article Unwinding a Bug - How C++ Exceptions Work.
In restore_all we see EPCR is returned to when exception handling is
complete. A snipped of this code is show below:
#define RESTORE_ALL \
DISABLE_INTERRUPTS(r3,r4) ;\
l.lwz r3,PT_PC(r1) ;\
l.mtspr r0,r3,SPR_EPCR_BASE ;\
l.lwz r3,PT_SR(r1) ;\
l.mtspr r0,r3,SPR_ESR_BASE ;\
l.lwz r3,PT_FPCSR(r1) ;\
l.mtspr r0,r3,SPR_FPCSR ;\
l.lwz r2,PT_GPR2(r1) ;\
l.lwz r3,PT_GPR3(r1) ;\
l.lwz r4,PT_GPR4(r1) ;\
l.lwz r5,PT_GPR5(r1) ;\
l.lwz r6,PT_GPR6(r1) ;\
l.lwz r7,PT_GPR7(r1) ;\
l.lwz r8,PT_GPR8(r1) ;\
l.lwz r9,PT_GPR9(r1) ;\
l.lwz r10,PT_GPR10(r1) ;\
l.lwz r11,PT_GPR11(r1) ;\
l.lwz r12,PT_GPR12(r1) ;\
l.lwz r13,PT_GPR13(r1) ;\
l.lwz r14,PT_GPR14(r1) ;\
l.lwz r15,PT_GPR15(r1) ;\
l.lwz r16,PT_GPR16(r1) ;\
l.lwz r17,PT_GPR17(r1) ;\
l.lwz r18,PT_GPR18(r1) ;\
l.lwz r19,PT_GPR19(r1) ;\
l.lwz r20,PT_GPR20(r1) ;\
l.lwz r21,PT_GPR21(r1) ;\
l.lwz r22,PT_GPR22(r1) ;\
l.lwz r23,PT_GPR23(r1) ;\
l.lwz r24,PT_GPR24(r1) ;\
l.lwz r25,PT_GPR25(r1) ;\
l.lwz r26,PT_GPR26(r1) ;\
l.lwz r27,PT_GPR27(r1) ;\
l.lwz r28,PT_GPR28(r1) ;\
l.lwz r29,PT_GPR29(r1) ;\
l.lwz r30,PT_GPR30(r1) ;\
l.lwz r31,PT_GPR31(r1) ;\
l.lwz r1,PT_SP(r1) ;\
l.rfe
Here we can see how l.mtspr r0,r3,SPR_EPCR_BASE restores the EPCR to the pc
address stored in pt_regs when we entered the exception handler. All
other register are restored and finally the l.rfe instruction is issued to
return from the exception which affectively jumps to EPCR.
The reason QEMU was not setting the correct exception address is due to the way
qemu is implemented which optimizes performance. QEMU executes target code
basic blocks that are translated to host native instructions, during runtime
all PC addresses are those of the host, for example x86-64 64-bit
addresses. When an exception occurs, updating the target PC address from the host PC
need to be explicityly requested.
QEMU Patch 3
The next patch to implement tininess before rouding was also trivial but
brought up a conversation about default NaN payloads.
QEMU Patch 4
Wait, there is more. During writing this article I realized that if QEMU
was setting the ECPR to the FPU instruction causing the exception then
we would end up in an endless loop.
Luckily the arcitecture anticipated this calling for FPU exceptions to set the next
instruction to be executed to EPCR. QEMU was missing this logic.
The patch target/openrisc: Set EPCR to next PC on FPE exceptions
fixes this up.
RTL Updates
Updating the actual verilog RTL CPU implementations also needed to be done.
Updates have been made to both the mor1kx
and the
or1k_marocchino
implementations.
mor1kx Updates
Updates to the mor1kx to support user mode reads and write to the FPCSR were done in the patch:
Make FPCSR is R/W accessible for both user- and supervisor- modes.
The full patch is:
@@ -618,7 +618,7 @@ module mor1kx_ctrl_cappuccino
spr_fpcsr[`OR1K_FPCSR_FPEE] <= 1'b0;
end
else if ((spr_we & spr_access[`OR1K_SPR_SYS_BASE] &
- (spr_sr[`OR1K_SPR_SR_SM] & padv_ctrl | du_access)) &&
+ (padv_ctrl | du_access)) &&
`SPR_OFFSET(spr_addr)==`SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) begin
spr_fpcsr <= spr_write_dat[`OR1K_FPCSR_WIDTH-1:0]; // update all fields
`ifdef OR1K_FPCSR_MASK_FLAGS
The change to verilog shows that before when writng (spr_we) to the FPCSR (OR1K_SPR_FPCSR_ADDR) register
we used to check that the supervisor bit (OR1K_SPR_SR_SM) bit of the sr spr (spr_sr) is set. That check
enforced supervisor mode only write access, removing this allows user space to write to the regsiter.
Updating mor1kx to support tininess checking before rounding was done in the
change Refactoring and implementation tininess detection before
rounding.
I will not go into the details of these patches as I don’t understand them so
much.
Marocchino Updates
Updates to the or1k_marocchino to support user mode reads and write to the FPCSR were done in the patch:
Make FPCSR is R/W accessible for both user- and supervisor- modes.
The full patch is:
@@ -714,7 +714,7 @@ module or1k_marocchino_ctrl
assign except_fpu_enable_o = spr_fpcsr[`OR1K_FPCSR_FPEE];
wire spr_fpcsr_we = (`SPR_OFFSET(({1'b0, spr_sys_group_wadr_r})) == `SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) &
- spr_sys_group_we & spr_sr[`OR1K_SPR_SR_SM];
+ spr_sys_group_we; // FPCSR is R/W for both user- and supervisor- modes
`ifdef OR1K_FPCSR_MASK_FLAGS
reg [`OR1K_FPCSR_ALLF_SIZE-1:0] ctrl_fpu_mask_flags_r;
Updating the marocchino to support dttectig tininess before rounding was done in the
patch:
Refactoring FPU Implementation for tininess detection BEFORE ROUNDING.
I will not go into details of the patch as I didn’t write them. In
general it is a medium size refactoring of the floating point unit.
Summary
We discussed updates to the architecture simulators and verilog CPU implementations
to allow supporting user mode floating point programs. These updates will now allow us to
port Linux and glibc to the OpenRISC floating point unit.
Further Reading
25 Apr 2023
Last year (2022) the big milestone for OpenRISC was getting the glibc port upstream.
Though there is libc support for
OpenRISC already with musl and ucLibc
the glibc port provides a extensive testsuite which has proved useful in shaking out toolchain
and OS bugs.
The upstreamed OpenRISC glibc support is missing support for leveraging the
OpenRISC floating-point unit (FPU).
Adding OpenRISC glibc FPU support requires a cross cutting effort across the
architecture’s fullstack from:
- Architecture Specification
- Simulators and CPU implementations
- Linux Kernel support
- GCC Instructions and Soft FPU
- Binutils/GDB Debugging Support
- glibc support
In this blog entry I will cover how the OpenRISC architecture specification
was updated to support user space floating point applications. But first, what
is FPU porting?
What is FPU Porting?
The FPU in modern CPU’s allow the processor to perform IEEE 754
floating point math like addition, subtraction, multiplication. When used in a
user application the FPU’s function becomes more of a math accelerator, speeding
up math operations including
trigonometric and
complex functions such as sin,
sinf and cexpf. Not all FPU’s provide the same
set of FPU operations nor do they have to. When enabled, the compiler will
insert floating point instructions where they can be used.
OpenRISC FPU support was added to the GCC compiler a while back.
We can see how this works with a simple example using the bare-metal newlib toolchain.
C code example addf.c:
float addf(float a, float b) {
return a + b;
}
To compile this C function we can do:
$ or1k-elf-gcc -O2 addf.c -c -o addf-sf.o
$ or1k-elf-gcc -O2 -mhard-float addf.c -c -o addf-hf.o
Assembly output of addf-sf.o contains the default software floating point
implementation as we can see below. We can see below that a call to __addsf3 was
added to perform our floating point operation. The function __addsf3
is provided
by libgcc as a software implementation of the single precision
floating point (sf) add operation.
$ or1k-elf-objdump -dr addf-sf.o
Disassembly of section .text:
00000000 <addf>:
0: 9c 21 ff fc l.addi r1,r1,-4
4: d4 01 48 00 l.sw 0(r1),r9
8: 04 00 00 00 l.jal 8 <addf+0x8>
8: R_OR1K_INSN_REL_26 __addsf3
c: 15 00 00 00 l.nop 0x0
10: 85 21 00 00 l.lwz r9,0(r1)
14: 44 00 48 00 l.jr r9
18: 9c 21 00 04 l.addi r1,r1,4
The disassembly of the addf-hf.o below shows that the FPU instruction
(hardware) lf.add.s is used to perform addition, this is because the snippet
was compiled using the -mhard-float argument. One could imagine if this is
supported it would be more efficient compared to the software implementation.
$ or1k-elf-objdump -dr addf-hf.o
Disassembly of section .text:
00000000 <addf>:
0: c9 63 20 00 lf.add.s r11,r3,r4
4: 44 00 48 00 l.jr r9
8: 15 00 00 00 l.nop 0x0
So if the OpenRISC toolchain already has support for FPU instructions what else
needs to be done? When we add FPU support to glibc we are adding FPU support
to the OpenRISC POSIX runtime and create a toolchain that can compile and link
binaries to run on this runtime.
The Runtime
Below we can see examples of two application runtimes, one Application A runs
with software floating point, the other Application B run’s with full hardware
floating point.

Both Application A and Application B can run on the same system, but
Application B requires a libc and kernel that support the floating point
runtime. As we can see:
- In Application B it leverages floating point instructions as noted in the
blue box. That should be implemented in the
CPU, and are produced by the GCC compiler.
- The math routines in the C Library used by Application B are accelarated by
the FPU as per the blue box. The math routines can also set up rounding of the FPU hardware to
be in line with rounding of the software routines. The math routines can
also detect exceptions by checking the FPU state. The rounding and exception
handling in the purple boxes is what is
implemented the GLIBC.
- The kernel must be able to save and restore the FPU state when switching
between processes. The OS also has support for signalling the process if
enabled. This is indicated in the purple
box.
Another aspect is that supporting hardware floating point in the OS means that
multiple user land programs can transparently use the FPU. To do all of this we
need to update the kernel and the C runtime libraries to:
- Make the kernel save and restore process FPU state during context switches
- Make the kernel handle FPU exceptions and deliver signals to user land
- Teach GLIBC how to setup FPU rounding mode
- Teach GLIBC how to translate FPU exceptions
- Tell GCC and GLIBC soft float about our FPU quirks
In order to compile applications like Application B a separate compiler
toolchain is needed. For highly configurable embredded system CPU’s like ARM, RISC-V there
are multiple toolchains available for building software for the different CPU
configurations. Usually there will be one toolchain for soft float and one for hard float support, see the below example
from the arm toolchain download page.

Fixing Architecture Issues
As we started to work on the floating point support we found two issues:
- The OpenRISC floating point control and status register (FPCSR) is accessible only in
supervisor mode.
- We have not defined how the FPU should perform tininess detection.
FPCSR Access
The GLIBC OpenRISC FPU port, or any port for that matter, starts
by looking at what other architectures have done. For GLIBC FPU support we can
look at what MIPS, ARM, RISC-V etc. have implemented. Most ports have a file
called sysdeps/{arch}/fpu_control.h, I noticed one thing right away as I went
through this, we can look at ARM or MIPS for example:
sysdeps/mips/fpu_control.h:
Excerpt from the MIPS port showing the definition of _FPU_GETCW and _FPU_SETCW
#else
# define _FPU_GETCW(cw) __asm__ volatile ("cfc1 %0,$31" : "=r" (cw))
# define _FPU_SETCW(cw) __asm__ volatile ("ctc1 %0,$31" : : "r" (cw))
#endif
sysdeps/arm/fpu_control.h:
Excerpt from the ARM port showing the definition of _FPU_GETCW and _FPU_SETCW
# define _FPU_GETCW(cw) \
__asm__ __volatile__ ("vmrs %0, fpscr" : "=r" (cw))
# define _FPU_SETCW(cw) \
__asm__ __volatile__ ("vmsr fpscr, %0" : : "r" (cw))
#endif
What we see here is a macro that defines how to read or write the floating point
control word for each architecture. The macros are implemented using a single
assembly instruction.
In OpenRISC we have similar instructions for reading and writing the floating
point control register (FPCSR), writing for example is: l.mtspr r0,%0,20. However,
on OpenRISC the FPCSR is read-only when running in user-space, this is a
problem.
If we remember from our operating system studies, user applications run in
user-mode as
apposed to the privileged kernel-mode.
The user floating point environment
is defined by POSIX in the ISO C Standard. The C library provides functions to
set rounding modes and clear exceptions using for example
fesetround
for setting FPU rounding modes and
feholdexcept for clearing exceptions.
If userspace applications need to be able to control the floating point unit
the having architectures support for this is integral.
Originally OpenRISC architecture specification
specified the floating point control and status registers (FPCSR) as being
read only when executing in user mode, again this is a problem and needs to
be addressed.

Other architectures define the floating point control register as being writable in user-mode.
For example, ARM has the
FPCR and FPSR,
and RISC-V has the
FCSR
all of which are writable in user-mode.

Tininess Detection
I am skipping ahead a bit here, once the OpenRISC GLIBC port was working we noticed
many problematic math test failures. This turned out to be inconsistencies
between the tininess detection [pdf]
settings in the toolchain. Tininess detection must be selected by an FPU
implementation as being done before or after rounding.
In the toolchain this is configured by:
- GLIBC
TININESS_AFTER_ROUNDING - macro used by test suite to control
expectations
- GLIBC
_FP_TININESS_AFTER_ROUNDING - macro used to control softfloat
implementation in GLIBC.
- GCC libgcc
_FP_TININESS_AFTER_ROUNDING - macro used to control softfloat
implementation in GCC libgcc.
Updating the Spec
Writing to FPCSR from user-mode could be worked around in OpenRISC by
introducing a syscall, but we decided to just change the architecture
specification for this. Updating the spec keeps it similar to all other
architectures out there.
In OpenRISC we have defined tininess detection to be done before rounding as
this matches what existing FPU implementation have done.
As of architecture specification revision
1.4 the FPCSR is defined as being writable
in user-mode and we have documented tininess detection to be before rounding.
Summary
We’ve gone through an overview of how the FPU accelarates math in
an application runtime. We then looked how the OpenRISC architecture specification needed
to be updated to support the floating point POSIX runtime.
In the next entry we shall look into patches to get QEMU and and CPU
implementations updated to support the new spec changes.
13 Dec 2020
I have been working on porting GLIBC to the OpenRISC architecture. This has
taken longer than I expected as with GLIBC upstreaming we must get every
test to pass. This was different compared to GDB and GCC which were a
bit more lenient.
My first upstreaming attempt
was completely tested on the QEMU simulator. I have
since added an FPGA LiteX SoC
to my test platform options. LiteX runs Linux on the OpenRISC mor1kx softcore
and tests are loaded over an SSH session. The SoC eliminates an issue I was
seeing on the simulator where under heavy load it appears the MMU starves the kernel
from getting any work done.
To get to where I am now this required:
Adding GDB Linux debugging support is great because it allows debugging of
multithreaded processes and signal handling; which we are going to need.
A Bug
Our story starts when I was trying to fix a failing GLIBC NPTL
test case. The test case involves C++ exceptions and POSIX threads.
The issue is that the catch block of a try/catch block is not
being called. Where do we even start?
My plan for approaching test case failures is:
- Understand what the test case is trying to test and where its failing
- Create a hypothesis about where the problem is
- Understand how the failing API’s works internally
- Debug until we find the issue
- If we get stuck go back to
2.
Let’s have a try.
Understanding the Test case
The GLIBC test case is nptl/tst-cancel24.cc.
The test starts in the do_test function and it will create a child thread with pthread_create.
The child thread executes function tf which waits on a semaphore until the parent thread cancels it. It
is expected that the child thread, when cancelled , will call it’s catch block.
The failure is that the catch block is not getting run as evidenced by the except_caught variable
not being set to true.
Below is an excerpt from the test showing the tf function.
static void *
tf (void *arg) {
sem_t *s = static_cast<sem_t *> (arg);
try {
monitor m;
pthread_barrier_wait (&b);
while (1)
sem_wait (s);
} catch (...) {
except_caught = true;
throw;
}
return NULL;
}
So the catch block is not being run. Simple, but where do we start to
debug that? Let’s move onto the next step.
Creating a Hypothesis
This one is a bit tricky as it seems C++ try/catch blocks are broken. Here, I am
working on GLIBC testing, what does that have to do with C++?
To get a better idea of where the problem is I tried to modify the test to test
some simple ideas. First, maybe there is a problem with catching exceptions
throws from thread child functions.
static void do_throw() { throw 99; }
static void * tf () {
try {
monitor m;
while (1) do_throw();
} catch (...) {
except_caught = true;
}
return NULL;
}
No, this works correctly. So try/catch is working.
Hypothesis: There is a problem handling exceptions while in a syscall.
There may be something broken with OpenRISC related to how we setup stack
frames for syscalls that makes the unwinder fail.
How does that work? Let’s move onto the next step.
Understanding the Internals
To find this bug we need to understand how C++ exceptions work. Also, we need to know
what happens when a thread is cancelled in a multithreaded
(pthread) glibc environment.
There are a few contributors pthread cancellation and C++ exceptions which are:
- DWARF - provided by our program and libraries in the
.eh_frame ELF
section
- GLIBC - provides the pthread runtime and cleanup callbacks to the GCC unwinder code
- GCC - provides libraries for dealing with exceptions
libgcc_s.so - handles unwinding by reading program DWARF metadata and doing the frame decoding
libstdc++.so.6 - provides the C++ personality routine which
identifies and prepares catch blocks for execution
DWARF
ELF binaries provide debugging information in a data format called
DWARF. The name was chosen to maintain a
fantasy theme. Lately the Linux community has a new debug format called
ORC.
Though DWARF is a debugging format and usually stored in .debug_frame,
.debug_info, etc sections, a stripped down version it is used for exception
handling.
Each ELF binary that supports unwinding contains the .eh_frame section to
provide unwinding information. This can be seen with the readelf program.
$ readelf -S sysroot/lib/libc.so.6
There are 70 section headers, starting at offset 0xaa00b8:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .note.ABI-tag NOTE 00000174 000174 000020 00 A 0 0 4
[ 2] .gnu.hash GNU_HASH 00000194 000194 00380c 04 A 3 0 4
[ 3] .dynsym DYNSYM 000039a0 0039a0 008280 10 A 4 15 4
[ 4] .dynstr STRTAB 0000bc20 00bc20 0054d4 00 A 0 0 1
[ 5] .gnu.version VERSYM 000110f4 0110f4 001050 02 A 3 0 2
[ 6] .gnu.version_d VERDEF 00012144 012144 000080 00 A 4 4 4
[ 7] .gnu.version_r VERNEED 000121c4 0121c4 000030 00 A 4 1 4
[ 8] .rela.dyn RELA 000121f4 0121f4 00378c 0c A 3 0 4
[ 9] .rela.plt RELA 00015980 015980 000090 0c AI 3 28 4
[10] .plt PROGBITS 00015a10 015a10 0000d0 04 AX 0 0 4
[11] .text PROGBITS 00015ae0 015ae0 155b78 00 AX 0 0 4
[12] __libc_freeres_fn PROGBITS 0016b658 16b658 001980 00 AX 0 0 4
[13] .rodata PROGBITS 0016cfd8 16cfd8 0192b4 00 A 0 0 4
[14] .interp PROGBITS 0018628c 18628c 000018 00 A 0 0 1
[15] .eh_frame_hdr PROGBITS 001862a4 1862a4 001a44 00 A 0 0 4
[16] .eh_frame PROGBITS 00187ce8 187ce8 007cf4 00 A 0 0 4
[17] .gcc_except_table PROGBITS 0018f9dc 18f9dc 000341 00 A 0 0 1
...
We can decode the metadata using readelf as well using the
--debug-dump=frames-interp and --debug-dump=frames arguments.
The frames dump provides a raw output of the DWARF metadata for each frame.
This is not usually as useful as frames-interp, but it shows how the DWARF
format is actually a bytecode. The DWARF interpreter needs to execute these
operations to understand how to derive the values of registers based current PC.
There is an interesting talk in Exploiting the hard-working
DWARF.pdf.
An example of the frames dump:
$ readelf --debug-dump=frames sysroot/lib/libc.so.6
...
00016788 0000000c ffffffff CIE
Version: 1
Augmentation: ""
Code alignment factor: 4
Data alignment factor: -4
Return address column: 9
DW_CFA_def_cfa_register: r1
DW_CFA_nop
00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
DW_CFA_advance_loc: 4 to 0016b588
DW_CFA_def_cfa_offset: 4
DW_CFA_advance_loc: 8 to 0016b590
DW_CFA_offset: r9 at cfa-4
DW_CFA_advance_loc: 68 to 0016b5d4
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 56 to 0016b60c
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 36 to 0016b630
DW_CFA_remember_state
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
DW_CFA_restore_state
DW_CFA_advance_loc: 40 to 0016b658
DW_CFA_def_cfa_offset: 0
DW_CFA_restore: r9
The frames-interp argument is a bit more clear as it shows the interpreted output
of the bytecode. Below we see two types of entries:
CIE - Common Information Entry
FDE - Frame Description Entry
The CIE provides starting point information for each child FDE entry. Some
things to point out: we see ra=9 indicates the return address is stored in
register r9, we see CFA r1+0 indicates the canonical frame pointer is stored in
register r1 and we see the stack frame size is 4 bytes.
An example of the frames-interp dump:
$ readelf --debug-dump=frames-interp sysroot/lib/libc.so.6
...
00016788 0000000c ffffffff CIE "" cf=4 df=-4 ra=9
LOC CFA
00000000 r1+0
00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
LOC CFA ra
0016b584 r1+0 u
0016b588 r1+4 u
0016b590 r1+4 c-4
0016b5d4 r1+4 c-4
0016b60c r1+4 c-4
0016b630 r1+4 c-4
0016b658 r1+0 u
GLIBC
GLIBC provides pthreads which when used with C++ needs to support exception
handling. The main place exceptions are used with pthreads is when cancelling
threads. When using pthread_cancel a cancel signal is sent to the target thread using tgkill
which causes an exception.
This is implemented with the below APIs.
- sigcancel_handler -
Setup during the pthread runtime initialization, it handles cancellation,
which calls
__do_cancel, which calls __pthread_unwind.
- __pthread_unwind -
Is called with
pd->cancel_jmp_buf. It calls glibc’s __Unwind_ForcedUnwind.
- _Unwind_ForcedUnwind -
Loads GCC’s
libgcc_s.so version of _Unwind_ForcedUnwind
and calls it with parameters:
exc - the exception context
unwind_stop - the stop callback to GLIBC, called for each frame of the unwind, with
the stop argument ibuf
ibuf - the jmp_buf, created by setjmp (self->cancel_jmp_buf) in start_thread
- unwind_stop -
Checks the current state of unwind and call the
cancel_jmp_buf if
we are at the end of stack. When the cancel_jmp_buf is called the thread
exits.
Let’s look at pd->cancel_jmp_buf in more details. The cancel_jmp_buf is
setup during pthread_create after clone in start_thread.
It uses the setjmp and
longjump non local goto mechanism.
Let’s look at some diagrams.

The above diagram shows a pthread that exits normally. During the Start phase
of the thread setjmp will create the cancel_jmp_buf. After the thread
routine exits it returns to the start_thread routine to do cleanup.
The cancel_jmp_buf is not used.

The above diagram shows a pthread that is cancelled. When the
thread is created setjmp will create the cancel_jmp_buf. In this case
while the thread routine is running it is cancelled, the unwinder runs
and at the end it calls unwind_stop which calls longjmp. After the
longjmp the thread is returned to start_thread to do cleanup.
A highly redacted version of our start_thread and unwind_stop functions is
shown below.
start_thread()
{
struct pthread *pd = START_THREAD_SELF;
...
struct pthread_unwind_buf unwind_buf;
int not_first_call;
not_first_call = setjmp ((struct __jmp_buf_tag *) unwind_buf.cancel_jmp_buf);
...
if (__glibc_likely (! not_first_call))
{
/* Store the new cleanup handler info. */
THREAD_SETMEM (pd, cleanup_jmp_buf, &unwind_buf);
...
/* Run the user provided thread routine */
ret = pd->start_routine (pd->arg);
THREAD_SETMEM (pd, result, ret);
}
... free resources ...
__exit_thread ();
}
unwind_stop (_Unwind_Action actions,
struct _Unwind_Context *context, void *stop_parameter)
{
struct pthread_unwind_buf *buf = stop_parameter;
struct pthread *self = THREAD_SELF;
int do_longjump = 0;
...
if ((actions & _UA_END_OF_STACK)
|| ... )
do_longjump = 1;
...
/* If we are at the end, go back start_thread for cleanup */
if (do_longjump)
__libc_unwind_longjmp ((struct __jmp_buf_tag *) buf->cancel_jmp_buf, 1);
return _URC_NO_REASON;
}
GCC
GCC provides the exception handling and unwinding capabilities
to the C++ runtime. They are provided in the libgcc_s.so and libstdc++.so.6 libraries.
The libgcc_s.so library implements the IA-64 Itanium Exception Handling ABI.
It’s interesting that the now defunct Itanium
architecture introduced this ABI which is now the standard for all processor exception
handling. There are two main entry points for the unwinder are:
_Unwind_ForcedUnwind - for forced unwinding
_Unwind_RaiseException - for raising normal exceptions
There are also two data structures to be aware of:
- _Unwind_Context - register and unwind state for a frame, below referenced as CONTEXT
- _Unwind_FrameState - register and unwind state from DWARF, below referenced as FS
The _Unwind_Context important parts:
struct _Unwind_Context {
_Unwind_Context_Reg_Val reg[__LIBGCC_DWARF_FRAME_REGISTERS__+1];
void *cfa;
void *ra;
struct dwarf_eh_bases bases;
_Unwind_Word flags;
};
The _Unwind_FrameState important parts:
typedef struct {
struct frame_state_reg_info { ... } regs;
void *pc;
/* The information we care about from the CIE/FDE. */
_Unwind_Personality_Fn personality;
_Unwind_Sword data_align;
_Unwind_Word code_align;
_Unwind_Word retaddr_column;
unsigned char fde_encoding;
unsigned char signal_frame;
void *eh_ptr;
} _Unwind_FrameState;
These two data structures are very similar. The _Unwind_FrameState is for internal
use and closely ties to the DWARF definitions of the frame. The _Unwind_Context
struct is more generic and is used as an opaque structure in the public unwind api.
Forced Unwinds
Exceptions that are raised for thread cancellation use a single phase forced unwind.
Code execution will not resume, but catch blocks will be run. This is why
cancel exceptions must be rethrown.
Forced unwinds use the unwind_stop handler which GLIBC provides as explained in
the GLIBC section above.
- _Unwind_ForcedUnwind - calls:
_Unwind_ForcedUnwind_Phase2 - loops forever doing:
- uw_frame_state_for - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT->ra
stop- callback to GLIBC to stop the unwind if needed
FS.personality - the C++ personality routine, see below, called with _UA_FORCE_UNWIND | _UA_CLEANUP_PHASE
- uw_advance_context - advance CONTEXT by populating it from FS
Normal Exceptions
For exceptions raised programmatically unwinding is very similar to the forced unwind, but
there is no stop function and exception unwinding is 2 phase.
- _Unwind_RaiseException - calls:
uw_init_context - load details of the current frame from cpu/stack into CONTEXT
- Do phase 1 loop:
uw_frame_state_for - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT->ra
FS.personality - the C++ personality routine, see below, called with _UA_SEARCH_PHASE
- uw_update_context - advance CONTEXT by populating it from FS (same as
uw_advance_context)
- _Unwind_RaiseException_Phase2 - do the frame iterations
uw_install_context - exit unwinder jumping to selected frame
_Unwind_RaiseException_Phase2 - do phase 2, loops forever doing:
uw_frame_state_for - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT->ra
FS.personality - the C++ personality routine, called with _UA_CLEANUP_PHASE
uw_update_context - advance CONTEXT by populating it from FS
The libstdc++.so.6 library provides the C++ standard library
which includes the C++ personality routine __gxx_personality_v0.
The personality routine is the interface between the unwind routines and the c++
(or other language) runtime, which handles the exception handling logic for that
language.
As we saw above the personality routine is executed for each stack frame. The
function checks if there is a catch block that matches the exception being
thrown. If there is a match, it will update the context to prepare it to jump
into the catch routine and return _URC_INSTALL_CONTEXT. If there is no catch
block matching it returns _URC_CONTINUE_UNWIND.
In the case of _URC_INSTALL_CONTEXT then the _Unwind_ForcedUnwind_Phase2
loop breaks and calls uw_install_context.
Unwinding through a Signal Frame
When the GCC unwinder is looping through frames the uw_frame_state_for
function will search DWARF information. The DWARF lookup will fail for signal
frames and a fallback mechanism is provided for each architecture to handle
this. For OpenRISC Linux this is handled by
or1k_fallback_frame_state.
To understand how this works let’s look into the Linux kernel a bit.
A process must be context switched to kernel by either a system call, timer or other
interrupt in order to receive a signal.

The diagram above shows what a process stack looks like after the kernel takes over.
An interrupt frame is push to the top of the stack and the pt_regs structure
is filled out containing the processor state before the interrupt.

This second diagram shows what happens when a signal handler is invoked. A new
special signal frame is pushed onto the stack and when the process is resumed
it resumes in the signal handler. In OpenRISC the signal frame is setup by the setup_rt_frame
function which is called inside of do_signal which calls handle_signal
which calls setup_rt_frame.
After the signal handler routine runs we return to a special bit of code called
the Trampoline. The trampoline code lives on the stack and runs
sigretrun.
Now back to or1k_fallback_frame_state.
The or1k_fallback_frame_state function checks if the current frame is a
signal frame by confirming the return address points to a Trampoline. If
it is a trampoline it looks into the kernel saved ucontext and pt_regs find
the previous user frame. Unwinding, can then continue as normal.
Debugging the Issue
Now with a good background in how unwinding works we can start to debug our test
case. We can recall our hypothesis:
Hypothesis: There is a problem handling exceptions while in a syscall.
There may be something broken with OpenRISC related to how we setup stack
frames for syscalls that makes the unwinder fail.
With GDB we can start to debug exception handling, we can trace right to the
start of the exception handling logic by setting our breakpoint at
_Unwind_ForcedUnwind.
This is the stack trace we see:
#0 _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:192
#1 0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc <unwind_stop>, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#2 0x30321fc0 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#3 0x30312388 in __do_cancel () at pthreadP.h:313
#4 sigcancel_handler (sig=32, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:162
#5 sigcancel_handler (sig=<optimized out>, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:127
#6 <signal handler called>
#7 0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78, expected=1, clockid=<optimized out>, abstime=0x0, private=<optimized out>)
at ../sysdeps/nptl/futex-internal.c:66
#8 0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#9 0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
#10 0x30314548 in start_thread (arg=<optimized out>) at pthread_create.c:463
#11 0x3043638c in __or1k_clone () from /lib/libc.so.6
Backtrace stopped: frame did not save the PC
(gdb)
In the GDB backtrack we can see it unwinds through, the signal frame, sem_wait
all the way to our thread routine tf. It appears everything, is working fine.
But we need to remember the backtrace we see above is from GDB’s unwinder not
GCC, also it uses the .debug_info DWARF data, not .eh_frame.
To really ensure the GCC unwinder is working as expected we need to debug it
walking the stack. Debugging when we unwind a signal frame can be done by
placing a breakpoint on or1k_fallback_frame_state.
Debugging this code as well shows it works correctly.
#0 or1k_fallback_frame_state (context=<optimized out>, context=<optimized out>, fs=<optimized out>) at ../../../libgcc/unwind-dw2.c:1271
#1 uw_frame_state_for (context=0x30caeb6c, fs=0x30cae914) at ../../../libgcc/unwind-dw2.c:1271
#2 0x30303200 in _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:162
#3 0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc <unwind_stop>, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#4 0x30321fc0 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:121
#5 0x30312388 in __do_cancel () at pthreadP.h:313
#6 sigcancel_handler (sig=32, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:162
#7 sigcancel_handler (sig=<optimized out>, si=0x30caec98, ctx=<optimized out>) at nptl-init.c:127
#8 <signal handler called>
#9 0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78, expected=1, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at ../sysdeps/nptl/futex-internal.c:66
#10 0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#11 0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
Debugging when the unwinding stops can be done by setting a breakpoint
on the unwind_stop function.
When debugging I was able to see that the unwinder failed when looking for
the __futex_abstimed_wait_cancelable64 frame. So, this is not an issue
with unwinding signal frames.
A second Hypothosis
Debugging showed that the uwinder is working correctly, and it can properly
unwind through our signal frames. However, the unwinder is bailing out early
before it gets to the tf frame which has the catch block we need to execute.
Hypothesis 2: There is something wrong finding DWARF info for __futex_abstimed_wait_cancelable64.
Looking at libpthread.so with readelf this function was missing completely from the .eh_frame
metadata. Now we found something.
What creates the .eh_frame anyway? GCC or Binutils (Assembler). If we run GCC
with the -S argument we can see GCC will output inline .cfi directives.
These .cfi annotations are what gets compiled to the to .eh_frame. GCC
creates the .cfi directives and the Assembler puts them into the .eh_frame
section.
An example of gcc -S:
.file "unwind.c"
.section .text
.align 4
.type unwind_stop, @function
unwind_stop:
.LFB83:
.cfi_startproc
l.addi r1, r1, -28
.cfi_def_cfa_offset 28
l.sw 0(r1), r16
l.sw 4(r1), r18
l.sw 8(r1), r20
l.sw 12(r1), r22
l.sw 16(r1), r24
l.sw 20(r1), r26
l.sw 24(r1), r9
.cfi_offset 16, -28
.cfi_offset 18, -24
.cfi_offset 20, -20
.cfi_offset 22, -16
.cfi_offset 24, -12
.cfi_offset 26, -8
.cfi_offset 9, -4
l.or r24, r8, r8
l.or r22, r10, r10
l.lwz r18, -1172(r10)
l.lwz r20, -692(r10)
l.lwz r17, -688(r10)
l.add r20, r20, r17
l.andi r16, r4, 16
l.sfnei r16, 0
When looking at the glibc build I noticed the .eh_frame data for
__futex_abstimed_wait_cancelable64 is missing from futex-internal.o. The one
where unwinding is failing we find it was completely mising .cfi directives.
Why is GCC not generating .cfi directives for this file?
.file "futex-internal.c"
.section .text
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "The futex facility returned an unexpected error code.\n"
.section .text
.align 4
.global __futex_abstimed_wait_cancelable64
.type __futex_abstimed_wait_cancelable64, @function
__futex_abstimed_wait_cancelable64:
l.addi r1, r1, -20
l.sw 0(r1), r16
l.sw 4(r1), r18
l.sw 8(r1), r20
l.sw 12(r1), r22
l.sw 16(r1), r9
l.or r22, r3, r3
l.or r20, r4, r4
l.or r16, r6, r6
l.sfnei r6, 0
l.ori r17, r0, 1
l.cmov r17, r17, r0
l.sfeqi r17, 0
l.bnf .L14
l.nop
Looking closer at the build line of these 2 files I see the build of futex-internal.c
is missing -fexceptions.
This flag is needed to enable the eh_frame section, which is what powers C++
exceptions, the flag is needed when we are building C code which needs to
support C++ exceptions.
So why is it not enabled? Is this a problem with the GLIBC build?
Looking at GLIBC the nptl/Makefile set’s -fexceptions explicitly for each
c file that needs it. For example:
# The following are cancellation points. Some of the functions can
# block and therefore temporarily enable asynchronous cancellation.
# Those must be compiled asynchronous unwind tables.
CFLAGS-pthread_testcancel.c += -fexceptions
CFLAGS-pthread_join.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_timedjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_clockjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_once.c += $(uses-callbacks) -fexceptions \
-fasynchronous-unwind-tables
CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
It is missing such a line for futex-internal.c. The following patch and a
libpthread rebuild fixes the issue!
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -220,6 +220,7 @@ CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
+CFLAGS-futex-internal.c += -fexceptions -fasynchronous-unwind-tables
# These are the function wrappers we have to duplicate here.
CFLAGS-fcntl.c += -fexceptions -fasynchronous-unwind-tables
I submitted this patch
to GLIBC but it turns out it was already fixed upstream
a few weeks before. Doh.
Summary
I hope the investigation into debugging this C++ exception test case proved interesting.
We can learn a lot about the deep internals of our tools when we have to fix bugs in them.
Like most illusive bugs, in the end this was a trivial fix but required some
key background knowledge.
Additional Reading
21 Jul 2020
This is an ongoing series of posts on ELF Binary Relocations and Thread
Local Storage. This article covers only Thread Local Storage and assumes
the reader has had a primer in ELF Relocations, if not please start with
my previous article ELF Binaries and Relocation Entries.
This is the third part in an illustrated 3 part series covering:
In the last article we covered how Thread Local Storage (TLS) works at runtime,
but how do we get there? How does the compiler and linker create the memory
structures and code fragments described in the previous article?
In this article we will discuss how TLS relocations are is implemented. Our
outline:
As before, the examples in this article can be found in my tls-examples
project. Please check it out.
I will assume here that most people understand what a compiler and assembler
basically do. In the sense that compiler will compile routines
written C code or something similar to assembly language. It is then up to the
assembler to turn that assembly code into machine code to run on a CPU.
That is a big part of what a toolchain does, and it’s pretty much that simple if
we have a single file of source code. But usually we don’t have a single file,
we have the multiple files, the c runtime,
crt0 and other libraries like
libc. These all need to be
put together into our final program, that is where the complexities of the
linker comes in.
In this article I will cover how variables in our source code (symbols) traverse
the toolchain from code to the memory in our final running program. A picture that looks
something like this:

The Compiler
First we start off with how relocations are created and emitted in the compiler.
As I work primarily on the GNU toolchain with
it’s GCC compiler we will look at that, let’s get started.
GCC Legitimize Address
To start we define a symbol as named address in memory. This address can be
a program variable where data is stored or function reference to where a
subroutine starts.
In GCC we have have TARGET_LEGITIMIZE_ADDRESS, the OpenRISC implementation
being or1k_legitimize_address().
It takes a symbol (memory address) and makes it usable in our CPU by generating RTX
sequences that are possible on our CPU to load that address into a register.
RTX represents a tree node in GCC’s register transfer language (RTL). The RTL
Expression is used to express our algorithm as a series of register transfers.
This is used as register transfer is basically what a CPU does.
A snippet from legitimize_address() function is below. The argument x
represents our input symbol (memory address) that we need to make usable by our
CPU. This code uses GCC internal API’s to emit RTX code sequences.
static rtx
or1k_legitimize_address (rtx x, rtx /* unused */, machine_mode /* unused */)
...
case TLS_MODEL_NONE:
t1 = can_create_pseudo_p () ? gen_reg_rtx (Pmode) : scratch;
if (!flag_pic)
{
emit_insn (gen_rtx_SET (t1, gen_rtx_HIGH (Pmode, x)));
return gen_rtx_LO_SUM (Pmode, t1, x);
}
else if (is_local)
{
crtl->uses_pic_offset_table = 1;
t2 = gen_sym_unspec (x, UNSPEC_GOTOFF);
emit_insn (gen_rtx_SET (t1, gen_rtx_HIGH (Pmode, t2)));
emit_insn (gen_add3_insn (t1, t1, pic_offset_table_rtx));
return gen_rtx_LO_SUM (Pmode, t1, copy_rtx (t2));
}
else
{
...
We can read the code snippet above as follows:
- This is for the non
TLS case as we see TLS_MODEL_NONE.
- We reserve a temporary register
t1.
- If not using Position-independent code (
flag_pic) we do:
- Emit an instruction to put the high bits of
x into our temporary register t1.
- Return the sum of
t1 and the low bits of x.
- Otherwise if the symbol is static (
is_local) we do:
- Mark the global state that this object file uses the
uses_pic_offset_table.
- We create a Global Offset Table offset variable
t2.
- Emit an instruction to put the high bits of
t2 (the GOT offset) into out temporary register t1.
- Emit an instruction to put the sum of
t1 (high bits of t2) and the GOT into t1`.
- Return the sum of
t1 and the low bits of t1.
You may have noticed that the local symbol still used the global offset
table (GOT). This is
because Position-idependent code requires using the GOT to reference symbols.
An example, from nontls.c:
static int x;
int *get_x_addr() {
return &x;
}
Example of the non pic case above, when we look at the assembly code generated by GCC
we can see the following:
.file "nontls.c"
.section .text
.local x
.comm x,4,4
.align 4
.global get_x_addr
.type get_x_addr, @function
get_x_addr:
l.addi r1, r1, -8 # \
l.sw 0(r1), r2 # | function prologue
l.addi r2, r1, 8 # |
l.sw 4(r1), r9 # /
l.movhi r17, ha(x) # \__ legitimize address of x into r17
l.addi r17, r17, lo(x) # /
l.or r11, r17, r17 # } place result in return register r11
l.lwz r2, 0(r1) # \
l.lwz r9, 4(r1) # | function epilogue
l.addi r1, r1, 8 # |
l.jr r9 # |
l.nop # /
.size get_x_addr, .-get_x_addr
.ident "GCC: (GNU) 9.0.1 20190409 (experimental)"
Example of the local pic case above the same code compiled with the -fPIC GCC option
looks like the following:
.file "nontls.c"
.section .text
.local x
.comm x,4,4
.align 4
.global get_x_addr
.type get_x_addr, @function
get_x_addr:
l.addi r1, r1, -8 # \
l.sw 0(r1), r2 # | function prologue
l.addi r2, r1, 8 # |
l.sw 4(r1), r9 # /
l.jal 8 # \
l.movhi r19, gotpchi(_GLOBAL_OFFSET_TABLE_-4) # | PC relative, put
l.ori r19, r19, gotpclo(_GLOBAL_OFFSET_TABLE_+0) # | GOT into r19
l.add r19, r19, r9 # /
l.movhi r17, gotoffha(x) # \
l.add r17, r17, r19 # | legitimize address of x into r17
l.addi r17, r17, gotofflo(x) # /
l.or r11, r17, r17 # } place result in return register r11
l.lwz r2, 0(r1) # \
l.lwz r9, 4(r1) # | function epilogue
l.addi r1, r1, 8 # |
l.jr r9 # |
l.nop # /
.size get_x_addr, .-get_x_addr
.ident "GCC: (GNU) 9.0.1 20190409 (experimental)"
TLS and Addend cases are also handled by or1k_legitimize_address().
GCC Print Operand
Once RTX is generated by legitimize address and GCC passes
run all of their optimizations the RTX needs to be printed out as assembly code. During
this process relocations are printed by GCC macros TARGET_PRINT_OPERAND_ADDRESS
and TARGET_PRINT_OPERAND. In OpenRISC these defined
by or1k_print_operand_address()
and or1k_print_operand().
Let us have a look at or1k_print_operand_address().
/* Worker for TARGET_PRINT_OPERAND_ADDRESS.
Prints the argument ADDR, an address RTX, to the file FILE. The output is
formed as expected by the OpenRISC assembler. Examples:
RTX OUTPUT
(reg:SI 3) 0(r3)
(plus:SI (reg:SI 3) (const_int 4)) 0x4(r3)
(lo_sum:SI (reg:SI 3) (symbol_ref:SI ("x"))) lo(x)(r3) */
static void
or1k_print_operand_address (FILE *file, machine_mode, rtx addr)
{
rtx offset;
switch (GET_CODE (addr))
{
case REG:
fputc ('0', file);
break;
case ...
case LO_SUM:
offset = XEXP (addr, 1);
addr = XEXP (addr, 0);
print_reloc (file, offset, 0, RKIND_LO);
break;
default: ...
}
fprintf (file, "(%s)", reg_names[REGNO (addr)]);
}
The above code snippet can be read as we explain below, but let’s first
make some notes:
- The input RTX
addr for TARGET_PRINT_OPERAND_ADDRESS will usually contain
a register and an offset typically this is used for LOAD and STORE
operations.
- Think of the RTX
addr as a node in an AST.
- The RTX node with code
REG and SYMBOL_REF are always leaf nodes.
With that, and if we use the or1k_print_operand_address() c comments above as examples
of some RTX addr input we will have:
RTX | (reg:SI 3) (lo_sum:SI (reg:SI 3) (symbol_ref:SI("x")))
-----------+--------------------------------------------------------------------
TREE |
(code) | (code:REG regno:3) (code:LO_SUM)
/ \ | / \
(0) (1) | (code:REG regno:3) (code:SYMBOL_REF "x")
We can now read the above snippet as:
- First get the
CODE of the RTX.
- If
CODE is REG (a register) than our offset can be 0.
- If
IS is LO_SUM (an addition operation) then we need to break it down to:
- Arg
0 is our new addr RTX (which we assume is a register)
- Arg
1 is an offset (which we then print with print_reloc())
- Second print out the register name now in
addr i.e. “r3”.
The code of or1k_print_operand() is similar and the reader may be inclined to
read more details. With that we can move on to the assembler.
TLS cases are also handled inside of the print_reloc() function.
The Assembler
In the GNU Toolchain our assembler is GAS, part of binutils.
The code that handles relocations is found in the function
parse_reloc()
found in opcodes/or1k-asm.c. The function parse_reloc() is the direct counterpart of GCC’s print_reloc()
discussed above. This is actually part of or1k_cgen_parse_operand()
which is wired into our assembler generator CGEN used for parsing operands.
If we are parsing a relocation like the one from above lo(x) then we can
isolate the code that processes that relocation.
static const bfd_reloc_code_real_type or1k_imm16_relocs[][6] = {
{ BFD_RELOC_LO16,
BFD_RELOC_OR1K_SLO16,
...
BFD_RELOC_OR1K_TLS_LE_AHI16 },
};
static int
parse_reloc (const char **strp)
{
const char *str = *strp;
enum or1k_rclass cls = RCLASS_DIRECT;
enum or1k_rtype typ;
...
else if (strncasecmp (str, "lo(", 3) == 0)
{
str += 3;
typ = RTYPE_LO;
}
...
*strp = str;
return (cls << RCLASS_SHIFT) | typ;
}
This uses strncasecmp to match
our "lo(" string pattern. The returned result is a relocation type and relocation class
which are use to lookup the relocation BFD_RELOC_LO16 in the or1k_imm16_relocs[][] table
which is indexed by relocation class and relocation class.
The assembler will encode that into the ELF binary. For TLS relocations the exact same
pattern is used.
The Linker
In the GNU Toolchain our object linker is the GNU linker LD, also part of the
binutils project.
The GNU linker uses the framework
BFD or Binary File
Descriptor which
is a beast. It is not only used in the linker but also used in GDB, the GNU
Simulator and the objdump tool.
What makes this possible is a rather complex API.
BFD Linker API
The BFD API is a generic binary file access API. It has been designed to support multiple
file formats and architectures via an object oriented, polymorphic API all written in c. It supports file formats
including a.out,
COFF and
ELF as well as
unexpected file formats like
verilog hex memory dumps.
Here we will concentrate on the BFD ELF implementation.
The API definition is split across multiple files which include:
- bfd/bfd-in.h - top level generic APIs including
bfd_hash_table
- bfd/bfd-in2.h - top level binary file APIs including
bfd and asection
- include/bfdlink.h - generic bfd linker APIs including
bfd_link_info and bfd_link_hash_table
- bfd/elf-bfd.h - extensions to the APIs for ELF binaries including
elf_link_hash_table
bfd/elf{wordsize}-{architecture}.c - architecture specific implementations
For each architecture implementations are defined in bfd/elf{wordsize}-{architecture}.c. For
example for OpenRISC we have
bfd/elf32-or1k.c.
Throughout the linker code we see access to the BFD Linker and ELF APIs. Some key symbols to watch out for include:
info - A reference to bfd_link_info top level reference to all linker
state.
htab - A pointer to elf_or1k_link_hash_table from or1k_elf_hash_table (info), a hash
table on steroids which stores generic link state and arch specific state, it’s also a hash
table of all global symbols by name, contains:
htab->root.splt - the output .plt section
htab->root.sgot - the output .got section
htab->root.srelgot - the output .relgot section (relocations against the got)
htab->root.sgotplt - the output .gotplt section
htab->root.dynobj - a special bfd to which sections are added (created in or1k_elf_check_relocs)
sym_hashes - From elf_sym_hashes (abfd) a list of for global symbols
in a bfd indexed by the relocation index ELF32_R_SYM (rel->r_info).
h - A pointer to a struct elf_link_hash_entry, represents link state
of a global symbol, contains:
h->got - A union of different attributes with different roles based on link phase.
h->got.refcount - used during phase 1 to count the symbol .got section references
h->got.offset - used during phase 2 to record the symbol .got section offset
h->plt - A union with the same function as h->got but used for the .plt section.
h->root.root.string - The symbol name
local_got- an array of unsigned long from elf_local_got_refcounts (ibfd) with the same
function to h->got but for local symbols, the function of the unsigned long is changed base
on the link phase. Ideally this should also be a union.
tls_type - Retrieved by ((struct elf_or1k_link_hash_entry *) h)->tls_type used to store the
tls_type of a global symbol.
local_tls_type - Retrieved by elf_or1k_local_tls_type(abfd) entry to store tls_type for local
symbols, when h is NULL.
root - The struct field root is used in subclasses to represent the parent class, similar to how super is used
in other languages.
Putting it all together we have a diagram like the following:

Now that we have a bit of understanding of the data structures
we can look to the link algorithm.
The link process in the GNU Linker can be thought of in phases.
Phase 1 - Book Keeping (check_relocs)
The or1k_elf_check_relocs() function is called during the first phase to
do book keeping on relocations. The function signature looks like:
static bfd_boolean
or1k_elf_check_relocs (bfd *abfd,
struct bfd_link_info *info,
asection *sec,
const Elf_Internal_Rela *relocs)
#define elf_backend_check_relocs or1k_elf_check_relocs
The arguments being:
abfd - The current elf object file we are working on
info - The BFD API
sec - The current elf section we are working on
relocs - The relocations from the current section
It does the book keeping by looping over relocations for the provided section
and updating the local and global symbol properties.
For local symbols:
...
else
{
unsigned char *local_tls_type;
/* This is a TLS type record for a local symbol. */
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (abfd);
if (local_tls_type == NULL)
{
bfd_size_type size;
size = symtab_hdr->sh_info;
local_tls_type = bfd_zalloc (abfd, size);
if (local_tls_type == NULL)
return FALSE;
elf_or1k_local_tls_type (abfd) = local_tls_type;
}
local_tls_type[r_symndx] |= tls_type;
}
...
else
{
bfd_signed_vma *local_got_refcounts;
/* This is a global offset table entry for a local symbol. */
local_got_refcounts = elf_local_got_refcounts (abfd);
if (local_got_refcounts == NULL)
{
bfd_size_type size;
size = symtab_hdr->sh_info;
size *= sizeof (bfd_signed_vma);
local_got_refcounts = bfd_zalloc (abfd, size);
if (local_got_refcounts == NULL)
return FALSE;
elf_local_got_refcounts (abfd) = local_got_refcounts;
}
local_got_refcounts[r_symndx] += 1;
}
The above is pretty straight forward and we can read as:
- First part is for storing local symbol
TLS type information:
- If the
local_tls_type array is not initialized:
- Allocate it, 1 entry for each local variable
- Record the TLS type in
local_tls_type for the current symbol
- Second part is for recording
.got section references:
- If the
local_got_refcounts array is not initialized:
- Allocate it, 1 entry for each local variable
- Record a reference by incrementing
local_got_refcounts for the current symbol
For global symbols, it’s much more easy we see:
...
if (h != NULL)
((struct elf_or1k_link_hash_entry *) h)->tls_type |= tls_type;
else
...
if (h != NULL)
h->got.refcount += 1;
else
...
As the tls_type and refcount fields are available directly on each
hash_entry handling global symbols is much easier.
- First part is for storing
TLS type information:
- Record the TLS type in
tls_type for the current hash_entry
- Second part is for recording
.got section references:
- Record a reference by incrementing
got.refcounts for the hash_entry
The above is repeated for all relocations and all input sections. A few other
things are also done including accounting for .plt entries.
Phase 2 - creating space (size_dynamic_sections + _bfd_elf_create_dynamic_sections)
The or1k_elf_size_dynamic_sections()
function iterates over all input object files to calculate the size required for
output sections. The _bfd_elf_create_dynamic_sections() function does the
actual section allocation, we use the generic version.
Setting up the sizes of the .got section (global offset table) and .plt
section (procedure link table) is done here.
The definition is as below:
static bfd_boolean
or1k_elf_size_dynamic_sections (bfd *output_bfd ATTRIBUTE_UNUSED,
struct bfd_link_info *info)
#define elf_backend_size_dynamic_sections or1k_elf_size_dynamic_sections
#define elf_backend_create_dynamic_sections _bfd_elf_create_dynamic_sections
The arguments to or1k_elf_size_dynamic_sections() being:
output_bfd - Unused, the output elf object
info - the BFD API which provides access to everything we need
Internally the function uses:
htab - from or1k_elf_hash_table (info)
htab->root.dynamic_sections_created - true if sections like .interp have been created by the linker
ibfd - a bfd pointer from info->input_bfds, represents an input object when iterating.
s->size - represents the output .got section size, which we will be
incrementing.
srel->size - represents the output .got.rela section size, which will
contain relocations against the .got section
During the first part of phase 2 we set .got and .got.rela section sizes
for local symbols with this code:
/* Set up .got offsets for local syms, and space for local dynamic
relocs. */
for (ibfd = info->input_bfds; ibfd != NULL; ibfd = ibfd->link.next)
{
...
local_got = elf_local_got_refcounts (ibfd);
if (!local_got)
continue;
symtab_hdr = &elf_tdata (ibfd)->symtab_hdr;
locsymcount = symtab_hdr->sh_info;
end_local_got = local_got + locsymcount;
s = htab->root.sgot;
srel = htab->root.srelgot;
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (ibfd);
for (; local_got < end_local_got; ++local_got)
{
if (*local_got > 0)
{
unsigned char tls_type = (local_tls_type == NULL)
? TLS_UNKNOWN
: *local_tls_type;
*local_got = s->size;
or1k_set_got_and_rela_sizes (tls_type, bfd_link_pic (info),
&s->size, &srel->size);
}
else
*local_got = (bfd_vma) -1;
if (local_tls_type)
++local_tls_type;
}
}
Here, for example, we can see we iterate over each input elf object ibfd and
each local symbol (local_got) we try and update s->size and srel->size to
account for the required size.
The above can be read as:
- For each
local_got entry:
- If the local symbol is used in the
.got section:
- Get the
tls_type byte stored in the local_tls_type array
- Set the offset
local_got to the section offset s->size, that is used
in phase 3 to tell us where we need to write the symbol into the .got
section.
- Update
s->size and srel->size using or1k_set_got_and_rela_sizes()
- If the local symbol is not used in the
.got section:
- Set the offset
local_got to the -1, to indicate not used
In the next part of phase 2 we allocate space for all global symbols by
iterating through symbols in htab with the allocate_dynrelocs iterator. To
do that we call:
elf_link_hash_traverse (&htab->root, allocate_dynrelocs, info);
Inside allocate_dynrelocs() we record the space used for relocations and
the .got and .plt sections. Example:
if (h->got.refcount > 0)
{
asection *sgot;
bfd_boolean dyn;
unsigned char tls_type;
...
sgot = htab->root.sgot;
h->got.offset = sgot->size;
tls_type = ((struct elf_or1k_link_hash_entry *) h)->tls_type;
dyn = htab->root.dynamic_sections_created;
dyn = WILL_CALL_FINISH_DYNAMIC_SYMBOL (dyn, bfd_link_pic (info), h);
or1k_set_got_and_rela_sizes (tls_type, dyn,
&sgot->size, &htab->root.srelgot->size);
}
else
h->got.offset = (bfd_vma) -1;
The above, with h being our global symbol, a pointer to struct elf_link_hash_entry,
can be read as:
- If the symbol will be in the
.got section:
- Get the global reference to the
.got section and put it in sgot
- Set the got location
h->got.offset for the symbol to the current got
section size htab->root.sgot.
- Set
dyn to true if we will be doing a dynamic link.
- Call
or1k_set_got_and_rela_sizes() to update the sizes for the .got
and .got.rela sections.
- If the symbol is going to be in the
.got section:
- Set the got location
h->got.offset to -1
The function or1k_set_got_and_rela_sizes() used above is used to increment
.got and .rela section sizes accounting for if these are TLS symbols, which
need additional entries and relocations.
Phase 3 - linking (relocate_section)
The or1k_elf_relocate_section()
function is called to fill in the relocation holes in the output binary .text
section. It does this by looping over relocations and writing to the .text
section the correct symbol value (memory address). It also updates other output
binary sections like the .got section. Also, for dynamic executables and
libraries new relocations may be written to .rela sections.
The function signature looks as follows:
static bfd_boolean
or1k_elf_relocate_section (bfd *output_bfd,
struct bfd_link_info *info,
bfd *input_bfd,
asection *input_section,
bfd_byte *contents,
Elf_Internal_Rela *relocs,
Elf_Internal_Sym *local_syms,
asection **local_sections)
#define elf_backend_relocate_section or1k_elf_relocate_section
The arguments to or1k_elf_relocate_sectioni() being:
output_bfd - the output elf object we will be writing to
info - the BFD API which provides access to everything we need
input_bfd - the current input elf object being iterated over
input_section the current .text section in the input elf object being iterated
over. From here we get .text section output details for pc relative relocations:
input_section->output_section->vma - the location of the output section.
input_section->output_offset - the output offset
contents - the output file buffer we will write to
relocs - relocations from the current input section
local_syms - an array of local symbols used to get the relocation value for local symbols
local_sections - an array input sections for local symbols, used to get the relocation value for local symbols
Internally the function uses:
- or1k_elf_howto_table - not
mentioned until now, but an array of
howto structs indexed by relocation enum.
The howto struct expresses the algorithm required to update the relocation.
relocation - a bfd_vma the value of the relocation symbol (memory address)
to be written to the output file.
in the output file that needs to be updated for the relocation.
value - the value that needs to be written to the relocation location.
During the first part of relocate_section we see:
if (r_symndx < symtab_hdr->sh_info)
{
sym = local_syms + r_symndx;
sec = local_sections[r_symndx];
relocation = _bfd_elf_rela_local_sym (output_bfd, sym, &sec, rel);
name = bfd_elf_string_from_elf_section
(input_bfd, symtab_hdr->sh_link, sym->st_name);
name = name == NULL ? bfd_section_name (sec) : name;
}
else
{
bfd_boolean unresolved_reloc, warned, ignored;
RELOC_FOR_GLOBAL_SYMBOL (info, input_bfd, input_section, rel,
r_symndx, symtab_hdr, sym_hashes,
h, sec, relocation,
unresolved_reloc, warned, ignored);
name = h->root.root.string;
}
This can be read as:
- If the current symbol is a local symbol:
- We initialize
relocation to the local symbol value using _bfd_elf_rela_local_sym().
- Otherwise the current symbol is global:
- We use the
RELOC_FOR_GLOBAL_SYMBOL() macro to initialize relocation.
During the next part we use the howto information to update the relocation value, and also
add relocations to the output file. For example:
case R_OR1K_TLS_GD_HI16:
case R_OR1K_TLS_GD_LO16:
case R_OR1K_TLS_GD_PG21:
case R_OR1K_TLS_GD_LO13:
case R_OR1K_TLS_IE_HI16:
case R_OR1K_TLS_IE_LO16:
case R_OR1K_TLS_IE_PG21:
case R_OR1K_TLS_IE_LO13:
case R_OR1K_TLS_IE_AHI16:
{
bfd_vma gotoff;
Elf_Internal_Rela rela;
asection *srelgot;
bfd_byte *loc;
bfd_boolean dynamic;
int indx = 0;
unsigned char tls_type;
srelgot = htab->root.srelgot;
/* Mark as TLS related GOT entry by setting
bit 2 to indcate TLS and bit 1 to indicate GOT. */
if (h != NULL)
{
gotoff = h->got.offset;
tls_type = ((struct elf_or1k_link_hash_entry *) h)->tls_type;
h->got.offset |= 3;
}
else
{
unsigned char *local_tls_type;
gotoff = local_got_offsets[r_symndx];
local_tls_type = (unsigned char *) elf_or1k_local_tls_type (input_bfd);
tls_type = local_tls_type == NULL ? TLS_NONE
: local_tls_type[r_symndx];
local_got_offsets[r_symndx] |= 3;
}
/* Only process the relocation once. */
if ((gotoff & 1) != 0)
{
gotoff += or1k_initial_exec_offset (howto, tls_type);
/* The PG21 and LO13 relocs are pc-relative, while the
rest are GOT relative. */
relocation = got_base + (gotoff & ~3);
if (!(r_type == R_OR1K_TLS_GD_PG21
|| r_type == R_OR1K_TLS_GD_LO13
|| r_type == R_OR1K_TLS_IE_PG21
|| r_type == R_OR1K_TLS_IE_LO13))
relocation -= got_sym_value;
break;
}
...
/* Static GD. */
else if ((tls_type & TLS_GD) != 0)
{
bfd_put_32 (output_bfd, 1, sgot->contents + gotoff);
bfd_put_32 (output_bfd, tpoff (info, relocation, dynamic),
sgot->contents + gotoff + 4);
}
gotoff += or1k_initial_exec_offset (howto, tls_type);
...
/* Static IE. */
else if ((tls_type & TLS_IE) != 0)
bfd_put_32 (output_bfd, tpoff (info, relocation, dynamic),
sgot->contents + gotoff);
/* The PG21 and LO13 relocs are pc-relative, while the
rest are GOT relative. */
relocation = got_base + gotoff;
if (!(r_type == R_OR1K_TLS_GD_PG21
|| r_type == R_OR1K_TLS_GD_LO13
|| r_type == R_OR1K_TLS_IE_PG21
|| r_type == R_OR1K_TLS_IE_LO13))
relocation -= got_sym_value;
}
break;
Here we process the relocation for TLS General Dynamic and Initial Exec relocations. I have trimmed
out the shared cases to save space.
This can be read as:
- Get a reference to the output relocation section
sreloc.
- Get the got offset which we setup during phase 3 for global or local symbols.
- Mark the symbol as using a TLS got entry, this
offset |= 3 trick is
possible because on 32-bit machines we have 2 lower bits free. This
is used during phase 4.
- If we have already processed this symbol once:
- Update
relocation to the location in the output .got section and break, we only need to create .got entries 1 time
- Otherwise populate
.got section entries
- For General Dynamic
- Put 2 entries into the output elf object
.gotsection, a literal 1 and the thread pointer offset
- For Initial Exec
- Put 1 entry into the output elf object
.got section, the thread pointer offset
- Finally update the
relocation to the location in the output .got section
In the last part of the loop we write the relocation value to the output
.text section. This is done with the or1k_final_link_relocate()
function.
r = or1k_final_link_relocate (howto, input_bfd, input_section, contents,
rel->r_offset, relocation + rel->r_addend);
With this the .text section is complete.
Phase 4 - finishing up (finish_dynamic_symbol + finish_dynamic_sections)
During phase 3 above we wrote the .text section out to file. During the
final finishing up phase we need to write the remaining sections. This
includes the .plt section an more writes to the .got section.
This also includes the .plt.rela and .got.rela sections which contain
dynamic relocation entries.
Writing of the data sections is handled by
or1k_elf_finish_dynamic_sections()
and writing of the relocation sections is handled by
or1k_elf_finish_dynamic_symbol(). These are defined as below.
static bfd_boolean
or1k_elf_finish_dynamic_sections (bfd *output_bfd,
struct bfd_link_info *info)
static bfd_boolean
or1k_elf_finish_dynamic_symbol (bfd *output_bfd,
struct bfd_link_info *info,
struct elf_link_hash_entry *h,
Elf_Internal_Sym *sym)
#define elf_backend_finish_dynamic_sections or1k_elf_finish_dynamic_sections
#define elf_backend_finish_dynamic_symbol or1k_elf_finish_dynamic_symbol
A snippet for the or1k_elf_finish_dynamic_sections() shows how when writing to
the .plt section assembly code needs to be injected. This is where the first
entry in the .plt section is written.
else if (bfd_link_pic (info))
{
plt0 = OR1K_LWZ(15, 16) | 8; /* .got+8 */
plt1 = OR1K_LWZ(12, 16) | 4; /* .got+4 */
plt2 = OR1K_NOP;
}
else
{
unsigned ha = ((got_addr + 0x8000) >> 16) & 0xffff;
unsigned lo = got_addr & 0xffff;
plt0 = OR1K_MOVHI(12) | ha;
plt1 = OR1K_LWZ(15,12) | (lo + 8);
plt2 = OR1K_LWZ(12,12) | (lo + 4);
}
or1k_write_plt_entry (output_bfd, splt->contents,
plt0, plt1, plt2, OR1K_JR(15));
elf_section_data (splt->output_section)->this_hdr.sh_entsize = 4;
Here we see a write to output_bfd, this represents the output object file
which we are writing to. The argument splt->contents represents the object
file offset to write to for the .plt section. Next we see the line
elf_section_data (splt->output_section)->this_hdr.sh_entsize = 4
this allows the linker to calculate the size of the section.
A snippet from the or1k_elf_finish_dynamic_symbol() function shows where
we write out the code and dynamic relocation entries for each symbol to
the .plt section.
splt = htab->root.splt;
sgot = htab->root.sgotplt;
srela = htab->root.srelplt;
...
else
{
unsigned ha = ((got_addr + 0x8000) >> 16) & 0xffff;
unsigned lo = got_addr & 0xffff;
plt0 = OR1K_MOVHI(12) | ha;
plt1 = OR1K_LWZ(12,12) | lo;
plt2 = OR1K_ORI0(11) | plt_reloc;
}
or1k_write_plt_entry (output_bfd, splt->contents + h->plt.offset,
plt0, plt1, plt2, OR1K_JR(12));
/* Fill in the entry in the global offset table. We initialize it to
point to the top of the plt. This is done to lazy lookup the actual
symbol as the first plt entry will be setup by libc to call the
runtime dynamic linker. */
bfd_put_32 (output_bfd, plt_base_addr, sgot->contents + got_offset);
/* Fill in the entry in the .rela.plt section. */
rela.r_offset = got_addr;
rela.r_info = ELF32_R_INFO (h->dynindx, R_OR1K_JMP_SLOT);
rela.r_addend = 0;
loc = srela->contents;
loc += plt_index * sizeof (Elf32_External_Rela);
bfd_elf32_swap_reloca_out (output_bfd, &rela, loc);
Here we can see we write 3 things to output_bfd for the single .plt entry.
We write:
- The assembly code to the
.plt section.
- The
plt_base_addr (the first entry in the .plt for runtime lookup) to the .got section.
- And finally a dynamic relocation for our symbol to the
.plt.rela.
With that we have written all of the sections out to our final elf object, and it’s ready
to be used.
GLIBC Runtime Linker
The runtime linker, also referred to as the dynamic linker, will do the final
linking as we load our program and shared libraries into memory. It can process
a limited set of relocation entries that were setup above during phase 4 of
linking.
The runtime linker implementation is found mostly in the
elf/dl-* GLIBC source files. Dynamic relocation processing is handled in by
the _dl_relocate_object()
function in the elf/dl-reloc.c file. The back end macro used for relocation
ELF_DYNAMIC_RELOCATE
is defined across several files including elf/dynamic-link.h
and elf/do-rel.h
Architecture specific relocations are handled by the function elf_machine_rela(), the implementation
for OpenRISC being in sysdeps/or1k/dl-machine.h.
In summary from top down:
- elf/rtld.c - implements
dl_main() the top level entry for the dynamic linker.
- elf/dl-open.c - function
dl_open_worker() calls _dl_relocate_object(), you may also recognize this from dlopen(3).
- elf/dl-reloc.c - function
_dl_relocate_object calls ELF_DYNAMIC_RELOCATE
elf/dynamic-link.h - defined macro ELF_DYNAMIC_RELOCATE calls elf_dynamic_do_Rel() via several macros
elf/do-rel.h - function elf_dynamic_do_Rel() calls elf_machine_rela()
sysdeps/or1k/dl-machine.h - architecture specific function elf_machine_rela() implements dynamic relocation handling
It supports relocations for:
R_OR1K_NONE - do nothing
R_OR1K_COPY - used to copy initial values from shared objects to process memory.
R_OR1K_32 - a 32-bit value
R_OR1K_GLOB_DAT - aligned 32-bit values for GOT entries
R_OR1K_JMP_SLOT - aligned 32-bit values for PLT entries
R_OR1K_TLS_DTPMOD/R_OR1K_TLS_DTPOFF - for shared TLS GD GOT entries
R_OR1K_TLS_TPOFF - for shared TLS IE GOT entries
A snippet of the OpenRISC implementation of elf_machine_rela() can be seen
below. It is pretty straight forward.
/* Perform the relocation specified by RELOC and SYM (which is fully resolved).
MAP is the object containing the reloc. */
auto inline void
__attribute ((always_inline))
elf_machine_rela (struct link_map *map, const Elf32_Rela *reloc,
const Elf32_Sym *sym, const struct r_found_version *version,
void *const reloc_addr_arg, int skip_ifunc)
{
struct link_map *sym_map = RESOLVE_MAP (&sym, version, r_type);
Elf32_Addr value = SYMBOL_ADDRESS (sym_map, sym, true);
...
switch (r_type)
{
...
case R_OR1K_32:
/* Support relocations on mis-aligned offsets. */
value += reloc->r_addend;
memcpy (reloc_addr_arg, &value, 4);
break;
case R_OR1K_GLOB_DAT:
case R_OR1K_JMP_SLOT:
*reloc_addr = value + reloc->r_addend;
break;
...
}
}
Handling TLS
The complicated part of the runtime linker is how it handles TLS variables.
This is done in the following files and functions.
The reader can read through the initialization code which is pretty straight forward, except for the
macros. Like most GNU code the code relies heavily on untyped macros. These macros are defined
in the architecture specific implementation files. For OpenRISC this is:
From the previous article on TLS we have the
TLS data structure that looks as follows:
dtv[] [ dtv[0], dtv[1], dtv[2], .... ]
counter ^ | \
----/ / \________
/ V V
/------TCB-------\/----TLS[1]----\ /----TLS[2]----\
| pthread tcbhead | tbss tdata | | tbss tdata |
\----------------/\--------------/ \--------------/
^
|
TP-----/
The symbols and macros defined in sysdeps/or1k/nptl/tls.h are:
__thread_self - a symbol representing the current thread always
TLS_DTV_AT_TP - used throughout the TLS code to adjust offsets
TLS_TCB_AT_TP - used throughout the TLS code to adjust offsets
TLS_TCB_SIZE - used during init_tls() to allocate memory for TLS
TLS_PRE_TCB_SIZE - used during init_tls() to allocate space for the pthread struct
INSTALL_DTV - used during initialization to update a new dtv pointer into the given tcb
GET_DTV - gets dtv via the provided tcb pointer
INSTALL_NEW_DTV - used during resizing to update the dtv into the current runtime __thread_self
TLS_INIT_TP - sets __thread_self this is the final step in init_tls()
THREAD_DTV - gets dtv via _thread_self
THREAD_SELF - get the pthread pointer via __thread_self
Implementations for OpenRISC are:
register tcbhead_t *__thread_self __asm__("r10");
#define TLS_DTV_AT_TP 1
#define TLS_TCB_AT_TP 0
#define TLS_TCB_SIZE sizeof (tcbhead_t)
#define TLS_PRE_TCB_SIZE sizeof (struct pthread)
#define INSTALL_DTV(tcbp, dtvp) (((tcbhead_t *) (tcbp))->dtv = (dtvp) + 1)
#define GET_DTV(tcbp) (((tcbhead_t *) (tcbp))->dtv)
#define TLS_INIT_TP(tcbp) ({__thread_self = ((tcbhead_t *)tcbp + 1); NULL;})
#define THREAD_DTV() ((((tcbhead_t *)__thread_self)-1)->dtv)
#define INSTALL_NEW_DTV(dtv) (THREAD_DTV() = (dtv))
#define THREAD_SELF \
((struct pthread *) ((char *) __thread_self - TLS_INIT_TCB_SIZE \
- TLS_PRE_TCB_SIZE))
Summary
We have looked at how symbols move from the Compiler, to Assembler, to Linker to
Runtime linker.
This has ended up being a long article to explain a rather complicated subject.
Let’s hope it helps provide a good reference for others who want to work on the
GNU toolchain in the future.
Further Reading
- GCC Passes - My blog entry on GCC passes
- bfdint - The BFD developer’s manual
- ldint - The LD developer’s manual
- LD and BFD Gist - Dump of notes I collected while working on this article.
19 Jan 2020
This is an ongoing series of posts on ELF Binary Relocations and Thread
Local Storage. This article covers only Thread Local Storage and assumes
the reader has had a primer in ELF Relocations, if not please start with
my previous article *ELF Binaries and Relocation Entries.
This is the second part in an illustrated 3 part series covering:
In the last article we covered ELF Binary internals and how relocation entries
are used to during link time to allow our programs to access symbols
(variables). However, what if we want a different variable instance for each
thread? This is where thread local storage (TLS) comes in.
In this article we will discuss how TLS works. Our outline:
As before, the examples in this article can be found in my tls-examples
project. Please check it out.
Thread Local Storage
Did you know that in C you can prefix variables with __thread to create
thread local variables?
Example
A thread local variable is a variable that will have a unique instance per thread.
Each time a new thread is created, the space required to store the thread local
variables is allocated.
TLS variables are stored in dynamic TLS sections.
TLS Sections
In the previous article we saw how variables were stored in the .data and
.bss sections. These are initialized once per program or library.
When we get to binaries that use TLS we will additionally have .tdata and
.tbss sections.
.tdata - static and non static initialized thread local variables
.tbss - static and non static non-initialized thread local variables
These exist in a special TLS segment which
is loaded per thread. In the next article we will discuss more about how this
loading works.
TLS Data Structures
As we recall, to access data in .data and .bss sections simple code
sequences with relocation entries are used. These sequences set and add
registers to build pointers to our data. For example, the below sequence uses 2
relocations to compose a .bss section address into register r11.
Addr. Machine Code Assembly Relocations
0000000c <get_x_addr>:
c: 19 60 [00 00] l.movhi r11,[0] # c R_OR1K_AHI16 .bss
10: 44 00 48 00 l.jr r9
14: 9d 6b [00 00] l.addi r11,r11,[0] # 14 R_OR1K_LO_16_IN_INSN .bss
With TLS the code sequences to access our data will also build pointers to our
data, but they need to traverse the TLS data structures.
As the code sequence is read only and will be the same for each thread another
level of indirection is needed, this is provided by the Thread Pointer (TP).
The Thread Pointer points into a data structure that allows us to locate TLS
data sections. The TLS data structure includes:
- Thread Control Block (TCB)
- Dynamic Thread Vector (DTV)
- TLS Data Sections
These are illustrated as below:
dtv[] [ dtv[0], dtv[1], dtv[2], .... ]
counter ^ | \
----/ / \________
/ V V
/------TCB-------\/----TLS[1]----\ /----TLS[2]----\
| pthread tcbhead | tbss tdata | | tbss tdata |
\----------------/\--------------/ \--------------/
^
|
TP-----/
Thread Pointer (TP)
The TP is unique to each thread. It provides the starting point to the TLS data
structure.
- The TP points to the Thread Control Block
- On OpenRISC the TP is stored in
r10
- On x86_64 the TP is stored in
$fs
- This is the
*tls pointer passed to the
clone() system call when
using CLONE_SETTLS.
Thread Control Block (TCB)
The TCB is the head of the TLS data structure. The TCB consists of:
pthread - the pthread
struct for the current thread, contains tid etc. Located by TP - TCB size - Pthread size
tcbhead - the tcbhead_t struct, machine dependent, contains pointer to DTV. Located by TP - TCB size.
For OpenRISC tcbhead_t is defined in
sysdeps/or1k/nptl/tls.h as:
typedef struct {
dtv_t *dtv;
} tcbhead_t
dtv - is a pointer to the dtv array, points to entry dtv[1]
For x86_64 the tcbhead_t is defined in
sysdeps/x86_64/nptl/tls.h
as:
typedef struct
{
void *tcb; /* Pointer to the TCB. Not necessarily the
thread descriptor used by libpthread. */
dtv_t *dtv;
void *self; /* Pointer to the thread descriptor. */
int multiple_threads;
int gscope_flag;
uintptr_t sysinfo;
uintptr_t stack_guard;
uintptr_t pointer_guard;
unsigned long int vgetcpu_cache[2];
/* Bit 0: X86_FEATURE_1_IBT.
Bit 1: X86_FEATURE_1_SHSTK.
*/
unsigned int feature_1;
int __glibc_unused1;
/* Reservation of some values for the TM ABI. */
void *__private_tm[4];
/* GCC split stack support. */
void *__private_ss;
/* The lowest address of shadow stack, */
unsigned long long int ssp_base;
/* Must be kept even if it is no longer used by glibc since programs,
like AddressSanitizer, depend on the size of tcbhead_t. */
__128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));
void *__padding[8];
} tcbhead_t;
The x86_64 implementation includes many more fields including:
gscope_flag - Global Scope lock flags used by the runtime linker, for OpenRISC this is stored in pthread.
stack_guard - The stack
guard canary stored in
the thread local area. For OpenRISC a global stack guard is stored in .bss.
pointer_guard - The pointer
guard stored in the
thread local area. For OpenRISC a global pointer guard is stored in .bss.
Dynamic Thread Vector (DTV)
The DTV is an array of pointers to each TLS data section. The first entry in
the DTV array contains the generation counter. The generation counter is really
just the array size. The DTV can be dynamically resized as more TLS modules are loaded.
The dtv_t type is a union as defined below:
typedef struct {
void *val; // Aligned pointer to data/bss
void *to_free; // Unaligned pointer for free()
} dtv_pointer
typedef union {
int counter; // for entry 0
dtv_pointer pointer; // for all other entries
} dtv_t
Each dtv_t entry can be either a counter or a pointer. By convention the
first entry, dtv[0] is a counter and the rest are pointers.
Thread Local Storage (TLS)
The initial set of TLS data sections is allocated contiguous with the TCB. Additional TLS
data blocks will be allocated dynamically. There will be one entry for each
loaded module, the first module being the current program. For dynamic
libraries it is lazily initialized per thread.
Local (or TLS[1])
tbss - the .tbss section for the current thread from the current
processes ELF binary.
tdata - the .tdata section for the current thread from the current
processes ELF binary.
TLS[2]
tbss - the .tbss section for variables defined in the first shared library loaded by the current process
tdata - the .tdata section for variables defined in the first shared library loaded by the current process
The __tls_get_addr() function
The __tls_get_addr() function can be used at any time to traverse the TLS data
structure and return a variable’s address. The function is given a pointer to
an architecture specific argument tls_index.
- The argument contains 2 pieces of data:
- The module index -
0 for the current process, 1 for the first loaded shared
library etc.
- The data offset - the offset of the variable in the
TLS data section
- Internally
__tls_get_addr uses TP to located the TLS data structure
- The function returns the address of the variable we want to access
For static builds the implementation is architecture dependant and defined in
OpenRISC
sysdeps/or1k/libc-tls.c
as:
__tls_get_addr (tls_index *ti)
{
dtv_t *dtv = THREAD_DTV ();
return (char *) dtv[1].pointer.val + ti->ti_offset;
}
Note for for static builds the module index can be hard coded to 1 as there
will always be only one module.
For dynamically linked programs the implementation is defined as part of the
runtime dynamic linker in
elf/dl-tls.c
as:
void *
__tls_get_addr (GET_ADDR_ARGS)
{
dtv_t *dtv = THREAD_DTV ();
if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation)))
return update_get_addr (GET_ADDR_PARAM);
void *p = dtv[GET_ADDR_MODULE].pointer.val;
if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);
return (char *) p + GET_ADDR_OFFSET;
}
Here several macros are used so it’s a bit hard to follow but there are:
THREAD_DTV - uses TP to get the pointer to the DTV array.
GET_ADDR_ARGS - short for tls_index* ti
GET_ADDR_PARAM - short for ti
GET_ADDR_MODULE - short for ti->ti_module
GET_ADDR_OFFSET - short for ti->ti_offset
TLS Access Models
As one can imagine, traversing the TLS data structures when accessing each variable
could be slow. For this reason there are different TLS access models that the
compiler can choose to minimize variable access overhead.
Global Dynamic
The Global Dynamic (GD), sometimes called General Dynamic, access model is the
slowest access model which will traverse the entire TLS data structure for each
variable access. It is used for accessing variables in dynamic shared
libraries.
Before Linking

Not counting relocations for the PLT and GOT entries; before linking the .text
contains 1 placeholder for a GOT offset. This GOT entry will contain the
arguments to __tls_get_addr.
After Linking

After linking there will be 2 relocation entries in the GOT to be resolved by
the dynamic linker. These are R_TLS_DTPMOD, the TLS module index, and
R_TLS_DTPOFF, the offset of the variable into the TLS module.
Example
File: tls-gd.c
extern __thread int x;
int* get_x_addr() {
return &x;
}
Code Sequence (OpenRISC)
tls-gd.o: file format elf32-or1k
Disassembly of section .text:
0000004c <get_x_addr>:
4c: 18 60 [00 00] l.movhi r3,[0] # 4c: R_OR1K_TLS_GD_HI16 x
50: 9c 21 ff f8 l.addi r1,r1,-8
54: a8 63 [00 00] l.ori r3,r3,[0] # 54: R_OR1K_TLS_GD_LO16 x
58: d4 01 80 00 l.sw 0(r1),r16
5c: d4 01 48 04 l.sw 4(r1),r9
60: 04 00 00 02 l.jal 68 <get_x_addr+0x1c>
64: 1a 00 [00 00] l.movhi r16,[0] # 64: R_OR1K_GOTPC_HI16 _GLOBAL_OFFSET_TABLE_-0x4
68: aa 10 [00 00] l.ori r16,r16,[0] # 68: R_OR1K_GOTPC_LO16 _GLOBAL_OFFSET_TABLE_
6c: e2 10 48 00 l.add r16,r16,r9
70: 04 00 [00 00] l.jal [0] # 70: R_OR1K_PLT26 __tls_get_addr
74: e0 63 80 00 l.add r3,r3,r16
78: 85 21 00 04 l.lwz r9,4(r1)
7c: 86 01 00 00 l.lwz r16,0(r1)
80: 44 00 48 00 l.jr r9
84: 9c 21 00 08 l.addi r1,r1,8
Code Sequence (x86_64)
tls-gd.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000020 <get_x_addr>:
20: 48 83 ec 08 sub $0x8,%rsp
24: 66 48 8d 3d [00 00 00 00] lea [0](%rip),%rdi # 28 R_X86_64_TLSGD x-0x4
2c: 66 66 48 e8 [00 00 00 00] callq [0] # 30 R_X86_64_PLT32 __tls_get_addr-0x4
34: 48 83 c4 08 add $0x8,%rsp
38: c3 retq
Local Dynamic
The Local Dynamic (LD) access model is an optimization for Global Dynamic where
multiple variables may be accessed from the same TLS module. Instead of
traversing the TLS data structure for each variable, the TLS data section address
is loaded once by calling __tls_get_addr with an offset of 0. Next, variables
can be accessed with individual offsets.
Local Dynamic is not supported on OpenRISC yet.
Before Linking

Not counting relocations for the PLT and GOT entries; before linking the .text
contains 1 placeholder for a GOT offset and 2 placeholders for the TLS offsets.
This GOT entry will contain the arguments to __tls_get_addr.
The TLD offsets will be the offsets to our variables in the TLD data section.
After Linking

After linking there will be 1 relocation entry in the GOT to be resolved by
the dynamic linker. This is R_TLS_DTPMOD, the TLS module index, the offset
will be 0x0.
Example
File: tls-ld.c
static __thread int x;
static __thread int y;
int sum() {
return x + y;
}
Code Sequence (x86_64)
tls-ld.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000030 <sum>:
30: 48 83 ec 08 sub $0x8,%rsp
34: 48 8d 3d [00 00 00 00] lea [0](%rip),%rdi # 37 R_X86_64_TLSLD x-0x4
3b: e8 [00 00 00 00] callq [0] # 3c R_X86_64_PLT32 __tls_get_addr-0x4
40: 8b 90 [00 00 00 00] mov [0](%rax),%edx # 42 R_X86_64_DTPOFF32 x
46: 03 90 [00 00 00 00] add [0](%rax),%edx # 48 R_X86_64_DTPOFF32 y
4c: 48 83 c4 08 add $0x8,%rsp
50: 89 d0 mov %edx,%eax
52: c3 retq
Initial Exec
The Initial Exec (IE) access model does not require traversing the TLS data
structure. It requires that the compiler knows that offset from the TP to the
variable can be computed during link time.
As Initial Exec does not require calling __tls_get_addr is is more efficient
compared the GD and LD access.
Before Linking

Text contains a placeholder for the got address of the offset. Not counting
relocation entry for the GOT; before linking the .text contains 1 placeholder
for a GOT offset. This GOT entry will contain the TP offset to the variable.
After Linking

After linking there will be no remaining relocation entries. The .text section
contains the actual GOT offset and the GOT entry will contain the TP offset
to the variable.
Example
File: tls-ie.c
Initial exec C code will be the same as global dynamic, however IE access will
be chosen when static compiling.
extern __thread int x;
int* get_x_addr() {
return &x;
}
Code Sequence (OpenRISC)
00000038 <get_x_addr>:
38: 9c 21 ff fc l.addi r1,r1,-4
3c: 1a 20 [00 00] l.movhi r17,[0x0] # 3c: R_OR1K_TLS_IE_AHI16 x
40: d4 01 48 00 l.sw 0(r1),r9
44: 04 00 00 02 l.jal 4c <get_x_addr+0x14>
48: 1a 60 [00 00] l.movhi r19,[0x0] # 48: R_OR1K_GOTPC_HI16 _GLOBAL_OFFSET_TABLE_-0x4
4c: aa 73 [00 00] l.ori r19,r19,[0x0] # 4c: R_OR1K_GOTPC_LO16 _GLOBAL_OFFSET_TABLE_
50: e2 73 48 00 l.add r19,r19,r9
54: e2 31 98 00 l.add r17,r17,r19
58: 85 71 [00 00] l.lwz r11,[0](r17) # 58: R_OR1K_TLS_IE_LO16 x
5c: 85 21 00 00 l.lwz r9,0(r1)
60: e1 6b 50 00 l.add r11,r11,r10
64: 44 00 48 00 l.jr r9
68: 9c 21 00 04 l.addi r1,r1,4
Code Sequence (x86_64)
0000000000000010 <get_x_addr>:
10: 48 8b 05 [00 00 00 00] mov 0x0(%rip),%rax # 13: R_X86_64_GOTTPOFF x-0x4
17: 64 48 03 04 25 00 00 00 00 add %fs:0x0,%rax
20: c3 retq
Local Exec
The Local Exec (LD) access model does not require traversing the TLS data
structure or a GOT entry. It is chosen by the compiler when accessing file
local variables in the current program.
The Local Exec access model is the most efficient.
Before Linking

Before linking the .text section contains one relocation entry for a TP
offset.
After Linking

After linking the .text section contains the value of the TP offset.
Example
File: tls-le.c
In the Local Exec example the variable x is local, it is not extern.
static __thread int x;
int * get_x_addr() {
return &x;
}
Code Sequence (OpenRISC)
00000010 <get_x_addr>:
10: 19 60 [00 00] l.movhi r11,[0x0] # 10: R_OR1K_TLS_LE_AHI16 .LANCHOR0
14: e1 6b 50 00 l.add r11,r11,r10
18: 44 00 48 00 l.jr r9
1c: 9d 6b [00 00] l.addi r11,r11,[0] # 1c: R_OR1K_TLS_LE_LO16 .LANCHOR0
Code Sequence (x86_64)
0000000000000010 <get_x_addr>:
10: 64 48 8b 04 25 00 00 00 00 mov %fs:0x0,%rax
19: 48 05 [00 00 00 00] add $0x0,%rax # 1b: R_X86_64_TPOFF32 x
1f: c3 retq
Linker Relaxation
As some TLS access methods are more efficient than others we would like to
choose the best method for each variable access. However, we sometimes don’t know
where a variable will come from until link time.
On some architectures the linker will rewrite the TLS access code sequence to
change to a more efficient access model, this is called relaxation.
One type of relaxation performed by the linker is GD to IE relaxation. During compile
time GD relocation may be chosen for extern variables. However, during link time
the variable may be found in the same module i.e. not a shared object which would require
GD access. In this case the access model can be changed to IE.
That’s pretty cool.
The architecture I work on OpenRISC does not support any
of this yet, it requires changes to the compiler and linker. The compiler needs
to be updated to mark sections of the output .text that can be rewritten
(often with added NOP codes). The linker needs to be updated to know how to
identify the relaxation opportunity and perform it.
Summary
In this article we have covered how TLS variables are accessed per thread via
the TLS data structure. Also, we saw how different TLS access models provide
varying levels of efficiency.
In the next article we will look more into how this is implemented in GCC, the
linker and the GLIBC runtime dynamic linker.
Further Reading