<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://stffrdhrn.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="http://stffrdhrn.github.io/" rel="alternate" type="text/html" /><updated>2026-02-21T07:59:04+00:00</updated><id>http://stffrdhrn.github.io/feed.xml</id><title type="html">shorne in japan</title><subtitle>Stafford&apos;s blog about engineering and software</subtitle><author><name>Stafford Horne</name></author><entry><title type="html">OpenRISC Multicore - SMP Linux soft lockups</title><link href="http://stffrdhrn.github.io/hardware/linux/embedded/openrisc/2026/01/13/or1k-linux-smp-softlockup-bug.html" rel="alternate" type="text/html" title="OpenRISC Multicore - SMP Linux soft lockups" /><published>2026-01-13T18:11:00+00:00</published><updated>2026-01-13T18:11:00+00:00</updated><id>http://stffrdhrn.github.io/hardware/linux/embedded/openrisc/2026/01/13/or1k-linux-smp-softlockup-bug</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/linux/embedded/openrisc/2026/01/13/or1k-linux-smp-softlockup-bug.html"><![CDATA[<p>This is the story of figuring out why OpenRISC Linux was no longer booting on FPGA boards.</p>

<p>In July 2025 we received a bug report: <a href="https://github.com/openrisc/mor1kx/issues/168">#168 mor1kx pipeline is stuck in dualcore iverilog RTL simulation</a>.</p>

<p>The report showed a hang on the second CPU of a custom
<a href="https://en.wikipedia.org/wiki/Multi-core_processor">multicore</a> platform.  The
CPU cores that we use in FPGA based
<a href="https://en.wikipedia.org/wiki/System_on_a_chip">SoCs</a> are highly configurable,
we can change cache sizes, MMU set sizes, memory synchronization strategies and
other settings.  Our first step were to ensure that these settings were
correct.   After some initial discussions and adjustments the user was able to
make progress, but Linux booted and hung with an error.  The following is a
snippet of the boot log:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ 72.530000] Run /init as init process
[ 95.470000] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 95.470000] rcu: (detected by 0, t=2102 jiffies, g=-1063, q=3 ncpus=2)
[ 95.470000] rcu: All QSes seen, last rcu_sched kthread activity 2088 (-20453--22541), jiffies_till_next_fqs=1, root -&gt;qsmask 0x0
[ 95.470000] rcu: rcu_sched kthread timer wakeup didn't happen for 2087 jiffies! g-1063 f0x2 RCU_GP_WAIT_FQS(5) -&gt;state=0x200
[ 95.470000] rcu: Possible timer handling issue on cpu=1 timer-softirq=194
[ 95.470000] rcu: rcu_sched kthread starved for 2088 jiffies! g-1063 f0x2 RCU_GP_WAIT_FQS(5) -&gt;state=0x200 -&gt;cpu=1
[ 95.470000] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 95.470000] rcu: RCU grace-period kthread stack dump:
[ 95.470000] task:rcu_sched state:R stack:0 pid:12 tgid:12 ppid:2 task_flags:0x208040 flags:0x00000000
[ 95.470000] Call trace:
[ 95.470000] [&lt;(ptrval)&gt;] 0xc00509a4
[ 95.470000] [&lt;(ptrval)&gt;] 0xc0050a04
</code></pre></div></div>

<p>In the above log we see an <a href="https://docs.kernel.org/RCU/stallwarn.html">RCU</a>
stall warning indicating that CPU 1 running but not making progressing and is
likely stuck in a tight loop.  We can also see that the CPUs are both running
but hanging.  It took until December 2025, 5 months, to locate and fix the bug.
In this article we will discuss how we debugged and solved this issue.</p>

<h1 id="reproducing-the-issue">Reproducing the issue</h1>

<p>The software that the user uses is the standard OpenRISC kernel and runtime.  It has
been stable for some time running on the QEMU simulator that we use for the bulk
of our software development and testing.</p>

<p>To be honest I haven’t run the OpenRISC multicore platform on a physical FGPA
development board for a few years, so just setting up the environment was going
to be a significant undertaking.</p>

<p>For the past few years I have been concentrating on OpenRISC software
so this meant using QEMU which is much more convenient.</p>

<p>To get the environment running we need a bunch of stuff:</p>

<ul>
  <li>De0 Nano Cyclone IV FPGA dev board with assortment of USB and serial device cables</li>
  <li><a href="https://fusesoc.readthedocs.io/en/stable/">fusesoc</a> 2.4.3 - Tool for RTL
package management, building and device programming.</li>
  <li><a href="https://www.altera.com/products/development-tools/quartus">Quartus Prime Design Software</a> 24.1 - for verilog synthesis and place and route</li>
  <li>The fusesoc OpenRISC multicore SoC - https://github.com/stffrdhrn/de0_nano-multicore</li>
  <li><a href="https://openocd.org">OpenOCD</a> 0.11.0 for debugging and loading software onto the board</li>
  <li>The OpenRISC <a href="https://github.com/stffrdhrn/or1k-toolchain-build/releases">toolchain</a> 15.1.0</li>
  <li>Linux kernel source code</li>
  <li>Old kernel patches to get OpenRISC running on the de0 nano</li>
  <li>A busybox <a href="https://github.com/stffrdhrn/or1k-rootfs-build/releases">rootfs</a> for userspace utilities</li>
</ul>

<p>There is a lot of information about how to get your FPGA board working with
openrisc in our <a href="https://openrisc.io/tutorials/de0_nano/">De0 Nano tutorials</a>.
Please refer to the tutorials if you would like to follow up.</p>

<p>Some notes about what I had to figure out when getting the De0 Nano development environment up again:</p>

<ul>
  <li>OpenOCD versions after 0.11 no longer work with OpenRISC and it’s
adv_debug_sys debug interface.
Never versions of OpenOCD will connect over the USB Blaster JTAG connection but
requests to write and read fail with CDC failures.</li>
  <li>While debugging the OpenOCD issues I verified our simulated
JTAG connectivity which uses OpenOCD to connect over
<a href="https://github.com/fjullien/jtag_vpi">jtag_vpi</a> does still work.</li>
  <li>Fusesoc is continuously evolving and the <a href="https://github.com/olofk/de0_nano/commits/master/">de0_nano</a> and
<code class="language-plaintext highlighter-rouge">de0_nano-multicore</code>.
projects needed to be updated to get them working again.</li>
</ul>

<p>Once the development board was loaded and running a simple hello world program
as per the tutorial I could continue try to run Linux.</p>

<h2 id="building-the-linux-kernel">Building the Linux Kernel</h2>

<p>To build and load the Linux kernel requires the kernel source, a kernel config and a
<a href="https://www.devicetree.org/">devicetree</a>
(DTS) file for our De0 Nano multicore board.  At the time of this writing we didn’t have one available
in the upstream kernel source tree, so we need to create one.  This means we
need to patch and configure the Linux kernel for De0 Nano support.</p>

<h3 id="patching-the-kernel">Patching the Kernel</h3>

<p>We can start with the existing OpenRISC multicore kernel config then make some
adjustments.  To get started we can configure the kernel with <code class="language-plaintext highlighter-rouge">simple_smp_defconfig</code>
as follows.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nv">ARCH</span><span class="o">=</span>openrisc <span class="nv">CROSS_COMPILE</span><span class="o">=</span>or1k-linux- simple_smp_defconfig
make
</code></pre></div></div>

<p>This gives us a good baseline.
We then need to create a device tree that works for the De0 Nano, it is also
almost the same as the simple SMP board but:</p>

<ul>
  <li>The FPGA SoC runs at 50Mhz instead of 20Mhz</li>
  <li>The FPGA SoC has no ethernet</li>
</ul>

<p>Starting with the existing <code class="language-plaintext highlighter-rouge">simple_smp.dts</code> I modified it creating
<a href="/content/2026/de0nano-smp.dts">de0nano-smp.dts</a> and placed it in
the <code class="language-plaintext highlighter-rouge">arch/openrisc/boot/dts</code> directory.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">--- arch/openrisc/boot/dts/simple_smp.dts       2026-02-11 20:15:20.244628708 +0000
</span><span class="gi">+++ arch/openrisc/boot/dts/de0nano-smp.dts      2026-02-12 17:24:15.959947375 +0000
</span><span class="p">@@ -25,12 +25,12 @@</span>
                cpu@0 {
                        compatible = "opencores,or1200-rtlsvn481";
                        reg = &lt;0&gt;;
<span class="gd">-                       clock-frequency = &lt;20000000&gt;;
</span><span class="gi">+                       clock-frequency = &lt;50000000&gt;;
</span>                };
                cpu@1 {
                        compatible = "opencores,or1200-rtlsvn481";
                        reg = &lt;1&gt;;
<span class="gd">-                       clock-frequency = &lt;20000000&gt;;
</span><span class="gi">+                       clock-frequency = &lt;50000000&gt;;
</span>                };
        };
 
<span class="p">@@ -57,13 +57,6 @@</span>
                compatible = "opencores,uart16550-rtlsvn105", "ns16550a";
                reg = &lt;0x90000000 0x100&gt;;
                interrupts = &lt;2&gt;;
<span class="gd">-               clock-frequency = &lt;20000000&gt;;
-       };
-
-       enet0: ethoc@92000000 {
-               compatible = "opencores,ethoc";
-               reg = &lt;0x92000000 0x800&gt;;
-               interrupts = &lt;4&gt;;
-               big-endian;
</span><span class="gi">+               clock-frequency = &lt;50000000&gt;;
</span>        };
 };
</code></pre></div></div>

<h3 id="configuring-the-kernel">Configuring the kernel</h3>

<p>The default smp config does not have debugging configured.  Run <code class="language-plaintext highlighter-rouge">make ARCH=openrisc menuconfig</code>
and enable the following.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Kernel hacking  ---&gt;
   printk and dmesg options  ---&gt;
    [*] Show timing information on printks              CONFIG_PRINTK_TIME=y
   Compile-time checks and compiler options  ---&gt;
        Debug information (Disable debug information)   CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
    [*] Provide GDB scripts for kernel debugging        CONFIG_GDB_SCRIPTS=y

 General setup  ---&gt;
   [*] Configure standard kernel features (expert users)  ---&gt;
     [*]   Load all symbols for debugging/ksymoops      CONFIG_KALLSYMS=y
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">CONFIG_KALLSYMS</code> seems unremarkable, but it is one of the most important config switches
to enable.  This enables our stack traces to show symbol information, which makes it easier to understand
where our crashes happen.</p>

<p>With all of that configured we can build the kernel.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make <span class="nt">-j12</span> <span class="se">\</span>
  <span class="nv">ARCH</span><span class="o">=</span>openrisc <span class="se">\</span>
  <span class="nv">CROSS_COMPILE</span><span class="o">=</span>or1k-linux- <span class="se">\</span>
  <span class="nv">CONFIG_INITRAMFS_SOURCE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/work/openrisc/busybox-rootfs/initramfs </span><span class="nv">$HOME</span><span class="s2">/work/openrisc/busybox-rootfs/initramfs.devnodes"</span> <span class="se">\</span>
  <span class="nv">CONFIG_BUILTIN_DTB_NAME</span><span class="o">=</span><span class="s2">"de0nano-smp"</span>
</code></pre></div></div>

<p>When the kernel build is complete we should see our <code class="language-plaintext highlighter-rouge">vmlinux</code> image as follows.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">ls</span> <span class="nt">-ltr</span> | <span class="nb">tail</span> <span class="nt">-n5</span>
<span class="nt">-rwxr-xr-x</span><span class="nb">.</span>   1 shorne shorne  104863360 Jan 30 13:30 vmlinux.unstripped
<span class="nt">-rw-r--r--</span><span class="nb">.</span>   1 shorne shorne     971587 Jan 30 13:30 System.map
<span class="nt">-rwxr-xr-x</span><span class="nb">.</span>   1 shorne shorne      11975 Jan 30 13:30 modules.builtin.modinfo
<span class="nt">-rw-r--r--</span><span class="nb">.</span>   1 shorne shorne       1047 Jan 30 13:30 modules.builtin
<span class="nt">-rwxr-xr-x</span><span class="nb">.</span>   1 shorne shorne  104763212 Jan 30 13:30 vmlinux
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">vmlinux</code> image is an <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF</a>
binary ready to load onto our
board.  I have also uploaded a <a href="https://github.com/stffrdhrn/linux/commit/47d4f4ce21ddb1a99e72016f130377a265ec3622">patch for adding the device tree file and a defconfig</a>
to GitHub for easy reproduction.</p>

<h2 id="booting-the-image">Booting the Image</h2>

<p>Loading the kernel onto our FPGA board using the GDB and OpenOCD commands from the
tutorial the system boots.</p>

<p>The system runs for a while and maybe we can execute commands, 2 CPU’s are
reported online but after some time we get the following lockup and the system
stops.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[  410.790000] rcu: INFO: rcu_sched self-detected stall on CPU
[  410.790000] rcu:     0-...!: (2099 ticks this GP) idle=4f64/1/0x40000002 softirq=438/438 fqs=277
[  410.790000] rcu:     (t=2100 jiffies g=-387 q=1845 ncpus=2)
[  410.790000] rcu: rcu_sched kthread starved for 1544 jiffies! g-387 f0x0 RCU_GP_WAIT_FQS(5) -&gt;state=0x0 -&gt;cpu=1
[  410.790000] rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  410.790000] rcu: RCU grace-period kthread stack dump:
[  410.790000] task:rcu_sched       state:R  running task     stack:0     pid:13    tgid:13    ppid:2      task_flags:0x208040 flags:0x00000000
...
[  411.000000] rcu: Stack dump where RCU GP kthread last ran:
[  411.000000] Task dump for CPU 1:
[  411.000000] task:kcompactd0      state:R  running task     stack:0     pid:29    tgid:29    ppid:2      task_flags:0x218040 flags:0x00000008
[  411.000000] Call trace:
[  411.050000] [&lt;(ptrval)&gt;] sched_show_task.part.0+0x104/0x138
[  411.050000] [&lt;(ptrval)&gt;] dump_cpu_task+0xd8/0xe0
[  411.050000] [&lt;(ptrval)&gt;] rcu_check_gp_kthread_starvation+0x1bc/0x1e4
[  411.050000] [&lt;(ptrval)&gt;] rcu_sched_clock_irq+0xd00/0xe9c
[  411.050000] [&lt;(ptrval)&gt;] ? ipi_icache_page_inv+0x0/0x24
[  411.050000] [&lt;(ptrval)&gt;] update_process_times+0xa8/0x128
[  411.050000] [&lt;(ptrval)&gt;] tick_nohz_handler+0xd8/0x264
[  411.050000] [&lt;(ptrval)&gt;] ? tick_program_event+0x78/0x100
[  411.100000] [&lt;(ptrval)&gt;] tick_nohz_lowres_handler+0x54/0x80
[  411.100000] [&lt;(ptrval)&gt;] timer_interrupt+0x88/0xc8
[  411.100000] [&lt;(ptrval)&gt;] _timer_handler+0x84/0x8c
[  411.100000] [&lt;(ptrval)&gt;] ? smp_call_function_many_cond+0x4d4/0x5b0
[  411.100000] [&lt;(ptrval)&gt;] ? ipi_icache_page_inv+0x0/0x24
[  411.100000] [&lt;(ptrval)&gt;] ? smp_call_function_many_cond+0x1bc/0x5b0
[  411.100000] [&lt;(ptrval)&gt;] ? __alloc_frozen_pages_noprof+0x118/0xde8
[  411.150000] [&lt;(ptrval)&gt;] ? ipi_icache_page_inv+0x14/0x24
[  411.150000] [&lt;(ptrval)&gt;] ? smp_call_function_many_cond+0x4d4/0x5b0
[  411.150000] [&lt;(ptrval)&gt;] on_each_cpu_cond_mask+0x28/0x38
[  411.150000] [&lt;(ptrval)&gt;] smp_icache_page_inv+0x30/0x40
[  411.150000] [&lt;(ptrval)&gt;] update_cache+0x12c/0x160
[  411.150000] [&lt;(ptrval)&gt;] handle_mm_fault+0xc48/0x1cc0
[  411.150000] [&lt;(ptrval)&gt;] ? _raw_spin_unlock_irqrestore+0x28/0x38
[  411.150000] [&lt;(ptrval)&gt;] do_page_fault+0x1d0/0x4b4
[  411.200000] [&lt;(ptrval)&gt;] ? sys_setpgid+0xe4/0x1f8
[  411.200000] [&lt;(ptrval)&gt;] ? _data_page_fault_handler+0x104/0x10c
[  411.200000] CPU: 0 UID: 0 PID: 61 Comm: sh Not tainted 6.19.0-rc5-simple-smp-00005-g4c0503f58a74 #339 NONE
[  411.200000] CPU #: 0
[  411.200000]    PC: c00e9dc4    SR: 0000807f    SP: c1235da4
[  411.200000] GPR00: 00000000 GPR01: c1235da4 GPR02: c1235e00 GPR03: 00000006
[  411.200000] GPR04: c1fe3ae0 GPR05: c1fe3ae0 GPR06: 00000000 GPR07: 00000000
[  411.200000] GPR08: 00000002 GPR09: c00ea0dc GPR10: c1234000 GPR11: 00000006
[  411.200000] GPR12: ffffffff GPR13: 00000002 GPR14: 300ef234 GPR15: c09b7b20
[  411.200000] GPR16: c1fc1b30 GPR17: 00000001 GPR18: c1fe3ae0 GPR19: c1fcffe0
[  411.200000] GPR20: 00000001 GPR21: ffffffff GPR22: 00000001 GPR23: 00000002
[  411.200000] GPR24: c0013950 GPR25: 00000000 GPR26: 00000001 GPR27: 00000000
[  411.200000] GPR28: 01616000 GPR29: 0000000b GPR30: 00000001 GPR31: 00000002
[  411.200000]   RES: 00000006 oGPR11: ffffffff
[  411.200000] Process sh (pid: 61, stackpage=c12457c0)
[  411.200000]
[  411.200000] Stack:
[  411.200000] Call trace:
[  411.200000] [&lt;(ptrval)&gt;] smp_call_function_many_cond+0x4d4/0x5b0
[  411.200000] [&lt;(ptrval)&gt;] on_each_cpu_cond_mask+0x28/0x38
[  411.200000] [&lt;(ptrval)&gt;] smp_icache_page_inv+0x30/0x40
[  411.200000] [&lt;(ptrval)&gt;] update_cache+0x12c/0x160
[  411.200000] [&lt;(ptrval)&gt;] handle_mm_fault+0xc48/0x1cc0
[  411.200000] [&lt;(ptrval)&gt;] ? _raw_spin_unlock_irqrestore+0x28/0x38
[  411.200000] [&lt;(ptrval)&gt;] do_page_fault+0x1d0/0x4b4
[  411.200000] [&lt;(ptrval)&gt;] ? sys_setpgid+0xe4/0x1f8
[  411.200000] [&lt;(ptrval)&gt;] ? _data_page_fault_handler+0x104/0x10c
[  411.200000]
[  411.200000]  c1235d84:       0000001c
[  411.200000]  c1235d88:       00000074
[  411.200000]  c1235d8c:       c1fc0008
[  411.200000]  c1235d90:       00000000
[  411.200000]  c1235d94:       c1235da4
[  411.200000]  c1235d98:       c0013964
[  411.200000]  c1235d9c:       c1235e00
[  411.200000]  c1235da0:       c00ea0dc
[  411.200000] (c1235da4:)      00000006

</code></pre></div></div>

<p>From the trace we can see both CPU’s are in similar code locations.</p>

<ul>
  <li>CPU0 : is in <code class="language-plaintext highlighter-rouge">smp_icache_page_inv -&gt; on_each_cpu_cond_mask -&gt; smp_call_function_many_cond</code></li>
  <li>CPU1 : is in <code class="language-plaintext highlighter-rouge">smp_icache_page_inv -&gt; on_each_cpu_cond_mask</code></li>
</ul>

<p>CPU1 is additionally handling a timer which is reporting the RCU stall, we can
ignore those bits of the stack, as it is reporting the problem for us it is not
the root cause.  So what is happening?</p>

<p>Let’s try to understand what is happening.  The <code class="language-plaintext highlighter-rouge">smp_icache_page_inv</code> function
is called to invalidate an icache page, it will force all CPU’s to invalidate a
cache entry by scheduling each CPU to call a cache invalidation function.  This
is scheduled with the <code class="language-plaintext highlighter-rouge">smp_call_function_many_cond</code> call.</p>

<p>On <code class="language-plaintext highlighter-rouge">CPU0</code> and <code class="language-plaintext highlighter-rouge">CPU1</code> this is being initiated by a page fault as we see
<code class="language-plaintext highlighter-rouge">do_page_fault</code> at the bottom of the stack.  The do_page_fault function will be
called when the CPU handles a TLB miss exception or if there was a page fault.
This must mean that a executable page was not available in memory and access to
that page caused a fault, once the page was mapped the icache needs to be
invalidated, this is done via the kernel’s inter-processor interrupt
(<a href="https://en.wikipedia.org/wiki/Inter-processor_interrupt">IPI</a>) mechanism.</p>

<p>The IPI allows one CPU to request work to be done on other CPUs, this is done
using the <code class="language-plaintext highlighter-rouge">on_each_cpu_cond_mask</code> function call.</p>

<p>If we open up the debugger we can see, we are stuck in <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> here:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ or1k-elf-gdb "$HOME/work/linux/vmlinux" -ex 'target remote :3333'
GNU gdb (GDB) 17.0.50.20250614-git
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=or1k-elf".

#0  0xc00ea11c in csd_lock_wait (csd=0xc1fd0000) at kernel/smp.c:351
351             smp_cond_load_acquire(&amp;csd-&gt;node.u_flags, !(VAL &amp; CSD_FLAG_LOCK));
</code></pre></div></div>

<p>Checking the backtrace we see <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> is indeed inside the IPI framework
function <code class="language-plaintext highlighter-rouge">smp_call_function_many_cond</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) bt
#0  0xc00ea11c in csd_lock_wait (csd=0xc1fd0000) at kernel/smp.c:351
#1  smp_call_function_many_cond (mask=&lt;optimized out&gt;, func=0xc0013ca8 &lt;ipi_icache_page_inv&gt;, info=0xc1ff8920, scf_flags=&lt;optimized out&gt;, 
    cond_func=&lt;optimized out&gt;) at kernel/smp.c:877
#2  0x0000002e in ?? ()
</code></pre></div></div>

<p>Here csd stands for Call Single Data which is part IPI framework’s remote
function call api.  The <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> function calls <code class="language-plaintext highlighter-rouge">smp_cond_load_acquire</code>
which we can see below:</p>

<h3 id="kernelsmpc">kernel/smp.c</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">static</span> <span class="n">__always_inline</span> <span class="kt">void</span> <span class="nf">csd_lock_wait</span><span class="p">(</span><span class="n">call_single_data_t</span> <span class="o">*</span><span class="n">csd</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">smp_cond_load_acquire</span><span class="p">(</span><span class="o">&amp;</span><span class="n">csd</span><span class="o">-&gt;</span><span class="n">node</span><span class="p">.</span><span class="n">u_flags</span><span class="p">,</span> <span class="o">!</span><span class="p">(</span><span class="n">VAL</span> <span class="o">&amp;</span> <span class="n">CSD_FLAG_LOCK</span><span class="p">));</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">CSD_FLAG_LOCK</code> flag is defined as seen here:</p>

<h3 id="includelinuxsmp_typesh">include/linux/smp_types.h</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="p">{</span>
        <span class="n">CSD_FLAG_LOCK</span>           <span class="o">=</span> <span class="mh">0x01</span><span class="p">,</span>
        <span class="p">...</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">smp_cond_load_acquire</code> macro is just a loop waiting for <code class="language-plaintext highlighter-rouge">&amp;csd-&gt;node.u_flags</code>
the 1 bit <code class="language-plaintext highlighter-rouge">CSD_FLAG_LOCK</code> to be cleared.</p>

<p>If we check the value of the <code class="language-plaintext highlighter-rouge">u_flags</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p/x csd-&gt;node.u_flags
$14 = 0x86330004
</code></pre></div></div>

<p>What is this we see?  The value is <code class="language-plaintext highlighter-rouge">0x86330004</code>, but that means the <code class="language-plaintext highlighter-rouge">0x1</code> bit is <em>not</em> set.
It should be exiting the loop.  As the RCU stall warning predicted our CPU is
stuck in tight loop.  In this case the loop is in <code class="language-plaintext highlighter-rouge">csd_lock_wait</code>.</p>

<p>The value in memory does not match the value the CPU is reading.  Is this a
memory synchronization issue?  Does the CPU cache incorrectly have the locked
flag?</p>

<h1 id="its-a-hardware-issue">It’s a Hardware Issue</h1>

<p>As this software works fine in QEMU, I was first suspecting this was a hardware
issue.  Perhaps there is an issue with cache coherency.</p>

<p>Luckily on OpenRISC we can disable caches.  I built the CPU with the caches
disabled, this is done by changing the following module parameters from
<code class="language-plaintext highlighter-rouge">ENABLED</code> to <code class="language-plaintext highlighter-rouge">NONE</code> as below, then re-synthesizing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ grep -r FEATURE.*CACHE ../de0_nano-multicore/
../de0_nano-multicore/rtl/verilog/orpsoc_top.v: .FEATURE_INSTRUCTIONCACHE       ("ENABLED"),
../de0_nano-multicore/rtl/verilog/orpsoc_top.v: .FEATURE_DATACACHE              ("ENABLED"),
../de0_nano-multicore/rtl/verilog/orpsoc_top.v: .FEATURE_INSTRUCTIONCACHE       ("ENABLED"),
../de0_nano-multicore/rtl/verilog/orpsoc_top.v: .FEATURE_DATACACHE              ("ENABLED"),
</code></pre></div></div>

<p>After this the system booted very slow, but we still had hang’s, I was stumped.</p>

<h2 id="gemini-to-the-rescue">Gemini to the Rescue</h2>

<p>I thought I would try out Gemini AI to help debug the issue.  I was able to paste
in the kernel crash dumps and AI was able to come to the same conclusion
I did.  It thought that it may be a memory synchonization issue.</p>

<p>But Gemini was not able to help, it kept chasing red herrings.</p>

<ul>
  <li>At first it suggsted looking at the memory barriers in the <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> code.
I uploaded some of the OpenRISC kernel source code.  It was certain it found the
issue, I was not convinced but humored it. It suggested kernel patches
I applied them and confirmed they didn’t help.</li>
  <li>I asked if it could be a hardware bug, Gemini thought this was a great idea.
If there was an issue with the CPU’s Load Store Unit (LSU) not flushing or
losing writes it could be the cause of the lock not being released.  I uploaded
some of the OpenRISC CPU verilog source code.  It was certain it found the bug
in the LSU, again I was not concinced looking at it’s patches.
The patches did not improve anything.</li>
</ul>

<p>We went through several iterations of this, none of the suggestions were correct.
I humored the patches but they did not work.</p>

<p>Gemini does help with discussing the issues, highlighting details, and process
of elimination, but it’s not able think much beyond the evidence I provide.
They seem lack the ability to think beyond it’s current context.</p>

<p>We will need to figure this out on our own.</p>

<h2 id="using-a-hardware-debugger">Using a Hardware Debugger</h2>

<p>I had some doubt that the values I was seeing in the GDB debug session were
correct.  As a last ditch effort I brought up SignalTap, an FPGA virtual logic
analyzer.  In other words this is a hardware debugger.</p>

<p>What should we look for in SignalTap? We want to confirm what is really in memory
when the CPU is reading the flags variable from memory in the lock loop.</p>

<p>From our GDB session above we recall the <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> lock loop was around PC address <code class="language-plaintext highlighter-rouge">0xc00ea11c</code>.
If we dump this area of the Linux binary we see the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ or1k-elf-objdump -d vmlinux | grep -C5 c00ea11c
c00ea108:       0c 00 00 07     l.bnf c00ea124 &lt;smp_call_function_many_cond+0x1c4&gt;
c00ea10c:       15 00 00 00     l.nop 0x0
c00ea110:       86 33 00 04     l.lwz r17,4(r19)  &lt;---------------------------------+
c00ea114:       a6 31 00 01     l.andi r17,r17,0x1                                  |
c00ea118:       bc 11 00 00     l.sfeqi r17,0                                       |
c00ea11c:&lt;--    0f ff ff fd     l.bnf c00ea110 &lt;smp_call_function_many_cond+0x1b0&gt; -+
c00ea120:       15 00 00 00      l.nop 0x0
c00ea124:       22 00 00 00     l.msync
c00ea128:       bc 17 00 02     l.sfeqi r23,2
c00ea12c:       0f ff ff e1     l.bnf c00ea0b0 &lt;smp_call_function_many_cond+0x150&gt;
c00ea130:       aa 20 00 01     l.ori r17,r0,0x1
</code></pre></div></div>

<p>We can see the <code class="language-plaintext highlighter-rouge">l.lwz</code> instruction is used to read in the flags value from
memory.  The <code class="language-plaintext highlighter-rouge">l.lwz</code> instruction instructs the CPU to load data at an address in
memory to a CPU register.  The CPU module that handles memory access is called
the Load Store Unit (LSU).  Let’s setup the logic analyzer to capture the LSU
signals.</p>

<p>In CPU core 0’s module <code class="language-plaintext highlighter-rouge">mor1kx_lsu_cappuccino</code> select signals:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">pc_execute_i</code> - The PC for the execute stage, this lets us know which instruction is waiting to execute</li>
  <li><code class="language-plaintext highlighter-rouge">exec_op_lsu_load_i</code> - Signal that is asserted when the LSU is being asked to perform a load</li>
  <li><code class="language-plaintext highlighter-rouge">dbus_adr_o</code> - The address being communicated from the LSU to the memory bus for the data load</li>
  <li><code class="language-plaintext highlighter-rouge">dbus_dat_i</code> - The data being communicated from the memory bus back to the LSU</li>
  <li><code class="language-plaintext highlighter-rouge">lsu_result_o</code> - The data captured by the LSU to be written to the register file</li>
</ul>

<p><em>Note 1</em> During this build, we disable the data cache to make sure loads are not
cached.  Otherwise our load would go out to the memory bus one time and be hard
to capture in the logic analyzer.</p>

<p><em>Note 2</em> We select only signals on CPU 0, as the <code class="language-plaintext highlighter-rouge">csd_lock_wait</code> lock loop is occurring on both CPUs.</p>

<p><em>Note 3</em> I found that if I added too many signals to SignalTap that Linux would fail to boot
as the CPU would get stuck with BUS errors. So be aware.</p>

<p>In SignalTap our setup looks like the following:</p>

<p><img src="/content/2026/2026-signaltap-lsu-load-setup.png" alt="SignalTap Selecting Signals" /></p>

<p>After the setup we can try to boot the kernel and observe the lockup.  When the lockup occurs
if we capture data we see the below:</p>

<p><img src="/content/2026/2026-signaltap-lsu-load.png" alt="SignalTap Reading Data" /></p>

<p>I have annotated the transitions in the trace:</p>

<ol>
  <li>Moments after the <code class="language-plaintext highlighter-rouge">exec_op_lsu_load_i</code> signal is asserted, the <code class="language-plaintext highlighter-rouge">dbus_adr_o</code> is set to
<code class="language-plaintext highlighter-rouge">0x011cc47c</code>.  This is the memory address to be read.</li>
  <li>Next we see <code class="language-plaintext highlighter-rouge">0x11</code> on <code class="language-plaintext highlighter-rouge">dbus_dat_i</code>.  This is the value read from memory.</li>
  <li>After this the value <code class="language-plaintext highlighter-rouge">0x11</code> is outputted on <code class="language-plaintext highlighter-rouge">lsu_result_o</code> confirming this is the value read.</li>
  <li>Finally after a few instructions the loop continues again and <code class="language-plaintext highlighter-rouge">exec_op_lsu_load_i</code> is asserted.</li>
</ol>

<p>Here we have confirmed the CPU is properly reading <code class="language-plaintext highlighter-rouge">0x11</code>, the lock is still held.  What does this mean
does it mean that CPU 1 (the secondary CPU) did not handle the IPI and release the lock?</p>

<p>It does mean that our GDB analysis is wrong.  There is no memory synchonization issue
and the hardware is behaving as expect.  We need another idea.</p>

<h1 id="actually-its-a-kernel-issue">Actually, it’s a Kernel Issue</h1>

<p>Since we know that is seems some IPIs are not getting handled properly
it would be good to be able to know how many IPIs are getting sent and lost.</p>

<p>I added a <a href="https://github.com/stffrdhrn/linux/commit/a7fc4d4778a70461fb28fb2e3216d3a85513fd62">patch to capture and dump IPI stats</a> 
when OpenRISC crashes.  What we see below is that CPU 1 is receiving no IPIs while
CPU 0 has received all IPIs sent by CPU 1.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[  648.180000] CPU: 0 UID: 0 PID: 1 Comm: init Tainted: G             L      6.19.0-rc4-de0nano-smp-00002-ga7fc4d
[  648.180000] Tainted: [L]=SOFTLOCKUP
[  648.180000] CPU #: 0
[  648.180000]    PC: c00ea100    SR: 0000807f    SP: c1031cf8
[  648.180000] GPR00: 00000000 GPR01: c1031cf8 GPR02: c1031d54 GPR03: 00000006
[  648.180000] GPR04: c1fe4ae0 GPR05: c1fe4ae0 GPR06: 00000000 GPR07: 00000000
[  648.180000] GPR08: 00000002 GPR09: c00ea420 GPR10: c1030000 GPR11: 00000006
[  648.180000] GPR12: 00000029 GPR13: 00000002 GPR14: c1fe4ae0 GPR15: 0000000b
[  648.180000] GPR16: c1fc1b60 GPR17: 00000011 GPR18: c1fe4ae0 GPR19: c1fd0010
[  648.180000] GPR20: 00000001 GPR21: ffffffff GPR22: 00000001 GPR23: 00000002
[  648.180000] GPR24: c0013c98 GPR25: 00000000 GPR26: 00000001 GPR27: c09cd7b0
[  648.180000] GPR28: 01608000 GPR29: c09c4524 GPR30: 00000006 GPR31: 00000000
[  648.180000]   RES: 00000006 oGPR11: ffffffff
...
[  648.180000] IPI stats:
[  648.180000] Wakeup IPIs                sent:        1 recv:        0
[  648.180000] Rescheduling IPIs          sent:        8 recv:        0
[  648.180000] Function call IPIs         sent:        0 recv:        0
[  648.180000] Function single call IPIs  sent:       41 recv:       46
...
[  660.260000] CPU: 1 UID: 0 PID: 29 Comm: kcompactd0 Tainted: G             L      6.19.0-rc4-de0nano-smp-00002-
[  660.260000] Tainted: [L]=SOFTLOCKUP
[  660.260000] CPU #: 1
[  660.260000]    PC: c053ca40    SR: 0000827f    SP: c1095b58
[  660.260000] GPR00: 00000000 GPR01: c1095b58 GPR02: c1095b60 GPR03: c11f003c
[  660.260000] GPR04: c11f20c0 GPR05: 3002e000 GPR06: c1095b64 GPR07: c1095b60
[  660.260000] GPR08: 00000000 GPR09: c0145c00 GPR10: c1094000 GPR11: c11fc05c
[  660.260000] GPR12: 00000000 GPR13: 0002003d GPR14: c1095d2c GPR15: 00000000
[  660.260000] GPR16: c1095b98 GPR17: 0000001d GPR18: c11f0000 GPR19: 0000001e
[  660.260000] GPR20: 30030000 GPR21: 001f001d GPR22: c1fe401c GPR23: c09cd7b0
[  660.260000] GPR24: ff000000 GPR25: 00000001 GPR26: 01000000 GPR27: c1ff21a4
[  660.260000] GPR28: 00000000 GPR29: c1095dd8 GPR30: 3002e000 GPR31: 00000002
[  660.260000]   RES: c11fc05c oGPR11: ffffffff
...
[  660.300000] IPI stats:
[  660.300000] Wakeup IPIs                sent:        0 recv:        0
[  660.310000] Rescheduling IPIs          sent:        0 recv:        0
[  660.310000] Function call IPIs         sent:        0 recv:        0
[  660.310000] Function single call IPIs  sent:       46 recv:        0
</code></pre></div></div>

<p>Why is this?  With some extra debugging I found that the programmable interrupt
controller mask register (<code class="language-plaintext highlighter-rouge">PICMR</code>) was <code class="language-plaintext highlighter-rouge">0x0</code> on CPU 1.  This means that all interrupts
on CPU 1 are masked and CPU 1 will never receive any interrupts.</p>

<p>After a <a href="https://github.com/stffrdhrn/linux/commit/d2533084299085b9b602b8b78d6827a2411ef05b">quick patch to unmask IPIs</a>
on secondary CPUs the system stability was fixed.  This is the simple patch:</p>

<div class="language-patch highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh">diff --git a/arch/openrisc/kernel/smp.c b/arch/openrisc/kernel/smp.c
index 86da4bc5ee0b..db3f6ff0b54a 100644
</span><span class="gd">--- a/arch/openrisc/kernel/smp.c
</span><span class="gi">+++ b/arch/openrisc/kernel/smp.c
</span><span class="p">@@ -138,6 +138,9 @@</span> asmlinkage __init void secondary_start_kernel(void)
        synchronise_count_slave(cpu);
        set_cpu_online(cpu, true);
 
<span class="gi">+       // Enable IPIs, hack
+       mtspr(SPR_PICMR, mfspr(SPR_PICMR) | 0x2);
+
</span>        local_irq_enable();
        /*
         * OK, it's off to the idle thread for us
</code></pre></div></div>

<h1 id="fixing-the-bug-upstream">Fixing the Bug Upstream</h1>

<p>Simply unmasking the interrupts in Linux as I did above in the hack would not be accepted upstream.
There are irqchip APIs that handle interrupt unmasking.</p>

<p>The <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/openrisc?id=eea1a28f93c8c78b961aca2012dedfd5c528fcac">OpenRISC IPI patch</a>
for the Linux 6.20/7.0 release updates the IPI interrupt driver to
register a percpu_irq which allows us to unmask the irq handler on each CPU.</p>

<p>In the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a67594c977234b0ad6887202740e9e8b9821473a">patch series</a>
I also added De0 Nano single core and multicore board
configurations to allow for easier board bring up.</p>

<h1 id="what-went-wrong-with-gdb">What went wrong with GDB?</h1>

<p>Why did GDB return the incorrect values when we were debugging initially?</p>

<p>GDB is not broken, but it could be improved when debugging kernel code.
Let’s look again at the GDB session and look at the addresses of our variables.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) l
346     {
347     }
348
349     static __always_inline void csd_lock_wait(call_single_data_t *csd)
350     {
351             smp_cond_load_acquire(&amp;csd-&gt;node.u_flags, !(VAL &amp; CSD_FLAG_LOCK));
352     }
353     #endif

(gdb) p/x csd-&gt;node.u_flags
$14 = 0x86330004

(gdb) p/x &amp;csd-&gt;node.u_flags
$15 = 0xc1fd0004
</code></pre></div></div>

<p>Here we see the value GDB reads is <code class="language-plaintext highlighter-rouge">0x86330004</code>, but the address of the variable is
<code class="language-plaintext highlighter-rouge">0xc1fd0004</code>.  This is a kernel address as we see the <code class="language-plaintext highlighter-rouge">0xc0000000</code> address offset.</p>

<p>Let’s inspect the assembly code that is running.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p/x $npc
$11 = 0xc00ea11c

(gdb) x/12i $npc-0xc000000c
   0xea110:     l.lwz r17,4(r19)
   0xea114:     l.andi r17,r17,0x1
   0xea118:     l.sfeqi r17,0
--&gt;0xea11c:     l.bnf 0xea110
   0xea120:     l.nop 0x0
   0xea124:     l.msync
   0xea128:     l.sfeqi r23,2
   0xea12c:     l.bnf 0xea0b0
   0xea130:     l.ori r17,r0,0x1
   0xea134:     l.lwz r16,56(r1)
   0xea138:     l.lwz r18,60(r1)
   0xea13c:     l.lwz r20,64(r1)
</code></pre></div></div>

<p>Here we see the familiar loop, the register <code class="language-plaintext highlighter-rouge">r19</code> stores the address of
<code class="language-plaintext highlighter-rouge">csd-&gt;node</code> and <code class="language-plaintext highlighter-rouge">u_flags</code> is at a 4 byte offset, hence <code class="language-plaintext highlighter-rouge">l.lwz r17,4(r19)</code>.</p>

<p>The register <code class="language-plaintext highlighter-rouge">r17</code> stores the value read from memory, then masked with <code class="language-plaintext highlighter-rouge">0x1</code>.
We can see this below.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) p/x $r17
$4 = 0x1
(gdb) p/x $r19
$5 = 0xc1fd0000

(gdb) x/12x $r19
0xc1fd0000:     0x862a0008      0x862a0008      0x862a0008      0x862a0008
0xc1fd0010:     0x862a0008      0x862a0008      0x862a0008      0x862a0008
0xc1fd0020:     0x862a0008      0x862a0008      0x862a0008      0x862a0008
</code></pre></div></div>

<p>Here we see <code class="language-plaintext highlighter-rouge">r19</code> is <code class="language-plaintext highlighter-rouge">0xc1fd0000</code> and if we inspect the memory at this location
we see values like <code class="language-plaintext highlighter-rouge">0x862a0008</code>, which is strange.</p>

<p>Above we discussed these are kernel addresses, offset by <code class="language-plaintext highlighter-rouge">0xc0000000</code>.
When the kernel does memory reads these will be mapped by the MMU to a physical address, in this case
<code class="language-plaintext highlighter-rouge">0x01fd0004</code>.</p>

<p>We can apply the offset ourselves and inspect memory as follows.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(gdb) x/12x $r19-0xc0000000
0x1fd0000:      0x00000000      0x00000011      0xc0013ca8      0xc1ff8920
0x1fd0010:      0x0000fe00      0x00000000      0x00000400      0x00000000
0x1fd0020:      0x01fd01fd      0x00000005      0x0000002e      0x0000002e
</code></pre></div></div>

<p>Bingo, this shows that we have <code class="language-plaintext highlighter-rouge">0x11</code> at <code class="language-plaintext highlighter-rouge">0x1fd0004</code> the lock value.  Memory does
indeed contain <code class="language-plaintext highlighter-rouge">0x11</code> the same as the value read by the CPU.</p>

<p>When GDB does memory reads the debug interface issues reads directly to the
memory bus.   The CPU and MMU are not involved.  This means, at the moment, we
need to be careful when inspecting memory and be sure to perform the offsets
ourselves.</p>

<h1 id="conclusion">Conclusion</h1>

<p>Debugging accross the hardware-software boundary requires a bit of experience
and a whole lot of parience.  We initially thought that this was a harware
issue, but then eventually found a trivial issue in the OpenRISC multicore support
drivers.  It took time and a few different tools to convince ourselves that the
hardware was fine.  After refocusing on the kernel and building some tools
(the IPI stats report) we found the clue we needed.</p>

<p>This highlights the importance in embedded systems to know your tools and
architecturte.  Or, in our case, remember that the MMU does not translate memory
reads over the JTAG interface.  With the bug fixed and De0 Nano support merged upstream
Linux on the OpenRISC platform not now more accessible and stable than before.</p>

<h1 id="followups">Followups</h1>

<p>Working on this issue highlighted that there are a few things to improve
in OpenRISC including:</p>

<ul>
  <li>Tutorials and Upstreaming patches</li>
  <li>OpenOCD is currently broken for OpenRISC</li>
  <li>OpenOCD doesn’t support multicore (<a href="https://github.com/stffrdhrn/openocd/commits/or1k-multicore/">multicore patches</a> never upstreamed)</li>
  <li>OpenOCD / GDB bugs</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="linux" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[This is the story of figuring out why OpenRISC Linux was no longer booting on FPGA boards.]]></summary></entry><entry><title type="html">OpenRISC FPU Port - Updating Linux, Compilers and Debuggers</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/24/or1k-fpu-linux-and-compilers.html" rel="alternate" type="text/html" title="OpenRISC FPU Port - Updating Linux, Compilers and Debuggers" /><published>2023-08-24T06:49:00+01:00</published><updated>2023-08-24T06:49:00+01:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/24/or1k-fpu-linux-and-compilers</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/24/or1k-fpu-linux-and-compilers.html"><![CDATA[<p>In this series we introduce the <a href="https://openrisc.io/">OpenRISC</a>
<a href="https://sourceware.org/glibc/">glibc</a> <a href="https://en.wikipedia.org/wiki/Floating-point_unit">FPU</a>
port and the effort required to get user space FPU support into OpenRISC Linux.
Adding FPU support to <a href="https://en.wikipedia.org/wiki/User_space_and_kernel_space">user space</a>
applications is a full stack project covering:</p>

<ul>
  <li><a href="/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port.html">Architecture Specification</a></li>
  <li><a href="/hardware/embedded/openrisc/2023/08/22/or1k-fpu-hw.html">Simulators and CPU implementations</a></li>
  <li>Linux Kernel support</li>
  <li>GCC Instructions and Soft FPU</li>
  <li>Binutils/GDB Debugging Support</li>
  <li>glibc support</li>
</ul>

<p>Have a look at previous articles if you need to catch up.  In this article we will cover
updates done to the Linux kernel, GCC (done long ago) and GDB for debugging.</p>

<h1 id="porting-linux-to-an-fpu">Porting Linux to an FPU</h1>

<p>Supporting hardware floating point operations in Linux user space applications
means adding the ability for the Linux kernel to store and restore FPU state
upon context switches.  This allows multiple programs to use the FPU at the same
time as if each program has it’s own floating point hardware.  The kernel allows
programs to multiplex usage of the FPU transparently.  This is similar to how
the kernel allows user programs to share other hardware like the <strong>CPU</strong> and
<strong>Memory</strong>.</p>

<p>On OpenRISC this requires to only add one addition register, the floating point
control and status register (<code class="language-plaintext highlighter-rouge">FPCSR</code>) to to context switches.  The <code class="language-plaintext highlighter-rouge">FPCSR</code>
contains status bits pertaining to <a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">rounding mode and exceptions</a>.</p>

<p>We will cover three places where Linux needs to store FPU state:</p>

<ul>
  <li>Context Switching</li>
  <li>Exception Context (Signal Frames)</li>
  <li>Register Sets</li>
</ul>

<h2 id="context-switching">Context Switching</h2>

<p>In order for the kernel to be able to support FPU usage in user space programs
it needs to be able to save and restore the FPU state during context switches.
Let’s look at the different kinds of context switches that happen in the Linux
Kernel to understand where FPU state needs to be stored and restored.</p>

<p>For our discussion purposes we will define a context switch as being
when one CPU state (context) is saved from CPU hardware (registers and program
counter) to memory and then another CPU state is loaded form memory to CPU
hardware.  This can happen in a few ways.</p>

<ol>
  <li>When handling exceptions such as interrupts</li>
  <li>When the scheduler wants to swap out one process for another.</li>
</ol>

<p>Furthermore, exceptions may be categorized into one of two cases: Interrupts and
<a href="https://en.wikipedia.org/wiki/System_call">System calls</a>.  For each of these a different amount of CPU state needs to be saved.</p>

<ol>
  <li><strong>Interrupts</strong> - for timers, hardware interrupts, illegal instructions etc,
for this case the <strong>full</strong> context will be saved.  As the switch is not
anticipated by the currently running program.</li>
  <li><strong>System calls</strong> - for system calls the user space program knows that it is making a function
call and therefore will save required state as per <a href="https://openrisc.io/or1k.html#__RefHeading__504887_595890882">OpenRISC calling conventions</a>.
In this case the kernel needs to save only <strong>callee</strong> saved registers.</li>
</ol>

<p>Below are outlines of the sequence of operations that take place when
transitioning from Interrupts and System calls into kernel code.  It highlights
at what point state is saved with the <span style="color:white;background-color:#69f;">INT</span>,
<span style="color:white;background-color:black;">kINT</span> and <span style="color:white;background-color:#c6f;">FPU</span> labels.</p>

<p>The <code class="language-plaintext highlighter-rouge">monospace</code> labels below correspond to the actual assembly labels in <a href="https://elixir.bootlin.com/linux/latest/source/arch/openrisc/kernel/entry.S">entry.S</a>,
the part of the OpenRISC Linux kernel that handles entry into kernel code.</p>

<p><strong>Interrupts (slow path)</strong></p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">EXCEPTION_ENTRY</code> - all <span style="color:white;background-color:#69f;">INT</span> state is saved</li>
  <li>Handle <em>exception</em> in kernel code</li>
  <li><code class="language-plaintext highlighter-rouge">_resume_userspace</code> - Check <code class="language-plaintext highlighter-rouge">thread_info</code> for work pending</li>
  <li>If work pending
    <ul>
      <li><code class="language-plaintext highlighter-rouge">_work_pending</code> - Call <a href="https://elixir.bootlin.com/linux/latest/source/arch/openrisc/kernel/signal.c#L298">do_work_pending</a>
        <ul>
          <li>Check if reschedule needed
            <ul>
              <li>If so, performs <code class="language-plaintext highlighter-rouge">_switch</code> which save/restores <span style="color:white;background-color:black;">kINT</span>
and <span style="color:white;background-color:#c6f;">FPU</span> state</li>
            </ul>
          </li>
          <li>Check for pending signals
            <ul>
              <li>If so, performs <code class="language-plaintext highlighter-rouge">do_signal</code> which save/restores <span style="color:white;background-color:#c6f;">FPU</span> state</li>
            </ul>
          </li>
        </ul>
      </li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">RESTORE_ALL</code> - all <span style="color:white;background-color:#69f;">INT</span> state is restored and return to user space</li>
</ol>

<p><strong>System calls (fast path)</strong></p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">_sys_call_handler</code> - callee saved registers are saved</li>
  <li>Handle <em>syscall</em> in kernel code</li>
  <li><code class="language-plaintext highlighter-rouge">_syscall_check_work</code> - Check <code class="language-plaintext highlighter-rouge">thread_info</code> for work pending</li>
  <li>If work pending
    <ul>
      <li>Save additional <span style="color:white;background-color:#69f;">INT</span> state</li>
      <li><code class="language-plaintext highlighter-rouge">_work_pending</code> - Call <a href="https://elixir.bootlin.com/linux/latest/source/arch/openrisc/kernel/signal.c#L298">do_work_pending</a>
        <ul>
          <li>Check if reschedule needed
            <ul>
              <li>If so, performs <code class="language-plaintext highlighter-rouge">_switch</code> which save/restores <span style="color:white;background-color:black;">kINT</span>
and <span style="color:white;background-color:#c6f;">FPU</span> state</li>
            </ul>
          </li>
          <li>Check for pending signals
            <ul>
              <li>If so, performs <code class="language-plaintext highlighter-rouge">do_signal</code> which save/restores <span style="color:white;background-color:#c6f;">FPU</span> state</li>
            </ul>
          </li>
        </ul>
      </li>
      <li><code class="language-plaintext highlighter-rouge">RESTORE_ALL</code> - all <span style="color:white;background-color:#69f;">INT</span> state is restored and return to user space</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">_syscall_resume_userspace</code> - restore callee saved registers return to user space.</li>
</ol>

<p>Some key points to note on the above:</p>

<ul>
  <li><span style="color:white;background-color:#69f;">INT</span> denotes the interrupts program’s register state.</li>
  <li><span style="color:white;background-color:black;">kINT</span> denotes the kernel’s register state before/after a context switch.</li>
  <li><span style="color:white;background-color:#c6f;">FPU</span> deones FPU state</li>
  <li>In both cases step <strong>4</strong> checks for work pending, which may cause task
rescheduling in which case a <strong>Context Switch</strong> (or <em>task switch</em>) will
be performed.
If rescheduling is not performed then after the sequence is complete processing
will resume where it left off.</li>
  <li>Step <strong>1</strong>, when switching from <em>user mode</em> to <em>kernel mode</em> is called a <strong>Mode Switch</strong></li>
  <li>Interrupts may happen in <em>user mode</em> or <em>kernel mode</em></li>
  <li>System calls will only happen in <em>user mode</em></li>
  <li><span style="color:white;background-color:#c6f;">FPU</span> state only needs to be saved and restored for <em>user mode</em> programs,
because <em>kernel mode</em> programs, in general, do not use the FPU.</li>
  <li>The current version of the OpenRISC port as of <code class="language-plaintext highlighter-rouge">v6.8</code> save and restores both
<span style="color:white;background-color:#69f;">INT</span> and <span style="color:white;background-color:#c6f;">FPU</span> state
what is shown before is a more optimized mechanism of only saving FPU state when needed.  Further optimizations could
be still make to only save FPU state for user space, and not save/restore if it is already done.</li>
</ul>

<p>With these principal’s in mind we can now look at how the mechanics of context
switching works.</p>

<p>Upon a <strong>Mode Switch</strong> from <em>user mode</em> to <em>kernel mode</em> the process thread
stack switches from using a user space stack to the associated kernel space
stack.  The required state is stored to a stack frame in a <code class="language-plaintext highlighter-rouge">pt_regs</code> structure.</p>

<p>The <code class="language-plaintext highlighter-rouge">pt_regs</code> structure (originally ptrace registers) represents the CPU
<em>registers</em> and <em>program counter</em> context that needs to be saved.</p>

<p>Below we can see how the kernel stack and user space stack relate.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>               kernel space

 +--------------+       +--------------+
 | Kernel Stack |       | Kernel stack |
 |     |        |       |    |         |
 |     v        |       |    |         |
 |   pt_regs' -------\  |    v         |
 |     v        |    |  |  pt_regs' --------\
 |   pt_reg''   |&lt;+  |  |    v         |&lt;+  |
 |              | |  |  |              | |  |
 +--------------+ |  |  +--------------+ |  |
 | thread_info  | |  |  | thread_info  | |  |
 | *task        | |  |  | *task        | |  |
 |  ksp ----------/  |  |  ksp-----------/  |
 +--------------+    |  +--------------+    |
                     |                      |
0xc0000000           |                      |
---------------------|----------------------|-----
                     |                      |
  process a          |   process b          |
 +--------------+    |  +------------+      |
 | User Stack   |    |  | User Stack |      |
 |     |        |    |  |    |       |      |
 |     V        |&lt;---+  |    v       |&lt;-----+
 |              |       |            |
 |              |       |            |
 | text         |       | text       |
 | heap         |       | heap       |
 +--------------+       +------------+


0x00000000
               user space
</code></pre></div></div>

<p><em>process a</em></p>

<p>In the above diagram notice how there are 2 set’s of <code class="language-plaintext highlighter-rouge">pt_regs</code> for process a.
The <em>pt_regs’</em> structure represents the user space registers (<span style="color:white;background-color:#69f;">INT</span>)
that are saved during the switch from <em>user mode</em> to <em>kernel mode</em>.  Notice how
the <em>pt_regs’</em> structure has an arrow pointing the user space stack, that’s the
saved stack pointer.  The second <em>pt_regs’’</em> structure represents the frozen
kernel state (<span style="color:white;background-color:black;">kINT</span>)
that was saved before a task <em>switch</em> was performed.</p>

<p><em>process b</em></p>

<p>Also in the diagram above we can see process b has only a <em>pt_regs’</em> (<span style="color:white;background-color:#69f;">INT</span>)
structure saved on the stack and does not currently have a <em>pt_regs’’</em> (<span style="color:white;background-color:black;">kINT</span>)
structure saved.  This indicates that that this process is currently running in
kernel space and is not yet frozen.</p>

<p>As we can see here, for OpenRISC there are two places to store state.</p>
<ul>
  <li>The <em>mode switch</em> context is saved to a <code class="language-plaintext highlighter-rouge">pt_regs</code> structure on the kernel
stack represented by <em>pt_regs’</em> at this point only integer registers need to
be saved.  This is represents the user process state.</li>
  <li>The <em>context switch</em> context is stored by OpenRISC again on the stack,
represented by <em>pt_regs’‘</em>.  This represents the kernel’s state before a
task switch.  All state that the kernel needs to resume later is stored.  In
other architectures this state is not stored on the stack but to the <code class="language-plaintext highlighter-rouge">task</code>
structure or to the <code class="language-plaintext highlighter-rouge">thread_info</code> structure.  This context may store the all
extra registers including FPU and Vector registers.</li>
</ul>

<div class="note">
<b>Note</b> In the above diagram we can see the kernel stack and <code>thread_info</code> live in
the same memory space.  This is a source of security issues and many
architectures have moved to support <a href="https://docs.kernel.org/mm/vmalloced-kernel-stacks.html">virtually mapped kernel stacks</a>,
OpenRISC does not yet support this and it would be a good opportunity
for improvement.
</div>

<p>The structure of the <code class="language-plaintext highlighter-rouge">pt_regs</code> used by OpenRISC is as per below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">pt_regs</span> <span class="p">{</span>
	<span class="kt">long</span> <span class="n">gpr</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
	<span class="kt">long</span>  <span class="n">pc</span><span class="p">;</span>
	<span class="cm">/* For restarting system calls:
	 * Set to syscall number for syscall exceptions,
	 * -1 for all other exceptions.
	 */</span>
	<span class="kt">long</span> <span class="n">orig_gpr11</span><span class="p">;</span>	<span class="cm">/* For restarting system calls */</span>
	<span class="kt">long</span> <span class="n">dummy</span><span class="p">;</span>             <span class="cm">/* Cheap alignment fix */</span>
	<span class="kt">long</span> <span class="n">dummy2</span><span class="p">;</span>		<span class="cm">/* Cheap alignment fix */</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The structure of <code class="language-plaintext highlighter-rouge">thread_struct</code> now used by OpenRISC to store the
user specific FPU state is as per below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">thread_struct</span> <span class="p">{</span>
       <span class="kt">long</span> <span class="n">fpcsr</span><span class="p">;</span>      <span class="cm">/* Floating point control status register. */</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The patches to OpenRISC added to support saving and restoring FPU state
during context switches are below:</p>

<ul>
  <li>2023-04-26 <a href="https://github.com/stffrdhrn/linux/commit/63d7f9f11e5e81de2ce8f1c7a8aaed5b0288eddf">63d7f9f11e5e</a> Stafford Horne   openrisc: Support storing and restoring fpu state</li>
  <li>2024-03-14 <a href="https://github.com/stffrdhrn/linux/commit/ead6248f25e1">ead6248f25e1</a> Stafford Horne   openrisc: Move FPU state out of pt_regs</li>
</ul>

<h2 id="signals-and-signal-frames">Signals and Signal Frames.</h2>

<p>Signal frames are another place that we want FPU’s state, namely <code class="language-plaintext highlighter-rouge">FPCSR</code>, to be available.</p>

<p>When a user process receives a signal it executes a signal handler in the
process space on a stack slightly outside it’s current stack.  This is setup
with <a href="https://elixir.bootlin.com/linux/latest/source/arch/openrisc/kernel/signal.c#L156">setup_rt_frame</a>.</p>

<p>As we saw above signals are received after syscalls or exceptions, during the
<code class="language-plaintext highlighter-rouge">do_pending_work</code> phase of the entry code.  This means means FPU state will need
to be saved and restored.</p>

<p>Again, we can look at the stack frames to paint a picture of how this works.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>               kernel space

 +--------------+
 | Kernel Stack |
 |     |        |
 |     v        |
 |   pt_regs' --------------------------\
 |              |&lt;+                      |
 |              | |                      |
 |              | |                      |
 +--------------+ |                      |
 | thread_info  | |                      |
 | *task        | |                      |
 |  ksp ----------/                      |
 +--------------+                        |
                                         |
0xc0000000                               |
-------------------                      |
                                         |
  process a                              |
 +--------------+                        |
 | User Stack   |                        |
 |     |        |                        |
 |     V        |                        |
 |xxxxxxxxxxxxxx|&gt; STACK_FRAME_OVERHEAD  |
 | siginfo      |\                       |
 | ucontext     | &gt;- sigframe            |
 | retcode[]    |/                       |
 |              |&lt;-----------------------/
 |              |
 | text         |
 | heap         |
 +--------------+

0x00000000
               user space
</code></pre></div></div>

<p>Here we can see that when we enter a signal handler, we can get a bunch of stuff
stuffed in the stack in a <code class="language-plaintext highlighter-rouge">sigframe</code> structure.  This includes the <code class="language-plaintext highlighter-rouge">ucontext</code>,
or user context which points to the original state of the program, registers and
all.  It also includes a bit of code, <code class="language-plaintext highlighter-rouge">retcode</code>, which is a trampoline to bring us
back into the kernel after the signal handler finishes.</p>

<div class="note">
<b>Note</b> we could also setup an alternate <a href="https://man7.org/linux/man-pages/man2/sigaltstack.2.html">signalstack</a>
to use instead of stuffing stuff onto the main user stack. The
above example is the default behaviour.
</div>

<p>The user <code class="language-plaintext highlighter-rouge">pt_regs</code> (as we called <em>pt_regs’</em>) is updated before returning to user
space to execute the signal handler code by updating the registers as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  sp:    stack pinter updated to point to a new user stack area below sigframe
  pc:    program counter:  sa_handler(
  r3:    argument 1:                   signo,
  r4:    argument 2:                  &amp;siginfo
  r5:    argument 3:                  &amp;ucontext)
  r9:    link register:    retcode[]
</code></pre></div></div>

<p>Now, when we return from the kernel to user space, user space will resume in the
signal handler, which runs within the user process context.</p>

<p>After the signal handler completes it will execute the <code class="language-plaintext highlighter-rouge">retcode</code>
block which is setup to call the special system call <a href="https://man7.org/linux/man-pages/man2/sigreturn.2.html">rt_sigreturn</a>.</p>

<div class="note">
<b>Note</b> for OpenRISC this means the stack has to be executable. Which is
a <a href="https://en.wikipedia.org/wiki/Stack_buffer_overflow">major security vulnerability</a>.
Modern architectures do not have executable stacks and use <a href="https://man7.org/linux/man-pages/man7/vdso.7.html">vdso</a>
or is provided by libc in <code>sa_restorer</code>.
</div>

<p>The <code class="language-plaintext highlighter-rouge">rt_sigreturn</code> system call will restore the <code class="language-plaintext highlighter-rouge">ucontext</code> registers (which
may have been updated by the signal handler) to the user <code class="language-plaintext highlighter-rouge">pt_regs</code> on the
kernel stack.  This allows us to either restore the user context before the
signal was received or return to a new context setup by the signal handler.</p>

<h3 id="a-note-on-user-space-abi-compatibility-for-signals">A note on user space ABI compatibility for signals.</h3>

<p>We need to to provide and restore the FPU <code class="language-plaintext highlighter-rouge">FPCSR</code> during signals via
<code class="language-plaintext highlighter-rouge">ucontext</code> but also not break user space ABI.  The ABI is important because
kernel and user space programs may be built at different times.  This means the
layout of existing fields in <code class="language-plaintext highlighter-rouge">ucontext</code> cannot change.  As we can see below by
comparing the <code class="language-plaintext highlighter-rouge">ucontext</code> definitions from Linux, glibc and musl each program
maintains their own separate header file.</p>

<p>In Linux we cannot add fields to <code class="language-plaintext highlighter-rouge">uc_sigcontext</code> as it would make <code class="language-plaintext highlighter-rouge">uc_sigmask</code>
unable to be read.  Fortunately we had a bit of space in <code class="language-plaintext highlighter-rouge">sigcontext</code> in the
unused <code class="language-plaintext highlighter-rouge">oldmask</code> field which we could repurpose for <code class="language-plaintext highlighter-rouge">FPCSR</code>.</p>

<p>The structure used by Linux to populate the signal frame is:</p>

<p>From: <a href="https://elixir.bootlin.com/linux/v6.8/source/include/uapi/asm-generic/ucontext.h#L5">uapi/asm-generic/ucontext.h</a></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">ucontext</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">long</span>     <span class="n">uc_flags</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">ucontext</span>  <span class="o">*</span><span class="n">uc_link</span><span class="p">;</span>
        <span class="n">stack_t</span>           <span class="n">uc_stack</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">sigcontext</span> <span class="n">uc_mcontext</span><span class="p">;</span>
        <span class="n">sigset_t</span>          <span class="n">uc_sigmask</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>From: <a href="https://elixir.bootlin.com/linux/v6.8/source/arch/openrisc/include/uapi/asm/ptrace.h">uapi/asm/ptrace.h</a></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">sigcontext</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="n">regs</span><span class="p">;</span>  <span class="cm">/* needs to be first */</span>
        <span class="k">union</span> <span class="p">{</span>
                <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fpcsr</span><span class="p">;</span>
                <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">oldmask</span><span class="p">;</span>  <span class="cm">/* unused */</span>
        <span class="p">};</span>
<span class="p">};</span>
</code></pre></div></div>

<div class="note">
<b>Note</b> In <code>sigcontext</code> originally a <code>union</code> was not used
and <a href="https://lore.kernel.org/linux-mm/ZL2V77V8xCWTKVR+@antec/T/">caused ABI breakage</a>; which was soon fixed.
</div>

<p>From: <a href="https://elixir.bootlin.com/linux/v6.8/source/arch/openrisc/include/uapi/asm/sigcontext.h">uapi/asm/sigcontext.h</a></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">user_regs_struct</span> <span class="p">{</span>
        <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">gpr</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
        <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pc</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">sr</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The structure that <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/or1k/sys/ucontext.h;h=b17e91915461b5f2095682efd174e7612d2ec119;hb=HEAD">glibc</a> expects is.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Context to describe whole processor state.  */</span>
<span class="k">typedef</span> <span class="k">struct</span>
  <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">__gprs</span><span class="p">[</span><span class="n">__NGREG</span><span class="p">];</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">__pc</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">__sr</span><span class="p">;</span>
  <span class="p">}</span> <span class="n">mcontext_t</span><span class="p">;</span>

<span class="cm">/* Userlevel context.  */</span>
<span class="k">typedef</span> <span class="k">struct</span> <span class="n">ucontext_t</span>
  <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">__uc_flags</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">ucontext_t</span> <span class="o">*</span><span class="n">uc_link</span><span class="p">;</span>
    <span class="n">stack_t</span> <span class="n">uc_stack</span><span class="p">;</span>
    <span class="n">mcontext_t</span> <span class="n">uc_mcontext</span><span class="p">;</span>
    <span class="n">sigset_t</span> <span class="n">uc_sigmask</span><span class="p">;</span>
  <span class="p">}</span> <span class="n">ucontext_t</span><span class="p">;</span>
</code></pre></div></div>

<div class="note">
<b>Note</b> This is broken, the <code>struct mcontext_t</code> in glibc is
missing the space for <code>oldmask</code>.
</div>

<p>The structure used by <a href="https://git.musl-libc.org/cgit/musl/tree/arch/or1k/bits/signal.h">musl</a> is:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="n">sigcontext</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="p">{</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">gpr</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pc</span><span class="p">;</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">sr</span><span class="p">;</span>
	<span class="p">}</span> <span class="n">regs</span><span class="p">;</span>
	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">oldmask</span><span class="p">;</span>
<span class="p">}</span> <span class="n">mcontext_t</span><span class="p">;</span>

<span class="k">typedef</span> <span class="k">struct</span> <span class="n">__ucontext</span> <span class="p">{</span>
	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">uc_flags</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">__ucontext</span> <span class="o">*</span><span class="n">uc_link</span><span class="p">;</span>
	<span class="n">stack_t</span> <span class="n">uc_stack</span><span class="p">;</span>
	<span class="n">mcontext_t</span> <span class="n">uc_mcontext</span><span class="p">;</span>
	<span class="n">sigset_t</span> <span class="n">uc_sigmask</span><span class="p">;</span>
<span class="p">}</span> <span class="n">ucontext_t</span><span class="p">;</span>

</code></pre></div></div>

<p>Below were the patches the to OpenRISC kernel to add floating point state to the
signal API.  This originally caused some ABI breakage and was fixed in the second patch.</p>

<ul>
  <li>2023-04-26 <a href="https://github.com/stffrdhrn/linux/commit/27267655c531">27267655c531</a> Stafford Horne   openrisc: Support floating point user api</li>
  <li>2023-07-10 <a href="https://github.com/stffrdhrn/linux/commit/dceaafd66881">dceaafd66881</a> Stafford Horne   openrisc: Union fpcsr and oldmask in sigcontext to unbreak userspace ABI</li>
</ul>

<h2 id="register-sets">Register sets</h2>

<p>Register sets provide debuggers the ability to read and save the state of
registers in other processes.  This is done via the <a href="https://man7.org/linux/man-pages/man2/ptrace.2.html">ptrace</a>
<code class="language-plaintext highlighter-rouge">PTRACE_GETREGSET</code> and <code class="language-plaintext highlighter-rouge">PTRACE_SETREGSET</code> requests.</p>

<p>Regsets also define what is dumped to <a href="https://en.wikipedia.org/wiki/Core_dump">core dumps</a> when a process crashes.</p>

<p>In OpenRISC we added the ability to get and set the <code class="language-plaintext highlighter-rouge">FPCSR</code> register
with the following patches:</p>

<ul>
  <li>2023-04-26 <a href="https://github.com/stffrdhrn/linux/commit/c91b4a07655d">c91b4a07655d</a> Stafford Horne   openrisc: Add floating point regset  (shorne/or1k-6.4-updates, or1k-6.4-updates)</li>
  <li>2024-03-14 <a href="https://github.com/stffrdhrn/linux/commit/14f89b18c1173fb6664bb338db850f5ad0484b93#diff-0c4ba219cbf5887111a27c6234092536a513f07927c418c14bb227a8ac85eaae">14f89b18c117</a> Stafford Horne   openrisc: Move FPU state out of pt_regs</li>
</ul>

<h1 id="porting-gcc-to-an-fpu">Porting GCC to an FPU</h1>

<h2 id="supporting-fpu-instructions">Supporting FPU Instructions</h2>

<p>I ported GCC to the OpenRISC FPU back in <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commit;f=gcc/config/or1k/or1k.md;h=44080af98edf7d8a59a94dd803f60cf0505fba34">2019</a>
, this entailed defining new instructions in the RTL machine description for
example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> (define_insn "plussf3"
   [(set (match_operand:SF 0 "register_operand" "=r")
         (plus:SF (match_operand:SF 1 "register_operand" "r")
                  (match_operand:SF 2 "register_operand" "r")))]
   "TARGET_HARD_FLOAT"
   "lf.add.s\t%d0, %d1, %d2"
   [(set_attr "type" "fpu")])

 (define_insn "minussf3"
   [(set (match_operand:SF 0 "register_operand" "=r")
         (minus:SF (match_operand:SF 1 "register_operand" "r")
                   (match_operand:SF 2 "register_operand" "r")))]
   "TARGET_HARD_FLOAT"
   "lf.sub.s\t%d0, %d1, %d2"
   [(set_attr "type" "fpu")])
</code></pre></div></div>

<p>The above is a simplified example of <a href="https://gcc.gnu.org/onlinedocs/gccint/Arithmetic.html">GCC Register Transfer Language(RTL)</a>
lisp expressions.  Note, the real expression actually uses mode iterators and is a bit harder to understand, hence the simplified version above.
These expressions are used for translating the GCC compiler RTL from it’s <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">abstract syntax tree</a>
form to actual machine instructions.</p>

<p>Notice how the above expressions are in the format <code class="language-plaintext highlighter-rouge">(define_insn INSN_NAME RTL_PATTERN CONDITION MACHINE_INSN ...)</code>. If
we break it down we see:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">INSN_NAME</code> - this is a unique name given to the instruction.</li>
  <li><code class="language-plaintext highlighter-rouge">RTL_PATTERN</code> - this is a pattern we look for in the RTL tree, Notice how the lisp represents 3 registers connected by the instruction node.</li>
  <li><code class="language-plaintext highlighter-rouge">CONDITION</code> - this is used to enable the instruction, in our case we use <code class="language-plaintext highlighter-rouge">TARGET_HARD_FLOAT</code>.  This means if the GCC hardware floating point
              option is enabled this expression will be enabled.</li>
  <li><code class="language-plaintext highlighter-rouge">MACHINE_INSN</code> - this represents the actual OpenRISC assembly instruction that will be output.</li>
</ul>

<h2 id="supporting-glibc-math">Supporting Glibc Math</h2>

<p>In order for glibc to properly support floating point operations GCC needs to do
a bit more than just support outputting floating point instructions.  Another
component of GCC is software floating point emulation.  When there are
operations not supported by hardware GCC needs to fallback to using software
emulation.  With way GCC and GLIBC weave software math routines and floating
point instructions we can think of the FPU as a math accelerator.  For example,
the floating point square root operation is not provided by OpenRISC hardware.</p>

<p>When operations like square root are not available by hardware glibc will inject
software routines to handle the operation.  The outputted square root routine
may use hardware multiply <code class="language-plaintext highlighter-rouge">lf.mul.s</code> and divide <code class="language-plaintext highlighter-rouge">lf.div.s</code> operations to
accelerate the emulation.</p>

<p>In order for this to work correctly the rounding mode and exception state of the
FPU and libgcc emulation need to by in sync. Notably, we had one patch to fix an
issue with exceptions not being in sync which was found when running glibc
tests.</p>

<ul>
  <li>2023-03-19 <a href="https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=33fb1625992ba8180b42988e714460bcab08ca0f">33fb1625992</a> or1k: Do not clear existing FPU exceptions before updating</li>
</ul>

<p>The libc <a href="https://www.gnu.org/software/libc/manual/html_node/Mathematics.html">math</a> routines include:</p>

<ul>
  <li>Trigonometry functions - For example <code class="language-plaintext highlighter-rouge">float sinf (float x)</code></li>
  <li>Logarithm functions - For example <code class="language-plaintext highlighter-rouge">float logf (float x)</code></li>
  <li>Hyperbolic functions - For example <code class="language-plaintext highlighter-rouge">float ccoshf (complex float z)</code></li>
</ul>

<p>The above just names a few, but as you can imagine the floating point acceleration
provided by the <code class="language-plaintext highlighter-rouge">FPU</code> is essential for performant scientific applications.</p>

<h1 id="adding-debugging-capabilities-for-an-fpu">Adding debugging capabilities for an FPU</h1>

<p>FPU debugging allows a user to inspect the FPU specific registers.
This includes FPU state registers and flags as well as view the floating point
values in each general purpose register.  This is not yet implemented on
OpenRISC.</p>

<p>This will be something others can take up.  The work required is to map
Linux FPU register sets to GDB.</p>

<h1 id="summary">Summary</h1>

<p>In summary adding floating point support to Linux revolved around adding one more register, the <code class="language-plaintext highlighter-rouge">FPCSR</code>,
to context switches and a few other places.</p>

<p>GCC fixes were needed to make sure hardware and software floating point routines could work together.</p>

<p>There are still improvements that can be done for the Linux port as noted above.  In the next
article we will wrap things up by showing the glibc port.</p>

<h1 id="further-reading">Further Reading</h1>

<ul>
  <li><a href="https://www.linfo.org/context_switch.html">Context Switch</a> - good definition of context switches.</li>
  <li><a href="https://www.maizure.org/projects/evolution_x86_context_switch_linux/">Evolution of the x86 context switch in Linux</a> - A great history of the linux context switch code.</li>
  <li><a href="http://liujunming.top/2022/01/08/Notes-about-FPU-implementation-in-Linux-kernel/">Notes about FPU implementation in Linux kernel</a> - Next step in x86 FPU context switch optimizations, smart loading</li>
  <li><a href="https://www.baeldung.com/linux/kernel-stack-and-user-space-stack#:~:text=User%20and%20Kernel%20Stacks,part%20of%20the%20kernel%20space.">Kernel Stack and User Stack</a> - explaination
of kernel and user space stacks.</li>
  <li><a href="https://docs.kernel.org/mm/vmalloced-kernel-stacks.html">Virtually Mapped Stacks</a> - details about relocating kernel stack for security</li>
  <li><a href="https://lpc.events/event/17/contributions/1462/">pt_regs the good the bad and the ugly</a> - on the history of <code class="language-plaintext highlighter-rouge">pt_regs</code></li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[In this series we introduce the OpenRISC glibc FPU port and the effort required to get user space FPU support into OpenRISC Linux. Adding FPU support to user space applications is a full stack project covering:]]></summary></entry><entry><title type="html">OpenRISC FPU Port - Fixing Hardware</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/22/or1k-fpu-hw.html" rel="alternate" type="text/html" title="OpenRISC FPU Port - Fixing Hardware" /><published>2023-08-22T19:45:00+01:00</published><updated>2023-08-22T19:45:00+01:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/22/or1k-fpu-hw</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/08/22/or1k-fpu-hw.html"><![CDATA[<p>In the <a href="/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port.html">last article</a> we introduced the
OpenRISC glibc FPU port and the effort required to get user space FPU support
into OpenRISC linux user applications.  We explained how the FPU port is a
fullstack project covering:</p>

<ul>
  <li><a href="/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port.html">Architecture Specification</a></li>
  <li>Simulators and CPU implementations</li>
  <li>Linux Kernel support</li>
  <li>GCC Instructions and Soft FPU</li>
  <li>Binutils/GDB Debugging Support</li>
  <li>glibc support</li>
</ul>

<p>In this entry we will cover updating <em>Simulators and CPU implementations</em> to support
the architecture changes which are called for as per the previous article.</p>

<ul>
  <li>Allowing usermode programs to update the FPCSR register</li>
  <li>Detecting tininess before rounding</li>
</ul>

<h1 id="simulator-updates">Simulator Updates</h1>

<p>The simulators used for testing OpenRISC software without hardware are QEMU
and or1ksim.  They both needed to be updated to cohere to the specification
updates discussed above.</p>

<h2 id="or1ksim-updates">Or1ksim Updates</h2>

<p>The OpenRISC architectue simulator or1ksim has been updated with the single patch:
<a href="https://github.com/openrisc/or1ksim/commit/7a03376c1bd0fc0d5008d3222c623601e7080b43">cpu: Allow FPCSR to be read/written in user mode</a>.</p>

<p>The softfloat FPU implementation was already configured to detect tininess before
rounding.</p>

<p>If you are interested you can download and run the simulator and test this out
with a docker image pulled from docker hub using the following:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># using podman instead of docker, you can use docker here too</span>
podman pull stffrdhrn/or1k-sim-env:latest
podman run <span class="nt">-it</span> <span class="nt">--rm</span> stffrdhrn/or1k-sim-env:latest

root@9a4a52eec8ee:/tmp# or1k-elf-sim <span class="nt">-version</span>
Seeding random generator with value 0x4a3c2bbd
OpenRISC 1000 Architectural Simulator, version 2023-08-20
</code></pre></div></div>

<p>This starts up an environment which has access to the OpenRISC architecture
simulator and a GNU compiler toolchain.  While still in the container can run a
quick test using the FPU as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Create a test program using OpenRISC FPU</span>
<span class="nb">cat</span> <span class="o">&gt;</span> fpee.c <span class="o">&lt;&lt;</span><span class="no">EOF</span><span class="sh">
#include &lt;float.h&gt;
#include &lt;stdio.h&gt;
#include &lt;or1k-sprs.h&gt;
#include &lt;or1k-support.h&gt;

static void enter_user_mode() {
  int32_t sr = or1k_mfspr(OR1K_SPR_SYS_SR_ADDR);
  sr &amp;= ~OR1K_SPR_SYS_SR_SM_MASK;
  or1k_mtspr(OR1K_SPR_SYS_SR_ADDR, sr);
}
static void enable_fpu_exceptions() {
  unsigned long fpcsr = OR1K_SPR_SYS_FPCSR_FPEE_MASK;
  or1k_mtspr(OR1K_SPR_SYS_FPCSR_ADDR, fpcsr);
}
static void fpe_handler() {
  printf("Got FPU Exception, PC: 0x%lx</span><span class="se">\n</span><span class="sh">", or1k_mfspr(OR1K_SPR_SYS_EPCR_BASE));
}
int main() {
  float result;

  or1k_exception_handler_add(0xd, fpe_handler);
#ifdef USER_MODE
  /* Note, printf here also allocates some memory allowing user mode runtime to
     work.  */
  printf("Enabling user mode</span><span class="se">\n</span><span class="sh">");
  enter_user_mode();
#endif
  enable_fpu_exceptions();

  printf("Exceptions enabled, now DIV 3.14 / 0!</span><span class="se">\n</span><span class="sh">");
  result = 3.14f / 0.0f;

  /* Verify we see infinity.  */
  printf("Result: %f</span><span class="se">\n</span><span class="sh">", result);
  /* Verify we see DZF set.  */
  printf("FPCSR: %x</span><span class="se">\n</span><span class="sh">", or1k_mfspr(OR1K_SPR_SYS_FPCSR_ADDR));

#ifdef USER_MODE
  asm volatile("l.movhi r3, 0; l.nop 1"); /* Exit sim, now */
#endif
  return 0;
}
</span><span class="no">EOF

</span><span class="c"># Compile the program</span>
or1k-elf-gcc <span class="nt">-g</span> <span class="nt">-O2</span> <span class="nt">-mhard-float</span> fpee.c <span class="nt">-o</span> fpee
or1k-elf-sim <span class="nt">-f</span> /opt/or1k/sim.cfg ./fpee

<span class="c"># Expected results</span>
<span class="c"># Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab</span>
<span class="c"># Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c</span>
<span class="c"># WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB</span>
<span class="c"># Exceptions enabled, now DIV 3.14 / 0!</span>
<span class="c"># Got FPU Exception, PC: 0x2068</span>
<span class="c"># Result: f</span>
<span class="c"># FPCSR: 801</span>

<span class="c"># Compile the program to run in USER_MODE</span>
or1k-elf-gcc <span class="nt">-g</span> <span class="nt">-O2</span> <span class="nt">-mhard-float</span> <span class="nt">-DUSER_MODE</span> fpee.c <span class="nt">-o</span> fpee
or1k-elf-sim <span class="nt">-f</span> /opt/or1k/sim.cfg ./fpee

<span class="c"># Expected results with USER_MODE</span>
<span class="c"># Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab</span>
<span class="c"># Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c</span>
<span class="c"># WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB</span>
<span class="c"># Enabling user mode</span>
<span class="c"># Exceptions enabled, now DIV 3.14 / 0!</span>
<span class="c"># Got FPU Exception, PC: 0x2068</span>
<span class="c"># Result: f</span>
<span class="c"># FPCSR: 801</span>
<span class="c"># exit(0)</span>
</code></pre></div></div>

<p>In the above we can see how to compile and run a simple FPU test program and run
it on or1ksim.  The program set’s up an FPU exception handler, enables exceptions
then does a divide by zero to produce an exception. This program uses the
OpenRISC <a href="https://sourceware.org/newlib/">newlib</a> (baremetal) toolchain to
compile a program that can run directly on the simulator, as oppposed to a
program running in an OS on a simulator or hardware.</p>

<p>Note, that normally <em>newlib</em> programs expect to run in supervisor mode, when
our program switches to user mode we need to take some precautions to ensure it
can run correctly.  As noted in the comments, usually when allocating and exiting
the <em>newlib</em> runtime will do things like disabling/enabling interrupts which
will fail when running in user mode.</p>

<h2 id="qemu-updates">QEMU Updates</h2>

<p>The QEMU update was done in my
<a href="https://lore.kernel.org/qemu-devel/20230511151000.381911-1-shorne@gmail.com">OpenRISC user space FPCSR</a>
qemu patch series.  The series was merged for the
<a href="https://wiki.qemu.org/ChangeLog/8.1">qemu 8.1</a> release.</p>

<p>The updates were split it into three changes:</p>
<ul>
  <li>Allowing FPCSR access in user mode.</li>
  <li>Properly set the exception PC address on floating point exceptions.</li>
  <li>Configuring the QEMU softfloat implementation to perform tininess check
before rounding.</li>
</ul>

<h3 id="qemu-patch-1">QEMU Patch 1</h3>

<p>The first patch to <em>allow FPCSR access in user mode</em> was trivial, but required some
code structure changes making the patch look bigger than it really was.</p>

<h3 id="qemu-patch-2">QEMU Patch 2</h3>

<p>The next patch to <em>properly set the exception PC address</em> fixed a long existing
bug where the <code class="language-plaintext highlighter-rouge">EPCR</code> was not properly updated after FPU exceptions.  Up until now
OpenRISC userspace did not support FPU instructions and this code path had not
been tested.</p>

<p>To explain why this fix is important let us look at the <code class="language-plaintext highlighter-rouge">EPCR</code> and what it is used for
in a bit more detail.
In general, when an exception occurs an OpenRISC CPU will store the program counter (<code class="language-plaintext highlighter-rouge">PC</code>)
of the instruction that caused the exception into the exeption program counter address
(<code class="language-plaintext highlighter-rouge">EPCR</code>).  Floating point exceptions are a special case in that the <code class="language-plaintext highlighter-rouge">EPCR</code> is
actually set to the next instruction to be executed, this is to avoid looping.</p>

<p>When the linux kernel handles a floating point exception it follows the path
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/head.S?h=v6.4-rc7#n429">0xd00</a> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/entry.S?h=v6.4-rc7#n853">fpe_trap_handler</a> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/traps.c?h=v6.4-rc7#n246"> do_fpe_trap</a>.  This will setup a
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04" title="Learn about Signals">signal</a> to be delivered to the user process.
The Linux OS uses the <code class="language-plaintext highlighter-rouge">EPCR</code> to report the exception instruction address to
userspace via a <code class="language-plaintext highlighter-rouge">signal</code> which we can see being done in <code class="language-plaintext highlighter-rouge">do_fpe_trap</code> which
we can see below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">asmlinkage</span> <span class="kt">void</span> <span class="nf">do_fpe_trap</span><span class="p">(</span><span class="k">struct</span> <span class="n">pt_regs</span> <span class="o">*</span><span class="n">regs</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">address</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTUNK</span><span class="p">;</span>
	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fpcsr</span> <span class="o">=</span> <span class="n">regs</span><span class="o">-&gt;</span><span class="n">fpcsr</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">fpcsr</span> <span class="o">&amp;</span> <span class="n">SPR_FPCSR_IVF</span><span class="p">)</span>
		<span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTINV</span><span class="p">;</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">fpcsr</span> <span class="o">&amp;</span> <span class="n">SPR_FPCSR_OVF</span><span class="p">)</span>
		<span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTOVF</span><span class="p">;</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">fpcsr</span> <span class="o">&amp;</span> <span class="n">SPR_FPCSR_UNF</span><span class="p">)</span>
		<span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTUND</span><span class="p">;</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">fpcsr</span> <span class="o">&amp;</span> <span class="n">SPR_FPCSR_DZF</span><span class="p">)</span>
		<span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTDIV</span><span class="p">;</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">fpcsr</span> <span class="o">&amp;</span> <span class="n">SPR_FPCSR_IXF</span><span class="p">)</span>
		<span class="n">code</span> <span class="o">=</span> <span class="n">FPE_FLTRES</span><span class="p">;</span>

	<span class="cm">/* Clear all flags */</span>
	<span class="n">regs</span><span class="o">-&gt;</span><span class="n">fpcsr</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">SPR_FPCSR_ALLF</span><span class="p">;</span>

	<span class="n">force_sig_fault</span><span class="p">(</span><span class="n">SIGFPE</span><span class="p">,</span> <span class="n">code</span><span class="p">,</span> <span class="p">(</span><span class="kt">void</span> <span class="n">__user</span> <span class="o">*</span><span class="p">)</span><span class="n">regs</span><span class="o">-&gt;</span><span class="n">pc</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here we see the excption becomes a <code class="language-plaintext highlighter-rouge">SIGFPE</code> signal and the exception address in
<code class="language-plaintext highlighter-rouge">regs-&gt;pc</code> is passed to <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?h=v6.4-rc7#n1705">force_sig_fault</a>.  The <code class="language-plaintext highlighter-rouge">PC</code> will be used to set the
<code class="language-plaintext highlighter-rouge">si_addr</code> field of the <code class="language-plaintext highlighter-rouge">siginfo_t</code> structure.</p>

<p>Next upon return from kernel space to user space the path is <code class="language-plaintext highlighter-rouge">do_fpe_trap</code> &gt;
<code class="language-plaintext highlighter-rouge">_fpe_trap_handler</code> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/entry.S?h=v6.4-rc7#n998">ret_from_exception</a> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/entry.S?h=v6.4-rc7#n943" title="Resume Userspace">resume_userspace</a> &gt;
<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/entry.S?h=v6.4-rc7#n952" title="Drop into Work Pending">work_pending</a> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/signal.c?h=v6.4-rc7#n293">do_work_pending</a> &gt; <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/entry.S?h=v6.4-rc7#n983">restore_all</a>.</p>

<p>Inside of <code class="language-plaintext highlighter-rouge">do_work_pending</code> with there the signal handling is done.  In explain a bit
about this in the article <a href="/software/toolchain/openrisc/2020/12/13/cxx-exception-unwinding.html">Unwinding a Bug - How C++ Exceptions Work</a>.
In <code class="language-plaintext highlighter-rouge">restore_all</code> we see <code class="language-plaintext highlighter-rouge">EPCR</code> is returned to when exception handling is
complete. A snipped of this code is show below:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define RESTORE_ALL                     \
    DISABLE_INTERRUPTS(r3,r4)               ;\
    l.lwz   r3,PT_PC(r1)                    ;\
    l.mtspr r0,r3,SPR_EPCR_BASE             ;\
    l.lwz   r3,PT_SR(r1)                    ;\
    l.mtspr r0,r3,SPR_ESR_BASE              ;\
    l.lwz   r3,PT_FPCSR(r1)                 ;\
    l.mtspr r0,r3,SPR_FPCSR                 ;\
    l.lwz   r2,PT_GPR2(r1)                  ;\
    l.lwz   r3,PT_GPR3(r1)                  ;\
    l.lwz   r4,PT_GPR4(r1)                  ;\
    l.lwz   r5,PT_GPR5(r1)                  ;\
    l.lwz   r6,PT_GPR6(r1)                  ;\
    l.lwz   r7,PT_GPR7(r1)                  ;\
    l.lwz   r8,PT_GPR8(r1)                  ;\
    l.lwz   r9,PT_GPR9(r1)                  ;\
    l.lwz   r10,PT_GPR10(r1)                    ;\
    l.lwz   r11,PT_GPR11(r1)                    ;\
    l.lwz   r12,PT_GPR12(r1)                    ;\
    l.lwz   r13,PT_GPR13(r1)                    ;\
    l.lwz   r14,PT_GPR14(r1)                    ;\
    l.lwz   r15,PT_GPR15(r1)                    ;\
    l.lwz   r16,PT_GPR16(r1)                    ;\
    l.lwz   r17,PT_GPR17(r1)                    ;\
    l.lwz   r18,PT_GPR18(r1)                    ;\
    l.lwz   r19,PT_GPR19(r1)                    ;\
    l.lwz   r20,PT_GPR20(r1)                    ;\
    l.lwz   r21,PT_GPR21(r1)                    ;\
    l.lwz   r22,PT_GPR22(r1)                    ;\
    l.lwz   r23,PT_GPR23(r1)                    ;\
    l.lwz   r24,PT_GPR24(r1)                    ;\
    l.lwz   r25,PT_GPR25(r1)                    ;\
    l.lwz   r26,PT_GPR26(r1)                    ;\
    l.lwz   r27,PT_GPR27(r1)                    ;\
    l.lwz   r28,PT_GPR28(r1)                    ;\
    l.lwz   r29,PT_GPR29(r1)                    ;\
    l.lwz   r30,PT_GPR30(r1)                    ;\
    l.lwz   r31,PT_GPR31(r1)                    ;\
    l.lwz   r1,PT_SP(r1)                    ;\
    l.rfe
</code></pre></div></div>

<p>Here we can see how <code class="language-plaintext highlighter-rouge">l.mtspr r0,r3,SPR_EPCR_BASE</code> restores the <code class="language-plaintext highlighter-rouge">EPCR</code> to the pc
address stored in <code class="language-plaintext highlighter-rouge">pt_regs</code> when we entered the exception handler.  All
other register are restored and finally the <code class="language-plaintext highlighter-rouge">l.rfe</code> instruction is issued to
return from the exception which affectively jumps to <code class="language-plaintext highlighter-rouge">EPCR</code>.</p>

<p>The reason QEMU was not setting the correct exception address is due to the way
qemu is implemented which optimizes performance.  QEMU executes target code
basic blocks that are translated to host native instructions, during runtime
all <code class="language-plaintext highlighter-rouge">PC</code> addresses are those of the host, for example x86-64 64-bit
addresses.  When an exception occurs, updating the target <code class="language-plaintext highlighter-rouge">PC</code> address from the host <code class="language-plaintext highlighter-rouge">PC</code>
need to be explicityly requested.</p>

<h3 id="qemu-patch-3">QEMU Patch 3</h3>

<p>The next patch to implement <em>tininess before rouding</em> was also trivial but
brought up a conversation about default NaN payloads.</p>

<h3 id="qemu-patch-4">QEMU Patch 4</h3>

<p>Wait, there is more.  During writing this article I realized that if QEMU
was setting the <code class="language-plaintext highlighter-rouge">ECPR</code> to the FPU instruction causing the exception then
we would end up in an endless loop.</p>

<p>Luckily the arcitecture anticipated this calling for FPU exceptions to set the next
instruction to be executed to <code class="language-plaintext highlighter-rouge">EPCR</code>.  QEMU was missing this logic.</p>

<p>The patch <a href="https://lore.kernel.org/qemu-devel/20230731210301.3360723-1-shorne@gmail.com/">target/openrisc: Set EPCR to next PC on FPE exceptions</a>
fixes this up.</p>

<h1 id="rtl-updates">RTL Updates</h1>

<p>Updating the actual verilog RTL CPU implementations also needed to be done.
Updates have been made to both the <a href="https://github.com/openrisc/mor1kx">mor1kx</a>
and the
<a href="https://github.com/openrisc/or1k_marocchino/tree/master">or1k_marocchino</a>
implementations.</p>

<h2 id="mor1kx-updates">mor1kx Updates</h2>

<p>Updates to the <em>mor1kx</em> to support user mode reads and write to the <code class="language-plaintext highlighter-rouge">FPCSR</code> were done in the patch:
<a href="https://github.com/openrisc/mor1kx/commit/6b1beaa871c02ccd570d8e6ad80f99bc4133aa26">Make FPCSR is R/W accessible for both user- and supervisor- modes</a>.</p>

<p>The full patch is:</p>

<div class="language-patch highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -618,7 +618,7 @@</span> module mor1kx_ctrl_cappuccino
            spr_fpcsr[`OR1K_FPCSR_FPEE] &lt;= 1'b0;
          end  
          else if ((spr_we &amp; spr_access[`OR1K_SPR_SYS_BASE] &amp;
<span class="gd">-                  (spr_sr[`OR1K_SPR_SR_SM] &amp; padv_ctrl | du_access)) &amp;&amp;
</span><span class="gi">+                  (padv_ctrl | du_access)) &amp;&amp;
</span>                   `SPR_OFFSET(spr_addr)==`SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) begin
            spr_fpcsr &lt;= spr_write_dat[`OR1K_FPCSR_WIDTH-1:0]; // update all fields
           `ifdef OR1K_FPCSR_MASK_FLAGS
</code></pre></div></div>

<p>The change to verilog shows that before when writng (<code class="language-plaintext highlighter-rouge">spr_we</code>) to the FPCSR (<code class="language-plaintext highlighter-rouge">OR1K_SPR_FPCSR_ADDR</code>) register
we used to check that the supervisor bit (<code class="language-plaintext highlighter-rouge">OR1K_SPR_SR_SM</code>) bit of the sr spr (<code class="language-plaintext highlighter-rouge">spr_sr</code>) is set.  That check
enforced supervisor mode only write access, removing this allows user space to write to the regsiter.</p>

<p>Updating <em>mor1kx</em> to support tininess checking before rounding was done in the
change <a href="https://github.com/openrisc/mor1kx/commit/f2a78cc5d98123e63af4b23296795d95ffdfd854">Refactoring and implementation tininess detection before
rounding</a>.
I will not go into the details of these patches as I don’t understand them so
much.</p>

<h2 id="marocchino-updates">Marocchino Updates</h2>

<p>Updates to the <em>or1k_marocchino</em> to support user mode reads and write to the <code class="language-plaintext highlighter-rouge">FPCSR</code> were done in the patch:
<a href="https://github.com/openrisc/or1k_marocchino/commit/cd9e2bd977f489892ea632142078e4bca8976576">Make FPCSR is R/W accessible for both user- and supervisor- modes</a>.</p>

<p>The full patch is:</p>

<div class="language-patch highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">@@ -714,7 +714,7 @@</span> module or1k_marocchino_ctrl
  assign except_fpu_enable_o = spr_fpcsr[`OR1K_FPCSR_FPEE];

  wire spr_fpcsr_we = (`SPR_OFFSET(({1'b0, spr_sys_group_wadr_r})) == `SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) &amp;
<span class="gd">-                      spr_sys_group_we &amp;  spr_sr[`OR1K_SPR_SR_SM];
</span><span class="gi">+                      spr_sys_group_we; // FPCSR is R/W for both user- and supervisor- modes
</span>
 `ifdef OR1K_FPCSR_MASK_FLAGS
  reg [`OR1K_FPCSR_ALLF_SIZE-1:0] ctrl_fpu_mask_flags_r;
</code></pre></div></div>

<p>Updating the <em>marocchino</em> to support dttectig tininess before rounding was done in the
patch:
<a href="https://github.com/openrisc/or1k_marocchino/commit/8be054f0bef95bd94238509ced79ef5ec7a57417">Refactoring FPU Implementation for tininess detection BEFORE ROUNDING</a>.
I will not go into details of the patch as I didn’t write them.  In
general it is a medium size refactoring of the floating point unit.</p>

<h1 id="summary">Summary</h1>

<p>We discussed updates to the architecture simulators and verilog CPU implementations
to allow supporting user mode floating point programs.  These updates will now allow us to
port Linux and glibc to the OpenRISC floating point unit.</p>

<h1 id="further-reading">Further Reading</h1>

<ul>
  <li><a href="https://qemu.readthedocs.io/en/latest/devel/tcg.html">QEMU Translation Internals</a> - details on how QEMU works</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[In the last article we introduced the OpenRISC glibc FPU port and the effort required to get user space FPU support into OpenRISC linux user applications. We explained how the FPU port is a fullstack project covering:]]></summary></entry><entry><title type="html">OpenRISC FPU Port - What is It?</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port.html" rel="alternate" type="text/html" title="OpenRISC FPU Port - What is It?" /><published>2023-04-25T12:10:00+01:00</published><updated>2023-04-25T12:10:00+01:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2023/04/25/or1k-fpu-port.html"><![CDATA[<p>Last year (2022) the big milestone for OpenRISC was getting the <a href="https://openrisc.io/toolchain/2022/02/19/glibc-upstream">glibc port upstream</a>.
Though there is <a href="https://en.wikipedia.org/wiki/C_standard_library">libc</a> support for
OpenRISC already with <a href="https://www.musl-libc.org">musl</a> and <a href="https://uclibc-ng.org">ucLibc</a>
the glibc port provides a extensive testsuite which has proved useful in shaking out toolchain
and OS bugs.</p>

<p>The upstreamed OpenRISC glibc support is missing support for leveraging the
OpenRISC <a href="https://en.wikipedia.org/wiki/Floating-point_unit">floating-point unit (FPU)</a>.
Adding OpenRISC glibc FPU support requires a cross cutting effort across the
architecture’s fullstack from:</p>

<ul>
  <li>Architecture Specification</li>
  <li>Simulators and CPU implementations</li>
  <li>Linux Kernel support</li>
  <li>GCC Instructions and Soft FPU</li>
  <li>Binutils/GDB Debugging Support</li>
  <li>glibc support</li>
</ul>

<p>In this blog entry I will cover how the OpenRISC architecture specification
was updated to support user space floating point applications.  But first, what
is FPU porting?</p>

<h2 id="what-is-fpu-porting">What is FPU Porting?</h2>

<p>The FPU in modern CPU’s allow the processor to perform <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE 754</a>
floating point math like addition, subtraction, multiplication.  When used in a
user application the FPU’s function becomes more of a math accelerator, speeding
up math operations including
<a href="https://en.wikipedia.org/wiki/Trigonometry">trigonometric</a> and
<a href="https://en.wikipedia.org/wiki/Complex_number">complex</a> functions such as <code class="language-plaintext highlighter-rouge">sin</code>,
<code class="language-plaintext highlighter-rouge">sinf</code> and <code class="language-plaintext highlighter-rouge">cexpf</code>.  Not all FPU’s provide the same
set of FPU operations nor do they have to.  When enabled, the compiler will
insert floating point instructions where they can be used.</p>

<p>OpenRISC FPU support was added to the GCC compiler <a href="https://www.phoronix.com/news/GCC-10-OpenRISC-FPU">a while back</a>.
We can see how this works with a simple example using the bare-metal <a href="https://sourceware.org/newlib/">newlib</a> toolchain.</p>

<p>C code example <code class="language-plaintext highlighter-rouge">addf.c</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float addf(float a, float b) {
    return a + b;
}
</code></pre></div></div>

<p>To compile this C function we can do:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ or1k-elf-gcc -O2 addf.c -c -o addf-sf.o
$ or1k-elf-gcc -O2 -mhard-float addf.c -c -o addf-hf.o
</code></pre></div></div>

<p>Assembly output of <code class="language-plaintext highlighter-rouge">addf-sf.o</code> contains the default software floating point
implementation as we can see below.  We can see below that a call to <code class="language-plaintext highlighter-rouge">__addsf3</code> was
added to perform our floating point operation.  The function <code class="language-plaintext highlighter-rouge">__addsf3</code>
is <a href="https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html">provided</a>
by <code class="language-plaintext highlighter-rouge">libgcc</code> as a software implementation of the single precision
floating point (<code class="language-plaintext highlighter-rouge">sf</code>) add operation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ or1k-elf-objdump -dr addf-sf.o 

Disassembly of section .text:

00000000 &lt;addf&gt;:
   0:   9c 21 ff fc     l.addi r1,r1,-4
   4:   d4 01 48 00     l.sw 0(r1),r9
   8:   04 00 00 00     l.jal 8 &lt;addf+0x8&gt;
                        8: R_OR1K_INSN_REL_26   __addsf3
   c:   15 00 00 00     l.nop 0x0
  10:   85 21 00 00     l.lwz r9,0(r1)
  14:   44 00 48 00     l.jr r9
  18:   9c 21 00 04     l.addi r1,r1,4
</code></pre></div></div>

<p>The disassembly of the <code class="language-plaintext highlighter-rouge">addf-hf.o</code> below shows that the FPU instruction
(hardware) <code class="language-plaintext highlighter-rouge">lf.add.s</code> is used to perform addition, this is because the snippet
was compiled using the <code class="language-plaintext highlighter-rouge">-mhard-float</code> argument.  One could imagine if this is
supported it would be more efficient compared to the software implementation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ or1k-elf-objdump -dr addf-hf.o 

Disassembly of section .text:

00000000 &lt;addf&gt;:
   0:   c9 63 20 00     lf.add.s r11,r3,r4
   4:   44 00 48 00     l.jr r9
   8:   15 00 00 00     l.nop 0x0
</code></pre></div></div>

<p>So if the OpenRISC toolchain already has support for FPU instructions what else
needs to be done?  When we add FPU support to glibc we are adding FPU support
to the OpenRISC POSIX <em>runtime</em> and create a <em>toolchain</em> that can compile and link
binaries to run on this runtime.</p>

<h3 id="the-runtime">The Runtime</h3>

<p>Below we can see examples of two application runtimes, one <em>Application A</em> runs
with software floating point, the other <em>Application B</em> run’s with full hardware
floating point.</p>

<p><img src="/content/2023/2023-04-24-floating-point-runtime.png" alt="OpenRISC Floating Point Runtime" /></p>

<p>Both <em>Application A</em> and <em>Application B</em> can run on the same system, but
<em>Application B</em> requires a libc and kernel that support the floating point
runtime.  As we can see:</p>

<ul>
  <li>In <em>Application B</em> it leverages floating point instructions as noted in the
<span style="color:blue">blue</span> box. That should be implemented in the
CPU, and are produced by the GCC compiler.</li>
  <li>The math routines in the C Library used by <em>Application B</em> are accelarated by
the FPU as per the <span style="color:blue">blue</span> box.  The math routines can also set up rounding of the FPU hardware to
be in line with rounding of the software routines.  The math routines can
also detect exceptions by checking the FPU state.  The rounding and exception
handling in the <span style="color:purple">purple</span> boxes is what is
implemented the GLIBC.</li>
  <li>The kernel must be able to save and restore the FPU state when switching
between processes.  The OS also has support for signalling the process if
enabled.  This is indicated in the <span style="color:purple">purple</span>
box.</li>
</ul>

<p>Another aspect is that supporting hardware floating point in the OS means that
multiple user land programs can transparently use the FPU.  To do all of this we
need to update the kernel and the C runtime libraries to:</p>

<ul>
  <li>Make the kernel save and restore process FPU state during context switches</li>
  <li>Make the kernel handle FPU exceptions and deliver signals to user land</li>
  <li>Teach GLIBC how to setup FPU rounding mode</li>
  <li>Teach GLIBC how to translate FPU exceptions</li>
  <li>Tell GCC and GLIBC soft float about our FPU quirks</li>
</ul>

<h3 id="the-toolchain">The Toolchain</h3>

<p>In order to compile applications like <em>Application B</em> a separate compiler
toolchain is needed.  For highly configurable embredded system CPU’s like ARM, RISC-V there
are multiple toolchains available for building software for the different CPU
configurations.  Usually there will be one toolchain for soft float and one for hard float support, see the below example
from the <a href="https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads">arm toolchain download</a> page.</p>

<p><img src="/content/2023/2023-04-25-arm-toolchains.png" alt="Floating Point Toolchains" /></p>

<h2 id="fixing-architecture-issues">Fixing Architecture Issues</h2>

<p>As we started to work on the floating point support we found two issues:</p>

<ul>
  <li>The OpenRISC floating point control and status register (FPCSR) is accessible only in
supervisor mode.</li>
  <li>We have not defined how the FPU should perform tininess detection.</li>
</ul>

<h3 id="fpcsr-access">FPCSR Access</h3>

<p>The GLIBC OpenRISC FPU port, or any port for that matter, starts
by looking at what other architectures have done.  For GLIBC FPU support we can
look at what MIPS, ARM, RISC-V etc. have implemented.  Most ports have a file
called <code class="language-plaintext highlighter-rouge">sysdeps/{arch}/fpu_control.h</code>, I noticed one thing right away as I went
through this, we can look at ARM or MIPS for example:</p>

<p><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/mips/fpu_control.h;h=d9ab3195bbef0159bf663c720485f8a3bdfbd136;hb=HEAD#l124">sysdeps/mips/fpu_control.h</a>:
<em>Excerpt from the MIPS port showing the definition of _FPU_GETCW and _FPU_SETCW</em></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cp">#else
</span> <span class="cp"># define _FPU_GETCW(cw) __asm__ volatile ("cfc1 %0,$31" : "=r" (cw))
</span> <span class="cp"># define _FPU_SETCW(cw) __asm__ volatile ("ctc1 %0,$31" : : "r" (cw))
</span> <span class="cp">#endif
</span></code></pre></div></div>

<p><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/arm/fpu_control.h;h=cadbe927b3df20d06f6f4cf159c94e865a595885;hb=HEAD#l67">sysdeps/arm/fpu_control.h</a>:
<em>Excerpt from the ARM port showing the definition of _FPU_GETCW and _FPU_SETCW</em></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cp"># define _FPU_GETCW(cw) \
   __asm__ __volatile__ ("vmrs %0, fpscr" : "=r" (cw))
</span> <span class="cp"># define _FPU_SETCW(cw) \
   __asm__ __volatile__ ("vmsr fpscr, %0" : : "r" (cw))
</span> <span class="cp">#endif
</span></code></pre></div></div>

<p>What we see here is a macro that defines how to read or write the floating point
control word for each architecture.  The macros are implemented using a single
assembly instruction.</p>

<p>In OpenRISC we have similar instructions for reading and writing the floating
point control register (FPCSR), writing for example is: <code class="language-plaintext highlighter-rouge">l.mtspr r0,%0,20</code>.  However,
<strong>on OpenRISC the FPCSR is read-only when running in user-space</strong>, this is a
problem.</p>

<p>If we remember from our operating system studies, user applications run in
<a href="https://en.wikipedia.org/wiki/User_space_and_kernel_space">user-mode</a> as
apposed to the privileged kernel-mode.
The user <a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/fenv.h.html">floating point environment</a>
is defined by POSIX in the ISO C Standard.  The C library provides functions to
set rounding modes and clear exceptions using for example
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/fesetround.html">fesetround</a>
for setting FPU rounding modes and
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/feholdexcept.html">feholdexcept</a> for clearing exceptions.
If userspace applications need to be able to control the floating point unit
the having architectures support for this is integral.</p>

<p>Originally <a href="https://openrisc.io/architecture">OpenRISC architecture specification</a>
specified the floating point control and status registers (FPCSR) as being
read only when executing in user mode, again <strong>this is a problem</strong> and needs to
be addressed.</p>

<p><img src="/content/2023/2023-04-22-or1k-fpcsr.png" alt="OpenRISC FPCSR Privileges" /></p>

<p>Other architectures define the floating point control register as being writable in user-mode.
For example, ARM has the
<a href="https://developer.arm.com/documentation/ddi0502/g/programmers-model/aarch64-register-descriptions/floating-point-control-register">FPCR and FPSR</a>,
and RISC-V has the
<a href="https://riscv.org/wp-content/uploads/2017/05/riscv-privileged-v1.10.pdf">FCSR</a>
all of which are writable in user-mode.</p>

<p><img src="/content/2023/2023-04-22-riscv-csr-fcsr.png" alt="RISC-V FCSR Privileges" /></p>

<h3 id="tininess-detection">Tininess Detection</h3>

<p>I am skipping ahead a bit here, once the OpenRISC GLIBC port was working we noticed
many problematic math test failures.  This turned out to be inconsistencies
between the tininess detection <a href="https://ntrs.nasa.gov/api/citations/19960008463/downloads/19960008463.pdf">[pdf]</a>
settings in the toolchain.  Tininess detection must be selected by an FPU
implementation as being done before or after rounding.
In the toolchain this is configured by:</p>

<ul>
  <li>GLIBC <code class="language-plaintext highlighter-rouge">TININESS_AFTER_ROUNDING</code> - macro used by test suite to control
expectations</li>
  <li>GLIBC <code class="language-plaintext highlighter-rouge">_FP_TININESS_AFTER_ROUNDING</code> - macro used to control softfloat
implementation in GLIBC.</li>
  <li>GCC libgcc <code class="language-plaintext highlighter-rouge">_FP_TININESS_AFTER_ROUNDING</code> - macro used to control softfloat
implementation in GCC libgcc.</li>
</ul>

<h3 id="updating-the-spec">Updating the Spec</h3>

<p>Writing to FPCSR from user-mode could be worked around in OpenRISC by
introducing a syscall, but we decided to just change the architecture
specification for this.  Updating the spec keeps it similar to all other
architectures out there.</p>

<p>In OpenRISC we have defined tininess detection to be done before rounding as
this matches what existing FPU implementation have done.</p>

<p>As of architecture specification <a href="https://openrisc.io/revisions/r1.4">revision
1.4</a> the FPCSR is defined as being writable
in user-mode and we have documented tininess detection to be before rounding.</p>

<h2 id="summary">Summary</h2>

<p>We’ve gone through an overview of how the FPU accelarates math in
an application runtime.  We then looked how the OpenRISC architecture specification needed
to be updated to support the floating point POSIX runtime.</p>

<p>In the next entry we shall look into patches to get QEMU and and CPU
implementations updated to support the new spec changes.</p>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[Last year (2022) the big milestone for OpenRISC was getting the glibc port upstream. Though there is libc support for OpenRISC already with musl and ucLibc the glibc port provides a extensive testsuite which has proved useful in shaking out toolchain and OS bugs.]]></summary></entry><entry><title type="html">Unwinding a Bug - How C++ Exceptions Work</title><link href="http://stffrdhrn.github.io/software/toolchain/openrisc/2020/12/13/cxx-exception-unwinding.html" rel="alternate" type="text/html" title="Unwinding a Bug - How C++ Exceptions Work" /><published>2020-12-13T04:25:00+00:00</published><updated>2020-12-13T04:25:00+00:00</updated><id>http://stffrdhrn.github.io/software/toolchain/openrisc/2020/12/13/cxx-exception-unwinding</id><content type="html" xml:base="http://stffrdhrn.github.io/software/toolchain/openrisc/2020/12/13/cxx-exception-unwinding.html"><![CDATA[<p>I have been working on porting GLIBC to the OpenRISC architecture.  This has
taken longer than I expected as with GLIBC upstreaming we must get every
test to pass.  This was different compared to GDB and GCC which were a
bit more lenient.</p>

<p>My <a href="https://lists.librecores.org/pipermail/openrisc/2020-May/002602.html">first upstreaming attempt</a>
was completely tested on the <a href="https://www.qemu.org">QEMU</a> simulator.  I have
since added an FPGA <a href="https://github.com/enjoy-digital/litex/blob/master/README.md">LiteX SoC</a>
to my test platform options.  LiteX runs Linux on the OpenRISC mor1kx softcore
and tests are loaded over an SSH session.  The SoC eliminates an issue I was
seeing on the simulator where under heavy load it appears the <a href="https://github.com/openrisc/linux/issues/12">MMU starves the kernel</a>
from getting any work done.</p>

<p>To get to where I am now this required:</p>

<ul>
  <li>Fixing buggy <a href="https://github.com/litex-hub/linux/commit/78969c54328e35b360d9452c7602f21107a13d22">LiteETH network driver</a>
in LiteX.</li>
  <li>Updating the OpenRISC GDB port to <a href="https://github.com/stffrdhrn/binutils-gdb/commit/9d0d2e9bef5c84caa7f05cc7ddba1e092e2b5120">support gdbserver</a>
and <a href="https://github.com/stffrdhrn/binutils-gdb/commit/82e99d5df56be3b18c63e613d00e2367fb5a78b7">native debugging</a></li>
  <li>Fixing bugs in the <a href="https://github.com/stffrdhrn/linux/commit/28b852b1dc351efc6525234c5adfd5bc2ad6d6e1">Linux Kernel</a> and
<a href="https://github.com/stffrdhrn/or1k-glibc/commit/75ddf155968299042e4d2b492e3b547c86d4672e">GLIBC</a> to get gdbserver and native support working</li>
</ul>

<p>Adding GDB Linux debugging support is great because it allows debugging of
multithreaded processes and signal handling; which we are going to need.</p>

<h2 id="a-bug">A Bug</h2>

<p>Our story starts when I was trying to fix a failing GLIBC <a href="https://en.wikipedia.org/wiki/Native_POSIX_Thread_Library">NPTL</a>
test case.  The test case involves C++ exceptions and POSIX threads.
The issue is that the <code class="language-plaintext highlighter-rouge">catch</code> block of a <code class="language-plaintext highlighter-rouge">try/catch</code> block is not
being called.  Where do we even start?</p>

<p>My plan for approaching test case failures is:</p>
<ol>
  <li>Understand what the test case is trying to test and where its failing</li>
  <li>Create a hypothesis about where the problem is</li>
  <li>Understand how the failing API’s works internally</li>
  <li>Debug until we find the issue</li>
  <li>If we get stuck go back to <code class="language-plaintext highlighter-rouge">2.</code></li>
</ol>

<p>Let’s have a try.</p>

<h3 id="understanding-the-test-case">Understanding the Test case</h3>

<p>The GLIBC test case is <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-cancel24.cc;h=1af709a8cab1d422ef4401e2b2d178df86f863c5;hb=HEAD">nptl/tst-cancel24.cc</a>.
The test starts in the <code class="language-plaintext highlighter-rouge">do_test</code> function and it will create a child thread with <code class="language-plaintext highlighter-rouge">pthread_create</code>.
The child thread executes function <code class="language-plaintext highlighter-rouge">tf</code> which waits on a semaphore until the parent thread cancels it.  It
is expected that the child thread, when cancelled , will call it’s catch block.</p>

<p>The failure is that the <code class="language-plaintext highlighter-rouge">catch</code> block is not getting run as evidenced by the <code class="language-plaintext highlighter-rouge">except_caught</code> variable
not being set to <code class="language-plaintext highlighter-rouge">true</code>.</p>

<p>Below is an excerpt from the test showing the <code class="language-plaintext highlighter-rouge">tf</code> function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">tf</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">sem_t</span> <span class="o">*</span><span class="n">s</span> <span class="o">=</span> <span class="n">static_cast</span><span class="o">&lt;</span><span class="n">sem_t</span> <span class="o">*&gt;</span> <span class="p">(</span><span class="n">arg</span><span class="p">);</span>

  <span class="n">try</span> <span class="p">{</span>
      <span class="n">monitor</span> <span class="n">m</span><span class="p">;</span>

      <span class="n">pthread_barrier_wait</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">);</span>

      <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">sem_wait</span> <span class="p">(</span><span class="n">s</span><span class="p">);</span>
  <span class="p">}</span> <span class="n">catch</span> <span class="p">(...)</span> <span class="p">{</span>
      <span class="n">except_caught</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
      <span class="n">throw</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So the <code class="language-plaintext highlighter-rouge">catch</code> block is not being run.  Simple, but where do we start to
debug that?  Let’s move onto the next step.</p>

<h3 id="creating-a-hypothesis">Creating a Hypothesis</h3>

<p>This one is a bit tricky as it seems C++ <code class="language-plaintext highlighter-rouge">try/catch</code> blocks are broken. Here, I am
working on GLIBC testing, what does that have to do with C++?</p>

<p>To get a better idea of where the problem is I tried to modify the test to test
some simple ideas. First, maybe there is a problem with catching exceptions
throws from thread child functions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">do_throw</span><span class="p">()</span> <span class="p">{</span> <span class="n">throw</span> <span class="mi">99</span><span class="p">;</span> <span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="o">*</span> <span class="nf">tf</span> <span class="p">()</span> <span class="p">{</span>
  <span class="n">try</span> <span class="p">{</span>
      <span class="n">monitor</span> <span class="n">m</span><span class="p">;</span>
      <span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="n">do_throw</span><span class="p">();</span>
  <span class="p">}</span> <span class="n">catch</span> <span class="p">(...)</span> <span class="p">{</span>
      <span class="n">except_caught</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>No, this works correctly.  So <code class="language-plaintext highlighter-rouge">try/catch</code> is working.</p>

<blockquote>
  <p><strong>Hypothesis</strong>: There is a problem handling exceptions while in a syscall.
There may be something broken with OpenRISC related to how we setup stack
frames for syscalls that makes the unwinder fail.</p>
</blockquote>

<p>How does that work?  Let’s move onto the next step.</p>

<h3 id="understanding-the-internals">Understanding the Internals</h3>

<p>To find this bug we need to understand how C++ exceptions work.  Also, we need to know
what happens when a thread is cancelled in a multithreaded
(<a href="https://en.wikipedia.org/wiki/POSIX_Threads">pthread</a>) glibc environment.</p>

<p>There are a few contributors pthread cancellation and C++ exceptions which are:</p>

<ul>
  <li><strong>DWARF</strong> - provided by our program and libraries in the <code class="language-plaintext highlighter-rouge">.eh_frame</code> ELF
section</li>
  <li><strong>GLIBC</strong> - provides the pthread runtime and cleanup callbacks to the GCC unwinder code</li>
  <li><strong>GCC</strong> - provides libraries for dealing with exceptions
    <ul>
      <li><code class="language-plaintext highlighter-rouge">libgcc_s.so</code> - handles unwinding by reading program <strong>DWARF</strong> metadata and doing the frame decoding</li>
      <li><code class="language-plaintext highlighter-rouge">libstdc++.so.6</code> - provides the C++ personality routine which
identifies and prepares <code class="language-plaintext highlighter-rouge">catch</code> blocks for execution</li>
    </ul>
  </li>
</ul>

<h4 id="dwarf">DWARF</h4>
<p>ELF binaries provide debugging information in a data format called
<a href="https://en.wikipedia.org/wiki/DWARF">DWARF</a>.  The name was chosen to maintain a
fantasy theme.  Lately the Linux community has a new debug format called
<a href="https://lwn.net/Articles/728339/">ORC</a>.</p>

<p>Though DWARF is a debugging format and usually stored in <code class="language-plaintext highlighter-rouge">.debug_frame</code>,
<code class="language-plaintext highlighter-rouge">.debug_info</code>, etc sections, a stripped down version it is used for exception
handling.</p>

<p>Each ELF binary that supports unwinding contains the <code class="language-plaintext highlighter-rouge">.eh_frame</code> section to
provide unwinding information.  This can be seen with the <code class="language-plaintext highlighter-rouge">readelf</code> program.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -S sysroot/lib/libc.so.6
There are 70 section headers, starting at offset 0xaa00b8:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .note.ABI-tag     NOTE            00000174 000174 000020 00   A  0   0  4
  [ 2] .gnu.hash         GNU_HASH        00000194 000194 00380c 04   A  3   0  4
  [ 3] .dynsym           DYNSYM          000039a0 0039a0 008280 10   A  4  15  4
  [ 4] .dynstr           STRTAB          0000bc20 00bc20 0054d4 00   A  0   0  1
  [ 5] .gnu.version      VERSYM          000110f4 0110f4 001050 02   A  3   0  2
  [ 6] .gnu.version_d    VERDEF          00012144 012144 000080 00   A  4   4  4
  [ 7] .gnu.version_r    VERNEED         000121c4 0121c4 000030 00   A  4   1  4
  [ 8] .rela.dyn         RELA            000121f4 0121f4 00378c 0c   A  3   0  4
  [ 9] .rela.plt         RELA            00015980 015980 000090 0c  AI  3  28  4
  [10] .plt              PROGBITS        00015a10 015a10 0000d0 04  AX  0   0  4
  [11] .text             PROGBITS        00015ae0 015ae0 155b78 00  AX  0   0  4
  [12] __libc_freeres_fn PROGBITS        0016b658 16b658 001980 00  AX  0   0  4
  [13] .rodata           PROGBITS        0016cfd8 16cfd8 0192b4 00   A  0   0  4
  [14] .interp           PROGBITS        0018628c 18628c 000018 00   A  0   0  1
  [15] .eh_frame_hdr     PROGBITS        001862a4 1862a4 001a44 00   A  0   0  4
  [16] .eh_frame         PROGBITS        00187ce8 187ce8 007cf4 00   A  0   0  4
  [17] .gcc_except_table PROGBITS        0018f9dc 18f9dc 000341 00   A  0   0  1
...
</code></pre></div></div>

<p>We can decode the metadata using <code class="language-plaintext highlighter-rouge">readelf</code> as well using the
<code class="language-plaintext highlighter-rouge">--debug-dump=frames-interp</code> and <code class="language-plaintext highlighter-rouge">--debug-dump=frames</code> arguments.</p>

<p>The <code class="language-plaintext highlighter-rouge">frames</code> dump provides a raw output of the DWARF metadata for each frame.
This is not usually as useful as <code class="language-plaintext highlighter-rouge">frames-interp</code>, but it shows how the DWARF
format is actually a bytecode.  The DWARF interpreter needs to execute these
operations to understand how to derive the values of registers based current PC.</p>

<p>There is an interesting talk in <a href="https://www.cs.dartmouth.edu/~sergey/battleaxe/hackito_2011_oakley_bratus.pdf">Exploiting the hard-working
DWARF.pdf</a>.</p>

<p>An example of the <code class="language-plaintext highlighter-rouge">frames</code> dump:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf --debug-dump=frames sysroot/lib/libc.so.6

...
00016788 0000000c ffffffff CIE
  Version:               1
  Augmentation:          ""
  Code alignment factor: 4
  Data alignment factor: -4
  Return address column: 9

  DW_CFA_def_cfa_register: r1
  DW_CFA_nop

00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
  DW_CFA_advance_loc: 4 to 0016b588
  DW_CFA_def_cfa_offset: 4
  DW_CFA_advance_loc: 8 to 0016b590
  DW_CFA_offset: r9 at cfa-4
  DW_CFA_advance_loc: 68 to 0016b5d4
  DW_CFA_remember_state
  DW_CFA_def_cfa_offset: 0
  DW_CFA_restore: r9
  DW_CFA_restore_state
  DW_CFA_advance_loc: 56 to 0016b60c
  DW_CFA_remember_state
  DW_CFA_def_cfa_offset: 0
  DW_CFA_restore: r9
  DW_CFA_restore_state
  DW_CFA_advance_loc: 36 to 0016b630
  DW_CFA_remember_state
  DW_CFA_def_cfa_offset: 0
  DW_CFA_restore: r9
  DW_CFA_restore_state
  DW_CFA_advance_loc: 40 to 0016b658
  DW_CFA_def_cfa_offset: 0
  DW_CFA_restore: r9
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">frames-interp</code> argument is a bit more clear as it shows the interpreted output
of the bytecode.  Below we see two types of entries:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">CIE</code> - Common Information Entry</li>
  <li><code class="language-plaintext highlighter-rouge">FDE</code> - Frame Description Entry</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">CIE</code> provides starting point information for each child <code class="language-plaintext highlighter-rouge">FDE</code> entry.  Some
things to point out: we see <code class="language-plaintext highlighter-rouge">ra=9</code> indicates the return address is stored in
register <code class="language-plaintext highlighter-rouge">r9</code>,  we see CFA <code class="language-plaintext highlighter-rouge">r1+0</code> indicates the canonical frame pointer is stored in
register <code class="language-plaintext highlighter-rouge">r1</code> and we see the stack frame size is <code class="language-plaintext highlighter-rouge">4</code> bytes.</p>

<p>An example of the <code class="language-plaintext highlighter-rouge">frames-interp</code> dump:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf --debug-dump=frames-interp sysroot/lib/libc.so.6

...
00016788 0000000c ffffffff CIE "" cf=4 df=-4 ra=9
   LOC   CFA
00000000 r1+0

00016798 00000028 00016788 FDE cie=00016788 pc=0016b584..0016b658
   LOC   CFA      ra
0016b584 r1+0     u
0016b588 r1+4     u
0016b590 r1+4     c-4
0016b5d4 r1+4     c-4
0016b60c r1+4     c-4
0016b630 r1+4     c-4
0016b658 r1+0     u
</code></pre></div></div>

<h4 id="glibc">GLIBC</h4>

<p>GLIBC provides <code class="language-plaintext highlighter-rouge">pthreads</code> which when used with C++ needs to support exception
handling.  The main place exceptions are used with <code class="language-plaintext highlighter-rouge">pthreads</code> is when cancelling
threads.  When using <code class="language-plaintext highlighter-rouge">pthread_cancel</code> a cancel signal is sent to the target thread using <a href="https://man7.org/linux/man-pages/man2/tgkill.2.html">tgkill</a>
which causes an exception.</p>

<p>This is implemented with the below APIs.</p>

<ul>
  <li><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/nptl-init.c;h=53b817715d58192857ed14450052e16dc34bc01b;hb=HEAD#l126">sigcancel_handler</a> -
Setup during the pthread runtime initialization, it handles cancellation,
which calls <code class="language-plaintext highlighter-rouge">__do_cancel</code>, which calls <code class="language-plaintext highlighter-rouge">__pthread_unwind</code>.</li>
  <li><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/unwind.c;h=8f157e49f4a088ac64722e85ff24514fff7f3c71;hb=HEAD#l121">__pthread_unwind</a> -
Is called with <code class="language-plaintext highlighter-rouge">pd-&gt;cancel_jmp_buf</code>.  It calls glibc’s <code class="language-plaintext highlighter-rouge">__Unwind_ForcedUnwind</code>.</li>
  <li><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/nptl/unwind-forcedunwind.c;h=50a089282bc236aa644f40feafd0dacdafe3a4e7;hb=HEAD#l122">_Unwind_ForcedUnwind</a> -
Loads GCC’s <code class="language-plaintext highlighter-rouge">libgcc_s.so</code> version of <code class="language-plaintext highlighter-rouge">_Unwind_ForcedUnwind</code>
and calls it with parameters:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">exc</code> - the exception context</li>
      <li><code class="language-plaintext highlighter-rouge">unwind_stop</code> - the stop callback to GLIBC, called for each frame of the unwind, with
the stop argument <code class="language-plaintext highlighter-rouge">ibuf</code></li>
      <li><code class="language-plaintext highlighter-rouge">ibuf</code> - the <code class="language-plaintext highlighter-rouge">jmp_buf</code>, created by <code class="language-plaintext highlighter-rouge">setjmp</code> (<code class="language-plaintext highlighter-rouge">self-&gt;cancel_jmp_buf</code>) in <code class="language-plaintext highlighter-rouge">start_thread</code></li>
    </ul>
  </li>
  <li><a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/unwind.c;h=8f157e49f4a088ac64722e85ff24514fff7f3c71;hb=HEAD#l39">unwind_stop</a> -
Checks the current state of unwind and call the <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code> if
we are at the end of stack.  When the <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code> is called the thread
exits.</li>
</ul>

<p>Let’s look at <code class="language-plaintext highlighter-rouge">pd-&gt;cancel_jmp_buf</code> in more details.  The <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code> is
setup during <code class="language-plaintext highlighter-rouge">pthread_create</code> after clone in <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_create.c;h=bad4e57a845bd3148ad634acaaccbea08b04dbbd;hb=HEAD#l406">start_thread</a>.
It uses the <a href="https://www.man7.org/linux/man-pages/man3/setjmp.3.html">setjmp</a> and
<a href="https://man7.org/linux/man-pages/man3/longjmp.3.html">longjump</a> non local goto mechanism.</p>

<p>Let’s look at some diagrams.</p>

<p><img src="/content/2020/pthread-normal-seq.png" alt="Pthread Normal" /></p>

<p>The above diagram shows a pthread that exits normally.  During the <em>Start</em> phase
of the thread <code class="language-plaintext highlighter-rouge">setjmp</code> will create the <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code>. After the thread
routine exits it returns to the <code class="language-plaintext highlighter-rouge">start_thread</code> routine to do cleanup.
The <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code> is not used.</p>

<p><img src="/content/2020/pthread-signalled-seq.png" alt="Pthread Signalled" /></p>

<p>The above diagram shows a pthread that is cancelled.  When the
thread is created <code class="language-plaintext highlighter-rouge">setjmp</code> will create the <code class="language-plaintext highlighter-rouge">cancel_jmp_buf</code>. In this case
while the thread routine is running it is cancelled, the unwinder runs
and at the end it calls <code class="language-plaintext highlighter-rouge">unwind_stop</code> which calls <code class="language-plaintext highlighter-rouge">longjmp</code>.  After the
<code class="language-plaintext highlighter-rouge">longjmp</code> the thread is returned to <code class="language-plaintext highlighter-rouge">start_thread</code> to do cleanup.</p>

<p>A highly redacted version of our <code class="language-plaintext highlighter-rouge">start_thread</code> and <code class="language-plaintext highlighter-rouge">unwind_stop</code> functions is
shown below.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">start_thread</span><span class="p">()</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">pthread</span> <span class="o">*</span><span class="n">pd</span> <span class="o">=</span> <span class="n">START_THREAD_SELF</span><span class="p">;</span>
  <span class="p">...</span>
  <span class="k">struct</span> <span class="n">pthread_unwind_buf</span> <span class="n">unwind_buf</span><span class="p">;</span>

  <span class="kt">int</span> <span class="n">not_first_call</span><span class="p">;</span>
  <span class="n">not_first_call</span> <span class="o">=</span> <span class="n">setjmp</span> <span class="p">((</span><span class="k">struct</span> <span class="n">__jmp_buf_tag</span> <span class="o">*</span><span class="p">)</span> <span class="n">unwind_buf</span><span class="p">.</span><span class="n">cancel_jmp_buf</span><span class="p">);</span>
  <span class="p">...</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">__glibc_likely</span> <span class="p">(</span><span class="o">!</span> <span class="n">not_first_call</span><span class="p">))</span>
    <span class="p">{</span>
      <span class="cm">/* Store the new cleanup handler info.  */</span>
      <span class="n">THREAD_SETMEM</span> <span class="p">(</span><span class="n">pd</span><span class="p">,</span> <span class="n">cleanup_jmp_buf</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">unwind_buf</span><span class="p">);</span>
      <span class="p">...</span>

      <span class="cm">/* Run the user provided thread routine */</span>
      <span class="n">ret</span> <span class="o">=</span> <span class="n">pd</span><span class="o">-&gt;</span><span class="n">start_routine</span> <span class="p">(</span><span class="n">pd</span><span class="o">-&gt;</span><span class="n">arg</span><span class="p">);</span>
      <span class="n">THREAD_SETMEM</span> <span class="p">(</span><span class="n">pd</span><span class="p">,</span> <span class="n">result</span><span class="p">,</span> <span class="n">ret</span><span class="p">);</span>
    <span class="p">}</span>
  <span class="p">...</span> <span class="n">free</span> <span class="n">resources</span> <span class="p">...</span>
  <span class="n">__exit_thread</span> <span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">unwind_stop</span> <span class="p">(</span><span class="n">_Unwind_Action</span> <span class="n">actions</span><span class="p">,</span>
	     <span class="k">struct</span> <span class="n">_Unwind_Context</span> <span class="o">*</span><span class="n">context</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">stop_parameter</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">pthread_unwind_buf</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">stop_parameter</span><span class="p">;</span>
  <span class="k">struct</span> <span class="n">pthread</span> <span class="o">*</span><span class="n">self</span> <span class="o">=</span> <span class="n">THREAD_SELF</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">do_longjump</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">...</span>

  <span class="k">if</span> <span class="p">((</span><span class="n">actions</span> <span class="o">&amp;</span> <span class="n">_UA_END_OF_STACK</span><span class="p">)</span>
      <span class="o">||</span> <span class="p">...</span> <span class="p">)</span>
    <span class="n">do_longjump</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>

  <span class="p">...</span>
  <span class="cm">/* If we are at the end, go back start_thread for cleanup */</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">do_longjump</span><span class="p">)</span>
    <span class="n">__libc_unwind_longjmp</span> <span class="p">((</span><span class="k">struct</span> <span class="n">__jmp_buf_tag</span> <span class="o">*</span><span class="p">)</span> <span class="n">buf</span><span class="o">-&gt;</span><span class="n">cancel_jmp_buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>

  <span class="k">return</span> <span class="n">_URC_NO_REASON</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="gcc">GCC</h4>

<p>GCC provides the exception handling and unwinding capabilities
to the C++ runtime.  They are provided in the <code class="language-plaintext highlighter-rouge">libgcc_s.so</code> and <code class="language-plaintext highlighter-rouge">libstdc++.so.6</code> libraries.</p>

<p>The <code class="language-plaintext highlighter-rouge">libgcc_s.so</code> library implements the <a href="https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html">IA-64 Itanium Exception Handling ABI</a>.
It’s interesting that the now defunct <a href="https://en.wikipedia.org/wiki/Itanium#Itanium_9700_(Kittson):_2017">Itanium</a>
architecture introduced this ABI which is now the standard for all processor exception
handling.  There are two main entry points for the unwinder are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">_Unwind_ForcedUnwind</code> - for forced unwinding</li>
  <li><code class="language-plaintext highlighter-rouge">_Unwind_RaiseException</code> - for raising normal exceptions</li>
</ul>

<p>There are also two data structures to be aware of:</p>

<ul>
  <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l12">_Unwind_Context</a> - register and unwind state for a frame, below referenced as CONTEXT</li>
  <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.h;h=2b8c1fd49dbc1bf2c816015ea2fed125774a8ef3;hb=HEAD#l25">_Unwind_FrameState</a> - register and unwind state from DWARF, below referenced as FS</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">_Unwind_Context</code> important parts:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">_Unwind_Context</span> <span class="p">{</span>
  <span class="n">_Unwind_Context_Reg_Val</span> <span class="n">reg</span><span class="p">[</span><span class="n">__LIBGCC_DWARF_FRAME_REGISTERS__</span><span class="o">+</span><span class="mi">1</span><span class="p">];</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">cfa</span><span class="p">;</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">ra</span><span class="p">;</span>
  <span class="k">struct</span> <span class="n">dwarf_eh_bases</span> <span class="n">bases</span><span class="p">;</span>
  <span class="n">_Unwind_Word</span> <span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">_Unwind_FrameState</code> important parts:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
  <span class="k">struct</span> <span class="n">frame_state_reg_info</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span> <span class="n">regs</span><span class="p">;</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">pc</span><span class="p">;</span>

  <span class="cm">/* The information we care about from the CIE/FDE.  */</span>
  <span class="n">_Unwind_Personality_Fn</span> <span class="n">personality</span><span class="p">;</span>
  <span class="n">_Unwind_Sword</span> <span class="n">data_align</span><span class="p">;</span>
  <span class="n">_Unwind_Word</span> <span class="n">code_align</span><span class="p">;</span>
  <span class="n">_Unwind_Word</span> <span class="n">retaddr_column</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">fde_encoding</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">signal_frame</span><span class="p">;</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">eh_ptr</span><span class="p">;</span>
<span class="p">}</span> <span class="n">_Unwind_FrameState</span><span class="p">;</span>
</code></pre></div></div>

<p>These two data structures are very similar.  The <code class="language-plaintext highlighter-rouge">_Unwind_FrameState</code> is for internal
use and closely ties to the DWARF definitions of the frame.  The <code class="language-plaintext highlighter-rouge">_Unwind_Context</code>
struct is more generic and is used as an opaque structure in the public unwind api.</p>

<p><em>Forced Unwinds</em></p>

<p>Exceptions that are raised for thread cancellation use a single phase forced unwind.
Code execution will not resume, but catch blocks will be run.  This is why
<a href="https://udrepper.livejournal.com/21541.html">cancel exceptions must be rethrown</a>.</p>

<p>Forced unwinds use the <code class="language-plaintext highlighter-rouge">unwind_stop</code> handler which GLIBC provides as explained in
the <strong>GLIBC</strong> section above.</p>

<ul>
  <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind.inc;h=9acead33ffc01e892d6feda2aaeffd9d04e56e74;hb=HEAD#l201">_Unwind_ForcedUnwind</a> - calls:
    <ul>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l1558">uw_init_context</a> - load details of the current frame from cpu/stack into CONTEXT</li>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind.inc;h=9acead33ffc01e892d6feda2aaeffd9d04e56e74;hb=HEAD#l144">_Unwind_ForcedUnwind_Phase2</a> - do the frame iterations</li>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l1641">uw_install_context</a> - exit unwinder jumping into the selected frame</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">_Unwind_ForcedUnwind_Phase2</code> - loops forever doing:
    <ul>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l1244">uw_frame_state_for</a> - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT-&gt;ra</li>
      <li><code class="language-plaintext highlighter-rouge">stop</code>- callback to GLIBC to stop the unwind if needed</li>
      <li><code class="language-plaintext highlighter-rouge">FS.personality</code> - the C++ personality routine, see below, called with <code class="language-plaintext highlighter-rouge">_UA_FORCE_UNWIND | _UA_CLEANUP_PHASE</code></li>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l1552">uw_advance_context</a> - advance CONTEXT by populating it from FS</li>
    </ul>
  </li>
</ul>

<p><em>Normal Exceptions</em></p>

<p>For exceptions raised programmatically unwinding is very similar to the forced unwind, but
there is no <code class="language-plaintext highlighter-rouge">stop</code> function and exception unwinding is 2 phase.</p>

<ul>
  <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind.inc;h=9acead33ffc01e892d6feda2aaeffd9d04e56e74;hb=HEAD#l83">_Unwind_RaiseException</a> - calls:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">uw_init_context</code> - load details of the current frame from cpu/stack into CONTEXT</li>
      <li>Do phase 1 loop:
        <ul>
          <li><code class="language-plaintext highlighter-rouge">uw_frame_state_for</code> - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT-&gt;ra</li>
          <li><code class="language-plaintext highlighter-rouge">FS.personality</code> - the C++ personality routine, see below, called with <code class="language-plaintext highlighter-rouge">_UA_SEARCH_PHASE</code></li>
          <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind-dw2.c;h=fe896565d2ec5c43ac683f2c6ed6d5e49fd8242e;hb=HEAD#l1516">uw_update_context</a> - advance CONTEXT by populating it from FS (same as <code class="language-plaintext highlighter-rouge">uw_advance_context</code>)</li>
        </ul>
      </li>
      <li><a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/unwind.inc;h=9acead33ffc01e892d6feda2aaeffd9d04e56e74;hb=HEAD#l30">_Unwind_RaiseException_Phase2</a> - do the frame iterations</li>
      <li><code class="language-plaintext highlighter-rouge">uw_install_context</code> - exit unwinder jumping to selected frame</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">_Unwind_RaiseException_Phase2</code> - do phase 2, loops forever doing:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">uw_frame_state_for</code> - populate FS for the frame one frame above CONTEXT, searching DWARF using CONTEXT-&gt;ra</li>
      <li><code class="language-plaintext highlighter-rouge">FS.personality</code> - the C++ personality routine, called with <code class="language-plaintext highlighter-rouge">_UA_CLEANUP_PHASE</code></li>
      <li><code class="language-plaintext highlighter-rouge">uw_update_context</code> - advance CONTEXT by populating it from FS</li>
    </ul>
  </li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">libstdc++.so.6</code> library provides the <a href="https://gcc.gnu.org/onlinedocs/gcc-10.2.0/libstdc++/manual/">C++ standard library</a>
which includes the C++ personality routine <a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/libsupc%2B%2B/eh_personality.cc;h=fd7cd6fc79886bf17aea6bc713d2a3840aa31326;hb=HEAD#l336">__gxx_personality_v0</a>.
The personality routine is the interface between the unwind routines and the c++
(or other language) runtime, which handles the exception handling logic for that
language.</p>

<p>As we saw above the personality routine is executed for each stack frame.  The
function checks if there is a <code class="language-plaintext highlighter-rouge">catch</code> block that matches the exception being
thrown.  If there is a match, it will update the context to prepare it to jump
into the catch routine and return <code class="language-plaintext highlighter-rouge">_URC_INSTALL_CONTEXT</code>.  If there is no catch
block matching it returns <code class="language-plaintext highlighter-rouge">_URC_CONTINUE_UNWIND</code>.</p>

<p>In the case of <code class="language-plaintext highlighter-rouge">_URC_INSTALL_CONTEXT</code> then the <code class="language-plaintext highlighter-rouge">_Unwind_ForcedUnwind_Phase2</code>
loop breaks and calls <code class="language-plaintext highlighter-rouge">uw_install_context</code>.</p>

<h4 id="unwinding-through-a-signal-frame">Unwinding through a Signal Frame</h4>

<p>When the GCC unwinder is looping through frames the <code class="language-plaintext highlighter-rouge">uw_frame_state_for</code>
function will search DWARF information.  The DWARF lookup will fail for signal
frames and a fallback mechanism is provided for each architecture to handle
this.  For OpenRISC Linux this is handled by
<a href="https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgcc/config/or1k/linux-unwind.h;h=c7ed043d3a89f2db205fd78fcb5db21f6fb561b2;hb=HEAD">or1k_fallback_frame_state</a>.
To understand how this works let’s look into the Linux kernel a bit.</p>

<p>A process must be context switched to kernel by either a system call, timer or other
interrupt in order to receive a signal.</p>

<p><img src="/content/2020/stack-frame-int.png" alt="The Stack Frame after an Interrupt" /></p>

<p>The diagram above shows what a process stack looks like after the kernel takes over.
An <em>interrupt frame</em> is push to the top of the stack and the <code class="language-plaintext highlighter-rouge">pt_regs</code> structure
is filled out containing the processor state before the interrupt.</p>

<p><img src="/content/2020/stack-frame-in-handler.png" alt="The Stack Frame in a Sig Handler" /></p>

<p>This second diagram shows what happens when a signal handler is invoked.  A new
special <em>signal frame</em> is pushed onto the stack and when the process is resumed
it resumes in the signal handler.  In OpenRISC the signal frame is setup by the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/openrisc/kernel/signal.c?h=v5.10-rc7#n144">setup_rt_frame</a>
function which is called inside of <code class="language-plaintext highlighter-rouge">do_signal</code> which calls <code class="language-plaintext highlighter-rouge">handle_signal</code>
which calls <code class="language-plaintext highlighter-rouge">setup_rt_frame</code>.</p>

<p>After the signal handler routine runs we return to a special bit of code called
the <strong>Trampoline</strong>.  The trampoline code lives on the stack and runs
<a href="https://man7.org/linux/man-pages/man2/sigreturn.2.html">sigretrun</a>.</p>

<p>Now back to <code class="language-plaintext highlighter-rouge">or1k_fallback_frame_state</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">or1k_fallback_frame_state</code> function checks if the current frame is a
<em>signal frame</em> by confirming the return address points to a <strong>Trampoline</strong>.  If
it is a trampoline it looks into the kernel saved <code class="language-plaintext highlighter-rouge">ucontext</code> and <code class="language-plaintext highlighter-rouge">pt_regs</code> find
the previous user frame.  Unwinding, can then continue as normal.</p>

<h3 id="debugging-the-issue">Debugging the Issue</h3>

<p>Now with a good background in how unwinding works we can start to debug our test
case.  We can recall our hypothesis:</p>

<blockquote>
  <p><strong>Hypothesis</strong>: There is a problem handling exceptions while in a syscall.
There may be something broken with OpenRISC related to how we setup stack
frames for syscalls that makes the unwinder fail.</p>
</blockquote>

<p>With GDB we can start to debug exception handling, we can trace right to the
start of the exception handling logic by setting our breakpoint at
<code class="language-plaintext highlighter-rouge">_Unwind_ForcedUnwind</code>.</p>

<p>This is the stack trace we see:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:192
#1  0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc &lt;unwind_stop&gt;, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#2  0x30321fc0 in __GI___pthread_unwind (buf=&lt;optimized out&gt;) at unwind.c:121
#3  0x30312388 in __do_cancel () at pthreadP.h:313
#4  sigcancel_handler (sig=32, si=0x30caec98, ctx=&lt;optimized out&gt;) at nptl-init.c:162
#5  sigcancel_handler (sig=&lt;optimized out&gt;, si=0x30caec98, ctx=&lt;optimized out&gt;) at nptl-init.c:127
#6  &lt;signal handler called&gt;
#7  0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78, expected=1, clockid=&lt;optimized out&gt;, abstime=0x0, private=&lt;optimized out&gt;)
    at ../sysdeps/nptl/futex-internal.c:66
#8  0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#9  0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
#10 0x30314548 in start_thread (arg=&lt;optimized out&gt;) at pthread_create.c:463
#11 0x3043638c in __or1k_clone () from /lib/libc.so.6
Backtrace stopped: frame did not save the PC
(gdb)
</code></pre></div></div>

<p>In the GDB backtrack we can see it unwinds through, the signal frame, <code class="language-plaintext highlighter-rouge">sem_wait</code>
all the way to our thread routine <code class="language-plaintext highlighter-rouge">tf</code>.  It appears everything, is working fine.
But we need to remember the backtrace we see above is from GDB’s unwinder not
GCC, also it uses the <code class="language-plaintext highlighter-rouge">.debug_info</code> DWARF data, not <code class="language-plaintext highlighter-rouge">.eh_frame</code>.</p>

<p>To really ensure the GCC unwinder is working as expected we need to debug it
walking the stack.  Debugging when we unwind a signal frame can be done by
placing a breakpoint on <code class="language-plaintext highlighter-rouge">or1k_fallback_frame_state</code>.</p>

<p>Debugging this code as well shows it works correctly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#0  or1k_fallback_frame_state (context=&lt;optimized out&gt;, context=&lt;optimized out&gt;, fs=&lt;optimized out&gt;) at ../../../libgcc/unwind-dw2.c:1271
#1  uw_frame_state_for (context=0x30caeb6c, fs=0x30cae914) at ../../../libgcc/unwind-dw2.c:1271
#2  0x30303200 in _Unwind_ForcedUnwind_Phase2 (exc=0x30caf658, context=0x30caeb6c, frames_p=0x30caea90) at ../../../libgcc/unwind.inc:162
#3  0x30303858 in _Unwind_ForcedUnwind (exc=0x30caf658, stop=0x30321dcc &lt;unwind_stop&gt;, stop_argument=0x30caeea4) at ../../../libgcc/unwind.inc:217
#4  0x30321fc0 in __GI___pthread_unwind (buf=&lt;optimized out&gt;) at unwind.c:121
#5  0x30312388 in __do_cancel () at pthreadP.h:313
#6  sigcancel_handler (sig=32, si=0x30caec98, ctx=&lt;optimized out&gt;) at nptl-init.c:162
#7  sigcancel_handler (sig=&lt;optimized out&gt;, si=0x30caec98, ctx=&lt;optimized out&gt;) at nptl-init.c:127
#8  &lt;signal handler called&gt;
#9  0x303266d0 in __futex_abstimed_wait_cancelable64 (futex_word=0x7ffffd78,  expected=1, clockid=&lt;optimized out&gt;, abstime=0x0, private=&lt;optimized out&gt;) at ../sysdeps/nptl/futex-internal.c:66
#10 0x303210f8 in __new_sem_wait_slow64 (sem=0x7ffffd78, abstime=0x0, clockid=0) at sem_waitcommon.c:285
#11 0x00002884 in tf (arg=0x7ffffd78) at throw-pthread-sem.cc:35
</code></pre></div></div>

<p>Debugging when the unwinding stops can be done by setting a breakpoint
on the <code class="language-plaintext highlighter-rouge">unwind_stop</code> function.</p>

<p>When debugging I was able to see that the unwinder failed when looking for
the <code class="language-plaintext highlighter-rouge">__futex_abstimed_wait_cancelable64</code> frame.  So, this is not an issue
with unwinding signal frames.</p>

<h3 id="a-second-hypothosis">A second Hypothosis</h3>

<p>Debugging showed that the uwinder is working correctly, and it can properly
unwind through our signal frames.  However, the unwinder is bailing out early
before it gets to the <code class="language-plaintext highlighter-rouge">tf</code> frame which has the catch block we need to execute.</p>

<blockquote>
  <p><strong>Hypothesis 2</strong>: There is something wrong finding DWARF info for <code class="language-plaintext highlighter-rouge">__futex_abstimed_wait_cancelable64</code>.</p>
</blockquote>

<p>Looking at <code class="language-plaintext highlighter-rouge">libpthread.so</code> with <code class="language-plaintext highlighter-rouge">readelf</code> this function was missing completely from the <code class="language-plaintext highlighter-rouge">.eh_frame</code>
metadata.  Now we found something.</p>

<p>What creates the <code class="language-plaintext highlighter-rouge">.eh_frame</code> anyway?  GCC or Binutils (Assembler). If we run GCC
with the <code class="language-plaintext highlighter-rouge">-S</code> argument we can see GCC will output inline <code class="language-plaintext highlighter-rouge">.cfi</code> directives.
These <code class="language-plaintext highlighter-rouge">.cfi</code> annotations are what gets compiled to the to <code class="language-plaintext highlighter-rouge">.eh_frame</code>.  GCC
creates the <code class="language-plaintext highlighter-rouge">.cfi</code> directives and the Assembler puts them into the <code class="language-plaintext highlighter-rouge">.eh_frame</code>
section.</p>

<p>An example of <code class="language-plaintext highlighter-rouge">gcc -S</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         <span class="p">.</span><span class="n">file</span>   <span class="s">"unwind.c"</span>
        <span class="p">.</span><span class="n">section</span>        <span class="p">.</span><span class="n">text</span>
        <span class="p">.</span><span class="n">align</span> <span class="mi">4</span>
        <span class="p">.</span><span class="n">type</span>   <span class="n">unwind_stop</span><span class="p">,</span> <span class="err">@</span><span class="n">function</span>
<span class="n">unwind_stop</span><span class="o">:</span>
<span class="p">.</span><span class="n">LFB83</span><span class="o">:</span>
        <span class="p">.</span><span class="n">cfi_startproc</span>
        <span class="n">l</span><span class="p">.</span><span class="n">addi</span>  <span class="n">r1</span><span class="p">,</span> <span class="n">r1</span><span class="p">,</span> <span class="o">-</span><span class="mi">28</span>
        <span class="p">.</span><span class="n">cfi_def_cfa_offset</span> <span class="mi">28</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">0</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r16</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">4</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r18</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">8</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r20</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">12</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r22</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">16</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r24</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">20</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r26</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">24</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r9</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">16</span><span class="p">,</span> <span class="o">-</span><span class="mi">28</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">18</span><span class="p">,</span> <span class="o">-</span><span class="mi">24</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">20</span><span class="p">,</span> <span class="o">-</span><span class="mi">20</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">22</span><span class="p">,</span> <span class="o">-</span><span class="mi">16</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">24</span><span class="p">,</span> <span class="o">-</span><span class="mi">12</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">26</span><span class="p">,</span> <span class="o">-</span><span class="mi">8</span>
        <span class="p">.</span><span class="n">cfi_offset</span> <span class="mi">9</span><span class="p">,</span> <span class="o">-</span><span class="mi">4</span>
        <span class="n">l</span><span class="p">.</span><span class="n">or</span>    <span class="n">r24</span><span class="p">,</span> <span class="n">r8</span><span class="p">,</span> <span class="n">r8</span>
        <span class="n">l</span><span class="p">.</span><span class="n">or</span>    <span class="n">r22</span><span class="p">,</span> <span class="n">r10</span><span class="p">,</span> <span class="n">r10</span>
        <span class="n">l</span><span class="p">.</span><span class="n">lwz</span>   <span class="n">r18</span><span class="p">,</span> <span class="o">-</span><span class="mi">1172</span><span class="p">(</span><span class="n">r10</span><span class="p">)</span>
        <span class="n">l</span><span class="p">.</span><span class="n">lwz</span>   <span class="n">r20</span><span class="p">,</span> <span class="o">-</span><span class="mi">692</span><span class="p">(</span><span class="n">r10</span><span class="p">)</span>
        <span class="n">l</span><span class="p">.</span><span class="n">lwz</span>   <span class="n">r17</span><span class="p">,</span> <span class="o">-</span><span class="mi">688</span><span class="p">(</span><span class="n">r10</span><span class="p">)</span>
        <span class="n">l</span><span class="p">.</span><span class="n">add</span>   <span class="n">r20</span><span class="p">,</span> <span class="n">r20</span><span class="p">,</span> <span class="n">r17</span>
        <span class="n">l</span><span class="p">.</span><span class="n">andi</span>  <span class="n">r16</span><span class="p">,</span> <span class="n">r4</span><span class="p">,</span> <span class="mi">16</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sfnei</span> <span class="n">r16</span><span class="p">,</span> <span class="mi">0</span>
</code></pre></div></div>

<p>When looking at the glibc build I noticed the <code class="language-plaintext highlighter-rouge">.eh_frame</code> data for
<code class="language-plaintext highlighter-rouge">__futex_abstimed_wait_cancelable64</code> is missing from futex-internal.o. The one
where unwinding is failing we find it was completely mising <code class="language-plaintext highlighter-rouge">.cfi</code> directives.
Why is GCC not generating <code class="language-plaintext highlighter-rouge">.cfi</code> directives for this file?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        <span class="p">.</span><span class="n">file</span>   <span class="s">"futex-internal.c"</span>
        <span class="p">.</span><span class="n">section</span>        <span class="p">.</span><span class="n">text</span>
        <span class="p">.</span><span class="n">section</span>        <span class="p">.</span><span class="n">rodata</span><span class="p">.</span><span class="n">str1</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span><span class="s">"aMS"</span><span class="p">,</span><span class="err">@</span><span class="n">progbits</span><span class="p">,</span><span class="mi">1</span>
<span class="p">.</span><span class="n">LC0</span><span class="o">:</span>
        <span class="p">.</span><span class="n">string</span> <span class="s">"The futex facility returned an unexpected error code.</span><span class="se">\n</span><span class="s">"</span>
        <span class="p">.</span><span class="n">section</span>        <span class="p">.</span><span class="n">text</span>
        <span class="p">.</span><span class="n">align</span> <span class="mi">4</span>
        <span class="p">.</span><span class="n">global</span> <span class="n">__futex_abstimed_wait_cancelable64</span>
        <span class="p">.</span><span class="n">type</span>   <span class="n">__futex_abstimed_wait_cancelable64</span><span class="p">,</span> <span class="err">@</span><span class="n">function</span>
<span class="n">__futex_abstimed_wait_cancelable64</span><span class="o">:</span>
        <span class="n">l</span><span class="p">.</span><span class="n">addi</span>  <span class="n">r1</span><span class="p">,</span> <span class="n">r1</span><span class="p">,</span> <span class="o">-</span><span class="mi">20</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">0</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r16</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">4</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r18</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">8</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r20</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">12</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r22</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sw</span>    <span class="mi">16</span><span class="p">(</span><span class="n">r1</span><span class="p">),</span> <span class="n">r9</span>
        <span class="n">l</span><span class="p">.</span><span class="n">or</span>    <span class="n">r22</span><span class="p">,</span> <span class="n">r3</span><span class="p">,</span> <span class="n">r3</span>
        <span class="n">l</span><span class="p">.</span><span class="n">or</span>    <span class="n">r20</span><span class="p">,</span> <span class="n">r4</span><span class="p">,</span> <span class="n">r4</span>
        <span class="n">l</span><span class="p">.</span><span class="n">or</span>    <span class="n">r16</span><span class="p">,</span> <span class="n">r6</span><span class="p">,</span> <span class="n">r6</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sfnei</span> <span class="n">r6</span><span class="p">,</span> <span class="mi">0</span>
        <span class="n">l</span><span class="p">.</span><span class="n">ori</span>   <span class="n">r17</span><span class="p">,</span> <span class="n">r0</span><span class="p">,</span> <span class="mi">1</span>
        <span class="n">l</span><span class="p">.</span><span class="n">cmov</span>  <span class="n">r17</span><span class="p">,</span> <span class="n">r17</span><span class="p">,</span> <span class="n">r0</span>
        <span class="n">l</span><span class="p">.</span><span class="n">sfeqi</span> <span class="n">r17</span><span class="p">,</span> <span class="mi">0</span>
        <span class="n">l</span><span class="p">.</span><span class="n">bnf</span>   <span class="p">.</span><span class="n">L14</span>
         <span class="n">l</span><span class="p">.</span><span class="n">nop</span>
</code></pre></div></div>

<p>Looking closer at the build line of these 2 files I see the build of <code class="language-plaintext highlighter-rouge">futex-internal.c</code>
is missing <code class="language-plaintext highlighter-rouge">-fexceptions</code>.</p>

<p>This flag is needed to enable the <code class="language-plaintext highlighter-rouge">eh_frame</code> section, which is what powers C++
exceptions, the flag is needed when we are building C code which needs to
support C++ exceptions.</p>

<p>So why is it not enabled?  Is this a problem with the GLIBC build?</p>

<p>Looking at GLIBC the <code class="language-plaintext highlighter-rouge">nptl/Makefile</code> set’s <code class="language-plaintext highlighter-rouge">-fexceptions</code> explicitly for each
c file that needs it.  For example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># The following are cancellation points.  Some of the functions can
# block and therefore temporarily enable asynchronous cancellation.
# Those must be compiled asynchronous unwind tables.
CFLAGS-pthread_testcancel.c += -fexceptions
CFLAGS-pthread_join.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_timedjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_clockjoin.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-pthread_once.c += $(uses-callbacks) -fexceptions \
                        -fasynchronous-unwind-tables
CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
</code></pre></div></div>

<p>It is missing such a line for <code class="language-plaintext highlighter-rouge">futex-internal.c</code>.  The following patch and a
libpthread rebuild fixes the issue!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -220,6 +220,7 @@ CFLAGS-pthread_cond_wait.c += -fexceptions -fasynchronous-unwind-tables
 CFLAGS-sem_wait.c += -fexceptions -fasynchronous-unwind-tables
 CFLAGS-sem_timedwait.c += -fexceptions -fasynchronous-unwind-tables
 CFLAGS-sem_clockwait.c = -fexceptions -fasynchronous-unwind-tables
+CFLAGS-futex-internal.c += -fexceptions -fasynchronous-unwind-tables

 # These are the function wrappers we have to duplicate here.
 CFLAGS-fcntl.c += -fexceptions -fasynchronous-unwind-tables
</code></pre></div></div>

<p>I <a href="http://sourceware-org.1504.n7.nabble.com/PATCH-nptl-Fix-issue-unwinding-through-sem-wait-futex-tt653730.html#a653833">submitted this patch</a>
to GLIBC but it turns out it was already <a href="https://sourceware.org/git/?p=glibc.git;a=commit;h=a04689ee7a2600a1466354096123c57ccd1e1dc7">fixed upstream</a>
a few weeks before. Doh.</p>

<h2 id="summary">Summary</h2>

<p>I hope the investigation into debugging this C++ exception test case proved interesting.
We can learn a lot about the deep internals of our tools when we have to fix bugs in them.
Like most illusive bugs, in the end this was a trivial fix but required some
key background knowledge.</p>

<h2 id="additional-reading">Additional Reading</h2>
<ul>
  <li><a href="https://itanium-cxx-abi.github.io/cxx-abi/">The IA64 Undwinder ABI</a> - The core unwinder API’s</li>
  <li><a href="https://fzn.fr/projects/frdwarf/dwarf-oopsla19-slides.pdf">Reliable DWARF Unwinding</a> - a Good presentation</li>
  <li><a href="https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA/ehframechpt.html">Exception Frames</a> - DWARF documentation</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="software" /><category term="toolchain" /><category term="openrisc" /><summary type="html"><![CDATA[I have been working on porting GLIBC to the OpenRISC architecture. This has taken longer than I expected as with GLIBC upstreaming we must get every test to pass. This was different compared to GDB and GCC which were a bit more lenient.]]></summary></entry><entry><title type="html">How Relocations and Thread Local Store are Implemented</title><link href="http://stffrdhrn.github.io/software/toolchain/openrisc/2020/07/21/relocs_tls_impl.html" rel="alternate" type="text/html" title="How Relocations and Thread Local Store are Implemented" /><published>2020-07-21T06:47:00+01:00</published><updated>2020-07-21T06:47:00+01:00</updated><id>http://stffrdhrn.github.io/software/toolchain/openrisc/2020/07/21/relocs_tls_impl</id><content type="html" xml:base="http://stffrdhrn.github.io/software/toolchain/openrisc/2020/07/21/relocs_tls_impl.html"><![CDATA[<p>This is an ongoing series of posts on ELF Binary Relocations and Thread
Local Storage.  This article covers only Thread Local Storage and assumes
the reader has had a primer in ELF Relocations, if not please start with
my previous article <em>ELF Binaries and Relocation Entries</em>.</p>

<p>This is the third part in an illustrated 3 part series covering:</p>
<ul>
  <li><a href="/hardware/embedded/openrisc/2019/11/29/relocs.html">ELF Binaries and Relocation Entries</a></li>
  <li><a href="/hardware/embedded/openrisc/2020/01/19/tls.html">Thread Local Storage</a></li>
  <li>How Relocations and Thread Local Store are implemented</li>
</ul>

<p>In the last article we covered how Thread Local Storage (TLS) works at runtime,
but how do we get there?  How does the compiler and linker create the memory
structures and code fragments described in the previous article?</p>

<p>In this article we will discuss how TLS relocations are is implemented.  Our
outline:</p>

<ul>
  <li><a href="#the-compiler">The Compiler</a>
    <ul>
      <li><a href="#gcc-legitimize-address">GCC Legitimize Address</a></li>
      <li><a href="#gcc-print-operand">GCC Print Operand</a></li>
    </ul>
  </li>
  <li><a href="#the-assembler">The Assembler</a></li>
  <li><a href="#the-linker">The Linker</a>
    <ul>
      <li><a href="#phase-1---book-keeping-check_relocs">Phase 1 - Book Keeping</a></li>
      <li><a href="#phase-2---creating-space-size_dynamic_sections--_bfd_elf_create_dynamic_sections">Phase 2 - Creating Space</a></li>
      <li><a href="#phase-3---linking-relocate_section">Phase 3 - Linking</a></li>
      <li><a href="#phase-4---finishing-up-finish_dynamic_symbol--finish_dynamic_sections">Phase 4 - Finishing Up</a></li>
    </ul>
  </li>
  <li><a href="#glibc-runtime-linker">GLIBC Runtime Linker</a>
    <ul>
      <li><a href="#handling-tls">Handling TLS</a></li>
    </ul>
  </li>
</ul>

<p>As before, the examples in this article can be found in my <a href="https://github.com/stffrdhrn/tls-examples">tls-examples</a>
project.  Please check it out.</p>

<h1 id="the-gnu-toolchain">The GNU toolchain</h1>

<p>I will assume here that most people understand what a compiler and assembler
basically do.  In the sense that compiler will compile routines
written C code or something similar to assembly language.  It is then up to the
assembler to turn that assembly code into machine code to run on a CPU.</p>

<p>That is a big part of what a toolchain does, and it’s pretty much that simple if
we have a single file of source code.  But usually we don’t have a single file,
we have the multiple files, the <a href="https://en.wikipedia.org/wiki/Runtime_library">c runtime</a>,
<a href="https://en.wikipedia.org/wiki/Crt0">crt0</a> and other libraries like
<a href="https://en.wikipedia.org/wiki/C_standard_library">libc</a>.  These all need to be
put together into our final program, that is where the complexities of the
linker comes in.</p>

<p>In this article I will cover how variables in our source code (<a href="https://en.wikipedia.org/wiki/Symbol_table">symbols</a>) traverse
the toolchain from code to the memory in our final running program.  A picture that looks
something like this:</p>

<p><img src="/content/2020/relocs-gccf2b.png" alt="GCC and Linker" /></p>

<h2 id="the-compiler">The Compiler</h2>

<p>First we start off with how relocations are created and emitted in the compiler.</p>

<p>As I work primarily on the GNU <a href="https://elinux.org/Toolchains">toolchain</a> with
it’s GCC compiler we will look at that, let’s get started.</p>

<h3 id="gcc-legitimize-address">GCC Legitimize Address</h3>

<p>To start we define a <strong>symbol</strong> as named address in memory.  This address can be
a program variable where data is stored or function reference to where a
subroutine starts.</p>

<p>In GCC we have have <code class="language-plaintext highlighter-rouge">TARGET_LEGITIMIZE_ADDRESS</code>, the OpenRISC implementation
being <a href="https://github.com/gcc-mirror/gcc/blob/releases/gcc-10.1.0/gcc/config/or1k/or1k.c#L841">or1k_legitimize_address()</a>.
It takes a symbol (memory address) and makes it usable in our CPU by generating RTX
sequences that are possible on our CPU to load that address into a register.</p>

<p>RTX represents a tree node in GCC’s register transfer language (RTL).  The RTL
Expression is used to express our algorithm as a series of register transfers.
This is used as register transfer is basically what a CPU does.</p>

<p>A snippet from <code class="language-plaintext highlighter-rouge">legitimize_address()</code> function is below.  The argument <code class="language-plaintext highlighter-rouge">x</code>
represents our input symbol (memory address) that we need to make usable by our
CPU.  This code uses GCC internal API’s to emit RTX code sequences.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">rtx</span>
<span class="n">or1k_legitimize_address</span> <span class="p">(</span><span class="n">rtx</span> <span class="n">x</span><span class="p">,</span> <span class="n">rtx</span> <span class="cm">/* unused */</span><span class="p">,</span> <span class="n">machine_mode</span> <span class="cm">/* unused */</span><span class="p">)</span>

    <span class="p">...</span>

	<span class="k">case</span> <span class="n">TLS_MODEL_NONE</span><span class="p">:</span>
	  <span class="n">t1</span> <span class="o">=</span> <span class="n">can_create_pseudo_p</span> <span class="p">()</span> <span class="o">?</span> <span class="n">gen_reg_rtx</span> <span class="p">(</span><span class="n">Pmode</span><span class="p">)</span> <span class="o">:</span> <span class="n">scratch</span><span class="p">;</span>
	  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">flag_pic</span><span class="p">)</span>
	    <span class="p">{</span>
	      <span class="n">emit_insn</span> <span class="p">(</span><span class="n">gen_rtx_SET</span> <span class="p">(</span><span class="n">t1</span><span class="p">,</span> <span class="n">gen_rtx_HIGH</span> <span class="p">(</span><span class="n">Pmode</span><span class="p">,</span> <span class="n">x</span><span class="p">)));</span>
	      <span class="k">return</span> <span class="n">gen_rtx_LO_SUM</span> <span class="p">(</span><span class="n">Pmode</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span>
	    <span class="p">}</span>
	  <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">is_local</span><span class="p">)</span>
	    <span class="p">{</span>
	      <span class="n">crtl</span><span class="o">-&gt;</span><span class="n">uses_pic_offset_table</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
	      <span class="n">t2</span> <span class="o">=</span> <span class="n">gen_sym_unspec</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">UNSPEC_GOTOFF</span><span class="p">);</span>
	      <span class="n">emit_insn</span> <span class="p">(</span><span class="n">gen_rtx_SET</span> <span class="p">(</span><span class="n">t1</span><span class="p">,</span> <span class="n">gen_rtx_HIGH</span> <span class="p">(</span><span class="n">Pmode</span><span class="p">,</span> <span class="n">t2</span><span class="p">)));</span>
	      <span class="n">emit_insn</span> <span class="p">(</span><span class="n">gen_add3_insn</span> <span class="p">(</span><span class="n">t1</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">pic_offset_table_rtx</span><span class="p">));</span>
	      <span class="k">return</span> <span class="n">gen_rtx_LO_SUM</span> <span class="p">(</span><span class="n">Pmode</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">copy_rtx</span> <span class="p">(</span><span class="n">t2</span><span class="p">));</span>
	    <span class="p">}</span>
	  <span class="k">else</span>
	    <span class="p">{</span>
              <span class="p">...</span>
</code></pre></div></div>

<p>We can read the code snippet above as follows:</p>

<ul>
  <li>This is for the non <code class="language-plaintext highlighter-rouge">TLS</code> case as we see <code class="language-plaintext highlighter-rouge">TLS_MODEL_NONE</code>.</li>
  <li>We reserve a temporary register <code class="language-plaintext highlighter-rouge">t1</code>.</li>
  <li>If not using <a href="https://en.wikipedia.org/wiki/Position-independent_code">Position-independent code</a> (<code class="language-plaintext highlighter-rouge">flag_pic</code>) we do:
    <ul>
      <li>Emit an instruction to put the <strong>high</strong> bits of <code class="language-plaintext highlighter-rouge">x</code> into our temporary register <code class="language-plaintext highlighter-rouge">t1</code>.</li>
      <li>Return the sum of <code class="language-plaintext highlighter-rouge">t1</code> and the <strong>low</strong> bits of <code class="language-plaintext highlighter-rouge">x</code>.</li>
    </ul>
  </li>
  <li>Otherwise if the symbol is static (<code class="language-plaintext highlighter-rouge">is_local</code>) we do:
    <ul>
      <li>Mark the global state that this object file uses the <code class="language-plaintext highlighter-rouge">uses_pic_offset_table</code>.</li>
      <li>We create a <a href="https://en.wikipedia.org/wiki/Global_Offset_Table">Global Offset Table</a> offset variable <code class="language-plaintext highlighter-rouge">t2</code>.</li>
      <li>Emit an instruction to put the <strong>high</strong> bits of <code class="language-plaintext highlighter-rouge">t2</code> (the GOT offset) into out temporary register <code class="language-plaintext highlighter-rouge">t1</code>.</li>
      <li>Emit an instruction to put the sum of <code class="language-plaintext highlighter-rouge">t1</code> (<strong>high</strong> bits of <code class="language-plaintext highlighter-rouge">t2) and the GOT into </code>t1`.</li>
      <li>Return the sum of <code class="language-plaintext highlighter-rouge">t1</code> and the <strong>low</strong> bits of <code class="language-plaintext highlighter-rouge">t1</code>.</li>
    </ul>
  </li>
</ul>

<p>You may have noticed that the <strong>local</strong> symbol still used the <strong>global</strong> offset
table (<a href="http://bottomupcs.sourceforge.net/csbu/x3824.htm">GOT</a>).  This is
because Position-idependent code requires using the GOT to reference symbols.</p>

<p>An example, from <a href="https://github.com/stffrdhrn/tls-examples/blob/master/nontls.c">nontls.c</a>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>

<span class="kt">int</span> <span class="o">*</span><span class="nf">get_x_addr</span><span class="p">()</span> <span class="p">{</span>
   <span class="k">return</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Example of the <em>non pic case</em> above, when we look at the assembly code generated by GCC
we can see the following:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	.file	<span class="s2">"nontls.c"</span>
	.section	.text
	.local	x
	.comm	x,4,4
	.align 4
	.global	get_x_addr
	.type	get_x_addr, @function
get_x_addr:
	l.addi	r1, r1, <span class="nt">-8</span>       <span class="c"># \</span>
	l.sw	0<span class="o">(</span>r1<span class="o">)</span>, r2        <span class="c"># | function prologue</span>
	l.addi	r2, r1, 8        <span class="c"># |</span>
	l.sw	4<span class="o">(</span>r1<span class="o">)</span>, r9        <span class="c"># /</span>
	l.movhi	r17, ha<span class="o">(</span>x<span class="o">)</span>       <span class="c"># \__ legitimize address of x into r17</span>
	l.addi	r17, r17, lo<span class="o">(</span>x<span class="o">)</span>  <span class="c"># /</span>
	l.or	r11, r17, r17    <span class="c"># } place result in return register r11</span>
	l.lwz	r2, 0<span class="o">(</span>r1<span class="o">)</span>        <span class="c"># \</span>
	l.lwz	r9, 4<span class="o">(</span>r1<span class="o">)</span>        <span class="c"># | function epilogue</span>
	l.addi	r1, r1, 8        <span class="c"># |</span>
	l.jr	r9               <span class="c"># |</span>
	 l.nop                   <span class="c"># /</span>

	.size	get_x_addr, .-get_x_addr
	.ident	<span class="s2">"GCC: (GNU) 9.0.1 20190409 (experimental)"</span>
</code></pre></div></div>

<p>Example of the <em>local pic case</em> above the same code compiled with the <code class="language-plaintext highlighter-rouge">-fPIC</code> GCC option
looks like the following:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	.file	<span class="s2">"nontls.c"</span>
	.section	.text
	.local	x
	.comm	x,4,4
	.align 4
	.global	get_x_addr
	.type	get_x_addr, @function
get_x_addr:
	l.addi	r1, r1, <span class="nt">-8</span>      <span class="c"># \</span>
	l.sw	0<span class="o">(</span>r1<span class="o">)</span>, r2       <span class="c"># | function prologue</span>
	l.addi	r2, r1, 8       <span class="c"># |</span>
	l.sw	4<span class="o">(</span>r1<span class="o">)</span>, r9       <span class="c"># /</span>
	l.jal	8                                             <span class="c"># \</span>
	 l.movhi	r19, gotpchi<span class="o">(</span>_GLOBAL_OFFSET_TABLE_-4<span class="o">)</span> <span class="c"># | PC relative, put</span>
	l.ori	r19, r19, gotpclo<span class="o">(</span>_GLOBAL_OFFSET_TABLE_+0<span class="o">)</span>    <span class="c"># | GOT into r19</span>
	l.add	r19, r19, r9                                  <span class="c"># /</span>
	l.movhi	r17, gotoffha<span class="o">(</span>x<span class="o">)</span>        <span class="c"># \</span>
	l.add	r17, r17, r19           <span class="c"># | legitimize address of x into r17</span>
	l.addi	r17, r17, gotofflo<span class="o">(</span>x<span class="o">)</span>   <span class="c"># /</span>
	l.or	r11, r17, r17   <span class="c"># } place result in return register r11</span>
	l.lwz	r2, 0<span class="o">(</span>r1<span class="o">)</span>       <span class="c"># \</span>
	l.lwz	r9, 4<span class="o">(</span>r1<span class="o">)</span>       <span class="c"># | function epilogue</span>
	l.addi	r1, r1, 8       <span class="c"># |</span>
	l.jr	r9              <span class="c"># |</span>
	 l.nop                  <span class="c"># /</span>

	.size	get_x_addr, .-get_x_addr
	.ident	<span class="s2">"GCC: (GNU) 9.0.1 20190409 (experimental)"</span>
</code></pre></div></div>

<p>TLS and Addend cases are also handled by <code class="language-plaintext highlighter-rouge">or1k_legitimize_address()</code>.</p>

<h3 id="gcc-print-operand">GCC Print Operand</h3>

<p>Once RTX is generated by legitimize address and <a href="/software/embedded/openrisc/2018/06/03/gcc_passes.html">GCC passes</a>
run all of their optimizations the RTX needs to be printed out as assembly code.  During
this process relocations are printed by GCC macros <code class="language-plaintext highlighter-rouge">TARGET_PRINT_OPERAND_ADDRESS</code>
and <code class="language-plaintext highlighter-rouge">TARGET_PRINT_OPERAND</code>.  In OpenRISC these defined
by <a href="https://github.com/gcc-mirror/gcc/blob/releases/gcc-10/gcc/config/or1k/or1k.c#L1139">or1k_print_operand_address()</a>
and <a href="https://github.com/gcc-mirror/gcc/blob/releases/gcc-10/gcc/config/or1k/or1k.c#L1193">or1k_print_operand()</a>.</p>

<p>Let us have a look at <code class="language-plaintext highlighter-rouge">or1k_print_operand_address()</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Worker for TARGET_PRINT_OPERAND_ADDRESS.
   Prints the argument ADDR, an address RTX, to the file FILE.  The output is
   formed as expected by the OpenRISC assembler.  Examples:
     RTX							      OUTPUT
     (reg:SI 3)							       0(r3)
     (plus:SI (reg:SI 3) (const_int 4))				     0x4(r3)
     (lo_sum:SI (reg:SI 3) (symbol_ref:SI ("x")))		   lo(x)(r3)  */</span>

<span class="k">static</span> <span class="kt">void</span>
<span class="nf">or1k_print_operand_address</span> <span class="p">(</span><span class="kt">FILE</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="n">machine_mode</span><span class="p">,</span> <span class="n">rtx</span> <span class="n">addr</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">rtx</span> <span class="n">offset</span><span class="p">;</span>

  <span class="k">switch</span> <span class="p">(</span><span class="n">GET_CODE</span> <span class="p">(</span><span class="n">addr</span><span class="p">))</span>
    <span class="p">{</span>
    <span class="k">case</span> <span class="n">REG</span><span class="p">:</span>
      <span class="n">fputc</span> <span class="p">(</span><span class="sc">'0'</span><span class="p">,</span> <span class="n">file</span><span class="p">);</span>
      <span class="k">break</span><span class="p">;</span>
    <span class="k">case</span> <span class="p">...</span>
    <span class="k">case</span> <span class="n">LO_SUM</span><span class="p">:</span>
      <span class="n">offset</span> <span class="o">=</span> <span class="n">XEXP</span> <span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
      <span class="n">addr</span> <span class="o">=</span> <span class="n">XEXP</span> <span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
      <span class="n">print_reloc</span> <span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">offset</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">RKIND_LO</span><span class="p">);</span>
      <span class="k">break</span><span class="p">;</span>
    <span class="nl">default:</span> <span class="p">...</span>
    <span class="err">}</span>

  <span class="n">fprintf</span> <span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s">"(%s)"</span><span class="p">,</span> <span class="n">reg_names</span><span class="p">[</span><span class="n">REGNO</span> <span class="p">(</span><span class="n">addr</span><span class="p">)]);</span>
<span class="err">}</span>
</code></pre></div></div>

<p>The above code snippet can be read as we explain below, but let’s first
make some notes:</p>

<ul>
  <li>The input RTX <code class="language-plaintext highlighter-rouge">addr</code> for <code class="language-plaintext highlighter-rouge">TARGET_PRINT_OPERAND_ADDRESS</code> will usually contain
a register and an offset typically this is used for <strong>LOAD</strong> and <strong>STORE</strong>
operations.</li>
  <li>Think of the RTX <code class="language-plaintext highlighter-rouge">addr</code> as a node in an <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">AST</a>.</li>
  <li>The RTX node with code <code class="language-plaintext highlighter-rouge">REG</code> and <code class="language-plaintext highlighter-rouge">SYMBOL_REF</code> are always leaf nodes.</li>
</ul>

<p>With that, and if we use the <code class="language-plaintext highlighter-rouge">or1k_print_operand_address()</code> c comments above as examples
of some RTX <code class="language-plaintext highlighter-rouge">addr</code> input we will have:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    RTX     |    (reg:SI 3)          (lo_sum:SI (reg:SI 3) (symbol_ref:SI("x")))
 -----------+--------------------------------------------------------------------
    TREE    |
   (code)   |  (code:REG regno:3)                (code:LO_SUM)
   /    \   |                                      /        \
  (0)   (1) |                         (code:REG regno:3)  (code:SYMBOL_REF "x")
</code></pre></div></div>

<p>We can now read the above snippet as:</p>

<ul>
  <li><strong>First</strong> get the <code class="language-plaintext highlighter-rouge">CODE</code> of the RTX.
    <ul>
      <li>If <code class="language-plaintext highlighter-rouge">CODE</code> is <code class="language-plaintext highlighter-rouge">REG</code> (a register) than our offset can be <code class="language-plaintext highlighter-rouge">0</code>.</li>
      <li>If <code class="language-plaintext highlighter-rouge">IS</code> is <code class="language-plaintext highlighter-rouge">LO_SUM</code> (an addition operation) then we need to break it down to:
        <ul>
          <li>Arg <code class="language-plaintext highlighter-rouge">0</code> is our new <code class="language-plaintext highlighter-rouge">addr</code> RTX (which we assume is a register)</li>
          <li>Arg <code class="language-plaintext highlighter-rouge">1</code> is an offset (which we then print with <a href="https://github.com/gcc-mirror/gcc/blob/releases/gcc-10/gcc/config/or1k/or1k.c#L1085">print_reloc()</a>)</li>
        </ul>
      </li>
    </ul>
  </li>
  <li><strong>Second</strong> print out the register name now in <code class="language-plaintext highlighter-rouge">addr</code> i.e. “r3”.</li>
</ul>

<p>The code of <code class="language-plaintext highlighter-rouge">or1k_print_operand()</code> is similar and the reader may be inclined to
read more details.  With that we can move on to the assembler.</p>

<p>TLS cases are also handled inside of the <code class="language-plaintext highlighter-rouge">print_reloc()</code> function.</p>

<h2 id="the-assembler">The Assembler</h2>

<p>In the GNU Toolchain our assembler is GAS, part of <a href="https://www.gnu.org/software/binutils/">binutils</a>.</p>

<p>The code that handles relocations is found in the function
<a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/opcodes/or1k-asm.c#L226">parse_reloc()</a>
found in <code class="language-plaintext highlighter-rouge">opcodes/or1k-asm.c</code>.  The function <code class="language-plaintext highlighter-rouge">parse_reloc()</code> is the direct counterpart of GCC’s <code class="language-plaintext highlighter-rouge">print_reloc()</code>
discussed above.  This is actually part of <a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/opcodes/or1k-asm.c#L491">or1k_cgen_parse_operand()</a>
which is wired into our assembler generator CGEN used for parsing operands.</p>

<p>If we are parsing a relocation like the one from above <code class="language-plaintext highlighter-rouge">lo(x)</code> then we can
isolate the code that processes that relocation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="n">bfd_reloc_code_real_type</span> <span class="n">or1k_imm16_relocs</span><span class="p">[][</span><span class="mi">6</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
  <span class="p">{</span> <span class="n">BFD_RELOC_LO16</span><span class="p">,</span>
    <span class="n">BFD_RELOC_OR1K_SLO16</span><span class="p">,</span>
<span class="p">...</span>
    <span class="n">BFD_RELOC_OR1K_TLS_LE_AHI16</span> <span class="p">},</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">parse_reloc</span> <span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">**</span><span class="n">strp</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">str</span> <span class="o">=</span> <span class="o">*</span><span class="n">strp</span><span class="p">;</span>
    <span class="k">enum</span> <span class="n">or1k_rclass</span> <span class="n">cls</span> <span class="o">=</span> <span class="n">RCLASS_DIRECT</span><span class="p">;</span>
    <span class="k">enum</span> <span class="n">or1k_rtype</span> <span class="n">typ</span><span class="p">;</span>

    <span class="p">...</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">strncasecmp</span> <span class="p">(</span><span class="n">str</span><span class="p">,</span> <span class="s">"lo("</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
      <span class="p">{</span>
	<span class="n">str</span> <span class="o">+=</span> <span class="mi">3</span><span class="p">;</span>
	<span class="n">typ</span> <span class="o">=</span> <span class="n">RTYPE_LO</span><span class="p">;</span>
      <span class="p">}</span>
    <span class="p">...</span>

    <span class="o">*</span><span class="n">strp</span> <span class="o">=</span> <span class="n">str</span><span class="p">;</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">cls</span> <span class="o">&lt;&lt;</span> <span class="n">RCLASS_SHIFT</span><span class="p">)</span> <span class="o">|</span> <span class="n">typ</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This uses <a href="https://man7.org/linux/man-pages/man3/strncasecmp.3.html">strncasecmp</a> to match
our <code class="language-plaintext highlighter-rouge">"lo("</code> string pattern.  The returned result is a relocation type and relocation class
which are use to lookup the relocation <code class="language-plaintext highlighter-rouge">BFD_RELOC_LO16</code> in the <code class="language-plaintext highlighter-rouge">or1k_imm16_relocs[][]</code> table
which is indexed by <em>relocation class</em> and <em>relocation class</em>.</p>

<p>The assembler will encode that into the ELF binary.  For TLS relocations the exact same
pattern is used.</p>

<h2 id="the-linker">The Linker</h2>

<p>In the GNU Toolchain our object linker is the GNU linker LD, also part of the
<em>binutils</em> project.</p>

<p>The GNU linker uses the framework
<a href="https://sourceware.org/binutils/docs/bfd/">BFD</a> or <a href="https://en.wikipedia.org/wiki/Binary_File_Descriptor_library">Binary File
Descriptor</a> which
is a beast.  It is not only used in the linker but also used in GDB, the GNU
Simulator and the objdump tool.</p>

<p>What makes this possible is a rather complex API.</p>

<h3 id="bfd-linker-api">BFD Linker API</h3>

<p>The BFD API is a generic binary file access API.  It has been designed to support multiple
file formats and architectures via an object oriented, polymorphic API all written in c.  It supports file formats
including <a href="https://en.wikipedia.org/wiki/A.out">a.out</a>,
<a href="https://en.wikipedia.org/wiki/COFF">COFF</a> and
<a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF</a> as well as
unexpected file formats like
<a href="https://binutils.sourceware.narkive.com/aoJ9J4Xk/patch-verilog-hex-memory-dump-backend-for-bfd">verilog hex memory dumps</a>.</p>

<p>Here we will concentrate on the BFD ELF implementation.</p>

<p>The API definition is split across multiple files which include:</p>

<ul>
  <li><a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/bfd-in.h">bfd/bfd-in.h</a> - top level generic APIs including <code class="language-plaintext highlighter-rouge">bfd_hash_table</code></li>
  <li><a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/bfd-in2.h">bfd/bfd-in2.h</a> - top level binary file APIs including <code class="language-plaintext highlighter-rouge">bfd</code> and <code class="language-plaintext highlighter-rouge">asection</code></li>
  <li><a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/include/bfdlink.h">include/bfdlink.h</a> - generic bfd linker APIs including <code class="language-plaintext highlighter-rouge">bfd_link_info</code> and <code class="language-plaintext highlighter-rouge">bfd_link_hash_table</code></li>
  <li><a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/bfd-elf.h">bfd/elf-bfd.h</a> - extensions to the APIs for ELF binaries including <code class="language-plaintext highlighter-rouge">elf_link_hash_table</code></li>
  <li><code class="language-plaintext highlighter-rouge">bfd/elf{wordsize}-{architecture}.c</code> - architecture specific implementations</li>
</ul>

<p>For each architecture implementations are defined in <code class="language-plaintext highlighter-rouge">bfd/elf{wordsize}-{architecture}.c</code>.  For
example for OpenRISC we have
<a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/elf32-or1k.c">bfd/elf32-or1k.c</a>.</p>

<p>Throughout the linker code we see access to the BFD Linker and ELF APIs. Some key symbols to watch out for include:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">info</code> - A reference to <code class="language-plaintext highlighter-rouge">bfd_link_info</code> top level reference to all linker
state.</li>
  <li><code class="language-plaintext highlighter-rouge">htab</code> - A pointer to <code class="language-plaintext highlighter-rouge">elf_or1k_link_hash_table</code> from <code class="language-plaintext highlighter-rouge">or1k_elf_hash_table (info)</code>, a hash
table on steroids which stores generic link state and arch specific state, it’s also a hash
table of all global symbols by name, contains:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.splt</code> - the output <code class="language-plaintext highlighter-rouge">.plt</code> section</li>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.sgot</code> -  the output <code class="language-plaintext highlighter-rouge">.got</code> section</li>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.srelgot</code> - the output <code class="language-plaintext highlighter-rouge">.relgot</code> section (relocations against the got)</li>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.sgotplt</code> - the output <code class="language-plaintext highlighter-rouge">.gotplt</code> section</li>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.dynobj</code> - a special <code class="language-plaintext highlighter-rouge">bfd</code> to which sections are added (created in <code class="language-plaintext highlighter-rouge">or1k_elf_check_relocs</code>)</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">sym_hashes</code> - From <code class="language-plaintext highlighter-rouge">elf_sym_hashes (abfd)</code> a list of for global symbols
in a <code class="language-plaintext highlighter-rouge">bfd</code> indexed by the relocation index <code class="language-plaintext highlighter-rouge">ELF32_R_SYM (rel-&gt;r_info)</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">h</code> - A pointer to a <code class="language-plaintext highlighter-rouge">struct elf_link_hash_entry</code>, represents link state
of a global symbol, contains:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">h-&gt;got</code> - A union of different attributes with different roles based on link phase.</li>
      <li><code class="language-plaintext highlighter-rouge">h-&gt;got.refcount</code> - used during phase 1 to count the symbol <code class="language-plaintext highlighter-rouge">.got</code> section references</li>
      <li><code class="language-plaintext highlighter-rouge">h-&gt;got.offset</code> - used during phase 2 to record the symbol <code class="language-plaintext highlighter-rouge">.got</code> section offset</li>
      <li><code class="language-plaintext highlighter-rouge">h-&gt;plt</code> - A union with the same function as <code class="language-plaintext highlighter-rouge">h-&gt;got</code> but used for the <code class="language-plaintext highlighter-rouge">.plt</code> section.</li>
      <li><code class="language-plaintext highlighter-rouge">h-&gt;root.root.string</code> - The symbol name</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">local_got</code>- an array of <code class="language-plaintext highlighter-rouge">unsigned long</code> from <code class="language-plaintext highlighter-rouge">elf_local_got_refcounts (ibfd)</code> with the same
function to <code class="language-plaintext highlighter-rouge">h-&gt;got</code> but for local symbols, the function of the <code class="language-plaintext highlighter-rouge">unsigned long</code> is changed base
on the link phase.  Ideally this should also be a union.</li>
  <li><code class="language-plaintext highlighter-rouge">tls_type</code> - Retrieved by <code class="language-plaintext highlighter-rouge">((struct elf_or1k_link_hash_entry *) h)-&gt;tls_type</code> used to store the
<code class="language-plaintext highlighter-rouge">tls_type</code> of a global symbol.</li>
  <li><code class="language-plaintext highlighter-rouge">local_tls_type</code> - Retrieved by <code class="language-plaintext highlighter-rouge">elf_or1k_local_tls_type(abfd)</code> entry to store <code class="language-plaintext highlighter-rouge">tls_type</code> for local
symbols, when <code class="language-plaintext highlighter-rouge">h</code> is <code class="language-plaintext highlighter-rouge">NULL</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">root</code> - The struct field <code class="language-plaintext highlighter-rouge">root</code> is used in subclasses to represent the parent class, similar to how <code class="language-plaintext highlighter-rouge">super</code> is used
  in other languages.</li>
</ul>

<p>Putting it all together we have a diagram like the following:</p>

<p><img src="/content/2020/relocs-bfddatastructs.png" alt="The BFD API" /></p>

<p>Now that we have a bit of understanding of the <a href="https://lwn.net/Articles/193245/">data structures</a>
we can look to the link algorithm.</p>

<p>The link process in the GNU Linker can be thought of in phases.</p>

<h3 id="phase-1---book-keeping-check_relocs">Phase 1 - Book Keeping (check_relocs)</h3>

<p>The <code class="language-plaintext highlighter-rouge">or1k_elf_check_relocs()</code> function is called during the first phase to
do book keeping on relocations.  The function signature looks like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bfd_boolean</span>
<span class="n">or1k_elf_check_relocs</span> <span class="p">(</span><span class="n">bfd</span> <span class="o">*</span><span class="n">abfd</span><span class="p">,</span>
                       <span class="k">struct</span> <span class="n">bfd_link_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                       <span class="n">asection</span> <span class="o">*</span><span class="n">sec</span><span class="p">,</span>
                       <span class="k">const</span> <span class="n">Elf_Internal_Rela</span> <span class="o">*</span><span class="n">relocs</span><span class="p">)</span>

<span class="cp">#define elf_backend_check_relocs        or1k_elf_check_relocs
</span></code></pre></div></div>

<p>The arguments being:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">abfd</code>   - The current elf object file we are working on</li>
  <li><code class="language-plaintext highlighter-rouge">info</code>   - The BFD API</li>
  <li><code class="language-plaintext highlighter-rouge">sec</code>    - The current elf section we are working on</li>
  <li><code class="language-plaintext highlighter-rouge">relocs</code> - The relocations from the current section</li>
</ul>

<p>It does the book keeping by looping over relocations for the provided section
and updating the local and global symbol properties.</p>

<p>For local symbols:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="p">...</span>
      <span class="k">else</span>
	<span class="p">{</span>
	  <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">local_tls_type</span><span class="p">;</span>

	  <span class="cm">/* This is a TLS type record for a local symbol.  */</span>
	  <span class="n">local_tls_type</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">elf_or1k_local_tls_type</span> <span class="p">(</span><span class="n">abfd</span><span class="p">);</span>
	  <span class="k">if</span> <span class="p">(</span><span class="n">local_tls_type</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
	    <span class="p">{</span>
	      <span class="n">bfd_size_type</span> <span class="n">size</span><span class="p">;</span>

	      <span class="n">size</span> <span class="o">=</span> <span class="n">symtab_hdr</span><span class="o">-&gt;</span><span class="n">sh_info</span><span class="p">;</span>
	      <span class="n">local_tls_type</span> <span class="o">=</span> <span class="n">bfd_zalloc</span> <span class="p">(</span><span class="n">abfd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
	      <span class="k">if</span> <span class="p">(</span><span class="n">local_tls_type</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
		<span class="k">return</span> <span class="n">FALSE</span><span class="p">;</span>
	      <span class="n">elf_or1k_local_tls_type</span> <span class="p">(</span><span class="n">abfd</span><span class="p">)</span> <span class="o">=</span> <span class="n">local_tls_type</span><span class="p">;</span>
	    <span class="p">}</span>
	  <span class="n">local_tls_type</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">]</span> <span class="o">|=</span> <span class="n">tls_type</span><span class="p">;</span>
	<span class="p">}</span>

	      <span class="p">...</span>
	      <span class="k">else</span>
		<span class="p">{</span>
		  <span class="n">bfd_signed_vma</span> <span class="o">*</span><span class="n">local_got_refcounts</span><span class="p">;</span>

		  <span class="cm">/* This is a global offset table entry for a local symbol.  */</span>
		  <span class="n">local_got_refcounts</span> <span class="o">=</span> <span class="n">elf_local_got_refcounts</span> <span class="p">(</span><span class="n">abfd</span><span class="p">);</span>
		  <span class="k">if</span> <span class="p">(</span><span class="n">local_got_refcounts</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
		    <span class="p">{</span>
		      <span class="n">bfd_size_type</span> <span class="n">size</span><span class="p">;</span>

		      <span class="n">size</span> <span class="o">=</span> <span class="n">symtab_hdr</span><span class="o">-&gt;</span><span class="n">sh_info</span><span class="p">;</span>
		      <span class="n">size</span> <span class="o">*=</span> <span class="k">sizeof</span> <span class="p">(</span><span class="n">bfd_signed_vma</span><span class="p">);</span>
		      <span class="n">local_got_refcounts</span> <span class="o">=</span> <span class="n">bfd_zalloc</span> <span class="p">(</span><span class="n">abfd</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
		      <span class="k">if</span> <span class="p">(</span><span class="n">local_got_refcounts</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
			<span class="k">return</span> <span class="n">FALSE</span><span class="p">;</span>
		      <span class="n">elf_local_got_refcounts</span> <span class="p">(</span><span class="n">abfd</span><span class="p">)</span> <span class="o">=</span> <span class="n">local_got_refcounts</span><span class="p">;</span>
		    <span class="p">}</span>
		  <span class="n">local_got_refcounts</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
		<span class="p">}</span>
</code></pre></div></div>

<p>The above is pretty straight forward and we can read as:</p>

<ul>
  <li>First part is for storing local symbol <code class="language-plaintext highlighter-rouge">TLS</code> type information:
    <ul>
      <li>If the <code class="language-plaintext highlighter-rouge">local_tls_type</code> array is not initialized:
        <ul>
          <li>Allocate it, 1 entry for each local variable</li>
        </ul>
      </li>
      <li>Record the TLS type in <code class="language-plaintext highlighter-rouge">local_tls_type</code> for the current symbol</li>
    </ul>
  </li>
  <li>Second part is for recording <code class="language-plaintext highlighter-rouge">.got</code> section references:
    <ul>
      <li>If the <code class="language-plaintext highlighter-rouge">local_got_refcounts</code> array is not initialized:
        <ul>
          <li>Allocate it, 1 entry for each local variable</li>
        </ul>
      </li>
      <li>Record a reference by incrementing <code class="language-plaintext highlighter-rouge">local_got_refcounts</code> for the current symbol</li>
    </ul>
  </li>
</ul>

<p>For global symbols, it’s much more easy we see:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="p">...</span>
      <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
	  <span class="p">((</span><span class="k">struct</span> <span class="n">elf_or1k_link_hash_entry</span> <span class="o">*</span><span class="p">)</span> <span class="n">h</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tls_type</span> <span class="o">|=</span> <span class="n">tls_type</span><span class="p">;</span>
      <span class="k">else</span>
      <span class="p">...</span>

	      <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
		<span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">refcount</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
	      <span class="k">else</span>
		<span class="p">...</span>
</code></pre></div></div>

<p>As the <code class="language-plaintext highlighter-rouge">tls_type</code> and <code class="language-plaintext highlighter-rouge">refcount</code> fields are available directly on each
<code class="language-plaintext highlighter-rouge">hash_entry</code> handling global symbols is much easier.</p>

<ul>
  <li>First part is for storing <code class="language-plaintext highlighter-rouge">TLS</code> type information:
    <ul>
      <li>Record the TLS type in <code class="language-plaintext highlighter-rouge">tls_type</code> for the current <code class="language-plaintext highlighter-rouge">hash_entry</code></li>
    </ul>
  </li>
  <li>Second part is for recording <code class="language-plaintext highlighter-rouge">.got</code> section references:
    <ul>
      <li>Record a reference by incrementing <code class="language-plaintext highlighter-rouge">got.refcounts</code> for the <code class="language-plaintext highlighter-rouge">hash_entry</code></li>
    </ul>
  </li>
</ul>

<p>The above is repeated for all relocations and all input sections.  A few other
things are also done including accounting for <code class="language-plaintext highlighter-rouge">.plt</code> entries.</p>

<h3 id="phase-2---creating-space-size_dynamic_sections--_bfd_elf_create_dynamic_sections">Phase 2 - creating space (size_dynamic_sections + _bfd_elf_create_dynamic_sections)</h3>

<p>The <a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/elf32-or1k.c#L2870">or1k_elf_size_dynamic_sections()</a>
function iterates over all input object files to calculate the size required for
output sections.  The <code class="language-plaintext highlighter-rouge">_bfd_elf_create_dynamic_sections()</code> function does the
actual section allocation, we use the generic version.</p>

<p>Setting up the sizes of the <code class="language-plaintext highlighter-rouge">.got</code> section (global offset table) and <code class="language-plaintext highlighter-rouge">.plt</code>
section (procedure link table) is done here.</p>

<p>The definition is as below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bfd_boolean</span>
<span class="n">or1k_elf_size_dynamic_sections</span> <span class="p">(</span><span class="n">bfd</span> <span class="o">*</span><span class="n">output_bfd</span> <span class="n">ATTRIBUTE_UNUSED</span><span class="p">,</span>
                                <span class="k">struct</span> <span class="n">bfd_link_info</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>

<span class="cp">#define elf_backend_size_dynamic_sections       or1k_elf_size_dynamic_sections
#define elf_backend_create_dynamic_sections     _bfd_elf_create_dynamic_sections
</span></code></pre></div></div>

<p>The arguments to <code class="language-plaintext highlighter-rouge">or1k_elf_size_dynamic_sections()</code> being:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">output_bfd</code> - <strong>Unused</strong>, the output elf object</li>
  <li><code class="language-plaintext highlighter-rouge">info</code> - the BFD API which provides access to everything we need</li>
</ul>

<p>Internally the function uses:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">htab</code> - from <code class="language-plaintext highlighter-rouge">or1k_elf_hash_table (info)</code>
    <ul>
      <li><code class="language-plaintext highlighter-rouge">htab-&gt;root.dynamic_sections_created</code> - <code class="language-plaintext highlighter-rouge">true</code> if sections like <code class="language-plaintext highlighter-rouge">.interp</code> have been created by the linker</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">ibfd</code> - a <code class="language-plaintext highlighter-rouge">bfd</code> pointer from <code class="language-plaintext highlighter-rouge">info-&gt;input_bfds</code>, represents an input object when iterating.</li>
  <li><code class="language-plaintext highlighter-rouge">s-&gt;size</code> - represents the output <code class="language-plaintext highlighter-rouge">.got</code> section size, which we will be
incrementing.</li>
  <li><code class="language-plaintext highlighter-rouge">srel-&gt;size</code> - represents the output <code class="language-plaintext highlighter-rouge">.got.rela</code> section size, which will
contain relocations against the <code class="language-plaintext highlighter-rouge">.got</code> section</li>
</ul>

<p>During the first part of phase 2 we set <code class="language-plaintext highlighter-rouge">.got</code> and <code class="language-plaintext highlighter-rouge">.got.rela</code> section sizes
for <strong>local</strong> symbols with this code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="cm">/* Set up .got offsets for local syms, and space for local dynamic
     relocs.  */</span>
  <span class="k">for</span> <span class="p">(</span><span class="n">ibfd</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">input_bfds</span><span class="p">;</span> <span class="n">ibfd</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="n">ibfd</span> <span class="o">=</span> <span class="n">ibfd</span><span class="o">-&gt;</span><span class="n">link</span><span class="p">.</span><span class="n">next</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="p">...</span>
      <span class="n">local_got</span> <span class="o">=</span> <span class="n">elf_local_got_refcounts</span> <span class="p">(</span><span class="n">ibfd</span><span class="p">);</span>
      <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">local_got</span><span class="p">)</span>
	<span class="k">continue</span><span class="p">;</span>

      <span class="n">symtab_hdr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">elf_tdata</span> <span class="p">(</span><span class="n">ibfd</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">symtab_hdr</span><span class="p">;</span>
      <span class="n">locsymcount</span> <span class="o">=</span> <span class="n">symtab_hdr</span><span class="o">-&gt;</span><span class="n">sh_info</span><span class="p">;</span>
      <span class="n">end_local_got</span> <span class="o">=</span> <span class="n">local_got</span> <span class="o">+</span> <span class="n">locsymcount</span><span class="p">;</span>
      <span class="n">s</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">sgot</span><span class="p">;</span>
      <span class="n">srel</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">srelgot</span><span class="p">;</span>
      <span class="n">local_tls_type</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">elf_or1k_local_tls_type</span> <span class="p">(</span><span class="n">ibfd</span><span class="p">);</span>
      <span class="k">for</span> <span class="p">(;</span> <span class="n">local_got</span> <span class="o">&lt;</span> <span class="n">end_local_got</span><span class="p">;</span> <span class="o">++</span><span class="n">local_got</span><span class="p">)</span>
	<span class="p">{</span>
	  <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">local_got</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
	    <span class="p">{</span>
	      <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tls_type</span> <span class="o">=</span> <span class="p">(</span><span class="n">local_tls_type</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
					<span class="o">?</span> <span class="n">TLS_UNKNOWN</span>
					<span class="o">:</span> <span class="o">*</span><span class="n">local_tls_type</span><span class="p">;</span>

	      <span class="o">*</span><span class="n">local_got</span> <span class="o">=</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">;</span>
	      <span class="n">or1k_set_got_and_rela_sizes</span> <span class="p">(</span><span class="n">tls_type</span><span class="p">,</span> <span class="n">bfd_link_pic</span> <span class="p">(</span><span class="n">info</span><span class="p">),</span>
					   <span class="o">&amp;</span><span class="n">s</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">srel</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
	    <span class="p">}</span>
	  <span class="k">else</span>
	    <span class="o">*</span><span class="n">local_got</span> <span class="o">=</span> <span class="p">(</span><span class="n">bfd_vma</span><span class="p">)</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>

	  <span class="k">if</span> <span class="p">(</span><span class="n">local_tls_type</span><span class="p">)</span>
	    <span class="o">++</span><span class="n">local_tls_type</span><span class="p">;</span>
	<span class="p">}</span>
    <span class="p">}</span>

</code></pre></div></div>

<p>Here, for example, we can see we iterate over each input elf object <code class="language-plaintext highlighter-rouge">ibfd</code> and
each local symbol (<code class="language-plaintext highlighter-rouge">local_got</code>) we try and update <code class="language-plaintext highlighter-rouge">s-&gt;size</code> and <code class="language-plaintext highlighter-rouge">srel-&gt;size</code> to
account for the required size.</p>

<p>The above can be read as:</p>

<ul>
  <li>For each <code class="language-plaintext highlighter-rouge">local_got</code> entry:
    <ul>
      <li>If the local symbol is used in the <code class="language-plaintext highlighter-rouge">.got</code> section:
        <ul>
          <li>Get the <code class="language-plaintext highlighter-rouge">tls_type</code> byte stored in the <code class="language-plaintext highlighter-rouge">local_tls_type</code> array</li>
          <li>Set the offset <code class="language-plaintext highlighter-rouge">local_got</code> to the section offset <code class="language-plaintext highlighter-rouge">s-&gt;size</code>, that is used
in phase 3 to tell us where we need to write the symbol into the <code class="language-plaintext highlighter-rouge">.got</code>
section.</li>
          <li>Update <code class="language-plaintext highlighter-rouge">s-&gt;size</code> and <code class="language-plaintext highlighter-rouge">srel-&gt;size</code> using <code class="language-plaintext highlighter-rouge">or1k_set_got_and_rela_sizes()</code></li>
        </ul>
      </li>
    </ul>
  </li>
  <li>If the local symbol is not used in the <code class="language-plaintext highlighter-rouge">.got</code> section:
    <ul>
      <li>Set the offset <code class="language-plaintext highlighter-rouge">local_got</code> to the <code class="language-plaintext highlighter-rouge">-1</code>, to indicate not used</li>
    </ul>
  </li>
</ul>

<p>In the next part of phase 2 we allocate space for all <strong>global</strong> symbols by
iterating through symbols in <code class="language-plaintext highlighter-rouge">htab</code> with the <code class="language-plaintext highlighter-rouge">allocate_dynrelocs</code> iterator.  To
do that we call:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="n">elf_link_hash_traverse</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">,</span> <span class="n">allocate_dynrelocs</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
</code></pre></div></div>

<p>Inside <code class="language-plaintext highlighter-rouge">allocate_dynrelocs()</code> we record the space used for relocations and
the <code class="language-plaintext highlighter-rouge">.got</code> and <code class="language-plaintext highlighter-rouge">.plt</code> sections.  Example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">if</span> <span class="p">(</span><span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">refcount</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="n">asection</span> <span class="o">*</span><span class="n">sgot</span><span class="p">;</span>
      <span class="n">bfd_boolean</span> <span class="n">dyn</span><span class="p">;</span>
      <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tls_type</span><span class="p">;</span>

      <span class="p">...</span>
      <span class="n">sgot</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">sgot</span><span class="p">;</span>

      <span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">offset</span> <span class="o">=</span> <span class="n">sgot</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">;</span>

      <span class="n">tls_type</span> <span class="o">=</span> <span class="p">((</span><span class="k">struct</span> <span class="n">elf_or1k_link_hash_entry</span> <span class="o">*</span><span class="p">)</span> <span class="n">h</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tls_type</span><span class="p">;</span>

      <span class="n">dyn</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">dynamic_sections_created</span><span class="p">;</span>
      <span class="n">dyn</span> <span class="o">=</span> <span class="n">WILL_CALL_FINISH_DYNAMIC_SYMBOL</span> <span class="p">(</span><span class="n">dyn</span><span class="p">,</span> <span class="n">bfd_link_pic</span> <span class="p">(</span><span class="n">info</span><span class="p">),</span> <span class="n">h</span><span class="p">);</span>
      <span class="n">or1k_set_got_and_rela_sizes</span> <span class="p">(</span><span class="n">tls_type</span><span class="p">,</span> <span class="n">dyn</span><span class="p">,</span>
				   <span class="o">&amp;</span><span class="n">sgot</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">srelgot</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>
    <span class="p">}</span>
  <span class="k">else</span>
    <span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">offset</span> <span class="o">=</span> <span class="p">(</span><span class="n">bfd_vma</span><span class="p">)</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>

<p>The above, with <code class="language-plaintext highlighter-rouge">h</code> being our global symbol, a pointer to <code class="language-plaintext highlighter-rouge">struct elf_link_hash_entry</code>,
can be read as:</p>

<ul>
  <li>If the symbol will be in the <code class="language-plaintext highlighter-rouge">.got</code> section:
    <ul>
      <li>Get the global reference to the <code class="language-plaintext highlighter-rouge">.got</code> section and put it in <code class="language-plaintext highlighter-rouge">sgot</code></li>
      <li>Set the got location <code class="language-plaintext highlighter-rouge">h-&gt;got.offset</code> for the symbol to the current got
section size <code class="language-plaintext highlighter-rouge">htab-&gt;root.sgot</code>.</li>
      <li>Set <code class="language-plaintext highlighter-rouge">dyn</code> to <code class="language-plaintext highlighter-rouge">true</code> if we will be doing a dynamic link.</li>
      <li>Call <code class="language-plaintext highlighter-rouge">or1k_set_got_and_rela_sizes()</code> to update the sizes for the <code class="language-plaintext highlighter-rouge">.got</code>
and <code class="language-plaintext highlighter-rouge">.got.rela</code> sections.</li>
    </ul>
  </li>
  <li>If the symbol is going to be in the <code class="language-plaintext highlighter-rouge">.got</code> section:
    <ul>
      <li>Set the got location <code class="language-plaintext highlighter-rouge">h-&gt;got.offset</code> to <code class="language-plaintext highlighter-rouge">-1</code></li>
    </ul>
  </li>
</ul>

<p>The function <code class="language-plaintext highlighter-rouge">or1k_set_got_and_rela_sizes()</code> used above is used to increment
<code class="language-plaintext highlighter-rouge">.got</code> and <code class="language-plaintext highlighter-rouge">.rela</code> section sizes accounting for if these are TLS symbols, which
need additional entries and relocations.</p>

<h3 id="phase-3---linking-relocate_section">Phase 3 - linking (relocate_section)</h3>

<p>The <a href="https://github.com/stffrdhrn/binutils-gdb/blob/or1k-glibc-1/bfd/elf32-or1k.c#L1252">or1k_elf_relocate_section()</a>
function is called to fill in the relocation holes in the output binary <code class="language-plaintext highlighter-rouge">.text</code>
section.  It does this by looping over relocations and writing to the <code class="language-plaintext highlighter-rouge">.text</code>
section the correct symbol value (memory address).  It also updates other output
binary sections like the <code class="language-plaintext highlighter-rouge">.got</code> section.  Also, for dynamic executables and
libraries new relocations may be written to <code class="language-plaintext highlighter-rouge">.rela</code> sections.</p>

<p>The function signature looks as follows:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bfd_boolean</span>
<span class="n">or1k_elf_relocate_section</span> <span class="p">(</span><span class="n">bfd</span> <span class="o">*</span><span class="n">output_bfd</span><span class="p">,</span>
                           <span class="k">struct</span> <span class="n">bfd_link_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                           <span class="n">bfd</span> <span class="o">*</span><span class="n">input_bfd</span><span class="p">,</span>
                           <span class="n">asection</span> <span class="o">*</span><span class="n">input_section</span><span class="p">,</span>
                           <span class="n">bfd_byte</span> <span class="o">*</span><span class="n">contents</span><span class="p">,</span>
                           <span class="n">Elf_Internal_Rela</span> <span class="o">*</span><span class="n">relocs</span><span class="p">,</span>
                           <span class="n">Elf_Internal_Sym</span> <span class="o">*</span><span class="n">local_syms</span><span class="p">,</span>
                           <span class="n">asection</span> <span class="o">**</span><span class="n">local_sections</span><span class="p">)</span>

<span class="cp">#define elf_backend_relocate_section    or1k_elf_relocate_section
</span></code></pre></div></div>

<p>The arguments to <code class="language-plaintext highlighter-rouge">or1k_elf_relocate_sectioni()</code> being:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">output_bfd</code> - the output elf object we will be writing to</li>
  <li><code class="language-plaintext highlighter-rouge">info</code> - the BFD API which provides access to everything we need</li>
  <li><code class="language-plaintext highlighter-rouge">input_bfd</code> - the current input elf object being iterated over</li>
  <li><code class="language-plaintext highlighter-rouge">input_section</code> the current <code class="language-plaintext highlighter-rouge">.text</code> section in the input elf object being iterated
over.  From here we get <code class="language-plaintext highlighter-rouge">.text</code> section output details for pc relative relocations:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">input_section-&gt;output_section-&gt;vma</code> - the location of the output section.</li>
      <li><code class="language-plaintext highlighter-rouge">input_section-&gt;output_offset</code> - the output offset</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">contents</code> - the output file buffer we will write to</li>
  <li><code class="language-plaintext highlighter-rouge">relocs</code> - relocations from the current input section</li>
  <li><code class="language-plaintext highlighter-rouge">local_syms</code> - an array of local symbols used to get the <code class="language-plaintext highlighter-rouge">relocation</code> value for local symbols</li>
  <li><code class="language-plaintext highlighter-rouge">local_sections</code> - an array input sections for local symbols, used to get the <code class="language-plaintext highlighter-rouge">relocation</code> value for local symbols</li>
</ul>

<p>Internally the function uses:</p>
<ul>
  <li><a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/elf32-or1k.c#L43">or1k_elf_howto_table</a> - not
mentioned until now, but an array of <code class="language-plaintext highlighter-rouge">howto</code> structs indexed by relocation enum.
The <code class="language-plaintext highlighter-rouge">howto</code> struct expresses the algorithm required to update the relocation.</li>
  <li><code class="language-plaintext highlighter-rouge">relocation</code> - a <code class="language-plaintext highlighter-rouge">bfd_vma</code> the value of the relocation symbol (memory address)
to be written to the output file.
in the output file that needs to be updated for the relocation.</li>
  <li><code class="language-plaintext highlighter-rouge">value</code> - the value that needs to be written to the relocation location.</li>
</ul>

<p>During the first part of <code class="language-plaintext highlighter-rouge">relocate_section</code> we see:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="k">if</span> <span class="p">(</span><span class="n">r_symndx</span> <span class="o">&lt;</span> <span class="n">symtab_hdr</span><span class="o">-&gt;</span><span class="n">sh_info</span><span class="p">)</span>
	<span class="p">{</span>
	  <span class="n">sym</span> <span class="o">=</span> <span class="n">local_syms</span> <span class="o">+</span> <span class="n">r_symndx</span><span class="p">;</span>
	  <span class="n">sec</span> <span class="o">=</span> <span class="n">local_sections</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">];</span>
	  <span class="n">relocation</span> <span class="o">=</span> <span class="n">_bfd_elf_rela_local_sym</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">sym</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sec</span><span class="p">,</span> <span class="n">rel</span><span class="p">);</span>

	  <span class="n">name</span> <span class="o">=</span> <span class="n">bfd_elf_string_from_elf_section</span>
	    <span class="p">(</span><span class="n">input_bfd</span><span class="p">,</span> <span class="n">symtab_hdr</span><span class="o">-&gt;</span><span class="n">sh_link</span><span class="p">,</span> <span class="n">sym</span><span class="o">-&gt;</span><span class="n">st_name</span><span class="p">);</span>
	  <span class="n">name</span> <span class="o">=</span> <span class="n">name</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="o">?</span> <span class="n">bfd_section_name</span> <span class="p">(</span><span class="n">sec</span><span class="p">)</span> <span class="o">:</span> <span class="n">name</span><span class="p">;</span>
	<span class="p">}</span>
      <span class="k">else</span>
	<span class="p">{</span>
	  <span class="n">bfd_boolean</span> <span class="n">unresolved_reloc</span><span class="p">,</span> <span class="n">warned</span><span class="p">,</span> <span class="n">ignored</span><span class="p">;</span>

	  <span class="n">RELOC_FOR_GLOBAL_SYMBOL</span> <span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">input_bfd</span><span class="p">,</span> <span class="n">input_section</span><span class="p">,</span> <span class="n">rel</span><span class="p">,</span>
				   <span class="n">r_symndx</span><span class="p">,</span> <span class="n">symtab_hdr</span><span class="p">,</span> <span class="n">sym_hashes</span><span class="p">,</span>
				   <span class="n">h</span><span class="p">,</span> <span class="n">sec</span><span class="p">,</span> <span class="n">relocation</span><span class="p">,</span>
				   <span class="n">unresolved_reloc</span><span class="p">,</span> <span class="n">warned</span><span class="p">,</span> <span class="n">ignored</span><span class="p">);</span>
	  <span class="n">name</span> <span class="o">=</span> <span class="n">h</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">root</span><span class="p">.</span><span class="n">string</span><span class="p">;</span>
	<span class="p">}</span>
</code></pre></div></div>

<p>This can be read as:</p>

<ul>
  <li>If the current symbol is a local symbol:
    <ul>
      <li>We initialize <code class="language-plaintext highlighter-rouge">relocation</code> to the local symbol value using <code class="language-plaintext highlighter-rouge">_bfd_elf_rela_local_sym()</code>.</li>
    </ul>
  </li>
  <li>Otherwise the current symbol is global:
    <ul>
      <li>We use the <code class="language-plaintext highlighter-rouge">RELOC_FOR_GLOBAL_SYMBOL()</code> macro to initialize <code class="language-plaintext highlighter-rouge">relocation</code>.</li>
    </ul>
  </li>
</ul>

<p>During the next part we use the <code class="language-plaintext highlighter-rouge">howto</code> information to update the <code class="language-plaintext highlighter-rouge">relocation</code> value, and also
add relocations to the output file.  For example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	<span class="k">case</span> <span class="n">R_OR1K_TLS_GD_HI16</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_GD_LO16</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_GD_PG21</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_GD_LO13</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_IE_HI16</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_IE_LO16</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_IE_PG21</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_IE_LO13</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">R_OR1K_TLS_IE_AHI16</span><span class="p">:</span>
	  <span class="p">{</span>
	    <span class="n">bfd_vma</span> <span class="n">gotoff</span><span class="p">;</span>
	    <span class="n">Elf_Internal_Rela</span> <span class="n">rela</span><span class="p">;</span>
	    <span class="n">asection</span> <span class="o">*</span><span class="n">srelgot</span><span class="p">;</span>
	    <span class="n">bfd_byte</span> <span class="o">*</span><span class="n">loc</span><span class="p">;</span>
	    <span class="n">bfd_boolean</span> <span class="n">dynamic</span><span class="p">;</span>
	    <span class="kt">int</span> <span class="n">indx</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">tls_type</span><span class="p">;</span>

	    <span class="n">srelgot</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">srelgot</span><span class="p">;</span>

	    <span class="cm">/* Mark as TLS related GOT entry by setting
	       bit 2 to indcate TLS and bit 1 to indicate GOT.  */</span>
	    <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
	      <span class="p">{</span>
		<span class="n">gotoff</span> <span class="o">=</span> <span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">offset</span><span class="p">;</span>
		<span class="n">tls_type</span> <span class="o">=</span> <span class="p">((</span><span class="k">struct</span> <span class="n">elf_or1k_link_hash_entry</span> <span class="o">*</span><span class="p">)</span> <span class="n">h</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tls_type</span><span class="p">;</span>
		<span class="n">h</span><span class="o">-&gt;</span><span class="n">got</span><span class="p">.</span><span class="n">offset</span> <span class="o">|=</span> <span class="mi">3</span><span class="p">;</span>
	      <span class="p">}</span>
	    <span class="k">else</span>
	      <span class="p">{</span>
		<span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">local_tls_type</span><span class="p">;</span>

		<span class="n">gotoff</span> <span class="o">=</span> <span class="n">local_got_offsets</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">];</span>
		<span class="n">local_tls_type</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">elf_or1k_local_tls_type</span> <span class="p">(</span><span class="n">input_bfd</span><span class="p">);</span>
		<span class="n">tls_type</span> <span class="o">=</span> <span class="n">local_tls_type</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="o">?</span> <span class="n">TLS_NONE</span>
						  <span class="o">:</span> <span class="n">local_tls_type</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">];</span>
		<span class="n">local_got_offsets</span><span class="p">[</span><span class="n">r_symndx</span><span class="p">]</span> <span class="o">|=</span> <span class="mi">3</span><span class="p">;</span>
	      <span class="p">}</span>

	    <span class="cm">/* Only process the relocation once.  */</span>
	    <span class="k">if</span> <span class="p">((</span><span class="n">gotoff</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
	      <span class="p">{</span>
		<span class="n">gotoff</span> <span class="o">+=</span> <span class="n">or1k_initial_exec_offset</span> <span class="p">(</span><span class="n">howto</span><span class="p">,</span> <span class="n">tls_type</span><span class="p">);</span>

		<span class="cm">/* The PG21 and LO13 relocs are pc-relative, while the
		   rest are GOT relative.  */</span>
		<span class="n">relocation</span> <span class="o">=</span> <span class="n">got_base</span> <span class="o">+</span> <span class="p">(</span><span class="n">gotoff</span> <span class="o">&amp;</span> <span class="o">~</span><span class="mi">3</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_GD_PG21</span>
		    <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_GD_LO13</span>
		    <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_IE_PG21</span>
		    <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_IE_LO13</span><span class="p">))</span>
		  <span class="n">relocation</span> <span class="o">-=</span> <span class="n">got_sym_value</span><span class="p">;</span>
		<span class="k">break</span><span class="p">;</span>
	      <span class="p">}</span>

           <span class="p">...</span>

	    <span class="cm">/* Static GD.  */</span>
	    <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">tls_type</span> <span class="o">&amp;</span> <span class="n">TLS_GD</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
	      <span class="p">{</span>
		<span class="n">bfd_put_32</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">sgot</span><span class="o">-&gt;</span><span class="n">contents</span> <span class="o">+</span> <span class="n">gotoff</span><span class="p">);</span>
		<span class="n">bfd_put_32</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">tpoff</span> <span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">relocation</span><span class="p">,</span> <span class="n">dynamic</span><span class="p">),</span>
		    <span class="n">sgot</span><span class="o">-&gt;</span><span class="n">contents</span> <span class="o">+</span> <span class="n">gotoff</span> <span class="o">+</span> <span class="mi">4</span><span class="p">);</span>
	      <span class="p">}</span>

	    <span class="n">gotoff</span> <span class="o">+=</span> <span class="n">or1k_initial_exec_offset</span> <span class="p">(</span><span class="n">howto</span><span class="p">,</span> <span class="n">tls_type</span><span class="p">);</span>

	    <span class="p">...</span>

	    <span class="cm">/* Static IE.  */</span>
	    <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">tls_type</span> <span class="o">&amp;</span> <span class="n">TLS_IE</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
	      <span class="n">bfd_put_32</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">tpoff</span> <span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">relocation</span><span class="p">,</span> <span class="n">dynamic</span><span class="p">),</span>
			  <span class="n">sgot</span><span class="o">-&gt;</span><span class="n">contents</span> <span class="o">+</span> <span class="n">gotoff</span><span class="p">);</span>

	    <span class="cm">/* The PG21 and LO13 relocs are pc-relative, while the
	       rest are GOT relative.  */</span>
	    <span class="n">relocation</span> <span class="o">=</span> <span class="n">got_base</span> <span class="o">+</span> <span class="n">gotoff</span><span class="p">;</span>
	    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_GD_PG21</span>
		  <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_GD_LO13</span>
		  <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_IE_PG21</span>
		  <span class="o">||</span> <span class="n">r_type</span> <span class="o">==</span> <span class="n">R_OR1K_TLS_IE_LO13</span><span class="p">))</span>
	      <span class="n">relocation</span> <span class="o">-=</span> <span class="n">got_sym_value</span><span class="p">;</span>
	  <span class="p">}</span>
	  <span class="k">break</span><span class="p">;</span>

</code></pre></div></div>

<p>Here we process the relocation for TLS General Dynamic and Initial Exec relocations.  I have trimmed
out the <strong>shared</strong> cases to save space.</p>

<p>This can be read as:</p>
<ul>
  <li>Get a reference to the output relocation section <code class="language-plaintext highlighter-rouge">sreloc</code>.</li>
  <li>Get the got offset which we setup during phase 3 for global or local symbols.</li>
  <li>Mark the symbol as using a TLS got entry, this <code class="language-plaintext highlighter-rouge">offset |= 3</code> trick is
possible because on 32-bit machines we have 2 lower bits free.  This
is used during phase 4.</li>
  <li>If we have already processed this symbol once:
    <ul>
      <li>Update <code class="language-plaintext highlighter-rouge">relocation</code> to the location in the output <code class="language-plaintext highlighter-rouge">.got</code> section and <strong>break</strong>, we only need to create <code class="language-plaintext highlighter-rouge">.got</code> entries 1 time</li>
    </ul>
  </li>
  <li>Otherwise populate <code class="language-plaintext highlighter-rouge">.got</code> section entries
    <ul>
      <li>For General Dynamic
        <ul>
          <li>Put 2 entries into the output elf object <code class="language-plaintext highlighter-rouge">.got</code>section,  a literal <code class="language-plaintext highlighter-rouge">1</code> and the thread pointer offset</li>
        </ul>
      </li>
      <li>For Initial Exec
        <ul>
          <li>Put 1 entry into the output elf object <code class="language-plaintext highlighter-rouge">.got</code> section, the thread pointer offset</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Finally update the <code class="language-plaintext highlighter-rouge">relocation</code> to the location in the output <code class="language-plaintext highlighter-rouge">.got</code> section</li>
</ul>

<p>In the last part of the loop we write the <code class="language-plaintext highlighter-rouge">relocation</code> value to the output
<code class="language-plaintext highlighter-rouge">.text</code> section.  This is done with the <a href="https://github.com/stffrdhrn/binutils-gdb/blob/or1k-glibc-1/bfd/elf32-or1k.c#L1101">or1k_final_link_relocate()</a>
function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="n">r</span> <span class="o">=</span> <span class="n">or1k_final_link_relocate</span> <span class="p">(</span><span class="n">howto</span><span class="p">,</span> <span class="n">input_bfd</span><span class="p">,</span> <span class="n">input_section</span><span class="p">,</span> <span class="n">contents</span><span class="p">,</span>
				    <span class="n">rel</span><span class="o">-&gt;</span><span class="n">r_offset</span><span class="p">,</span> <span class="n">relocation</span> <span class="o">+</span> <span class="n">rel</span><span class="o">-&gt;</span><span class="n">r_addend</span><span class="p">);</span>
</code></pre></div></div>

<p>With this the <code class="language-plaintext highlighter-rouge">.text</code> section is complete.</p>

<h3 id="phase-4---finishing-up-finish_dynamic_symbol--finish_dynamic_sections">Phase 4 - finishing up (finish_dynamic_symbol + finish_dynamic_sections)</h3>

<p>During phase 3 above we wrote the <code class="language-plaintext highlighter-rouge">.text</code> section out to file.  During the
final finishing up phase we need to write the remaining sections.  This
includes the <code class="language-plaintext highlighter-rouge">.plt</code> section an more writes to the <code class="language-plaintext highlighter-rouge">.got</code> section.</p>

<p>This also includes the <code class="language-plaintext highlighter-rouge">.plt.rela</code> and <code class="language-plaintext highlighter-rouge">.got.rela</code> sections which contain
dynamic relocation entries.</p>

<p>Writing of the data sections is handled by
<a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/elf32-or1k.c#L2178">or1k_elf_finish_dynamic_sections()</a>
and writing of the relocation sections is handled by
<a href="https://github.com/bminor/binutils-gdb/blob/binutils-2_34/bfd/elf32-or1k.c#L2299">or1k_elf_finish_dynamic_symbol()</a>.  These are defined as below.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">bfd_boolean</span>
<span class="n">or1k_elf_finish_dynamic_sections</span> <span class="p">(</span><span class="n">bfd</span> <span class="o">*</span><span class="n">output_bfd</span><span class="p">,</span>
                                  <span class="k">struct</span> <span class="n">bfd_link_info</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>

<span class="k">static</span> <span class="n">bfd_boolean</span>
<span class="n">or1k_elf_finish_dynamic_symbol</span> <span class="p">(</span><span class="n">bfd</span> <span class="o">*</span><span class="n">output_bfd</span><span class="p">,</span>
                                <span class="k">struct</span> <span class="n">bfd_link_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                                <span class="k">struct</span> <span class="n">elf_link_hash_entry</span> <span class="o">*</span><span class="n">h</span><span class="p">,</span>
                                <span class="n">Elf_Internal_Sym</span> <span class="o">*</span><span class="n">sym</span><span class="p">)</span>

<span class="cp">#define elf_backend_finish_dynamic_sections     or1k_elf_finish_dynamic_sections
#define elf_backend_finish_dynamic_symbol       or1k_elf_finish_dynamic_symbol
</span></code></pre></div></div>

<p>A snippet for the <code class="language-plaintext highlighter-rouge">or1k_elf_finish_dynamic_sections()</code> shows how when writing to
the <code class="language-plaintext highlighter-rouge">.plt</code> section assembly code needs to be injected.  This is where the first
entry in the <code class="language-plaintext highlighter-rouge">.plt</code> section is written.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>          <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">bfd_link_pic</span> <span class="p">(</span><span class="n">info</span><span class="p">))</span>
            <span class="p">{</span>
              <span class="n">plt0</span> <span class="o">=</span> <span class="n">OR1K_LWZ</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="mi">8</span><span class="p">;</span>      <span class="cm">/* .got+8 */</span>
              <span class="n">plt1</span> <span class="o">=</span> <span class="n">OR1K_LWZ</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="mi">4</span><span class="p">;</span>      <span class="cm">/* .got+4 */</span>
              <span class="n">plt2</span> <span class="o">=</span> <span class="n">OR1K_NOP</span><span class="p">;</span>
            <span class="p">}</span>
          <span class="k">else</span>
            <span class="p">{</span>
              <span class="kt">unsigned</span> <span class="n">ha</span> <span class="o">=</span> <span class="p">((</span><span class="n">got_addr</span> <span class="o">+</span> <span class="mh">0x8000</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffff</span><span class="p">;</span>
              <span class="kt">unsigned</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">got_addr</span> <span class="o">&amp;</span> <span class="mh">0xffff</span><span class="p">;</span>
              <span class="n">plt0</span> <span class="o">=</span> <span class="n">OR1K_MOVHI</span><span class="p">(</span><span class="mi">12</span><span class="p">)</span> <span class="o">|</span> <span class="n">ha</span><span class="p">;</span>
              <span class="n">plt1</span> <span class="o">=</span> <span class="n">OR1K_LWZ</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">12</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="mi">8</span><span class="p">);</span>
              <span class="n">plt2</span> <span class="o">=</span> <span class="n">OR1K_LWZ</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">12</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="mi">4</span><span class="p">);</span>
            <span class="p">}</span>

          <span class="n">or1k_write_plt_entry</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">splt</span><span class="o">-&gt;</span><span class="n">contents</span><span class="p">,</span>
                                <span class="n">plt0</span><span class="p">,</span> <span class="n">plt1</span><span class="p">,</span> <span class="n">plt2</span><span class="p">,</span> <span class="n">OR1K_JR</span><span class="p">(</span><span class="mi">15</span><span class="p">));</span>

          <span class="n">elf_section_data</span> <span class="p">(</span><span class="n">splt</span><span class="o">-&gt;</span><span class="n">output_section</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">this_hdr</span><span class="p">.</span><span class="n">sh_entsize</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
</code></pre></div></div>

<p>Here we see a write to <code class="language-plaintext highlighter-rouge">output_bfd</code>, this represents the output object file
which we are writing to.  The argument <code class="language-plaintext highlighter-rouge">splt-&gt;contents</code> represents the object
file offset to write to for the <code class="language-plaintext highlighter-rouge">.plt</code> section. Next we see the line
<code class="language-plaintext highlighter-rouge">elf_section_data (splt-&gt;output_section)-&gt;this_hdr.sh_entsize = 4</code>
this allows the linker to calculate the size of the section.</p>

<p>A snippet from the <code class="language-plaintext highlighter-rouge">or1k_elf_finish_dynamic_symbol()</code> function shows where
we write out the code and dynamic relocation entries for each symbol to
the <code class="language-plaintext highlighter-rouge">.plt</code> section.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      <span class="n">splt</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">splt</span><span class="p">;</span>
      <span class="n">sgot</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">sgotplt</span><span class="p">;</span>
      <span class="n">srela</span> <span class="o">=</span> <span class="n">htab</span><span class="o">-&gt;</span><span class="n">root</span><span class="p">.</span><span class="n">srelplt</span><span class="p">;</span>
      <span class="p">...</span>

      <span class="k">else</span>
        <span class="p">{</span>
          <span class="kt">unsigned</span> <span class="n">ha</span> <span class="o">=</span> <span class="p">((</span><span class="n">got_addr</span> <span class="o">+</span> <span class="mh">0x8000</span><span class="p">)</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xffff</span><span class="p">;</span>
          <span class="kt">unsigned</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">got_addr</span> <span class="o">&amp;</span> <span class="mh">0xffff</span><span class="p">;</span>
          <span class="n">plt0</span> <span class="o">=</span> <span class="n">OR1K_MOVHI</span><span class="p">(</span><span class="mi">12</span><span class="p">)</span> <span class="o">|</span> <span class="n">ha</span><span class="p">;</span>
          <span class="n">plt1</span> <span class="o">=</span> <span class="n">OR1K_LWZ</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">12</span><span class="p">)</span> <span class="o">|</span> <span class="n">lo</span><span class="p">;</span>
          <span class="n">plt2</span> <span class="o">=</span> <span class="n">OR1K_ORI0</span><span class="p">(</span><span class="mi">11</span><span class="p">)</span> <span class="o">|</span> <span class="n">plt_reloc</span><span class="p">;</span>
        <span class="p">}</span>

      <span class="n">or1k_write_plt_entry</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">splt</span><span class="o">-&gt;</span><span class="n">contents</span> <span class="o">+</span> <span class="n">h</span><span class="o">-&gt;</span><span class="n">plt</span><span class="p">.</span><span class="n">offset</span><span class="p">,</span>
                            <span class="n">plt0</span><span class="p">,</span> <span class="n">plt1</span><span class="p">,</span> <span class="n">plt2</span><span class="p">,</span> <span class="n">OR1K_JR</span><span class="p">(</span><span class="mi">12</span><span class="p">));</span>

      <span class="cm">/* Fill in the entry in the global offset table.  We initialize it to
         point to the top of the plt.  This is done to lazy lookup the actual
         symbol as the first plt entry will be setup by libc to call the
         runtime dynamic linker.  */</span>
      <span class="n">bfd_put_32</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="n">plt_base_addr</span><span class="p">,</span> <span class="n">sgot</span><span class="o">-&gt;</span><span class="n">contents</span> <span class="o">+</span> <span class="n">got_offset</span><span class="p">);</span>

      <span class="cm">/* Fill in the entry in the .rela.plt section.  */</span>
      <span class="n">rela</span><span class="p">.</span><span class="n">r_offset</span> <span class="o">=</span> <span class="n">got_addr</span><span class="p">;</span>
      <span class="n">rela</span><span class="p">.</span><span class="n">r_info</span> <span class="o">=</span> <span class="n">ELF32_R_INFO</span> <span class="p">(</span><span class="n">h</span><span class="o">-&gt;</span><span class="n">dynindx</span><span class="p">,</span> <span class="n">R_OR1K_JMP_SLOT</span><span class="p">);</span>
      <span class="n">rela</span><span class="p">.</span><span class="n">r_addend</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
      <span class="n">loc</span> <span class="o">=</span> <span class="n">srela</span><span class="o">-&gt;</span><span class="n">contents</span><span class="p">;</span>
      <span class="n">loc</span> <span class="o">+=</span> <span class="n">plt_index</span> <span class="o">*</span> <span class="nf">sizeof</span> <span class="p">(</span><span class="n">Elf32_External_Rela</span><span class="p">);</span>
      <span class="n">bfd_elf32_swap_reloca_out</span> <span class="p">(</span><span class="n">output_bfd</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">rela</span><span class="p">,</span> <span class="n">loc</span><span class="p">);</span>
</code></pre></div></div>

<p>Here we can see we write 3 things to <code class="language-plaintext highlighter-rouge">output_bfd</code> for the single <code class="language-plaintext highlighter-rouge">.plt</code> entry.
We write:</p>
<ul>
  <li>The assembly code to the <code class="language-plaintext highlighter-rouge">.plt</code> section.</li>
  <li>The <code class="language-plaintext highlighter-rouge">plt_base_addr</code> (the first entry in the <code class="language-plaintext highlighter-rouge">.plt</code> for runtime lookup) to the <code class="language-plaintext highlighter-rouge">.got</code> section.</li>
  <li>And finally a dynamic relocation for our symbol to the <code class="language-plaintext highlighter-rouge">.plt.rela</code>.</li>
</ul>

<p>With that we have written all of the sections out to our final elf object, and it’s ready
to be used.</p>

<h2 id="glibc-runtime-linker">GLIBC Runtime Linker</h2>

<p>The runtime linker, also referred to as the dynamic linker, will do the final
linking as we load our program and shared libraries into memory.  It can process
a limited set of relocation entries that were setup above during phase 4 of
linking.</p>

<p>The runtime linker implementation is found mostly in the
<code class="language-plaintext highlighter-rouge">elf/dl-*</code> GLIBC source files.  Dynamic relocation processing is handled in by
the <a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-reloc.c#L146">_dl_relocate_object()</a>
function in the <code class="language-plaintext highlighter-rouge">elf/dl-reloc.c</code> file.  The back end macro used for relocation
<a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dynamic-link.h#L194">ELF_DYNAMIC_RELOCATE</a>
is defined across several files including <code class="language-plaintext highlighter-rouge">elf/dynamic-link.h</code>
and <a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/do-rel.h">elf/do-rel.h</a>
Architecture specific relocations are handled by the function <code class="language-plaintext highlighter-rouge">elf_machine_rela()</code>, the implementation
for OpenRISC being in <a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/sysdeps/or1k/dl-machine.h#L210">sysdeps/or1k/dl-machine.h</a>.</p>

<p>In summary from top down:</p>
<ul>
  <li><a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/rtld.c">elf/rtld.c</a> - implements <code class="language-plaintext highlighter-rouge">dl_main()</code> the top level entry for the dynamic linker.</li>
  <li><a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-open.c">elf/dl-open.c</a> - function <code class="language-plaintext highlighter-rouge">dl_open_worker()</code> calls <code class="language-plaintext highlighter-rouge">_dl_relocate_object()</code>, you may also recognize this from <a href="https://man7.org/linux/man-pages/man3/dlopen.3.html">dlopen(3)</a>.</li>
  <li><a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-reloc.c">elf/dl-reloc.c</a> - function <code class="language-plaintext highlighter-rouge">_dl_relocate_object</code> calls <code class="language-plaintext highlighter-rouge">ELF_DYNAMIC_RELOCATE</code></li>
  <li><code class="language-plaintext highlighter-rouge">elf/dynamic-link.h</code> - defined macro <code class="language-plaintext highlighter-rouge">ELF_DYNAMIC_RELOCATE</code> calls <code class="language-plaintext highlighter-rouge">elf_dynamic_do_Rel()</code> via several macros</li>
  <li><code class="language-plaintext highlighter-rouge">elf/do-rel.h</code> - function <code class="language-plaintext highlighter-rouge">elf_dynamic_do_Rel()</code> calls <code class="language-plaintext highlighter-rouge">elf_machine_rela()</code></li>
  <li><code class="language-plaintext highlighter-rouge">sysdeps/or1k/dl-machine.h</code> - architecture specific function <code class="language-plaintext highlighter-rouge">elf_machine_rela()</code> implements dynamic relocation handling</li>
</ul>

<p>It supports relocations for:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_NONE</code> - do nothing</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_COPY</code> - used to copy initial values from shared objects to process memory.</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_32</code> - a <code class="language-plaintext highlighter-rouge">32-bit</code> value</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_GLOB_DAT</code> - aligned <code class="language-plaintext highlighter-rouge">32-bit</code> values for <code class="language-plaintext highlighter-rouge">GOT</code> entries</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_JMP_SLOT</code> - aligned <code class="language-plaintext highlighter-rouge">32-bit</code> values for <code class="language-plaintext highlighter-rouge">PLT</code> entries</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_TLS_DTPMOD/R_OR1K_TLS_DTPOFF</code> - for shared TLS GD <code class="language-plaintext highlighter-rouge">GOT</code> entries</li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_TLS_TPOFF</code> - for shared TLS IE <code class="language-plaintext highlighter-rouge">GOT</code> entries</li>
</ul>

<p>A snippet of the OpenRISC implementation of <code class="language-plaintext highlighter-rouge">elf_machine_rela()</code> can be seen
below.  It is pretty straight forward.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Perform the relocation specified by RELOC and SYM (which is fully resolved).
   MAP is the object containing the reloc.  */</span>

<span class="k">auto</span> <span class="kr">inline</span> <span class="kt">void</span>
<span class="nf">__attribute</span> <span class="p">((</span><span class="n">always_inline</span><span class="p">))</span>
<span class="n">elf_machine_rela</span> <span class="p">(</span><span class="k">struct</span> <span class="n">link_map</span> <span class="o">*</span><span class="n">map</span><span class="p">,</span> <span class="k">const</span> <span class="n">Elf32_Rela</span> <span class="o">*</span><span class="n">reloc</span><span class="p">,</span>
                  <span class="k">const</span> <span class="n">Elf32_Sym</span> <span class="o">*</span><span class="n">sym</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">r_found_version</span> <span class="o">*</span><span class="n">version</span><span class="p">,</span>
                  <span class="kt">void</span> <span class="o">*</span><span class="k">const</span> <span class="n">reloc_addr_arg</span><span class="p">,</span> <span class="kt">int</span> <span class="n">skip_ifunc</span><span class="p">)</span>
<span class="p">{</span>

      <span class="k">struct</span> <span class="n">link_map</span> <span class="o">*</span><span class="n">sym_map</span> <span class="o">=</span> <span class="n">RESOLVE_MAP</span> <span class="p">(</span><span class="o">&amp;</span><span class="n">sym</span><span class="p">,</span> <span class="n">version</span><span class="p">,</span> <span class="n">r_type</span><span class="p">);</span>
      <span class="n">Elf32_Addr</span> <span class="n">value</span> <span class="o">=</span> <span class="n">SYMBOL_ADDRESS</span> <span class="p">(</span><span class="n">sym_map</span><span class="p">,</span> <span class="n">sym</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>

     <span class="p">...</span>
      <span class="k">switch</span> <span class="p">(</span><span class="n">r_type</span><span class="p">)</span>
        <span class="p">{</span>
          <span class="p">...</span>
          <span class="k">case</span> <span class="n">R_OR1K_32</span><span class="p">:</span>
            <span class="cm">/* Support relocations on mis-aligned offsets.  */</span>
            <span class="n">value</span> <span class="o">+=</span> <span class="n">reloc</span><span class="o">-&gt;</span><span class="n">r_addend</span><span class="p">;</span>
            <span class="n">memcpy</span> <span class="p">(</span><span class="n">reloc_addr_arg</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">value</span><span class="p">,</span> <span class="mi">4</span><span class="p">);</span>
            <span class="k">break</span><span class="p">;</span>
          <span class="k">case</span> <span class="n">R_OR1K_GLOB_DAT</span><span class="p">:</span>
          <span class="k">case</span> <span class="n">R_OR1K_JMP_SLOT</span><span class="p">:</span>
            <span class="o">*</span><span class="n">reloc_addr</span> <span class="o">=</span> <span class="n">value</span> <span class="o">+</span> <span class="n">reloc</span><span class="o">-&gt;</span><span class="n">r_addend</span><span class="p">;</span>
            <span class="k">break</span><span class="p">;</span>
          <span class="p">...</span>
        <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<h3 id="handling-tls">Handling TLS</h3>

<p>The complicated part of the runtime linker is how it handles TLS variables.</p>

<p>This is done in the following files and functions.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">elf/rtld.c</code> - implements
<a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/rtld.c#L739">init_tls()</a>
which initializes the TLS  data structures.</li>
  <li><a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-tls.c">elf/dl-tls.c</a> - The runtime
linker tls implementation the top level initialization code including
<a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-tls.c#L331">_dl_allocate_tls_storage()</a> and
<a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/elf/dl-tls.c#L436">_dl_allocate_tls_init()</a>.</li>
</ul>

<p>The reader can read through the initialization code which is pretty straight forward, except for the
macros.  Like most GNU code the code relies heavily on untyped macros.  These macros are defined
in the architecture specific implementation files.  For OpenRISC this is:</p>

<ul>
  <li><a href="https://github.com/stffrdhrn/or1k-glibc/blob/or1k-port-1/sysdeps/or1k/nptl/tls.h">sysdeps/or1k/nptl/tls.h</a> - contains the definition
of the TLS structures used for OpenRISC.</li>
</ul>

<p>From the <a href="/hardware/embedded/openrisc/2020/01/19/tls.html">previous article</a> on TLS we have the
TLS data structure that looks as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  dtv[]   [ dtv[0], dtv[1], dtv[2], .... ]
            counter ^ |      \
               ----/ /        \________
              /     V                  V
/------TCB-------\/----TLS[1]----\   /----TLS[2]----\
| pthread tcbhead | tbss   tdata |   | tbss   tdata |
\----------------/\--------------/   \--------------/
          ^
          |
   TP-----/
</code></pre></div></div>

<p>The symbols and macros defined in <code class="language-plaintext highlighter-rouge">sysdeps/or1k/nptl/tls.h</code> are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">__thread_self</code> - a symbol representing the current thread always</li>
  <li><code class="language-plaintext highlighter-rouge">TLS_DTV_AT_TP</code> - used throughout the TLS code to adjust offsets</li>
  <li><code class="language-plaintext highlighter-rouge">TLS_TCB_AT_TP</code> - used throughout the TLS code to adjust offsets</li>
  <li><code class="language-plaintext highlighter-rouge">TLS_TCB_SIZE</code> - used during <code class="language-plaintext highlighter-rouge">init_tls()</code> to allocate memory for TLS</li>
  <li><code class="language-plaintext highlighter-rouge">TLS_PRE_TCB_SIZE</code> - used during <code class="language-plaintext highlighter-rouge">init_tls()</code> to allocate space for the <code class="language-plaintext highlighter-rouge">pthread</code> struct</li>
  <li><code class="language-plaintext highlighter-rouge">INSTALL_DTV</code> - used during initialization to update a new dtv pointer into the given tcb</li>
  <li><code class="language-plaintext highlighter-rouge">GET_DTV</code> - gets dtv via the provided tcb pointer</li>
  <li><code class="language-plaintext highlighter-rouge">INSTALL_NEW_DTV</code> - used during resizing to update the dtv into the current runtime <code class="language-plaintext highlighter-rouge">__thread_self</code></li>
  <li><code class="language-plaintext highlighter-rouge">TLS_INIT_TP</code> - sets <code class="language-plaintext highlighter-rouge">__thread_self</code> this is the final step in <code class="language-plaintext highlighter-rouge">init_tls()</code></li>
  <li><code class="language-plaintext highlighter-rouge">THREAD_DTV</code> - gets dtv via _thread_self</li>
  <li><code class="language-plaintext highlighter-rouge">THREAD_SELF</code> - get the pthread pointer via <code class="language-plaintext highlighter-rouge">__thread_self</code></li>
</ul>

<p>Implementations for OpenRISC are:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">register</span> <span class="n">tcbhead_t</span> <span class="o">*</span><span class="n">__thread_self</span> <span class="nf">__asm__</span><span class="p">(</span><span class="s">"r10"</span><span class="p">);</span>

<span class="cp">#define TLS_DTV_AT_TP  1
#define TLS_TCB_AT_TP  0
</span>
<span class="cp">#define TLS_TCB_SIZE             sizeof (tcbhead_t)
#define TLS_PRE_TCB_SIZE         sizeof (struct pthread)
</span>
<span class="cp">#define INSTALL_DTV(tcbp, dtvp)  (((tcbhead_t *) (tcbp))-&gt;dtv = (dtvp) + 1)
#define GET_DTV(tcbp)            (((tcbhead_t *) (tcbp))-&gt;dtv)
</span>
<span class="cp">#define TLS_INIT_TP(tcbp)        ({__thread_self = ((tcbhead_t *)tcbp + 1); NULL;})
</span>
<span class="cp">#define THREAD_DTV()             ((((tcbhead_t *)__thread_self)-1)-&gt;dtv)
#define INSTALL_NEW_DTV(dtv)     (THREAD_DTV() = (dtv))
</span>
<span class="cp">#define THREAD_SELF \
  ((struct pthread *) ((char *) __thread_self - TLS_INIT_TCB_SIZE \
    - TLS_PRE_TCB_SIZE))
</span></code></pre></div></div>

<h2 id="summary">Summary</h2>

<p>We have looked at how symbols move from the Compiler, to Assembler, to Linker to
Runtime linker.</p>

<p>This has ended up being a long article to explain a rather complicated subject.
Let’s hope it helps provide a good reference for others who want to work on the
GNU toolchain in the future.</p>

<h2 id="further-reading">Further Reading</h2>
<ul>
  <li><a href="/software/embedded/openrisc/2018/06/03/gcc_passes.html">GCC Passes</a> - My blog entry on GCC passes</li>
  <li><a href="http://cahirwpz.users.sourceforge.net/binutils-2.26/bfd-internal.html/index.html#SEC_Contents">bfdint</a> - The BFD developer’s manual</li>
  <li><a href="http://home.elka.pw.edu.pl/~macewicz/dokumentacja/gnu/ld/ldint_2.html">ldint</a> - The LD developer’s manual</li>
  <li><a href="https://gist.github.com/stffrdhrn/d59e1d082430a48643b301c13f6f4d24">LD and BFD Gist</a> - Dump of notes I collected while working on this article.</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="software" /><category term="toolchain" /><category term="openrisc" /><summary type="html"><![CDATA[This is an ongoing series of posts on ELF Binary Relocations and Thread Local Storage. This article covers only Thread Local Storage and assumes the reader has had a primer in ELF Relocations, if not please start with my previous article ELF Binaries and Relocation Entries.]]></summary></entry><entry><title type="html">Thread Local Storage</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2020/01/19/tls.html" rel="alternate" type="text/html" title="Thread Local Storage" /><published>2020-01-19T12:05:00+00:00</published><updated>2020-01-19T12:05:00+00:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2020/01/19/tls</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2020/01/19/tls.html"><![CDATA[<p><em>This is an ongoing series of posts on ELF Binary Relocations and Thread
Local Storage.  This article covers only Thread Local Storage and assumes
the reader has had a primer in ELF Relocations, if not please start with
my previous article *ELF Binaries and Relocation Entries</em>.</p>

<p>This is the second part in an illustrated 3 part series covering:</p>
<ul>
  <li><a href="/hardware/embedded/openrisc/2019/11/29/relocs.html">ELF Binaries and Relocation Entries</a></li>
  <li>Thread Local Storage</li>
  <li><a href="/software/toolchain/openrisc/2020/07/21/relocs_tls_impl.html">How Relocations and Thread Local Store are Implemented</a></li>
</ul>

<p>In the last article we covered ELF Binary internals and how relocation entries
are used to during link time to allow our programs to access symbols
(variables).  However, what if we want a different variable instance for each
thread?  This is where thread local storage (TLS) comes in.</p>

<p>In this article we will discuss how TLS works.  Our outline:</p>

<ul>
  <li><a href="#tls-sections">TLS Sections</a></li>
  <li><a href="#tls-data-structures">TLS data structures</a></li>
  <li><a href="#tls-access-models">TLS access models</a>
    <ul>
      <li><a href="#global-dynamic">Global Dynamic</a></li>
      <li><a href="#local-dynamic">Local Dynamic</a></li>
      <li><a href="#initial-exec">Initial Exec</a></li>
      <li><a href="#local-exec">Local Exec</a></li>
    </ul>
  </li>
  <li><a href="#linker-relaxation">Linker Relaxation</a></li>
</ul>

<p>As before, the examples in this article can be found in my <a href="https://github.com/stffrdhrn/tls-examples">tls-examples</a>
project.  Please check it out.</p>

<h2 id="thread-local-storage">Thread Local Storage</h2>

<p>Did you know that in C you can prefix variables with <code class="language-plaintext highlighter-rouge">__thread</code> to create
<a href="https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html">thread local</a> variables?</p>

<h3 id="example">Example</h3>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">__thread</span> <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
</code></pre></div></div>

<p>A thread local variable is a variable that will have a unique instance per thread.
Each time a new thread is created, the space required to store the thread local
variables is allocated.</p>

<p>TLS variables are stored in dynamic TLS sections.</p>

<h2 id="tls-sections">TLS Sections</h2>

<p>In the previous article we saw how variables were stored in the <code class="language-plaintext highlighter-rouge">.data</code> and
<code class="language-plaintext highlighter-rouge">.bss</code> sections.  These are initialized once per program or library.</p>

<p>When we get to binaries that use TLS we will additionally have <code class="language-plaintext highlighter-rouge">.tdata</code> and
<code class="language-plaintext highlighter-rouge">.tbss</code> sections.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.tdata</code> - static and non static initialized thread local variables</li>
  <li><code class="language-plaintext highlighter-rouge">.tbss</code>  - static and non static non-initialized thread local variables</li>
</ul>

<p>These exist in a special <code class="language-plaintext highlighter-rouge">TLS</code> <a href="https://lwn.net/Articles/531148/">segment</a> which
is loaded per thread.  In the next article we will discuss more about how this
loading works.</p>

<h2 id="tls-data-structures">TLS Data Structures</h2>

<p>As we recall, to access data in <code class="language-plaintext highlighter-rouge">.data</code> and <code class="language-plaintext highlighter-rouge">.bss</code> sections simple code
sequences with relocation entries are used.  These sequences set and add
registers to build pointers to our data.  For example, the below sequence uses 2
relocations to compose a <code class="language-plaintext highlighter-rouge">.bss</code> section address into register <code class="language-plaintext highlighter-rouge">r11</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Addr.   Machine Code    Assembly             Relocations
0000000c &lt;get_x_addr&gt;:
   c:   19 60 [00 00]   l.movhi r11,[0]      # c  R_OR1K_AHI16 .bss
  10:   44 00  48 00    l.jr r9
  14:   9d 6b [00 00]    l.addi r11,r11,[0]  # 14 R_OR1K_LO_16_IN_INSN .bss
</code></pre></div></div>

<p>With TLS the code sequences to access our data will also build pointers to our
data, but they need to traverse the TLS data structures.</p>

<p>As the code sequence is read only and will be the same for each thread another
level of indirection is needed, this is provided by the Thread Pointer (TP).</p>

<p>The Thread Pointer points into a data structure that allows us to locate TLS
data sections.  The TLS data structure includes:</p>

<ul>
  <li>Thread Control Block (TCB)</li>
  <li>Dynamic Thread Vector (DTV)</li>
  <li>TLS Data Sections</li>
</ul>

<p>These are illustrated as below:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  dtv[]   [ dtv[0], dtv[1], dtv[2], .... ]
            counter ^ |      \
               ----/ /        \________
              /     V                  V
/------TCB-------\/----TLS[1]----\   /----TLS[2]----\
| pthread tcbhead | tbss   tdata |   | tbss   tdata |
\----------------/\--------------/   \--------------/
          ^
          |
   TP-----/
</code></pre></div></div>

<h3 id="thread-pointer-tp">Thread Pointer (TP)</h3>

<p>The TP is unique to each thread.  It provides the starting point to the TLS data
structure.</p>

<ul>
  <li>The TP points to the Thread Control Block</li>
  <li>On OpenRISC the TP is stored in <code class="language-plaintext highlighter-rouge">r10</code></li>
  <li>On x86_64 the TP is stored in <code class="language-plaintext highlighter-rouge">$fs</code></li>
  <li>This is the <code class="language-plaintext highlighter-rouge">*tls</code> pointer passed to the
<a href="http://man7.org/linux/man-pages/man2/clone.2.html">clone()</a> system call when
using <code class="language-plaintext highlighter-rouge">CLONE_SETTLS</code>.</li>
</ul>

<h3 id="thread-control-block-tcb">Thread Control Block (TCB)</h3>

<p>The TCB is the head of the TLS data structure.  The TCB consists of:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pthread</code> - the <a href="http://man7.org/linux/man-pages/man7/pthreads.7.html">pthread</a>
struct for the current thread, contains <code class="language-plaintext highlighter-rouge">tid</code> etc. Located by <code class="language-plaintext highlighter-rouge">TP - TCB size - Pthread size</code></li>
  <li><code class="language-plaintext highlighter-rouge">tcbhead</code> - the <code class="language-plaintext highlighter-rouge">tcbhead_t</code> struct, machine dependent, contains pointer to DTV.  Located by <code class="language-plaintext highlighter-rouge">TP - TCB size</code>.</li>
</ul>

<p>For OpenRISC <code class="language-plaintext highlighter-rouge">tcbhead_t</code> is defined in
<a href="https://github.com/openrisc/or1k-glibc/blob/or1k-port/sysdeps/or1k/nptl/tls.h#L30">sysdeps/or1k/nptl/tls.h</a> as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
  <span class="n">dtv_t</span> <span class="o">*</span><span class="n">dtv</span><span class="p">;</span>
<span class="p">}</span> <span class="n">tcbhead_t</span>
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dtv</code> - is a pointer to the dtv array, points to entry <code class="language-plaintext highlighter-rouge">dtv[1]</code></li>
</ul>

<p>For x86_64 the <code class="language-plaintext highlighter-rouge">tcbhead_t</code> is defined in
<a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/nptl/tls.h;h=e7c1416eec4a490312ed56cc51a03a33eaa8e222;hb=HEAD#l42">sysdeps/x86_64/nptl/tls.h</a>
as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span>
<span class="p">{</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">tcb</span><span class="p">;</span>            <span class="cm">/* Pointer to the TCB.  Not necessarily the
                           thread descriptor used by libpthread.  */</span>
  <span class="n">dtv_t</span> <span class="o">*</span><span class="n">dtv</span><span class="p">;</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">self</span><span class="p">;</span>           <span class="cm">/* Pointer to the thread descriptor.  */</span>
  <span class="kt">int</span> <span class="n">multiple_threads</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">gscope_flag</span><span class="p">;</span>
  <span class="kt">uintptr_t</span> <span class="n">sysinfo</span><span class="p">;</span>
  <span class="kt">uintptr_t</span> <span class="n">stack_guard</span><span class="p">;</span>
  <span class="kt">uintptr_t</span> <span class="n">pointer_guard</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">vgetcpu_cache</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
  <span class="cm">/* Bit 0: X86_FEATURE_1_IBT.
     Bit 1: X86_FEATURE_1_SHSTK.
   */</span>
  <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">feature_1</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">__glibc_unused1</span><span class="p">;</span>
  <span class="cm">/* Reservation of some values for the TM ABI.  */</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">__private_tm</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
  <span class="cm">/* GCC split stack support.  */</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">__private_ss</span><span class="p">;</span>
  <span class="cm">/* The lowest address of shadow stack,  */</span>
  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="kt">long</span> <span class="kt">int</span> <span class="n">ssp_base</span><span class="p">;</span>
  <span class="cm">/* Must be kept even if it is no longer used by glibc since programs,
     like AddressSanitizer, depend on the size of tcbhead_t.  */</span>
  <span class="n">__128bits</span> <span class="n">__glibc_unused2</span><span class="p">[</span><span class="mi">8</span><span class="p">][</span><span class="mi">4</span><span class="p">]</span> <span class="n">__attribute__</span> <span class="p">((</span><span class="n">aligned</span> <span class="p">(</span><span class="mi">32</span><span class="p">)));</span>

  <span class="kt">void</span> <span class="o">*</span><span class="n">__padding</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span>
<span class="p">}</span> <span class="n">tcbhead_t</span><span class="p">;</span>
</code></pre></div></div>

<p>The x86_64 implementation includes many more fields including:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">gscope_flag</code> - Global Scope lock flags used by the runtime linker, for OpenRISC this is stored in <code class="language-plaintext highlighter-rouge">pthread</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">stack_guard</code> - The <a href="https://access.redhat.com/blogs/766093/posts/3548631">stack
guard</a> canary stored in
the thread local area.  For OpenRISC a global stack guard is stored in <code class="language-plaintext highlighter-rouge">.bss</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">pointer_guard</code> - The <a href="http://hmarco.org/bugs/glibc_ptr_mangle_weakness.html">pointer
guard</a> stored in the
thread local area.  For OpenRISC a global pointer guard is stored in <code class="language-plaintext highlighter-rouge">.bss</code>.</li>
</ul>

<h3 id="dynamic-thread-vector-dtv">Dynamic Thread Vector (DTV)</h3>

<p>The DTV is an array of pointers to each TLS data section.  The first entry in
the DTV array contains the generation counter.  The generation counter is really
just the array size.  The DTV can be dynamically resized as more TLS modules are loaded.</p>

<p>The <code class="language-plaintext highlighter-rouge">dtv_t</code> type is a union as defined below:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">val</span><span class="p">;</span>     <span class="c1">// Aligned pointer to data/bss</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">to_free</span><span class="p">;</span> <span class="c1">// Unaligned pointer for free()</span>
<span class="p">}</span> <span class="n">dtv_pointer</span>

<span class="k">typedef</span> <span class="k">union</span> <span class="p">{</span>
  <span class="kt">int</span> <span class="n">counter</span><span class="p">;</span>          <span class="c1">// for entry 0</span>
  <span class="n">dtv_pointer</span> <span class="n">pointer</span><span class="p">;</span>  <span class="c1">// for all other entries</span>
<span class="p">}</span> <span class="n">dtv_t</span>
</code></pre></div></div>

<p>Each <code class="language-plaintext highlighter-rouge">dtv_t</code> entry can be either a counter or a pointer.  By convention the
first entry, <code class="language-plaintext highlighter-rouge">dtv[0]</code> is a counter and the rest are pointers.</p>

<h3 id="thread-local-storage-tls">Thread Local Storage (TLS)</h3>

<p>The initial set of TLS data sections is allocated contiguous with the TCB.  Additional TLS
data blocks will be allocated dynamically. There will be one entry for each
loaded module, the first module being the current program.  For dynamic
libraries it is lazily initialized per thread.</p>

<h4 id="local-or-tls1">Local (or TLS[1])</h4>
<ul>
  <li><code class="language-plaintext highlighter-rouge">tbss</code> - the <code class="language-plaintext highlighter-rouge">.tbss</code> section for the current thread from the current
processes ELF binary.</li>
  <li><code class="language-plaintext highlighter-rouge">tdata</code> - the <code class="language-plaintext highlighter-rouge">.tdata</code> section for the current thread from the current
processes ELF binary.</li>
</ul>

<h4 id="tls2">TLS[2]</h4>
<ul>
  <li><code class="language-plaintext highlighter-rouge">tbss</code> - the <code class="language-plaintext highlighter-rouge">.tbss</code> section for variables defined in the first shared library loaded by the current process</li>
  <li><code class="language-plaintext highlighter-rouge">tdata</code> - the <code class="language-plaintext highlighter-rouge">.tdata</code> section for variables defined in the first shared library loaded by the current process</li>
</ul>

<h3 id="the-__tls_get_addr-function">The __tls_get_addr() function</h3>

<p>The <code class="language-plaintext highlighter-rouge">__tls_get_addr()</code> function can be used at any time to traverse the TLS data
structure and return a variable’s address.  The function is given a pointer to
an architecture specific argument <code class="language-plaintext highlighter-rouge">tls_index</code>.</p>

<ul>
  <li>The argument contains 2 pieces of data:
    <ul>
      <li>The module index - <code class="language-plaintext highlighter-rouge">0</code> for the current process, <code class="language-plaintext highlighter-rouge">1</code> for the first loaded shared
library etc.</li>
      <li>The data offset - the offset of the variable in the <code class="language-plaintext highlighter-rouge">TLS</code> data section</li>
    </ul>
  </li>
  <li>Internally <code class="language-plaintext highlighter-rouge">__tls_get_addr</code> uses TP to located the TLS data structure</li>
  <li>The function returns the address of the variable we want to access</li>
</ul>

<p>For static builds the implementation is architecture dependant and defined in
OpenRISC
<a href="https://github.com/openrisc/or1k-glibc/blob/or1k-port/sysdeps/or1k/libc-tls.c#L28">sysdeps/or1k/libc-tls.c</a>
as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__tls_get_addr</span> <span class="p">(</span><span class="n">tls_index</span> <span class="o">*</span><span class="n">ti</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">dtv_t</span> <span class="o">*</span><span class="n">dtv</span> <span class="o">=</span> <span class="n">THREAD_DTV</span> <span class="p">();</span>
  <span class="k">return</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">dtv</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">pointer</span><span class="p">.</span><span class="n">val</span> <span class="o">+</span> <span class="n">ti</span><span class="o">-&gt;</span><span class="n">ti_offset</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note for for static builds the module index can be hard coded to <code class="language-plaintext highlighter-rouge">1</code> as there
will always be only one module.</p>

<p>For dynamically linked programs the implementation is defined as part of the
runtime dynamic linker in
<a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-tls.c;hb=9f8b135f76ac7943d1e108b7f6e816f526b2208c#l824">elf/dl-tls.c</a>
as:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="o">*</span>
<span class="nf">__tls_get_addr</span> <span class="p">(</span><span class="n">GET_ADDR_ARGS</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">dtv_t</span> <span class="o">*</span><span class="n">dtv</span> <span class="o">=</span> <span class="n">THREAD_DTV</span> <span class="p">();</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">__glibc_unlikely</span> <span class="p">(</span><span class="n">dtv</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">counter</span> <span class="o">!=</span> <span class="n">GL</span><span class="p">(</span><span class="n">dl_tls_generation</span><span class="p">)))</span>
    <span class="k">return</span> <span class="n">update_get_addr</span> <span class="p">(</span><span class="n">GET_ADDR_PARAM</span><span class="p">);</span>

  <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">dtv</span><span class="p">[</span><span class="n">GET_ADDR_MODULE</span><span class="p">].</span><span class="n">pointer</span><span class="p">.</span><span class="n">val</span><span class="p">;</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">__glibc_unlikely</span> <span class="p">(</span><span class="n">p</span> <span class="o">==</span> <span class="n">TLS_DTV_UNALLOCATED</span><span class="p">))</span>
    <span class="k">return</span> <span class="n">tls_get_addr_tail</span> <span class="p">(</span><span class="n">GET_ADDR_PARAM</span><span class="p">,</span> <span class="n">dtv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>

  <span class="k">return</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)</span> <span class="n">p</span> <span class="o">+</span> <span class="n">GET_ADDR_OFFSET</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here several macros are used so it’s a bit hard to follow but there are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">THREAD_DTV</code> - uses TP to get the pointer to the DTV array.</li>
  <li><code class="language-plaintext highlighter-rouge">GET_ADDR_ARGS</code> - short for <code class="language-plaintext highlighter-rouge">tls_index* ti</code></li>
  <li><code class="language-plaintext highlighter-rouge">GET_ADDR_PARAM</code> - short for <code class="language-plaintext highlighter-rouge">ti</code></li>
  <li><code class="language-plaintext highlighter-rouge">GET_ADDR_MODULE</code> - short for <code class="language-plaintext highlighter-rouge">ti-&gt;ti_module</code></li>
  <li><code class="language-plaintext highlighter-rouge">GET_ADDR_OFFSET</code> - short for <code class="language-plaintext highlighter-rouge">ti-&gt;ti_offset</code></li>
</ul>

<h2 id="tls-access-models">TLS Access Models</h2>

<p>As one can imagine, traversing the TLS data structures when accessing each variable
could be slow.  For this reason there are different TLS access models that the
compiler can choose to minimize variable access overhead.</p>

<h3 id="global-dynamic">Global Dynamic</h3>

<p>The Global Dynamic (GD), sometimes called General Dynamic, access model is the
slowest access model which will traverse the entire TLS data structure for each
variable access.  It is used for accessing variables in dynamic shared
libraries.</p>

<h4 id="before-linking">Before Linking</h4>

<p><img src="/content/2019/tls-gd-obj.png" alt="Global Dynamic Object" /></p>

<p>Not counting relocations for the PLT and GOT entries; before linking the <code class="language-plaintext highlighter-rouge">.text</code>
contains 1 placeholder for a GOT offset.  This GOT entry will contain the
arguments to <code class="language-plaintext highlighter-rouge">__tls_get_addr</code>.</p>

<h4 id="after-linking">After Linking</h4>

<p><img src="/content/2019/tls-gd-exe.png" alt="Global Dynamic Program" /></p>

<p>After linking there will be 2 relocation entries in the GOT to be resolved by
the dynamic linker.  These are <code class="language-plaintext highlighter-rouge">R_TLS_DTPMOD</code>, the TLS module index, and
<code class="language-plaintext highlighter-rouge">R_TLS_DTPOFF</code>, the offset of the variable into the TLS module.</p>

<h4 id="example-1">Example</h4>

<p>File: <a href="https://github.com/stffrdhrn/tls-examples/blob/master/tls-gd.c">tls-gd.c</a></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kr">__thread</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>

<span class="kt">int</span><span class="o">*</span> <span class="nf">get_x_addr</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="code-sequence-openrisc">Code Sequence (OpenRISC)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tls-gd.o:     file format elf32-or1k

Disassembly of section .text:

0000004c &lt;get_x_addr&gt;:
  4c:	18 60 [00 00] 	l.movhi r3,[0]          # 4c: R_OR1K_TLS_GD_HI16	x
  50:	9c 21 ff f8 	l.addi r1,r1,-8
  54:	a8 63 [00 00] 	l.ori r3,r3,[0]         # 54: R_OR1K_TLS_GD_LO16	x
  58:	d4 01 80 00 	l.sw 0(r1),r16
  5c:	d4 01 48 04 	l.sw 4(r1),r9
  60:	04 00 00 02 	l.jal 68 &lt;get_x_addr+0x1c&gt;
  64:	1a 00 [00 00] 	 l.movhi r16,[0]        # 64: R_OR1K_GOTPC_HI16	_GLOBAL_OFFSET_TABLE_-0x4
  68:	aa 10 [00 00] 	l.ori r16,r16,[0]       # 68: R_OR1K_GOTPC_LO16	_GLOBAL_OFFSET_TABLE_
  6c:	e2 10 48 00 	l.add r16,r16,r9
  70:	04 00 [00 00] 	l.jal [0]               # 70: R_OR1K_PLT26	__tls_get_addr
  74:	e0 63 80 00 	 l.add r3,r3,r16
  78:	85 21 00 04 	l.lwz r9,4(r1)
  7c:	86 01 00 00 	l.lwz r16,0(r1)
  80:	44 00 48 00 	l.jr r9
  84:	9c 21 00 08 	 l.addi r1,r1,8
</code></pre></div></div>

<h4 id="code-sequence-x86_64">Code Sequence (x86_64)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tls-gd.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000020 &lt;get_x_addr&gt;:
  20:	48 83 ec 08          	  sub    $0x8,%rsp
  24:	66 48 8d 3d [00 00 00 00] lea    [0](%rip),%rdi  # 28 R_X86_64_TLSGD	x-0x4
  2c:	66 66 48 e8 [00 00 00 00] callq  [0]             # 30 R_X86_64_PLT32	__tls_get_addr-0x4
  34:	48 83 c4 08          	  add    $0x8,%rsp
  38:	c3                   	  retq
</code></pre></div></div>

<h3 id="local-dynamic">Local Dynamic</h3>

<p>The Local Dynamic (LD) access model is an optimization for Global Dynamic where
multiple variables may be accessed from the same TLS module.  Instead of
traversing the TLS data structure for each variable, the TLS data section address
is loaded once by calling <code class="language-plaintext highlighter-rouge">__tls_get_addr</code> with an offset of <code class="language-plaintext highlighter-rouge">0</code>.  Next, variables
can be accessed with individual offsets.</p>

<p>Local Dynamic is not supported on OpenRISC yet.</p>

<h4 id="before-linking-1">Before Linking</h4>

<p><img src="/content/2019/tls-ld-obj.png" alt="Local Dynamic Object" /></p>

<p>Not counting relocations for the PLT and GOT entries; before linking the <code class="language-plaintext highlighter-rouge">.text</code>
contains 1 placeholder for a GOT offset and 2 placeholders for the TLS offsets.
This GOT entry will contain the arguments to <code class="language-plaintext highlighter-rouge">__tls_get_addr</code>.
The TLD offsets will be the offsets to our variables in the TLD data section.</p>

<h4 id="after-linking-1">After Linking</h4>

<p><img src="/content/2019/tls-ld-exe.png" alt="Local Dynamic Program" /></p>

<p>After linking there will be 1 relocation entry in the GOT to be resolved by
the dynamic linker.  This is <code class="language-plaintext highlighter-rouge">R_TLS_DTPMOD</code>, the TLS module index, the offset
will be <code class="language-plaintext highlighter-rouge">0x0</code>.</p>

<h5 id="example-2">Example</h5>

<p>File: <a href="https://github.com/stffrdhrn/tls-examples/blob/master/tls-ld.c">tls-ld.c</a></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">__thread</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>
<span class="k">static</span> <span class="kr">__thread</span> <span class="kt">int</span> <span class="n">y</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">sum</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="code-sequence-x86_64-1">Code Sequence (x86_64)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tls-ld.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000030 &lt;sum&gt;:
  30:	48 83 ec 08          	sub    $0x8,%rsp
  34:	48 8d 3d [00 00 00 00] 	lea    [0](%rip),%rdi   # 37 R_X86_64_TLSLD	x-0x4
  3b:	e8 [00 00 00 00]       	callq  [0]              # 3c R_X86_64_PLT32	__tls_get_addr-0x4
  40:	8b 90 [00 00 00 00]    	mov    [0](%rax),%edx   # 42 R_X86_64_DTPOFF32	x
  46:	03 90 [00 00 00 00]    	add    [0](%rax),%edx   # 48 R_X86_64_DTPOFF32	y
  4c:	48 83 c4 08          	add    $0x8,%rsp
  50:	89 d0                	mov    %edx,%eax
  52:	c3                   	retq
</code></pre></div></div>

<h3 id="initial-exec">Initial Exec</h3>

<p>The Initial Exec (IE) access model does not require traversing the TLS data
structure.  It requires that the compiler knows that offset from the TP to the
variable can be computed during link time.</p>

<p>As Initial Exec does not require calling <code class="language-plaintext highlighter-rouge">__tls_get_addr</code> is is more efficient
compared the GD and LD access.</p>

<h4 id="before-linking-2">Before Linking</h4>

<p><img src="/content/2019/tls-ie-obj.png" alt="Initial Exec Object" /></p>

<p>Text contains a placeholder for the got address of the offset.  Not counting
relocation entry for the GOT; before linking the <code class="language-plaintext highlighter-rouge">.text</code> contains 1 placeholder
for a GOT offset.  This GOT entry will contain the TP offset to the variable.</p>

<h4 id="after-linking-2">After Linking</h4>

<p><img src="/content/2019/tls-ie-exe.png" alt="Initial Exec Program" /></p>

<p>After linking there will be no remaining relocation entries. The <code class="language-plaintext highlighter-rouge">.text</code> section
contains the actual GOT offset and the GOT entry will contain the TP offset
to the variable.</p>

<h4 id="example-3">Example</h4>

<p>File: <a href="https://github.com/stffrdhrn/tls-examples/blob/master/tls-ie.c">tls-ie.c</a></p>

<p>Initial exec C code will be the same as global dynamic, however IE access will
be chosen when static compiling.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kr">__thread</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>

<span class="kt">int</span><span class="o">*</span> <span class="nf">get_x_addr</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="code-sequence-openrisc-1">Code Sequence (OpenRISC)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000038 &lt;get_x_addr&gt;:
  38:	9c 21 ff fc 	l.addi r1,r1,-4
  3c:	1a 20 [00 00] 	l.movhi r17,[0x0]   # 3c: R_OR1K_TLS_IE_AHI16	x
  40:	d4 01 48 00 	l.sw 0(r1),r9
  44:	04 00 00 02 	l.jal 4c &lt;get_x_addr+0x14&gt;
  48:	1a 60 [00 00] 	 l.movhi r19,[0x0]  # 48: R_OR1K_GOTPC_HI16	_GLOBAL_OFFSET_TABLE_-0x4
  4c:	aa 73 [00 00] 	l.ori r19,r19,[0x0] # 4c: R_OR1K_GOTPC_LO16	_GLOBAL_OFFSET_TABLE_
  50:	e2 73 48 00 	l.add r19,r19,r9
  54:	e2 31 98 00 	l.add r17,r17,r19
  58:	85 71 [00 00] 	l.lwz r11,[0](r17)  # 58: R_OR1K_TLS_IE_LO16	x
  5c:	85 21 00 00 	l.lwz r9,0(r1)
  60:	e1 6b 50 00 	l.add r11,r11,r10
  64:	44 00 48 00 	l.jr r9
  68:	9c 21 00 04 	 l.addi r1,r1,4
</code></pre></div></div>

<h4 id="code-sequence-x86_64-2">Code Sequence (x86_64)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000010 &lt;get_x_addr&gt;:
  10:	48 8b 05 [00 00 00 00] 	   mov    0x0(%rip),%rax   # 13: R_X86_64_GOTTPOFF	x-0x4
  17:	64 48 03 04 25 00 00 00 00 add    %fs:0x0,%rax
  20:	c3                         retq
</code></pre></div></div>

<h3 id="local-exec">Local Exec</h3>

<p>The Local Exec (LD) access model does not require traversing the TLS data
structure or a GOT entry.  It is chosen by the compiler when accessing file
local variables in the current program.</p>

<p>The Local Exec access model is the most efficient.</p>

<h4 id="before-linking-3">Before Linking</h4>

<p><img src="/content/2019/tls-le-obj.png" alt="Local Exec Object" /></p>

<p>Before linking the <code class="language-plaintext highlighter-rouge">.text</code> section contains one relocation entry for a TP
offset.</p>

<h4 id="after-linking-3">After Linking</h4>

<p><img src="/content/2019/tls-le-exe.png" alt="Local Exec Program" /></p>

<p>After linking the <code class="language-plaintext highlighter-rouge">.text</code> section contains the value of the TP offset.</p>

<h4 id="example-4">Example</h4>

<p>File: <a href="https://github.com/stffrdhrn/tls-examples/blob/master/tls-le.c">tls-le.c</a></p>

<p>In the Local Exec example the variable <code class="language-plaintext highlighter-rouge">x</code> is local, it is not <code class="language-plaintext highlighter-rouge">extern</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">__thread</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>

<span class="kt">int</span> <span class="o">*</span> <span class="nf">get_x_addr</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="code-sequence-openrisc-2">Code Sequence (OpenRISC)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000010 &lt;get_x_addr&gt;:
  10:	19 60 [00 00] 	l.movhi r11,[0x0]    # 10: R_OR1K_TLS_LE_AHI16	.LANCHOR0
  14:	e1 6b 50 00 	l.add r11,r11,r10
  18:	44 00 48 00 	l.jr r9
  1c:	9d 6b [00 00] 	 l.addi r11,r11,[0]  # 1c: R_OR1K_TLS_LE_LO16	.LANCHOR0
</code></pre></div></div>

<h4 id="code-sequence-x86_64-3">Code Sequence (x86_64)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000000010 &lt;get_x_addr&gt;:
  10:	64 48 8b 04 25 00 00 00 00  mov    %fs:0x0,%rax
  19:	48 05 [00 00 00 00]    	    add    $0x0,%rax  # 1b: R_X86_64_TPOFF32	x
  1f:	c3                   	    retq
</code></pre></div></div>

<h2 id="linker-relaxation">Linker Relaxation</h2>

<p>As some TLS access methods are more efficient than others we would like to
choose the best method for each variable access.  However, we sometimes don’t know
where a variable will come from until link time.</p>

<p>On some architectures the linker will rewrite the TLS access code sequence to
change to a more efficient access model, this is called relaxation.</p>

<p>One type of relaxation performed by the linker is GD to IE relaxation.  During compile
time GD relocation may be chosen for <code class="language-plaintext highlighter-rouge">extern</code> variables.  However, during link time
the variable may be found in the same module i.e. not a shared object which would require
GD access.  In this case the access model can be changed to IE.</p>

<p>That’s pretty cool.</p>

<p>The architecture I work on <a href="https://openrisc.io/">OpenRISC</a> does not support any
of this yet, it requires changes to the compiler and linker.  The compiler needs
to be updated to mark sections of the output <code class="language-plaintext highlighter-rouge">.text</code> that can be rewritten
(often with added <code class="language-plaintext highlighter-rouge">NOP</code> codes).  The linker needs to be updated to know how to
identify the relaxation opportunity and perform it.</p>

<h2 id="summary">Summary</h2>

<p>In this article we have covered how TLS variables are accessed per thread via
the TLS data structure.  Also, we saw how different TLS access models provide
varying levels of efficiency.</p>

<p>In the next article we will look more into how this is implemented in GCC, the
linker and the GLIBC runtime dynamic linker.</p>

<h2 id="further-reading">Further Reading</h2>
<ul>
  <li><a href="https://fuchsia.dev/fuchsia-src/development/threads/tls">Fuschia TLS</a> - A very good overview of TLS internals</li>
  <li><a href="https://android.googlesource.com/platform/bionic/+/HEAD/docs/elf-tls.md">Android TLS</a> - Another good overview of TLS internals</li>
  <li><a href="https://docs.oracle.com/cd/E19683-01/817-3677/chapter8-20/index.html">Sun/Oracle TLS Access Models</a> - An example of how bad the old documentation was</li>
  <li><a href="https://www.akkadia.org/drepper/tls.pdf">Drepper TLS</a> (pdf) - The original TLS documentation from the GLIBC maintainer Ulrich Drepper</li>
  <li><a href="https://chao-tic.github.io/blog/2018/12/25/tls">TLS Deep Dive</a> - Another individual’s in depth notes on TLS</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[This is an ongoing series of posts on ELF Binary Relocations and Thread Local Storage. This article covers only Thread Local Storage and assumes the reader has had a primer in ELF Relocations, if not please start with my previous article *ELF Binaries and Relocation Entries.]]></summary></entry><entry><title type="html">ELF Binaries and Relocation Entries</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/11/29/relocs.html" rel="alternate" type="text/html" title="ELF Binaries and Relocation Entries" /><published>2019-11-29T06:47:00+00:00</published><updated>2019-11-29T06:47:00+00:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/11/29/relocs</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/11/29/relocs.html"><![CDATA[<p>Recently I have been working on getting the <a href="https://github.com/openrisc/or1k-glibc">OpenRISC glibc</a>
port ready for upstreaming.  Part of this work has been to run the glibc
testsuite and get the tests to pass.  The <a href="https://sourceware.org/glibc/wiki/Testing/Testsuite">glibc testsuite</a>
has a comprehensive set of linker and runtime relocation tests.</p>

<p>In order to fix issues with tests I had to learn more than I did before about ELF Relocations
, Thread Local Storage and the binutils linker implementation in BFD.  There is a lot of
documentation available, but it’s a bit hard to follow as it assumes certain
knowledge, for example have a look at the Solaris <a href="https://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-54839.html">Linker and Libraries</a>
section on relocations.  In this article I will try to fill in those gaps.</p>

<p>This will be an illustrated 3 part series covering</p>
<ul>
  <li>ELF Binaries and Relocation Entries</li>
  <li><a href="/hardware/embedded/openrisc/2020/01/19/tls.html">Thread Local Storage</a></li>
  <li><a href="/software/toolchain/openrisc/2020/07/21/relocs_tls_impl.html">How Relocations and Thread Local Store are Implemented</a></li>
</ul>

<p>All of the examples in this article can be found in my <a href="https://github.com/stffrdhrn/tls-examples">tls-examples</a>
project.  Please check it out.</p>

<p>On Linux, you can download it and <code class="language-plaintext highlighter-rouge">make</code> it with your favorite toolchain.
By default it will cross compile using an <a href="https://openrisc.io/software">openrisc toolchain</a>.
This can be overridden with the <code class="language-plaintext highlighter-rouge">CROSS_COMPILE</code> variable.
For example, to build for your current host.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone git@github.com:stffrdhrn/tls-examples.git
$ make CROSS_COMPILE=
gcc -fpic -c -o tls-gd-dynamic.o tls-gd.c -Wall -O2 -g
gcc -fpic -c -o nontls-dynamic.o nontls.c -Wall -O2 -g
...
objdump -dr x-static.o &gt; x-static.S
objdump -dr xy-static.o &gt; xy-static.S
</code></pre></div></div>

<p>Now we can get started.</p>

<h2 id="elf-segments-and-sections">ELF Segments and Sections</h2>

<p>Before we can talk about relocations we need to talk a bit about what makes up
<a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF</a> binaries.
This is a prerequisite as relocations and TLS are part of ELF binaries.  There
are a few basic ELF binary types:</p>

<ul>
  <li>Objects (<code class="language-plaintext highlighter-rouge">.o</code>) - produced by a compiler, contains a collection of sections, also call relocatable files.</li>
  <li>Program - an executable program, contains sections grouped into segments.</li>
  <li>Shared Objects (<code class="language-plaintext highlighter-rouge">.so</code>) - a program library, contains sections grouped into segments.</li>
  <li>Core Files - core dump of program memory, these are also ELF binaries</li>
</ul>

<p>Here we will discuss Object Files and Program Files.</p>

<h3 id="an-elf-object">An ELF Object</h3>

<p><img src="/content/2019/elf-obj.png" alt="ELF Object" /></p>

<p>The compiler generates object files, these contain sections of binary data and
these are not executable.</p>

<p>The object file produced by <a href="https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Overall-Options.html#index-c">gcc</a>
generally contains <code class="language-plaintext highlighter-rouge">.rela.text</code>, <code class="language-plaintext highlighter-rouge">.text</code>, <code class="language-plaintext highlighter-rouge">.data</code> and <code class="language-plaintext highlighter-rouge">.bss</code> sections.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.rela.text</code> - a list of relocations against the <code class="language-plaintext highlighter-rouge">.text</code> section</li>
  <li><code class="language-plaintext highlighter-rouge">.text</code> - contains compiled program machine code</li>
  <li><code class="language-plaintext highlighter-rouge">.data</code> - static and non static initialized variable values</li>
  <li><code class="language-plaintext highlighter-rouge">.bss</code>  - static and non static non-initialized variables</li>
</ul>

<h3 id="an-elf-program">An ELF Program</h3>

<p><img src="/content/2019/elf-program.png" alt="ELF Program" /></p>

<p>ELF binaries are made of <a href="https://en.wikipedia.org/wiki/Data_segment">sections</a> and segments.</p>

<p>A segment contains a group of sections and the segment defines how the data should
be loaded into memory for program execution.</p>

<p>Each segment is mapped to program memory by the kernel when a process is created.  Program files contain
most of the same sections as objects but there are some differences.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">.text</code> - contains executable program code, there is no <code class="language-plaintext highlighter-rouge">.rela.text</code> section</li>
  <li><code class="language-plaintext highlighter-rouge">.got</code>  - the <a href="https://en.wikipedia.org/wiki/Global_Offset_Table">global offset table</a> used to access variables, created during link time.  May be populated during runtime.</li>
</ul>

<h3 id="looking-at-elf-binaries-readelf">Looking at ELF binaries (<code class="language-plaintext highlighter-rouge">readelf</code>)</h3>

<p>The <code class="language-plaintext highlighter-rouge">readelf</code> tool can help inspect elf binaries.</p>

<p>Some examples:</p>

<h4 id="reading-sections-of-an-object-file">Reading Sections of an Object File</h4>

<p>Using the <code class="language-plaintext highlighter-rouge">-S</code> option we can read sections from an elf file.
As we can see below we have the <code class="language-plaintext highlighter-rouge">.text</code>, <code class="language-plaintext highlighter-rouge">.rela.text</code>, <code class="language-plaintext highlighter-rouge">.bss</code> and many other
sections.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -S tls-le-static.o
There are 20 section headers, starting at offset 0x604:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000034 000020 00  AX  0   0  4
  [ 2] .rela.text        RELA            00000000 0003f8 000030 0c   I 17   1  4
  [ 3] .data             PROGBITS        00000000 000054 000000 00  WA  0   0  1
  [ 4] .bss              NOBITS          00000000 000054 000000 00  WA  0   0  1
  [ 5] .tbss             NOBITS          00000000 000054 000004 00 WAT  0   0  4
  [ 6] .debug_info       PROGBITS        00000000 000054 000074 00      0   0  1
  [ 7] .rela.debug_info  RELA            00000000 000428 000084 0c   I 17   6  4
  [ 8] .debug_abbrev     PROGBITS        00000000 0000c8 00007c 00      0   0  1
  [ 9] .debug_aranges    PROGBITS        00000000 000144 000020 00      0   0  1
  [10] .rela.debug_arang RELA            00000000 0004ac 000018 0c   I 17   9  4
  [11] .debug_line       PROGBITS        00000000 000164 000087 00      0   0  1
  [12] .rela.debug_line  RELA            00000000 0004c4 00006c 0c   I 17  11  4
  [13] .debug_str        PROGBITS        00000000 0001eb 00007a 01  MS  0   0  1
  [14] .comment          PROGBITS        00000000 000265 00002b 01  MS  0   0  1
  [15] .debug_frame      PROGBITS        00000000 000290 000030 00      0   0  4
  [16] .rela.debug_frame RELA            00000000 000530 000030 0c   I 17  15  4
  [17] .symtab           SYMTAB          00000000 0002c0 000110 10     18  15  4
  [18] .strtab           STRTAB          00000000 0003d0 000025 00      0   0  1
  [19] .shstrtab         STRTAB          00000000 000560 0000a1 00      0   0  1
</code></pre></div></div>

<h4 id="reading-sections-of-a-program-file">Reading Sections of a Program File</h4>

<p>Using the <code class="language-plaintext highlighter-rouge">-S</code> option on a program file we can also read the sections.  The file
type does not matter as long as it is an ELF we can read the sections.
As we can see below there is no longer a <code class="language-plaintext highlighter-rouge">rela.text</code> section, but we have others
including the <code class="language-plaintext highlighter-rouge">.got</code> section.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -S tls-le-static
There are 31 section headers, starting at offset 0x32e8fc:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        000020d4 0000d4 080304 00  AX  0   0  4
  [ 2] __libc_freeres_fn PROGBITS        000823d8 0803d8 001118 00  AX  0   0  4
  [ 3] .rodata           PROGBITS        000834f0 0814f0 01544c 00   A  0   0  4
  [ 4] __libc_subfreeres PROGBITS        0009893c 09693c 000024 00   A  0   0  4
  [ 5] __libc_IO_vtables PROGBITS        00098960 096960 0002f4 00   A  0   0  4
  [ 6] __libc_atexit     PROGBITS        00098c54 096c54 000004 00   A  0   0  4
  [ 7] .eh_frame         PROGBITS        00098c58 096c58 0027a8 00   A  0   0  4
  [ 8] .gcc_except_table PROGBITS        0009b400 099400 000089 00   A  0   0  1
  [ 9] .note.ABI-tag     NOTE            0009b48c 09948c 000020 00   A  0   0  4
  [10] .tdata            PROGBITS        0009dc28 099c28 000010 00 WAT  0   0  4
  [11] .tbss             NOBITS          0009dc38 099c38 000024 00 WAT  0   0  4
  [12] .init_array       INIT_ARRAY      0009dc38 099c38 000004 04  WA  0   0  4
  [13] .fini_array       FINI_ARRAY      0009dc3c 099c3c 000008 04  WA  0   0  4
  [14] .data.rel.ro      PROGBITS        0009dc44 099c44 0003bc 00  WA  0   0  4
  [15] .data             PROGBITS        0009e000 09a000 000de0 00  WA  0   0  4
  [16] .got              PROGBITS        0009ede0 09ade0 000064 04  WA  0   0  4
  [17] .bss              NOBITS          0009ee44 09ae44 000bec 00  WA  0   0  4
  [18] __libc_freeres_pt NOBITS          0009fa30 09ae44 000014 00  WA  0   0  4
  [19] .comment          PROGBITS        00000000 09ae44 00002a 01  MS  0   0  1
  [20] .debug_aranges    PROGBITS        00000000 09ae6e 002300 00      0   0  1
  [21] .debug_info       PROGBITS        00000000 09d16e 0fd048 00      0   0  1
  [22] .debug_abbrev     PROGBITS        00000000 19a1b6 0270ca 00      0   0  1
  [23] .debug_line       PROGBITS        00000000 1c1280 0ce95c 00      0   0  1
  [24] .debug_frame      PROGBITS        00000000 28fbdc 0063bc 00      0   0  4
  [25] .debug_str        PROGBITS        00000000 295f98 011e35 01  MS  0   0  1
  [26] .debug_loc        PROGBITS        00000000 2a7dcd 06c437 00      0   0  1
  [27] .debug_ranges     PROGBITS        00000000 314204 00c900 00      0   0  1
  [28] .symtab           SYMTAB          00000000 320b04 0075d0 10     29 926  4
  [29] .strtab           STRTAB          00000000 3280d4 0066ca 00      0   0  1
  [30] .shstrtab         STRTAB          00000000 32e79e 00015c 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)
</code></pre></div></div>

<h4 id="reading-segments-from-a-program-file">Reading Segments from a Program File</h4>

<p>Using the <code class="language-plaintext highlighter-rouge">-l</code> option on a program file we can read the segments.
Notice how segments map from file offsets to memory offsets and alignment.
The two different <code class="language-plaintext highlighter-rouge">LOAD</code> type segments are segregated by read only/execute and read/write.
Each section is also mapped to a segment here.  As we can see <code class="language-plaintext highlighter-rouge">.text is in the first </code>LOAD` segment
which is executable as expected.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -l tls-le-static

Elf file type is EXEC (Executable file)
Entry point 0x2104
There are 5 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x00002000 0x00002000 0x994ac 0x994ac R E 0x2000
  LOAD           0x099c28 0x0009dc28 0x0009dc28 0x0121c 0x01e1c RW  0x2000
  NOTE           0x09948c 0x0009b48c 0x0009b48c 0x00020 0x00020 R   0x4
  TLS            0x099c28 0x0009dc28 0x0009dc28 0x00010 0x00034 R   0x4
  GNU_RELRO      0x099c28 0x0009dc28 0x0009dc28 0x003d8 0x003d8 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     .text __libc_freeres_fn .rodata __libc_subfreeres __libc_IO_vtables __libc_atexit .eh_frame .gcc_except_table .note.ABI-tag 
   01     .tdata .init_array .fini_array .data.rel.ro .data .got .bss __libc_freeres_ptrs 
   02     .note.ABI-tag 
   03     .tdata .tbss 
   04     .tdata .init_array .fini_array .data.rel.ro 
</code></pre></div></div>

<h4 id="reading-segments-from-an-object-file">Reading Segments from an Object File</h4>

<p>Using the <code class="language-plaintext highlighter-rouge">-l</code> option with an object file does not work as we can see below.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>readelf -l tls-le-static.o

There are no program headers in this file.
</code></pre></div></div>

<h2 id="relocation-entries">Relocation entries</h2>

<p>As mentioned an object file by itself is not executable.  The main reason is that
there are no program headers as we just saw.  Another reason is that
the <code class="language-plaintext highlighter-rouge">.text</code> section still contains relocation entries (or placeholders) for the
addresses of variables located in the <code class="language-plaintext highlighter-rouge">.data</code> and <code class="language-plaintext highlighter-rouge">.bss</code> sections.
These placeholders will just be <code class="language-plaintext highlighter-rouge">0</code> in the machine code.  So, if we tried to run
the machine code in an object file we would end up with Segmentation faults (<a href="https://en.wikipedia.org/wiki/Segmentation_fault">SEGV</a>).</p>

<p>A relocation entry is a placeholder that is added by the compiler or linker when
producing ELF binaries.
The relocation entries are to be filled in with addresses pointing to data.
Relocation entries can be made in code such as the <code class="language-plaintext highlighter-rouge">.text</code> section or in data
sections like the <code class="language-plaintext highlighter-rouge">.got</code> section.  For example:</p>

<h2 id="resolving-relocations">Resolving Relocations</h2>

<p><img src="/content/2019/gcc-obj-ld.png" alt="GCC and Linker" /></p>

<p>The diagram above shows relocation entries as white circles.
Relocation entries may be filled or resolved at link-time or dynamically during execution.</p>

<p>Link time relocations</p>
<ul>
  <li>Place holders are filled in when ELF object files are linked by the linker to create executables or libraries</li>
  <li>For example, relocation entries in <code class="language-plaintext highlighter-rouge">.text</code> sections</li>
</ul>

<p>Dynamic relocations</p>
<ul>
  <li>Place holders is filled during runtime by the dynamic linker.  i.e. Procedure Link Table</li>
  <li>For example, relocation entries added to <code class="language-plaintext highlighter-rouge">.got</code> and <code class="language-plaintext highlighter-rouge">.plt</code> sections which link
to shared objects.</li>
</ul>

<p><em>Note: Statically built binaries do not have any dynamic relocations and are not
loaded with the dynamic linker.</em></p>

<p>In general link time relocations are used to fill in relocation entries in code.
Dynamic relocations fill in relocation entries in data sections.</p>

<h3 id="listing-relocation-entries">Listing Relocation Entries</h3>

<p>A list of relocations in a ELF binary can printed using <code class="language-plaintext highlighter-rouge">readelf</code> with
the <code class="language-plaintext highlighter-rouge">-r</code> options.</p>

<p>Output of <code class="language-plaintext highlighter-rouge">readelf -r tls-gd-dynamic.o</code></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Relocation section '.rela.text' at offset 0x530 contains 10 entries:
 Offset     Info    Type            Sym.Value  Sym. Name + Addend
00000000  00000f16 R_OR1K_TLS_GD_HI1 00000000   x + 0
00000008  00000f17 R_OR1K_TLS_GD_LO1 00000000   x + 0
00000020  0000100c R_OR1K_GOTPC_HI16 00000000   _GLOBAL_OFFSET_TABLE_ - 4
00000024  0000100d R_OR1K_GOTPC_LO16 00000000   _GLOBAL_OFFSET_TABLE_ + 0
0000002c  00000d0f R_OR1K_PLT26      00000000   __tls_get_addr + 0
...
</code></pre></div></div>

<p>The relocation entry list explains how to and where to apply the relocation entry.
It contains:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">Offset</code> - the location in the binary that needs to be updated</li>
  <li><code class="language-plaintext highlighter-rouge">Info</code> - the encoded value containing the <code class="language-plaintext highlighter-rouge">Type, Sym and Addend</code>, which is
  broken down to:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">Type</code> - the type of relocation (the formula for what is to be performed is defined in the
linker)</li>
      <li><code class="language-plaintext highlighter-rouge">Sym. Value</code> - the address value (if known) of the symbol.</li>
      <li><code class="language-plaintext highlighter-rouge">Sym. Name</code> - the name of the symbol (variable name) that this relocation needs to find
during link time.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">Addend</code> - a value that needs to be added to the derived symbol address.
This is used to with arrays (i.e. for a relocation referencing <code class="language-plaintext highlighter-rouge">a[14]</code> we would have <strong>Sym. Name</strong> <code class="language-plaintext highlighter-rouge">a</code> and an <strong>Addend</strong> of the data size of <code class="language-plaintext highlighter-rouge">a</code> times <code class="language-plaintext highlighter-rouge">14</code>)</li>
</ul>

<h3 id="example">Example</h3>

<p>File: <a href="https://github.com/stffrdhrn/tls-examples/blob/master/nontls.c">nontls.c</a></p>

<p>In the example below we have a simple variable and a function to access it’s
address.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="n">x</span><span class="p">;</span>

<span class="kt">int</span><span class="o">*</span> <span class="nf">get_x_addr</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">return</span> <span class="o">&amp;</span><span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Let’s see what happens when we compile this source.</p>

<p><em>The steps to compile and link can be found in the <a href="https://github.com/stffrdhrn/tls-examples">tls-examples</a> project hosting
the source examples.</em></p>

<h3 id="before-linking">Before Linking</h3>

<p><img src="/content/2019/nontls-obj.png" alt="Non TLS Object" /></p>

<p>The diagram above shows relocations in the resulting object file as white circles.</p>

<p>In the actual output below we can see that access to the variable <code class="language-plaintext highlighter-rouge">x</code> is
referenced by a literal <code class="language-plaintext highlighter-rouge">0</code> in each instruction.  These are highlighted with
square brackets <code class="language-plaintext highlighter-rouge">[]</code> below for clarity.</p>

<p>These empty parts of the <code class="language-plaintext highlighter-rouge">.text</code> section are relocation entries.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Addr.   Machine Code    Assembly             Relocations
0000000c &lt;get_x_addr&gt;:
   c:   19 60 [00 00]   l.movhi r11,[0]      # c  R_OR1K_AHI16 .bss
  10:   44 00  48 00    l.jr r9
  14:   9d 6b [00 00]    l.addi r11,r11,[0]  # 14 R_OR1K_LO_16_IN_INSN        .bss
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">get_x_addr</code> will return the address of variable <code class="language-plaintext highlighter-rouge">x</code>.
We can look at the assembly instruction to understand how this is done.  Some background
of the OpenRISC ABI.</p>

<ul>
  <li>Registers are 32-bit.</li>
  <li>Function return values are placed in register <code class="language-plaintext highlighter-rouge">r11</code>.</li>
  <li>To return from a function we jump to the address in the link register <code class="language-plaintext highlighter-rouge">r9</code>.</li>
  <li>OpenRISC has a <a href="https://en.wikipedia.org/wiki/Delay_slot">branch delay slot</a>, meaning the address after a branch it executed
before the branch is taken.</li>
</ul>

<p>Now, lets break down the assembly:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">l.movhi</code> - move the value <code class="language-plaintext highlighter-rouge">[0]</code> into high bits of register <code class="language-plaintext highlighter-rouge">r11</code>, clearing the lower bits.</li>
  <li><code class="language-plaintext highlighter-rouge">l.addi</code> - add the value in register <code class="language-plaintext highlighter-rouge">r11</code> to the value <code class="language-plaintext highlighter-rouge">[0]</code> and store the results in <code class="language-plaintext highlighter-rouge">r11</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">l.jr</code> - jump to the address in <code class="language-plaintext highlighter-rouge">r9</code></li>
</ul>

<p>This constructs a 32-bit value out of 2 16-bit values.</p>

<h3 id="after-linking">After Linking</h3>

<p><img src="/content/2019/nontls-exe.png" alt="Non TLS Object" /></p>

<p>The diagram above shows the relocations have been replaced with actual values.</p>

<p>As we can see from the linker output the places in the machine code that had relocation place holders
are now replaced with values.  For example <code class="language-plaintext highlighter-rouge">1a 20 00 00</code> has become <code class="language-plaintext highlighter-rouge">1a 20 00 0a</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00002298 &lt;get_x_addr&gt;:
    2298:	19 60 00 0a 	l.movhi r11,0xa
    229c:	44 00 48 00 	l.jr r9
    22a0:	9d 6b ee 60 	l.addi r11,r11,-4512
</code></pre></div></div>

<p>If we calculate <code class="language-plaintext highlighter-rouge">0xa &lt;&lt; 16 + -4512 (fee60)</code> we see get <code class="language-plaintext highlighter-rouge">0009ee60</code>.  That is the
same location of <code class="language-plaintext highlighter-rouge">x</code> within our binary.  This we can check with <code class="language-plaintext highlighter-rouge">readelf -s</code>
which lists all symbols.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ readelf -s nontls-static | grep ' x'
    42: 0009ee60     4 OBJECT  LOCAL  DEFAULT   17 x
</code></pre></div></div>

<h2 id="types-of-relocations">Types of Relocations</h2>

<p>As we saw above, a simple program resulted in 2 different relocation entries just to compose the address of 1 variable.
We saw:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_AHI16</code></li>
  <li><code class="language-plaintext highlighter-rouge">R_OR1K_LO_16_IN_INSN</code></li>
</ul>

<p>The need for different relacation types comes from the different requirements for the
relocation.  Processing of a relocation involves usually a very simple transform
, each relocation defines a different transform.  The components of the relocation
definition are:</p>

<ul>
  <li><strong>Input</strong> The input of a relocation formula is always the <strong>Symbol Address</strong> who’s absolute value is unknown at compile time. But
there may also be other input variables to the formula including:
    <ul>
      <li><strong>Program Counter</strong> The absolute address of the machine code address being updated</li>
      <li><strong>Addend</strong> The addend from the relocation entry discussed above in the <em>Listing Relocation Entries</em> section</li>
    </ul>
  </li>
  <li><strong>Formula</strong> How the input is manipulated to derive the output value.  For example shift right 16 bits.</li>
  <li><strong>Bit-Field</strong> Specifies which bits at the output address need to be updated.</li>
</ul>

<p>To be more specific about the above relocations we have:</p>

<table>
  <thead>
    <tr>
      <th>Relocation Type</th>
      <th>Bit-Field</th>
      <th>Formula</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">R_OR1K_AHI16</code></td>
      <td><code class="language-plaintext highlighter-rouge">simm16</code></td>
      <td><code class="language-plaintext highlighter-rouge">S &gt;&gt; 16</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">R_OR1K_LO_16_IN_INSN</code></td>
      <td><code class="language-plaintext highlighter-rouge">simm16</code></td>
      <td><code class="language-plaintext highlighter-rouge">S &amp;&amp; 0xffff</code></td>
    </tr>
  </tbody>
</table>

<p>The Bit-Field described above is <code class="language-plaintext highlighter-rouge">simm16</code> which means update the lower 16-bits
of the 32-bit value at the output offset and do not disturb the upper 16-bits.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> +----------+----------+
 |          |  simm16  |
 | 31    16 | 15     0 |
 +----------+----------+
</code></pre></div></div>

<p>There are many other Relocation Types with difference Bit-Fields and Formulas.
These use different methods based on what each instruction does, and where each instruction
encodes its immediate value.</p>

<p>For full listings refer to architecture manuals.</p>

<ul>
  <li><a href="https://docs.oracle.com/cd/E19683-01/817-3677/chapter6-54839/index.html">Linkers and Libraries</a> - Oracle’s documentation on Intel and Sparc relocations</li>
  <li><a href="https://sourceware.org/binutils/docs-2.33.1/as/OpenRISC_002dRelocs.html">Binutils OpenRISC Relocs</a> - Binutil Manual containing details on OpenRISC relocations</li>
  <li><a href="https://static.docs.arm.com/ihi0044/e/IHI0044E_aaelf.pdf">ELF for ARM</a>[pdf] - ARM Relocation Types table on page 25</li>
</ul>

<p>Take a look and see if you can understand how to read these now.</p>

<h2 id="summary">Summary</h2>

<p>In this article we have discussed what ELF binaries are and how they can be read.
We have talked about how from compilation to linking to runtime, relocation entries
are used to communicate which parts of a program remain to be resolved.  We
then discussed how relocation types provide a formula and bit-mask for updating
the places in ELF binaries that need to be filled in.</p>

<p>In the next article we will discuss how Thread Local Storage works, both link-time
and runtime relocation entries play big part in how TLS works.</p>

<h2 id="further-reading">Further Reading</h2>
<ul>
  <li><a href="http://bottomupcs.sourceforge.net/csbu/x3735.htm">Bottums Up - Dynamic Linker</a> - Details on the Dynamic Linker, Relocations and Position Independent Code</li>
  <li><a href="https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html">GOT and PLT Key to Code Sharing</a> - Good overview of the <code class="language-plaintext highlighter-rouge">.got</code> and <code class="language-plaintext highlighter-rouge">.plt</code> sections</li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[Recently I have been working on getting the OpenRISC glibc port ready for upstreaming. Part of this work has been to run the glibc testsuite and get the tests to pass. The glibc testsuite has a comprehensive set of linker and runtime relocation tests.]]></summary></entry><entry><title type="html">OR1K Marocchino A Tomasulo Implementation</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/10/21/or1k_marocchino_tomasulo.html" rel="alternate" type="text/html" title="OR1K Marocchino A Tomasulo Implementation" /><published>2019-10-21T15:37:00+01:00</published><updated>2019-10-21T15:37:00+01:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/10/21/or1k_marocchino_tomasulo</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/10/21/or1k_marocchino_tomasulo.html"><![CDATA[<p><em>This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order
<a href="https://openrisc.io">OpenRISC</a> cpu.  In this series we are reviewing the
Marocchino and it’s architecture.  If you haven’t already I suggest you start of
by reading the intro in <a href="/hardware/embedded/openrisc/2019/06/11/or1k_marocchino.html">Marocchino in Action</a>.</em></p>

<p>In the last article, <em>Marocchino Instruction Pipeline</em> we discussed the 
architecture of the CPU.  In this article let’s look at how Marocchino achieves
out-of-order execution using the <a href="https://en.wikipedia.org/wiki/Tomasulo_algorithm">Tomasulo algorithm</a>.</p>

<h2 id="achieving-out-of-order-execution">Achieving Out-of-Order Execution</h2>

<p>In a traditional pipelined CPU the goal is retire one instruction
per clock cycle.  Any pipeline stall means an execution clock cycle will be lost.
One method for reducing the affect of pipeline stalls is instruction parallelization.  In 1993
the Intel <a href="https://en.wikipedia.org/wiki/P5_(microarchitecture)">Pentium</a>
processor was one of the first consumer CPUs to achieve this with it’s <a href="https://arstechnica.com/features/2004/07/pentium-1/">dual U
and V integer pipelines</a>.
The pentium U and V pipelines require <a href="http://oldhome.schmorp.de/doc/opt-pairing.html">certain coding
techniques</a> to take full
advantage.  Achieving more parallelism requires more sophisticated data hazard
detection and instruction scheduling.  Introduced with the IBM System/360 in the
60’s by Robert Tomasulo, the <em>Tomosulo Algorithm</em> provides the building blocks to
allow for multiple instruction execution parallelism.  Generally speaking no special programming is needed to
take advantage of instruction parallelism on a processor implementing Tomasulo
algorithm.</p>

<p><img src="/content/2019/Algorithme_de_Tomasulo.png" alt="Tomasulo's algorithm" /></p>

<p>Though the technique of out-of-order CPU execution with Tomasulo’s algorithm had
been designed in the 60’s it did not make its way into popular consumer hardware
until the <a href="https://en.wikipedia.org/wiki/Pentium_Pro">Pentium Pro</a> in the 1995.
Further Pentium revisions such as the Pentium III, Pentium 4 and Core
architectures are based on this same architecture.  Understanding
this architecture is a key to understanding modern CPUs.</p>

<p>In this article we will point out comparisons between the Marocchino and Pentium pro 
who’s architecture can be seen in the below diagram.</p>

<p><img src="/content/2019/pentium-pro.png" alt="pentium pro diagram" /></p>

<p>The Marocchino implements the Tomasulo algorithm in a CPU that can be synthesized
and run on an FPGA.  Let’s dive into the implementation by breaking down the
building blocks used in Tomasulo’s algorithm and how they have been implemented in
Marocchino.</p>

<h2 id="tomasulo-building-blocks">Tomasulo Building blocks</h2>

<p>Besides the basic CPU modules like Instruction Fetch, Decode and Register File,
the building blocks that are used in the Tomasulo algorithm are as follows:</p>

<ul>
  <li><a href="https://en.wikipedia.org/wiki/Reservation_station">Reservation Station</a> - A
queue where decoded instructions are placed before they can be
executed.  Instructions are placed in the queue with their decoded operation
and available arguments.  If any arguments are not available the reservation
station will wait until the arguments are available before executing.</li>
  <li>Execution Units - The execution units include the Arithmetic
Logic Unit (ALU), Memory Load/Store Unit or FPU is responsible for performing
the instruction operation.</li>
  <li><a href="https://en.wikipedia.org/wiki/Re-order_buffer">Re-order Buffer</a> (ROB) - A ring
buffer which manages the order in which instructions are retired.  In Marocchino
the implementation is slightly simplified and called the Order Control Buffer (OCB).</li>
  <li>Instruction Ids - As an instruction is queued into the ROB, or OCB in Marocchino
it is assigned an Instruction Id which is used to track the instruction in different
components in Marocchino code this is called the <code class="language-plaintext highlighter-rouge">extaddr</code>.</li>
  <li>Register Allocation Table (RAT) - A table used for data hazard resolution.
The RAT table has one cell per OpenRISC general purpose register, 32 entries.
Each RAT cell indicates if a register is busy being produced by a queued
instruction and which instruction will produce it.</li>
  <li>Common Data Bus - Execution units present their result to all reservation stations
along with the register file.  Writing to the reservation station provides
immediate resolution of data hazards.  The link between execution units,
reservation stations and register file is referred to as the common data bus.</li>
</ul>

<p>The below diagram shows how these components are arranged in the Marocchino processor.</p>

<p><img src="/content/2019/marocchino-pipeline-tomasulo.png" alt="marocchino pipeline diagram" /></p>

<h3 id="resolving-data-hazards">Resolving Data Hazards</h3>

<p>As mentioned above the goal of a pipelined architecture is to retire one
instruction per clock cycle.
<a href="https://en.wikipedia.org/wiki/Instruction_pipelining">Pipelining</a> helps achieve
this by splitting an instruction into pipeline stages i.e. Fetch, Decode,
Execute, Load/Store and Register Write Back.  If one instruction depends on the
results produced by a previous instruction will be a problem as register
write back of the previous instruction may not complete before registers are
read during the Decode phase of a instruction.  This and other types of dependencies
between pipeline stages are called
<a href="https://en.wikipedia.org/wiki/Hazard_(computer_architecture)">hazards</a>, and
they must be avoided.</p>

<p>The Tomasulo algorithm with its Reservation Stations, Register Allocation Tables
and other building blocks try to avoid hazards causing pipeline stalls.  Let’s look
at a simple example to see how this is done.</p>

<ul>
  <li>instruction 1 - <code class="language-plaintext highlighter-rouge">b = a * 2</code></li>
  <li>instruction 2 - <code class="language-plaintext highlighter-rouge">x = a + b</code></li>
  <li>instruction 3 - <code class="language-plaintext highlighter-rouge">y = x / y</code></li>
</ul>

<p>Here we can see that <code class="language-plaintext highlighter-rouge">instruction 2</code> depends on <code class="language-plaintext highlighter-rouge">instruction 1</code> as the addition
of <code class="language-plaintext highlighter-rouge">a + b</code> cannot be performed until <code class="language-plaintext highlighter-rouge">b</code> is produced by <code class="language-plaintext highlighter-rouge">instruction 1</code>.</p>

<p>Let’s assume that <code class="language-plaintext highlighter-rouge">instruction 1</code> is currently executing on the <code class="language-plaintext highlighter-rouge">MULTIPLY</code> unit.
The CPU decodes <code class="language-plaintext highlighter-rouge">instruction 2</code>, instead of detecting a data hazard and stalling
the pipeline <code class="language-plaintext highlighter-rouge">instruction 2</code> will be placed in the reservation station of the
<code class="language-plaintext highlighter-rouge">ADD</code> execution unit.  The RAT indicates that <code class="language-plaintext highlighter-rouge">b</code> is busy and being produced by
<code class="language-plaintext highlighter-rouge">insruction 1</code>.  This means <code class="language-plaintext highlighter-rouge">instruction 2</code> cannot execute right away.  Next, we
can look at <code class="language-plaintext highlighter-rouge">instruction 3</code> and place it onto the reservation station of the
<code class="language-plaintext highlighter-rouge">DIVIDE</code> execution unit.  As <code class="language-plaintext highlighter-rouge">instruction 3</code> has no hazards for <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> it
can proceed directly to execution, even before <code class="language-plaintext highlighter-rouge">instruction 2</code> is ready for
execution.</p>

<p>Note, if a required reservation station is full the pipeline will stall.</p>

<h3 id="register-renaming">Register Renaming</h3>

<p>As mentioned above, execution units will present their output onto the common
data bus <code class="language-plaintext highlighter-rouge">wrbk_result</code> and the data will be written into reservation stations.
Writing the register to the reservation station may occur before writing
back to the register file.  This is what register renaming is, as the
register input does not come directly from the register file.</p>

<h3 id="instruction-id">Instruction Id</h3>

<p>When an instruction is issued it may be registered in the RAT, OCB and Reservation
Station.  It is assigned an Instruction Id for tracking purposes.  In Marocchino
this is called the <code class="language-plaintext highlighter-rouge">extadr</code> and is <code class="language-plaintext highlighter-rouge">3</code> bits wide. It is generated by the simple
instruction ID generation logic.</p>

<p>The is implemented in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L291">or1k_marocchino_oman.v</a> with the following counter
logic which generates a new <code class="language-plaintext highlighter-rouge">extadr</code> every time an instruction is decoded.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c1">// extension to DEST, FLAG or CARRY</span>
  <span class="c1">// Zero value is reserved as "not used"</span>
  <span class="k">localparam</span> <span class="p">[</span><span class="n">DEST_EXTADR_WIDTH</span><span class="o">-</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">EXTADR_MAX</span> <span class="o">=</span> <span class="p">((</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">DEST_EXTADR_WIDTH</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
  <span class="k">localparam</span> <span class="p">[</span><span class="n">DEST_EXTADR_WIDTH</span><span class="o">-</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">EXTADR_MIN</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
  <span class="c1">// ---</span>
  <span class="kt">reg</span>  <span class="p">[</span><span class="n">DEST_EXTADR_WIDTH</span><span class="o">-</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">dcod_extadr_r</span><span class="p">;</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="n">DEST_EXTADR_WIDTH</span><span class="o">-</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">extadr_adder</span><span class="p">;</span>
  <span class="c1">// ---</span>
  <span class="k">assign</span> <span class="n">extadr_adder</span> <span class="o">=</span> <span class="p">(</span><span class="n">dcod_extadr_r</span> <span class="o">==</span> <span class="n">EXTADR_MAX</span><span class="p">)</span> <span class="o">?</span> <span class="n">EXTADR_MIN</span> <span class="o">:</span> <span class="p">(</span><span class="n">dcod_extadr_r</span> <span class="o">+</span> <span class="mb">1'b1</span><span class="p">);</span>
  <span class="c1">// ---</span>
  <span class="k">always</span> <span class="o">@</span><span class="p">(</span><span class="kt">posedge</span> <span class="n">cpu_clk</span><span class="p">)</span> <span class="k">begin</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">pipeline_flush_i</span><span class="p">)</span>
      <span class="n">dcod_extadr_r</span> <span class="o">&lt;=</span> <span class="o">{</span><span class="n">DEST_EXTADR_WIDTH</span><span class="o">{</span><span class="mb">1'b0</span><span class="o">}}</span><span class="p">;</span>
    <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">padv_dcod_i</span><span class="p">)</span>
      <span class="n">dcod_extadr_r</span> <span class="o">&lt;=</span> <span class="n">fetch_valid_i</span> <span class="o">?</span> <span class="n">extadr_adder</span> <span class="o">:</span> <span class="n">dcod_extadr_r</span><span class="p">;</span>
  <span class="k">end</span> <span class="c1">// @clock</span>
  <span class="c1">// support in-1clk-unit forwarding</span>
  <span class="k">assign</span> <span class="n">dcod_extadr_o</span> <span class="o">=</span> <span class="n">dcod_extadr_r</span><span class="p">;</span>
</code></pre></div></div>

<p>Every instruction that is queued by the order manager is designated
an <code class="language-plaintext highlighter-rouge">extadr</code>.  This allows components like the reservation station and RAT tables
to track when an instruction starts and completes executing.</p>

<p>The interactions between the <code class="language-plaintext highlighter-rouge">extadr</code> and other components are as follows.</p>

<p>During decode:</p>

<ul>
  <li>the ID generator generates the <code class="language-plaintext highlighter-rouge">extaddr</code> by incrementing a counter.</li>
  <li>the OCB registers the <code class="language-plaintext highlighter-rouge">extaddr</code> along with other decoded instruction details</li>
  <li>the RAT registers an <code class="language-plaintext highlighter-rouge">extaddr</code> for the decoded instruction to indicate which
instruction will resolve a hazard.</li>
</ul>

<p>During execution:</p>

<ul>
  <li>the OCB broadcasts the <code class="language-plaintext highlighter-rouge">extaddr</code> of the oldest instruction registered in a FIFO
fashion.  This is to indicate which instruction is to be retired and ensures
instructions are retired in order.</li>
  <li>the RAT outputs the <code class="language-plaintext highlighter-rouge">extaddr</code> indicating which queued instruction will produce a register</li>
  <li>the RAT receives an <code class="language-plaintext highlighter-rouge">extaddr</code> from the OCB output to clear allocation flags</li>
  <li>the Reservation Station receives the <code class="language-plaintext highlighter-rouge">extaddr</code> with hazards to track when
instructions have finished and results are available.</li>
</ul>

<h3 id="register-allocation-table">Register Allocation Table</h3>

<p>The register allocation table (RAT), sometimes called register alias table, keeps
track of which registers are currently in progress of being generated by pending
instructions.  This is used to derive and resolve hazards.</p>

<p>The outputs of the RAT cell are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rat_rd_extadr_o</code> - indicates which <code class="language-plaintext highlighter-rouge">extadr</code> instruction has been allocated to
                    generate this register.
                    This will be updated with <code class="language-plaintext highlighter-rouge">decod_extadr_i</code> when <code class="language-plaintext highlighter-rouge">padv_exec_i</code> goes high.</li>
  <li><code class="language-plaintext highlighter-rouge">rat_rd_alloc_o</code> - indicates that this register is currently allocated to an
                   instruction which is not yet complete.
                   This will be <strong>set</strong> when <code class="language-plaintext highlighter-rouge">padv_exec_i</code> goes high, <code class="language-plaintext highlighter-rouge">decod_rfd_we_i</code> is high,
                   and <code class="language-plaintext highlighter-rouge">dcod_rfd_adr_i</code> is equal to <code class="language-plaintext highlighter-rouge">GPR_ADR</code>.</li>
</ul>

<p><img src="/content/2019/marocchino-ratcell.png" alt="marocchino RAT Cell diagram" /></p>

<p>The RAT table is made of 32 <code class="language-plaintext highlighter-rouge">rat_cell</code> modules;  one cell per register.  The
register which the cell is allocated to is stored within <code class="language-plaintext highlighter-rouge">GPR_ADR</code> in the rat
cell.</p>

<p><img src="/content/2019/marocchino-rat.png" alt="marocchino RAT diagram" /></p>

<p>Outputs of the RAT are registered to reservation stations.  The hazards are
derived with the following logic in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L471">or1k_marocchino_oman.v</a>.</p>

<p>The <code class="language-plaintext highlighter-rouge">omn2dec_hazard_d1a1_o</code> hazard means that the argument <code class="language-plaintext highlighter-rouge">a</code> of the decoded
instruction will be resolved when the instruction with <code class="language-plaintext highlighter-rouge">extadr</code> in <code class="language-plaintext highlighter-rouge">omn2dec_extadr_dxa1_o</code> is
retired.  The <code class="language-plaintext highlighter-rouge">2</code> in <code class="language-plaintext highlighter-rouge">d2</code>, <code class="language-plaintext highlighter-rouge">a2</code> and <code class="language-plaintext highlighter-rouge">b2</code> represent the 2nd register used in 64-bit
FPU instructions.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
  <span class="c1">//  # relative operand A1</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d1a1_o</span> <span class="o">=</span> <span class="n">rat_rd1_alloc</span><span class="p">[</span><span class="n">dcod_rfa1_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfa1_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d2a1_o</span> <span class="o">=</span> <span class="n">rat_rd2_alloc</span><span class="p">[</span><span class="n">dcod_rfa1_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfa1_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_extadr_dxa1_o</span> <span class="o">=</span> <span class="n">rat_extadr</span><span class="p">[</span><span class="n">dcod_rfa1_adr_i</span><span class="p">];</span>
  <span class="c1">//  # relative operand B1</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d1b1_o</span> <span class="o">=</span> <span class="n">rat_rd1_alloc</span><span class="p">[</span><span class="n">dcod_rfb1_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfb1_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d2b1_o</span> <span class="o">=</span> <span class="n">rat_rd2_alloc</span><span class="p">[</span><span class="n">dcod_rfb1_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfb1_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_extadr_dxb1_o</span> <span class="o">=</span> <span class="n">rat_extadr</span><span class="p">[</span><span class="n">dcod_rfb1_adr_i</span><span class="p">];</span>
  <span class="c1">//  # relative operand A2</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d1a2_o</span> <span class="o">=</span> <span class="n">rat_rd1_alloc</span><span class="p">[</span><span class="n">dcod_rfa2_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfa2_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d2a2_o</span> <span class="o">=</span> <span class="n">rat_rd2_alloc</span><span class="p">[</span><span class="n">dcod_rfa2_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfa2_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_extadr_dxa2_o</span> <span class="o">=</span> <span class="n">rat_extadr</span><span class="p">[</span><span class="n">dcod_rfa2_adr_i</span><span class="p">];</span>
  <span class="c1">//  # relative operand B2</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d1b2_o</span> <span class="o">=</span> <span class="n">rat_rd1_alloc</span><span class="p">[</span><span class="n">dcod_rfb2_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfb2_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_hazard_d2b2_o</span> <span class="o">=</span> <span class="n">rat_rd2_alloc</span><span class="p">[</span><span class="n">dcod_rfb2_adr_i</span><span class="p">]</span> <span class="o">&amp;</span> <span class="n">dcod_rfb2_req_i</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">omn2dec_extadr_dxb2_o</span> <span class="o">=</span> <span class="n">rat_extadr</span><span class="p">[</span><span class="n">dcod_rfb2_adr_i</span><span class="p">];</span>
</code></pre></div></div>

<h3 id="reservation-stations">Reservation Stations</h3>

<p><img src="/content/2019/marocchino-rsrvs.png" alt="marocchino reservation station diagram" /></p>

<p>The reservation station receives an instruction from the decode stage and queues
it until all hazards are resolved and the execution unit is free.</p>

<p>Each reservation station has one busy slot and one execution slot.  In the
Pentium Pro there were 20 reservation station slots, the Marocchino has 5 or 10
depending if you count the execution slots.</p>

<p>Reservation stations are populated when the pipeline advance <code class="language-plaintext highlighter-rouge">padv_rsrvs_i</code> signal comes.
An instruction may be forwarded directly to execution if there are no hazards
and the execution unit is free.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">busy_extadr_dxa_r</code> - is populated with data from <code class="language-plaintext highlighter-rouge">omn2dec_hazards_addrs_i</code>.  The <code class="language-plaintext highlighter-rouge">busy_extadr_dxa_r</code>
register represents the <code class="language-plaintext highlighter-rouge">extadr</code> to look for which will resolve the A register hazard.</li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">busy_extadr_dxb_r</code> - same as ‘A’ but indicates which <code class="language-plaintext highlighter-rouge">extadr</code> will produce the B register.</p>
  </li>
  <li><code class="language-plaintext highlighter-rouge">busy_hazard_dxa_r</code> - is populated with data from <code class="language-plaintext highlighter-rouge">omn2dec_hazards_flags_i</code>.  The <code class="language-plaintext highlighter-rouge">busy_hazard_dxa_r</code>
register represents that there is an instruction executing that will produce register A
which has not yet completed.</li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">busy_hazard_dxb_r</code> - same as ‘A’ but indicates that ‘B’ is not available yet.</p>
  </li>
  <li><code class="language-plaintext highlighter-rouge">busy_op_any_r</code> - populated with <code class="language-plaintext highlighter-rouge">1</code> when <code class="language-plaintext highlighter-rouge">padv_rsrvs_i</code> goes high indicates that
there is an operation queued.</li>
  <li><code class="language-plaintext highlighter-rouge">busy_op_r</code> - populated with <code class="language-plaintext highlighter-rouge">dcod_op_i</code>. Represents the operation pending in the queue.</li>
  <li><code class="language-plaintext highlighter-rouge">busy_rfa_r</code> - populated with data from <code class="language-plaintext highlighter-rouge">dcod_rfxx_i</code>. Represents the value of operand A pending in the queue.</li>
  <li><code class="language-plaintext highlighter-rouge">busy_rfb_r</code> - populated with data from <code class="language-plaintext highlighter-rouge">dcod_rfxx_i</code>. Represents the value of operand B pending in the queue.</li>
</ul>

<p>The reservation station resolves hazards by watching and comparing <code class="language-plaintext highlighter-rouge">wrbk_extadr_i</code>
with the <code class="language-plaintext highlighter-rouge">busy_extadr_dxa_r</code> and <code class="language-plaintext highlighter-rouge">busy_extadr_dxb_r</code> registers.  If the two match
it means that the instruction producing register A or B has finished writing back
its results and the hazard can be cleared.</p>

<p>Writeback forwarding is handled via the following verilog multiplexer and register logic.
The first bit is used to register the decoded values <code class="language-plaintext highlighter-rouge">dcod_rf*</code> from the register
file otherwise we watch for inputs from the forwarding logic.  If there is a pending hazard
results are forwarded from the common data bus, otherwise results are maintained.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c1">// BUSY stage operands A1 &amp; B1</span>
  <span class="k">always</span> <span class="o">@</span><span class="p">(</span><span class="kt">posedge</span> <span class="n">cpu_clk</span><span class="p">)</span> <span class="k">begin</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">padv_rsrvs_i</span><span class="p">)</span> <span class="k">begin</span>
      <span class="n">busy_rfa1_r</span> <span class="o">&lt;=</span> <span class="n">dcod_rfa1</span><span class="p">;</span>
      <span class="n">busy_rfb1_r</span> <span class="o">&lt;=</span> <span class="n">dcod_rfb1</span><span class="p">;</span>
    <span class="k">end</span>
    <span class="k">else</span> <span class="k">begin</span>
      <span class="n">busy_rfa1_r</span> <span class="o">&lt;=</span> <span class="n">busy_rfa1</span><span class="p">;</span>
      <span class="n">busy_rfb1_r</span> <span class="o">&lt;=</span> <span class="n">busy_rfb1</span><span class="p">;</span>
    <span class="k">end</span>
  <span class="k">end</span> <span class="c1">// @clock</span>

  <span class="c1">// Forwarding</span>
  <span class="c1">//  operand A1</span>
  <span class="k">assign</span> <span class="n">busy_rfa1</span> <span class="o">=</span>  <span class="n">busy_hazard_d1a1_r</span> <span class="o">?</span> <span class="n">wrbk_result1_i</span> <span class="o">:</span>
                     <span class="p">(</span><span class="n">busy_hazard_d2a1_r</span> <span class="o">?</span> <span class="n">wrbk_result2_i</span> <span class="o">:</span> <span class="n">busy_rfa1_r</span><span class="p">);</span>
  <span class="c1">//  operand B1</span>
  <span class="k">assign</span> <span class="n">busy_rfb1</span> <span class="o">=</span>  <span class="n">busy_hazard_d1b1_r</span> <span class="o">?</span> <span class="n">wrbk_result1_i</span> <span class="o">:</span>
                     <span class="p">(</span><span class="n">busy_hazard_d2b1_r</span> <span class="o">?</span> <span class="n">wrbk_result2_i</span> <span class="o">:</span> <span class="n">busy_rfb1_r</span><span class="p">);</span>
</code></pre></div></div>

<p>When all hazard flags are cleared the contents of <code class="language-plaintext highlighter-rouge">busy_op_r</code> , <code class="language-plaintext highlighter-rouge">busy_rfa_r</code> and
<code class="language-plaintext highlighter-rouge">busy_rfb_r</code> will be transferred to <code class="language-plaintext highlighter-rouge">exec_op_any_r</code>, <code class="language-plaintext highlighter-rouge">exec_op_r</code>, etc.  They
are presented on the outputs and the execution unit can take them and start processing.</p>

<p>The <code class="language-plaintext highlighter-rouge">unit_free_o</code> output signals the control unit that the reservation station
is free and can be issued another instruction.  The signal goes high when all hazards
are cleared and the busy state transfers to exec.</p>

<h3 id="execution-units">Execution Units</h3>

<p>In Marocchino the execution units (also referred to as functional units)
execute instructions which it receives from the reservation stations.</p>

<p>The execution units in Marocchino are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_int_1clk</code> - handles integer instructions which can complete
in 1 clock cycle. This includes <code class="language-plaintext highlighter-rouge">SHIFT</code>, <code class="language-plaintext highlighter-rouge">ADD</code>, <code class="language-plaintext highlighter-rouge">AND</code>, <code class="language-plaintext highlighter-rouge">OR</code> etc.</li>
  <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_int_div</code> - handles integer <code class="language-plaintext highlighter-rouge">DIVIDE</code> operations.</li>
  <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_int_mul</code> - handles integer <code class="language-plaintext highlighter-rouge">MULTIPLY</code> operations.</li>
  <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_lsu</code> - handles memory load store operations.  It interfaces
with the data cache, MMU and memory bus.</li>
  <li><code class="language-plaintext highlighter-rouge">pfpu_marocchino_top</code> - handles floating point operations.  These include
<code class="language-plaintext highlighter-rouge">ADD</code>, <code class="language-plaintext highlighter-rouge">MULTIPLY</code>, <code class="language-plaintext highlighter-rouge">CMP</code>, <code class="language-plaintext highlighter-rouge">I2F</code> etc.</li>
</ul>

<p>Handshake signals between the reservation station and execution units are used
to issue operations to execution units.</p>

<p><img src="/content/2019/marocchino-handshake.png" alt="marocchino execution unit handshake diagram" /></p>

<p>The <code class="language-plaintext highlighter-rouge">taking_op_i</code> is the signal from the execution unit signalling it has
received the op and the reservation station will clear all <code class="language-plaintext highlighter-rouge">exec_*_o</code> output
signals.</p>

<h3 id="order-control-buffer">Order Control Buffer</h3>

<p><img src="/content/2019/marocchino-ocb.png" alt="marocchino order control diagram" /></p>

<p>In the Marocchino the Order Control Buffer (OCB) is the in order retirement
unit.  It can retire a single instruction at a time.  The implementation is a 7
entry FIFO queue.  This is much less than the Pentium Pro which contains 40
slots.  The OCB receives a single instruction at time from the decoder and
broadcasts the oldest instruction for other components to see.  Instructions are
retired after execution write back is complete.</p>

<p>If the OCB output indicates a branch instruction or an exception, branch logic
is invoked.  Instead of waiting for write back to a register the write back logic
in the Marocchino will perform the branch operations.  This may include flushing
the OCB.  Special care is taken to handle branch delay slot instruction execution.</p>

<p>The OCB is different from a traditional Tomasulo Reorder Buffer (ROB) in that it
does not store any execution write back results.</p>

<p>Each OCB entry stores:</p>

<ul>
  <li>The Instruction ID <code class="language-plaintext highlighter-rouge">extaddr</code></li>
  <li>The type of instruction</li>
  <li>The register destination addresses used for write back</li>
  <li>Any Fetch and Decode exceptions</li>
</ul>

<p>This can be seen as defined by the <code class="language-plaintext highlighter-rouge">ocbi</code> and <code class="language-plaintext highlighter-rouge">ocbi</code> wire buses in
<a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L1007">or1k_marocchino_oman.v</a>.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c1">// --- OCB-Controls input ---</span>
  <span class="kt">wire</span>  <span class="p">[</span><span class="n">OCBT_MSB</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">ocbi</span><span class="p">;</span>
  <span class="k">assign</span> <span class="n">ocbi</span> <span class="o">=</span>
    <span class="o">{</span>
      <span class="c1">// --- pipeline [C]ontrol flags ---</span>
      <span class="n">dcod_extadr_r</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_ls_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_fpxx_cmp_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_fpxx_arith_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_mul_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_div_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_1clk_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_jb_r</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="n">dcod_op_push_wrbk_i</span><span class="p">,</span> <span class="c1">// OCB-Controls entrance</span>
      <span class="c1">// --- instruction [A]ttributes ---</span>
      <span class="n">pc_decode_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_rfd2_adr_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_rfd2_we_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_rfd1_adr_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_rfd1_we_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_delay_slot_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_op_rfe_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="c1">// Flag that istruction is restartable</span>
      <span class="n">interrupts_en</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="c1">// Combined IFETCH/DECODE an exception flag</span>
      <span class="n">dcod_an_except_fd_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="c1">// FETCH &amp; DECODE exceptions</span>
      <span class="n">dcod_fetch_except_ibus_err_r</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_fetch_except_ipagefault_r</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_fetch_except_itlb_miss_r</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_except_illegal_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_except_syscall_i</span><span class="p">,</span> <span class="c1">// OCB-Attributes entrance</span>
      <span class="n">dcod_except_trap_i</span> <span class="c1">// OCB-Attributes entrance</span>
    <span class="o">}</span><span class="p">;</span>

  <span class="c1">// --- INSN OCB input ---</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="n">OCBT_MSB</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">ocbo</span><span class="p">;</span>
</code></pre></div></div>

<h3 id="common-data-bus">Common Data Bus</h3>

<p>As discussed above the common data collects write back results from execution units
and routes them for write back.</p>

<p>This can be seen in the <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_cpu.v#L1933">or1k_marocchino_cpu.v</a>
as below.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c1">// --- regular ---</span>
  <span class="k">always</span> <span class="o">@</span><span class="p">(</span><span class="n">wrbk_1clk_result</span>       <span class="kt">or</span> <span class="n">wrbk_div_result</span> <span class="kt">or</span> <span class="n">wrbk_mul_result</span> <span class="kt">or</span>
           <span class="n">wrbk_fpxx_arith_res_hi</span> <span class="kt">or</span> <span class="n">wrbk_lsu_result</span> <span class="kt">or</span> <span class="n">wrbk_mfspr_result</span><span class="p">)</span>
  <span class="k">begin</span>
    <span class="n">wrbk_result1</span> <span class="o">=</span> <span class="n">wrbk_1clk_result</span>       <span class="o">|</span> <span class="n">wrbk_div_result</span> <span class="o">|</span> <span class="n">wrbk_mul_result</span> <span class="o">|</span>
                   <span class="n">wrbk_fpxx_arith_res_hi</span> <span class="o">|</span> <span class="n">wrbk_lsu_result</span> <span class="o">|</span> <span class="n">wrbk_mfspr_result</span><span class="p">;</span>
  <span class="k">end</span>

  <span class="c1">// --- FPU64 extention ---</span>
  <span class="k">assign</span> <span class="n">wrbk_result2</span> <span class="o">=</span> <span class="n">wrbk_fpxx_arith_res_lo</span><span class="p">;</span>

</code></pre></div></div>

<h3 id="conclusion">Conclusion</h3>

<p>Tomasulo’s algorithm is still relevant today and used in many processors.
Marocchino provides an accessible implementation.  Marocchino is however, not
super-scalar, while Pentium Pro can decode up to 4 instructions at a time the Marocchino
can only decode 1 at a time.</p>

<p>Furthermore many improvements can be made to Marocchino to increase performance.  Including:</p>

<ul>
  <li>Full featured reorder buffer</li>
  <li>Parallel instruction decoding</li>
  <li>Speculative execution; <a href="https://meltdownattack.com/">or should we?</a></li>
  <li>More reservation station slots</li>
</ul>

<p>However, these come with a cost of size on the FPGA.  If you are interested in helping
out please feel free to contribute.</p>

<p>If anything in this article could be improved, more timing diagrams, typos or fixes
for diagrams please send <a href="https://twitter.com/stffrdhrn">me a message on twitter</a>.</p>

<h2 id="further-reading-and-sources">Further Reading and Sources</h2>
<ul>
  <li>Intel architecture manuals
    <ul>
      <li><a href="/content/2019/vol1ref24319002.pdf">Intel Architecture Software Developers Manual PDF</a> see
        <ul>
          <li>section 2.1 brief history of the intel architecture</li>
          <li>section 2.4 introduction to the P6 microarchitecture</li>
        </ul>
      </li>
      <li><a href="/content/2019/Intel_PentiumPro.pdf">Pentium Pro Datasheet PDF</a> see
        <ul>
          <li>section 2.2 The Pentium Pro Processor Pipeline</li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="https://courses.cs.washington.edu/courses/csep548/06au/lectures.html">University of Washington, Computer Architecture</a>
    <ul>
      <li><a href="https://courses.cs.washington.edu/courses/cse548/06wi/slides/reorderBuf.pdf">Re-order buffer</a> - source of Pentium Pro diagram</li>
    </ul>
  </li>
  <li><a href="https://cseweb.ucsd.edu/classes/wi13/cse240a/">UCSD, Graduate Computer Architecture</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)">Intel Core 2</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Pentium_Pro">Intel Pentium Pro</a></li>
</ul>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we are reviewing the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.]]></summary></entry><entry><title type="html">OR1K Marocchino Instruction Pipeline</title><link href="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/07/18/or1k_marocchino_instruction_pipeline.html" rel="alternate" type="text/html" title="OR1K Marocchino Instruction Pipeline" /><published>2019-07-18T06:43:00+01:00</published><updated>2019-07-18T06:43:00+01:00</updated><id>http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/07/18/or1k_marocchino_instruction_pipeline</id><content type="html" xml:base="http://stffrdhrn.github.io/hardware/embedded/openrisc/2019/07/18/or1k_marocchino_instruction_pipeline.html"><![CDATA[<p><em>This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order
<a href="https://openrisc.io">OpenRISC</a> cpu.  In this series we will review the Marocchino and it’s architecture.
If you haven’t already I suggest you start of by reading the intro in <a href="/hardware/embedded/openrisc/2019/06/11/or1k_marocchino.html">Marocchino in Action</a>.</em></p>

<p>In the last article, <em>Marocchino in Action</em> we discussed the history of
the CPU and how to setup setup a development environment for it.  In this
article let’s look a bit deeper into how the Marocchino CPU works.</p>

<p>We will look at how an instruction flows through the Marocchino <a href="https://en.wikipedia.org/wiki/Instruction_pipelining">pipeline</a>.</p>

<h2 id="marocchino-architecture">Marocchino Architecture</h2>

<p>The Marocchino source code is available on
<a href="https://github.com/openrisc/or1k_marocchino/tree/master/rtl/verilog">github</a>
and is easy to navigate.  We have these directories:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">rtl/verilog</code> - the core verilog code, with toplevel modules
    <ul>
      <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_top.v</code> - top level module, connects CPU to wishbone bus</li>
      <li><code class="language-plaintext highlighter-rouge">or1k_marocchino_cpu.v</code> - CPU module, connects CPU pipeline</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">rtl/verilog/pfpu_marocchino</code> - the FPU implementation
    <ul>
      <li><code class="language-plaintext highlighter-rouge">pfpu_marocchino_top.v</code> - FPU module, wires together FPU components</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">bench</code> - test bench harness monitor modules</li>
  <li><code class="language-plaintext highlighter-rouge">doc</code> - design documentation</li>
</ul>

<p><img src="/content/2019/marocchino-github.png" alt="marocchino github website screenshot" /></p>

<p>At first glance of the code the Marocchino may look like a <a href="https://en.wikipedia.org/wiki/Classic_RISC_pipeline">traditional 5 stage
RISC pipeline.</a>  It has
fetch, decode, execution, load/store and register write back modules which you
might picture in your head as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  PIPELINE CRTL - progress/stall the pipeline
  (or1k_marocchino_ctrl.v)

  INSTRUCTION PIPELINE - process an instruction

                        |
                        V
/                     FETCH                     \
\ (or1k_marocchino_fetch.v)                     /
                        |
                        V
/                     DECODE                    \
\ (or1k_marocchino_decode.v)                    /
                        |
                        V
/                     EXECUTE                   \
| (or1k_marocchino_int_1clk.v) ALU              |
| (or1k_marocchino_int_div.v)  DIVISION         |
\ (or1k_marocchino_int_mul.v)  MULTIPLICATION   /
                        |
                        V
/                   LOAD STORE                  \
\ (or1k_marocchino_lsu.v)      TO/FROM RAM      /
                        |
                        V
/                   WRITE BACK                  \
\ (or1k_marocchino_rf.v)                        /

</code></pre></div></div>

<p>However, once you look a bit closer you notice some things that are different.
The top-level module
<a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_cpu.v">or1k_marocchino_cpu</a>
connects the modules and shows:</p>

<ul>
  <li>Between the decode and execution units there are reservation stations.</li>
  <li>Along with the control unit, there is an order manager module which provides control signals.</li>
  <li>The load/store execution unit is done as part of the execution stage.</li>
</ul>

<p>What this CPU is is a super-scalar instruction pipeline with in order instruction
retirement implementing the <a href="https://en.wikipedia.org/wiki/Tomasulo_algorithm">Tomasulo algorithm</a>.</p>

<p>A simplified view of the CPU’s internal module layout is as per the below diagram.</p>

<p><img src="/content/2019/marocchino-pipeline.png" alt="marocchino pipeline diagram" /></p>

<h2 id="pipeline-controls">Pipeline Controls</h2>

<p>The marocchino has two modules for coordinating pipeline stage instruction propagation.  The
control unit and the order manager.</p>

<h3 id="control-unit">Control Unit</h3>

<p>The control unit of the CPU is in charge of watching over the pipeline stages
and signalling when operations can transfer from one stage to the next.  The
Marocchino does this with a series of pipeline advance (<code class="language-plaintext highlighter-rouge">padv_*</code>) signals.  In
general for the best efficiency all <code class="language-plaintext highlighter-rouge">padv_*</code> wires should be high at all times
allowing instructions to progress on every clock cycle.  But as we will see in
reality, this is difficult to achieve due to pipeline stall scenarios like cache
misses and branch prediction misses.  The <code class="language-plaintext highlighter-rouge">padv_*</code> signals include:</p>

<h4 id="padv_fetch_o">padv_fetch_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">padv_fetch_o</code> signal instructs the instruction fetch unit to progress.
Internally the fetch unit has 3 stages.  The instruction fetch unit interacts
with the instruction cache and instruction <a href="https://en.wikipedia.org/wiki/Memory_management_unit">memory management unit</a> (MMU).
The <code class="language-plaintext highlighter-rouge">padv_fetch_o</code> signal goes low and the pipeline stalls when the decode
module is busy (<code class="language-plaintext highlighter-rouge">dcod_emtpy_i</code> is low).  The signal <code class="language-plaintext highlighter-rouge">dcod_empty_i</code> comes from
the Decode module and indicates that an instruction can be accepted by the
decode stage.</p>

<p>This is represented by this <code class="language-plaintext highlighter-rouge">assign</code> in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_ctrl.v#L662">or1k_marocchino_ctrl.v</a>:</p>

<p><em>Note</em> The <code class="language-plaintext highlighter-rouge">stepping</code> and <code class="language-plaintext highlighter-rouge">pstep[]</code> signals are related to debug single stepping, and can be
ignored for our purposes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // Advance IFETCH
  // Stepping condition is close to the one for DECODE
  assign padv_fetch_o = padv_all &amp; ((~stepping) | (dcod_empty_i &amp; pstep[0])); // ADV. IFETCH

</code></pre></div></div>

<h4 id="padv_dcod_o">padv_dcod_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">padv_dcod_o</code> signal instructs the instruction decode stage to output
decoded operands.  The decode unit is one stage, if <code class="language-plaintext highlighter-rouge">padv_dcod_o</code> is high, it
will decode the instruction input every cycle.
The <code class="language-plaintext highlighter-rouge">padv_dcod_o</code> signal goes low if the destination reservation station for the
operands cannot accept an instruction.</p>

<p>This is represented by this <code class="language-plaintext highlighter-rouge">assign</code> in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_ctrl.v#L676">or1k_marocchino_ctrl.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // Advance DECODE
  assign padv_dcod_o = padv_all &amp; (~wrbk_rfdx_we_i) &amp; // ADV. DECODE
    (((~stepping) &amp; dcod_free_i &amp; (dcod_empty_i | ena_dcod)) | // ADV. DECODE
       (stepping  &amp; dcod_empty_i &amp; pstep[0])); // ADV. DECODE
</code></pre></div></div>

<h4 id="padv_exec_o-and-padv__rsrvs_o">padv_exec_o and padv_*_rsrvs_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">padv_exec_o</code> signal to order manager enqueues decoded ops into the Order Control
Buffer (OCB).  The OCB is a <a href="https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)">FIFO</a>
queue which keeps track of the order instructions have been decoded.</p>

<p>The <code class="language-plaintext highlighter-rouge">padv_*_rsrvs_o</code> signal wired one of the reservation stations
enables registering of an instruction into a reservation station.  There is one
<code class="language-plaintext highlighter-rouge">padv_*_rsrvs_o</code> signal and reservation station per execution unit. They are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">padv_1clk_rsrvs_o</code> - to the reservation station for single clock ALU operations</li>
  <li><code class="language-plaintext highlighter-rouge">padv_muldiv_rsrvs_o</code> - to the reservation station for multiply and divide
operations.  Divide operations take 32 clock cycles.  Multiply operations
execute with 2 clock cycles.</li>
  <li><code class="language-plaintext highlighter-rouge">padv_fpxx_rsrvs_o</code> - to the reservation station for the <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic">floating point unit</a>
(FPU).  There are multiple FPU operations including multiply, divide, add,
subtract, comparison and conversion between integer and floating point.</li>
  <li><code class="language-plaintext highlighter-rouge">padv_lsu_rsrvs_o</code> - to the reservation station for the load store unit.  The
load store unit will load data from <a href="https://en.wikipedia.org/wiki/Random-access_memory">memory</a>
to registers or store data from registers to memory.  It interacts with the
data cache and data MMU.</li>
</ul>

<p>Both <code class="language-plaintext highlighter-rouge">padv_exec_o</code> and <code class="language-plaintext highlighter-rouge">padv_*_rsrvs_o</code> are dependent on the execution units being
ready and both signals will go high or low at the same time.</p>

<p>This is represented by the <code class="language-plaintext highlighter-rouge">assign</code> in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_ctrl.v#L690">or1k_marocchino_ctrl.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // Advance EXECUTE (push OCB &amp; clean up  DECODE)
  assign padv_exec_o         = ena_exec         &amp; padv_an_exec_unit;
  // Per execution unit (or reservation station) advance
  assign padv_1clk_rsrvs_o   = ena_1clk_rsrvs   &amp; padv_an_exec_unit;
  assign padv_muldiv_rsrvs_o = ena_muldiv_rsrvs &amp; padv_an_exec_unit;
  assign padv_fpxx_rsrvs_o   = ena_fpxx_rsrvs   &amp; padv_an_exec_unit;
  assign padv_lsu_rsrvs_o    = ena_lsu_rsrvs    &amp; padv_an_exec_unit;
</code></pre></div></div>

<h4 id="padv_wrbk_o">padv_wrbk_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">padv_wrbk_o</code> signal to the execution units will go active when <code class="language-plaintext highlighter-rouge">exec_valid_i</code> is active
and will finalize writing back the execution results.  The <code class="language-plaintext highlighter-rouge">padv_wrbk_o</code> signal to
the order manager will retire the oldest instruction from the OCB.</p>

<p>This is represented by this <code class="language-plaintext highlighter-rouge">assign</code> in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_ctrl.v#L703">or1k_marocchino_ctrl.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // Advance Write Back latches
  wire   exec_valid_l = exec_valid_i | op_mXspr_valid;
  assign padv_wrbk_o  = exec_valid_l &amp; padv_all &amp; (~wrbk_rfdx_we_i) &amp; ((~stepping) | pstep[2]);
</code></pre></div></div>

<p>An astute reader would notice that there are no pipeline advance (<code class="language-plaintext highlighter-rouge">padv_*</code>)
signals to each of the execution units.  This is where the order manager comes
in.</p>

<h3 id="order-manager">Order Manager</h3>

<p>The order manager ensures that instructions are retired in the same order that they are decoded.
It contains a register allocation table (RAT) for hazard resolution and the OCB.
We will go into more depth on the RAT in the next article, but for now let’s look
at how the order manager interacts with the instruction pipeline flow.</p>

<h4 id="exec_valid_o">exec_valid_o</h4>

<p>As the OCB is a FIFO queue the output port presents the oldest non retired
instruction to the order manager.  The <code class="language-plaintext highlighter-rouge">exec_valid_o</code> signal to the control unit
will go active when the <code class="language-plaintext highlighter-rouge">*_valid_i</code> signal from the execution unit and the OCB
output instruction match.</p>

<p>This is represented by this <code class="language-plaintext highlighter-rouge">assign</code> in <a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L505">or1k_marocchino_oman.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  assign exec_valid_o =
    (op_1clk_valid_l          &amp; ~ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID: but wait attributes for l.jal/ljalr
    (exec_jb_attr_valid       &amp;  ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID
    (div_valid_i              &amp;  ocbo[OCBTC_OP_DIV_POS])         | // EXEC VALID
    (mul_valid_i              &amp;  ocbo[OCBTC_OP_MUL_POS])         | // EXEC VALID
    (fpxx_arith_valid_i       &amp;  ocbo[OCBTC_OP_FPXX_ARITH_POS])  | // EXEC VALID
    (fpxx_cmp_valid_i         &amp;  ocbo[OCBTC_OP_FPXX_CMP_POS])    | // EXEC VALID
    (lsu_valid_i              &amp;  ocbo[OCBTC_OP_LS_POS])          | // EXEC VALID
                                 ocbo[OCBTC_OP_PUSH_WRBK_POS];     // EXEC VALID
</code></pre></div></div>

<p>The OCB helps the order manager ensure that instructions are retired in the same
order that they are decoded.</p>

<h4 id="grant_wrbk__o">grant_wrbk_*_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">grant_wrbk_*_o</code> signal to the execution units will go active depending on
the OCB output port instruction.</p>

<p>This is represented by this <code class="language-plaintext highlighter-rouge">assign</code> in
<a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L402">or1k_marocchino_oman.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // Grant Write-Back-access to units
  assign grant_wrbk_to_1clk_o        = ocbo[OCBTC_OP_1CLK_POS];
  assign grant_wrbk_to_div_o         = ocbo[OCBTC_OP_DIV_POS];
  assign grant_wrbk_to_mul_o         = ocbo[OCBTC_OP_MUL_POS];
  assign grant_wrbk_to_fpxx_arith_o  = ocbo[OCBTC_OP_FPXX_ARITH_POS];
  assign grant_wrbk_to_lsu_o         = ocbo[OCBTC_OP_LS_POS];
  assign grant_wrbk_to_fpxx_cmp_o    = ocbo[OCBTC_OP_FPXX_CMP_POS];
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">grant_wrbk_*_o</code> signal along with the <code class="language-plaintext highlighter-rouge">padb_wrbk_o</code> signal signal an
execution unit that it can write back its result to the register file / RAT /
reservation station.</p>

<h4 id="wrbk_rfd1_we_o-wrbk_rfd2_we_o-and-wrbk_rfdx_we_o">wrbk_rfd1_we_o, wrbk_rfd2_we_o and wrbk_rfdx_we_o</h4>

<p>The <code class="language-plaintext highlighter-rouge">wrbk_rfd1_we_o</code> and <code class="language-plaintext highlighter-rouge">wrbk_rfd2_we_o</code> signals enable writeback
to the register file.  There are 2 signals because some 64-bit FPU instructions
require writing results to 2 registers.  When there is just a single register to write
only signal <code class="language-plaintext highlighter-rouge">wrbk_rfd1_we_o</code> is used.  When there are two results, writing happens
in 2-stages, first <code class="language-plaintext highlighter-rouge">wrbk_rfd1_we_o</code> signals the write back to register 1 then in
the next cycle <code class="language-plaintext highlighter-rouge">wrbk_rfd2_we_o</code> signals the write back to register 2.</p>

<p>The <code class="language-plaintext highlighter-rouge">wrbk_rfdx_we_o</code> signal to the control unit stalls the pipeline to allow
the second write to complete.</p>

<p>This is represented by this logic in
<a href="https://github.com/openrisc/or1k_marocchino/blob/master/rtl/verilog/or1k_marocchino_oman.v#L1007">or1k_marocchino_oman.v</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  // instuction requests write-back
  wire exec_rfd1_we = ocbo[OCBTA_RFD1_WRBK_POS];
  wire exec_rfd2_we = ocbo[OCBTA_RFD2_WRBK_POS];

...

  // 1-clock Write-Back-pulses
  //  # for D1
  always @(posedge cpu_clk) begin
    if (padv_wrbk_i)
      wrbk_rfd1_we_o &lt;= exec_rfd1_we;
    else
      wrbk_rfd1_we_o &lt;= 1'b0;
  end // @clock

  //  # for D2 we delay WriteBack for 1-clock
  //    to split write into RF from D1
  always @(posedge cpu_clk) begin
    if (cpu_rst) begin
      wrbk_rfdx_we_o &lt;= 1'b0; // flush
      wrbk_rfd2_we_o &lt;= 1'b0; // flush
    end
    else if (wrbk_rfd2_we_o) begin
      wrbk_rfdx_we_o &lt;= 1'b0; // D2 write done
      wrbk_rfd2_we_o &lt;= 1'b0; // D2 write done
    end
    else if (wrbk_rfdx_we_o)
      wrbk_rfd2_we_o &lt;= 1'b1; // do D2 write
    else if (padv_wrbk_i)
      wrbk_rfdx_we_o &lt;= exec_rfd2_we;
  end // @clock

</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">padv_wrbk_i</code> signal from the control unit to the order manager also takes
care of dequeuing the last instruction from the OCB.  With that and the
writebacks completed the instruction is said to be retired.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The Marocchino instruction pipeline is not very complicated while still being
full featured including Caches, MMU and FPU.  We have mentioned
a few structures such as Reservation Station and RAT which we haven’t gone into
much details on.  These help implement out-of-order superscalar execution using
Tomasulo’s algorithm.  In the <a href="/hardware/embedded/openrisc/2019/10/21/or1k_marocchino_tomasulo.html">next article</a> we will go into more details on these
components and how Tomasulo works.</p>]]></content><author><name>Stafford Horne</name></author><category term="hardware" /><category term="embedded" /><category term="openrisc" /><summary type="html"><![CDATA[This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we will review the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.]]></summary></entry></feed>