In the last article we introduced the OpenRISC glibc FPU port and the effort required to get user space FPU support into OpenRISC linux user applications. We explained how the FPU port is a fullstack project covering:

Architecture Specification
Simulators and CPU implementations
Linux Kernel support
GCC Instructions and Soft FPU
Binutils/GDB Debugging Support
glibc support

In this entry we will cover updating Simulators and CPU implementations to support the architecture changes which are called for as per the previous article.

Allowing usermode programs to update the FPCSR register
Detecting tininess before rounding

Simulator Updates

The simulators used for testing OpenRISC software without hardware are QEMU and or1ksim. They both needed to be updated to cohere to the specification updates discussed above.

Or1ksim Updates

The OpenRISC architectue simulator or1ksim has been updated with the single patch: cpu: Allow FPCSR to be read/written in user mode.

The softfloat FPU implementation was already configured to detect tininess before rounding.

If you are interested you can download and run the simulator and test this out with a docker image pulled from docker hub using the following:

# using podman instead of docker, you can use docker here too
podman pull stffrdhrn/or1k-sim-env:latest
podman run -it --rm stffrdhrn/or1k-sim-env:latest

root@9a4a52eec8ee:/tmp# or1k-elf-sim -version
Seeding random generator with value 0x4a3c2bbd
OpenRISC 1000 Architectural Simulator, version 2023-08-20

This starts up an environment which has access to the OpenRISC architecture simulator and a GNU compiler toolchain. While still in the container can run a quick test using the FPU as follows:

# Create a test program using OpenRISC FPU
cat > fpee.c <<EOF
#include <float.h>
#include <stdio.h>
#include <or1k-sprs.h>
#include <or1k-support.h>

static void enter_user_mode() {
  int32_t sr = or1k_mfspr(OR1K_SPR_SYS_SR_ADDR);
  sr &= ~OR1K_SPR_SYS_SR_SM_MASK;
  or1k_mtspr(OR1K_SPR_SYS_SR_ADDR, sr);
}
static void enable_fpu_exceptions() {
  unsigned long fpcsr = OR1K_SPR_SYS_FPCSR_FPEE_MASK;
  or1k_mtspr(OR1K_SPR_SYS_FPCSR_ADDR, fpcsr);
}
static void fpe_handler() {
  printf("Got FPU Exception, PC: 0x%lx\n", or1k_mfspr(OR1K_SPR_SYS_EPCR_BASE));
}
int main() {
  float result;

  or1k_exception_handler_add(0xd, fpe_handler);
#ifdef USER_MODE
  /* Note, printf here also allocates some memory allowing user mode runtime to
     work.  */
  printf("Enabling user mode\n");
  enter_user_mode();
#endif
  enable_fpu_exceptions();

  printf("Exceptions enabled, now DIV 3.14 / 0!\n");
  result = 3.14f / 0.0f;

  /* Verify we see infinity.  */
  printf("Result: %f\n", result);
  /* Verify we see DZF set.  */
  printf("FPCSR: %x\n", or1k_mfspr(OR1K_SPR_SYS_FPCSR_ADDR));

#ifdef USER_MODE
  asm volatile("l.movhi r3, 0; l.nop 1"); /* Exit sim, now */
#endif
  return 0;
}
EOF

# Compile the program
or1k-elf-gcc -g -O2 -mhard-float fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee

# Expected results
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801

# Compile the program to run in USER_MODE
or1k-elf-gcc -g -O2 -mhard-float -DUSER_MODE fpee.c -o fpee
or1k-elf-sim -f /opt/or1k/sim.cfg ./fpee

# Expected results with USER_MODE
# Program Header: PT_LOAD, vaddr: 0x00000000, paddr: 0x0 offset: 0x00002000, filesz: 0x000065ab, memsz: 0x000065ab
# Program Header: PT_LOAD, vaddr: 0x000085ac, paddr: 0x85ac offset: 0x000085ac, filesz: 0x000000c8, memsz: 0x0000046c
# WARNING: sim_init: Debug module not enabled, cannot start remote service to GDB
# Enabling user mode
# Exceptions enabled, now DIV 3.14 / 0!
# Got FPU Exception, PC: 0x2068
# Result: f
# FPCSR: 801
# exit(0)

In the above we can see how to compile and run a simple FPU test program and run it on or1ksim. The program set’s up an FPU exception handler, enables exceptions then does a divide by zero to produce an exception. This program uses the OpenRISC newlib (baremetal) toolchain to compile a program that can run directly on the simulator, as oppposed to a program running in an OS on a simulator or hardware.

Note, that normally newlib programs expect to run in supervisor mode, when our program switches to user mode we need to take some precautions to ensure it can run correctly. As noted in the comments, usually when allocating and exiting the newlib runtime will do things like disabling/enabling interrupts which will fail when running in user mode.

QEMU Updates

The QEMU update was done in my OpenRISC user space FPCSR qemu patch series. The series was merged for the qemu 8.1 release.

The updates were split it into three changes:

Allowing FPCSR access in user mode.
Properly set the exception PC address on floating point exceptions.
Configuring the QEMU softfloat implementation to perform tininess check before rounding.

QEMU Patch 1

The first patch to allow FPCSR access in user mode was trivial, but required some code structure changes making the patch look bigger than it really was.

QEMU Patch 2

The next patch to properly set the exception PC address fixed a long existing bug where the EPCR was not properly updated after FPU exceptions. Up until now OpenRISC userspace did not support FPU instructions and this code path had not been tested.

To explain why this fix is important let us look at the EPCR and what it is used for in a bit more detail. In general, when an exception occurs an OpenRISC CPU will store the program counter (PC) of the instruction that caused the exception into the exeption program counter address (EPCR). Floating point exceptions are a special case in that the EPCR is actually set to the next instruction to be executed, this is to avoid looping.

When the linux kernel handles a floating point exception it follows the path 0xd00 > fpe_trap_handler > do_fpe_trap. This will setup a signal to be delivered to the user process. The Linux OS uses the EPCR to report the exception instruction address to userspace via a signal which we can see being done in do_fpe_trap which we can see below:

asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned long address)
{
	int code = FPE_FLTUNK;
	unsigned long fpcsr = regs->fpcsr;

	if (fpcsr & SPR_FPCSR_IVF)
		code = FPE_FLTINV;
	else if (fpcsr & SPR_FPCSR_OVF)
		code = FPE_FLTOVF;
	else if (fpcsr & SPR_FPCSR_UNF)
		code = FPE_FLTUND;
	else if (fpcsr & SPR_FPCSR_DZF)
		code = FPE_FLTDIV;
	else if (fpcsr & SPR_FPCSR_IXF)
		code = FPE_FLTRES;

	/* Clear all flags */
	regs->fpcsr &= ~SPR_FPCSR_ALLF;

	force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
}

Here we see the excption becomes a SIGFPE signal and the exception address in regs->pc is passed to force_sig_fault. The PC will be used to set the si_addr field of the siginfo_t structure.

Next upon return from kernel space to user space the path is do_fpe_trap > _fpe_trap_handler > ret_from_exception > resume_userspace > work_pending > do_work_pending > restore_all.

Inside of do_work_pending with there the signal handling is done. In explain a bit about this in the article Unwinding a Bug - How C++ Exceptions Work. In restore_all we see EPCR is returned to when exception handling is complete. A snipped of this code is show below:

#define RESTORE_ALL                     \
    DISABLE_INTERRUPTS(r3,r4)               ;\
    l.lwz   r3,PT_PC(r1)                    ;\
    l.mtspr r0,r3,SPR_EPCR_BASE             ;\
    l.lwz   r3,PT_SR(r1)                    ;\
    l.mtspr r0,r3,SPR_ESR_BASE              ;\
    l.lwz   r3,PT_FPCSR(r1)                 ;\
    l.mtspr r0,r3,SPR_FPCSR                 ;\
    l.lwz   r2,PT_GPR2(r1)                  ;\
    l.lwz   r3,PT_GPR3(r1)                  ;\
    l.lwz   r4,PT_GPR4(r1)                  ;\
    l.lwz   r5,PT_GPR5(r1)                  ;\
    l.lwz   r6,PT_GPR6(r1)                  ;\
    l.lwz   r7,PT_GPR7(r1)                  ;\
    l.lwz   r8,PT_GPR8(r1)                  ;\
    l.lwz   r9,PT_GPR9(r1)                  ;\
    l.lwz   r10,PT_GPR10(r1)                    ;\
    l.lwz   r11,PT_GPR11(r1)                    ;\
    l.lwz   r12,PT_GPR12(r1)                    ;\
    l.lwz   r13,PT_GPR13(r1)                    ;\
    l.lwz   r14,PT_GPR14(r1)                    ;\
    l.lwz   r15,PT_GPR15(r1)                    ;\
    l.lwz   r16,PT_GPR16(r1)                    ;\
    l.lwz   r17,PT_GPR17(r1)                    ;\
    l.lwz   r18,PT_GPR18(r1)                    ;\
    l.lwz   r19,PT_GPR19(r1)                    ;\
    l.lwz   r20,PT_GPR20(r1)                    ;\
    l.lwz   r21,PT_GPR21(r1)                    ;\
    l.lwz   r22,PT_GPR22(r1)                    ;\
    l.lwz   r23,PT_GPR23(r1)                    ;\
    l.lwz   r24,PT_GPR24(r1)                    ;\
    l.lwz   r25,PT_GPR25(r1)                    ;\
    l.lwz   r26,PT_GPR26(r1)                    ;\
    l.lwz   r27,PT_GPR27(r1)                    ;\
    l.lwz   r28,PT_GPR28(r1)                    ;\
    l.lwz   r29,PT_GPR29(r1)                    ;\
    l.lwz   r30,PT_GPR30(r1)                    ;\
    l.lwz   r31,PT_GPR31(r1)                    ;\
    l.lwz   r1,PT_SP(r1)                    ;\
    l.rfe

Here we can see how l.mtspr r0,r3,SPR_EPCR_BASE restores the EPCR to the pc address stored in pt_regs when we entered the exception handler. All other register are restored and finally the l.rfe instruction is issued to return from the exception which affectively jumps to EPCR.

The reason QEMU was not setting the correct exception address is due to the way qemu is implemented which optimizes performance. QEMU executes target code basic blocks that are translated to host native instructions, during runtime all PC addresses are those of the host, for example x86-64 64-bit addresses. When an exception occurs, updating the target PC address from the host PC need to be explicityly requested.

QEMU Patch 3

The next patch to implement tininess before rouding was also trivial but brought up a conversation about default NaN payloads.

QEMU Patch 4

Wait, there is more. During writing this article I realized that if QEMU was setting the ECPR to the FPU instruction causing the exception then we would end up in an endless loop.

Luckily the arcitecture anticipated this calling for FPU exceptions to set the next instruction to be executed to EPCR. QEMU was missing this logic.

The patch target/openrisc: Set EPCR to next PC on FPE exceptions fixes this up.

RTL Updates

Updating the actual verilog RTL CPU implementations also needed to be done. Updates have been made to both the mor1kx and the or1k_marocchino implementations.

mor1kx Updates

Updates to the mor1kx to support user mode reads and write to the FPCSR were done in the patch: Make FPCSR is R/W accessible for both user- and supervisor- modes.

The full patch is:

@@ -618,7 +618,7 @@ module mor1kx_ctrl_cappuccino
            spr_fpcsr[`OR1K_FPCSR_FPEE] <= 1'b0;
          end  
          else if ((spr_we & spr_access[`OR1K_SPR_SYS_BASE] &
-                  (spr_sr[`OR1K_SPR_SR_SM] & padv_ctrl | du_access)) &&
+                  (padv_ctrl | du_access)) &&
                   `SPR_OFFSET(spr_addr)==`SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) begin
            spr_fpcsr <= spr_write_dat[`OR1K_FPCSR_WIDTH-1:0]; // update all fields
           `ifdef OR1K_FPCSR_MASK_FLAGS

The change to verilog shows that before when writng (spr_we) to the FPCSR (OR1K_SPR_FPCSR_ADDR) register we used to check that the supervisor bit (OR1K_SPR_SR_SM) bit of the sr spr (spr_sr) is set. That check enforced supervisor mode only write access, removing this allows user space to write to the regsiter.

Updating mor1kx to support tininess checking before rounding was done in the change Refactoring and implementation tininess detection before rounding. I will not go into the details of these patches as I don’t understand them so much.

Marocchino Updates

Updates to the or1k_marocchino to support user mode reads and write to the FPCSR were done in the patch: Make FPCSR is R/W accessible for both user- and supervisor- modes.

The full patch is:

@@ -714,7 +714,7 @@ module or1k_marocchino_ctrl
  assign except_fpu_enable_o = spr_fpcsr[`OR1K_FPCSR_FPEE];

  wire spr_fpcsr_we = (`SPR_OFFSET(({1'b0, spr_sys_group_wadr_r})) == `SPR_OFFSET(`OR1K_SPR_FPCSR_ADDR)) &
-                      spr_sys_group_we &  spr_sr[`OR1K_SPR_SR_SM];
+                      spr_sys_group_we; // FPCSR is R/W for both user- and supervisor- modes

 `ifdef OR1K_FPCSR_MASK_FLAGS
  reg [`OR1K_FPCSR_ALLF_SIZE-1:0] ctrl_fpu_mask_flags_r;

Updating the marocchino to support dttectig tininess before rounding was done in the patch: Refactoring FPU Implementation for tininess detection BEFORE ROUNDING. I will not go into details of the patch as I didn’t write them. In general it is a medium size refactoring of the floating point unit.

Summary

We discussed updates to the architecture simulators and verilog CPU implementations to allow supporting user mode floating point programs. These updates will now allow us to port Linux and glibc to the OpenRISC floating point unit.

shorne in japan

OpenRISC FPU Port - Fixing Hardware