shorne in japan

blog archive about resume

OR1K Marocchino Instruction Pipeline

18 Jul 2019

This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we will review the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.

In the last article, Marocchino in Action we discussed the history of the CPU and how to setup setup a development environment for it. In this article let’s look a bit deeper into how the Marocchino CPU works.

We will look at how an instruction flows through the Marocchino pipeline.

Marocchino Architecture

The Marocchino source code is available on github and is easy to navigate. We have these directories:

  • rtl/verilog - the core verilog code, with toplevel modules
    • or1k_marocchino_top.v - top level module, connects CPU to wishbone bus
    • or1k_marocchino_cpu.v - CPU module, connects CPU pipeline
  • rtl/verilog/pfpu_marocchino - the FPU implementation
    • pfpu_marocchino_top.v - FPU module, wires together FPU components
  • bench - test bench harness monitor modules
  • doc - design documentation

marocchino github website screenshot

At first glance of the code the Marocchino may look like a traditional 5 stage RISC pipeline. It has fetch, decode, execution, load/store and register write back modules which you might picture in your head as follows:

  PIPELINE CRTL - progress/stall the pipeline
  (or1k_marocchino_ctrl.v)

  INSTRUCTION PIPELINE - process an instruction

                        |
                        V
/                     FETCH                     \
\ (or1k_marocchino_fetch.v)                     /
                        |
                        V
/                     DECODE                    \
\ (or1k_marocchino_decode.v)                    /
                        |
                        V
/                     EXECUTE                   \
| (or1k_marocchino_int_1clk.v) ALU              |
| (or1k_marocchino_int_div.v)  DIVISION         |
\ (or1k_marocchino_int_mul.v)  MULTIPLICATION   /
                        |
                        V
/                   LOAD STORE                  \
\ (or1k_marocchino_lsu.v)      TO/FROM RAM      /
                        |
                        V
/                   WRITE BACK                  \
\ (or1k_marocchino_rf.v)                        /

However, once you look a bit closer you notice some things that are different. The top-level module or1k_marocchino_cpu connects the modules and shows:

  • Between the decode and execution units there are reservation stations.
  • Along with the control unit, there is an order manager module which provides control signals.
  • The load/store execution unit is done as part of the execution stage.

What this CPU is is a super-scalar instruction pipeline with in order instruction retirement implementing the Tomasulo algorithm.

A simplified view of the CPU’s internal module layout is as per the below diagram.

marocchino pipeline diagram

Pipeline Controls

The marocchino has two modules for coordinating pipeline stage instruction propagation. The control unit and the order manager.

Control Unit

The control unit of the CPU is in charge of watching over the pipeline stages and signalling when operations can transfer from one stage to the next. The Marocchino does this with a series of pipeline advance (padv_*) signals. In general for the best efficiency all padv_* wires should be high at all times allowing instructions to progress on every clock cycle. But as we will see in reality, this is difficult to achieve due to pipeline stall scenarios like cache misses and branch prediction misses. The padv_* signals include:

padv_fetch_o

The padv_fetch_o signal instructs the instruction fetch unit to progress. Internally the fetch unit has 3 stages. The instruction fetch unit interacts with the instruction cache and instruction memory management unit (MMU). The padv_fetch_o signal goes low and the pipeline stalls when the decode module is busy (dcod_emtpy_i is low). The signal dcod_empty_i comes from the Decode module and indicates that an instruction can be accepted by the decode stage.

This is represented by this assign in or1k_marocchino_ctrl.v:

Note The stepping and pstep[] signals are related to debug single stepping, and can be ignored for our purposes.

  // Advance IFETCH
  // Stepping condition is close to the one for DECODE
  assign padv_fetch_o = padv_all & ((~stepping) | (dcod_empty_i & pstep[0])); // ADV. IFETCH

padv_dcod_o

The padv_dcod_o signal instructs the instruction decode stage to output decoded operands. The decode unit is one stage, if padv_dcod_o is high, it will decode the instruction input every cycle. The padv_dcod_o signal goes low if the destination reservation station for the operands cannot accept an instruction.

This is represented by this assign in or1k_marocchino_ctrl.v:

  // Advance DECODE
  assign padv_dcod_o = padv_all & (~wrbk_rfdx_we_i) & // ADV. DECODE
    (((~stepping) & dcod_free_i & (dcod_empty_i | ena_dcod)) | // ADV. DECODE
       (stepping  & dcod_empty_i & pstep[0])); // ADV. DECODE

padv_exec_o and padv_*_rsrvs_o

The padv_exec_o signal to order manager enqueues decoded ops into the Order Control Buffer (OCB). The OCB is a FIFO queue which keeps track of the order instructions have been decoded.

The padv_*_rsrvs_o signal wired one of the reservation stations enables registering of an instruction into a reservation station. There is one padv_*_rsrvs_o signal and reservation station per execution unit. They are:

  • padv_1clk_rsrvs_o - to the reservation station for single clock ALU operations
  • padv_muldiv_rsrvs_o - to the reservation station for multiply and divide operations. Divide operations take 32 clock cycles. Multiply operations execute with 2 clock cycles.
  • padv_fpxx_rsrvs_o - to the reservation station for the floating point unit (FPU). There are multiple FPU operations including multiply, divide, add, subtract, comparison and conversion between integer and floating point.
  • padv_lsu_rsrvs_o - to the reservation station for the load store unit. The load store unit will load data from memory to registers or store data from registers to memory. It interacts with the data cache and data MMU.

Both padv_exec_o and padv_*_rsrvs_o are dependent on the execution units being ready and both signals will go high or low at the same time.

This is represented by the assign in or1k_marocchino_ctrl.v:

  // Advance EXECUTE (push OCB & clean up  DECODE)
  assign padv_exec_o         = ena_exec         & padv_an_exec_unit;
  // Per execution unit (or reservation station) advance
  assign padv_1clk_rsrvs_o   = ena_1clk_rsrvs   & padv_an_exec_unit;
  assign padv_muldiv_rsrvs_o = ena_muldiv_rsrvs & padv_an_exec_unit;
  assign padv_fpxx_rsrvs_o   = ena_fpxx_rsrvs   & padv_an_exec_unit;
  assign padv_lsu_rsrvs_o    = ena_lsu_rsrvs    & padv_an_exec_unit;

padv_wrbk_o

The padv_wrbk_o signal to the execution units will go active when exec_valid_i is active and will finalize writing back the execution results. The padv_wrbk_o signal to the order manager will retire the oldest instruction from the OCB.

This is represented by this assign in or1k_marocchino_ctrl.v:

  // Advance Write Back latches
  wire   exec_valid_l = exec_valid_i | op_mXspr_valid;
  assign padv_wrbk_o  = exec_valid_l & padv_all & (~wrbk_rfdx_we_i) & ((~stepping) | pstep[2]);

An astute reader would notice that there are no pipeline advance (padv_*) signals to each of the execution units. This is where the order manager comes in.

Order Manager

The order manager ensures that instructions are retired in the same order that they are decoded. It contains a register allocation table (RAT) for hazard resolution and the OCB. We will go into more depth on the RAT in the next article, but for now let’s look at how the order manager interacts with the instruction pipeline flow.

exec_valid_o

As the OCB is a FIFO queue the output port presents the oldest non retired instruction to the order manager. The exec_valid_o signal to the control unit will go active when the *_valid_i signal from the execution unit and the OCB output instruction match.

This is represented by this assign in or1k_marocchino_oman.v:

  assign exec_valid_o =
    (op_1clk_valid_l          & ~ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID: but wait attributes for l.jal/ljalr
    (exec_jb_attr_valid       &  ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID
    (div_valid_i              &  ocbo[OCBTC_OP_DIV_POS])         | // EXEC VALID
    (mul_valid_i              &  ocbo[OCBTC_OP_MUL_POS])         | // EXEC VALID
    (fpxx_arith_valid_i       &  ocbo[OCBTC_OP_FPXX_ARITH_POS])  | // EXEC VALID
    (fpxx_cmp_valid_i         &  ocbo[OCBTC_OP_FPXX_CMP_POS])    | // EXEC VALID
    (lsu_valid_i              &  ocbo[OCBTC_OP_LS_POS])          | // EXEC VALID
                                 ocbo[OCBTC_OP_PUSH_WRBK_POS];     // EXEC VALID

The OCB helps the order manager ensure that instructions are retired in the same order that they are decoded.

grant_wrbk_*_o

The grant_wrbk_*_o signal to the execution units will go active depending on the OCB output port instruction.

This is represented by this assign in or1k_marocchino_oman.v:

  // Grant Write-Back-access to units
  assign grant_wrbk_to_1clk_o        = ocbo[OCBTC_OP_1CLK_POS];
  assign grant_wrbk_to_div_o         = ocbo[OCBTC_OP_DIV_POS];
  assign grant_wrbk_to_mul_o         = ocbo[OCBTC_OP_MUL_POS];
  assign grant_wrbk_to_fpxx_arith_o  = ocbo[OCBTC_OP_FPXX_ARITH_POS];
  assign grant_wrbk_to_lsu_o         = ocbo[OCBTC_OP_LS_POS];
  assign grant_wrbk_to_fpxx_cmp_o    = ocbo[OCBTC_OP_FPXX_CMP_POS];

The grant_wrbk_*_o signal along with the padb_wrbk_o signal signal an execution unit that it can write back its result to the register file / RAT / reservation station.

wrbk_rfd1_we_o, wrbk_rfd2_we_o and wrbk_rfdx_we_o

The wrbk_rfd1_we_o and wrbk_rfd2_we_o signals enable writeback to the register file. There are 2 signals because some 64-bit FPU instructions require writing results to 2 registers. When there is just a single register to write only signal wrbk_rfd1_we_o is used. When there are two results, writing happens in 2-stages, first wrbk_rfd1_we_o signals the write back to register 1 then in the next cycle wrbk_rfd2_we_o signals the write back to register 2.

The wrbk_rfdx_we_o signal to the control unit stalls the pipeline to allow the second write to complete.

This is represented by this logic in or1k_marocchino_oman.v:

  // instuction requests write-back
  wire exec_rfd1_we = ocbo[OCBTA_RFD1_WRBK_POS];
  wire exec_rfd2_we = ocbo[OCBTA_RFD2_WRBK_POS];

...

  // 1-clock Write-Back-pulses
  //  # for D1
  always @(posedge cpu_clk) begin
    if (padv_wrbk_i)
      wrbk_rfd1_we_o <= exec_rfd1_we;
    else
      wrbk_rfd1_we_o <= 1'b0;
  end // @clock

  //  # for D2 we delay WriteBack for 1-clock
  //    to split write into RF from D1
  always @(posedge cpu_clk) begin
    if (cpu_rst) begin
      wrbk_rfdx_we_o <= 1'b0; // flush
      wrbk_rfd2_we_o <= 1'b0; // flush
    end
    else if (wrbk_rfd2_we_o) begin
      wrbk_rfdx_we_o <= 1'b0; // D2 write done
      wrbk_rfd2_we_o <= 1'b0; // D2 write done
    end
    else if (wrbk_rfdx_we_o)
      wrbk_rfd2_we_o <= 1'b1; // do D2 write
    else if (padv_wrbk_i)
      wrbk_rfdx_we_o <= exec_rfd2_we;
  end // @clock

The padv_wrbk_i signal from the control unit to the order manager also takes care of dequeuing the last instruction from the OCB. With that and the writebacks completed the instruction is said to be retired.

Conclusion

The Marocchino instruction pipeline is not very complicated while still being full featured including Caches, MMU and FPU. We have mentioned a few structures such as Reservation Station and RAT which we haven’t gone into much details on. These help implement out-of-order superscalar execution using Tomasulo’s algorithm. In the next article we will go into more details on these components and how Tomasulo works.


OR1K Marocchino in Action

11 Jun 2019

Update 2021: There are a few fusesoc command's below. They have been updated to work with the latest fusesoc and core versions.

The Marocchino is an advanced new OpenRISC soft core CPU. However, not many have heard about it. Let’s try to change this.

In this series of posts I would like to take the reader over some key parts of the Marocchino and it’s architecture.

  • Marocchino in Action - (this article) A light intro and a guide to getting started with Marocchino
  • Instruction Pipeline - An deep dive into how the Marocchino pipeline is structured
  • A Tomasulo Implementation - How the Marocchino achieves Out-of-Order execution

Introducing Marocchino

In the beginning of 2019 I had finished the OpenRISC GCC port and was working on building up toolchain test and verification support using the mor1kx soft core. Part of the mor1kx’s feature set is the ability to swap out different pipeline arrangements to configure the CPU for performance or resource usage. Each pipeline is named after an Italian coffee, we have Cappuccino, Espresso and Pronto-Espresso. One of these pipelines which has been under development but never integrated into the main branch was the Marocchino. I had never paid much attention to the Marocchino pipeline.

Around the same time the author of Marocchino sent a mail mentioning he could not use the new GCC port as it was missing Single and Double precision FPU support. Using the new verification pipeline I set out to start working on adding single and double precision floating point support to the OpenRISC gcc port. My verification target would be the Marocchino pipeline.

After some initial investigation I found this CPU was much more than a new pipeline for the mor1kx with a special FPU. The marocchino has morphed into a complete re-implementation of the OpenRISC 1000 spec. Seeing this we split the marocchino out to it’s own repository where it could grow on it’s own. Of course maintaining the Italian coffee name.

With features like out-of-order execution using Tomasulo’s algorithm, 64-bit FPU operations using register pairs, MMU, Instruction caches, Data caches, Multicore support and a clean verilog code base the Marocchino is advanced to say the least.

I would claim Marocchino is one of the most advanced implementations of a out-of-order execution open source CPU cores. One of it’s friendly rivals is the BOOM core a 64-bit risc-v implementation written in Chisel. To contrast Marocchino has a similar feature set but is 32-bit OpenRISC written in verilog making it approachable. If you know more out-of-order execution open source cores I would love to know, and I can update this list.

Let’s dive in.

Getting started with Marocchino

We can quickly get started with Marocchino as we use FuseSoC. Which makes bringing together an running verilog libraries, or cores, a snap.

Environment Setup

The Marocchino development environment requires Linux, you can use a VM, docker or your own machine. I personally use fedora and maintain several Debian docker images for continuous integration and package builds.

The environment we install allows for simulating verilog using icarus or verilator. It also allows synthesis and programming to an FPGA using EDA tools. Here we will cover only simulation. For details on programming an SoC to an FPGA using FuseSoC see the build and pgm commands in FuseSoC documentation.

Note Below we use /tmp/openrisc to install software and work on code, but you can use any path you like.

Setting up FuseSoC

fusesoc logo

To get started let’s setup FuseSoC and install the required cores into the FuseSoC library.

Here we clone the git repositories used for Marocchino development into /tmp/openrisc/src feel free to have a look, If you feel adventurous make some changes. The repos include:

  • mor1kx-generic - the SoC system configuration which wires together the CPU, Memory, System bus, UART and a simple interrupt peripheral.
  • or1k_marocchino - the Marocchino CPU core.
sudo pip install fusesoc

mkdir -p /tmp/openrisc/src
cd /tmp/openrisc/src
git clone https://github.com/stffrdhrn/mor1kx-generic.git
git clone https://github.com/openrisc/or1k_marocchino.git

# As yourself
fusesoc init -y
fusesoc library add intgen https://github.com/stffrdhrn/intgen.git
fusesoc library add elf-loader https://github.com/fusesoc/elf-loader.git
fusesoc library add mor1kx-generic /tmp/openrisc/src/mor1kx-generic
fusesoc library add or1k_marocchino /tmp/openrisc/src/or1k_marocchino

Setting up Icarus Verilog

Next we will need to install our verilog compiler/simulator Icarus Verilog (iverilog).

mkdir -p /tmp/openrisc/iverilog
cd /tmp/openrisc/iverilog
git clone https://github.com/steveicarus/iverilog.git .

sh autoconf.sh
./configure --prefix=/tmp/openrisc/local
make
make install
export PATH=/tmp/openrisc/local/bin:$PATH

Using Docker

Update 2021: The docker images are out of date and no longer work well, please use the manual setup method.

If you want to get started very quickly faster we can use the librecores-ci docker image. Which includes iverilog, verilator and fusesoc.

This allows us to skip the Setting up Icarus Verilog and part of the Setting up FuseSoC step above.

This can be done with the following.

docker pull librecores/librecores-ci
docker run -it --rm docker.io/librecores/librecores-ci

Setting up GCC

Next we install the GCC toolchain which is used for compiling C and OpenRISC assembly programs. The produced elf binaries can be loaded and run on the CPU core. Pull the latest toolchain from my gcc releases page. Here we use the newlib (baremetal) toolchain which allows compiling programs which run directly on the processor. For details on other toolchains available see the toolchain summary on the OpenRISC homepage.

mkdir -p /tmp/openrisc
cd /tmp/openrisc
wget https://github.com/stffrdhrn/gcc/releases/download/or1k-9.1.1-20190507/or1k-elf-9.1.1-20190507.tar.xz

tar -xf or1k-elf-9.1.1-20190507.tar.xz
export PATH=/tmp/openrisc/or1k-elf/bin:$PATH

The development environment should now be set up.

To check everything works you should be able to run the following commands.

Setup Verification

To ensure the toolchain is installed and working we can run the following:

$ or1k-elf-gcc --version
or1k-elf-gcc (GCC) 9.1.1 20190503
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

To ensure FuseSoC and the required cores are installed we can run this:

$ fusesoc core-info mor1kx-generic 
CORE INFO
Name:      ::mor1kx-generic:1.1
Core root: /root/.local/share/fusesoc/mor1kx-generic

Targets:
marocchino_tb
mor1kx_tb

$ fusesoc list-cores
...
::intgen:0                     : local
::or1k_marocchino:5.0-r3       : local
...

Running an Assembly Program

The most simple program you can run on OpenRISC is a simple assembly program. When running, everything in the below program is loaded into the OpenRISC memory, nothing more nothing less.

To compile, run and trace a simple assembly program we do the following.

Create a source file asm-openrisc.s as follows:

/* Exception vectors section.  */
	.section .vectors, "ax"

/* 0x100: OpenRISC RESET reset vector. */
        .org 0x100

	/* Jump to program initialisation code */
	.global _main
	l.movhi r4, hi(_main)
	l.ori r4, r4, lo(_main)
	l.jr    r4
	 l.nop

/* Main executable section.  */
	.section .text
	
	.global _main	
_main:
	l.addi	r1,r0,0x1
	l.addi	r2,r1,0x2
	l.addi	r3,r2,0x4
	l.addi	r4,r3,0x8
	l.addi	r5,r4,0x10
	l.addi	r6,r5,0x20
	l.addi	r7,r6,0x40
	l.addi	r8,r7,0x80
	l.addi	r9,r8,0x100
	l.addi	r10,r9,0x200
	l.addi	r11,r10,0x400
	l.addi	r12,r11,0x800
	l.addi	r13,r12,0x1000
	l.addi	r14,r13,0x2000
	l.addi	r15,r14,0x4000
	l.addi	r16,r15,0x8000

	l.sub	r31,r0,r1
	l.sub	r30,r31,r2
	l.sub	r29,r30,r3
	l.sub	r28,r29,r4
	l.sub	r27,r28,r5
	l.sub	r26,r27,r6
	l.sub	r25,r26,r7
	l.sub	r24,r25,r8
	l.sub	r23,r24,r9
	l.sub	r22,r23,r10
	l.sub	r21,r22,r11
	l.sub	r20,r21,r12
	l.sub	r19,r20,r13
	l.sub	r18,r19,r14
	l.sub	r17,r18,r15
	l.sub	r16,r17,r16

	/* Set sim return code to 0 - meaning OK. */
	l.movhi	r3, 0x0
	l.nop 	0x1 /* Exit simulation */
	l.nop
	l.nop

To compile this we use or1k-elf-gcc. Note the -nostartfiles option, this is useful for compiling assembly when we don’t need newlib/libgloss provided “startup” sections linked into the binary as we provide them ourselves.

mkdir /tmp/openrisc/src
cd /tmp/openrisc/src
vim openrisc-asm.s

or1k-elf-gcc -nostartfiles openrisc-asm.s -o openrisc-asm

Finally, to run the program on the Marocchino we run fusesoc with the below options.

  • run - specifies that we want to run a simulation.
  • --target - is a FuseSoC option for run specifying which of the mor1kx-generic targets we want to run, here we specify marochino_tb, the Marocchino test bench.
  • --tool - is a sub option for --target specifying that we want to run the marocchino_tb target using icarus.
  • ::mor1kx-generic:1.1 - specifies which system we want to run. System’s represent an SoC that can be simulated or synthesized. You can see a list of system using list-cores.
  • --elf_load - is mor1kx-generic specific option which specifies an elf binary that will be loaded into memory before the simulation starts.
  • --trace_enable - is a mor1kx-generic specific option enabling tracing. When specified the simulator will output a trace file to {fusesoc-builds}/mor1kx-generic_1.1/marocchino_tb-icarus/marocchino-trace.log see log.
  • --trace_to_screen - is a mor1kx-generic specific option enabling tracing instruction execution to the console as we can see below.
  • --vcd - is a mor1kx-generic option instruction icarus to output a vcd file which creates a trace file which can be loaded with gtkwave.
fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
  --elf_load ./openrisc-asm --trace_enable --trace_to_screen --vcd

VCD info: dumpfile testlog.vcd opened for output.
Program header 0: addr 0x00000000, size 0x000001A0
elf-loader: /tmp/openrisc/src/openrisc-asm was loaded
Loading         104 words
                   0 : Illegal Wishbone B3 cycle type (xxx)
S 00000100: 18800000 l.movhi r4,0x0000       r4         = 00000000  flag: 0
S 00000104: a8840110 l.ori   r4,r4,0x0110    r4         = 00000110  flag: 0
S 00000108: 44002000 l.jr    r4                                     flag: 0
S 0000010c: 15000000 l.nop   0x0000                                 flag: 0
S 00000110: 9c200001 l.addi  r1,r0,0x0001    r1         = 00000001  flag: 0
S 00000114: 9c410002 l.addi  r2,r1,0x0002    r2         = 00000003  flag: 0
S 00000118: 9c620004 l.addi  r3,r2,0x0004    r3         = 00000007  flag: 0
S 0000011c: 9c830008 l.addi  r4,r3,0x0008    r4         = 0000000f  flag: 0
S 00000120: 9ca40010 l.addi  r5,r4,0x0010    r5         = 0000001f  flag: 0
S 00000124: 9cc50020 l.addi  r6,r5,0x0020    r6         = 0000003f  flag: 0
S 00000128: 9ce60040 l.addi  r7,r6,0x0040    r7         = 0000007f  flag: 0
S 0000012c: 9d070080 l.addi  r8,r7,0x0080    r8         = 000000ff  flag: 0
S 00000130: 9d280100 l.addi  r9,r8,0x0100    r9         = 000001ff  flag: 0
S 00000134: 9d490200 l.addi  r10,r9,0x0200   r10        = 000003ff  flag: 0
S 00000138: 9d6a0400 l.addi  r11,r10,0x0400  r11        = 000007ff  flag: 0
S 0000013c: 9d8b0800 l.addi  r12,r11,0x0800  r12        = 00000fff  flag: 0
S 00000140: 9dac1000 l.addi  r13,r12,0x1000  r13        = 00001fff  flag: 0
S 00000144: 9dcd2000 l.addi  r14,r13,0x2000  r14        = 00003fff  flag: 0
S 00000148: 9dee4000 l.addi  r15,r14,0x4000  r15        = 00007fff  flag: 0
S 0000014c: 9e0f8000 l.addi  r16,r15,0x8000  r16        = ffffffff  flag: 0
S 00000150: e3e00802 l.sub   r31,r0,r1       r31        = ffffffff  flag: 0
S 00000154: e3df1002 l.sub   r30,r31,r2      r30        = fffffffc  flag: 0
S 00000158: e3be1802 l.sub   r29,r30,r3      r29        = fffffff5  flag: 0
S 0000015c: e39d2002 l.sub   r28,r29,r4      r28        = ffffffe6  flag: 0
S 00000160: e37c2802 l.sub   r27,r28,r5      r27        = ffffffc7  flag: 0
S 00000164: e35b3002 l.sub   r26,r27,r6      r26        = ffffff88  flag: 0
S 00000168: e33a3802 l.sub   r25,r26,r7      r25        = ffffff09  flag: 0
S 0000016c: e3194002 l.sub   r24,r25,r8      r24        = fffffe0a  flag: 0
S 00000170: e2f84802 l.sub   r23,r24,r9      r23        = fffffc0b  flag: 0
S 00000174: e2d75002 l.sub   r22,r23,r10     r22        = fffff80c  flag: 0
S 00000178: e2b65802 l.sub   r21,r22,r11     r21        = fffff00d  flag: 0
S 0000017c: e2956002 l.sub   r20,r21,r12     r20        = ffffe00e  flag: 0
S 00000180: e2746802 l.sub   r19,r20,r13     r19        = ffffc00f  flag: 0
S 00000184: e2537002 l.sub   r18,r19,r14     r18        = ffff8010  flag: 0
S 00000188: e2327802 l.sub   r17,r18,r15     r17        = ffff0011  flag: 0
S 0000018c: e2118002 l.sub   r16,r17,r16     r16        = ffff0012  flag: 0
S 00000190: 18600000 l.movhi r3,0x0000       r3         = 00000000  flag: 0
S 00000194: 15000001 l.nop   0x0001                                 flag: 0
exit(0x00000000);

If we look at the VCD trace file in gtkwave we can see the below trace. The trace file is also helpful for navigating through the various SoC components as it captures all wire transitions.

Take note that we are not seeing very good performance, this is because caching is not enabled and the CPU takes several cycles to read an instruction from memory. This means we are not seeing one instruction executed per cycle. Enabling caches would fix this.

asm tracefile vcd

Running a C program

When we compile a C program there is a lot more happening behind the scenes. The linker will link in an entire runtime that along with standard libc functions on OpenRISC will setup interrupt vectors, enable caches, setup memory sections for variables, run static initializers and finally run our program.

The program:

/* Simple c program, doing some math.  */

#include <stdio.h>

int a [] = { 1, 2, 3, 4, 5 };

int madd(int a, int b, int c) {
  return a * b + c;
}

int main() {
  int res;

  for (int i = 0; i < 5; i++) {
    res = madd(0  , a[1], a[i]);
    res = madd(res, a[2], a[i]);
    res = madd(res, a[2], a[i]);
    res = madd(res, a[3], a[i]);
    res = madd(res, a[4], a[i]);
  }

  printf("Result is = %d\n", res);

  return 0;
}

To compile we use the below or1k-elf-gcc command. Notice, we do not specify -nostartfiles here as we do want newlib to link in all the start routines to provide a full c runtime. We do specify the following arguments to tell GCC a bit about our OpenRISC cpu. If these -m options are not specified GCC will link in code using the libgcc library to emulate these instructions.

  • -mhard-mul - indicates that our cpu target supports multiply instructions.
  • -mhard-div - indicates that our cpu target supports divide instructions.
  • -mhard-float - indicates that our cpu target supports FPU instructions.
  • -mdouble-float - indicates that our cpu target supports the new double precision floating point instructions using register pairs (orfpx64a32).

To see a full list of options for OpenRISC read the GCC manual or see the output of or1k-elf-gcc --target-help.

or1k-elf-gcc -Wall -O2 -mhard-mul -mhard-div -mhard-float -mdouble-float -mror \
  openrisc-c.c -o openrisc-c

If we want to inspect the assembly to ensure we did generate multiply instructions we can use the trusty objdump utility. As per below, yes, we can see multiply instructions.

or1k-elf-objdump -d openrisc-c | grep -A10 main
00002000 <main>:
    2000:       1a a0 00 01     l.movhi r21,0x1
    2004:       9c 21 ff f8     l.addi r1,r1,-8
    2008:       9e b5 40 3c     l.addi r21,r21,16444
    200c:       86 75 00 10     l.lwz r19,16(r21)
    2010:       86 f5 00 08     l.lwz r23,8(r21)
    2014:       e2 37 9b 06     l.mul r17,r23,r19
    2018:       e2 31 98 00     l.add r17,r17,r19
    201c:       e2 31 bb 06     l.mul r17,r17,r23
    2020:       e2 31 98 00     l.add r17,r17,r19
    2024:       86 b5 00 0c     l.lwz r21,12(r21)
...

Similar to running the assembly example we can run this with fusesoc as follows.

fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
  --elf_load ./openrisc-c --trace_enable --vcd
...

Result is = 1330

Now, if we look at the VCD trace file we can see the below trace. Notice that with the c program we can observe better pipelining where an instruction can be executed every clock cycle. This is because caches have been initialized as part of the newlib c-runtime initialization, great!

c tracefile vcd

Conclusion

In this article we went through a quick introduction to the Marocchino development environment. The development environment would actually be similar when developing any OpenRISC core.

This environment will allow the reader to following in future Marocchino articles where we go deeper into the architecture. In this environment you can now:

  • Develop and test verilog code for the Marocchino processor
  • Develop assembly programs and test them on the Marocchino or other OpenRISC processors
  • Develop c programs and test them on the Marocchino or other OpenRISC processors

In the next article we will look more into how the above programs actually flow through the Marocchino pipeline. Stay tuned.


GCC Stack Frames

08 Jun 2018

What I learned from doing the OpenRISC GCC port, defining the stack frame

This is a continuation on my notes of things I learned while working on the OpenRISC GCC backend port. The stack frame layout is very important to get right when implementing an architecture’s calling conventions. If not we may have a compiler that works fine with code it compiles but cannot interoperate with libraries produced by another compiler.

For me I figured this would be most difficult as I am horrible with off by one bugs. However, after learning the whole picture I was able to get it working.

In this post we will go over the two main stack frame concepts:

  • Registers - The GCC internal and hard register numbers pointing into the stack
  • Stack Layout - How memory layout of the stack is defined

Or1k Stack Frame

Registers

Stack registers are cpu registers dedicated to point to different locations in the stack. The content of these registers is updated during function epilogue and prologue. In the above diagram we can see the pointed out as AP, HFP, FP and SP.

Virtual Registers

GCC’s first glimpse of the stack.

These are created during the expand and eliminated during vreg pass. By looking at these we cat understand the whole picture: Offsets, outgoing arguments, incoming arguments etc.

The virtual registers are GCC’s canonical view of the stack frame. During the vregs pass they will be replaced with architecture specific registers. See details on this in my discussion on GCC important passes.

Macro GCC OpenRISC
VIRTUAL_INCOMING_ARGS_REGNUM Points to incoming arguments. ARG_POINTER_REGNUM + FIRST_PARM_OFFSET. default
VIRTUAL_STACK_VARS_REGNUM Points to local variables. FRAME_POINTER_REGNUM + TARGET_STARTING_FRAME_OFFSET. default
VIRTUAL_STACK_DYNAMIC_REGNUM STACK_POINTER_REGNUM + STACK_DYNAMIC_OFFSET. default
VIRTUAL_OUTGOING_ARGS_REGNUM Points to outgoing arguments. STACK_POINTER_REGNUM + STACK_POINTER_OFFSET. default

Real Registers (Sometimes)

The stack pointer will pretty much always be a real register that shows up in the final assembly. Other registers will be like virtuals and eliminated during some pass.

Macro GCC OpenRISC
STACK_POINTER_REGNUM The hard stack pointer register, not defined where it should point Points to the last data on the current stack frame. i.e. 0(r1) points next function arg[0]
FRAME_POINTER_REGNUM (FP) Points to automatic/local variable storage Points to the first local variable. i.e. 0(FP) points to local variable[0].
HARD_FRAME_POINTER_REGNUM The hard frame pointer, not defined where it should point Points to the same location as the previous functions SP. i.e. 0(r2) points to current function arg[0]
ARG_POINTER_REGNUM Points to current function incoming arguments For OpenRISC this is the same as HARD_FRAME_POINTER_REGNUM.

Stack Layout

Stack layout defines how the stack frame is placed in memory.

Eliminations

Eliminations provide the rules for which registers can be eliminated by replacing them with another register and a calculated offset. The offset is calculated by looking at data collected by the TARGET_COMPUTE_FRAME_LAYOUT macro function.

On OpenRISC we have defined these below. We allow the frame pointer and argument pointer to be eliminated. They will be replaced with either the stack pointer register or the hard frame pointer. In OpenRISC there is no argument pointer so it will always need to be eliminated. Also, the frame pointer is a placeholder, when elimination is done it will be eliminated.

Note GCC knows that at some optimization levels the hard frame pointer will be omitted. In these cases HARD_FRAME_POINTER_REGNUM will not selected as the elimination target register. We don’t need to define any hard frame pointer eliminations.

Macro GCC OpenRISC
ELIMINABLE_REGS Sets of registers from, to which we can eliminate by calculating the difference between them. We eliminate Argument Pointer and Frame Pointer.
INITIAL_ELIMINATION_OFFSET Function to compute the difference between eliminable registers. See implementation below

Example

or1k.h

#define ELIMINABLE_REGS					\
{ { FRAME_POINTER_REGNUM, STACK_POINTER_REGNUM },	\
  { FRAME_POINTER_REGNUM, HARD_FRAME_POINTER_REGNUM },	\
  { ARG_POINTER_REGNUM,	 STACK_POINTER_REGNUM },	\
  { ARG_POINTER_REGNUM,   HARD_FRAME_POINTER_REGNUM } }

#define INITIAL_ELIMINATION_OFFSET(FROM, TO, OFFSET) \
  do {							\
    (OFFSET) = or1k_initial_elimination_offset ((FROM), (TO)); \
  } while (0)

or1k.c

HOST_WIDE_INT
or1k_initial_elimination_offset (int from, int to)
{
  HOST_WIDE_INT offset;

  /* Set OFFSET to the offset from the stack pointer.  */
  switch (from)
    {
    /* Incoming args are all the way up at the previous frame.  */
    case ARG_POINTER_REGNUM:
      offset = cfun->machine->total_size;
      break;

    /* Local args grow downward from the saved registers.  */
    case FRAME_POINTER_REGNUM:
      offset = cfun->machine->args_size + cfun->machine->local_vars_size;
      break;

    default:
      gcc_unreachable ();
    }

  if (to == HARD_FRAME_POINTER_REGNUM)
    offset -= cfun->machine->total_size;

  return offset;
}

Stack Section Growth

Some sections of the stack frame may contain multiple variables, for example we may have multiple outgoing arguments or local variables. The order in which these are stored in memory is defined by these macros.

Note On OpenRISC the local variables definition changed during implementation from upwards to downwards. These are local only to the current function so does not impact calling conventions.

For a new port is recommended to define FRAME_GROWS_DOWNWARD as 1 as it is usually not critical to the target calling conventions and defining it also enables the Stack Protector feature. The stack protector can be turned on in gcc using -fstack-protector, during build ensure to --enable-libssp which is enabled by default.

Macro GCC OpenRISC
STACK_GROWS_DOWNWARD Define true if new stack frames decrease towards memory address 0x0. 1
FRAME_GROWS_DOWNWARD Define true if increasing local variables are at negative offset from FP. Define this to enable the GCC stack protector feature. 1
ARGS_GROW_DOWNWARD Define true if increasing function arguments are at negative offset from AP for incoming args and SP for outgoing args. 0 (default)

Stack Section Offsets

Offsets may be required if an architecture has extra offsets between the different register pointers and the actual variable data. In OpenRISC we have no such offsets.

Macro GCC OpenRISC
STACK_POINTER_OFFSET See VIRTUAL_OUTGOING_ARGS_REGNUM 0
FIRST_PARM_OFFSET See VIRTUAL_INCOMING_ARGS_REGNUM 0
STACK_DYNAMIC_OFFSET See VIRTUAL_STACK_DYNAMIC_REGNUM 0
TARGET_STARTING_FRAME_OFFSET See VIRTUAL_OUTGOING_ARGS_REGNUM 0

Outgoing Arguments

When a function calls another function sometimes the arguments to that function will need to be stored to the stack before making the function call. For OpenRISC this is when we have more arguments than fit in argument registers or when we have variadic arguments. The outgoing arguments for all child functions need to be accounted for and the space will be allocated on the stack.

On some architectures outgoing arguments are pushed onto and popped off the stack. For OpenRISC we do not do this we simply, allocate the required memory in the prologue.

Macro GCC OpenRISC
ACCUMULATE_OUTGOING_ARGS If defined, don’t push args just store in crtl->outgoing_args_size. Our prologue should allocate this space relative to the SP (as per ARGS_GROW_DOWNWARD). 1
CUMULATIVE_ARGS A C type used for tracking args in the TARGET_FUNCTION_ARG_* macros. int
INIT_CUMULATIVE_ARGS Initializes a newly created CUMULALTIVE_ARGS type. Sets the int variable to 0
TARGET_FUNCTION_ARG Return a reg RTX or Zero to indicate when to start to pass outgoing args on the stack. See implementation
FUNCTION_ARG_REGNO_P Returns true of the given register number is used for passing outgoing function arguments. r3 to r8 are OK for arguments
TARGET_FUNCTION_ARG_ADVANCE This is called during iterating through outgoing function args to account for the next function arg size. See implementation

Further Reading

These references were very helpful in getting our calling conventions right:


GCC Important Passes

03 Jun 2018

What I learned from doing the OpenRISC GCC port, a deep dive into passes

When starting the OpenRISC gcc port I had a good idea of how the compiler worked and what would be involved in the port. Those main things being

  1. define a new machine description file in gcc’s RTL
  2. define a bunch of description macros and helper functions in a .c and .h file.

I realized early on that trouble shooting issues requires understanding the purpose of some important compiler passes. It was difficult to understand what all of the compiler passes were. There are so many, 200+, but after some time I found there are a few key passes to be concerned about; lets jump in.

Quick Tips

  • When debugging compiler problems use the -fdump-rtl-all-all and -fdump-tree-all-all flags to debug where things go wrong.
  • To understand which passes are run for different -On optimization levels you can use -fdump-passes.
  • The numbers in the dump output files indicate the order in which passes were run. For example between test.c.235r.vregs and test.c.234r.expand the expand pass is run before vregs, and there were no passes run inbetween.
  • The debug options -S -dp are also helpful for tying RTL up with the output assembly. The -S option tells the compiler to dump the assembler output, and -dp enables annotation comments showing the RTL instruction id, name and other useful statistics.

Glossary Terms

  • We may see cfg thoughout the gcc source, this is not configuration, but control flow graph.
  • Spilling is performed when there are not enough registers available during register allocation to store all scope variables, one variable in a register is chosen and spilled by saving it to memory; thus freeing up a register for allocation.
  • IL is a GCC intermediate language i.e. GIMPLE or RTL. During porting we are mainly concerned with RTL.
  • Lowering are operations done by passes to take higher level language and graph representations and make them more simple/lower level in preparation for machine assembly conversion.
  • Predicates part of the RTL these are used to facilitate instruction matching the. Having these more specific reduces the work that reload needs to do and generates better code.
  • Constraints part of the RTL and used during reload, these are associated with assembly instructions used to resolved the target instruction.

Passes

Passes are the core of the compiler. To start, there are basically two types of compiler passes in gcc:

  • Tree - Passes working on GIMPLE.
  • RTL - Passes working on Register Transfer Language.

GCC Passes

There are also Interprocedural analysis passes (IPA) which we will not get into, as I don’t really know what they are. You can find a list of all passes in gcc/passes.def.

In this post we will concentrate on the RTL passes as this is what most of our backend port influences. The passes interesting for our port are:

  • expand
  • vregs
  • split
  • combine
  • ira
  • LRA/reload

An Example

In order to illustrate how the passes work we have the following example C snippet of code. We will compile it and inspect the output of each stage.

int func (int a, int b) {
  return 2 * a + b;
}

When compiled with or1k-elf-gcc -O0 -c ../func.c the output is:

$ or1k-elf-objdump -dr func.o

func.o:     file format elf32-or1k

Disassembly of section .text:

00000000 <func>:
   0:   9c 21 ff f0     l.addi r1,r1,-16      ; Adjust stack pointer
   4:   d4 01 10 08     l.sw 8(r1),r2         ; Save old frame pointer
   8:   9c 41 00 10     l.addi r2,r1,16       ; Adjust frame pointer
   c:   d4 01 48 0c     l.sw 12(r1),r9        ; Save link register
  10:   d7 e2 1f f0     l.sw -16(r2),r3       ; Store arg[0]
  14:   d7 e2 27 f4     l.sw -12(r2),r4       ; Store arg[1]
  18:   86 22 ff f0     l.lwz r17,-16(r2)     ; Load arg[1]
  1c:   e2 31 88 00     l.add r17,r17,r17
  20:   e2 71 88 04     l.or r19,r17,r17
  24:   86 22 ff f4     l.lwz r17,-12(r2)     ; Load arg[0]
  28:   e2 33 88 00     l.add r17,r19,r17
  2c:   e1 71 88 04     l.or r11,r17,r17
  30:   84 41 00 08     l.lwz r2,8(r1)        ; Restore old frame pointer
  34:   85 21 00 0c     l.lwz r9,12(r1)       ; Restore link register
  38:   9c 21 00 10     l.addi r1,r1,16       ; Restore old stack pointer
  3c:   44 00 48 00     l.jr r9               ; Return
  40:   15 00 00 00     l.nop 0x0

Lets walk though some of the RTL passes to understand how we arrived at the above.

The Expand Pass

During passes there is a sudden change from GIMPLE to RTL, this change happens during expand/rtl generation pass.

There are about 55,000 lines of code used to handle expand.

   1094 gcc/stmt.c     - expand_label, expand_case
   5929 gcc/calls.c    - expand_call
  12054 gcc/expr.c     - expand_assignment, expand_expr_addr_expr ...
   2270 gcc/explow.c
   6168 gcc/expmed.c   - expand_shift, expand_mult, expand_and ...
   6817 gcc/function.c - expand_function_start, expand_function_end
   7327 gcc/optabs.c   - expand_binop, expand_doubleword_shift, expand_float, expand_atomic_load ...
   6641 gcc/emit-rtl.c
   6631 gcc/cfgexpand.c - pass and entry poiint is defined herei expand_gimple_stmt
  54931 total

The expand pass is defined in gcc/cfgexpand.c. It will take the instruction names like addsi3 and movsi and expand them to RTL instructions which will be refined by further passes.

Expand Input

Before RTL generation we have GIMPLE. Below is the content of func.c.232t.optimized the last of the tree passes before RTL conversion. An important tree pass is Static Single Assignment (SSA) I don’t go into it here, but it is what makes us have so many variables, note that each variable will be assigned only once, this helps simplify the tree for analysis and later RTL steps like register allocation.

func (intD.1 aD.1448, intD.1 bD.1449)
{
  intD.1 a_2(D) = aD.1448;
  intD.1 b_3(D) = bD.1449;
  intD.1 _1;
  intD.1 _4;

  _1 = a_2(D) * 2;
  _4 = _1 + b_3(D);
  return _4;
}

Expand Output

After expand we can first see the RTL. Each statement of the gimple above will be represented by 1 or more RTL expressions. I have simplified the RTL a bit and included the GIMPLE inline for clarity.

Tip Reading RTL. RTL is a lisp dialect. Each statement has the form (type id prev next n (statement)).

(insn 2 5 3 2 (set (reg/v:SI 44) (reg:SI 3 r3)) (nil))

For the instruction:

  • insn is the expression type
  • 2 is the instruction unique id
  • 5 is the instruction before it
  • 3 is the next instruction
  • 2 I am not sure what this is
  • (set (reg/v:SI 44) (reg:SI 3 r3)) (nil) - is the expression

Back to our example, this is with -O0 to allow the virtual-stack-vars to not be elimated for verbosity:

This is the contents of func.c.234r.expand.

;; func (intD.1 aD.1448, intD.1 bD.1449)
;; {
;;   Note: First we save the arguments
;;   intD.1 a_2(D) = aD.1448;
(insn 2 5 3 2 (set (mem/c:SI (reg/f:SI 36 virtual-stack-vars) [1 a+0 S4 A32])
        (reg:SI 3 r3 [ a ])) "../func.c":1 -1
     (nil))

;;   intD.1 b_3(D) = bD.1449;
(insn 3 2 4 2 (set (mem/c:SI (plus:SI (reg/f:SI 36 virtual-stack-vars)
                (const_int 4 [0x4])) [1 b+0 S4 A32])
        (reg:SI 4 r4 [ b ])) "../func.c":1 -1
     (nil))

;;   Note: this was optimized from x 2 to n + n.
;;   _1 = a_2(D) * 2;
;;    This is expanded to:
;;     1. Load a_2(D)
;;     2. Add a_2(D) + a_2(D) store result to temporary
;;     3. Store results to _1
(insn 7 4 8 2 (set (reg:SI 45)
        (mem/c:SI (reg/f:SI 36 virtual-stack-vars) [1 a+0 S4 A32])) "../func.c":2 -1
     (nil))
(insn 8 7 9 2 (set (reg:SI 46)
        (plus:SI (reg:SI 45)
            (reg:SI 45))) "../func.c":2 -1
     (nil))
(insn 9 8 10 2 (set (reg:SI 42 [ _1 ])
        (reg:SI 46)) "../func.c":2 -1
     (nil))a

;;  _4 = _1 + b_3(D);
;;   This is expanded to:
;;    1. Load b_3(D)
;;    2. Do the Add and store to _4
(insn 10 9 11 2 (set (reg:SI 47)
        (mem/c:SI (plus:SI (reg/f:SI 36 virtual-stack-vars)
                (const_int 4 [0x4])) [1 b+0 S4 A32])) "../func.c":2 -1
     (nil))
(insn 11 10 14 2 (set (reg:SI 43 [ _4 ])
        (plus:SI (reg:SI 42 [ _1 ])
            (reg:SI 47))) "../func.c":2 -1
     (nil))

;; return _4;
;;  We put _4 into r11 the openrisc return value register
(insn 14 11 18 2 (set (reg:SI 44 [ <retval> ])
        (reg:SI 43 [ _4 ])) "../func.c":2 -1
     (nil))
(insn 18 14 19 2 (set (reg/i:SI 11 r11)
        (reg:SI 44 [ <retval> ])) "../func.c":3 -1
     (nil))
(insn 19 18 0 2 (use (reg/i:SI 11 r11)) "../func.c":3 -1
     (nil))

The Virtual Register Pass

The virtual register pass is part of gcc/function.c file which has a few different passes in it.

$ grep -n 'pass_data ' gcc/function*

gcc/function.c:1995:const pass_data pass_data_instantiate_virtual_regs =
gcc/function.c:6486:const pass_data pass_data_leaf_regs =
gcc/function.c:6553:const pass_data pass_data_thread_prologue_and_epilogue =
gcc/function.c:6747:const pass_data pass_data_match_asm_constraints =

Virtual Register Output

Here we can see that the previously seen variables stored to the frame at virtual-stack-vars memory locations are now being stored to memory offsets of an architecture specifc register. After the Virtual Registers pass all of the virtual-* registers will be eliminated.

For OpenRISC we see ?fp, a fake register which we defined with macro FRAME_POINTER_REGNUM. We use this as a placeholder as OpenRISC’s frame pointer does not point to stack variables (it points to the function incoming arguments). The placeholder is needed by GCC but it will be eliminated later. On some arechitecture this will be a real register at this point.

;; Here we see virtual-stack-vars replaced with ?fp.
(insn 2 5 3 2 (set (mem/c:SI (reg/f:SI 33 ?fp) [1 a+0 S4 A32])
        (reg:SI 3 r3 [ a ])) "../func.c":1 16 {*movsi_internal}
     (nil))
(insn 3 2 4 2 (set (mem/c:SI (plus:SI (reg/f:SI 33 ?fp)
                (const_int 4 [0x4])) [1 b+0 S4 A32])
        (reg:SI 4 r4 [ b ])) "../func.c":1 16 {*movsi_internal}
     (nil))
(insn 7 4 8 2 (set (reg:SI 45)
        (mem/c:SI (reg/f:SI 33 ?fp) [1 a+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
(insn 8 7 9 2 (set (reg:SI 46)
        (plus:SI (reg:SI 45)
            (reg:SI 45))) "../func.c":2 2 {addsi3}
     (nil))
(insn 9 8 10 2 (set (reg:SI 42 [ _1 ])
        (reg:SI 46)) "../func.c":2 16 {*movsi_internal}
     (nil))
(insn 10 9 11 2 (set (reg:SI 47)
        (mem/c:SI (plus:SI (reg/f:SI 33 ?fp)
                (const_int 4 [0x4])) [1 b+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
;; ...

The Split and Combine Passes

The Split passes use define_split definitions to look for RTL expressions which cannot be handled by a single instruction on the target architecture. These instructions are split into multiple RTL instructions. Splits patterns are defined in our machine description file.

The Combine pass does the opposite. It looks for instructions that can be combined into a signle instruction. Having tightly defined predicates will ensure incorrect combines don’t happen.

The combine pass code is about 15,000 lines of code.

14950 gcc/combine.c

The IRA Pass

The IRA and LRA passes are some of the most complicated passes, they are responsible to turning the psuedo register allocations which have been used up to this point and assigning real registers.

The Register Allocation problem they solve is NP-complete.

The IRA pass code is around 22,000 lines of code.

  3514 gcc/ira-build.c
  5661 gcc/ira.c
  4956 gcc/ira-color.c
   824 gcc/ira-conflicts.c
  2399 gcc/ira-costs.c
  1323 gcc/ira-emit.c
   224 gcc/ira.h
  1511 gcc/ira-int.h
  1595 gcc/ira-lives.c
 22007 total

IRA Pass Output

We do not see many changes during the IRA pass in this example but it has prepared us for the next step, LRA/reload.

(insn 21 5 2 2 (set (reg:SI 41)
        (unspec_volatile:SI [
                (const_int 0 [0])
            ] UNSPECV_SET_GOT)) 46 {set_got_tmp}
     (expr_list:REG_UNUSED (reg:SI 41)
        (nil)))
(insn 2 21 3 2 (set (mem/c:SI (reg/f:SI 33 ?fp) [1 a+0 S4 A32])
        (reg:SI 3 r3 [ a ])) "../func.c":1 16 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 3 r3 [ a ])
        (nil)))
(insn 3 2 4 2 (set (mem/c:SI (plus:SI (reg/f:SI 33 ?fp)
                (const_int 4 [0x4])) [1 b+0 S4 A32])
        (reg:SI 4 r4 [ b ])) "../func.c":1 16 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 4 r4 [ b ])
        (nil)))

(insn 7 4 8 2 (set (reg:SI 45)
        (mem/c:SI (reg/f:SI 33 ?fp) [1 a+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
(insn 8 7 9 2 (set (reg:SI 46)
        (plus:SI (reg:SI 45)
            (reg:SI 45))) "../func.c":2 2 {addsi3}
     (expr_list:REG_DEAD (reg:SI 45)
        (nil)))
(insn 9 8 10 2 (set (reg:SI 42 [ _1 ])
        (reg:SI 46)) "../func.c":2 16 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 46)
        (nil)))
(insn 10 9 11 2 (set (reg:SI 47)
        (mem/c:SI (plus:SI (reg/f:SI 33 ?fp)
                (const_int 4 [0x4])) [1 b+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
;; ...

The LRA Pass (Reload)

The Local Register Allocator pass replaced the reload pass which is still used by some targets. OpenRISC and other modern ports use only LRA. The purpose of LRA/reload is to make sure each RTL instruction has real registers and a real instruction to use for output. If the criteria for an instruction is not met LRA/reload has some tricks to change and instruction and “reload” it in order to get it to match the criteria.

The LRA pass is about 17,000 lines of code.

  1816 gcc/lra-assigns.c
  2608 gcc/lra.c
   362 gcc/lra-coalesce.c
  7072 gcc/lra-constraints.c
  1465 gcc/lra-eliminations.c
    44 gcc/lra.h
   534 gcc/lra-int.h
  1450 gcc/lra-lives.c
  1347 gcc/lra-remat.c
   822 gcc/lra-spills.c
 17520 total

During LRA/reload constraints are used to match the real target inscrutions, i.e. "r" or "m" or target speciic ones like "O".

Before and after LRA/reload predicates are used to match RTL expressions, i.e general_operand or target specific ones like reg_or_s16_operand.

If we look at a test.c.278r.reload dump file we will a few sections.

  • Local
  • Pseudo live ranges
  • Inheritance
  • Assignment
  • Repeat
********** Local #1: **********
...
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            Cycle danger: overall += LRA_MAX_REJECT
          alt=0,overall=609,losers=1,rld_nregs=1
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=1: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=2: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=3: Bad operand -- refuse
          alt=4,overall=0,losers=0,rld_nregs=0
         Choosing alt 4 in insn 2:  (0) m  (1) rO {*movsi_internal}
...

The above snippet of the Local phase of the LRA/reload pass shows the contraints matching loop for RTL insn 2.

To understand what is going on we should look at what is insn 2, from our input. This is a set instruction having a destination of memory and a source of register type, or "m,r".

(insn 2 21 3 2 (set (mem/c:SI (reg/f:SI 33 ?fp) [1 a+0 S4 A32])
        (reg:SI 3 r3 [ a ])) "../func.c":1 16 {*movsi_internal}
     (expr_list:REG_DEAD (reg:SI 3 r3 [ a ])
        (nil)))

RTL from .md file of our *movsi_internal instruction. The alternatives are the constraints, i.e. "=r,r,r,r, m,r".

(define_insn "*mov<I:mode>_internal"
  [(set (match_operand:I 0 "nonimmediate_operand" "=r,r,r,r, m,r")
        (match_operand:I 1 "input_operand"        " r,M,K,I,rO,m"))]
  "register_operand (operands[0], <I:MODE>mode)
   || reg_or_0_operand (operands[1], <I:MODE>mode)"
  "@
   l.or\t%0, %1, %1
   l.movhi\t%0, hi(%1)
   l.ori\t%0, r0, %1
   l.xori\t%0, r0, %1
   l.s<I:ldst>\t%0, %r1
   l.l<I:ldst>z\t%0, %1"
  [(set_attr "type" "alu,alu,alu,alu,st,ld")])

The constraints matching interates over the alternatives. As we remember from above we are trying to match "m,r". We can see:

  • alt=0 - this shows 1 loser because alt 0 r,r vs m,r has one match and one mismatch.
  • alt=1 - is indented and says Bad operand meaning there is no match at all with r,M vs m,r
  • alt=2 - is indented and says Bad operand meaning there is no match at all with r,K vs m,r
  • alt=3 - is indented and says Bad operand meaning there is no match at all with r,I vs m,r
  • alt=4 - is as win as we match m,rO vs m,r

After this we know exactly which target instructions for each RTL expression is neded.

End of Reload (LRA)

Finally we can see here at the end of LRA/reload all registers are real. The output at this point is pretty much ready for assembly output.

(insn 21 5 2 2 (set (reg:SI 16 r17 [41])
        (unspec_volatile:SI [
                (const_int 0 [0])
            ] UNSPECV_SET_GOT)) 46 {set_got_tmp}
     (nil))
(insn 2 21 3 2 (set (mem/c:SI (plus:SI (reg/f:SI 2 r2)
                (const_int -16 [0xfffffffffffffff0])) [1 a+0 S4 A32])
        (reg:SI 3 r3 [ a ])) "../func.c":1 16 {*movsi_internal}
     (nil))
(insn 3 2 4 2 (set (mem/c:SI (plus:SI (reg/f:SI 2 r2)
                (const_int -12 [0xfffffffffffffff4])) [1 b+0 S4 A32])
        (reg:SI 4 r4 [ b ])) "../func.c":1 16 {*movsi_internal}
     (nil))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 8 2 (set (reg:SI 16 r17 [45])
        (mem/c:SI (plus:SI (reg/f:SI 2 r2)
                (const_int -16 [0xfffffffffffffff0])) [1 a+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
(insn 8 7 9 2 (set (reg:SI 16 r17 [46])
        (plus:SI (reg:SI 16 r17 [45])
            (reg:SI 16 r17 [45]))) "../func.c":2 2 {addsi3}
     (nil))
(insn 9 8 10 2 (set (reg:SI 17 r19 [orig:42 _1 ] [42])
        (reg:SI 16 r17 [46])) "../func.c":2 16 {*movsi_internal}
     (nil))
(insn 10 9 11 2 (set (reg:SI 16 r17 [47])
        (mem/c:SI (plus:SI (reg/f:SI 2 r2)
                (const_int -12 [0xfffffffffffffff4])) [1 b+0 S4 A32])) "../func.c":2 16 {*movsi_internal}
     (nil))
;; ...

Conclusion

We have walked some of the passes of GCC to better understand how it works. During porting most of the problems will show up around expand, vregs and reload passes. Its good to have a general idea of what these do and how to read the dump files when troubleshooting. I hope the above helps.

Further Reading


OpenRISC GCC Status Update

16 May 2018

News flash, the OpenRISC GCC port now can run “Hello World”

After about 4 months of development on the OpenRISC GCC port rewrite I have hit my first major milestone, the “Hello World” program is working. Over those 4 months I spent about 2 months working on my from scratch dummy SMH port then 2 months to get the OpenRISC port to this stage.

Next Steps

There are still many todo items before this will be ready for general use, including:

  • Milestone 2 items
    • Investigate and Fix test suite failures, see below
    • Write OpenRISC specific test cases
    • Ensure all memory layout and calling conventions are within spec
    • Ensure sign extending, overflow, and carry flag arithmetic is correct
    • Fix issues with GDB debugging target remote is working OK, target sim is having issues.
    • Implement stubbed todo items, see below
    • Support for C++, I haven’t even tried to compile it yet
  • Milestone 3 items
    • Support for position independent code (PIC)
    • Support for thread local storage (TLS)
    • Support for floating point instructions (FPU)
    • Support for Atomic Builtins

Somewhere between milestone 2 and 3 I will start to work on getting the port reviewed on the GCC and OpenRISC mailing lists. If anyone wants to review right now please feel free to send feedback.

Test Suite Results

Running the gcc testsuite right now shows the following results. Many of these look to be related to internal compiler errors.

                === gcc Summary ===

# of expected passes            84301
# of unexpected failures        5096
# of unexpected successes       3
# of expected failures          211
# of unresolved testcases       2821
# of unsupported tests          2630
/home/shorne/work/gnu-toolchain/build-gcc/gcc/xgcc  version 9.0.0 20180426 (experimental) (GCC)

Stubbed TODO Items

Some of the stubbed todo items include:

Trampoline Handling

In gcc/config/or1k/or1k.h implement trampoline hooks for nested functions.

#define TRAMPOLINE_SIZE 12
#define TRAMPOLINE_ALIGNMENT (abort (), 0)

Profiler Hooks

In gcc/config/or1k/or1k.h implement profiling hooks.

#define FUNCTION_PROFILER(FILE,LABELNO) (abort (), 0)

Exception Handler Hooks

In gcc/config/or1k/or1k.c ensure what I am doing is right, on other targets they copy the address onto the stack before returning.

/* TODO, do we need to just set to r9? or should we put it to where r9
   is stored on the stack?  */
void
or1k_expand_eh_return (rtx eh_addr)
{
  emit_move_insn (gen_rtx_REG (Pmode, LR_REGNUM), eh_addr);
}