shorne in japan

blog archive about resume

OR1K Marocchino A Tomasulo Implementation

21 Oct 2019

This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we are reviewing the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.

In the last article, Marocchino Instruction Pipeline we discussed the architecture of the CPU. In this article let’s look at how Marocchino achieves out-of-order execution using the Tomasulo algorithm.

Achieving Out-of-Order Execution

In a traditional pipelined CPU the goal is retire one instruction per clock cycle. Any pipeline stall means an execution clock cycle will be lost. One method for reducing the affect of pipeline stalls is instruction parallelization. In 1993 the Intel Pentium processor was one of the first consumer CPUs to achieve this with it’s dual U and V integer pipelines. The pentium U and V pipelines require certain coding techniques to take full advantage. Achieving more parallelism requires more sophisticated data hazard detection and instruction scheduling. Introduced with the IBM System/360 in the 60’s by Robert Tomasulo, the Tomosulo Algorithm provides the building blocks to allow for multiple instruction execution parallelism. Generally speaking no special programming is needed to take advantage of instruction parallelism on a processor implementing Tomasulo algorithm.

Tomasulo's algorithm

Though the technique of out-of-order CPU execution with Tomasulo’s algorithm had been designed in the 60’s it did not make its way into popular consumer hardware until the Pentium Pro in the 1995. Further Pentium revisions such as the Pentium III, Pentium 4 and Core architectures are based on this same architecture. Understanding this architecture is a key to understanding modern CPUs.

In this article we will point out comparisons between the Marocchino and Pentium pro who’s architecture can be seen in the below diagram.

pentium pro diagram

The Marocchino implements the Tomasulo algorithm in a CPU that can be synthesized and run on an FPGA. Let’s dive into the implementation by breaking down the building blocks used in Tomasulo’s algorithm and how they have been implemented in Marocchino.

Tomasulo Building blocks

Besides the basic CPU modules like Instruction Fetch, Decode and Register File, the building blocks that are used in the Tomasulo algorithm are as follows:

The below diagram shows how these components are arranged in the Marocchino processor.

marocchino pipeline diagram

Resolving Data Hazards

As mentioned above the goal of a pipelined architecture is to retire one instruction per clock cycle. Pipelining helps achieve this by splitting an instruction into pipeline stages i.e. Fetch, Decode, Execute, Load/Store and Register Write Back. If one instruction depends on the results produced by a previous instruction will be a problem as register write back of the previous instruction may not complete before registers are read during the Decode phase of a instruction. This and other types of dependencies between pipeline stages are called hazards, and they must be avoided.

The Tomasulo algorithm with its Reservation Stations, Register Allocation Tables and other building blocks try to avoid hazards causing pipeline stalls. Let’s look at a simple example to see how this is done.

Here we can see that instruction 2 depends on instruction 1 as the addition of a + b cannot be performed until b is produced by instruction 1.

Let’s assume that instruction 1 is currently executing on the MULTIPLY unit. The CPU decodes instruction 2, instead of detecting a data hazard and stalling the pipeline instruction 2 will be placed in the reservation station of the ADD execution unit. The RAT indicates that b is busy and being produced by insruction 1. This means instruction 2 cannot execute right away. Next, we can look at instruction 3 and place it onto the reservation station of the DIVIDE execution unit. As instruction 3 has no hazards for x and y it can proceed directly to execution, even before instruction 2 is ready for execution.

Note, if a required reservation station is full the pipeline will stall.

Register Renaming

As mentioned above, execution units will present their output onto the common data bus wrbk_result and the data will be written into reservation stations. Writing the register to the reservation station may occur before writing back to the register file. This is what register renaming is, as the register input does not come directly from the register file.

Instruction Id

When an instruction is issued it may be registered in the RAT, OCB and Reservation Station. It is assigned an Instruction Id for tracking purposes. In Marocchino this is called the extadr and is 3 bits wide. It is generated by the simple instruction ID generation logic.

The is implemented in or1k_marocchino_oman.v with the following counter logic which generates a new extadr every time an instruction is decoded.

  // extension to DEST, FLAG or CARRY
  // Zero value is reserved as "not used"
  localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MAX = ((1 << DEST_EXTADR_WIDTH) - 1);
  localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MIN = 1;
  // ---
  reg  [DEST_EXTADR_WIDTH-1:0] dcod_extadr_r;
  wire [DEST_EXTADR_WIDTH-1:0] extadr_adder;
  // ---
  assign extadr_adder = (dcod_extadr_r == EXTADR_MAX) ? EXTADR_MIN : (dcod_extadr_r + 1'b1);
  // ---
  always @(posedge cpu_clk) begin
    if (pipeline_flush_i)
      dcod_extadr_r <= {DEST_EXTADR_WIDTH{1'b0}};
    else if (padv_dcod_i)
      dcod_extadr_r <= fetch_valid_i ? extadr_adder : dcod_extadr_r;
  end // @clock
  // support in-1clk-unit forwarding
  assign dcod_extadr_o = dcod_extadr_r;

Every instruction that is queued by the order manager is designated an extadr. This allows components like the reservation station and RAT tables to track when an instruction starts and completes executing.

The interactions between the extadr and other components are as follows.

During decode:

During execution:

Register Allocation Table

The register allocation table (RAT), sometimes called register alias table, keeps track of which registers are currently in progress of being generated by pending instructions. This is used to derive and resolve hazards.

The outputs of the RAT cell are:

marocchino RAT Cell diagram

The RAT table is made of 32 rat_cell modules; one cell per register. The register which the cell is allocated to is stored within GPR_ADR in the rat cell.

marocchino RAT diagram

Outputs of the RAT are registered to reservation stations. The hazards are derived with the following logic in or1k_marocchino_oman.v.

The omn2dec_hazard_d1a1_o hazard means that the argument a of the decoded instruction will be resolved when the instruction with extadr in omn2dec_extadr_dxa1_o is retired. The 2 in d2, a2 and b2 represent the 2nd register used in 64-bit FPU instructions.


  //  # relative operand A1
  assign omn2dec_hazard_d1a1_o = rat_rd1_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
  assign omn2dec_hazard_d2a1_o = rat_rd2_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
  assign omn2dec_extadr_dxa1_o = rat_extadr[dcod_rfa1_adr_i];
  //  # relative operand B1
  assign omn2dec_hazard_d1b1_o = rat_rd1_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
  assign omn2dec_hazard_d2b1_o = rat_rd2_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
  assign omn2dec_extadr_dxb1_o = rat_extadr[dcod_rfb1_adr_i];
  //  # relative operand A2
  assign omn2dec_hazard_d1a2_o = rat_rd1_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
  assign omn2dec_hazard_d2a2_o = rat_rd2_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
  assign omn2dec_extadr_dxa2_o = rat_extadr[dcod_rfa2_adr_i];
  //  # relative operand B2
  assign omn2dec_hazard_d1b2_o = rat_rd1_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
  assign omn2dec_hazard_d2b2_o = rat_rd2_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
  assign omn2dec_extadr_dxb2_o = rat_extadr[dcod_rfb2_adr_i];

Reservation Stations

marocchino reservation station diagram

The reservation station receives an instruction from the decode stage and queues it until all hazards are resolved and the execution unit is free.

Each reservation station has one busy slot and one execution slot. In the Pentium Pro there were 20 reservation station slots, the Marocchino has 5 or 10 depending if you count the execution slots.

Reservation stations are populated when the pipeline advance padv_rsrvs_i signal comes. An instruction may be forwarded directly to execution if there are no hazards and the execution unit is free.

The reservation station resolves hazards by watching and comparing wrbk_extadr_i with the busy_extadr_dxa_r and busy_extadr_dxb_r registers. If the two match it means that the instruction producing register A or B has finished writing back its results and the hazard can be cleared.

Writeback forwarding is handled via the following verilog multiplexer and register logic. The first bit is used to register the decoded values dcod_rf* from the register file otherwise we watch for inputs from the forwarding logic. If there is a pending hazard results are forwarded from the common data bus, otherwise results are maintained.

  // BUSY stage operands A1 & B1
  always @(posedge cpu_clk) begin
    if (padv_rsrvs_i) begin
      busy_rfa1_r <= dcod_rfa1;
      busy_rfb1_r <= dcod_rfb1;
    end
    else begin
      busy_rfa1_r <= busy_rfa1;
      busy_rfb1_r <= busy_rfb1;
    end
  end // @clock

  // Forwarding
  //  operand A1
  assign busy_rfa1 =  busy_hazard_d1a1_r ? wrbk_result1_i :
                     (busy_hazard_d2a1_r ? wrbk_result2_i : busy_rfa1_r);
  //  operand B1
  assign busy_rfb1 =  busy_hazard_d1b1_r ? wrbk_result1_i :
                     (busy_hazard_d2b1_r ? wrbk_result2_i : busy_rfb1_r);

When all hazard flags are cleared the contents of busy_op_r , busy_rfa_r and busy_rfb_r will be transferred to exec_op_any_r, exec_op_r, etc. They are presented on the outputs and the execution unit can take them and start processing.

The unit_free_o output signals the control unit that the reservation station is free and can be issued another instruction. The signal goes high when all hazards are cleared and the busy state transfers to exec.

Execution Units

In Marocchino the execution units (also referred to as functional units) execute instructions which it receives from the reservation stations.

The execution units in Marocchino are:

Handshake signals between the reservation station and execution units are used to issue operations to execution units.

marocchino execution unit handshake diagram

The taking_op_i is the signal from the execution unit signalling it has received the op and the reservation station will clear all exec_*_o output signals.

Order Control Buffer

marocchino order control diagram

In the Marocchino the Order Control Buffer (OCB) is the in order retirement unit. It can retire a single instruction at a time. The implementation is a 7 entry FIFO queue. This is much less than the Pentium Pro which contains 40 slots. The OCB receives a single instruction at time from the decoder and broadcasts the oldest instruction for other components to see. Instructions are retired after execution write back is complete.

If the OCB output indicates a branch instruction or an exception, branch logic is invoked. Instead of waiting for write back to a register the write back logic in the Marocchino will perform the branch operations. This may include flushing the OCB. Special care is taken to handle branch delay slot instruction execution.

The OCB is different from a traditional Tomasulo Reorder Buffer (ROB) in that it does not store any execution write back results.

Each OCB entry stores:

This can be seen as defined by the ocbi and ocbi wire buses in or1k_marocchino_oman.v.

  // --- OCB-Controls input ---
  wire  [OCBT_MSB:0] ocbi;
  assign ocbi =
    {
      // --- pipeline [C]ontrol flags ---
      dcod_extadr_r, // OCB-Controls entrance
      dcod_op_ls_i, // OCB-Controls entrance
      dcod_op_fpxx_cmp_i, // OCB-Controls entrance
      dcod_op_fpxx_arith_i, // OCB-Controls entrance
      dcod_op_mul_i, // OCB-Controls entrance
      dcod_op_div_i, // OCB-Controls entrance
      dcod_op_1clk_i, // OCB-Controls entrance
      dcod_op_jb_r, // OCB-Controls entrance
      dcod_op_push_wrbk_i, // OCB-Controls entrance
      // --- instruction [A]ttributes ---
      pc_decode_i, // OCB-Attributes entrance
      dcod_rfd2_adr_i, // OCB-Attributes entrance
      dcod_rfd2_we_i, // OCB-Attributes entrance
      dcod_rfd1_adr_i, // OCB-Attributes entrance
      dcod_rfd1_we_i, // OCB-Attributes entrance
      dcod_delay_slot_i, // OCB-Attributes entrance
      dcod_op_rfe_i, // OCB-Attributes entrance
      // Flag that istruction is restartable
      interrupts_en, // OCB-Attributes entrance
      // Combined IFETCH/DECODE an exception flag
      dcod_an_except_fd_i, // OCB-Attributes entrance
      // FETCH & DECODE exceptions
      dcod_fetch_except_ibus_err_r, // OCB-Attributes entrance
      dcod_fetch_except_ipagefault_r, // OCB-Attributes entrance
      dcod_fetch_except_itlb_miss_r, // OCB-Attributes entrance
      dcod_except_illegal_i, // OCB-Attributes entrance
      dcod_except_syscall_i, // OCB-Attributes entrance
      dcod_except_trap_i // OCB-Attributes entrance
    };

  // --- INSN OCB input ---
  wire [OCBT_MSB:0] ocbo;

Common Data Bus

As discussed above the common data collects write back results from execution units and routes them for write back.

This can be seen in the or1k_marocchino_cpu.v as below.

  // --- regular ---
  always @(wrbk_1clk_result       or wrbk_div_result or wrbk_mul_result or
           wrbk_fpxx_arith_res_hi or wrbk_lsu_result or wrbk_mfspr_result)
  begin
    wrbk_result1 = wrbk_1clk_result       | wrbk_div_result | wrbk_mul_result |
                   wrbk_fpxx_arith_res_hi | wrbk_lsu_result | wrbk_mfspr_result;
  end

  // --- FPU64 extention ---
  assign wrbk_result2 = wrbk_fpxx_arith_res_lo;

Conclusion

Tomasulo’s algorithm is still relevant today and used in many processors. Marocchino provides an accessible implementation. Marocchino is however, not super-scalar, while Pentium Pro can decode up to 4 instructions at a time the Marocchino can only decode 1 at a time.

Furthermore many improvements can be made to Marocchino to increase performance. Including:

However, these come with a cost of size on the FPGA. If you are interested in helping out please feel free to contribute.

If anything in this article could be improved, more timing diagrams, typos or fixes for diagrams please send me a message on twitter.

Further Reading and Sources