shorne in japan

blog archive about resume

One reservation station per execution unit

There are four reservation station:

Reservation unit stages

Each reservation station consists of two stages: RSRVS-BUSY and RSRVS-EXECUTE. RSRVS-EXECUTE stage is just registers to provide operands to appropriate execution module as all hazards resolved.

case: No Hazards, Decode to Execution Bypass busy

If:

case: Hazards, Decode to Busy Enqueue

If:

case: Resolve Hazard, Busy to Execute

If:

case: Stall

If RSRVS-BUSY is occupied by an instruction (with or without hazards), unit is busy. If PIPE-DECODE instruction requires to be pushed into such reservation station, PIPE-DECODE and PIPE-FETCH become stalled till the RSRVS-BUSY become empty.

Only RSRVS-BUSY stage performs hazard resolving. I tried to implement hazard resolving in RSRVS-EXECUTE stage too to increase effective RSRVS deep. But it increases LUT consumption noticeably without performance improvement.

The RSRVS-BUSY stages also play role of decupling buffers to prevent long propagation paths for padv_* signals. You can find similar technique at output of PIPE-IFETCH and all execution units. Moreover, there are several of such decupling buffers inside of FPU pipes.

Optimizations

remove padv signals

I had idea to completely remove padv_* signals from design and replace them by inter-stages ready/takes. But I haven’t found yet how to handle requests from DU and process l.mfspr / l.mtspr for the case.

Here several notes should be done about l.mfspr / l.mtspr processing. They are processed in special way. If l.mfspr/l.mtspr is in PIPE-DECODE stage, PIPE-IFETCH and PIPE-DECODE become stalled. There is no reservation station for l.mfspr/l.mtspr. These instructions are processed by dedicated logic in CTRL-module only after OCB become empty that means all hazards are resolved. While CTRL processes l.mfspr / l.mtspr pipe is stalled.

OCB and writeback

As you correctly discovered OCB restores instruction order by granting access to “common write-back bus” exactly in the order of instructions were issued into execution units. “Common write-back bus” is distributed among units: each of them have got wrbk_* registers as output. If padv_wrbk is high and write back access is granted by a unit it puts its result in its wrbk_* registers. At the same time all other units put zero in their wrbk_* registers. The wrbk_result1 is just ORed wrbk_* registers of all execution units. The wrbk_result2 is just less significant word of double precision FPU’s result (single precision / most significant word of double precision output goes to wrbk_result1). And padv_wrbk is raised if granted access execution unit is ready.

padv_exec could be treated as ORed padv_*_rsrvs plus implicit padv_op_mXspr (for pushing l.mfspr/l.mtspr from PIPE-DECODE to CTRL). If need to push a RSRVS or l.mfspr/l.mtspr than push OCB with padv_exec. For my money my implementation is more elegance than just OR Улыбка [:-)].

Other ideas

I had a lot of ideas:

However, after preliminary analysis I concluded that on the one hand all of these techniques are very costly in terms of implementation time and LUT consumption and on the other hand MAROCCHINO is huge already while potential performance improvement is not obviously high. That’s why I postponed them.