shorne in japan

blog about contact

Thread Local Storage

19 Jan 2020

This is an ongoing series of posts on ELF Binary Relocations and Thread Local Storage. This article covers only Thread Local Storage and assumes the reader has had a primer in ELF Relocations, if not please start with my previous article ELF Binaries and Relocation Entries

In the last article we covered ELF Binary internals and how relocation entries are used to during link time to allow our programs to access symbols (variables). However, what if we want a different variable instance for each thread? This is where thread local storage (TLS) comes in.

In this article we will discuss how TLS works. Our outline:

As before, the examples in this article can be found in my tls-examples project. Please check it out.

Thread Local Storage

Did you know that in C you can prefix variables with __thread to create thread local variables?

Example

__thread int i;

A thread local variable is a variable that will have a unique instance per thread. Each time a new thread is created, the space required to store the thread local variables is allocated.

TLS variables are stored in dynamic TLS sections.

TLS Sections

In the previous article we saw how variables were stored in the .data and .bss sections. These are initialized once per program or library.

When we get to binaries that use TLS we will additionally have .tdata and .tbss sections.

  • .tdata - static and non static initialized thread local variables
  • .tbss - static and non static non-initialized thread local variables

These exist in a special TLS segment which is loaded per thread. In the next article we will discuss more about how this loading works.

TLS Data Structures

As we recall, to access data in .data and .bss sections simple code sequences with relocation entries are used. These sequences set and add registers to build pointers to our data. For example, the below sequence uses 2 relocations to compose a .bss section address into register r11.

Addr.   Machine Code    Assembly             Relocations
0000000c <get_x_addr>:
   c:   19 60 [00 00]   l.movhi r11,[0]      # c  R_OR1K_AHI16 .bss
  10:   44 00  48 00    l.jr r9
  14:   9d 6b [00 00]    l.addi r11,r11,[0]  # 14 R_OR1K_LO_16_IN_INSN .bss

With TLS the code sequences to access our data will also build pointers to our data, but they need to traverse the TLS data structures.

As the code sequence is read only and will be the same for each thread another level of indirection is needed, this is provided by the Thread Pointer (TP).

The Thread Pointer points into a data structure that allows us to locate TLS data sections. The TLS data structure includes:

  • Thread Control Block (TCB)
  • Dynamic Thread Vector (DTV)
  • TLS Data Sections

These are illustrated as below:

  dtv[]   [ dtv[0], dtv[1], dtv[2], .... ]
            counter ^ |      \
               ----/ /        \________
              /     V                  V
/------TCB-------\/----TLS[1]----\   /----TLS[2]----\
| pthread tcbhead | tbss   tdata |   | tbss   tdata |
\----------------/\--------------/   \--------------/
          ^
          |
   TP-----/

Thread Pointer (TP)

The TP is unique to each thread. It provides the starting point to the TLS data structure.

  • The TP points to the Thread Control Block
  • On OpenRISC the TP is stored in r10
  • On x86_64 the TP is stored in $fs
  • This is the *tls pointer passed to the clone() system call when using CLONE_SETTLS.

Thread Control Block (TCB)

The TCB is the head of the TLS data structure. The TCB consists of:

  • pthread - the pthread struct for the current thread, contains tid etc. Located by TP - TCB size - Pthread size
  • tcbhead - the tcbhead_t struct, machine dependent, contains pointer to DTV. Located by TP - TCB size.

For OpenRISC tcbhead_t is defined in sysdeps/or1k/nptl/tls.h as:

typedef struct {
  dtv_t *dtv;
} tcbhead_t
  • dtv - is a pointer to the dtv array, points to entry dtv[1]

For x86_64 the tcbhead_t is defined in sysdeps/x86_64/nptl/tls.h as:

typedef struct
{
  void *tcb;            /* Pointer to the TCB.  Not necessarily the
                           thread descriptor used by libpthread.  */
  dtv_t *dtv;
  void *self;           /* Pointer to the thread descriptor.  */
  int multiple_threads;
  int gscope_flag;
  uintptr_t sysinfo;
  uintptr_t stack_guard;
  uintptr_t pointer_guard;
  unsigned long int vgetcpu_cache[2];
  /* Bit 0: X86_FEATURE_1_IBT.
     Bit 1: X86_FEATURE_1_SHSTK.
   */
  unsigned int feature_1;
  int __glibc_unused1;
  /* Reservation of some values for the TM ABI.  */
  void *__private_tm[4];
  /* GCC split stack support.  */
  void *__private_ss;
  /* The lowest address of shadow stack,  */
  unsigned long long int ssp_base;
  /* Must be kept even if it is no longer used by glibc since programs,
     like AddressSanitizer, depend on the size of tcbhead_t.  */
  __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));

  void *__padding[8];
} tcbhead_t;

The x86_64 implementation includes many more fields including:

  • gscope_flag - Global Scope lock flags used by the runtime linker, for OpenRISC this is stored in pthread.
  • stack_guard - The stack guard canary stored in the thread local area. For OpenRISC a global stack guard is stored in .bss.
  • pointer_guard - The pointer guard stored in the thread local area. For OpenRISC a global pointer guard is stored in .bss.

Dynamic Thread Vector (DTV)

The DTV is an array of pointers to each TLS data section. The first entry in the DTV array contains the generation counter. The generation counter is really just the array size. The DTV can be dynamically resized as more TLS modules are loaded.

The dtv_t type is a union as defined below:

typedef struct {
  void *val;     // Aligned pointer to data/bss
  void *to_free; // Unaligned pointer for free()
} dtv_pointer

typedef union {
  int counter;          // for entry 0
  dtv_pointer pointer;  // for all other entries
} dtv_t

Each dtv_t entry can be either a counter or a pointer. By convention the first entry, dtv[0] is a counter and the rest are pointers.

Thread Local Storage (TLS)

The initial set of TLS data sections is allocated contiguous with the TCB. Additional TLS data blocks will be allocated dynamically. There will be one entry for each loaded module, the first module being the current program. For dynamic libraries it is lazily initialized per thread.

Local (or TLS[1])

  • tbss - the .tbss section for the current thread from the current processes ELF binary.
  • tdata - the .tdata section for the current thread from the current processes ELF binary.

TLS[2]

  • tbss - the .tbss section for variables defined in the first shared library loaded by the current process
  • tdata - the .tdata section for variables defined in the first shared library loaded by the current process

The __tls_get_addr() function

The __tls_get_addr() function can be used at any time to traverse the TLS data structure and return a variable’s address. The function is given a pointer to an architecture specific argument tls_index.

  • The argument contains 2 pieces of data:
    • The module index - 0 for the current process, 1 for the first loaded shared library etc.
    • The data offset - the offset of the variable in the TLS data section
  • Internally __tls_get_addr uses TP to located the TLS data structure
  • The function returns the address of the variable we want to access

For static builds the implementation is architecture dependant and defined in OpenRISC sysdeps/or1k/libc-tls.c as:

__tls_get_addr (tls_index *ti)
{
  dtv_t *dtv = THREAD_DTV ();
  return (char *) dtv[1].pointer.val + ti->ti_offset;
}

Note for for static builds the module index can be hard coded to 1 as there will always be only one module.

For dynamically linked programs the implementation is defined as part of the runtime dynamic linker in elf/dl-tls.c as:

void *
__tls_get_addr (GET_ADDR_ARGS)
{
  dtv_t *dtv = THREAD_DTV ();

  if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation)))
    return update_get_addr (GET_ADDR_PARAM);

  void *p = dtv[GET_ADDR_MODULE].pointer.val;

  if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
    return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);

  return (char *) p + GET_ADDR_OFFSET;
}

Here several macros are used so it’s a bit hard to follow but there are:

  • THREAD_DTV - uses TP to get the pointer to the DTV array.
  • GET_ADDR_ARGS - short for tls_index* ti
  • GET_ADDR_PARAM - short for ti
  • GET_ADDR_MODULE - short for ti->ti_module
  • GET_ADDR_OFFSET - short for ti->ti_offset

TLS Access Models

As one can imagine, traversing the TLS data structures when accessing each variable could be slow. For this reason there are different TLS access models that the compiler can choose to minimize variable access overhead.

Global Dynamic

The Global Dynamic (GD), sometimes called General Dynamic, access model is the slowest access model which will traverse the entire TLS data structure for each variable access. It is used for accessing variables in dynamic shared libraries.

Before Linking

Global Dynamic Object

Not counting relocations for the PLT and GOT entries; before linking the .text contains 1 placeholder for a GOT offset. This GOT entry will contain the arguments to __tls_get_addr.

After Linking

Global Dynamic Program

After linking there will be 2 relocation entries in the GOT to be resolved by the dynamic linker. These are R_TLS_DTPMOD, the TLS module index, and R_TLS_DTPOFF, the offset of the variable into the TLS module.

Example

File: tls-gd.c

extern __thread int x;

int* get_x_addr() {
  return &x;
}

Code Sequence (OpenRISC)

tls-gd.o:     file format elf32-or1k

Disassembly of section .text:

0000004c <get_x_addr>:
  4c:	18 60 [00 00] 	l.movhi r3,[0]          # 4c: R_OR1K_TLS_GD_HI16	x
  50:	9c 21 ff f8 	l.addi r1,r1,-8
  54:	a8 63 [00 00] 	l.ori r3,r3,[0]         # 54: R_OR1K_TLS_GD_LO16	x
  58:	d4 01 80 00 	l.sw 0(r1),r16
  5c:	d4 01 48 04 	l.sw 4(r1),r9
  60:	04 00 00 02 	l.jal 68 <get_x_addr+0x1c>
  64:	1a 00 [00 00] 	 l.movhi r16,[0]        # 64: R_OR1K_GOTPC_HI16	_GLOBAL_OFFSET_TABLE_-0x4
  68:	aa 10 [00 00] 	l.ori r16,r16,[0]       # 68: R_OR1K_GOTPC_LO16	_GLOBAL_OFFSET_TABLE_
  6c:	e2 10 48 00 	l.add r16,r16,r9
  70:	04 00 [00 00] 	l.jal [0]               # 70: R_OR1K_PLT26	__tls_get_addr
  74:	e0 63 80 00 	 l.add r3,r3,r16
  78:	85 21 00 04 	l.lwz r9,4(r1)
  7c:	86 01 00 00 	l.lwz r16,0(r1)
  80:	44 00 48 00 	l.jr r9
  84:	9c 21 00 08 	 l.addi r1,r1,8

Code Sequence (x86_64)

tls-gd.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000020 <get_x_addr>:
  20:	48 83 ec 08          	  sub    $0x8,%rsp
  24:	66 48 8d 3d [00 00 00 00] lea    [0](%rip),%rdi  # 28 R_X86_64_TLSGD	x-0x4
  2c:	66 66 48 e8 [00 00 00 00] callq  [0]             # 30 R_X86_64_PLT32	__tls_get_addr-0x4
  34:	48 83 c4 08          	  add    $0x8,%rsp
  38:	c3                   	  retq   

Local Dynamic

The Local Dynamic (LD) access model is an optimization for Global Dynamic where multiple variables may be accessed from the same TLS module. Instead of traversing the TLS data structure for each variable, the TLS data section address is loaded once by calling __tls_get_addr with an offset of 0. Next, variables can be accessed with individual offsets.

Local Dynamic is not supported on OpenRISC yet.

Before Linking

Local Dynamic Object

Not counting relocations for the PLT and GOT entries; before linking the .text contains 1 placeholder for a GOT offset and 2 placeholders for the TLS offsets. This GOT entry will contain the arguments to __tls_get_addr. The TLD offsets will be the offsets to our variables in the TLD data section.

After Linking

Local Dynamic Program

After linking there will be 1 relocation entry in the GOT to be resolved by the dynamic linker. This is R_TLS_DTPMOD, the TLS module index, the offset will be 0x0.

Example

File: tls-ld.c

static __thread int x;
static __thread int y;

int sum() {
  return x + y;
}

Code Sequence (x86_64)

tls-ld.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000030 <sum>:
  30:	48 83 ec 08          	sub    $0x8,%rsp
  34:	48 8d 3d [00 00 00 00] 	lea    [0](%rip),%rdi   # 37 R_X86_64_TLSLD	x-0x4
  3b:	e8 [00 00 00 00]       	callq  [0]              # 3c R_X86_64_PLT32	__tls_get_addr-0x4
  40:	8b 90 [00 00 00 00]    	mov    [0](%rax),%edx   # 42 R_X86_64_DTPOFF32	x
  46:	03 90 [00 00 00 00]    	add    [0](%rax),%edx   # 48 R_X86_64_DTPOFF32	y
  4c:	48 83 c4 08          	add    $0x8,%rsp
  50:	89 d0                	mov    %edx,%eax
  52:	c3                   	retq   

Initial Exec

The Initial Exec (IE) access model does not require traversing the TLS data structure. It requires that the compiler knows that offset from the TP to the variable can be computed during link time.

As Initial Exec does not require calling __tls_get_addr is is more efficient compared the GD and LD access.

Before Linking

Initial Exec Object

Text contains a placeholder for the got address of the offset. Not counting relocation entry for the GOT; before linking the .text contains 1 placeholder for a GOT offset. This GOT entry will contain the TP offset to the variable.

After Linking

Initial Exec Program

After linking there will be no remaining relocation entries. The .text section contains the actual GOT offset and the GOT entry will contain the TP offset to the variable.

Example

File: tls-ie.c

Initial exec C code will be the same as global dynamic, however IE access will be chosen when static compiling.

extern __thread int x;

int* get_x_addr() {
  return &x;
}

Code Sequence (OpenRISC)

00000038 <get_x_addr>:
  38:	9c 21 ff fc 	l.addi r1,r1,-4
  3c:	1a 20 [00 00] 	l.movhi r17,[0x0]   # 3c: R_OR1K_TLS_IE_AHI16	x
  40:	d4 01 48 00 	l.sw 0(r1),r9
  44:	04 00 00 02 	l.jal 4c <get_x_addr+0x14>
  48:	1a 60 [00 00] 	 l.movhi r19,[0x0]  # 48: R_OR1K_GOTPC_HI16	_GLOBAL_OFFSET_TABLE_-0x4
  4c:	aa 73 [00 00] 	l.ori r19,r19,[0x0] # 4c: R_OR1K_GOTPC_LO16	_GLOBAL_OFFSET_TABLE_
  50:	e2 73 48 00 	l.add r19,r19,r9
  54:	e2 31 98 00 	l.add r17,r17,r19
  58:	85 71 [00 00] 	l.lwz r11,[0](r17)  # 58: R_OR1K_TLS_IE_LO16	x
  5c:	85 21 00 00 	l.lwz r9,0(r1)
  60:	e1 6b 50 00 	l.add r11,r11,r10
  64:	44 00 48 00 	l.jr r9
  68:	9c 21 00 04 	 l.addi r1,r1,4

Code Sequence (x86_64)

0000000000000010 <get_x_addr>:
  10:	48 8b 05 [00 00 00 00] 	   mov    0x0(%rip),%rax   # 13: R_X86_64_GOTTPOFF	x-0x4
  17:	64 48 03 04 25 00 00 00 00 add    %fs:0x0,%rax
  20:	c3                         retq   

Local Exec

The Local Exec (LD) access model does not require traversing the TLS data structure or a GOT entry. It is chosen by the compiler when accessing file local variables in the current program.

The Local Exec access model is the most efficient.

Before Linking

Local Exec Object

Before linking the .text section contains one relocation entry for a TP offset.

After Linking

Local Exec Program

After linking the .text section contains the value of the TP offset.

Example

File: tls-le.c

In the Local Exec example the variable x is local, it is not extern.

static __thread int x;

int * get_x_addr() {
  return &x;
}

Code Sequence (OpenRISC)

00000010 <get_x_addr>:
  10:	19 60 [00 00] 	l.movhi r11,[0x0]    # 10: R_OR1K_TLS_LE_AHI16	.LANCHOR0
  14:	e1 6b 50 00 	l.add r11,r11,r10
  18:	44 00 48 00 	l.jr r9
  1c:	9d 6b [00 00] 	 l.addi r11,r11,[0]  # 1c: R_OR1K_TLS_LE_LO16	.LANCHOR0

Code Sequence (x86_64)

0000000000000010 <get_x_addr>:
  10:	64 48 8b 04 25 00 00 00 00  mov    %fs:0x0,%rax
  19:	48 05 [00 00 00 00]    	    add    $0x0,%rax  # 1b: R_X86_64_TPOFF32	x
  1f:	c3                   	    retq   

Linker Relaxation

As some TLS access methods are more efficient than others we would like to choose the best method for each variable access. However, we sometimes don’t know where a variable will come from until link time.

On some architectures the linker will rewrite the TLS access code sequence to change to a more efficient access model, this is called relaxation.

One type of relaxation performed by the linker is GD to IE relaxation. During compile time GD relocation may be chosen for extern variables. However, during link time the variable may be found in the same module i.e. not a shared object which would require GD access. In this case the access model can be changed to IE.

That’s pretty cool.

Summary

In this article we have covered how TLS variables are accessed per thread via the TLS data structure. Also, we saw how different TLS access models provide varying levels of efficiency.

In the next article we will look more into how this is implemented in GCC, the linker and the GLIBC runtime dynamic linker.

Further Reading


ELF Binaries and Relocation Entries

29 Nov 2019

Recently I have been working on getting the OpenRISC glibc port ready for upstreaming. Part of this work has been to run the glibc testsuite and get the tests to pass. The glibc testsuite has a comprehensive set of linker and runtime relocation tests.

In order to fix issues with tests I had to learn more than I did before about ELF Relocations , Thread Local Storage and the binutils linker implementation in BFD. There is a lot of documentation available, but it’s a bit hard to follow as it assumes certain knowledge, for example have a look at the Solaris Linker and Libraries section on relocations. In this article I will try to fill in those gaps.

This will be an illustrated 3 part series covering

  • ELF Binaries and Relocation Entries
  • Thread Local Storage
  • How Relocations and Thread Local Store are implemented

All of the examples in this article can be found in my tls-examples project. Please check it out.

On Linux, you can download it and make it with your favorite toolchain. By default it will cross compile using an openrisc toolchain. This can be overridden with the CROSS_COMPILE variable. For example, to build for your current host.

$ git clone git@github.com:stffrdhrn/tls-examples.git
$ make CROSS_COMPILE=
gcc -fpic -c -o tls-gd-dynamic.o tls-gd.c -Wall -O2 -g
gcc -fpic -c -o nontls-dynamic.o nontls.c -Wall -O2 -g
...
objdump -dr x-static.o > x-static.S
objdump -dr xy-static.o > xy-static.S

Now we can get started.

ELF Segments and Sections

Before we can talk about relocations we need to talk a bit about what makes up ELF binaries. This is a prerequisite as relocations and TLS are part of ELF binaries. There are a few basic ELF binary types:

  • Objects (.o) - produced by a compiler, contains a collection of sections, also call relocatable files.
  • Program - an executable program, contains sections grouped into segments.
  • Shared Objects (.so) - a program library, contains sections grouped into segments.
  • Core Files - core dump of program memory, these are also ELF binaries

Here we will discuss Object Files and Program Files.

An ELF Object

ELF Object

The compiler generates object files, these contain sections of binary data and these are not executable.

The object file produced by gcc generally contains .rela.text, .text, .data and .bss sections.

  • .rela.text - a list of relocations against the .text section
  • .text - contains compiled program machine code
  • .data - static and non static initialized variable values
  • .bss - static and non static non-initialized variables

An ELF Program

ELF Program

ELF binaries are made of sections and segments.

A segment contains a group of sections and the segment defines how the data should be loaded into memory for program execution.

Each segment is mapped to program memory by the kernel when a process is created. Program files contain most of the same sections as objects but there are some differences.

  • .text - contains executable program code, there is no .rela.text section
  • .got - the global offset table used to access variables, created during link time. May be populated during runtime.

Looking at ELF binaries (readelf)

The readelf tool can help inspect elf binaries.

Some examples:

Reading Sections of an Object File

Using the -S option we can read sections from an elf file. As we can see below we have the .text, .rela.text, .bss and many other sections.

$ readelf -S tls-le-static.o
There are 20 section headers, starting at offset 0x604:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000034 000020 00  AX  0   0  4
  [ 2] .rela.text        RELA            00000000 0003f8 000030 0c   I 17   1  4
  [ 3] .data             PROGBITS        00000000 000054 000000 00  WA  0   0  1
  [ 4] .bss              NOBITS          00000000 000054 000000 00  WA  0   0  1
  [ 5] .tbss             NOBITS          00000000 000054 000004 00 WAT  0   0  4
  [ 6] .debug_info       PROGBITS        00000000 000054 000074 00      0   0  1
  [ 7] .rela.debug_info  RELA            00000000 000428 000084 0c   I 17   6  4
  [ 8] .debug_abbrev     PROGBITS        00000000 0000c8 00007c 00      0   0  1
  [ 9] .debug_aranges    PROGBITS        00000000 000144 000020 00      0   0  1
  [10] .rela.debug_arang RELA            00000000 0004ac 000018 0c   I 17   9  4
  [11] .debug_line       PROGBITS        00000000 000164 000087 00      0   0  1
  [12] .rela.debug_line  RELA            00000000 0004c4 00006c 0c   I 17  11  4
  [13] .debug_str        PROGBITS        00000000 0001eb 00007a 01  MS  0   0  1
  [14] .comment          PROGBITS        00000000 000265 00002b 01  MS  0   0  1
  [15] .debug_frame      PROGBITS        00000000 000290 000030 00      0   0  4
  [16] .rela.debug_frame RELA            00000000 000530 000030 0c   I 17  15  4
  [17] .symtab           SYMTAB          00000000 0002c0 000110 10     18  15  4
  [18] .strtab           STRTAB          00000000 0003d0 000025 00      0   0  1
  [19] .shstrtab         STRTAB          00000000 000560 0000a1 00      0   0  1

Reading Sections of a Program File

Using the -S option on a program file we can also read the sections. The file type does not matter as long as it is an ELF we can read the sections. As we can see below there is no longer a rela.text section, but we have others including the .got section.

$ readelf -S tls-le-static
There are 31 section headers, starting at offset 0x32e8fc:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        000020d4 0000d4 080304 00  AX  0   0  4
  [ 2] __libc_freeres_fn PROGBITS        000823d8 0803d8 001118 00  AX  0   0  4
  [ 3] .rodata           PROGBITS        000834f0 0814f0 01544c 00   A  0   0  4
  [ 4] __libc_subfreeres PROGBITS        0009893c 09693c 000024 00   A  0   0  4
  [ 5] __libc_IO_vtables PROGBITS        00098960 096960 0002f4 00   A  0   0  4
  [ 6] __libc_atexit     PROGBITS        00098c54 096c54 000004 00   A  0   0  4
  [ 7] .eh_frame         PROGBITS        00098c58 096c58 0027a8 00   A  0   0  4
  [ 8] .gcc_except_table PROGBITS        0009b400 099400 000089 00   A  0   0  1
  [ 9] .note.ABI-tag     NOTE            0009b48c 09948c 000020 00   A  0   0  4
  [10] .tdata            PROGBITS        0009dc28 099c28 000010 00 WAT  0   0  4
  [11] .tbss             NOBITS          0009dc38 099c38 000024 00 WAT  0   0  4
  [12] .init_array       INIT_ARRAY      0009dc38 099c38 000004 04  WA  0   0  4
  [13] .fini_array       FINI_ARRAY      0009dc3c 099c3c 000008 04  WA  0   0  4
  [14] .data.rel.ro      PROGBITS        0009dc44 099c44 0003bc 00  WA  0   0  4
  [15] .data             PROGBITS        0009e000 09a000 000de0 00  WA  0   0  4
  [16] .got              PROGBITS        0009ede0 09ade0 000064 04  WA  0   0  4
  [17] .bss              NOBITS          0009ee44 09ae44 000bec 00  WA  0   0  4
  [18] __libc_freeres_pt NOBITS          0009fa30 09ae44 000014 00  WA  0   0  4
  [19] .comment          PROGBITS        00000000 09ae44 00002a 01  MS  0   0  1
  [20] .debug_aranges    PROGBITS        00000000 09ae6e 002300 00      0   0  1
  [21] .debug_info       PROGBITS        00000000 09d16e 0fd048 00      0   0  1
  [22] .debug_abbrev     PROGBITS        00000000 19a1b6 0270ca 00      0   0  1
  [23] .debug_line       PROGBITS        00000000 1c1280 0ce95c 00      0   0  1
  [24] .debug_frame      PROGBITS        00000000 28fbdc 0063bc 00      0   0  4
  [25] .debug_str        PROGBITS        00000000 295f98 011e35 01  MS  0   0  1
  [26] .debug_loc        PROGBITS        00000000 2a7dcd 06c437 00      0   0  1
  [27] .debug_ranges     PROGBITS        00000000 314204 00c900 00      0   0  1
  [28] .symtab           SYMTAB          00000000 320b04 0075d0 10     29 926  4
  [29] .strtab           STRTAB          00000000 3280d4 0066ca 00      0   0  1
  [30] .shstrtab         STRTAB          00000000 32e79e 00015c 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)

Reading Segments from a Program File

Using the -l option on a program file we can read the segments. Notice how segments map from file offsets to memory offsets and alignment. The two different LOAD type segments are segregated by read only/execute and read/write. Each section is also mapped to a segment here. As we can see .text is in the first LOAD` segment which is executable as expected.

$ readelf -l tls-le-static

Elf file type is EXEC (Executable file)
Entry point 0x2104
There are 5 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x00002000 0x00002000 0x994ac 0x994ac R E 0x2000
  LOAD           0x099c28 0x0009dc28 0x0009dc28 0x0121c 0x01e1c RW  0x2000
  NOTE           0x09948c 0x0009b48c 0x0009b48c 0x00020 0x00020 R   0x4
  TLS            0x099c28 0x0009dc28 0x0009dc28 0x00010 0x00034 R   0x4
  GNU_RELRO      0x099c28 0x0009dc28 0x0009dc28 0x003d8 0x003d8 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     .text __libc_freeres_fn .rodata __libc_subfreeres __libc_IO_vtables __libc_atexit .eh_frame .gcc_except_table .note.ABI-tag 
   01     .tdata .init_array .fini_array .data.rel.ro .data .got .bss __libc_freeres_ptrs 
   02     .note.ABI-tag 
   03     .tdata .tbss 
   04     .tdata .init_array .fini_array .data.rel.ro 

Reading Segments from an Object File

Using the -l option with an object file does not work as we can see below.

readelf -l tls-le-static.o

There are no program headers in this file.

Relocation entries

As mentioned an object file by itself is not executable. The main reason is that there are no program headers as we just saw. Another reason is that the .text section still contains relocation entries (or placeholders) for the addresses of variables located in the .data and .bss sections. These placeholders will just be 0 in the machine code. So, if we tried to run the machine code in an object file we would end up with Segmentation faults (SEGV).

A relocation entry is a placeholder that is added by the compiler or linker when producing ELF binaries. The relocation entries are to be filled in with addresses pointing to data. Relocation entries can be made in code such as the .text section or in data sections like the .got section. For example:

Resolving Relocations

GCC and Linker

The diagram above shows relocation entries as white circles. Relocation entries may be filled or resolved at link-time or dynamically during execution.

Link time relocations

  • Place holders are filled in when ELF object files are linked by the linker to create executables or libraries
  • For example, relocation entries in .text sections

Dynamic relocations

  • Place holders is filled during runtime by the dynamic linker. i.e. Procedure Link Table
  • For example, relocation entries added to .got and .plt sections which link to shared objects.

Note: Statically built binaries do not have any dynamic relocations and are not loaded with the dynamic linker.

In general link time relocations are used to fill in relocation entries in code. Dynamic relocations fill in relocation entries in data sections.

Listing Relocation Entries

A list of relocations in a ELF binary can printed using readelf with the -r options.

Output of readelf -r tls-gd-dynamic.o

Relocation section '.rela.text' at offset 0x530 contains 10 entries:
 Offset     Info    Type            Sym.Value  Sym. Name + Addend
00000000  00000f16 R_OR1K_TLS_GD_HI1 00000000   x + 0
00000008  00000f17 R_OR1K_TLS_GD_LO1 00000000   x + 0
00000020  0000100c R_OR1K_GOTPC_HI16 00000000   _GLOBAL_OFFSET_TABLE_ - 4
00000024  0000100d R_OR1K_GOTPC_LO16 00000000   _GLOBAL_OFFSET_TABLE_ + 0
0000002c  00000d0f R_OR1K_PLT26      00000000   __tls_get_addr + 0
...

The relocation entry list explains how to and where to apply the relocation entry. It contains:

  • Offset - the location in the binary that needs to be updated
  • Info - the encoded value containing the Type, Sym and Addend, which is broken down to:
    • Type - the type of relocation (the formula for what is to be performed is defined in the linker)
    • Sym. Value - the address value (if known) of the symbol.
    • Sym. Name - the name of the symbol (variable name) that this relocation needs to find during link time.
  • Addend - a value that needs to be added to the derived symbol address. This is used to with arrays (i.e. for a relocation referencing a[14] we would have Sym. Name a and an Addend of the data size of a times 14)

Example

File: nontls.c

In the example below we have a simple variable and a function to access it’s address.

static int x;

int* get_x_addr() {
  return &x;
}

Let’s see what happens when we compile this source.

The steps to compile and link can be found in the tls-examples project hosting the source examples.

Before Linking

Non TLS Object

The diagram above shows relocations in the resulting object file as white circles.

In the actual output below we can see that access to the variable x is referenced by a literal 0 in each instruction. These are highlighted with square brackets [] below for clarity.

These empty parts of the .text section are relocation entries.

Addr.   Machine Code    Assembly             Relocations
0000000c <get_x_addr>:
   c:   19 60 [00 00]   l.movhi r11,[0]      # c  R_OR1K_AHI16 .bss
  10:   44 00  48 00    l.jr r9
  14:   9d 6b [00 00]    l.addi r11,r11,[0]  # 14 R_OR1K_LO_16_IN_INSN        .bss

The function get_x_addr will return the address of variable x. We can look at the assembly instruction to understand how this is done. Some background of the OpenRISC ABI.

  • Registers are 32-bit.
  • Function return values are placed in register r11.
  • To return from a function we jump to the address in the link register r9.
  • OpenRISC has a branch delay slot, meaning the address after a branch it executed before the branch is taken.

Now, lets break down the assembly:

  • l.movhi - move the value [0] into high bits of register r11, clearing the lower bits.
  • l.addi - add the value in register r11 to the value [0] and store the results in r11.
  • l.jr - jump to the address in r9

This constructs a 32-bit value out of 2 16-bit values.

After Linking

Non TLS Object

The diagram above shows the relocations have been replaced with actual values.

As we can see from the linker output the places in the machine code that had relocation place holders are now replaced with values. For example 1a 20 00 00 has become 1a 20 00 0a.

00002298 <get_x_addr>:
    2298:	19 60 00 0a 	l.movhi r11,0xa
    229c:	44 00 48 00 	l.jr r9
    22a0:	9d 6b ee 60 	l.addi r11,r11,-4512

If we calculate 0xa << 16 + -4512 (fee60) we see get 0009ee60. That is the same location of x within our binary. This we can check with readelf -s which lists all symbols.

$ readelf -s nontls-static | grep ' x'
    42: 0009ee60     4 OBJECT  LOCAL  DEFAULT   17 x

Types of Relocations

As we saw above, a simple program resulted in 2 different relocation entries just to compose the address of 1 variable. We saw:

  • R_OR1K_AHI16
  • R_OR1K_LO_16_IN_INSN

The need for different relation types comes from the different requirements for the relocation. Processing of a relocation involves usually a very simple transform , each relocation defines a different transform. The components of the relocation definition are:

  • Input The input of a relocation formula is always the Symbol Address who’s absolute value is unknown at compile time. But there may also be other input variables to the formula including:
    • Program Counter The absolute address of the machine code address being updated
    • Addend The addend from the relocation entry discussed above in the Listing Relocation Entries section
  • Formula How the input is manipulated to derive the output value. For example shift right 16 bits.
  • Bit-Field Specifies which bits at the output address need to be updated.

To be more specific about the above relocations we have:

Relocation Type Bit-Field Formula
R_OR1K_AHI16 simm16 S >> 16
R_OR1K_LO_16_IN_INSN simm16 S && 0xffff

The Bit-Field described above is simm16 which means update the lower 16-bits of the 32-bit value at the output offset and do not disturb the upper 16-bits.

 +----------+----------+
 |          |  simm16  |
 | 31    16 | 15     0 |
 +----------+----------+

There are many other Relocation Types with difference Bit-Fields and Formulas. These use different methods based on what each instruction does, and where each instruction encodes its immediate value.

For full listings refer to architecture manuals.

Take a look and see if you can understand how to read these now.

Summary

In this article we have discussed what ELF binaries are and how they can be read. We have talked about how from compilation to linking to runtime, relocation entries are used to communicate which parts of a program remain to be resolved. We then discussed how relocation types provide a formula and bit-mask for updating the places in ELF binaries that need to be filled in.

In the next article we will discuss how Thread Local Storage works, both link-time and runtime relocation entries play big part in how TLS works.

Further Reading


OR1K Marocchino A Tomasulo Implementation

21 Oct 2019

This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we are reviewing the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.

In the last article, Marocchino Instruction Pipeline we discussed the architecture of the CPU. In this article let’s look at how Marocchino achieves out-of-order execution using the Tomasulo algorithm.

Achieving Out-of-Order Execution

In a traditional pipelined CPU the goal is retire one instruction per clock cycle. Any pipeline stall means an execution clock cycle will be lost. One method for reducing the affect of pipeline stalls is instruction parallelization. In 1993 the Intel Pentium processor was one of the first consumer CPUs to achieve this with it’s dual U and V integer pipelines. The pentium U and V pipelines require certain coding techniques to take full advantage. Achieving more parallelism requires more sophisticated data hazard detection and instruction scheduling. Introduced with the IBM System/360 in the 60’s by Robert Tomasulo, the Tomosulo Algorithm provides the building blocks to allow for multiple instruction execution parallelism. Generally speaking no special programming is needed to take advantage of instruction parallelism on a processor implementing Tomasulo algorithm.

Tomasulo's algorithm

Though the technique of out-of-order CPU execution with Tomasulo’s algorithm had been designed in the 60’s it did not make its way into popular consumer hardware until the Pentium Pro in the 1995. Further Pentium revisions such as the Pentium III, Pentium 4 and Core architectures are based on this same architecture. Understanding this architecture is a key to understanding modern CPUs.

In this article we will point out comparisons between the Marocchino and Pentium pro who’s architecture can be seen in the below diagram.

pentium pro diagram

The Marocchino implements the Tomasulo algorithm in a CPU that can be synthesized and run on an FPGA. Let’s dive into the implementation by breaking down the building blocks used in Tomasulo’s algorithm and how they have been implemented in Marocchino.

Tomasulo Building blocks

Besides the basic CPU modules like Instruction Fetch, Decode and Register File, the building blocks that are used in the Tomasulo algorithm are as follows:

  • Reservation Station - A queue where decoded instructions are placed before they can be executed. Instructions are placed in the queue with their decoded operation and available arguments. If any arguments are not available the reservation station will wait until the arguments are available before executing.
  • Execution Units - The execution units include the Arithmetic Logic Unit (ALU), Memory Load/Store Unit or FPU is responsible for performing the instruction operation.
  • Re-order Buffer (ROB) - A ring buffer which manages the order in which instructions are retired. In Marocchino the implementation is slightly simplified and called the Order Control Buffer (OCB).
  • Instruction Ids - As an instruction is queued into the ROB, or OCB in Marocchino it is assigned an Instruction Id which is used to track the instruction in different components in Marocchino code this is called the extaddr.
  • Register Allocation Table (RAT) - A table used for data hazard resolution. The RAT table has one cell per OpenRISC general purpose register, 32 entries. Each RAT cell indicates if a register is busy being produced by a queued instruction and which instruction will produce it.
  • Common Data Bus - Execution units present their result to all reservation stations along with the register file. Writing to the reservation station provides immediate resolution of data hazards. The link between execution units, reservation stations and register file is referred to as the common data bus.

The below diagram shows how these components are arranged in the Marocchino processor.

marocchino pipeline diagram

Resolving Data Hazards

As mentioned above the goal of a pipelined architecture is to retire one instruction per clock cycle. Pipelining helps achieve this by splitting an instruction into pipeline stages i.e. Fetch, Decode, Execute, Load/Store and Register Write Back. If one instruction depends on the results produced by a previous instruction will be a problem as register write back of the previous instruction may not complete before registers are read during the Decode phase of a instruction. This and other types of dependencies between pipeline stages are called hazards, and they must be avoided.

The Tomasulo algorithm with its Reservation Stations, Register Allocation Tables and other building blocks try to avoid hazards causing pipeline stalls. Let’s look at a simple example to see how this is done.

  • instruction 1 - b = a * 2
  • instruction 2 - x = a + b
  • instruction 3 - y = x / y

Here we can see that instruction 2 depends on instruction 1 as the addition of a + b cannot be performed until b is produced by instruction 1.

Let’s assume that instruction 1 is currently executing on the MULTIPLY unit. The CPU decodes instruction 2, instead of detecting a data hazard and stalling the pipeline instruction 2 will be placed in the reservation station of the ADD execution unit. The RAT indicates that b is busy and being produced by insruction 1. This means instruction 2 cannot execute right away. Next, we can look at instruction 3 and place it onto the reservation station of the DIVIDE execution unit. As instruction 3 has no hazards for x and y it can proceed directly to execution, even before instruction 2 is ready for execution.

Note, if a required reservation station is full the pipeline will stall.

Register Renaming

As mentioned above, execution units will present their output onto the common data bus wrbk_result and the data will be written into reservation stations. Writing the register to the reservation station may occur before writing back to the register file. This is what register renaming is, as the register input does not come directly from the register file.

Instruction Id

When an instruction is issued it may be registered in the RAT, OCB and Reservation Station. It is assigned an Instruction Id for tracking purposes. In Marocchino this is called the extadr and is 3 bits wide. It is generated by the simple instruction ID generation logic.

The is implemented in or1k_marocchino_oman.v with the following counter logic which generates a new extadr every time an instruction is decoded.

  // extension to DEST, FLAG or CARRY
  // Zero value is reserved as "not used"
  localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MAX = ((1 << DEST_EXTADR_WIDTH) - 1);
  localparam [DEST_EXTADR_WIDTH-1:0] EXTADR_MIN = 1;
  // ---
  reg  [DEST_EXTADR_WIDTH-1:0] dcod_extadr_r;
  wire [DEST_EXTADR_WIDTH-1:0] extadr_adder;
  // ---
  assign extadr_adder = (dcod_extadr_r == EXTADR_MAX) ? EXTADR_MIN : (dcod_extadr_r + 1'b1);
  // ---
  always @(posedge cpu_clk) begin
    if (pipeline_flush_i)
      dcod_extadr_r <= {DEST_EXTADR_WIDTH{1'b0}};
    else if (padv_dcod_i)
      dcod_extadr_r <= fetch_valid_i ? extadr_adder : dcod_extadr_r;
  end // @clock
  // support in-1clk-unit forwarding
  assign dcod_extadr_o = dcod_extadr_r;

Every instruction that is queued by the order manager is designated an extadr. This allows components like the reservation station and RAT tables to track when an instruction starts and completes executing.

The interactions between the extadr and other components are as follows.

During decode:

  • the ID generator generates the extaddr by incrementing a counter.
  • the OCB registers the extaddr along with other decoded instruction details
  • the RAT registers an extaddr for the decoded instruction to indicate which instruction will resolve a hazard.

During execution:

  • the OCB broadcasts the extaddr of the oldest instruction registered in a FIFO fashion. This is to indicate which instruction is to be retired and ensures instructions are retired in order.
  • the RAT outputs the extaddr indicating which queued instruction will produce a register
  • the RAT receives an extaddr from the OCB output to clear allocation flags
  • the Reservation Station receives the extaddr with hazards to track when instructions have finished and results are available.

Register Allocation Table

The register allocation table (RAT), sometimes called register alias table, keeps track of which registers are currently in progress of being generated by pending instructions. This is used to derive and resolve hazards.

The outputs of the RAT cell are:

  • rat_rd_extadr_o - indicates which extadr instruction has been allocated to generate this register. This will be updated with decod_extadr_i when padv_exec_i goes high.
  • rat_rd_alloc_o - indicates that this register is currently allocated to an instruction which is not yet complete. This will be set when padv_exec_i goes high, decod_rfd_we_i is high, and dcod_rfd_adr_i is equal to GPR_ADR.

marocchino RAT Cell diagram

The RAT table is made of 32 rat_cell modules; one cell per register. The register which the cell is allocated to is stored within GPR_ADR in the rat cell.

marocchino RAT diagram

Outputs of the RAT are registered to reservation stations. The hazards are derived with the following logic in or1k_marocchino_oman.v.

The omn2dec_hazard_d1a1_o hazard means that the argument a of the decoded instruction will be resolved when the instruction with extadr in omn2dec_extadr_dxa1_o is retired. The 2 in d2, a2 and b2 represent the 2nd register used in 64-bit FPU instructions.


  //  # relative operand A1
  assign omn2dec_hazard_d1a1_o = rat_rd1_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
  assign omn2dec_hazard_d2a1_o = rat_rd2_alloc[dcod_rfa1_adr_i] & dcod_rfa1_req_i;
  assign omn2dec_extadr_dxa1_o = rat_extadr[dcod_rfa1_adr_i];
  //  # relative operand B1
  assign omn2dec_hazard_d1b1_o = rat_rd1_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
  assign omn2dec_hazard_d2b1_o = rat_rd2_alloc[dcod_rfb1_adr_i] & dcod_rfb1_req_i;
  assign omn2dec_extadr_dxb1_o = rat_extadr[dcod_rfb1_adr_i];
  //  # relative operand A2
  assign omn2dec_hazard_d1a2_o = rat_rd1_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
  assign omn2dec_hazard_d2a2_o = rat_rd2_alloc[dcod_rfa2_adr_i] & dcod_rfa2_req_i;
  assign omn2dec_extadr_dxa2_o = rat_extadr[dcod_rfa2_adr_i];
  //  # relative operand B2
  assign omn2dec_hazard_d1b2_o = rat_rd1_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
  assign omn2dec_hazard_d2b2_o = rat_rd2_alloc[dcod_rfb2_adr_i] & dcod_rfb2_req_i;
  assign omn2dec_extadr_dxb2_o = rat_extadr[dcod_rfb2_adr_i];

Reservation Stations

marocchino reservation station diagram

The reservation station receives an instruction from the decode stage and queues it until all hazards are resolved and the execution unit is free.

Each reservation station has one busy slot and one execution slot. In the Pentium Pro there were 20 reservation station slots, the Marocchino has 5 or 10 depending if you count the execution slots.

Reservation stations are populated when the pipeline advance padv_rsrvs_i signal comes. An instruction may be forwarded directly to execution if there are no hazards and the execution unit is free.

  • busy_extadr_dxa_r - is populated with data from omn2dec_hazards_addrs_i. The busy_extadr_dxa_r register represents the extadr to look for which will resolve the A register hazard.
  • busy_extadr_dxb_r - same as ‘A’ but indicates which extadr will produce the B register.

  • busy_hazard_dxa_r - is populated with data from omn2dec_hazards_flags_i. The busy_hazard_dxa_r register represents that there is an instruction executing that will produce register A which has not yet completed.
  • busy_hazard_dxb_r - same as ‘A’ but indicates that ‘B’ is not available yet.

  • busy_op_any_r - populated with 1 when padv_rsrvs_i goes high indicates that there is an operation queued.
  • busy_op_r - populated with dcod_op_i. Represents the operation pending in the queue.
  • busy_rfa_r - populated with data from dcod_rfxx_i. Represents the value of operand A pending in the queue.
  • busy_rfb_r - populated with data from dcod_rfxx_i. Represents the value of operand B pending in the queue.

The reservation station resolves hazards by watching and comparing wrbk_extadr_i with the busy_extadr_dxa_r and busy_extadr_dxb_r registers. If the two match it means that the instruction producing register A or B has finished writing back its results and the hazard can be cleared.

Writeback forwarding is handled via the following verilog multiplexer and register logic. The first bit is used to register the decoded values dcod_rf* from the register file otherwise we watch for inputs from the forwarding logic. If there is a pending hazard results are forwarded from the common data bus, otherwise results are maintained.

  // BUSY stage operands A1 & B1
  always @(posedge cpu_clk) begin
    if (padv_rsrvs_i) begin
      busy_rfa1_r <= dcod_rfa1;
      busy_rfb1_r <= dcod_rfb1;
    end
    else begin
      busy_rfa1_r <= busy_rfa1;
      busy_rfb1_r <= busy_rfb1;
    end
  end // @clock

  // Forwarding
  //  operand A1
  assign busy_rfa1 =  busy_hazard_d1a1_r ? wrbk_result1_i :
                     (busy_hazard_d2a1_r ? wrbk_result2_i : busy_rfa1_r);
  //  operand B1
  assign busy_rfb1 =  busy_hazard_d1b1_r ? wrbk_result1_i :
                     (busy_hazard_d2b1_r ? wrbk_result2_i : busy_rfb1_r);

When all hazard flags are cleared the contents of busy_op_r , busy_rfa_r and busy_rfb_r will be transferred to exec_op_any_r, exec_op_r, etc. They are presented on the outputs and the execution unit can take them and start processing.

The unit_free_o output signals the control unit that the reservation station is free and can be issued another instruction. The signal goes high when all hazards are cleared and the busy state transfers to exec.

Execution Units

In Marocchino the execution units (also referred to as functional units) execute instructions which it receives from the reservation stations.

The execution units in Marocchino are:

  • or1k_marocchino_int_1clk - handles integer instructions which can complete in 1 clock cycle. This includes SHIFT, ADD, AND, OR etc.
  • or1k_marocchino_int_div - handles integer DIVIDE operations.
  • or1k_marocchino_int_mul - handles integer MULTIPLY operations.
  • or1k_marocchino_lsu - handles memory load store operations. It interfaces with the data cache, MMU and memory bus.
  • pfpu_marocchino_top - handles floating point operations. These include ADD, MULTIPLY, CMP, I2F etc.

Handshake signals between the reservation station and execution units are used to issue operations to execution units.

marocchino execution unit handshake diagram

The taking_op_i is the signal from the execution unit signalling it has received the op and the reservation station will clear all exec_*_o output signals.

Order Control Buffer

marocchino order control diagram

In the Marocchino the Order Control Buffer (OCB) is the in order retirement unit. It can retire a single instruction at a time. The implementation is a 7 entry FIFO queue. This is much less than the Pentium Pro which contains 40 slots. The OCB receives a single instruction at time from the decoder and broadcasts the oldest instruction for other components to see. Instructions are retired after execution write back is complete.

If the OCB output indicates a branch instruction or an exception, branch logic is invoked. Instead of waiting for write back to a register the write back logic in the Marocchino will perform the branch operations. This may include flushing the OCB. Special care is taken to handle branch delay slot instruction execution.

The OCB is different from a traditional Tomasulo Reorder Buffer (ROB) in that it does not store any execution write back results.

Each OCB entry stores:

  • The Instruction ID extaddr
  • The type of instruction
  • The register destination addresses used for write back
  • Any Fetch and Decode exceptions

This can be seen as defined by the ocbi and ocbi wire buses in or1k_marocchino_oman.v.

  // --- OCB-Controls input ---
  wire  [OCBT_MSB:0] ocbi;
  assign ocbi =
    {
      // --- pipeline [C]ontrol flags ---
      dcod_extadr_r, // OCB-Controls entrance
      dcod_op_ls_i, // OCB-Controls entrance
      dcod_op_fpxx_cmp_i, // OCB-Controls entrance
      dcod_op_fpxx_arith_i, // OCB-Controls entrance
      dcod_op_mul_i, // OCB-Controls entrance
      dcod_op_div_i, // OCB-Controls entrance
      dcod_op_1clk_i, // OCB-Controls entrance
      dcod_op_jb_r, // OCB-Controls entrance
      dcod_op_push_wrbk_i, // OCB-Controls entrance
      // --- instruction [A]ttributes ---
      pc_decode_i, // OCB-Attributes entrance
      dcod_rfd2_adr_i, // OCB-Attributes entrance
      dcod_rfd2_we_i, // OCB-Attributes entrance
      dcod_rfd1_adr_i, // OCB-Attributes entrance
      dcod_rfd1_we_i, // OCB-Attributes entrance
      dcod_delay_slot_i, // OCB-Attributes entrance
      dcod_op_rfe_i, // OCB-Attributes entrance
      // Flag that istruction is restartable
      interrupts_en, // OCB-Attributes entrance
      // Combined IFETCH/DECODE an exception flag
      dcod_an_except_fd_i, // OCB-Attributes entrance
      // FETCH & DECODE exceptions
      dcod_fetch_except_ibus_err_r, // OCB-Attributes entrance
      dcod_fetch_except_ipagefault_r, // OCB-Attributes entrance
      dcod_fetch_except_itlb_miss_r, // OCB-Attributes entrance
      dcod_except_illegal_i, // OCB-Attributes entrance
      dcod_except_syscall_i, // OCB-Attributes entrance
      dcod_except_trap_i // OCB-Attributes entrance
    };

  // --- INSN OCB input ---
  wire [OCBT_MSB:0] ocbo;

Common Data Bus

As discussed above the common data collects write back results from execution units and routes them for write back.

This can be seen in the or1k_marocchino_cpu.v as below.

  // --- regular ---
  always @(wrbk_1clk_result       or wrbk_div_result or wrbk_mul_result or
           wrbk_fpxx_arith_res_hi or wrbk_lsu_result or wrbk_mfspr_result)
  begin
    wrbk_result1 = wrbk_1clk_result       | wrbk_div_result | wrbk_mul_result |
                   wrbk_fpxx_arith_res_hi | wrbk_lsu_result | wrbk_mfspr_result;
  end

  // --- FPU64 extention ---
  assign wrbk_result2 = wrbk_fpxx_arith_res_lo;

Conclusion

Tomasulo’s algorithm is still relevant today and used in many processors. Marocchino provides an accessible implementation. Marocchino is however, not super-scalar, while Pentium Pro can decode up to 4 instructions at a time the Marocchino can only decode 1 at a time.

Furthermore many improvements can be made to Marocchino to increase performance. Including:

  • Full featured reorder buffer
  • Parallel instruction decoding
  • Speculative execution; or should we?
  • More reservation station slots

However, these come with a cost of size on the FPGA. If you are interested in helping out please feel free to contribute.

If anything in this article could be improved, more timing diagrams, typos or fixes for diagrams please send me a message on twitter.

Further Reading and Sources


OR1K Marocchino Instruction Pipeline

18 Jul 2019

This is an ongoing series of posts on the Marocchino CPU, an open source out-of-order OpenRISC cpu. In this series we will review the Marocchino and it’s architecture. If you haven’t already I suggest you start of by reading the intro in Marocchino in Action.

In the last article, Marocchino in Action we discussed the history of the CPU and how to setup setup a development environment for it. In this article let’s look a bit deeper into how the Marocchino CPU works.

We will look at how an instruction flows through the Marocchino pipeline.

Marocchino Architecture

The Marocchino source code is available on github and is easy to navigate. We have these directories:

  • rtl/verilog - the core verilog code, with toplevel modules
    • or1k_marocchino_top.v - top level module, connects CPU to wishbone bus
    • or1k_marocchino_cpu.v - CPU module, connects CPU pipeline
  • rtl/verilog/pfpu_marocchino - the FPU implementation
    • pfpu_marocchino_top.v - FPU module, wires together FPU components
  • bench - test bench harness monitor modules
  • doc - design documentation

marocchino github website screenshot

At first glance of the code the Marocchino may look like a traditional 5 stage RISC pipeline. It has fetch, decode, execution, load/store and register write back modules which you might picture in your head as follows:

  PIPELINE CRTL - progress/stall the pipeline
  (or1k_marocchino_ctrl.v)

  INSTRUCTION PIPELINE - process an instruction

                        |
                        V
/                     FETCH                     \
\ (or1k_marocchino_fetch.v)                     /
                        |
                        V
/                     DECODE                    \
\ (or1k_marocchino_decode.v)                    /
                        |
                        V
/                     EXECUTE                   \
| (or1k_marocchino_int_1clk.v) ALU              |
| (or1k_marocchino_int_div.v)  DIVISION         |
\ (or1k_marocchino_int_mul.v)  MULTIPLICATION   /
                        |
                        V
/                   LOAD STORE                  \
\ (or1k_marocchino_lsu.v)      TO/FROM RAM      /
                        |
                        V
/                   WRITE BACK                  \
\ (or1k_marocchino_rf.v)                        /

However, once you look a bit closer you notice some things that are different. The top-level module or1k_marocchino_cpu connects the modules and shows:

  • Between the decode and execution units there are reservation stations.
  • Along with the control unit, there is an order manager module which provides control signals.
  • The load/store execution unit is done as part of the execution stage.

What this CPU is is a super-scalar instruction pipeline with in order instruction retirement implementing the Tomasulo algorithm.

A simplified view of the CPU’s internal module layout is as per the below diagram.

marocchino pipeline diagram

Pipeline Controls

The marocchino has two modules for coordinating pipeline stage instruction propagation. The control unit and the order manager.

Control Unit

The control unit of the CPU is in charge of watching over the pipeline stages and signalling when operations can transfer from one stage to the next. The Marocchino does this with a series of pipeline advance (padv_*) signals. In general for the best efficiency all padv_* wires should be high at all times allowing instructions to progress on every clock cycle. But as we will see in reality, this is difficult to achieve due to pipeline stall scenarios like cache misses and branch prediction misses. The padv_* signals include:

padv_fetch_o

The padv_fetch_o signal instructs the instruction fetch unit to progress. Internally the fetch unit has 3 stages. The instruction fetch unit interacts with the instruction cache and instruction memory management unit (MMU). The padv_fetch_o signal goes low and the pipeline stalls when the decode module is busy (dcod_emtpy_i is low). The signal dcod_empty_i comes from the Decode module and indicates that an instruction can be accepted by the decode stage.

This is represented by this assign in or1k_marocchino_ctrl.v:

Note The stepping and pstep[] signals are related to debug single stepping, and can be ignored for our purposes.

  // Advance IFETCH
  // Stepping condition is close to the one for DECODE
  assign padv_fetch_o = padv_all & ((~stepping) | (dcod_empty_i & pstep[0])); // ADV. IFETCH

padv_dcod_o

The padv_dcod_o signal instructs the instruction decode stage to output decoded operands. The decode unit is one stage, if padv_dcod_o is high, it will decode the instruction input every cycle. The padv_dcod_o signal goes low if the destination reservation station for the operands cannot accept an instruction.

This is represented by this assign in or1k_marocchino_ctrl.v:

  // Advance DECODE
  assign padv_dcod_o = padv_all & (~wrbk_rfdx_we_i) & // ADV. DECODE
    (((~stepping) & dcod_free_i & (dcod_empty_i | ena_dcod)) | // ADV. DECODE
       (stepping  & dcod_empty_i & pstep[0])); // ADV. DECODE

padv_exec_o and padv_*_rsrvs_o

The padv_exec_o signal to order manager enqueues decoded ops into the Order Control Buffer (OCB). The OCB is a FIFO queue which keeps track of the order instructions have been decoded.

The padv_*_rsrvs_o signal wired one of the reservation stations enables registering of an instruction into a reservation station. There is one padv_*_rsrvs_o signal and reservation station per execution unit. They are:

  • padv_1clk_rsrvs_o - to the reservation station for single clock ALU operations
  • padv_muldiv_rsrvs_o - to the reservation station for multiply and divide operations. Divide operations take 32 clock cycles. Multiply operations execute with 2 clock cycles.
  • padv_fpxx_rsrvs_o - to the reservation station for the floating point unit (FPU). There are multiple FPU operations including multiply, divide, add, subtract, comparison and conversion between integer and floating point.
  • padv_lsu_rsrvs_o - to the reservation station for the load store unit. The load store unit will load data from memory to registers or store data from registers to memory. It interacts with the data cache and data MMU.

Both padv_exec_o and padv_*_rsrvs_o are dependent on the execution units being ready and both signals will go high or low at the same time.

This is represented by the assign in or1k_marocchino_ctrl.v:

  // Advance EXECUTE (push OCB & clean up  DECODE)
  assign padv_exec_o         = ena_exec         & padv_an_exec_unit;
  // Per execution unit (or reservation station) advance
  assign padv_1clk_rsrvs_o   = ena_1clk_rsrvs   & padv_an_exec_unit;
  assign padv_muldiv_rsrvs_o = ena_muldiv_rsrvs & padv_an_exec_unit;
  assign padv_fpxx_rsrvs_o   = ena_fpxx_rsrvs   & padv_an_exec_unit;
  assign padv_lsu_rsrvs_o    = ena_lsu_rsrvs    & padv_an_exec_unit;

padv_wrbk_o

The padv_wrbk_o signal to the execution units will go active when exec_valid_i is active and will finalize writing back the execution results. The padv_wrbk_o signal to the order manager will retire the oldest instruction from the OCB.

This is represented by this assign in or1k_marocchino_ctrl.v:

  // Advance Write Back latches
  wire   exec_valid_l = exec_valid_i | op_mXspr_valid;
  assign padv_wrbk_o  = exec_valid_l & padv_all & (~wrbk_rfdx_we_i) & ((~stepping) | pstep[2]);

An astute reader would notice that there are no pipeline advance (padv_*) signals to each of the execution units. This is where the order manager comes in.

Order Manager

The order manager ensures that instructions are retired in the same order that they are decoded. It contains a register allocation table (RAT) for hazard resolution and the OCB. We will go into more depth on the RAT in the next article, but for now let’s look at how the order manager interacts with the instruction pipeline flow.

exec_valid_o

As the OCB is a FIFO queue the output port presents the oldest non retired instruction to the order manager. The exec_valid_o signal to the control unit will go active when the *_valid_i signal from the execution unit and the OCB output instruction match.

This is represented by this assign in or1k_marocchino_oman.v:

  assign exec_valid_o =
    (op_1clk_valid_l          & ~ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID: but wait attributes for l.jal/ljalr
    (exec_jb_attr_valid       &  ocbo[OCBTC_JUMP_OR_BRANCH_POS]) | // EXEC VALID
    (div_valid_i              &  ocbo[OCBTC_OP_DIV_POS])         | // EXEC VALID
    (mul_valid_i              &  ocbo[OCBTC_OP_MUL_POS])         | // EXEC VALID
    (fpxx_arith_valid_i       &  ocbo[OCBTC_OP_FPXX_ARITH_POS])  | // EXEC VALID
    (fpxx_cmp_valid_i         &  ocbo[OCBTC_OP_FPXX_CMP_POS])    | // EXEC VALID
    (lsu_valid_i              &  ocbo[OCBTC_OP_LS_POS])          | // EXEC VALID
                                 ocbo[OCBTC_OP_PUSH_WRBK_POS];     // EXEC VALID

The OCB helps the order manager ensure that instructions are retired in the same order that they are decoded.

grant_wrbk_*_o

The grant_wrbk_*_o signal to the execution units will go active depending on the OCB output port instruction.

This is represented by this assign in or1k_marocchino_oman.v:

  // Grant Write-Back-access to units
  assign grant_wrbk_to_1clk_o        = ocbo[OCBTC_OP_1CLK_POS];
  assign grant_wrbk_to_div_o         = ocbo[OCBTC_OP_DIV_POS];
  assign grant_wrbk_to_mul_o         = ocbo[OCBTC_OP_MUL_POS];
  assign grant_wrbk_to_fpxx_arith_o  = ocbo[OCBTC_OP_FPXX_ARITH_POS];
  assign grant_wrbk_to_lsu_o         = ocbo[OCBTC_OP_LS_POS];
  assign grant_wrbk_to_fpxx_cmp_o    = ocbo[OCBTC_OP_FPXX_CMP_POS];

The grant_wrbk_*_o signal along with the padb_wrbk_o signal signal an execution unit that it can write back its result to the register file / RAT / reservation station.

wrbk_rfd1_we_o, wrbk_rfd2_we_o and wrbk_rfdx_we_o

The wrbk_rfd1_we_o and wrbk_rfd2_we_o signals enable writeback to the register file. There are 2 signals because some 64-bit FPU instructions require writing results to 2 registers. When there is just a single register to write only signal wrbk_rfd1_we_o is used. When there are two results, writing happens in 2-stages, first wrbk_rfd1_we_o signals the write back to register 1 then in the next cycle wrbk_rfd2_we_o signals the write back to register 2.

The wrbk_rfdx_we_o signal to the control unit stalls the pipeline to allow the second write to complete.

This is represented by this logic in or1k_marocchino_oman.v:

  // instuction requests write-back
  wire exec_rfd1_we = ocbo[OCBTA_RFD1_WRBK_POS];
  wire exec_rfd2_we = ocbo[OCBTA_RFD2_WRBK_POS];

...

  // 1-clock Write-Back-pulses
  //  # for D1
  always @(posedge cpu_clk) begin
    if (padv_wrbk_i)
      wrbk_rfd1_we_o <= exec_rfd1_we;
    else
      wrbk_rfd1_we_o <= 1'b0;
  end // @clock

  //  # for D2 we delay WriteBack for 1-clock
  //    to split write into RF from D1
  always @(posedge cpu_clk) begin
    if (cpu_rst) begin
      wrbk_rfdx_we_o <= 1'b0; // flush
      wrbk_rfd2_we_o <= 1'b0; // flush
    end
    else if (wrbk_rfd2_we_o) begin
      wrbk_rfdx_we_o <= 1'b0; // D2 write done
      wrbk_rfd2_we_o <= 1'b0; // D2 write done
    end
    else if (wrbk_rfdx_we_o)
      wrbk_rfd2_we_o <= 1'b1; // do D2 write
    else if (padv_wrbk_i)
      wrbk_rfdx_we_o <= exec_rfd2_we;
  end // @clock

The padv_wrbk_i signal from the control unit to the order manager also takes care of dequeuing the last instruction from the OCB. With that and the writebacks completed the instruction is said to be retired.

Conclusion

The Marocchino instruction pipeline is not very complicated while still being full featured including Caches, MMU and FPU. We have mentioned a few structures such as Reservation Station and RAT which we haven’t gone into much details on. These help implement out-of-order superscalar execution using Tomasulo’s algorithm. In the next article we will go into more details on these components and how Tomasulo works.


OR1K Marocchino in Action

11 Jun 2019

The Marocchino is an advanced new OpenRISC soft core CPU. However, not many have heard about it. Let’s try to change this.

In this series of posts I would like to take the reader over some key parts of the Marocchino and it’s architecture.

  • Marocchino in Action - (this article) A light intro and a guide to getting started with Marocchino
  • Instruction Pipeline - An deep dive into how the Marocchino pipeline is structured
  • A Tomasulo Implementation - How the Marocchino achieves Out-of-Order execution

Introducing Marocchino

In the beginning of 2019 I had finished the OpenRISC GCC port and was working on building up toolchain test and verification support using the mor1kx soft core. Part of the mor1kx’s feature set is the ability to swap out different pipeline arrangements to configure the CPU for performance or resource usage. Each pipeline is named after an Italian coffee, we have Cappuccino, Espresso and Pronto-Espresso. One of these pipelines which has been under development but never integrated into the main branch was the Marocchino. I had never paid much attention to the Marocchino pipeline.

Around the same time the author of Marocchino sent a mail mentioning he could not used the new GCC port as is was missing Single and Double precision FPU support. Using the new verification pipeline I set out to start working on adding single and double precision floating point support to the OpenRISC gcc port. My verification target would be the Marocchino pipeline.

After some initial investigation I found this CPU was much more than a new pipeline for the mor1kx with a special FPU. The marocchino has morphed into a complete re-implementation of the OpenRISC 1000 spec. Seeing this we split the marocchino out to it’s own repository where it could grow on it’s own. Of course maintaining the Italian coffee name.

With features like out-of-order execution using Tomasulo’s algorithm, 64-bit FPU operations using register pairs, MMU, Instruction caches, Data caches, Multicore support and a clean verilog code base the Marocchino is advanced to say the least.

I would claim Marocchino is one of the most advanced implementations of a out-of-order execution open source CPU cores. One of it’s friendly rivals is the BOOM core a 64-bit risc-v implementation written in Chisel. To contrast Marocchino has a similar feature set but is 32-bit OpenRISC written in verilog making it approachable. If you know more out-of-order execution open source cores I would love to know, and I can update this list.

Let’s dive in.

Getting started with Marocchino

We can quickly get started with Marocchino as we use FuseSoC. Which makes bringing together an running verilog libraries, or cores, a snap.

Environment Setup

The Marocchino development environment requires Linux, you can use a VM, docker or your own machine. I personally use fedora and maintain several Debian docker images for continuous integration and package builds.

The environment we install allows for simulating verilog using icarus or verilator. It also allows synthesis and programming to an FPGA using EDA tools. Here we will cover only simulation. For details on programming an SoC to an FPGA using FuseSoC see the build and pgm commands in FuseSoC documentation.

Note Below we use /tmp/openrisc to install software and work on code, but you can use any path you like.

Setting up FuseSoC

fusesoc logo

To get started let’s setup FuseSoC and install the required cores into the FuseSoC library.

Here we clone the git repositories used for Marocchino development into /tmp/openrisc/src feel free to have a look, If you feel adventurous make some changes. The repos include:

  • mor1kx-generic - the SoC system configuration which wires together the CPU, Memory, System bus, UART and a simple interrupt peripheral.
  • or1k_marocchino - the Marocchino CPU core.
sudo pip install fusesoc

mkdir -p /tmp/openrisc/src
cd /tmp/openrisc/src
git clone https://github.com/stffrdhrn/mor1kx-generic.git
git clone https://github.com/openrisc/or1k_marocchino.git

# As yourself
fusesoc init -y
fusesoc library add intgen https://github.com/stffrdhrn/intgen.git
fusesoc library add mor1kx-generic /tmp/openrisc/src/mor1kx-generic
fusesoc library add or1k_marocchino /tmp/openrisc/src/or1k_marocchino

Setting up Icarus Verilog

Next we will need to install our verilog compiler/simulator Icarus Verilog (iverilog).

mkdir -p /tmp/openrisc/iverilog
cd /tmp/openrisc/iverilog
git clone https://github.com/steveicarus/iverilog.git .

sh autoconf.sh
./configure --prefix=/tmp/openrisc/local
make
make install
export PATH=/tmp/openrisc/local/bin:$PATH

Using Docker

If you want to get started very quickly faster we can use the librecores-ci docker image. Which includes iverilog, verilator and fusesoc.

This allows us to skip the Setting up Icarus Verilog and part of the Setting up FuseSoC step above.

This can be done with the following.

docker pull librecores/librecores-ci
docker run -it --rm docker.io/librecores/librecores-ci

Setting up GCC

Next we install the GCC toolchain which is used for compiling C and OpenRISC assembly programs. The produced elf binaries can be loaded and run on the CPU core. Pull the latest toolchain from my gcc releases page. Here we use the newlib (baremetal) toolchain which allows compiling programs which run directly on the processor. For details on other toolchains available see the toolchain summary on the OpenRISC homepage.

mkdir -p /tmp/openrisc
cd /tmp/openrisc
wget https://github.com/stffrdhrn/gcc/releases/download/or1k-9.1.1-20190507/or1k-elf-9.1.1-20190507.tar.xz

tar -xf or1k-elf-9.1.1-20190507.tar.xz
export PATH=/tmp/openrisc/or1k-elf/bin:$PATH

The development environment should now be set up.

To check everything works you should be able to run the following commands.

Setup Verification

To ensure the toolchain is installed and working we can run the following:

$ or1k-elf-gcc --version
or1k-elf-gcc (GCC) 9.1.1 20190503
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

To ensure FuseSoC and the required cores are installed we can run this:

$ fusesoc core-info mor1kx-generic 
CORE INFO
Name:      ::mor1kx-generic:1.1
Core root: /root/.local/share/fusesoc/mor1kx-generic

Targets:
marocchino_tb
mor1kx_tb

$ fusesoc list-cores
...
::intgen:0                     : local
::or1k_marocchino:5.0-r3       : local
...

Running an Assembly Program

The most simple program you can run on OpenRISC is a simple assembly program. When running, everything in the below program is loaded into the OpenRISC memory, nothing more nothing less.

To compile, run and trace a simple assembly program we do the following.

Create a source file asm-openrisc.s as follows:

/* Exception vectors section.  */
	.section .vectors, "ax"

/* 0x100: OpenRISC RESET reset vector. */
        .org 0x100 	

	/* Jump to program initialisation code */
	.global _main
	l.movhi r4, hi(_main)
	l.ori r4, r4, lo(_main)
	l.jr    r4
	 l.nop

/* Main executable section.  */
	.section .text
	
	.global _main	
_main:
	l.addi	r1,r0,0x1
	l.addi	r2,r1,0x2
	l.addi	r3,r2,0x4
	l.addi	r4,r3,0x8
	l.addi	r5,r4,0x10
	l.addi	r6,r5,0x20
	l.addi	r7,r6,0x40
	l.addi	r8,r7,0x80
	l.addi	r9,r8,0x100
	l.addi	r10,r9,0x200
	l.addi	r11,r10,0x400
	l.addi	r12,r11,0x800
	l.addi	r13,r12,0x1000
	l.addi	r14,r13,0x2000
	l.addi	r15,r14,0x4000
	l.addi	r16,r15,0x8000

	l.sub	r31,r0,r1
	l.sub	r30,r31,r2
	l.sub	r29,r30,r3
	l.sub	r28,r29,r4
	l.sub	r27,r28,r5
	l.sub	r26,r27,r6
	l.sub	r25,r26,r7
	l.sub	r24,r25,r8
	l.sub	r23,r24,r9
	l.sub	r22,r23,r10
	l.sub	r21,r22,r11
	l.sub	r20,r21,r12
	l.sub	r19,r20,r13
	l.sub	r18,r19,r14
	l.sub	r17,r18,r15
	l.sub	r16,r17,r16

	/* Set sim return code to 0 - meaning OK. */
	l.movhi	r3, 0x0
	l.nop 	0x1 /* Exit simulation */
	l.nop
	l.nop

To compile this we use or1k-elf-gcc. Note the -nostartfiles option, this is useful for compiling assembly when we don’t need newlib/libgloss provided “startup” sections linked into the binary as we provide them ourselves.

mkdir /tmp/openrisc/src
cd /tmp/openrisc/src
vim openrisc-asm.s

or1k-elf-gcc -nostartfiles openrisc-asm.s -o openrisc-asm

Finally, to run the program on the Marocchino we run fusesoc with the below options.

  • run - specifies that we want to run a simulation.
  • --target - is a FuseSoC option for run specifying which of the mor1kx-generic targets we want to run, here we specify marochino_tb, the Marocchino test bench.
  • --tool - is a sub option for --target specifying that we want to run the marocchino_tb target using icarus.
  • ::mor1kx-generic:1.1 - specifies which system we want to run. System’s represent an SoC that can be simulated or synthesized. You can see a list of system using list-cores.
  • --elf-load - is mor1kx-generic specific option which specifies an elf binary that will be loaded into memory before the simulation starts.
  • --trace_enable - is a mor1kx-generic specific option enabling tracing. When specified the simulator will output a trace file to {fusesoc-builds}/mor1kx-generic_1.1/marocchino_tb-icarus/marocchino-trace.log see log.
  • --trace_to_screen - is a mor1kx-generic specific option enabling tracing instruction execution to the console as we can see below.
  • --vcd - is a mor1kx-generic option instruction icarus to output a vcd file which creates a trace file which can be loaded with gtkwave.
fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
  --elf-load ./openrisc-asm --trace_enable --trace_to_screen --vcd

VCD info: dumpfile testlog.vcd opened for output.
Program header 0: addr 0x00000000, size 0x000001A0
elf-loader: /tmp/openrisc/src/openrisc-asm was loaded
Loading         104 words
                   0 : Illegal Wishbone B3 cycle type (xxx)
S 00000100: 18800000 l.movhi r4,0x0000       r4         = 00000000  flag: 0
S 00000104: a8840110 l.ori   r4,r4,0x0110    r4         = 00000110  flag: 0
S 00000108: 44002000 l.jr    r4                                     flag: 0
S 0000010c: 15000000 l.nop   0x0000                                 flag: 0
S 00000110: 9c200001 l.addi  r1,r0,0x0001    r1         = 00000001  flag: 0
S 00000114: 9c410002 l.addi  r2,r1,0x0002    r2         = 00000003  flag: 0
S 00000118: 9c620004 l.addi  r3,r2,0x0004    r3         = 00000007  flag: 0
S 0000011c: 9c830008 l.addi  r4,r3,0x0008    r4         = 0000000f  flag: 0
S 00000120: 9ca40010 l.addi  r5,r4,0x0010    r5         = 0000001f  flag: 0
S 00000124: 9cc50020 l.addi  r6,r5,0x0020    r6         = 0000003f  flag: 0
S 00000128: 9ce60040 l.addi  r7,r6,0x0040    r7         = 0000007f  flag: 0
S 0000012c: 9d070080 l.addi  r8,r7,0x0080    r8         = 000000ff  flag: 0
S 00000130: 9d280100 l.addi  r9,r8,0x0100    r9         = 000001ff  flag: 0
S 00000134: 9d490200 l.addi  r10,r9,0x0200   r10        = 000003ff  flag: 0
S 00000138: 9d6a0400 l.addi  r11,r10,0x0400  r11        = 000007ff  flag: 0
S 0000013c: 9d8b0800 l.addi  r12,r11,0x0800  r12        = 00000fff  flag: 0
S 00000140: 9dac1000 l.addi  r13,r12,0x1000  r13        = 00001fff  flag: 0
S 00000144: 9dcd2000 l.addi  r14,r13,0x2000  r14        = 00003fff  flag: 0
S 00000148: 9dee4000 l.addi  r15,r14,0x4000  r15        = 00007fff  flag: 0
S 0000014c: 9e0f8000 l.addi  r16,r15,0x8000  r16        = ffffffff  flag: 0
S 00000150: e3e00802 l.sub   r31,r0,r1       r31        = ffffffff  flag: 0
S 00000154: e3df1002 l.sub   r30,r31,r2      r30        = fffffffc  flag: 0
S 00000158: e3be1802 l.sub   r29,r30,r3      r29        = fffffff5  flag: 0
S 0000015c: e39d2002 l.sub   r28,r29,r4      r28        = ffffffe6  flag: 0
S 00000160: e37c2802 l.sub   r27,r28,r5      r27        = ffffffc7  flag: 0
S 00000164: e35b3002 l.sub   r26,r27,r6      r26        = ffffff88  flag: 0
S 00000168: e33a3802 l.sub   r25,r26,r7      r25        = ffffff09  flag: 0
S 0000016c: e3194002 l.sub   r24,r25,r8      r24        = fffffe0a  flag: 0
S 00000170: e2f84802 l.sub   r23,r24,r9      r23        = fffffc0b  flag: 0
S 00000174: e2d75002 l.sub   r22,r23,r10     r22        = fffff80c  flag: 0
S 00000178: e2b65802 l.sub   r21,r22,r11     r21        = fffff00d  flag: 0
S 0000017c: e2956002 l.sub   r20,r21,r12     r20        = ffffe00e  flag: 0
S 00000180: e2746802 l.sub   r19,r20,r13     r19        = ffffc00f  flag: 0
S 00000184: e2537002 l.sub   r18,r19,r14     r18        = ffff8010  flag: 0
S 00000188: e2327802 l.sub   r17,r18,r15     r17        = ffff0011  flag: 0
S 0000018c: e2118002 l.sub   r16,r17,r16     r16        = ffff0012  flag: 0
S 00000190: 18600000 l.movhi r3,0x0000       r3         = 00000000  flag: 0
S 00000194: 15000001 l.nop   0x0001                                 flag: 0
exit(0x00000000);

If we look at the VCD trace file in gtkwave we can see the below trace. The trace file is also helpful for navigating through the various SoC components as it captures all wire transitions.

Take note that we are not seeing very good performance, this is because caching is not enabled and the CPU takes several cycles to read an instruction from memory. This means we are not seeing one instruction executed per cycle. Enabling caches would fix this.

asm tracefile vcd

Running a C program

When we compile a C program there is a lot more happening behind the scenes. The linker will link in an entire runtime that along with standard libc functions on OpenRISC will setup interrupt vectors, enable caches, setup memory sections for variables, run static initializers and finally run our program.

The program:

/* Simple c program, doing some math.  */

#include <stdio.h>

int a [] = { 1, 2, 3, 4, 5 };

int madd(int a, int b, int c) {
  return a * b + c;
}

int main() {
  int res;

  for (int i = 0; i < 5; i++) {
    res = madd(0  , a[1], a[i]);
    res = madd(res, a[2], a[i]);
    res = madd(res, a[2], a[i]);
    res = madd(res, a[3], a[i]);
    res = madd(res, a[4], a[i]);
  }

  printf("Result is = %d\n", res);

  return 0;
}

To compile we use the below or1k-elf-gcc command. Notice, we do not specify -nostartfiles here as we do want newlib to link in all the start routines to provide a full c runtime. We do specify the following arguments to tell GCC a bit about our OpenRISC cpu. If these -m options are not specified GCC will link in code using the libgcc library to emulate these instructions.

  • -mhard-mul - indicates that our cpu target supports multiply instructions.
  • -mhard-div - indicates that our cpu target supports divide instructions.
  • -mhard-float - indicates that our cpu target supports FPU instructions.
  • -mdouble-float - indicates that our cpu target supports the new double precision floating point instructions using register pairs (orfpx64a32).

To see a full list of options for OpenRISC read the GCC manual or see the output of or1k-elf-gcc --target-help.

or1k-elf-gcc -Wall -O2 -mhard-mul -mhard-div -mhard-float -mdouble-float -mror \
  openrisc-c.c -o openrisc-c

If we want to inspect the assembly to ensure we did generate multiply instructions we can use the trusty objdump utility. As per below, yes, we can see multiply instructions.

or1k-elf-objdump -d openrisc-c | grep -A10 main
00002000 <main>:
    2000:       1a a0 00 01     l.movhi r21,0x1
    2004:       9c 21 ff f8     l.addi r1,r1,-8
    2008:       9e b5 40 3c     l.addi r21,r21,16444
    200c:       86 75 00 10     l.lwz r19,16(r21)
    2010:       86 f5 00 08     l.lwz r23,8(r21)
    2014:       e2 37 9b 06     l.mul r17,r23,r19
    2018:       e2 31 98 00     l.add r17,r17,r19
    201c:       e2 31 bb 06     l.mul r17,r17,r23
    2020:       e2 31 98 00     l.add r17,r17,r19
    2024:       86 b5 00 0c     l.lwz r21,12(r21)
...

Similar to running the assembly example we can run this with fusesoc as follows.

fusesoc run --target marocchino_tb --tool icarus ::mor1kx-generic:1.1 \
  --elf-load ./openrisc-c --trace_enable --vcd
...

Result is = 1330

Now, if we look at the VCD trace file we can see the below trace. Notice that with the c program we can observe better pipelining where an instruction can be executed every clock cycle. This is because caches have been initialized as part of the newlib c-runtime initialization, Great!

c tracefile vcd

Conclusion

In this article we went through a quick introduction to the Marocchino development environment. The development environment would actually be similar when developing any OpenRISC core.

This environment will allow the reader to following in future Marocchino articles where we go deeper into the architecture. In this environment you can now:

  • Develop and test verilog code for the Marocchino processor
  • Develop assembly programs and test them on the Marocchino or other OpenRISC processors
  • Develop c programs and test them on the Marocchino or other OpenRISC processors

In the next article we will look more into how the above programs actually flow through the Marocchino pipeline. Stay tuned.