Chapter: Embedded Systems Design : Emulation and debugging techniques

Optimisation problems in Emulation techniques

The difficulties do not stop with hardware mechanical problems. Software debugging can be confused or hampered by optimisation techniques used by the compiler to improve the efficiency of the code.

Optimisation problems

The difficulties do not stop with hardware mechanical problems. Software debugging can be confused or hampered by optimisation techniques used by the compiler to improve the efficiency of the code. Usually set by options from the command line, the optimisation routines examine the code and change it to improve its efficiency, while retaining its logical design and context. Many different techniques are used but they fall into two main types: those that remove code and those that add code or change it. A compiler may remove variables or routines that are never used or do not return any function. Small loops may be unrolled into straight line code to remove branching delays at the expense of a slightly larger program. Floating point routines may be replaced by inline floating point instructions. The net result is code that is different from the assembler listing produced by the compiler. In addition, the generated symbol table may be radically different from that expected from the source code.

These optimisation techniques can be ruthless; I have known whole routines to be removed and in one case a com-plete program was reduced to a single NOP instruction! The program was a set of functions that performed benchmark routines but did not use any global information or return any values. The optimiser saw this and decided that as no data was passed to it and it did not modify or return any global data, it effectively did nothing and replaced it with a NOP. When benchmarked, it gave a pretty impressive performance of zero seconds to execute several million calculations.

/* sieve.c — Eratosthenes Sieve prime number calculation */

/* scaled down with MAX_PRIME set to 17 instead of 8091 */

#define MAX_ITER 1

#define MAX_PRIME 17

char flags[MAX_PRIME];

main ()

{

for (iter = 1;iter<=MAX_ITER;iter++)

{

count = 0;

/* redundant code added here */

for(l = 0; l < 200; l++ ); for(m = 128; l > 1; m— );

/* redundant code ends here */

for(i = 0; i<MAX_PRIME; i++) flags[i] = 1;

for(i = 0; i<MAX_PRIME; i++) if(flags[i])

{

prime = i + i + 3; k = i + prime;

while (k < MAX_PRIME)

{

flags[k] = 0; k += prime;

}

count++;

printf(“ prime %d =

%d\n”, count, prime);

}

printf(“\n%d primes\n”,count);

}

Source listing for optimisation example

file "ctm1AAAa00360" file "ctm1AAAa00355"

def aut1.,32 def aut1.,32

def arg1.,64 def arg1.,56

text text

global _main global _main

_main: _main:

subu r31,r31,arg1. subu r31,r31,arg1.

st r1,r31,arg1.-4 st r1,r31,arg1.-4

st r19,r31,aut1.+0 st.d r20,r31,aut1.+0

st r20,r31,aut1.+4 st.d r22,r31,aut1.+8

st r21,r31,aut1.+8 st r25,r31,aut1.+16

st r22,r31,aut1.+12 or r20,r0,1

st r23,r31,aut1.+16 @L26:

st r24,r31,aut1.+20 or r21,r0,r0

st r25,r31,aut1.+24 or r25,r0,r0

or r19,r0,1 @L7:

br @L25 addu r25,r25,1

@L26: cmp r13,r25,200

or r20,r0,r0 bb1 lt,r13,@L7

or r23,r0,r0 br.n @L28

br @L6 or r2,r0,128

@L7: @L11:

addu r23,r23,1 subu r2,r2,1

@L6: @L28:

cmp r13,r23,200 cmp r13,r25,1

bb1 lt,r13,@L7 bb1 gt,r13,@L11

or r22,r0,128 or r25,r0,r0

br @L10 or.u r22,r0,hi16(_flags)

@L11: or r22,r22,lo16(_flags)

subu r22,r22,1 @L15:

@L10: or r13,r0,1

cmp r13,r23,1 st.b r13,r22,r25

bb1 gt,r13,@L11 addu r25,r25,1

or r25,r0,r0 cmp r12,r25,17

br @L14 bb1 lt,r12,@L15

@L15: or r25,r0,r0

or.u r13,r0,hi16(_flags) @L24:

or r13,r13,lo16(_flags) ld.b r12,r22,r25

or r12,r0,1 bcnd eq0,r12,@L17

st.b r12,r13,r25 addu r12,r25,r25

addu r25,r25,1 addu r23,r12,3

@L14: addu r2,r25,r23

cmp r13,r25,17 cmp r12,r2,17

bb1 lt,r13,@L15 bb1 ge,r12,@L18

or r25,r0,r0 @L20:

br @L23 st.b r0,r22,r2

@L24: addu r2,r2,r23

or.u r13,r0,hi16(_flags) cmp r13,r2,17

or r13,r13,lo16(_flags) bb1 lt,r13,@L20

ld.b r13,r13,r25 @L18:

bcnd eq0,r13,@L17 addu r21,r21,1

addu r13,r25,r25 or.u r2,r0,hi16(@L21)

addu r21,r13,3 or r2,r2,lo16(@L21)

addu r24,r25,r21 or r3,r0,r21

br @L19 bsr.n _printf

@L20: or r4,r0,r23

or.u r13,r0,hi16(_flags) @L17:

or r13,r13,lo16(_flags) addu r25,r25,1

st.b r0,r13,r24 cmp r13,r25,17

addu r24,r24,r21 bb1 lt,r13,@L24

@L19: addu r20,r20,1

cmp r13,r24,17 cmp r13,r20,1

bb1 lt,r13,@L20 bb1 le,r13,@L26

addu r20,r20,1 or.u r2,r0,hi16(@L27)

or.u r2,r0,hi16(@L21) or r2,r2,lo16(@L27)

or r2,r2,lo16(@L21) bsr.n _printf

or r3,r0,r20 or r3,r0,r21

or r4,r0,r21 ld.d r20,r31,aut1.+0

bsr _printf ld r1,r31,arg1.-4

@L17: ld.d r22,r31,aut1.+8

addu r25,r25,1 ld r25,r31,aut1.+16

@L23: jmp.n r1

cmp r13,r25,17 addu r31,r31,arg1.

bb1 lt,r13,@L24

addu r19,r19,1

@L25:

cmp r13,r19,1

bb1 le,r13,@L26

or.u r2,r0,hi16(@L27)

or r2,r2,lo16(@L27)

or r3,r0,r20

bsr _printf

ld r19,r31,aut1.+0

ld r20,r31,aut1.+4

ld r21,r31,aut1.+8

ld r22,r31,aut1.+12

ld r23,r31,aut1.+16

ld r24,r31,aut1.+20

ld r25,r31,aut1.+24

ld r1,r31,arg1.-4

addu r31,r31,arg1.

jmp r1

No optimisation Full optimisation

Assembler listings for optimised and non-optimised compilation

To highlight how optimisation can dramatically change the generated code structure, look at the C source listing for the Eratosthenes Sieve program and the resulting M88000 assem-bler listings that were generated by using the default non-optimised setting and the full optimisation option. The imme-diate difference is in the greatly reduced size of the code and the use of the .n suffix with jump and branch instructions to make use of the delay slot. This is a technique used on many RISC processors to prevent a pipeline stall when changing the program flow. If the instruction has a .n suffix, the instruction immediately after it is effectively executed with the branch and not after it, as it might appear from the listing!

In addition, the looping structures have been reorgan-ised to make them more efficient, although the redundant code loops could be encoded simply as a loop with a single branch. If the optimiser is that good, why has it not done this? The reason is that the compiler expects loops to be inserted for a reason and usually some form of work is done within the loop which may change the loop variables. Thus the compiler will take the general case and use that rather than completely remove it or rewrite it. If the loop had been present in a dead code area — within a conditional statement where the condi-tions would never be met — the compiler would remove the structure completely.

The initialisation routine _main is different in that not all the variables are initialised using a store instruction and fetching their values from a stack. The optimised version uses the faster ‘or’ instruction to set some of the variables to zero.

These and other changes highlight several problems with optimisation. The obvious one is with debugging the code. With the changes to the code, the assembler listing and symbol tables do not match. Where the symbols have been preserved, the code may have dramatically changed. Where the routines have been removed, the symbols and references may not be present. There are several solutions to this. The first is to debug the code with optimisation switched off. This preserves the symbol references but the code will not run at the same speed as the optimised version, and this can lead to some timing problems. A second solution is becoming available from compiler and debugger suppliers, where the optimisation techniques preserve as much of the symbolic information as possible so that function addresses and so on are not lost.

The second issue is concerned with the effect optimisation may have on memory mapped I/O. Unless the optimiser can recognise that a function is dealing with memory mapped I/O, it may not realise that the function is doing some work after all and remove it — with disastrous results. This may require declaring the I/O addresses as a global variable, returning a value at the function’s completion or even passing the address to the function itself, so that the optimiser can recognise its true role. A third complication can arise with optimisations such as unrolling loops and software timing. It is not uncommon to use instruction sequences to delay certain accesses or functions. A peripheral may require a certain number of clock cycles to respond to a command. This delay can be accomplished by executing other instructions, such as a loop or a divide instruc-tion. The optimiser may remove or unroll such loops and replace the inefficient divide instruction with a logical shift. While this does increase the performance, that is not what was required and the delayed peripheral access may not be long enough — again with disastrous results.

Such software timing should be discouraged not only for this but also for portability reasons. The timing will assume certain characteristics about the processor in terms of process-ing speed and performance which may not be consistent with other faster board designs or different processor versions.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Embedded Systems Design : Emulation and debugging techniques : Optimisation problems in Emulation techniques |