Home | | Embedded Systems Design | Optimisation problems in Emulation techniques

Chapter: Embedded Systems Design : Emulation and debugging techniques

Optimisation problems in Emulation techniques

The difficulties do not stop with hardware mechanical problems. Software debugging can be confused or hampered by optimisation techniques used by the compiler to improve the efficiency of the code.

Optimisation problems

The difficulties do not stop with hardware mechanical problems. Software debugging can be confused or hampered by optimisation techniques used by the compiler to improve the efficiency of the code. Usually set by options from the command line, the optimisation routines examine the code and change it to improve its efficiency, while retaining its logical design and context. Many different techniques are used but they fall into two main types: those that remove code and those that add code or change it. A compiler may remove variables or routines that are never used or do not return any function. Small loops may be unrolled into straight line code to remove branching delays at the expense of a slightly larger program. Floating point routines may be replaced by inline floating point instructions. The net result is code that is different from the assembler listing produced by the compiler. In addition, the generated symbol table may be radically different from that expected from the source code.

 

These optimisation techniques can be ruthless; I have known whole routines to be removed and in one case a com-plete program was reduced to a single NOP instruction! The program was a set of functions that performed benchmark routines but did not use any global information or return any values. The optimiser saw this and decided that as no data was passed to it and it did not modify or return any global data, it effectively did nothing and replaced it with a NOP. When benchmarked, it gave a pretty impressive performance of zero seconds to execute several million calculations.

 

 

 

/* sieve.c — Eratosthenes Sieve prime number calculation */

 

/* scaled down with MAX_PRIME set to 17 instead of 8091 */

 

#define MAX_ITER      1

#define MAX_PRIME   17

char   flags[MAX_PRIME];

 

main ()

 

{

 

register int i,k,l,m; int prime,count,iter;

 

for (iter = 1;iter<=MAX_ITER;iter++)

 

{

 

count = 0;

 

/* redundant code added here */

 

for(l = 0; l < 200; l++ ); for(m = 128; l > 1; m— );

 

/* redundant code ends here */

 

for(i = 0; i<MAX_PRIME; i++) flags[i] = 1;

 

for(i = 0; i<MAX_PRIME; i++) if(flags[i])

 

{

 

prime = i + i + 3; k = i + prime;

 

while (k < MAX_PRIME)

 

{

 

flags[k] = 0; k += prime;

}

 

count++;

 

printf(“ prime %d =

 

%d\n”, count, prime);

 

}

 

}

 

printf(“\n%d primes\n”,count);

 

}

Source listing for optimisation example

 

file     "ctm1AAAa00360"       file     "ctm1AAAa00355"

def     aut1.,32      def     aut1.,32

def     arg1.,64      def     arg1.,56

text             text   

global          _main         global          _main

_main:                  _main:       

subu  r31,r31,arg1.        subu  r31,r31,arg1.

st       r1,r31,arg1.-4       st       r1,r31,arg1.-4

st       r19,r31,aut1.+0    st.d    r20,r31,aut1.+0

st       r20,r31,aut1.+4    st.d    r22,r31,aut1.+8

st       r21,r31,aut1.+8    st       r25,r31,aut1.+16

st       r22,r31,aut1.+12  or      r20,r0,1

st       r23,r31,aut1.+16  @L26:        

st       r24,r31,aut1.+20  or      r21,r0,r0

st       r25,r31,aut1.+24  or      r25,r0,r0

or      r19,r0,1      @L7:

br      @L25          addu  r25,r25,1

@L26:                  cmp   r13,r25,200

or      r20,r0,r0     bb1    lt,r13,@L7

or      r23,r0,r0     br.n   @L28

br      @L6  or      r2,r0,128

@L7:           @L11:        

addu  r23,r23,1    subu  r2,r2,1

@L6:           @L28:        

cmp   r13,r23,200 cmp   r13,r25,1

bb1    lt,r13,@L7  bb1    gt,r13,@L11

or      r22,r0,128  or      r25,r0,r0

br      @L10          or.u   r22,r0,hi16(_flags)

@L11:                  or      r22,r22,lo16(_flags)

subu  r22,r22,1    @L15:        

@L10:                  or      r13,r0,1

cmp   r13,r23,1    st.b    r13,r22,r25

bb1    gt,r13,@L11         addu  r25,r25,1

or      r25,r0,r0     cmp   r12,r25,17

br      @L14          bb1    lt,r12,@L15

@L15:                  or      r25,r0,r0

or.u   r13,r0,hi16(_flags)         @L24:        

or      r13,r13,lo16(_flags)       ld.b    r12,r22,r25

or      r12,r0,1      bcnd  eq0,r12,@L17

st.b    r12,r13,r25 addu  r12,r25,r25

addu  r25,r25,1    addu  r23,r12,3

@L14:                  addu  r2,r25,r23

cmp   r13,r25,17  cmp   r12,r2,17

bb1    lt,r13,@L15          bb1    ge,r12,@L18

or      r25,r0,r0     @L20:        

br      @L23          st.b    r0,r22,r2

@L24:                  addu  r2,r2,r23

or.u   r13,r0,hi16(_flags)         cmp   r13,r2,17

or      r13,r13,lo16(_flags)       bb1    lt,r13,@L20

ld.b    r13,r13,r25 @L18:        

bcnd  eq0,r13,@L17      addu  r21,r21,1

addu  r13,r25,r25 or.u   r2,r0,hi16(@L21)

addu  r21,r13,3    or      r2,r2,lo16(@L21)

addu  r24,r25,r21 or      r3,r0,r21

br      @L19          bsr.n  _printf

@L20:                  or      r4,r0,r23

or.u   r13,r0,hi16(_flags)         @L17:        

or      r13,r13,lo16(_flags)       addu  r25,r25,1

st.b    r0,r13,r24   cmp   r13,r25,17

addu  r24,r24,r21 bb1    lt,r13,@L24

@L19:                  addu  r20,r20,1

cmp   r13,r24,17  cmp   r13,r20,1

bb1    lt,r13,@L20          bb1    le,r13,@L26

addu  r20,r20,1    or.u   r2,r0,hi16(@L27)

or.u   r2,r0,hi16(@L21) or      r2,r2,lo16(@L27)

or      r2,r2,lo16(@L21) bsr.n  _printf

or      r3,r0,r20     or      r3,r0,r21

or      r4,r0,r21     ld.d    r20,r31,aut1.+0

bsr     _printf        ld       r1,r31,arg1.-4

@L17:                  ld.d    r22,r31,aut1.+8

addu  r25,r25,1    ld       r25,r31,aut1.+16

@L23:                  jmp.n r1

cmp   r13,r25,17  addu  r31,r31,arg1.

bb1    lt,r13,@L24                   

addu  r19,r19,1             

@L25:                           

cmp   r13,r19,1             

bb1    le,r13,@L26                  

or.u   r2,r0,hi16(@L27)          

or      r2,r2,lo16(@L27)          

or      r3,r0,r20              

bsr     _printf                 

ld       r19,r31,aut1.+0            

ld       r20,r31,aut1.+4            

ld       r21,r31,aut1.+8            

ld       r22,r31,aut1.+12          

ld       r23,r31,aut1.+16          

ld       r24,r31,aut1.+20          

ld       r25,r31,aut1.+24          

ld       r1,r31,arg1.-4               

addu  r31,r31,arg1.                 

jmp   r1               

No optimisation             Full optimisation

                  

 

 

Assembler listings for optimised and non-optimised compilation

 

To highlight how optimisation can dramatically change the generated code structure, look at the C source listing for the Eratosthenes Sieve program and the resulting M88000 assem-bler listings that were generated by using the default non-optimised setting and the full optimisation option. The imme-diate difference is in the greatly reduced size of the code and the use of the .n suffix with jump and branch instructions to make use of the delay slot. This is a technique used on many RISC processors to prevent a pipeline stall when changing the program flow. If the instruction has a .n suffix, the instruction immediately after it is effectively executed with the branch and not after it, as it might appear from the listing!

 

In addition, the looping structures have been reorgan-ised to make them more efficient, although the redundant code loops could be encoded simply as a loop with a single branch. If the optimiser is that good, why has it not done this? The reason is that the compiler expects loops to be inserted for a reason and usually some form of work is done within the loop which may change the loop variables. Thus the compiler will take the general case and use that rather than completely remove it or rewrite it. If the loop had been present in a dead code area — within a conditional statement where the condi-tions would never be met — the compiler would remove the structure completely.

 

The initialisation routine _main is different in that not all the variables are initialised using a store instruction and fetching their values from a stack. The optimised version uses the faster ‘or’ instruction to set some of the variables to zero.

 

These and other changes highlight several problems with optimisation. The obvious one is with debugging the code. With the changes to the code, the assembler listing and symbol tables do not match. Where the symbols have been preserved, the code may have dramatically changed. Where the routines have been removed, the symbols and references may not be present. There are several solutions to this. The first is to debug the code with optimisation switched off. This preserves the symbol references but the code will not run at the same speed as the optimised version, and this can lead to some timing problems. A second solution is becoming available from compiler and debugger suppliers, where the optimisation techniques preserve as much of the symbolic information as possible so that function addresses and so on are not lost.

 

The second issue is concerned with the effect optimisation may have on memory mapped I/O. Unless the optimiser can recognise that a function is dealing with memory mapped I/O, it may not realise that the function is doing some work after all and remove it — with disastrous results. This may require declaring the I/O addresses as a global variable, returning a value at the function’s completion or even passing the address to the function itself, so that the optimiser can recognise its true role. A third complication can arise with optimisations such as unrolling loops and software timing. It is not uncommon to use instruction sequences to delay certain accesses or functions. A peripheral may require a certain number of clock cycles to respond to a command. This delay can be accomplished by executing other instructions, such as a loop or a divide instruc-tion. The optimiser may remove or unroll such loops and replace the inefficient divide instruction with a logical shift. While this does increase the performance, that is not what was required and the delayed peripheral access may not be long enough — again with disastrous results.

 

Such software timing should be discouraged not only for this but also for portability reasons. The timing will assume certain characteristics about the processor in terms of process-ing speed and performance which may not be consistent with other faster board designs or different processor versions.

 

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
Embedded Systems Design : Emulation and debugging techniques : Optimisation problems in Emulation techniques |


Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.