Optimisation problems
The difficulties do not stop with hardware mechanical problems. Software
debugging can be confused or hampered by optimisation techniques used by the
compiler to improve the efficiency of the code. Usually set by options from the
command line, the optimisation routines examine the code and change it to
improve its efficiency, while retaining its logical design and context. Many
different techniques are used but they fall into two main types: those that
remove code and those that add code or change it. A compiler may remove
variables or routines that are never used or do not return any function. Small
loops may be unrolled into straight line code to remove branching delays at the
expense of a slightly larger program. Floating point routines may be replaced
by inline floating point instructions. The net result is code that is different
from the assembler listing produced by the compiler. In addition, the generated
symbol table may be radically different from that expected from the source
code.
These optimisation techniques can be ruthless; I have known whole
routines to be removed and in one case a com-plete program was reduced to a
single NOP instruction! The program was a set of functions that performed
benchmark routines but did not use any global information or return any values.
The optimiser saw this and decided that as no data was passed to it and it did
not modify or return any global data, it effectively did nothing and replaced
it with a NOP. When benchmarked, it gave a pretty impressive performance of
zero seconds to execute several million calculations.
/* sieve.c — Eratosthenes Sieve prime number calculation */
/* scaled down with MAX_PRIME set to 17 instead of 8091 */
#define MAX_ITER 1
#define MAX_PRIME 17
char flags[MAX_PRIME];
main ()
{
register int i,k,l,m; int prime,count,iter;
for (iter = 1;iter<=MAX_ITER;iter++)
{
count = 0;
/* redundant code added here */
for(l = 0; l < 200; l++ ); for(m = 128; l > 1; m— );
/* redundant code ends here */
for(i = 0; i<MAX_PRIME; i++) flags[i] = 1;
for(i = 0; i<MAX_PRIME; i++) if(flags[i])
{
prime = i + i + 3; k = i + prime;
while (k < MAX_PRIME)
{
flags[k] = 0; k += prime;
}
count++;
printf(“ prime %d =
%d\n”, count, prime);
}
}
printf(“\n%d primes\n”,count);
}
Source listing for optimisation example
file "ctm1AAAa00360" file "ctm1AAAa00355"
def aut1.,32 def aut1.,32
def arg1.,64 def arg1.,56
text text
global _main global _main
_main: _main:
subu r31,r31,arg1. subu r31,r31,arg1.
st r1,r31,arg1.-4 st r1,r31,arg1.-4
st r19,r31,aut1.+0 st.d r20,r31,aut1.+0
st r20,r31,aut1.+4 st.d r22,r31,aut1.+8
st r21,r31,aut1.+8 st r25,r31,aut1.+16
st r22,r31,aut1.+12 or r20,r0,1
st r23,r31,aut1.+16 @L26:
st r24,r31,aut1.+20 or r21,r0,r0
st r25,r31,aut1.+24 or r25,r0,r0
or r19,r0,1 @L7:
br @L25 addu r25,r25,1
@L26: cmp r13,r25,200
or r20,r0,r0 bb1 lt,r13,@L7
or r23,r0,r0 br.n @L28
br @L6 or r2,r0,128
@L7: @L11:
addu r23,r23,1 subu r2,r2,1
@L6: @L28:
cmp r13,r23,200 cmp r13,r25,1
bb1 lt,r13,@L7 bb1 gt,r13,@L11
or r22,r0,128 or r25,r0,r0
br @L10 or.u r22,r0,hi16(_flags)
@L11: or r22,r22,lo16(_flags)
subu r22,r22,1 @L15:
@L10: or r13,r0,1
cmp r13,r23,1 st.b r13,r22,r25
bb1 gt,r13,@L11 addu r25,r25,1
or r25,r0,r0 cmp r12,r25,17
br @L14 bb1 lt,r12,@L15
@L15: or r25,r0,r0
or.u r13,r0,hi16(_flags) @L24:
or r13,r13,lo16(_flags) ld.b r12,r22,r25
or r12,r0,1 bcnd eq0,r12,@L17
st.b r12,r13,r25 addu r12,r25,r25
addu r25,r25,1 addu r23,r12,3
@L14: addu r2,r25,r23
cmp r13,r25,17 cmp r12,r2,17
bb1 lt,r13,@L15 bb1 ge,r12,@L18
or r25,r0,r0 @L20:
br @L23 st.b r0,r22,r2
@L24: addu r2,r2,r23
or.u r13,r0,hi16(_flags) cmp r13,r2,17
or r13,r13,lo16(_flags) bb1 lt,r13,@L20
ld.b r13,r13,r25 @L18:
bcnd eq0,r13,@L17 addu r21,r21,1
addu r13,r25,r25 or.u r2,r0,hi16(@L21)
addu r21,r13,3 or r2,r2,lo16(@L21)
addu r24,r25,r21 or r3,r0,r21
br @L19 bsr.n _printf
@L20: or r4,r0,r23
or.u r13,r0,hi16(_flags) @L17:
or r13,r13,lo16(_flags) addu r25,r25,1
st.b r0,r13,r24 cmp r13,r25,17
addu r24,r24,r21 bb1 lt,r13,@L24
@L19: addu r20,r20,1
cmp r13,r24,17 cmp r13,r20,1
bb1 lt,r13,@L20 bb1 le,r13,@L26
addu r20,r20,1 or.u r2,r0,hi16(@L27)
or.u r2,r0,hi16(@L21) or r2,r2,lo16(@L27)
or r2,r2,lo16(@L21) bsr.n _printf
or r3,r0,r20 or r3,r0,r21
or r4,r0,r21 ld.d r20,r31,aut1.+0
bsr _printf ld r1,r31,arg1.-4
@L17: ld.d r22,r31,aut1.+8
addu r25,r25,1 ld r25,r31,aut1.+16
@L23: jmp.n r1
cmp r13,r25,17 addu r31,r31,arg1.
bb1 lt,r13,@L24
addu r19,r19,1
@L25:
cmp r13,r19,1
bb1 le,r13,@L26
or.u r2,r0,hi16(@L27)
or r2,r2,lo16(@L27)
or r3,r0,r20
bsr _printf
ld r19,r31,aut1.+0
ld r20,r31,aut1.+4
ld r21,r31,aut1.+8
ld r22,r31,aut1.+12
ld r23,r31,aut1.+16
ld r24,r31,aut1.+20
ld r25,r31,aut1.+24
ld r1,r31,arg1.-4
addu r31,r31,arg1.
jmp r1
No optimisation Full
optimisation
Assembler listings for optimised and non-optimised compilation
To highlight how optimisation can dramatically change the generated code
structure, look at the C source listing for the Eratosthenes Sieve program and
the resulting M88000 assem-bler listings that were generated by using the
default non-optimised setting and the full optimisation option. The imme-diate
difference is in the greatly reduced size of the code and the use of the .n
suffix with jump and branch instructions to make use of the delay slot. This is
a technique used on many RISC processors to prevent a pipeline stall when
changing the program flow. If the instruction has a .n suffix, the instruction
immediately after it is effectively executed with the branch and not after it,
as it might appear from the listing!
In addition, the looping structures have been reorgan-ised to make them
more efficient, although the redundant code loops could be encoded simply as a
loop with a single branch. If the optimiser is that good, why has it not done
this? The reason is that the compiler expects loops to be inserted for a reason
and usually some form of work is done within the loop which may change the loop
variables. Thus the compiler will take the general case and use that rather
than completely remove it or rewrite it. If the loop had been present in a dead
code area — within a conditional statement where the condi-tions would never be
met — the compiler would remove the structure completely.
The initialisation routine _main is different in that not all the variables are initialised using a
store instruction and fetching their values from a stack. The optimised version
uses the faster ‘or’ instruction to set some of the variables to zero.
These and other changes highlight several problems with optimisation.
The obvious one is with debugging the code. With the changes to the code, the
assembler listing and symbol tables do not match. Where the symbols have been
preserved, the code may have dramatically changed. Where the routines have been
removed, the symbols and references may not be present. There are several
solutions to this. The first is to debug the code with optimisation switched
off. This preserves the symbol references but the code will not run at the same
speed as the optimised version, and this can lead to some timing problems. A
second solution is becoming available from compiler and debugger suppliers,
where the optimisation techniques preserve as much of the symbolic information
as possible so that function addresses and so on are not lost.
The second issue is concerned with the effect optimisation may have on
memory mapped I/O. Unless the optimiser can recognise that a function is
dealing with memory mapped I/O, it may not realise that the function is doing
some work after all and remove it — with disastrous results. This may require
declaring the I/O addresses as a global variable, returning a value at the
function’s completion or even passing the address to the function itself, so
that the optimiser can recognise its true role. A third complication can arise
with optimisations such as unrolling loops and software timing. It is not
uncommon to use instruction sequences to delay certain accesses or functions. A
peripheral may require a certain number of clock cycles to respond to a
command. This delay can be accomplished by executing other instructions, such
as a loop or a divide instruc-tion. The optimiser may remove or unroll such
loops and replace the inefficient divide instruction with a logical shift.
While this does increase the performance, that is not what was required and the
delayed peripheral access may not be long enough — again with disastrous
results.
Such software timing should be discouraged not only for this but also
for portability reasons. The timing will assume certain characteristics about
the processor in terms of process-ing speed and performance which may not be
consistent with other faster board designs or different processor versions.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.