Scenario 2 — Reducing the cost of
The preceding scenario shows the delays that can be caused by accessing
external memory. If the data is accessible from a local register the delay and
thus the performance loss is greatly reduced and may be zero. If the data is in
local on-chip memory or in an on-chip cache, the delay may only be a single
cycle. If it is external DRAM, the delay may be nine or ten cycles. This
demonstrates that the location of data can have a dramatic effect on any access
delay and the resultant performance loss.
A good way of tackling this problem is to create a table with the
storage location, its storage capability and speed of access in terms of clock
cycles and develop techniques to move data be-tween the various locations so
that it is available when the proces-sor needs it. For example, moving the data
into registers compared to direct manipulation externally in memory can reduce
the number of cycles needed, even when the saving and restoring of the register
contents to free up the storage is taken into account.
Registers are the fastest access storage resource available to the
processor and are the smallest in size. As a result they are an extremely
scarce resource which has to be used and managed carefully. The main problem is
in deciding what information to store and when. This dilemma comes from the
fact that there is frequently insufficient register space to store all the data
all of the time. As a result, registers are used to store temporary values
before updating main memory with a final result, to hold counter values for
loop constructions and so on and for key important values. There are several approaches
to doing this and exploiting their speed:
Let the compiler do the work
Many compilers will take care of register management automatically for
you when it is told to use optimisation techniques. For many applications that
are written in a high level language such as C, this is often a good choice.
Load and store to provide faster
In this case, variables stored in external memory or on a stack are
copied to an internal register, processed and then the result is written back
out. With RISC processors that do not support direct memory manipulation, this
is the only method of manipulating data. With CISC processors, such as the
M68000 family, there is a choice. By writing code so that data is brought in to
be manipulated, instead of using addressing modes that operate directly on
external memory, the impact of slow memory access can be minimised.
By assigning a variable to a register during its initial decla-ration,
the physical access to the variable will be quicker. This can be done
explicitly or implicitly. Explicit declara-tions use special attributes that
the programmer uses in the declaration, e.g. reg. An implicit declaration is
where the compiler will take a standard declaration such as global or static
and implicitly use this to allocate a register if possible.
Context save and restore
If more variables are assigned to registers than there are registers
within the processor, the software may need to perform a full save and restore
of the register set before using or accessing a register to ensure that these
variables are not corrupted. This is an accepted process when multitasking so
that the data of one task that resides in the processor registers is not
corrupted. This procedure may need to be used at a far lower level in the
software to prevent inadvertent register-based data corruption.
Data and instruction caches work on the principle that both data and
code are accessed more than once. The cache memory will store the information
as it is fetched from the main memory so that any subsequent access is from the
faster cache memory. This assumption is important because straight line code
without branches, loops and so on will not benefit from a cache.
The size and organisation of the cache is important because it
determines the efficiency of the overall system. If the program loops will fit
into the cache memory, the fastest performance will be realised and the whole
of the external bus bandwidth will be available for data movement. If the data
cache can hold all the data structures, then the data access will be the
fastest possible. In practice, the overall performance gain is less than ideal
because inevitably access to the external memory will be needed either to fetch
the code and data the first time around, when the cache is not big enough to
contain all the instructions or data, or when the external bus must be used to
access an I/O port where data cannot be cached. Interrupts and other
asynchronous events will also compete for the cache and can cause instructions
and data that has been cached prior to the event to be removed, thus forcing
external memory accesses when the original program flow is continued.
One trick that can be used with caches is to preload them so that a
cache miss is never encountered. With normal operation, a cache miss will force
an external memory access and while this is in progress, the processor is
stalled waiting for the information — data or instruction — to be returned. In many
code sequences, this is more likely to happen with data, where the first time
that the cache and external bus are used to access the data is when it is
needed. As described earlier with scenario 1, this delay occurs at an important
point in the sequence and the delay prevents the processor from continuing.
By using the same technique as used in scenario 1, the data cache can be
preloaded with information for the next processing iteration before the current
one has completed. The PowerPC architecture provides special instructions that
allow this to be performed. In this way, the slow data access is overlapped
with the processing and data access from the cache and does not cause any
processor stalls. In other words, it ensures that the cache always continues to
have the data that the instruction stream needs to have.
By the very nature of this technique, it is one that is normally
performed by hand and not automatically available through the optimisation
techniques supplied with high level language com-pilers.
It is very useful with large amounts of data that would not fit into a
register. However, if the time taken to fetch the data is greater than the time
taken to process the previous block, then the processing will stall.
Caches also have one other interesting characteristic in that they can
make it very difficult to predict how long a particular operation will take to
execute. If everything is in the cache, the time will be short. If not then it
will be long. In practice, it will be somewhere in between. The problem is that
the actual time will depend on the number of cache hits and misses which will
depend in turn on the software that has run before which will have overwritten
some of the entries. As a result, the actual timing becomes more statistical in
nature and in many cases the worst case timing has to be assumed, even though
statistically the routine will execute faster 99.999% of the time!
Using on-chip memory
Some microcontrollers and DSP chips have local memory which can be used
to store data or instructions and thus offers fast local storage. Any access to
it will be fast and thus data and code structures will always gain some benefit
if they are located here. The problem is that to gain the best benefit, both
the code and data structures must fit in the on-chip memory so that no external
accesses are necessary. This may not be possible for some pro-grams and
therefore decisions have to be made on which parts of the code and data
structures are allocated this resource. With a real-time operating system,
local on-chip memory is often used to gain the best context switching time.
This memory requirement now has to compete with algorithms that need on-chip
storage to meet the performance requirements by minimising any processor
One good thing about using on-chip memory is that it makes performance
calculations easier as the memory accesses will have a consistent access time.
Some microcontrollers and DSPs have on-chip DMA con-trollers which can
be used in conjunction with local memory to create a sort of crude but
efficient cache. In reality, it is more like a buffering technique with an
intelligent agent filling up and controlling the buffers in parallel with the
The basic technique defines the local memory into two or more buffers,
and programs the DMA controller to transfer data from the external memory to
the local on-chip memory buffer while the data in the other buffer is
processed. The overlapping of the DMA data transfer and the processing means
that the data processing has local access to its data instead of having to wait
for far slower memory access.
The buffering technique can be made more sophisticated by incorporating
additional DMA transfers to move data out of the local memory back to the
external memory. This may require the use of many more smaller buffers with
different DMA character-istics. Constants could be put into one buffer which
are read in but not read out. Variables can be stored in another where the
informa-tion is written out to external memory.
Making the right decisions
The main problems faced by designers with these tech-niques is in
knowing which one(s) should be used. The problem is that they involve a high
degree of knowledge about the processor and the system characteristics. While a
certain amount of informa-tion can be obtained from a documentation-based
analysis, the use of simulation tools to run through code sequences and provide
information concerning cache hits ratios, processor stalls and so on is a vital
part in obtaining the optimum solution. Because of this, many cycle level
processor simulation tools are becoming available which help provide this level