Chapter: Embedded Systems Design : Memory and performance trade-offs

Reducing the cost of memory access

Scenario 2 — Reducing the cost of memory access

The preceding scenario shows the delays that can be caused by accessing external memory. If the data is accessible from a local register the delay and thus the performance loss is greatly reduced and may be zero. If the data is in local on-chip memory or in an on-chip cache, the delay may only be a single cycle. If it is external DRAM, the delay may be nine or ten cycles. This demonstrates that the location of data can have a dramatic effect on any access delay and the resultant performance loss.

A good way of tackling this problem is to create a table with the storage location, its storage capability and speed of access in terms of clock cycles and develop techniques to move data be-tween the various locations so that it is available when the proces-sor needs it. For example, moving the data into registers compared to direct manipulation externally in memory can reduce the number of cycles needed, even when the saving and restoring of the register contents to free up the storage is taken into account.

Using registers

Registers are the fastest access storage resource available to the processor and are the smallest in size. As a result they are an extremely scarce resource which has to be used and managed carefully. The main problem is in deciding what information to store and when. This dilemma comes from the fact that there is frequently insufficient register space to store all the data all of the time. As a result, registers are used to store temporary values before updating main memory with a final result, to hold counter values for loop constructions and so on and for key important values. There are several approaches to doing this and exploiting their speed:

• Let the compiler do the work

Many compilers will take care of register management automatically for you when it is told to use optimisation techniques. For many applications that are written in a high level language such as C, this is often a good choice.

• Load and store to provide faster local storage

In this case, variables stored in external memory or on a stack are copied to an internal register, processed and then the result is written back out. With RISC processors that do not support direct memory manipulation, this is the only method of manipulating data. With CISC processors, such as the M68000 family, there is a choice. By writing code so that data is brought in to be manipulated, instead of using addressing modes that operate directly on external memory, the impact of slow memory access can be minimised.

• Declaring register-based variables

By assigning a variable to a register during its initial decla-ration, the physical access to the variable will be quicker. This can be done explicitly or implicitly. Explicit declara-tions use special attributes that the programmer uses in the declaration, e.g. reg. An implicit declaration is where the compiler will take a standard declaration such as global or static and implicitly use this to allocate a register if possible.

• Context save and restore

If more variables are assigned to registers than there are registers within the processor, the software may need to perform a full save and restore of the register set before using or accessing a register to ensure that these variables are not corrupted. This is an accepted process when multitasking so that the data of one task that resides in the processor registers is not corrupted. This procedure may need to be used at a far lower level in the software to prevent inadvertent register-based data corruption.

Using caches

Data and instruction caches work on the principle that both data and code are accessed more than once. The cache memory will store the information as it is fetched from the main memory so that any subsequent access is from the faster cache memory. This assumption is important because straight line code without branches, loops and so on will not benefit from a cache.

The size and organisation of the cache is important because it determines the efficiency of the overall system. If the program loops will fit into the cache memory, the fastest performance will be realised and the whole of the external bus bandwidth will be available for data movement. If the data cache can hold all the data structures, then the data access will be the fastest possible. In practice, the overall performance gain is less than ideal because inevitably access to the external memory will be needed either to fetch the code and data the first time around, when the cache is not big enough to contain all the instructions or data, or when the external bus must be used to access an I/O port where data cannot be cached. Interrupts and other asynchronous events will also compete for the cache and can cause instructions and data that has been cached prior to the event to be removed, thus forcing external memory accesses when the original program flow is continued.

Preloading caches

One trick that can be used with caches is to preload them so that a cache miss is never encountered. With normal operation, a cache miss will force an external memory access and while this is in progress, the processor is stalled waiting for the information — data or instruction — to be returned. In many code sequences, this is more likely to happen with data, where the first time that the cache and external bus are used to access the data is when it is needed. As described earlier with scenario 1, this delay occurs at an important point in the sequence and the delay prevents the processor from continuing.

By using the same technique as used in scenario 1, the data cache can be preloaded with information for the next processing iteration before the current one has completed. The PowerPC architecture provides special instructions that allow this to be performed. In this way, the slow data access is overlapped with the processing and data access from the cache and does not cause any processor stalls. In other words, it ensures that the cache always continues to have the data that the instruction stream needs to have.

By the very nature of this technique, it is one that is normally performed by hand and not automatically available through the optimisation techniques supplied with high level language com-pilers.

It is very useful with large amounts of data that would not fit into a register. However, if the time taken to fetch the data is greater than the time taken to process the previous block, then the processing will stall.

Caches also have one other interesting characteristic in that they can make it very difficult to predict how long a particular operation will take to execute. If everything is in the cache, the time will be short. If not then it will be long. In practice, it will be somewhere in between. The problem is that the actual time will depend on the number of cache hits and misses which will depend in turn on the software that has run before which will have overwritten some of the entries. As a result, the actual timing becomes more statistical in nature and in many cases the worst case timing has to be assumed, even though statistically the routine will execute faster 99.999% of the time!

Using on-chip memory

Some microcontrollers and DSP chips have local memory which can be used to store data or instructions and thus offers fast local storage. Any access to it will be fast and thus data and code structures will always gain some benefit if they are located here. The problem is that to gain the best benefit, both the code and data structures must fit in the on-chip memory so that no external accesses are necessary. This may not be possible for some pro-grams and therefore decisions have to be made on which parts of the code and data structures are allocated this resource. With a real-time operating system, local on-chip memory is often used to gain the best context switching time. This memory requirement now has to compete with algorithms that need on-chip storage to meet the performance requirements by minimising any processor stalls.

One good thing about using on-chip memory is that it makes performance calculations easier as the memory accesses will have a consistent access time.

Using DMA

Some microcontrollers and DSPs have on-chip DMA con-trollers which can be used in conjunction with local memory to create a sort of crude but efficient cache. In reality, it is more like a buffering technique with an intelligent agent filling up and controlling the buffers in parallel with the processing.

The basic technique defines the local memory into two or more buffers, and programs the DMA controller to transfer data from the external memory to the local on-chip memory buffer while the data in the other buffer is processed. The overlapping of the DMA data transfer and the processing means that the data processing has local access to its data instead of having to wait for far slower memory access.

The buffering technique can be made more sophisticated by incorporating additional DMA transfers to move data out of the local memory back to the external memory. This may require the use of many more smaller buffers with different DMA character-istics. Constants could be put into one buffer which are read in but not read out. Variables can be stored in another where the informa-tion is written out to external memory.

Making the right decisions

The main problems faced by designers with these tech-niques is in knowing which one(s) should be used. The problem is that they involve a high degree of knowledge about the processor and the system characteristics. While a certain amount of informa-tion can be obtained from a documentation-based analysis, the use of simulation tools to run through code sequences and provide information concerning cache hits ratios, processor stalls and so on is a vital part in obtaining the optimum solution. Because of this, many cycle level processor simulation tools are becoming available which help provide this level of information.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Embedded Systems Design : Memory and performance trade-offs : Reducing the cost of memory access |