Data Reuse
1 Types of Reuse
2 Self Reuse
3 Self-Spatial Reuse
4 Group Reuse
5 Exercises for Section 11.5
From array access functions we derive two kinds of information
useful for locality optimization and parallelization:
1.
Data reuse: for locality optimization, we wish to identify
sets of iterations that access the same data or the same cache line.
2.
Data
dependence: for correctness of
parallelization and locality loop trans-formations, we wish to identify all
the data dependences in the code. Re-call that two (not necessarily distinct)
accesses have a data dependence if instances of the accesses may refer to the
same memory location, and at least one of them is a write.
In many cases, whenever we identify iterations that reuse the same
data, there are data dependences between them.
Whenever there is a data dependence, obviously the same data is
reused. For example, in matrix multiplication, the same element in the output
array is written 0(n) times. The write operations must be executed in
the original execution order;3 there is reuse because we can allocate the same
element to a register.
However, not all
reuse can be exploited in locality optimizations; here is an example
illustrating this issue.
We observe that the
loop writes to a different location at each iteration, so there are no reuses
or dependences on the different write operations. The loop, how-ever, reads
locations 5 , 8 , 1 1 , 1 4 , 1 7 , . . . , and writes locations 3 , 1 0 , 1 7
, 2 4 , — The read and write iterations access the same elements 17, 38, and 59
and so on. That is, the integers of the form 17 + 21 j for j = 0,1, 2 ,
. . . are all those integers that can be written both as 7ii = 3 and as 3i2 + 5, for some integers i1 and i 2 . However, this reuse occurs rarely, and cannot
be exploited easily if at all.
Data dependence is different from reuse analysis in that one of
the accesses sharing a data dependence must be a write access. More
importantly, data de-pendence needs to be both correct and precise. It needs to
find all dependences for correctness, and it should not find spurious dependences
because they can cause unnecessary serialization.
With data reuse, we only need to find where most of the
exploitable reuses are. This problem is much simpler, so we take up this topic
here in this section and tackle data dependences in the next. We simplify reuse
analysis by ignoring loop bounds, because they seldom change the shape of the
reuse. Much of the reuse exploitable by affine partitioning resides among
instances of the same array accesses, and accesses that share the same coefficient
matrix (what we have typically called F in the affine index
function). As shown above, access patterns like 7% + 3 and 3i + 5
have no reuse of interest.
1. Types of Reuse
We first start with Example 11.20 to illustrate the different
kinds of data reuses. In the following, we need to distinguish between the
access as an instruction in a program, e.g., x
= Z [ i , j ] , from the execution of this instruction many times, as we
execute the loop nest. For emphasis, we
may refer to the statement itself as a static
access, while the various
iterations of the statement as we execute its loop nest are called dynamic accesses.
Reuses can be classified as self versus group. If
iterations reusing the same data come from the same static access, we refer to
the reuse as self reuse; if they come from different accesses, we refer to it
as group reuse. The reuse is temporal if the same exact location is
referenced; it is spatial if the same cache line is referenced.
Example 11.20 : Consider
the following loop nest:
Accesses Z[j], Z[j + 1], and Z[j + 2] each have
self-spatial reuse because con-secutive iterations of the same access refer to
contiguous array elements. Pre-sumably contiguous elements are very likely to
reside on the same cache line. In addition, they all have self-temporal reuse,
since the exact elements are used over and over again in each iteration in the
outer loop. In addition, they all have the same coefficient matrix, and thus
have group reuse. There is group reuse, both temporal and spatial, between the
different accesses. Although there are 4
n 2 accesses in this code, if the reuse can be exploited, we only need to bring
in about n / c cache lines into the cache, where c is the number of words in a
cache line. We drop a factor of n due to
self-spatial reuse, a factor of c to due to spatial locality, and finally a
factor of 4 due to group reuse.
In the following, we show how we can use linear algebra to extract
the reuse information from affine array accesses. We are interested in not just
finding how much potential savings there are, but also which iterations are
reusing the data so that we can try to move them close together to exploit the
reuse.
2. Self Reuse
There can be substantial savings in memory accesses by exploiting
self reuse. If the data referenced by a static access has k dimensions
and the access is nested in a loop d deep, for some d > k,
then the same data can be reused nd~k times, where n is the number of iterations
in each loop. For example, if a 3-deep loop nest accesses one column of an
array, then there is a potential savings factor of n2 accesses. It turns out that the
dimensionality of an access corresponds to the concept of the rank of the coefficient matrix in the access,
and we can find which iterations refer to the same location by finding the null
space of the matrix, as explained below.
Rank of a Matrix
The rank of a matrix F is the largest number of columns (or
equivalently, rows) of F that are linearly independent. A set of vectors
is linearly independent if none of the vectors can be written as a
linear combination of finitely many other vectors in the set.
Notice that the second row is the sum of the first and third rows,
while the fourth row is the third row minus twice the first row. However, the
first and third rows are linearly independent; neither is a multiple of the
other. Thus, the rank of the matrix is 2.
We could also draw this conclusion by examining the columns. The
third column is twice the second column minus the first column. On the other
hand, any two columns are linearly independent. Again, we conclude that the
rank is 2.
Example 1 1 . 2 2 : Let us look at the array accesses in
Fig. 11.18. The first access, X[i — 1], has dimension 1, because the
rank of the matrix [1 0] is 1. That is, the one row is linearly independent, as
is the first column.
has two independent rows (and therefore two independent columns,
of course). The third access, Y[j, j + 1], is of dimension 1, because
the matrix
has rank 1. Note that the two rows are identical, so only one is
linearly in-dependent. Equivalently, the first column is 0 times the second
column, so the columns are not independent. Intuitively, in a large, square
array Y, the only elements accessed lie along a one-dimensional line,
just above the main diagonal.
The fourth access, Y[l,2] has dimension 0, because a matrix
of all 0's has rank 0. Note that for such a matrix, we cannot find a linear sum
of even one row that is nonzero. Finally, the last access, Z[i,i,2 * i + j],
has dimension 2. Note that in the matrix for this access
the last two rows are linearly independent; neither is a multiple
of the other.
However, the first row is a linear "sum" of the other
two rows, with both coefficients 0.
Null Space of a Matrix
A reference in a d-deep loop nest with rank r accesses 0(nr) data
elements in 0(nd) iterations, so on average, 0(nd~r) iterations must refer to
the same array element. Which iterations access the same data? Suppose an
access in this loop nest is represented by matrix-vector combination F and f.
Let i and i' be two iterations that refer to the same array element. Then Fi +
f = Fi' + f.
Rearranging terms, we get
There is a well-known concept from linear algebra that
characterizes when i and i' satisfy the above equation. The set of all
solutions to the equation Fv = 0 is called the null space of F. Thus, two
iterations refer to the same array element if the difference of their
loop-index vectors belongs to the null space of matrix F.
It is easy to see that the null vector, v = 0, always satisfies Fv
= 0. That is, two iterations surely refer to the same array element if their
difference is 0; in other words, if they are really the same iteration. Also, the
null space is truly a vector space. That is, if F v i = 0 and F v 2 = 0, then
F(vi + v 2 ) = 0 and F(cvi) = 0.
If the matrix F is fully ranked, that is, its rank
is d, then the null space of F consists of only the null vector.
In that case, iterations in a loop nest all refer to different data. In
general, the dimension of the null space, also known as the nullity, is
d — r. If d > r, then for each element there is a (d —
r)-dimensional space of iterations that access that element.
The null space can be represented by its basis vectors. A
^-dimensional null space is represented by k independent vectors; any
vector that can be expressed as a linear combination of the basis vectors
belongs to the null space.
Example 11 . 23 : Let us reconsider the
matrix of Example 11.21:
We determined in that example that the rank of the matrix is 2;
thus the nullity is 3 - 2 = 1. To find a basis for the null space, which in
this case must be a single nonzero vector of length 3, we may suppose a vector
in the null space to be [x,y,z] and try to solve the equation
If we multiply the first two rows by the vector of unknowns, we
get the two equations
We could write the equations that come from the third and fourth
rows as well, but because there are no three linearly independent rows, we know
that the additional equations add no new constraints on x, y, and z. For
instance, the equation we get from the third row, Ax + 5y + Qz = 0 can be
obtained by subtracting the first equation from the second.
We must eliminate as many variables as we can from the above
equations. Start by using the first equation to solve for x; that is, x — —2y —
3z. Then substitute for x in the second equation, to get — 3y = 6z, or y = —2z.
Since x = —2y — 3z, and y = —2z: it
follows that x = z. Thus, the vector [x,y,z] is really [z, —2z,z]. We may pick
any nonzero value of z to form the one and only basis vector for the null
space. For example, we may choose z = 1 and use [1, —2,1] as the basis of the
null space.
Example 1
1 . 2 4 : The rank, nullity, and null space for each of the references in
Example 11.17 are shown in Fig. 11.19. Observe that the sum of the rank and
nullity in all the cases is the depth of the loop nest, 2. Since the accesses
and Z[l,
i, 2 * i + j] have a rank of 2, all iterations refer to different
locations. Accesses X[i — 1] and Y[j, j + both have rank-1 matrices, so 0(n)
iterations refer to the same location.
In the former case, entire rows in the iteration space refer to the same
location. In other words, iterations
that differ only in the j dimension share the same location, which is
succinctly represented by the basis of the null space, [0,1]. For Y[j,j + 1], entire columns in the iteration space refer
to the same location, and this fact is succinctly represented by the basis of
the null space, [1,0].
Finally,
the access F[l,2] refers to the same location in all the iterations. The null
space corresponding has 2 basis vectors, [0,1], [1,0], meaning that all pairs
of iterations in the loop nest refer to exactly the same location. •
3. Self-Spatial
Reuse
The
analysis
of spatial reuse depends on the data layout of the matrix. C matrices are laid
out in row-major order and Fortran matrices are laid out in column-major order.
In other words, array elements X[iJ] and X[i,j + 1] are
contiguous in C and X[i, j] and
X[i + 1, j] are contiguous in Fortran. Without loss of generality, in the rest
of the chapter, we shall adopt the C (row-major) array layout.
As a first approximation, we
consider two array elements to share the same cache line if and only if they
share the same row in a two-dimensional array. More generally, in an array of d
dimensions, we take array elements to share a cache line if they differ only in
the last dimension. Since for a typical array and cache, many array elements
can fit in one cache line, there is significant speedup to be had by accessing
an entire row in order, even though, strictly speaking, we occasionally have to
wait to load a new cache line.
The trick to discovering and
taking advantage of self-spatial reuse is to drop the last row from the
coefficient matrix F. If the resulting truncated matrix has rank
that is less than the depth of the loop nest, then we can assure spatial
locality by making sure that the innermost loop varies only the last coordinate
of the array.
Example 1 1 . 2 5 :
Consider the last access, Z[l,
i,2* i+ j], in Fig. 1 1 . 1 9 . If we delete the
last row, we are left with the truncated matrix
The rank of this
matrix is evidently 1, and since the loop nest has depth 2, there is the
opportunity for spatial reuse. In this case, since j is the inner-loop
index, the inner loop visits contiguous elements of the array Z stored
in row-major order. Making i the inner-loop index will not yield spatial
locality, since as i changes, both the second and third dimensions
change. •
The general rule for
determining whether there is self-spatial reuse is as follows. As always, we
assume that the loop indexes correspond to columns of the coefficient matrix in
order, with the outermost loop first, and the innermost loop last. Then in
order for there to be spatial reuse, the vector [ 0 , 0 , . . . , 0,1] must be
in the null space of the truncated matrix. The reason is that if this vector is
in the null space, then when we fix all loop indexes but the innermost one, we
know that all dynamic accesses during one run through the inner loop vary in
only the last array index. If the array is stored in row-major order, then
these elements are all near one another, perhaps in the same cache line.
Example 11 . 26 : Note that [0,1] (transposed as a column vector) is in the null space of the truncated matrix of Example 11.25. Thus, as mentioned there, we expect that with j as the inner-loop index, there will be spatial locality. On the other hand, if we reverse the order of the loops, so i is the inner loop, then the coefficient matrix becomes
Now, [0,1] is
not in the null space of this matrix. Rather, the null space is generated by
the basis vector [1,0]. Thus, as we suggested in Example 11.25, we do not
expect spatial locality if i is the inner loop.
We should observe,
however, that the test for [ 0 , 0 , . . . , 0,1] being in the null space is
not quite sufficient to assure spatial locality. For instance, suppose the access
were not Z[l,i,2*i
+ j] but Z[l,i,2*i
+ 50*j]. Then, only every fiftieth element of Z would be accessed
during one run of the inner loop, and we would not reuse a cache line unless it
were long enough to hold more than 50 elements.
4. Group Reuse
We compute group reuse only among accesses in a loop sharing the
same coef-ficient matrix. Given two dynamic accesses Fii + fi and F i 2 + f2, reuse of the
same data requires that
Suppose v is one solution to this equation. Then if w is any vector in the null space of F i , w + v is also a solution, and in fact those are all the solutions to
the equation.
has two array accesses, Z[i,j] and Z[i — l,j].
Observe that these two accesses are both characterized by the coefficient
matrix
like the second access, Y"[i, j] in Fig. 11.19. This
matrix has rank 2, so there is no self-temporal reuse.
However, each access exhibits self-spatial reuse. As described in
Section 11.5.3, when we delete the bottom row of the matrix, we are left
with only the top row, [1,0], which has rank 1. Since [0,1]
is in the null space of this truncated matrix, we expect spatial reuse. As each
incrementation of inner-loop index j increases the second array index by
one, we in fact do access adjacent array elements, and will make maximum use of
each cache line.
Although there is no self-temporal reuse for either access,
observe that the two references Z[i,j] and Z[i — l,j] access almost
the same set of array elements. That is, there is group-temporal reuse because
the data read by access Z[i — l,j) is the same as the data written by
access Z[i,j], except for the case i = 1. This simple
pattern applies to the entire iteration space and can be exploited to improve
data locality in the code. Formally, discounting the loop bounds, the two
accesses Z[i,j] and
Z[i- l,j] refer to the same
location in iterations (k,ji) and (%2, j
2 ) , respectively, provided
That is, ji = J2 and i2 =
k + 1.
Notice that the reuse occurs along the i-axis the of the iteration
space. That is, inner iteration (kih) occurs n iterations (of the loop) after
the iteration
Thus, many iterations are executed before the data written is
reused.
This data may or may not still be in the cache. If the cache manages to hold two consecutive
rows of matrix Z, then access Z[i — does not miss in the cache, and the total
number of cache misses for the entire loop nest is n 2 / c , where c is the
number of elements per cache line. Otherwise, there will be twice as many
misses, since both static accesses require a new cache line for each c dynamic
accesses.
Example 11 . 28 : Suppose
there are two accesses
in a 3-deep loop nest, with indexes i, j, and k,
from the outer to the inner loop. Then two accesses ii = [i1,ji,k1] and
i2 = [^2,^2,^2] reuse the same element whenever
One solution to this equation for a vector v = [i1 — i 2 , ji — j
2 , &i — &2] is v = [1, —1,0]; that is, i1 22 + 1, Ji = ji — 1, and k1
= &2 .4 However, the null space of the matrix
is generated by the basis vector [0,0,1]; that is, the third loop
index, k, can be arbitrary. Thus, v, the solution to the above equation, is any
vector [1, — l , m ] for some m. Put
another way, a dynamic access to A[i,j,i + j], in a loop nest with indexes i,
j, and is reused not only by other dynamic accesses A[i, j, i+j] with the same
values of i and j and a different value of k, but also by dynamic accesses A[i
+ 1, j — 1, i + j] with loop index values i + 1, j — 1, and any value of k.
Although
we shall not do so here, we can reason about group-spatial reuse analogously.
As per the discussion of self-spatial reuse, we simply drop the last dimension
from consideration.
The
extent of reuse is different for the different categories of reuse.
Self-temporal reuse gives the most benefit: a reference with a ^-dimensional
null space reuses the same data 0(nk) times. The extent of
self-spatial reuse is limited by the length of the cache line. Finally, the
extent of group reuse is limited by the number of references in a group sharing
the reuse.
5. Exercises for Section 11.5
Exercise 1 1 . 5 . 1 : Compute the ranks of each of the matrices in Fig. 11.20.
Give both a maximal set of linearly dependent columns and a
maximal set of linearly dependent rows.
Exercise 11.5.2 : Find a basis
for the null space of each matrix in Fig. 11.20.
Exercise 11 . 5 . 3: Assume that
the iteration space has dimensions (variables) i, j, and k. For each of the
accesses below, describe the subspaces that refer to the following
single elements of the array:
Exercise 11 . 5 . 4: Suppose array
A is stored in
row-major order and accessed inside the following
loop nest:
Indicate for each of
the following accesses whether it is possible to rewrite the loops so that the
access to A exhibits self-spatial reuse; that is, entire cache lines are
used consecutively. Show how to rewrite the loops, if so. Note: the rewriting
of the loops may involve both reordering and introduction of new loop indexes.
However, you may not change the layout of the array, e.g., by changing it to
column-major order. Also note: in general, reordering of loop indexes may be
legal or illegal, depending on criteria we develop in the next section.
However, in this case, where the effect of each access is simply to set an
array element to 0, you do not
have to worry about the effect of reordering loops as far as the semantics of
the program is concerned.
Exercise 1 1 . 5 . 5
: In Section 11.5.3 we commented that we get spatial locality if the innermost loop varies only as the last
coordinate of an array access. However,
that assertion depended on our assumption that the array was stored in row-major order. What condition would
assure spatial locality if the array
were stored in column-major order?
Exercise 1 1 . 5 . 6 : In Example 11.28 we
observed that the the existence of reuse
between two similar accesses depended heavily on the particular
expressions for the coordinates of the
array. Generalize our observation there to determine for which functions f(i,j) there is reuse
between the accesses A[i,j,i + j] and
A[i + l,j-l,f(iJ)}.
Exercise 1 1 . 5 . 7 : In Example 11.27 we
suggested that there will be more cache
misses than necessary if rows of the matrix Z are so long that they do
not fit in the cache. If that is the case, how could you rewrite the loop nest
in order to Guarantee group-spatial reuse?
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.