EURASIP Journal on Applied Signal Processing 2002:9, 926–935 c ○ 2002 Hindawi Publishing Corporation Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding

This paper presents an iteration space partitioning scheme to reduce the CPU idle time due to the long memory access latency. We take into consideration both the data accesses of intermediate and initial data. An algorithm is proposed to ﬁnd the largest overlap for initial data to reduce the entire memory tra ﬃ c. In order to e ﬃ ciently hide the memory latency, another algorithm is developed to balance the ALU and memory schedules. The experiments on DSP benchmarks show that the algorithms signiﬁcantly outperform the known existing methods.


INTRODUCTION
The contemporary DSP and embedded systems always contain the memory hierarchy, which can be categorized as onchip and off-chip memories.In general, the on-chip memory have a fast speed and restrictive size, while the off-chip memory have the much slower speed and larger size.To do the CPU's computation, the data need to be loaded from the off-chip to on-chip memories.Thus, the system performance will be degraded due to this long off-chip access latency.How to tolerate the memory latency with memory hierarchy is becoming a more and more important problem [1].The onchip and off-chip memories are abstracted as the first and second level memories, respectively, in this paper.
Prefetching [1,2,3,4,5] is a technique to fetch the data from the memory in advance of the corresponding computations.It can be used to hide the memory latency.On the other hand, software pipelining [6] and modulo scheduling [7,8] are the scheduling techniques used to explore the parallelism in the loop.Both the prefetching and scheduling techniques can be used to accelerate the execution speed.However, these traditional techniques have some weaknesses [9] such that they cannot efficiently solve the problem mentioned in the first paragraph.This paper combines the software pipelining technique with the data prefetching approach.Multiple memory units, attached to the first level memory, will perform operations to prefetch data from the second to the first level memories.These memory units are in charge of preparing all data required by the computation in the first level memory in advance of computation.Multiple ALU units exist in the processor for doing the computation.The ALU schedule is optimized by using the software pipelining technique under the resource constraints.The operations in the ALU units and memory units execute simultaneously.Therefore, the long memory access latency is tolerated by overlapping the data fetching operations with the ALU operations.Although using computation to hide the memory latency has been studied extensively before, trying to balance the computation and memory loading has never been researched thoroughly according to the authors' knowledge.This paper presents an approach to balance the ALU and memory schedules to achieve an optimal overall schedule length.
The data to be prefetched can be classified into two groups, the intermediate and initial data.The intermediate data can serve as both left and right operands in the equa-tions.Their value will vary during the computation.On the contrary, the initial data can only serve as right operands in the equations.They will maintain their value during the computation.Take the following equations as an example, the arrays B, C can be regarded as the intermediate data and A as the initial data (1) The influence of both these two kinds of data should be deliberated in order to obtain an optimal overall schedule.
To take full use of the data locality, the entire iteration space can be divided into small blocks named partitions.A lot of works have been done on the partitioning technique.Loop tiling [10,11] is a technique used to group basic computations so as to increase computation granularity and thereby reduce communication time.Generally, they have no detailed schedule of ALU and memory operations as our method.Moreover, only intermediate data are taken into consideration.Agarwal and Kranz [12] make an extensive study of data partition.They use an approximation method to find a good partition to minimize the data transfer among the different processors.Affine reference index is considered in their work.However, they mainly concentrate on the initial data and have few consideration on the intermediate data.
The approaches in [9,13] are the few approaches to consider the detailed schedule under memory hierarchy.Nevertheless, their memory references consider only the intermediate data, and ignore the initial data, which are an important influence factor of performance.From the experimental results in Section 5, we can see that such deficiency will lead to an unbalanced schedule, which means a worse schedule.
In our approach, both the intermediate and initial data are considered.For the intermediate data, we will restrict our study to nested loops with uniform data dependencies.The study of uniform loop nests is justified by the fact that most general linear recurrence equations can be transformed into a uniform form.This transformation (uniformization [14]) greatly reduces the complexity of the problem.On the other hand, it is difficult to implement uniformization for the initial data.Therefore, affine reference index is considered.The concept footprint [12] is used to denote the initial data needed for the computation of ALU units in one partition.Given a partition shape, this paper presents an algorithm to find a partition size which can give rise to the maximum overlap between the adjacent overall footprints such that the number of memory operations is reduced to the largest extent.
When considering the schedule of the loop, we propose the detailed ALU and memory schedules.Each of the memory and ALU operations are assigned to an available hardware unit and time slot.Therefore, it is very convenient to apply our technique to a compiler.The memory schedule is balanced to the ALU schedule such that the overall schedule is close to the lower bound, which is determined by the ALU schedule.Our method gives the algorithm to determine the partition shape and size in order to achieve balanced ALU and memory schedules.At last, the memory requirement of our technique for applications is also presented.
The new algorithm in this paper significantly exceeds the performance of existing algorithms [9,13] due to the fact that it optimizes both ALU and memory schedules and considers the influence of initial data.Taking the wave digital filter as an example, in a standard system with 4 ALU units and 4 memory units, assuming 3 initial data references exist in each iteration, our algorithm can obtain an average schedule length of 4.018 CPU clock cycles, which is very close to the theoretic lower bound of 4 clock cycles.The traditional list scheduling needs 22 clock cycles.The hardware prefetching costs 10 clock cycles.While the PSP algorithm in [13] can achieve some improvement, it still needs 8 clock cycles.Without the memory constraint, the algorithm in [9] has the same performance, 8 clock cycles.Our algorithm improves all the previous approaches.
It is worthwhile to mention that some works have been done on data layout technique [15,16], which is used to maintain the cache coherency and reduce the conflict traffic.Our work should be regarded as another different layer which can be built upon the layer of data layout to get a better performance.
The remainder of this paper is organized as follows.Section 2 introduces the terms and basic concepts used in the paper.Section 3 presents the theory on initial data.Section 4 describes the algorithm to find the detailed schedule.Section 5 contains the experimental result of comparison of this technique with a number of existing approaches.We conclude in Section 6.

BACKGROUND
We can represent the operations in a loop by a multidimensional data flow graph (MDFG) [6].Each node in the MDFG represents a computation.Each edge denotes the data dependence between two computations, with its weight as the distance vector.The benefit of using MDFG instead of the general data dependence graph (DDG) or statement dependence graph (SDG) is that MDFG is the finer-grained description of data dependences.Each node of MDFG corresponds to one ALU computation.On the contrary, a node always corresponds to a statement in DDG or SDG, which will consume uncertain ALU computation time depending on the complexity of the statement.It is more convenient to schedule the ALU operations with MDFG.Moreover, lots of DSP applications, such as DSP filters, and so forth, can be directly mapped into MDFG [17].
The execution of all nodes in an MDFG one time is an iteration.It corresponds to executing the loop body for one time under a certain loop index.Iterations are identified by a vector i, equivalent to a multidimensional index.
In this paper, we will always illustrate our ideas under two-dimensional loops.It is not difficult to extend to loops with more than two dimensions by using the same idea presented in this paper.

Architecture model
The technique in our paper is designed for use in a system which has one or more processors.These processors share a common memory hierarchy, as shown in Figure 1.There are multiple ALU and memory units in the system.The access time for the first level memory is significantly less than for the second level memory, as in current systems.During a program's execution, if one instruction requires data which is not in the first level memory, the processor will have to fetch data from the second level memory, which will cost much more time.Thus, prefetching data into the first level memory before its explicit use can minimize the overall execution time.Two types of memory operations, prefetch and keep are supported by the memory units.The prefetch operation prefetches the data from the second level to the first level memories; the keep operation keeps the data in the first level memory for the execution of one partition.Both of them are issued to guarantee that those data being referenced in the near future appear in the first level memory before their references.It is important to note that the first level memory in this model cannot be regarded as a pure cache, because we do not consider the cache associativity.In other words, it can be thought of as a full-associative cache.

Partitioning the iteration space
Regular execution of nested loops proceeds in either a rowwise or column-wise manner until the boundary of iteration space is reached.However, this mode of execution does not take full advantage of either the locality of reference or the available parallelism.The execution of such structures can be made to be more efficient by dividing the entire iteration space into regions called partitions that better exploit spatial locality.
Provided that the total iteration space is divided into partitions of iterations, the execution sequence will be determined by each partition.Assume that the partition in which the loop is executing is the current partition.Then the next partition is the partition adjacent on the right side of the  current partition along the x-axis.The other partitions are all partitions except the above two partitions.Based on this classification, different memory operations will be assigned to different data in a partition.For a delay dependency that goes into the next partition, a keep memory operation is used to keep this data in the first level memory for one partition, since this data will be reused immediately in the next partition.Delay dependencies that go into other partitions result in the use of prefetch memory operations to fetch data in advance.
A partition is determined by its partition shape and partition size.We use two basic vectors (in a basic vector, each element is an integer and all elements have no common factor except 1), P x and P y , to identify a parallelogram as the partition shape.These two basic vectors will be called partition vectors.Assume, without loss of generality, that the angle between P x and P y is less than 180 • , and P x is clockwise of P y .The partition size is determined by the vector S = ( f x , f y ), where f x and f y are the multiples of the partition size over partition vectors P x and P y , respectively.Thus, the partition can be delimited by two vectors f x P x and f y P y .
How to find the optimal partition size will be discussed in Section 4. Due to the dependencies between the iterations, the P x and P y cannot be chosen arbitrarily.The following property gives the condition of a legal partition shape [9].Property 1.A pair of partition vectors that satisfy the following constraints is legal.For each delay vector d e , the following cross products1 relations hold: d e × P x ≤ 0 and d e × P y ≥ 0.
Because nested loops should follow the lexicographical order, we can choose (1, 0) as our P x vector and use the normalized leftmost vector of all delay dependencies as our P y .The partition shape is decided by these two vectors.
An overall schedule consists of two parts: an ALU part and a memory part, as seen in Figure 2. The ALU part sched- ules the ALU computation.We know that the computation in a loop can be represented by an MDFG.The ALU part is a schedule of these MDFG nodes.The memory part schedules the memory operations-prefetch and keep, so that the data for the computation can always be found in the first level memory.

THE THEORY ABOUT INITIAL DATA
The overall footprint of one partition consists of all the initial data needed by one partition computation.Provided the execution is along the partition sequence, the initial data needed by the current partition computation have been prefetched to the first level memory at the time of previous partition.Also, the initial data needed by the next partition execution will be prefetched by the memory units during the current partition execution.For the overlap between the overall footprints of the current and next partitions, they have already been in the first level memory.The prefetch operations can be spared.Thus, the major concern for the initial data is how to maximize the overlap between the overall footprints of two consecutively executed partitions to reduce the memory traffic.
As mentioned in Section 1, we consider affine reference for the initial data.Given a loop index vector i, an affine reference index can be expressed as is a 2×2 matrix and a is the offset vector.The footprint with respect to a reference A[ g 1 ( i)] is the set of all data elements A[ g 1 ( i)] of A, for i an element of the partition.The overall footprint is the union of the footprints with respect to all different references.For example, in Figure 3, the partition is a rectangle with size 3 × 4. The initial data references are B(i + j, i − j) and B(2i + j, i − 2 j).Their corresponding footprints are denoted by those integer points marked by × and •, respectively.The overall footprint is the union of these two footprints.
In [12], Anant presents the concept uniformly generated references.Two references A[ g 1 ( i)] and A[ g 2 ( i)] are said to be uniformly generated if If two references B 1 and B 2 are not uniformly generated, the overlap between footprint with respect to B 1 of the current partition and that with respect to B 2 of the next partition can be ignored because the overlap, if exists, diminishes rapidly.Therefore, we need only consider the overlap between footprints with respect to uniformly generated references of two consecutive partitions.Moreover, the offset vector a should satisfy that a = m G 1 + n G 2 , where m and n are integer constants.Otherwise, no overlap between the footprints of consecutive partitions will exist even for the uniformly generated references.
The memory requirement should be taken into account when trying to maximize the overlap.The partition size cannot be enlarged arbitrarily only for the sake of increasing overlap.In such case, the larger partition means the larger overall footprint; that is, the much more memory space will be consumed.Therefore, given a partition shape and a set of uniformly generated references, we try to derive some conditions of the partition size which should be met to achieve a reasonable maximal overlap.For the convenience of description, we introduce the following notations.Definition 1. (1) Assuming the partition size is S, f ( a, S) is the footprint with respect to reference with offset vector a of the current partition, and f ( a , S) is the footprint with respect to reference with offset a of the next partition.
(2) Given a set of uniformly generated references, the set R = { a 1 , a 2 , . . ., a n } is set of offset vectors. 2 Assuming the partition size is S, F(R, S) is the overall footprint of the current partition and F(R , S) is the overall footprint of the next partition.
The one-dimensional case can be regarded as a simplification to the two-dimensional problem, in which the f y is always set to zero.It provides the theoretic foundation for the two-dimensional problem.In the case of one dimension, a partition is reduced to a line segment and all vectors reduce to integer numbers.The partition size can be thought of as the length of the line segment.We use an example to demonstrate the problem we are tackling.In Figure 4, there are three different offset vectors: 1, 2, 7.The solid lines represent the overall footprint of the current partition, and dotted lines denote that of the next partition.Then, we need to find the condition of the partition size, that is, the length of the line segment, to achieve a maximal overlap.The figure shows the case when the length equal 5, which is the minimum length to obtain the maximum overlap between overall footprints.
In order to derive the theorem on the minimum value S which can generate the maximum overlap, we first have  the following lemmas.They are used to consider the overlap of two footprints of the consecutive partitions, as show in Figure 5.The solid line is the footprint of the current partition and the dotted line is the footprint of the next partition.
Lemma 1.The minimum S is a 2 − a 1 which makes the maximum intersection between f (a 1 , S) and f (a 2 , S), where a 2 ≥ a 1 .
Proof.According to the relation between (a 1 + S) and a 2 , there are two different cases.Case 1.As shown in Figure 5a, Case 2. As shown in Figure 5b, a 1 + S > a 2 , that is, S > a 2 −a 1 .The intersection of two segments is (a 1 +S, a 2 +S−1).It has no relation to S. This means the size of intersection will not increase in spite of the increment of S. Lemma 2. For the intersection between f (a 1 , S) and f (a 2 , S), where a 2 ≥ a 1 , it will keep constant, irrelevant to the value of S, as long as S ≥ a 2 − a 1 .

According to Definition 1, F(R, S) and F(R , S) can be expressed as
( The following lemma gives the expression of their intersection. Lemma 3. Let C m be the intersection f (a m , S) ∩ f (a m−1 , S).
Then the intersection of F(R, S) and F(R , S) is n 2 C m , where the number of integers in R is n.
Proof.Let A m denote f (r m , S), and B m denote f (r m , S).
Basis step.Let n = 2. Then F(R, S) = A 1 ∪ A 2 and F(R , S) = B 1 ∪ B 2 .The ending point of A 1 is less than the starting point of B 1 and B 2 , the starting point of B 2 is greater than the ending point of A 1 and A 2 .Thus, the only possible intersection is Induction hypothesis.Assume that, for some n ≥ 2, There are two different cases.
Proof.When considering two adjacent C m and C m−1 , we have There is no common element between B m−1 and A m−1 , neither is C m and C m−1 .According to Lemmas 1 and 2, the value x ≥ r m − r m−1 can make segment C m largest.Moreover, each C m will not intersect each other.Therefore, the theorem is correct.
From Theorem 1 and Lemma 2, we can directly derive the following theorem.

Theorem 2. For the overall footprints F(R, S) and F(R , S), their overlap will keep constant if the value of S continues to increase from the S value obtained by Theorem 1.
To maximize the overlap between F(R, S) and F(R , S) in the two dimension space, we can find that the f y element of the partition size is not so important as the f x element, since the intersection always increases when f y is enlarged.We will determine the value of f y based on other conditions.Therefore, the key is what is the minimum value of f x to make the intersection maximum, given a certain f y .
Next, we discuss the situation with G a two-dimensional identity matrix.If G is not an identity matrix, the same idea can be applied as long as a = m G 1 + n G 2 .The only difference is that the original XY-space will be transformed to the new space by the G matrix.An augment set R * can be obtained based on a certain partition size of S and the set R with the following method: a * i = a i , a * i+n = a i + f y P y • y, where n is the size of the set R and P y = (P y • x, P y • y).Arranging all the points in the set R * with the increasing order of the Y element, the overall footprint of one partition can be divided into a series of stripes.Each stripe is determined by two horizontal lines which pass the two adjacent points sorted in R * .For instance, in Figure 6, the R set is {(0, 0), (6, 1), (3,2), (1, 3)}.Assume the value of f y P y • y is 5, then the augment set R * is {(0, 0), (0, 5), (6, 1), (6,6), (3,2), (3,7), (1,3), (1, 8)}.After sorting, it will become {(0, 0), (6, 1), (3,2), (1, 3), (0, 5), (6,6), (3,7), (1,8)}.The overall footprint consists of 7 stripes as indicated in Figure 6.
In each stripe, a horizontal line will intersect with left bounds of some footprints f ( a, S).Thus, the twodimensional intersection problem of this stripe in the footprint can be reduced to the one-dimensional problem, which can be solved using Theorem 1. Applying this idea to each stripe, we can solve the two-dimensional overlap problem, as demonstrated in Algorithm 1.The algorithm is obviously a polynomial-time algorithm, whose time complexity is O(n 2 ).

THE OVERALL SCHEDULE
The overall schedule can be divided into two parts-ALU and memory schedules.For the ALU schedule, the multidimensional rotation scheduling algorithm [6] is used to generate a Input: The set R and the shape of the partition Output: The f x to make the overlap maximum under a certain f y .
(2) Based on the set R and partition shape, choose an f y such that the product f y * P y • y is larger than the difference between the largest and least b element of all vectors in the set R. (3) Using the f y above, generate the augment set R * .(4) Sort all the values in the R * in increasing order according to the b element and keep them in an event list.( 5) Use a horizontal line to sweep the whole iteration space.When an event point is met, insert the corresponding set f ( a, S) in a visiting list, if the event point is the lower bound of the footprint.
Otherwise delete the corresponding f ( a, S) from the list.( 6) Calculate the intersection point of this line with the left bound and right bound of each set in the visiting list, respectively.Use Theorem 1 to derive an f x value to make the intersection in the current stripe maximal.
Algorithm 1: Calculating the minimum x to make the overlap maximum.
static schedule for one iteration.Then the entire ALU schedule can be formed by simply replicating this schedule for each iteration in the partition.The schedule obtained in this way is the most compact schedule since it only considers the ALU hardware resource constraints.The overall schedule length must be longer than it.Thus, this ALU schedule provides a lower bound for the overall schedule.This lower bound can be calculated by #len iteration × #nodes, where len iteration represents the schedule length obtained by multidimensional rotation scheduling algorithm for one iteration, and #nodes denotes the number of iteration nodes in one partition.Our objective is to find a partition whose overall schedule length can be very close to this lower bound.

Balanced overall schedule
Different from the ALU schedule, the memory schedule is considered as an integrate for the entire partition.It consists of two parts: memory operations for initial data and intermediate data.Each part consists of the prefetch and keep operations for the corresponding data.Because all the prefetch operations have no relations to the current computation, they can be arranged from the beginning of the memory schedule part.On the contrary, the keep operation for intermediate data can only be issued after the corresponding computation has finished.The keep operations for initial data can be issued as soon as they have been prefetched.The memory part schedule length is the summation of these two parts' schedule lengths.
For the intermediate data, the calculation of the number  of prefetch and keep operations can refer to [13].For the initial data, they can be prefetched in blocks.This kind of operation can fetch several data at one time and costs only a little longer time than general prefetch operation.To calculate the number of such operations, we first have the following observation.
Property 2. As long as f y P y G 2 , the projection of footprint size along the direction G 2 , is larger than the maximum difference of aG 2 , for all a belongs to a uniformly generated offset vector set, the overall footprint will increase at a constant rate with the increment of f y , so does the number of prefetch operations for initial data.
Note the requirement in the above property guarantees that the partition is large enough, such that the footprint with respect to an offset vector can intersect with the footprint with respect to all other offset vectors belonging to the same uniformly generated set.
Suppose that a two-dimensional vector can be written as a = (a • x, a • y).Given a certain f x , the number of prefetch operations for initial data for any f y , which satisfy the condition in the above property, is Pre Base ini +( f y − f y0 )×Pre incr ini , where f y0 = y 0 /((P y G) • y) , y 0 is the maximum difference of ( aG) • y for all offset vectors, Pre Base ini denotes the number of such operations for a partition with size f x × f y0 , and Pre incr ini represents the increment of number of prefetch operations when f y is increased by one.
The keep operations for the initial data can be issued after they have been prefetched.The number of such keep operations is Keep Base ini + ( f y − f y0 ) × Keep incr ini , where y 0 and f y0 have the same meaning as above.Keep Base ini denotes the number of keep operations for a partition with size f x × f y0 , and Keep incr ini represents the increment of keep operations when f y is increased by one.
In order to understand what is a good partition size, we first need the definition of the balanced overall schedule.It also gives the balanced overall schedule requirement.Definition 2. A balanced overall schedule is a schedule for which the memory schedule is at most one unit time of keep operation longer than the ALU schedule.
To reduce the computation complexity and simplify the analysis, we add a restriction on the partition size: the partition size is large enough that no data dependence can span more than two partitions.
(1) There is no delay dependency which can span more than two partitions along the y coordinate direction, that is, (2) There is no delay dependency which can span more than two partitions along the x coordinate direction, that is, f x > max{d x − d y (P y • y/(P y • x))}.
As long as these constraints on minimal partition size are satisfied, the length of prefetch and keep parts for intermediate data in memory schedule increases slower than the ALU schedule length when partition size is enlarged.At this time, if a partition size cannot be found to meet the balanced overall schedule requirement, it means that the length of the block prefetch part for initial data increases too fast.Due to the property of block prefetch, increasing f x will increase the number of block prefetch only by a small number, while increase the ALU part by a relative large length.Therefore, a partition size which satisfy the balanced overall schedule requirement can be found.Algorithm 2 determines the partition size to obtain the balanced overall schedule.
After the optimal partition size is determined, the operations in ALU and memory schedules can be easily arranged.For the ALU part, it is the duplication of the schedule for one iteration.For the memory part, the memory operations for initial data are allocated first, then are the memory operations for intermediate data, as we discussed above.
The memory requirement for a partition consists of four parts, the memory requirement for the calculation of inpartition data, the memory for prefetch operations of intermediate data, the memory for keep operations of intermediate data, and the memory for those operations of initial data.The memory consumption calculation for in-partition data can refer to [9].For the other part memory requirements, they can be computed simply by multiplying the number of operations with the memory requirement of each operation.
The memory requirement for a prefetch operation is 2. One is used to store the data prefetched by the previous partition and consumed in the current partition, the other stores the data prefetched by the current partition and consumed in the next partition.As the same rule, the keep operation will take 2 memory locations, too.The block prefetch operations will take 2 × block size memory locations.

EXPERIMENT
In this section, we use several DSP benchmarks to illustrate the effectiveness of our new algorithm.They are WDF, IIR, DPCM, 2D, and Floyd, as indicated in Tables 1 and  2, which stand for wave digital filter, infinite impulse response filter, differential pulse-code modulation device, twodimensional filter and Folyd-Steinberg algorithm, respectively.
These are DSP filters in common usage in real DSP applications.We applied five different algorithms on these benchmarks: list scheduling, hardware prefetching scheme, partitioning algorithms in [9,13] and our new partition algorithm (since it has been shown in [9] that loop tiling technique cannot outperform partitioning algorithms, we do not compare the result of loop tiling in this section).In list scheduling, the same architecture model is used.However, the ALU part uses the traditional list scheduling algorithm, and the iteration space is not partitioned.In hardware prefetching scheduling, we use the model presented in [18].In this model, whenever a block is accessed, the next block is also loaded.The partitioning algorithms in [9,13] assume the same architecture model as ours.They partition the iteration space and execute the entire loop along the partition sequence.However, they do not take into account the influence of the initial data.
In the experiment, we assume an ALU computation, a keep operation of one clock cycle, a prefetch time of 10 CPU clock cycles, and a block prefetch time of 16 CPU clock cycles, which is reasonable when the big performance gap between CPU and the main memory is considered.Table 1 presents results with only one initial data with the offset vector (1, 1), and Table 2 presents results with three initial data with the offset vector set {(1, 1), (2, −2), (0, 3)}.Note all these three initial data references are uniformly generated.From the discussion in Section 4, the overall footprint is only the simple summation of the footprint with respect to different uniformly generated reference sets.In Tables 1 and 2, the par vector column determines the partition shape.The list column lists the schedule length for list scheduling and the improvement ratio our algorithm can get compared to list scheduling.The hardware column lists the schedule length for hardware prefetching and our algorithm's relative improvement ratio.Since the algorithm in [13] will get the same result as the algorithm in [9] when there is no memory size constraint, we merge their results into one column partition algo.In the partition algo and new algo columns, the size column is the size of partition presented with the multiple of partition vectors.The m r column represents the corresponding memory requirement and the len column is the average scheduling length for corresponding algorithms.The ratio column is the improvement our new algorithm can get relative to the corresponding algorithms.
The list scheduling and hardware prefetching schedule the operations based on the iteration, which will result in the

Figure 1 :
Figure 1: Architecture model with multiple function units and a memory hierarchy.

Figure 5 :
Figure 5: Two different relations between a 1 and a 2 .

Figure 6 :
Figure 6: The stripe division of a footprint.

Figure 7 :
Figure 7: The tendency of intersection with f x and f y .

Table 1 :
Experimental results with only one initial data.The ALU schedule for one iteration, the partition shape P x × P y and the initial data offset vector set R. Using the two above conditions on partition size, calculate another pair of minimum f x and f y .(3)Get a new pair f x = max( f x , f x ) and f y = max( f y , f y ).(4)Using this pair ( f x , f y ), calculate the number of prefetch operations, block prefetch operations, and keep operations.(5)Calculate the ALU schedule length to see if the balanced overall schedule requirement is satisfied.(6)If it is satisfied, this pair ( f x , f y ) is the partition size.Otherwise, increase f x by one, use the balanced overall schedule requirement to find the minimum f y .If such f y does not exist, continue increasing f x until the feasible f y is found.