General Tuning Guidelines
There are a number of important steps to optimizing an application and tune for the memory hierarchy. Some cache use issues are discussed below. Remember to always maximize cache reuse.
- Minimize stride length. For the best-case scenario, stride length 1 is optimal for most systems and in particularly the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.
The following snippets of code illustrates the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.
Fortran Example: C Example: Real*8 :: a(m,n), b(m,n), c(m,n) ... do i=1,n do j=1,m a(j,i)=b(j,i)+c(j,i) end do end do
Double a[m][n], b[m][n], c[m][n]; ... for (i=0;i < m;i++){ for (j=0;j < n;j++){ a[i][j]=b[i][j]+c[i][j]; } } - Another approach is data reuse in cache by cache blocking. The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise, the data has to be loaded into cache from memory every time it becomes necessary, since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, as the latency for loading data from memory is a few orders higher than from cache. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.
This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different sub-matrices in cache, while the computation is on-going.
Example: Matrix multiplication
Real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb ! <- nb is blocking factor do jj=1,n,nb do kk=1,n,nb do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) end do end do end do end do end do end do - Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two. More specifically, be aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.
Example: Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.
Real*8 :: a(1024,50) ... do i=1,n a(1,i)=0.50*a(1,i) end do
A 16K 4-way set assoc cache has 4 sets of 4K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. This effectively reduces L1 from 256 cache lines to only 4! The program ends up with a 256 byte cache from the original 16K, due to non-optimal choice of leading dimension. Solution: Change leading dimension to 1028 (1024 + 1/2 cache line)
Other Tuning Techniques
In addition to the general guidelines above, employ the following techniques to make the best use of physical memory, floating point operations, and input/output subsystem, and to avoid Fortran90 performance issues whenever possible.
Memory Tuning
Use Data Prefetching
Encourage data prefetching to hide memory latency. Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.
Example: Using Data Prefetching
In the following dot-product example, the number of streams that are prefetched are increased from 2 to 4 to 6 for the same functionality. However, care must be taken, since prefetching an increasing number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching a higher number of streams can be counterproductive.
| 2 streams | 4 streams | 6 streams |
do i=1,n s=s+x(i)*y(i) end do dotp=s |
do i=1,n/2 s0=s0+x(i)*y(i) s1=s1+x(i+n/2)*y(i+n/2) end do s0=s0+(n-2*(n/2))*x(n)*y(n) dotp=s0+s1 |
do i=1,n/3 s0=s0+x(i)*y(i) s1=s1+x(i+n/3)*y(i+n/3) s2=s2+x(i+2*(n/3))*y(i+2*(n/3)) end do do i=3*(n/3)+1,n s0=s0+x(i)*y(i) end do dotp=s0+s1+s2 |
Work within available physical memory
Fit the problem size to physical memory. Working in virtual memory leads to performance degradation and should be avoided at all costs. In addition swapping causes problems on some Linux systems.
Floating-Point Tuning
Unroll Inner Loops to Hide FP Latency
In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.
... do i=1,n,k s1 = s1 + x(i)*y(i) s2 = s2 + x(i+1)*y(i+1) s3 = s3 + x(i+2)*y(i+2) s4 = s4 + x(i+3)*y(i+3) ... sk = sk + x(i+k)*y(i+k) end do ... dotp = s1 + s2 + s3 + s4 + ... + sk |
Avoid Divide Operations
The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside of a loop it is better to substitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by an optimized vector intrinsic library call, if available.
|
|
I/O Subsystem Tuning
Some of the more common sense approach entails using what's provided by the vendor. In other words, taking advantage of the hardware. On Linux systems for example, this would mean using Lustre for Linux-based clusters. On IBM systems, it would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.
Other common sensical approaches to optimizing I/O are to be aware of the existence and the locations of the filesystems. For instance, whether the file systems are locally mounted or accessed through a remote file system. The former is much faster than the latter, due to limitations of network bandwidth, disk speed, and overhead caused by accessing the filesystem over the network and should always be the goal at the design level.
The other approaches including considering the best software options available. Some of them are enumerated below:
- Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
- Use binary instead of ASCII format because of overhead in converting from internal representation of real numbers to character string. In addition, ASCII files are larger than the corresponding binary file.
- In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
- If available, use asynchronous I/O to overlap reads/writes with computation. This capability is available on the Cray, IBM, and SGI systems.
Fortran90 Performance Pitfalls
There are a couple of design issues specific to Fortran90 applications.
For the first case, the F90 Array syntax of the two dimensional arrays impacts performance. Consider the two code snippets given below:
| Case 1: |
do j = js,je
do k = ks,ke
do i = is,ie
rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))
enddo
enddo
enddo
|
| Case 2: |
rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
smdiv * rt(is:ie,ks:ke,js:je) – rtold(is:ie,ks:ke,js:je))
|
The array syntax in the computation step of the second approach leads to significant performance penalty over explicit loops on cache-based systems, although it is more elegant. Vector systems tend to prefer this array syntax from the performance standpoint. More importantly, the array syntax generates larger temporary arrays on program stack.
The way the arrays are declared also impacts performance. For example, the negative performance impact is significantly higher for the F90 assumed shape arrays in the second case. There is almost a ten-fold increase in compile time. (This timing test was done on the IBM Power3 for CLimate application).
| Case 1: |
REAL, DIMENSION( ims:ime ,kms:kme ,jms:jme ) :: r, rt, rw, rtold
Results in F77-style assumed-size arrays
Compile time: 46 seconds
Run time: .064 seconds / call
|
| Case 2: |
REAL, DIMENSION( ims: ,kms: ,jms: ) :: r, rt, rw, rtold Results in F90-style assumed-shape arrays Compile time: 3120 seconds!! Run time: .083 seconds / call |
Another issue which arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy rather than the beginning address of the array, as normally occurs when arrays are passed as parameters. This F90 copy-in/copy-out overhead is very inefficient and can also cause errors while calling external libraries, like MPI.


