TACC User Guides

Memory Subsystem Tuning

General Tuning Guidelines

There are a number of important steps to optimizing an application and tune for the memory hierarchy. Some cache use issues are discussed below. Remember to always maximize cache reuse.

  1. Minimize stride length. For the best-case scenario, stride length 1 is optimal for most systems and in particularly the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.

    The following snippets of code illustrates the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

    Fortran Example: C Example:
    Real*8 :: a(m,n), b(m,n), c(m,n)
    ...
    do i=1,n
      do j=1,m
       a(j,i)=b(j,i)+c(j,i)
      end do
    end do
    Double a[m][n], b[m][n], c[m][n];
    ...
    for (i=0;i < m;i++){
      for (j=0;j < n;j++){
        a[i][j]=b[i][j]+c[i][j];
      }
    }

     

  2. Another approach is data reuse in cache by cache blocking. The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise, the data has to be loaded into cache from memory every time it becomes necessary, since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, as the latency for loading data from memory is a few orders higher than from cache. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.

    This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different sub-matrices in cache, while the computation is on-going.

    Example: Matrix multiplication

    Real*8 a(n,n), b(n,n), c(n,n)
    do ii=1,n,nb  ! <- nb is blocking factor
      do jj=1,n,nb
        do kk=1,n,nb
          do i=ii,min(n,ii+nb-1)
            do j=jj,min(n,jj+nb-1)
              do k=kk,min(n,kk+nb-1)
                c(i,j)=c(i,j)+a(j,k)*b(k,i)
              end do
            end do
          end do
        end do
      end do
    end do

     

  3. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two. More specifically, be aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.

    Example: Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.

    Real*8 :: a(1024,50)
    ...
    do i=1,n
      a(1,i)=0.50*a(1,i)
    end do
    A 16K 4-way set assoc cache has 4 sets of 4K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. This effectively reduces L1 from 256 cache lines to only 4! The program ends up with a 256 byte cache from the original 16K, due to non-optimal choice of leading dimension.

     

    Solution: Change leading dimension to 1028 (1024 + 1/2 cache line)

Other Tuning Techniques

In addition to the general guidelines above, employ the following techniques to make the best use of physical memory, floating point operations, and input/output subsystem, and to avoid Fortran90 performance issues whenever possible.

Memory Tuning

Use Data Prefetching

Encourage data prefetching to hide memory latency. Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example: Using Data Prefetching

In the following dot-product example, the number of streams that are prefetched are increased from 2 to 4 to 6 for the same functionality. However, care must be taken, since prefetching an increasing number of streams does not necessarily translate into increased performance. There is a threshold value beyond which prefetching a higher number of streams can be counterproductive.

2 streams 4 streams 6 streams
do i=1,n
  s=s+x(i)*y(i)
end do
dotp=s
do i=1,n/2
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/2)*y(i+n/2)
end do
s0=s0+(n-2*(n/2))*x(n)*y(n)
dotp=s0+s1
do i=1,n/3
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/3)*y(i+n/3)
  s2=s2+x(i+2*(n/3))*y(i+2*(n/3))
end do
do i=3*(n/3)+1,n
  s0=s0+x(i)*y(i)
end do
dotp=s0+s1+s2

Work within available physical memory

Fit the problem size to physical memory. Working in virtual memory leads to performance degradation and should be avoided at all costs. In addition swapping causes problems on some Linux systems.

Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.

...
do i=1,n,k
  s1 = s1 + x(i)*y(i)
  s2 = s2 + x(i+1)*y(i+1)
  s3 = s3 + x(i+2)*y(i+2)
  s4 = s4 + x(i+3)*y(i+3)
  ...
  sk = sk + x(i+k)*y(i+k)
end do
...
dotp = s1 + s2 + s3 + s4 + ... + sk

Avoid Divide Operations

The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside of a loop it is better to substitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by an optimized vector intrinsic library call, if available.

a=...  
do i=1,n 
 x(i)=x(i)/a 
end do
	
arrow
a=...
ainv=1.0/a
do i=1,n
 x(i)=x(i)*ainv 
end do

I/O Subsystem Tuning

Some of the more common sense approach entails using what's provided by the vendor. In other words, taking advantage of the hardware. On Linux systems for example, this would mean using Lustre for Linux-based clusters. On IBM systems, it would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensical approaches to optimizing I/O are to be aware of the existence and the locations of the filesystems. For instance, whether the file systems are locally mounted or accessed through a remote file system. The former is much faster than the latter, due to limitations of network bandwidth, disk speed, and overhead caused by accessing the filesystem over the network and should always be the goal at the design level.

The other approaches including considering the best software options available. Some of them are enumerated below:

  1. Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
  2. Use binary instead of ASCII format because of overhead in converting from internal representation of real numbers to character string. In addition, ASCII files are larger than the corresponding binary file.
  3. In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
  4. If available, use asynchronous I/O to overlap reads/writes with computation. This capability is available on the Cray, IBM, and SGI systems.

Fortran90 Performance Pitfalls

There are a couple of design issues specific to Fortran90 applications.

For the first case, the F90 Array syntax of the two dimensional arrays impacts performance. Consider the two code snippets given below:

Case 1:
do j = js,je
  do k = ks,ke
    do i = is,ie
      rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))
    enddo
  enddo
enddo
Case 2:
rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
    smdiv * rt(is:ie,ks:ke,js:je) – rtold(is:ie,ks:ke,js:je))
		

 

The array syntax in the computation step of the second approach leads to significant performance penalty over explicit loops on cache-based systems, although it is more elegant. Vector systems tend to prefer this array syntax from the performance standpoint. More importantly, the array syntax generates larger temporary arrays on program stack.

The way the arrays are declared also impacts performance. For example, the negative performance impact is significantly higher for the F90 assumed shape arrays in the second case. There is almost a ten-fold increase in compile time. (This timing test was done on the IBM Power3 for CLimate application).

Case 1:
REAL, DIMENSION( ims:ime ,kms:kme ,jms:jme ) :: r, rt, rw, rtold
Results in F77-style assumed-size arrays

  Compile time: 46 seconds
  Run time:    .064 seconds / call
    
Case 2:
REAL, DIMENSION( ims: ,kms: ,jms: ) :: r, rt, rw, rtold
Results in F90-style assumed-shape arrays

  Compile time: 3120 seconds!!
  Run time:    .083 seconds / call
		

 

Another issue which arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy rather than the beginning address of the array, as normally occurs when arrays are passed as parameters. This F90 copy-in/copy-out overhead is very inefficient and can also cause errors while calling external libraries, like MPI.