TACC User Guides
Lonestar User Guide

System Information

Overview

Quick Start Guide

System Access

Architecture

User Environment

File Systems

Archive Transfers

Available Software

Programming Models

Compiling Code

Job Submission

Affinity Policies

Debugging

Optimization

References

FAQs

The TACC Lonestar cluster is one of the largest academic computational resources in the nation. It serves as a computational resource in the NSF TeraGrid partnership, the Texas-wide Computational Grid (HiPCAT), the UT campus grid and the UT research community.

Figure 1.2. Dell 1955 Blade motherboard.
Figure 1. Lonestar System: Front row of frames. Figure 1.3. InfiniBand Switch Topology (400 nodes).

 

 

System Overview

The Lonestar Linux Cluster consists of 1300 nodes, with 2 Dual-Core processors per node, for a total of 5,200 cores. It is configured with 10.4 TB of total memory and 95TB of local disk space. The peak performance rated is 55 TFLOPS. The system supports a 68TB global, parallel file storage, managed by the Lustre file system. Nodes are interconnected with InfiniBand technology in a fat-tree topology with a 1GB/sec point-to-point bandwidth. A 10 PB capacity archival system is available for long term storage and backups.

 

Architecture

The Lonestar compute and login nodes run a Linux OS and are managed by the Rocks 4.1 cluster toolkit. The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1.-1.3.

Compute Nodes

A node consists of a Dell PowerEdge 1955 blade running the a 2.6 x86_64 Linux kernel from kernel.org. Each node contains two Xeon Intel Duo-Core 64-bit processors (4 cores in all) on a single board, as an SMP unit. The core frequency is 2.66GHz and supports 4 floating-point operations per clock period with a peak performance of 10.6 GFLOPS/core or 42.6GFLOPS/node. Each node contains 8GB of memory. The memory subsystem has an 1333 MHz Front Side Bus, and dual channels with 533 MHz Fully Buffered DIMMS. Both processors share access to the memory controllers in the memory controller hub (HCM or North Bridge).

Interconnect

The interconnect topology is a fat tree, with an oversubscription of 2. The eighty nodes of each pair of frames are connected to 5 TopSpin 24-port 120 switches (leafs). From each TopSpin 120 switch there are 4 uplink connections to 4 TopSpin 96-port "270" switches (cores). Over 15 other pairs of frames are connected in similar fashion to the cores, supporting 1300 nodes. Figure 1.4 illustrates the topology connection from the leaf switches to the core switches.

Filesystems

The Lonestar Storage includes a 73GB SATA drive (60GB usable by user) on each node. Home directories are NSF mounted to all nodes and limited by quota to 200MB per user. The Work file system, also accessible from all nodes, is a parallel file system supported by Lustre and 68TB of DataDirect Storage. Archival storage is not directly available from the login node, but accessible through scp.

Figure 1.4. InfiniBand Switch Topology
image of motherboard
Table 1. System Configuration & Performance
Component Technology Performance/Size
Peak Floating Point Operations 55.5 TFLOPS
Nodes(blades) Two Dual-Core Xeon 5100 processors 1300 Nodes / 5,200 Cores
Memory Distributed 10.4TB (Aggregate)
Shared Disk Lustre, parallel File System 68TB
Local Disk SATA 95TB (Aggregate)
Interconnect InfiniBand Switch 1 GB/s P-2-P Bandwidth
Table 1.2. PowerEdge 1955 Compute Node
Component Technology
Sockets per Node/Cores per Socket 2/2
Motherboard Intel 5000P Chipset (Bensley Platform)
Memory Per Node 8GB PC2-4300 FB-DIMM memory
System Bus 1333 MHz Processor Front Side Bus (FS): 10.7GB/s
Memory Bus & Configuration quad channel, 533MHz, 8x1GB DDR2: 8.5GB/s/Socket
(Separate Front Side Bussers, DIB)
PCI Express 8 lane
73GB Disk 10K RPM SAS-SATA
Table 1.3. PowerEdge 2950 Login Nodes
Component> Technology
2 login nodes lonestar.tacc.utexas.edu
(lslogin2.tacc.utexas.edu)
Sockets per Node/Cores per Socket 2/2
Clock Speed 2.66GHz
Motherboard Intel 5000X Chipset
Memory Per Node 16GB PC2-4300 FB-DIMM memory
404GB HOME Disk Dell 220S PowerVault
200MB quota
Table 1.4. Intel Woodcrest Processor
Technology 64-bit (Intel EM64T)
Clock Speed 2.66GHz
FP Results/Clock Period 4
Peak Performance/core 10.64GFLOPS/core
L2 Cache 4MB (Smart)
L1 Cache 32KB
Table 1.5. Storage Systems
Storage Class Size Architecture Features
Local 73MB/node SAS-SATA mounted on /tmp
Parallel 68TB Lustre, DataDirect S2A9500 16 Dell 1850 I/O data servers, Brocade switch
user striping, MPI-IO, mnt on /work
HOME 400GB NSF, Raid-5, 200MB/user Dell 2850 Server, automounted
Ranch (Tape Storage) 10PB SAM-FS (Storage Archive Manager) 10GB/s connection through 4 GridFTP Servers

Intel Extended Memory 64-bit Technology (EM64T) Features

  • 64-bit virtual address space, 1TB physical address space
  • 64-bit pointers
  • 64-bit general purpose registers
  • 64-bit integer registers and arithmetic units

Users must recompile their applications when migrating from a 32-bit system to Lonestar because of the difference in processor architectures. Lonestar nodes have 64-bit Intel EM64T processors, a 64-bit OS, and support only 64-bit libraries.

The user environment on Lonestar functions nearly identically to the other TACC Linux systems. Except for a difference in the compiler architecture option, the size of long and pointers types (in C), and directory references to 64-bit libraries, the commands for compiling, loading and running applications are the same. To check for potential porting problems between 32- to 64-bit modes, include the –Wp64 option when compiling C or C++ codes with the Intel’s icc compiler. Addition information is available in the compilation section.

 

User Environment

 

System Access

SSH

To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Telnet is no longer allowed because of the security vulnerabilities associated with it. The "r" commands rlogin, rsh, and rcp, as well as ftp, are also disabled on this machine for similar reasons. These commands are replaced by the more secure alternatives included in SSH --- ssh, scp, and sftp.

Before any login sessions can be initiated using ssh, a working SSH client needs to be present in the local machine. Wikipedia is a good source of information on SSH in general and provides information on the various clients available for your particular operating system.

Do not run the optional ssh-keygen command to set up Public-key authentication. This option sets up a passphrase that will interfere with submitting job scripts to the batch queues. If you have already done this, remove the .ssh directory (and the files under it) from your home directory. Log out and log back in to test.

To initiate an ssh connection to a Lonestar login node from a UNIX or Linux system with ssh already installed, execute the following command:

ssh userid@lonestar.tacc.utexas.edu

Note: userid is needed only if the user name on the local machine and the TACC machine differ.

Password changes should comply with practices presented in the TACC Password Guide.

 

Login Information

Login Shell

The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, use:

lslogin2% echo $SHELL

You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by the /etc/shells file, along with their full-path.

To display the list of available shells with chsh and change your login shell to bash, execute the following:

lslogin2% chsh -l
lslogin2% chsh -s /bin/bash

User Environment

The next most important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

lslogin2% env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.

HOME=/home/utexas/username
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.

Startup Scripts

All UNIX systems set up a default environment and provide administrators and users with the ability to execute additional UNIX commands to alter the environment. These commands are "sourced". That is, they are executed by your login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment.

Forgetting to unload superseded modules or putting module commands in the wrong shell startup file are some of the most common environment set up mistakes. With the C based shells, use the .login_user file instead of .cshrc_user. The .login_user file is executed at login, but only after .cshrc_user, and therefore after the rest of the environment is set up. For Bourne based shells, use the .profile_user file for module commands.

Basic site environment variables and aliases are set in the following files:

/etc/csh.cshrc	{C-type shells, non-login specific}
/etc/csh.login	{C-type shells, specific to login}
/etc/profile	{Bourne-type shells}

TACC coordinates the environments on several systems. In order to efficiently maintain and create a common environment among these systems, TACC uses its own startup files in /usr/local/etc. (A corresponding file in this etc directory is sourced by the startup script files that reside in your home directory. (Please do not edit these files and the sourcing commands in them, even if you are a UNIX guru.) Any commands you put in your .login_user, .cshrc_user, or .profile_user file are sourced (if the file exists) at the end of the corresponding /usr/local/etc command files. If you accidentally remove your .login, .cshrc, or .profile, you can copy new ones from /usr/local/etc/start-up, or execute:

lslogin2% /usr/local/bin/install_ut_startups

This utility renames the old shell startup scripts with an "_old" suffix and recreates the standard ones.

For historical reasons, the C based shells (csh, tcsh, etc.) source two types of files. The .cshrc type files are sourced first (/etc/csh.cshrc then $HOME/.cshrc then /usr/local/etc/cshrc then $HOME/.cshrc_user). These files are used to set up the execution environment used by all scripts and for access to the machine without an interactive login. For example, the following commands execute only the .cshrc type files on the remote machine:

scp data lonestar.tacc.utexas.edu
ssh lonestar.tacc.utexas.edu date

The .login type files set up environment variables that accounts commonly use in an interactive session. They are sourced after the .cshrc type files (/etc/csh.login then $HOME/.login then /usr/local/etc/login then $HOME/.login_user).

Similarly, if your login shell is a Bourne based shell (bash, sh, ksh, etc.), the profile files are sourced (/etc/profile then $HOME/.profile then /usr/local/etc/profile then $HOME/.profile_user).

The commands in the /etc files above set operating system interaction and the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)

 

File Systems

The TACC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

To determine the amount of disk spaced used in a file system, cd to the directory of interest and execute the df -k . command, including the "dot", which represents the current directory. Without the "dot" all file systems are reported.

In the command output below, the file system name appears on the left (IP number, followed by the file system name), and the used and available space (-k, in units of 1 KBytes) appear in the middle columns followed by the percent used and the mount point:

lslogin2% df -k .
File System			1k-blocks	Used		Available	Use%	Mounted on
129.114.64.4:/export/home	423366168	355672040	46188336	89%	/home

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sb option (s=summary, b=units in bytes):

lslogin2% du -sb

To determine quota limits and usage on $HOME, execute the quota command without any options (from any directory):

lslogin2% quota

The four major file systems available on Lonestar are:

HOME

At login, the system automatically sets the current working directory to your home directory.
Store your source code and build your executables here.
This directory has a quota limit of 200 MB.
This file system is backed up.
The frontend nodes and any compute node can access this directory.
Use $HOME to reference your home directory in scripts.
Use cd to change to $HOME.

WORK

Store large files here.
Change to this directory in your batch scripts and run jobs in this file system.
The work file system is approximately 30 TB.
This file system is not backed up.
The frontend nodes and any compute node can access this directory.
Purge Policy: Files with access times greater than 10 days are purged.
Use $WORK to reference this directory in scripts.
Use cdw to change to $WORK.

NOTE: TACC staff may delete files from work if the file system becomes full, even if files are less than 10 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

More on $WORK -- How to do parallel I/O in the Lustre File System

SCRATCH

This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job.
It is often more efficient to use and store files directly in $WORK (to avoid moving files from scratch at the end of a batch job).
The scratch file system is approximately 60 GB.
Files stored in the scratch directory on each node are removed immediately after the job terminates.
Use $SCRATCH to reference this file system in scripts.

ARCHIVE

Store permanent files here for archival storage.
This file system is NOT NSF mounted (directly accessible) on any node.
Use the $ARCHIVE file system only for long-term file storage to the $ARCHIVER system; it is not appropriate to use it as a staging area.
Use the rcp command to transfer data to this system. For example:

lslogin2% rcp ${ARCHIVER}:$ARCHIVE/myfile $WORK

Use the rsh command to login to the $ARCHIVER system from any TACC machine. For example:

lslogin2% rsh $ARCHIVER

See the Ranch User Guide for more on archiving.

 

Development

 

Software on TACC Resources

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations ($WORK, $HOME, etc.), aliases (cdw, cdh, etc.) and license paths. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load name

where name is the name of the module to load. If you often need an application environment, place the module commands required in your .login_user and/or .profile_user shell startup file.

Most of the package directories are in /opt/apps ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set up by loading the fftw3 module:

lslogin2% module load fftw3

To see a list of available modules, a synopsis of a particular modulefile's operations (in this case, fftw3), and a list of currently loaded modules, execute the following commands:

lslogin2% module avail
lslogin2% module help fftw3
lslogin2% module list

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.

 

Programming Models

There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor core loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 2. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, "ibrun a.out").

In the MPMD paradigm, each processor core loads up and executes a different program image and operates on different data sets, as illustrated in Figure 2. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or hundreds of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a UNIX script is used to assign input parameters for the execution command (through the batch MPI launcher, "ibrun script_command"). Details of the batch mechanism for parameter sweeps are described in the help information for the launcher module:

lslogin2$ module help launcher
Figure 2. Distributed Memory Paradigm: Single/Multiple-Program Multiple-Data.
lonestar3

The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes. Each node on this system contains 8 CPUs with a single 16GB memory subsystem.

The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem, as shown in Figure 2.1. There is no need for explict messaging between processors as with with MPI coding.

Figure 2.1 Shared-Memory Parallel Processing.
lonestar2

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.

The number of application that benefit from hybrid programming on dual-processor nodes (e.g. on Lonestar) is very small. The programming and support of hybrid codes is complicated by compiler and platform support of both paradigms. However, with the new multi-core multi-socket commodity systems on the horizon, there may be a resurgence in hybrid programming if these systems provide better enhanced performance with SMPP (OMP) algorithms.

For further information, please see the Manuals section of this document.

 

Compiling Code

The Lonestar programming environment uses Intel C++ and Intel Fortran compilers by default. This section highlights the important HPC aspects of using the Intel compilers. The Intel compiler commands can be used for both compiling (making ".o" object files) and linking (making an executable from a ".o" object files).

The Intel Compiler Suite

The latest Intel compiler available is loaded as the default at login with the intel module. (The previous version of the compiler is available for special porting needs. Use the 'module available' command to list all modules installed, including versions and default information where applicable.) The gcc compiler and module are also available (Use 'gcc --version' to display version information.); but we recommend using the Intel suite whenever possible. The Intel suite is installed with the EM64T 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode). Any programs compiled on 32-bit systems need to be recompiled to run natively on Lonestar. Any pre-compiled packages should be EM64T (x86-64) compiled or errors may occur. Since only 64-bit versions of the MPI libraries have been built on Lonestar, programs compiled in 32-bit mode will not execute MPI code.

The Intel Fortran compiler command is ifort (use 'ifort -V' for current version information).

Web accessible Intel manuals are available: Intel C++ Compiler Documentation and Intel Fortran Compiler Documentation.

Compiling Serial Programs

The table below lists the syntax for serial program compilation.

Compiler Language File Extension Example
icc C .c icc [compiler_options] prog.c
icc C++ .C, .cc, .cpp, .cxx icc [compiler_options] prog.cpp
ifort F77 .f, .for, .ftn ifort [compiler_options] prog.f
ifort F90 .f90, .fpp ifort [compiler_options] prog.f90

Appropriate file name extensions are required for each compiler. By default, the executable name is a.out; and it may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimizations.

A C program example:
lslogin2% icc -o flamec.exe -O3 -xT prog.c
A Fortran program example:
lslogin2% ifort -o flamef.exe -O3 -xT prog.f90

Commonly used options may be placed in a icc.cfg or ifc.cfg file for compiling C and Fortran code, respectively.

For additional information, execute the compiler command with the -help option to display all compiler options, their syntax, and a brief explanation, or display the man page, as follows:

lslogin2% icc -help
lslogin2% ifort -help
lslogin2% man icc
lslogin2% man ifort

Some of the more important options are listed in the Basic Optimization section of this guide. Additional documentation, references, and a number of user guides (pdf, html) are available in the Fortran and C++ compiler home directories ($IFC_DOC and $ICC_DOC).

OpenMP Programs

Since each of the PowerEdge blades (nodes) of the Lonestar cluster is a Xeon dual-processor system, applications can use the shared memory programming paradigm "on node". However, because of the limited number of processors in each node, there are rarely any significant performance benefits to using a shared-memory model on the node.

The OpenMP compiler options are listed in the Basic Optimization section of this guide, for those who need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the openmp options.

MPI Programs

The "mpicmds" commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI MVAPICH (mvapich) and Intel compiler (intel) modules are loaded to produce the default environment which provide the location to the corresponding mpicmds.

Compiling Parallel Programs with MPI

The mpicc, mpiCC, mpif77, and mpif90 compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. The following table lists the compiler wrappers for each language:

Compiler Language File Extension Example
mpicc C .c mpicc [compiler_options] prog.c
mpiCC C++ .cc, .C, .cpp, .cxx mpiCC [compiler_options] prog.cc
mpif77 F77 .f, .for, .ftn mpif77 [compiler_options] prog.f
mpif90 F90 .f90, .fpp mpif90 [compiler_options] prog.f90

Appropriate file name extensions are required for each wrapper. By default, the executable name is a.out. You may rename it using the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

A C program example:
lslogin2% mpicc -o prog.exe -O3 -xT prog.c
A Fortran program example:
lslogin2% mpif90 -o prog.exe -O3 -xT prog.f90

Include linker options, such as library paths and library names, after the program module names, as explained in the Loading Libraries section below. The Running Code section explains how to execute MPI executables in batch scripts and "interactive batch" runs on compute nodes.

We recommend that you use the Intel compiler for optimal code performance. TACC does not support the use of the gcc compiler for production code on the Lonestar system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option for modules requiring gcc. (Since gcc- and Intel-compiled code are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for the two cases:

lslogin2% mpicc -O3 -xT -c -cc=gcc suba.c
lslogin2% mpicc -O3 -xT mymain.c suba.o

lslogin2% mpicc -O3 -xT -c suba.c
lslogin2% mpicc -O3 -xT -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o
 

Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

At the most basic level of optimization that the compiler can perform is -On options, explained below. Optimization Level: -On

Level Description
n = 0: Fast compilation, full debugging support; equivalent to -g
n = 1,2: Low to moderate optimization, partial debugging support:
  • instruction rescheduling
  • copy propagation
  • software pipelining
  • common subexpression elimination
  • prefetching, loop transformations
n = 3+: Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change
code semantics and results (sometimes even breaks code!) :
  • enables -O2
  • more aggressive prefetching, loop transformations

The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Option Description
-c For compilation of source file only.
-O3 Aggressive optimization (-O2 is default).
-xT Generates code with streaming SIMD extensions SSE2/3/4 for EM64T architecture.
-axT Same as -xT, but also generates generic code.
-g Debugging information, generates symbol table.
-mp Maintain floating point precision (disables some optimizations).
-mp1 Improve floating-point precision (speed impact is less than -mp).
-ip Enable single-file interprocedural (IP) optimizations (within files).
-ip0 Enable multi-file IP optimizations (between files).
-prefetch Enables data prefetching (requires –O3).
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level.
 

Loading Libraries

Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.

  • Use the -l loader option to link in a library at load time: e.g.
  • lslogin2% ifort prog.f90 -l< name >
    
  • This links in either the shared library libname.so (default) or the static library libname.a, provided it can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.
  • To explicitly include a library directory, use the -L option, e.g.
  • lslogin2% ifort prog.f -L/mydirectory/lib -l< name >
    

In the above example, the user's libname.a library is not in the default search path, so the "-L" option is specified to point to the libname.a directory.

Many of the modules for applications and libraries, such as the mkl library module provide environment variables for compiling and linking commands. Execute module help module_name for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the mkl library:

lslogin2% mpicc	-Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC mkl_test.c \
 		-L$TACC_MKL_LIB -lmkl_em64t

Here, the module supplied variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the LD_LIBRARY path or the LDD dynamic cache of bindings between shared libraries and directory paths. (This avoids having to set the LD_LIBRARY path ("manually" or through a module command) before running the executables.

Previously, Fortran programs that used utilities such as getarg were required to include the compatibility library libPEPCF90.a with the "-Vaxlib" option, when using the Intel compiler. As of version 9.1, this option is no longer necessary.

 

Performance Libraries

ISPs (Independent Software Providers) and HPC vendors provide high performance math libraries that are tuned for specific architectures. Many applications depend on these libraries for optimal performance. Intel has developed performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for the em64t architectures. Details of the Intel libraries and specific loader/linker options are given below.

MKL library

The "Math Kernel Library" consists of functions with Fortran, C, and C++ interfaces for the following computational areas:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
  • LAPACK for linear algebraic equation solvers and eigensystem analysis
  • Fast Fourier Transforms
  • Transcendental Functions

In addition, MKL also offers a set of functions collectively known as VML -- the "Vector Math Library". VML is a set of vectorized transcendental functions which offer both high performance and excellent accuracy compared to the libm functions (for most of the Intel architectures). The vectorized functions are considerably faster than standard library library functions for vectors longer than a few elements.

To use MKL and VML, first load the MKL module using the command module load mkl. This will set the TACC_MKL_LIB, TACC_MKL_INC, and TACC_MKL_DOC environment variables to the directories containing the MKL libraries, the MKL header files and the MKL documentation. Below is an example command for compiling and linking a program that contains calls to BLAS functions (in MKL). Note that the library is for use in a single node, hence can be used by both serial compilers or by MPI wrapper scripts.

mpicc -O3 -Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC foo.c -L$TACC_MKL_LIB -lmkl_em64t

For additional documentation and reference on MKL, both pdf and html-based, please look in the directory specified by the MKL_DOC environment variable.

 

Code Tuning

Memory Subsystem Tuning

There are a number of techniques for optimizing application code and tuning the memory hierarchy.

Maximize cache reuse

  1. Always minimize stride length . For the best-case scenario, stride length 1 is optimal for most systems and in particular the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.

    The following snippets of codes illustrates the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

    Fortran Example:				C Example:
    Real*8 :: a(m,n), b(m,n), c(m,n)		Double a[m][n], b[m][n], c[m][n];
    ...						...
    do i=1,n					for (i=0;i < m;i++){
      do j=1,m					  for (j=0;j < n;j++){
       a(j,i)=b(j,i)+c(j,i)				    a[i][j]=b[i][j]+c[i][j];
      end do					  }
    end do						}
    
  2. Another approach is data reuse in cache by cache blocking. The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.

    This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different submatrices in cache while the computation is on-going.

    Example: Matrix multiplication
    Real*8 a(n,n), b(n,n), c(n,n)
    do ii=1,n,nb  ! <- nb is blocking factor
      do jj=1,n,nb
        do kk=1,n,nb
          do i=ii,min(n,ii+nb-1)
            do j=jj,min(n,jj+nb-1)
              do k=kk,min(n,kk+nb-1)
                c(i,j)=c(i,j)+a(j,k)*b(k,i)
              end do
            end do
          end do
        end do
      end do
    end do
  3. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two. More particulalrly, users should be aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.

    Example: Consider an L1 cache that is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.

    Problem: A 16 K 4-way set associative cache has 4 sets of 4 K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4. That results in a 256 byte cache, down from the original 16 K, due to the non-optimal choice of leading dimension.

    Real*8 :: a(1024,50)
    ...
    do i=1,n
      a(1,i)=0.50*a(1,i)
    end do
    

    Solution: Change leading dimension to 1028 (1024 + 1/2 cache line)

Encourage Data Prefetching to Hide Memory Latency

Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example: In the following dot-product example, the number of streams prefetched are increased from 2, to 4, to 6, for the same functionality. However, just prefetching a larger number of streams does not necessarily translate into increased performance. There is a threashold value beyond which prefetching more streams can be counterproductive.

2 streams 4 streams 6 streams
do i=1,n
  s=s+x(i)*y(i)
end do
dotp=s
		
do i=1,n/2
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/2)*y(i+n/2)
end do
s0=s0+x(i)*y(i)
dotp=s0+s1
		
do i=1,n/3
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/3)*y(i+n/3)
  s2=s2+x(i+2*n/3)*y(i+2*n/3)
end do
do i=3*(n/3)+1,n
  s0=s0+x(i)*y(i)
end do
dotp=s0+s1+s2
		

Work within available physical memory

Make sure to fit the problem size to memory. Working in virtual memory leads to performance degradation and should be avoided. In addition, swapping causes problems on some Linux systems.

Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better

...
do i=1,n,k
  s1 = s1 + x(i)*y(i)
  s2 = s2 + x(i+1)*y(i+1)
  s3 = s3 + x(i+2)*y(i+2)
  s4 = s4 + x(i+3)*y(i+3)
  ...
  sk = sk + x(i+k)*y(i+k)
end do
...
dotp = s1 + s2 + s3 + s4 + ... + sk

Avoid Divide Operations

The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside a loop, it is better to subsitute that step by a multiply outside of the loop, provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.

a=...  
do i=1,n 
x(i)=x(i)/a 
end do
arrow
 a=...
 ainv=1.0/a
 do i=1,n
   x(i)=x(i)*ainv 
 end do

I/O Subsystem Tuning

Some of the more common sense approach entails using whats provided by the vendor i.e. taking advantage of the hardware . On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensible approaches to optimizing I/O is to be aware of the existence and the locations of the filesystems i.e. whether the filesystems are locally mounted or through a remote filesystem. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the filesystem over the network and should always be the goal at the design level.

The other approaches including considering the best software options available. Some of them are enumerated below:

  1. Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.

  2. Use binary instead of ASCII format because of the overhead incurred converting from the internal representation of real numbers to a character string. In addition, ASCII files are larger than the corresponding binary file.

  3. In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.

  4. If available, use asynchronous I/O to overlap reads/writes with computation.

Fortran90 Performance Pitfalls

Several coding issues impact the performance of Fortran90 applications. For example, consider the two cases of using different F90 Array syntax for the two dimensional arrays below:

Case 1:
do j = js,je
  do k = ks,ke
    do i = is,ie
      rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))
    enddo
  enddo
enddo
Case 2:
rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
    smdiv * rt(is:ie,ks:ke,js:je) – rtold(is:ie,ks:ke,js:je))

The array syntax in the computation step of the second approach leads to a significant performance penalty over using explicit loops on cache-based systems, although it is more elegent. Vector systems tend to prefer this array syntax from a performance standpoint. More importantly, the array syntax generates larger temporary arrays on the program stack.

The way the arrays are declared also impacts performance. In the following example, there are two cases of F90 assumed shape arrays. In the second case, the negative performance impact is significantly higher, almost ten-fold in compile time.

Case 1:
REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold
Results in F77-style assumed-size arrays

  Compile time:  46 seconds
  Run time:     .064 seconds / call
Case 2:
REAL, DIMENSION( ims:    , kms:    , jms:    ) :: r, rt, rw, rtold
Results in F90-style assumed-shape arrays
  Compile time:  3120 seconds!!
  Run time:     .083 seconds / call

Another issue that arises from the F90 assumed shape arrays occurs when it is a parameter in a subroutine. Using assumed shape arrays as a parameter in a subroutine may result in the subroutine being passed a copy, rather than being passed the address of the array itself. This F90 copy-in/copy-out overhead is not only inefficient, but may cause errors when calling external libraries.

 

Usage

 

Running Code

Runtime Environment

Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file). Cat /etc/ld.so.conf to see the TACC configured directories, or execute the following command to see a list of directories and candidate libraries:

lslogin2% /sbin/ldconfig -p

Use the -Wl,rpath loader option or the LD_LIBARY_PATH to override the default runtime bindings.

The Intel compiler and MKL math libraries are located in the /opt/intel directory, and application libraries are located in /usr/local/apps ($APPS). The GOTO libraries are located in $TACC_GOTOBLAS_LIB. Use the module help libname command to display instructions and examples on loading libraries.

The LSF Batch System

Batch facilities such as LoadLeveler, LSF, and SGE differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). The following paragraphs list the basic batch operations and their options, explain how to use the LSF batch environment, and describe the queue structure.

The references at the end of this section contain links to the LSF manuals. New users should bookmark or print the LSF Quick Reference. To help those migrating from other systems, a comparison of the IBM LoadLeveler, LSF, and SGE batch options and commands is offered in a separate document.

Step 1: Job Submission

LSF provides the bsub command for submitting batch jobs: Use the LSF bsubcommand to submit a batch job with the following syntax:

lslogin2% bsub < job_script 

where is the name of a UNIX file containing job script commands, that is used as input to the bsub command, as indicated by the "<" redirection symbol. This "job script" file may contain both shell commands and special commented statements that include bsub options and resource specifications. The most common of these options are listed in Table 2.

Table 2. List of the Most Common bsub Options
Option Argument Function
-q queue_name Submits to queue designated by queue_name.
-J job_name Names the job job_name.
-L shell Use shell as login shell for the batch session. Specify using absolute path.
-I   Submits interactive batch job (i.e. stdin and stdout are redirected to the terminal).
-u emailaddress Email address to use for -B and -N options.
-B   Sends email at job start.
-N   Sends job output by mail when job finishes. If used with -o, job output is sent by mail and saved in output file.
-i input_file Reads stdin from input_file.
-o output_file Direct job output to output_file.
-e error_file Direct job error to error_file.
-c [hours:]minutes Limits job CPU time to that specified.
-W [hours:]minutes Limits job wall clock time to that specified.
-F file_ limit Set a per-process (soft) file space limit (in KB).
-M memory Set the per-process memory limit (in KB).
-n min_proc[, max_proc] Request min_proc-max_proc number of processor cores.
-P project_name Charges run to project_name. Used only for multi-project logins. Project names and accounting reports are displayed at login.
-f "local file operator [remote file]"

Transfers files from local host to remote host. See Table 2.1 for a list of operator values, and their meaning.

Table 2.1 List of Operators for the bsub -f Option
Operator Action
> Copies file from local to remote host before the job starts.
< Copies file from remote to local host after job completes.
>> Appends remote file to local file at job completion.
< >, < > Copies local file to remote file before job starts (overwrites if remote file exists) and then copies remote file to local file at job completion (overwrites if local file exists).

You can pass bsub options from the command line, or specify them in the job script file. The latter approach is preferable. It is easier to store commonly used bsub commands in a script file that will be reused several times rather than retyping the commands for every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.

Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #BSUB and are followed with bsub options. UNIX shell commands are executed by the shell specified with the -L option, or by the UNIX "magic" first-line shell descriptor (if the -L option is not specified). The exampe job script content below requests four processor cores and 1.5 hours of run time:

#!/bin/tcsh
# the first line specifies the shell
#BSUB -J jobname	#name the job "jobname"
#BSUB -o out.o%J	#output-> out.o
#BSUB -e err.o%J	#error -> error.o
#BSUB -n 4 -W 1:30	#4 CPU cores and 1hr+30min
#BSUB -q normal	#Use normal queue
set echo	#Echo all commands
cd $LS_SUBCWD	#change to submission directory
ibrun ./a.out	#use ibrun for "pam -g 1 mvapich_wrapper"
 		#CPUs are specified above in -n option

The job output and error are sent to out.o and err.o, respectively. LSF provides several "%" macros for the #BSUB options lines that are evaluated at submission time. The above %J string is substituted with the job id (you cannot use environment variables in the #BSUB statements) . The job name (set with -J) and the job id are assigned to the environment variables LSB_JOBNAME and LSB_JOBID. Both are often used within the UNIX commands. The memory limit per task on a node is automatically adjusted to the maximum memory available to a user application (for serial and parallel codes).

The ptile option has been modified to accommodate dual-core processing. The general syntax is:

#BSUB -R 'span[ptile=X]'

The ptile value, X={1, 2, 3, or 4}, defines the number of MPI tasks allocated per node. It also sets the maximum amount of memory per task on the node. The values are listed below. If the ptile option is not specified, the default value, ptile=4, is used (this allows 4 tasks to be launched on the 4 processor cores of each node).

If you need to run an application that requires more than 1.92GB of memory per task, then run with fewer tasks per node. The memory limits and processor core count for each of the four options are:

#BSUB -R 'span[ptile=4]' 4 MPI tasks per node 1.92 GB memory/task
#BSUB -R 'span[ptile=3]' 3 MPI tasks per node 2.56 GB memory/task
#BSUB -R 'span[ptile=2]' 2 MPI tasks per node 3.85 GB memory/task
#BSUB -R 'span[ptile=1]' 1 MPI tasks per node 7.70 GB memory/task

SU charges are based on the number of nodes allocated (not the number of cores used per node), because the node is dedicated to the user's job, and is not shared with any other job.

Consequently, a job using ptile=1 to obtain the maximum memory per MPI task, will incur an SU charge four times larger than a default run using ptile=4 (and requesting the same number of tasks). To illustrate the determination of the number of nodes assigned, the task occupation on a node, and the SU cost, consider the following job script which uses ptile=3 and requests 32 MPI tasks for an hour:

#!/bin/tcsh
# the first line specifies the shell
#BSUB -J mysimulation
#BSUB -q normal
#BSUB -P myaccount
#BSUB -o output.%J.out
#BSUB -W 1:00
#BSUB -n 32
#BSUB -R 'span[ptile=3]'
ibrun ./a.out

This job will execute on 11 nodes. This is determined by dividing the tasks requested (-n option) by the number of tasks per node (ptile option), and rounding up to the nearest integer:

(MPI tasks requested) / (MPI tasks per node)
cores used = (32) / (3) = 10.6 (roundup)=> 11 nodes

 

Three tasks are assigned to each node, with the last node filled by the residuals (2 in this case). Since the SU charge is based on the number of allocated nodes, the cost for 1 hr usage will be:

SU = nodes used * 4 cores/node * 1SU/1-core-hr * hrs used = 11 * 4 * 1 = 44 SUs

Accounts with multiple projects should specify the project name to charge, according to the syntax in the example below for the A-abc project:

#BSUB -P A-abc

The accounting report displayed at login lists the account's project information.

Step 2: Batch query

After job submission, users can monitor the status of their jobs with the bjobs command. Table 2.2 lists the bjobs options:

Table 2.2 List of bjobs Options
Option Action
-a Show all jobs.
-r Show running jobs only.
-p Displays pending jobs with reasons for pending state.
-lp Displays pending jobs with reasons and host names.
-s Show suspended jobs and reasons for suspension.
-l Shows "long" list of job details.

The bjobs command output includes a list of jobs and the following fields for each job:

Table 2.3 Some of the Fields in the bjobs Command Output
Field Description
JOBID job id assigned to the job
USER user that owns the job
STAT current job status. Includes, but is not limited to the status codes described in Table 2.4
QUEUE queue job was submitted to
SUBMIT_TIME time at which job was submitted
Table 2.4 Some bjobs Command Status Codes
Status Code Description
PEND job hasn't started yet
PSUSP suspended
RUN job is running
USUSP job suspended while running
SSUSP job suspended by LSF
DONE job completed with status of 0
EXIT job terminated with non-zero status

For convenience, TACC has created an additional job monitoring utility which summarizes all jobs in the batch system in a manner similar to the "showq" utility from PBS. For example:

lslogin2% showq

will summarize all running, idle, and pending jobs, along with any advanced reservations scheduled within the next week. The showq -u displays information only on jobs associated with your userid (use showq --help to obtain more information on the available options).

Step 3: Job control

Control of job behavior takes many forms:

  1. Job modification while in the pending/run state

    Users can reset the qsub options of a pending job with the bmod command, using the following syntax:

    bmod

    where options refers only to the following bsub resource options (also described in Table 2):

    -c per-job cpu time
    -W per-job wall clock time
    -o output file
    -e error file
    -r re-runnable jobs

    In addition, a running job resource value may be decreased relative to the old value. Increasing the resource limits of a running job is not allowed.

  2. Job deletion
    The bkill command is used to remove pending and running jobs from the queue. For running jobs, bkill sends the SIGINT, SIGTERM, SIGKILL signals, with a 10 second interval between each signal, to processes of a running job. The following table explains the different bkill invocations:
    bkill Removes pending or running job.
    bkill -s 9 Sends (sig)kill immediately to running job.
    bkill -r ** Removes job from LSF without waiting for processes to terminate.
    ** Use this only if you are running a parallel code under pam control; AND immediately
    report the job number to TACC staff through the portal consulting system.
    (This may leave hung processes that can interfere with the next job.)
  3. Job suspension/resumption
    The bstop and bresume commands allow users to stop and resume jobs, respectively. The syntax is:
    bstop
    bresume

    The bstop command may be used to stop serial or parallel jobs and can be invoked by a user or a person with LSF system admin privileges. A user cannot resume a job that was suspended by a system admin nor can he resume a job owned by another user.

The LSF Batch Environment

In addition to the environment variables inherited by the job from the interactive login environment, LSF sets several other variables in every batch session. The following table lists some of the important LSF variables:

Table 2.5 LSF Batch Environment Variables
Environment Variable Contains
LSB_ERRORFILE name of the error file
LSB_JOBID batch job id
LS_JOBPID process id of the job
LSB_HOSTS list of hosts assigned to the job. Multi-cpu hosts will appear more than once
LSB_QUEUE batch queue to which job was submitted
LSB_JOBNAME name user assigned to the job
LS_SUBCWD directory of submission, set to $cwd when the job is submitted
LSB_INTERACTIVE set to 'y' when the -I option is used with bsub

Lonestar Queue Structure

Below is a table of queue names and the characteristics (wall-clock and processor limits and default values; priority charge factor; and purpose) for the Lonestar queues. The systest and support queues are for TACC system and HPC group testing and consulting support, respectively.

Table 2.5 LSF Batch Environment Queues
Queue Name Max Runtime (default) Max Processors SU Charge Rate Purpose
serial 12 hrs. 1 1.0 Normal usage for uni-processor jobs
normal 48 hrs. 512 1.0 Normal Priority
high 48 hrs. 512 1.8 High priority
hero (varies) hrs. >512 1.0 Large node count, has user control list
development 30 min. 16 1.0 Debugging and Development, also runs Interactive jobs
systest -- -- -- TACC Staff only, debugging & benchmarking
request -- -- -- Special request configuration
spruce -- -- -- Debugging and development, Special PRiority & Urgent Comp. Env.

Interactive Jobs

Use the bsub command with the -I option on the login node to launch an interactive job. A job is defined interactive if it receives input from stdin and sends output to stdout. When the job starts, LSF inherits the present execution environment. LSF sets environment variables related to the batch session such as LSB_HOSTS, LS_SUBCWD, and LSB_JOBID while preserving variables such as PATH, LD_LIBRARY_PATH, etc.

Output is not sent to the terminal until after the job is completed.

There are three ways to run an interactive job in LSF:

  1. Enter a single command as the final argument to bsub. The following example runs an MPI executable on four processors:
    bsub -I -n 4 -W 0:05 -q development ibrun ./a.out
  2. Enter a series of commands at the bsub interactive prompt after entering the LSF interface, and end the input with a control-D (^D). Note, only bsub options appears on the bsub command line. Use this mode for executing multiple commands. The example below runs 3 MPI job run, each on four processors:
    lslogin2%	bsub -I -n 4 -W 0:05 -q development
    bsub>	pwd
    bsub>	ibrun ./a.out 4 1
    bsub>	ibrun ./a.out 1 4
    bsub>	ibrun ./a.out 2 2
    bsub>	^D
    bsub>	Job  is submitted to queue
    < ... other output from executable >
  3. Enter command by redirecting an input file through STDIN. This is recommended for executing multiple commands and testing job scripts.
    #BSUB -n 4
    #BSUB -W 0:05
    #BSUB -q development
    ibrun ./a.out 2 2
    ibrun ./a.out 1 4
    ibrun ./a.out 4 1
     
    slogin2% bsub -I < job
    
    Job < number > is submitted to queue
    < ... other output from executable >

Submit all Interactive jobs to the development queue. This queue has a maximum cpu limit of 16 and a time limit of 30 minutes. If the requested number of cpus is less than the number available, the interactive job will wait until enough cpus are free to schedule the job. Although the interactive bsub command appears to hang, this is normal and indicates that the job is waiting to execute.

NOTE: TACC usage policy does not allow users to run interactive or serial programs on the login nodes of the HPC systems. All such executions must be submitted directly to an appropriate queue of the system's batch utility. On the Lonestar system, use an LSF job script with the number of processors set to 1 (#BSUB -n 1), and submit the job to the serial queue (#BSUB -q serial).

 

Basic Optimization

Basic Optimization for Serial and Parallel Programming using OpenMP and MPI

The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and iCC and mpiCC. Below are some of the common serial compiler options with descriptions.

Compiler Options Description
-O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-vec_report[0|...|5] controls amount of vectorizer diagnostic information.
-xT includes specialized code for SSE4 instruction set.
-fast DO NOT USE - static load not allowed.
-g -fp generates debugging information, disables using EBP as general purpose register.
-openmp enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp_report[0|1|2] controls the OpenMP parallelizer diagnostic level.
-help lists options.

 

Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.

Use the -help option with the mpicmds commands for additional information:

lslogin2% mpicc -help
lslogin2% mpif90 -help
lslogin2% mpirun -help

Use the options listed for mpirun with the ibrun command in your job script. For detail on the MPI standard, go to: www.mcs.anl.gov/mpi.

 

Tools

Program Timers and Performance Tools

Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers

The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:

/usr/bin/time -p

The -p option specifies traditional "precision" output, units in seconds. See the time man page for additional information.

To use time with an MPI task, use:

/usr/bin/time -p mpirun -np 4 ./a.out

This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled "real" is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).

Code Section Timers

"Section" timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below.

Code Section Timers
Routine Type Resolution (usec) OS/Compiler
times user/sys 1000 Linux/AIX/IRIX/UNICOS
getrusage wall/user/sys 1000 Linux/AIX/IRIX
gettimeofday wall clock 1 Linux/AIX/IRIX/UNICOS
rdtsc wall clock 0.1 Linux
read_real_time wall clock 0.001 AIX
system_clock wall clock system dependent Fortran90 Intrinsic
MPI_Wtime wall clock system dependent MPI Library (C & Fortran)

For general purpose or course-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accesses in the same way:

     external   x_timer                 double x_timer(void);
     real*8  :: x_timer                 ...
     real*8  :: sec0, sec1, tseconds    double sec0, sec1, tseconds;
     ...                                ...
     sec0     = x_timer()               sec0     = x_timer();
     ...Fortran Code                    ...C Codes
     sec1     = x_timer()               sec1     = x_timer();
     tseconds = sec1-sec0               tseconds = sec1-sec0

Standard Profilers

The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the "hotspots". As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program:

Profiling Serial Executables
ifort -qp prog.f90 Instruments code
a.out Produces gmon.out trace file
gprof Reads gmon.out (default args: a.out gmon.out)
(report sent to STDOUT)
Profiling Parallel Executables
mpif90 -qp prog.f90 Instruments code
setenv GMON_OUT_PREFIX gout.* Forces each task to produce a gout
mpirun -np < # > a.out Produces gmon.out trace file
gprof -s gout.* Combines gout files into gmon.sum
gprof a.out gmon.sum Reads executable (a.out) & gmon.sum
(report sent to STDOUT)

Detailed documentation is available at www.gnu.org.

Timing Tools

Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

Debugging with DDT

DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Ranger, please see the DDT Debugging Guide.

 

Resources

 

Manuals

The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.