TACC User Guides
Ranger User Guide

System Overview

The Sun Constellation Linux Cluster, named Ranger, is one of the largest computational resources in the world.  Ranger was made possible by a grant awarded by the National Science Foundation in September 2006 to TACC and its partners including Sun  Microsystems, Arizona State University, and Cornell University.  Ranger entered formal production on February 4, 2008 and supports high-end computational science for NSF TeraGrid researchers throughout the United States, academic institutions within Texas, and components of the University of Texas System.

The Ranger system is comprised of 3,936 16-way SMP compute nodes providing 15,744 AMD Opteron™ processors for a total of 62,976 compute cores, 123 TB of total memory and 1.7 PB of raw global disk space. It has a theoretical peak performance of 579 TFLOPS. All Ranger nodes are interconnected using InfiniBand technology in a full-CLOS topology providing a 1GB/sec point-to-point bandwidth. A 10 PB capacity archival system is available for long term storage and backups. Example pictures highlighting various components of the system are shown in Figures 1-3.

 

One of 6 rows: Management/IO Racks (front, black), Compute Rack (silver), and In-row Cooler (black). Image of SunBlade motherboard
Figure 2. SunBlade x6420 motherboard (compute blade).
Constellation Switch (partially wired).
Figure 1. One of six Ranger rows: Management/IO Racks (front, black), Compute Rack (silver), and In-row Heat Exchanger (black). Figure 3. InfiniBand Constellation Core Switch.

 

 

Architecture

The Ranger compute and login nodes run a Linux OS and are managed by the Rocks 4.1 cluster toolkit. Two 3456 port Constellation switches provide dual-plane access between NEMs (Network Element Modules) of each 12-blade chassis. Several global, parallel Lustre file systems have been configured to target different storage needs. The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized in Tables 1.1 through 1.3.

  • Ranger is a blade-based system. Each node is a SunBlade x6420 blade running a 2.6.18.8 Linux kernel. Each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in all) on a single board, as an SMP unit. The core frequency is 2.3 GHz and supports 4 floating-point operations per clock period with a peak performance of 9.2 GFLOPS/core or 128 GFLOPS/node.

  • Each node contains 32 GB of memory. The memory subsystem has a 1.0 GHz HyperTransport system Bus, and 2 channels with 667 MHz DDR2 DIMMS. Each socket possesses an independent memory controller connected directly to an L3 cache.

  • The interconnect topology is a 7-stage, full-CLOS fat tree with two large Sun InfiniBand Datacenter switches at the core of the fabric (each switch can support up to a maximum of 3,456 SDR InfiniBand ports).  Each of the 328  compute chassis is connected directly to the 2 core switches. Twelve additional frames are also connected directly to the core switches and provide files system, administration and login capabilities.

  • File systems: Ranger's file systems are hosted on 72 Sun x4500 disk servers, each containing 48 SATA drives, and six Sun x4600 metadata servers. From this aggregate space of 1.7PB, three global file systems are available for all users (see Table 1.5).

Table 1.1 System Configuration and Performance
Component Technology Performance/Size
Peak Floating
Point Operations
  579 TFLOPS (Peak)
Nodes(blades) Four Quad-Core AMD Opteron processors 3,936 Nodes / 62,976 Cores
Memory Distributed 123 TB Aggregate
Shared Disk Lustre, parallel File System 1.7 PB Raw 
Interconnect InfiniBand Switch 1 GB/sec unidirectional point-to-point bandwidth
2.3 micro-sec max MPI latency
Table 1.2 SunBlade x6420 Compute Node
Component Technology
Sockets per Node/Cores per Socket 4/4 (Barcelona)
Clock Speed 2.3 GHz
Memory Per Node 32 GB
System Bus HyperTransport, 6.4 GB bidirectional
Memory 2 GB DDR2/667, PC2-5300 ECC-registered DIMMs
PCI Express x8
Compact Flash 8 GB
Table 1.3 Sun x4600 Login Nodes
Component Technology
2 login nodes
(specific node selected when
accessing ranger.tacc.utexas.edu )
(login3.tacc.utexas.edu)
(login4.tacc.utexas.edu)
Sockets per Node/Cores per Socket 4/4 (Barcelona)
Clock Speed 2.2 GHz
Memory Per Node 32 GB
Table 1.4 AMD Barcelona Processor
Technology 64-bit
Clock Speed 2.3 GHz
FP Results/Clock Period 4
Peak Performance/core 9.2 GFLOPS
L3 Cache 2 MB on-die (shared)
L2 Cache 4 x 512 KB
L1 Cache 64 KB
Table 1.5. Storage Systems
Storage Class Size Architecture Features
Local 8 GB/node Compact Flash not available to users (O/S only)
Parallel 1.7 PB Lustre, Sun x4500 disk servers 72 Sun x4500 I/O data servers, 6 Sun x4600 Metadata
Ranch (Tape Storage) 2.8 PB SAM-FS (Storage Archive Manager) 10 GB/s connection through 4 GridFTP Servers
Table 1.6. Parallel File Systems
Storage Class Size Quota (per User) Retention Policy
$HOME ~100 TB 6 GB Backed up nightly; Not purged
$WORK ~200 TB 350 GB Not backed up; Not purged
$SCRATCH ~800 TB 400 TB Not backed up; Purged every 10 days

 

 

User Environment

 

System Access

ssh

To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Unix based systems generally have a ssh client available locally.  Freely-available clients for other platforms are also available; a popular choice for Windows users is PuTTY.

To initiate an ssh connection to a Ranger login node from a UNIX or Linux system with an ssh client already installed, execute the following command:

ssh userid@ranger.tacc.utexas.edu

 where userid is replaced with the Ranger user name assigned to you during the allocation process.  Note that this userid specification is only required if the user name on the local machine and the TACC machine differ. You can also login to a specific login node. For example, to log in to Ranger's login4 node, execute the following:

ssh userid@login4.ranger.tacc.utexas.edu

Important Note: Do not run the optional ssh-keygen command on Ranger to set up public-key authentication. This option sets up a passphrase that will interfere with applications running in the batch queues. If you have already done this, remove the .ssh directory (and the files under it) from your top-level home directory. Log out and log back in to generate a new .ssh directory with TACC generated keys.

 

Login Information

Login Shell

The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, use:

login4% echo $SHELL

You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by the /etc/shells file, along with their full-path.

To display the list of available shells with chsh and change your login shell to bash, execute the following:

login4% chsh -l
login4% chsh -s /bin/bash

User Environment

The next most important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:

login4% env

The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.

HOME=/home/utexas/staff/jones
PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin

Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with an assignment) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.

Startup Scripts

All UNIX systems set up a default environment and provide administrators and users with the ability to execute additional UNIX commands to alter the environment. These commands are "sourced". That is, they are executed by your login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment.

Forgetting to unload superseded modules or putting module commands in the wrong shell startup file are some of the most common environment set up mistakes. With the C based shells, use the .login_user file instead of .cshrc_user. The .login_user file is executed at login, but only after .cshrc_user, and therefore after the rest of the environment is set up. For Bourne based shells, use the .profile_user file for module commands.

Basic site environment variables and aliases are set in the following files:

/etc/csh.cshrc {C-type shells, non-login specific}
/etc/csh.login {C-type shells, specific to login}
/etc/profile {Bourne-type shells}

TACC coordinates the environments on several systems. In order to efficiently maintain and create a common environment among these systems, TACC uses its own startup files in /usr/local/etc. (A corresponding file in this etc directory is sourced by the startup script files that reside in your home directory. (Please do not remove these files and the sourcing commands in them, even if you are a UNIX guru.) Any commands you put in your .login_user, .cshrc_user, or .profile_user file are sourced (if the file exists) at the end of the corresponding /usr/local/etc command files. If you accidentally remove your .login, .cshrc, or .profile, you can copy new ones from /usr/local/etc/start-up, or execute:

login4% /usr/local/bin/install_ut_startups

This utility renames the old shell startup scripts with a date suffix and recreates the standard ones.

For historical reasons, the C based shells (csh, tcsh, etc.) source two types of files. The .cshrc type files are sourced first (/etc/csh.cshrc then $HOME/.cshrc then /usr/local/etc/cshrc then $HOME/.cshrc_user). These files are used to set up the execution environment used by all scripts and for access to the machine without an interactive login. For example, the following commands execute only the .cshrc type files on the remote machine:

scp data ranger.tacc.utexas.edu
ssh ranger.tacc.utexas.edu date

The .login type files set up environment variables that accounts commonly use in an interactive session. They are sourced after the .cshrc type files (/etc/csh.login then $HOME/.login then /usr/local/etc/login then $HOME/.login_user).

Similarly, if your login shell is a Bourne based shell (bash, sh, ksh, etc.), the profile files are sourced (/etc/profile then $HOME/.profile then /usr/local/etc/profile then $HOME/.profile_user).

The commands in the /etc files above set operating system interaction and the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)

 

File Systems

Ranger has several different file systems with distinct storage characteristics. There are predefined directories in these file systems for you to store your data. Since these file systems are shared with others, they are managed either by a quota limit or a purge policy.

Three Lustre file systems are available: $HOME, $WORK, and $SCRATCH. The $HOME directory has a 6GB quota. $WORK on Ranger is a non-purged, non-backed up file system with a 350GB quota. $SCRATCH is periodically purged and not backedup, and has a very large, 400TB quota. All file systems also impose an inode limit, which affects the number of files allowed.

To determine the amount of disk spaced used in a file system, cd to the directory of interest and execute the df -k . command including the dot which represents the current directory. Without the dot all file systems are reported.

In the command output below, the file system name appears on the left , and the used and available space (-k, in units of 1 KBytes) appear in the middle columns followed by the percent used and the mount point:

login4% df -k .
File System 1k-blocks Used Available Use% Mounted on
129.114.97.1@o2ib:/share 103836108384 290290196 103545814376 1% /share

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sm option (s=summary, m=units in mbytes):

login4% du -sm *

To determine quota limits and usage, execute the lfs quota command with your login account (userid) and the directory of interest:

login4% lfs quota -u < userid > $HOME
login4% lfs quota -u < userid > $WORK
login4% lfs quota -u < userid > $SCRATCH

To determine quota limits and usage for yourself and your group:

login4% lfs quota -u hinder $HOME

Ranger has three major file systems available:

$HOME directory

At login, the system automatically sets the current working directory to your home directory.

Store your source code and build your executables here.

This directory has a quota limit of 6 GB.

The frontend nodes and any compute node can access this directory.

Use $HOME to reference your home directory in scripts.

Use cd to change to $HOME.

 

$WORK directory

Store large files here.

Change to this directory in your batch scripts and run jobs in this file system.

This directory has a quota limit of 350 GB.

The frontend nodes and any compute node can access this directory.

Currently, there is no purging policy in effect on this file system.

This file system is not backed up.

Use $WORK to reference this directory in scripts.

Use cdw to change to $WORK.

 

$SCRATCH

This is NOT a local disk file system on each node.

This is a global Lustre file system for storing temporary files.

The quota on this system is 400 TB.

Files on this system may be purged when a file's access time exceeds 10 days

If possible, have your job scripts use and store files directly in $WORK (to avoid moving files from $SCRATCH later, before they are purged).

Use $SCRATCH to reference this file system in scripts.

Use cds to change to $SCRATCH.

 

NOTE: TACC staff may delete files from scratch if the scratch file system becomes full, even if files are less than 10 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

Additionally, the $ARCHIVE variable is set on Ranger to point to the user's directory on the tape archival system, Ranch. 

This directory belongs to the tape archival system ($ARCHIVER).

Use this directory when copying files to Ranch. For example:

scp filename ${ARCHIVER}:${ARCHIVE}

The bbcp command is another, potentially faster solution for transferring files to ranch.

More on Archiving files -- rls and sinc not available.

 

Development

 

High Speed File Transfers

There are two utilities, bbcp and globus-url-copy, that can be used to achieve higher performance than the rcp and scp programs, when transferring large files between TACC clusters and the TACC archive (Ranch). During production, the scp and rcp speeds between Ranger and Ranch average about 15MB/s, while bbcp and globus-url-copy speeds are about 125MB/s. These values vary with I/O and network traffic.

The bbcp utility works much the same way as scp, it is only available on TACC machines, and includes an option to copy subdirectories. For each transfer command it is necessary to provide your ssh passphrase or login password. You can use the ssh-agent and ssh-add commands to automatically supply passphrases for ssh commands (including bbcp) during your login session. The general bbcp syntax is:

bbcp [options] < file or directory > < to Machine >:< relative path >/< file or directory >

 

For example, the following command transfers < data > to $ARCHIVER as < data >:

login4% bbcp < data > ${ARCHIVER}:$ARCHIVE

 

To transfer < data > to $ARCHIVER and force replacement, use:

login4% bbcp -f < data > ${ARCHIVER}:$ARCHIVE/< data >

 

To transfer directory < dir1 > and subdirectories, use:

login4% bbcp -r < dir1 > ${ARCHIVER}:$ARCHIVE

 

By default, bbcp does not overwrite files; use the -f option to force replacement. The -r option transfers directory contents recursively. You can also use bbcp to transfer files between Lonestar and Ranger. However, do NOT use ranger.tacc.utexas.edu or lonestar.tacc.utexas.edu as hostnames with bbcp. Use one of the local hostnames (lslogin1.ls.tacc.utexas.edu or lslogin2.ls.tacc.utexas.edu for Lonestar; and login3.ranger.tacc.utexas.edu or login4.ranger.tacc.utexas.edu for Ranger), as shown in these examples:

lslogin1% bbcp -f < data > login3.ranger.tacc.utexas.edu:
lslogin1% bbcp -f < data > login4.ranger.tacc.utexas.edu:
login3% bbcp -f < data > lslogin1.ls.tacc.utexas.edu:
login3% bbcp -f < data > lslogin2.ls.tacc.utexas.edu:

 

To transfer data between TeraGrid sites, use globus-url-copy. This command requires the use of a TeraGrid certificate to create a proxy for password-less transfers. It has a complex syntax, but provides high-speed access to other TeraGrid machines that support gridFTP services (the protocol for globus-url-copy). High-speed transfers of a file or directory occur between the different FTP servers at the TG sites. The GridFTP servers mount the file systems of the compute machines, thereby providing access to your files or directories. Third party transfers are possible (transfers initiated between two machines from another machine). For a list of TG GridFTP servers and mounted directory names for the TG, see the transfer-location page.

Use grid-proxy-init to obtain a proxy certificate. For example:

login3% grid-proxy-init

 

This command will prompt for the certificate password. The proxy is valid for 12 hours for all logins on the local machine. With globus-url-copy, you must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy < options > gsiftp://< gridftp_server1 >/< directory >|< file > \
  gsiftp://< gridftp_server2 >/< directory >|< file >

 

An example file transfer might look like this:

login3% globus-url-copy -stripe -tcp-bs 11M -vb \
  gsiftp://gridftp.ranger.tacc.teragrid.org/`pwd`/file1 \
  gsiftp://tg-gridftp.ncsa.teragrid.org/home/ncsa/johndoe/file2

 

An example directory transfer might look like this:

login3% globus-url-copy -stripe -tcp-bs 11M -vb \
  gsiftp://gridftp.ranger.tacc.teragrid.org/`pwd`/directory1/ \
  gsiftp://gridftp.ranch.tacc.teragrid.org/home/00000/johndoe/directory2/

 

Use the stripe and buffer size options (-stripe to use multiple service nodes, -tcp-bs 11M to set ftp data channel buffer size). Otherwise, the speed will be about 20 times slower! When transferring directories, the directory path must end with a slash (/). The -rp option (not shown) allows paths relative to the user's "starting" directory of the file system mounted on the server. Without this option, you must specify the full path to the file or directory.

 

Software on TACC Resources

A collection of program libraries and software packages are supported on Ranger across diverse disciplines. For a comprehensive list of software packages available, visit the TACC Software page. These software products for the supercomputing environment have been selected on the basis of quality, history of performance, system compatibility, and benefit to the scientific community. If you need a particular software package for your work, let us know via the Consulting Form.  A organized and customizable listing of all packages (name, version, etc.) as well as execution/loading information is available in the Software and Tools Table.  The same information for individual packages can be found in the module files on the machine through the execution of the module help command (module help < module_name >).

Modules

TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.

At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations ($WORK, $HOME, etc.), aliases (cdw, cdh, etc.) and license paths. Therefore, there is no need for you to set them or update them when updates are made to system and application software.

Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.

Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:

module load < module_name >

where < module_name > is the name of the module to load. If you often need an application environment, place the module commands required in your .login_user and/or .profile_user shell startup file.

Most of the package directories are in /opt/apps ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.

As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set up by loading the fftw3 module:

login4% module load fftw3

To see a list of available modules, a synopsis of a particular modulefile's operations (in this case, fftw3), and a list of currently loaded modules, execute the following commands:

login4% module avail
login4% module help fftw3
login4% module list

During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.

 

Programming Models

There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.

For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 1. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, "ibrun a.out").

In the MPMD paradigm, each processor loads up and executes a different program image and operates on different data sets, as illustrated in Figure 1. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or 100s of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a UNIX script is used to assign input parameters for the execution command (through the batch MPI launcher, "ibrun script_command"). Details of the batch mechanism for parameter sweeps are described in the help information for the launcher module:

login4$ module help launcher

Figure 1. Distributed Memory Paradigm: Single/Multiple-Program Multiple-Data.
Distributed Memory Paradigm

 

The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes, like the TACC Champion Power5 System (8 CPUs, 16GB memory per node) or the TACC Ranger Cluster (16 cores, 32GB memory per node).

The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem, as shown in Figure 2. There is no need for explicit messaging between processors as with MPI coding.

Figure 2. Shared-Memory Parallel Processing.
Shared-Memory Parallel Processing.

 

In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.

In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.

In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.

 

Compiling Code

The following sections present the compiler invocation for serial and MPI executions, and follows with a section on options. All compiler commands can be used for just compiling with the -c option (create just the ".o" object files) or compiling and linking (to create executables). To use a different (no-default) compiler you first unload the MPI environment (mvapich2), swap the compiler environment, and then reload the MPI environment.

 

Compiling Serial Programs

The compiler invocation commands for the supported vendor compiler systems are tabulated below.

Vendor Compiler Language File Extension
intel icc C .c
intel icc C++ .C, .cc, .cpp, .cxx
intel ifort F77/F90/F95 .f, .for, .ftn, .f90, .fpp
pgi pgcc C .c
pgi pgcpp C++ .C, .cc
pgi pgf95 F77/90/95 .f, .F, .FOR, .f90, .f95, .hpf
gnu gcc C .c
sun sun_cc C .c
sun sun_CC C++ .C, .cc, .cpp, .cxx
sun sunf90 F77/F90/F95 .f, .F, .FOR, .f90, .hpf
sun sunf95 F95 .f, .F, .FOR, .f90, .f95, .hpf

 

Note: pgf90 is an alias for pgf95.

 

Appropriate source code file extensions are required for each compiler. By default, the executable file name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.

An Intel C program example:

login4% icc -o flamec.exe -O2 -xW prog.c

 

A PGI Fortran program example:

login4% pgf95 -o flamef.exe -fast -tp barcelona-64 prog.f90

 

A gnu C program example:

login4% gcc -o flamec.exe -mtune=barcelona -march=barcelona prog.c

 

A Sun Fortran program example:

login4% sunf90 -o flamef.exe -xarch=sse2 prog.f90

 

To see a list of compiler options, their syntax, and a terse explanation, execute the compiler command with the -help or --help option. Also, man pages are available.

 

Compiling Parallel Programs with MPI

The "mpicmds" commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI MVAPICH (mvapich2) and PGI compiler (intel) modules are loaded to produce the default environment which provide the location to the corresponding mpicmds. Compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. The compiler and MVAPICH library are selected according to the modules that have been loaded. The following table lists the compiler wrappers for each language:

Compiler Program File Extension
mpicc C .c
mpicxx C++ .cc, .C, .cpp, .cxx
mpif90 F77/F90 .f, .for, .ftn, .f90, .f95, .fpp

 

Appropriate source code file extensions are required for each wrapper. By default, the executable name is a.out. It may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options:

An Intel Fortran example:

login4% mpif90 -o prog.exe -O2 -xW prog.f90

 

A PGI C example:

login4% mpicc -o prog.exe -fast -tp barcelona-64 prog.c

 

Include linker options such as library paths and library names after the program module names, as explained in the Loading Libraries Section below. The Running Code Section explains how to execute MPI executables in batch scripts and "interactive batch" runs on compute nodes.

We recommend that you use either the Intel or the PGI compiler for optimal code performance. TACC does not support the use of the gcc compiler for production codes on the Ranger system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option. (Since gcc- and Intel-compiled code are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler in combination with the Intel compiler for the two cases:

login4% mpicc -O2 -xW -c -cc=gcc suba.c
login4% mpicc -O2 -xW mymain.c suba.o
login4%
login4% mpicc -O2 -xW -c suba.c
login4% mpicc -O2 -xW -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o

 

Compiling OpenMP Programs

Since each of the blades (nodes) of the Ranger cluster is an AMD Opteron quad-processor quad-core system, applications can use the shared memory programming paradigm "on node". With a total number of 16 cores per node, we encourage the use of a shared-memory model on the node.

The OpenMP compiler options are listed below for those who do need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the openmp options.

With the Intel C compiler, for example:

login4% mpicc -O2 -xW -openmp

 

Using the PGI Fortran compiler:

login4% mpif90 -fast -tp barcelona-64 -mp

 

Basic Optimization

The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and iCC and mpicxx. Below are some of the common serial compiler options with descriptions.

Intel Compiler Option Description
-O3 performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-ipo Creates inter-procedural optimizations.
-vec_report[0|...|5] Controls the amount of vectorizer diagnostic information.
-xW Includes specialized code for SSE and SSE2 instructions (recommended).
-xO Includes specialized code for SSE, SSE2 and SSE3 instructions. Use, if code benefits from SSE3 instructions.
-fast Includes: -ipo, -O2, -static DO NOT USE -- static load not allowed.
-g -fp debugging information produced, disable using EBP as general purpose register.
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.
-openmp_report[0|1|2] Controls the OpenMP parallelizer diagnostic level.
-help Lists options.

 

Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.

Use the -help option with the mpicmds commands for additional information, for example:

login3% mpicc -help
login3% mpif90 -help
login3% mpirun -help

 

Use the options listed in the mpirun man page with the ibrun command.

PGI Compiler Options Description
-O3 Performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs.
-Mipa=fast, inline Creates inter-procedural optimizations. There is a loader problem with this option. Please include the following path in the LD_LIBRARY_PATH environment variable:
For Bourne-based shells:
export LD_LIBRARY_PATH=/share/apps/binutils-amd/070220/lib64:${LD_LIBRARY_PATH}
For C-based shells:
setenv LD_LIBRARY_PATH  /share/apps/binutils-amd/070220/lib64:${LD_LIBRARY_PATH}
-tp barcelona-64 Includes specialized code for the barcelona chip.
-fast Includes: -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-g, -gopt Produces debugging information.
-mp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives.
-Minfo=mp,ipa Provides information about OpenMP, and inter-procedural optimization.
-help Lists options.
-help -fast Lists options for the -fast option.

 

For details on the MPI standard go to the URL: www.mcs.anl.gov/mpi.

Compiler Usage Guidelines

The AMD Compiler Usage Guidelines document provides the "best-known" peak switches for various compilers tailored to their Opteron products. Developers and installers should read Chapter 5 of this document before experimenting with PGI and Intel compiler options.

See Chapter 5 of the AMD Compiler Usage Guidelines.

The Intel Compiler Suite

The Intel compilers are not the default compilers. You must use the module commands to load the Intel compiler environment. (The 10.1 suite is the most current installed, but the 9.1 compilers are available for special porting needs.) The gcc 3.4.6 compiler and module are also available. We recommend using the Intel (or the PGI) suite whenever possible. The 10.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).

Web accessible Intel manuals are available: Intel C++ Compiler Documentation and Intel Fortran Compiler Documentation.

The PGI 7.1 Compiler Suite

The PGI 7.1 compilers are loaded as the default compilers at login with the pgi module. We are recommending the use of the PGI suite whenever possible (at this time). The 7.1 suite is installed with the 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode).

A PDF version of the PGI User's Guide is available.

Selecting Compiler and MPI Environments

The Ranger programming environment supports several compilers (Intel 9/10 and PGI 7) and several MPI stacks (mvapich, mvapich2 and openmpi) for MPI programs. Each MPI stack must have a library compiled for each of the compilers, so that applications compiled with compiler x can load the x-compiled MPI libraries. Hence, the MPI environment is dependent upon the compiler environment you select. The generic way to change these two environments is to: unload the MPI environment, change the compiler environment (or not if you continue to use the present compiler environment), and then load the MPI environment you will need. The possible scenarios and examples are:

Change Only MPI stack
module unload < MPI_old >
module load < MPI_new >

 

Change Only Compiler (COMP)
module unload < MPI_orig >
module swap < COMP_old > < COMP_new >
module load < MPI_orig >

 

Change Compiler (COMP) and MPI stack
module unload < MPI_old >
module swap < COMP_old > < COMP_new >
module load < MPI_new >

Example of changing only the MPI stack:

login4% module unload mvapich2
login4% module load mvapich/1.0

Example of changing the compiler:

login4% module unload mvapich2
login4% module swap pgi intel
login4% module load mvapich2

Example of changing the compiler and the MPI stack (the version is specified in this example):

login4% module unload mvapich2
login4% module swap pgi intel/10.1
login4% module load openmpi

By default, the pgi compiler and mvapich environments are set up at login. Execute the module avail command to determine the modulefile names for all the available compilers; they have the syntax compiler/version. Note, only certain MPI stacks and other compiler-dependent libraries are seen for each compiler environment. The above commands can be placed in your .login_user (C shells) or .profile_user (Bourne shells) file to automatically set an alternate default compiler and MPI stack in your environment at login. The matrix below shows the available combination of compilers and MPI stacks.

MPI Family Compiler Support MPI1-1 Full MPI-2 Notes
mvapich/1.0 pgi
intel/9.1
intel/10.1
Yes No This is the current recommended stack for large scale analysis on Ranger. It has been used to run applications with O(32K) MPI tasks.
mvapich2/1.0 pgi
intel/9.1
intel/10.1
Yes Yes This supports full MPI-2 functionality with a job-startup mechanism that is recommended for job sizes in the range from 16-2048 tasks.
openmpi/1.2.4 pgi
intel/9.1
intel/10.1
Yes Yes OpenMPI also supports MPI-2 semantics and is the successor to the LAM/MPI project.

Available Compiler and MPI Stack combinations, and MPI-1 and MPI-2 compliance.

 

Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

PGI compiler
-O[n] Optimization level, n=0, 1, 2 or 3
-tp barcelona-64 Targeting the architecture
-Mipa[=option] Interprocedural analysis, option=fast,inline

 

Intel compiler
-O[n] Optimization level, n=0, 1, 2 or 3
-x[p] Targeting the architecture, p=W or O
-ip, -ipo Interprocedural analysis

 

See the Development section for the use of these options.

 

Loading Libraries

Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.

Use the -l loader option to link in a library at load time. For example:

compiler prog.f90 -lname

 

This links in either the shared library libname.so (default) or the static library libname.a, provided that the correct path can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.

To explicitly include a library directory, use the -L option:

compiler prog.f -L/mydirectory/lib -lbname

 

In this example, the libname.a library is not in the default search path, so the "-L" option is specified to point to the libname.a directory /mydirectory/lib.

MKL

Many of the modules for applications and libraries, such as the mkl library module, provide environment variables for compiling and linking commands. Execute module help module_name for a description, listing, and use cases for the assigned environment variables.

The following example illustrates their use for the mkl library (after loading the mkl module with the command module load mkl):

mpicc prog.c \
-I$TACC_MKL_INC
-Wl,-rpath,$TACC_MKL_LIB \
-L$TACC_MKL_LIB -lmkl -lguide
mpf90 prog.f90 -Wl,-rpath,$TACC_MKL_LIB \
-L$TACC_MKL_LIB -lmkl -lguide

 

Here, the module supplied variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the LD_LIBRARY path or the LDD dynamic cache of bindings between shared libraries and directory paths. (This avoids having to set the LD_LIBRARY_PATH ("manually" or through a module command) before running the executables.

gotoBLAS

The gotoBLAS library contains statically compiled routines and does not require the –Wl,rpath option to locate the library directory at run time.  The multi-threaded library (for hybrid computing) has a “_mp” suffix.  Also, codes that use long long int (C) or integer kind=8 should use the libraries differentiated with the “_ilp64” moniker.  Otherwise, for normal, single core executions within MPI or serial code, use the libgoto_lp64.a library, as shown below (after loading the gotoBLAS module with the command module load gotoblas).

mpicc prog.c \
-I$TACC_GOTOBLAS_/LIB
-L$TACC_GOTOBLAS_LIB -lgoto_lp64
mpf90 prog.f90 -L$TACC_GOTOBLAS_LIB -lgoto_lp64

If you are using the gotoBLAS in conjunction with other libraries that include the BLAS routines, make sure to include the gotoBLAS reference (-L$TACC_GOTOBLAS_LIB –lgoto_lp64) before any other library specification.  A load map, which shows the library module for each static routine, can be used to validate that gotoBLAS libraries are being called from your program. Use the following loader option to create a map:

Compiler independent load map option -WI,-Map,mymap
   

ScaLAPACK

Since the Intel LAPACK routines are highly optimized for em64t-type architectures, TACC recommends loading the Intel MKL library to satisfy the LAPACK references within ScaLAPACK routines.  In this case it is important to specify the ScaLAPACK library options BEFORE MKL and other libraries on the loader/compiler command line.  Because ScaLAPACK uses two static libraries in the APIs for accessing the BLACS routines, it may be necessary to explicitly reference them more than once, since the loader makes only a single pass through the library path list while loading libraries.  If you need help determining a path sequence, please contact TACC staff through the portal consulting system or the TeraGrid HelpDesk ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it ).  The examples below illustrate the normal library specification to try for C codes, and  the loading sequence found to work for Fortran codes (don’t forget to load the scalapack and mkl modules with the command module load scalapack mkl):

mpicc \
-I$TACC_SCALAPACK_INC prog.c

-Wl,-rpath,$TACC_MKL_LIB \
-L$TACC_SCALAPACK_LIB -lscalapack \
$TACC_SCALAPACK_LIB/blacsCinit_MPI-LINUX-0.a \
$TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
-L$TACC_MKL_LIB -lmkl -lguide -L$IFC_LIB -lifcore

mpf90 prog.f90 -Wl,-rpath,$TACC_MKL_LIB \
-L$TACC_SCALAPACK_LIB -lscalapack \
$TACC_SCALAPACK_LIB/blacsF77init_MPI-LINUX-0.a \
$TACC_SCALAPACK_LIB/blacs_MPI-LINUX-0.a \
$TACC_SCALAPACK_LIB/blacsF77init_MPI-LINUX-0.a \
-L$TACC_MKL_LIB -lmkl -lguide

Note, the C library list includes the Fortran libfcore.a to satisfy miscellaneous Intel run-time routines; other situations my require the portability or POSIX libraries (libifport.a and libifposix.a).

 

Performance Libraries

TACC provides several ISP (Independent Software Providers) and HPC vendor math libraries that can be used in many applications. These libraries provide highly optimized math packages and functions for the Ranger system.

The ACML library (AMD Core Math Library) contains many common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) specifically optimized for the AMD Barcelona processor. The ACML library also supports multi-processor (threading) FFT and BLAS routines. The Intel MKL (Math Kernel Library) has a similar set of math packages. Also, the GotoBLAS libraries contain the fastest set of BLAS routine for this machine.

The default compiler representation for Ranger consists of 32-bit ints, and 64-bit longs and pointers. Likewise for Fortran, integers are 32-bit, and the pointers are 64-bit. This is called the LP64 mode (Long and Pointers are 64-bit, ints and integers are 32-bit). Libraries with 64-bit integers are often suffixed with an ILP64.

ACML and MKL libraries

The "AMD Core Math Library" and "Math Kernel Library" consists of functions with Fortran, C, and C++ interfaces for the following computational areas:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
  • LAPACK for linear algebraic equation solvers and eigensystem analysis
  • Fast Fourier Transforms
  • Transcendental Functions

Note, Intel provides performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for their em64t and Core-2 systems. These routines also work well on the AMD Opteron microarchitecture. In addition, MKL also offers a set of functions collectively known as VML -- the "Vector Math Library". VML is a set of vectorized transcendental functions which offer both high performance and excellent accuracy compared to the libm, for vectors longer than a few elements.

GotoBLAS library

The "GotoBLAS Library" (pronounced "goat-toe") contains highly optimized BLAS routines for the Barcelona microarchitecture. The library has been compiled for use with the PGI and Intel Fortran compilers on Ranger. The Goto BLAS routines are also supported on other architectures, and the source code is available free for academic use (download). We recommend the GotoBLAS libraries for performing the following linear algebra operations and solving matrix equations:

  • BLAS (vector-vector, matrix-vector, matrix-matrix operations)
  • LAPACK for linear algebraic equation solvers and eigensystem analysis
 

Code Tuning

Additional performance can be obtained with these techniques :

  • Memory Subsystem Tuning : Optimize access to the memory by minimizing the stride length and/or employ cache blocking techniques.
  • Floating-Point Tuning : Unroll inner loops to hide FP latencies, and avoid costly operations like division and exponentiation.
  • I/O Tuning : Use direct-access binary files to improve the I/O performance.

These techniques are explained in further detail, with examples, in a separate Memory Subsystem Tuning document.

 

Usage

 

Running Code

Runtime Environment

Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file). Cat /etc/ld.so.conf to see the TACC configured directories, or execute:

login3% /sbin/ldconfig -p

to see a list of directories and candidate libraries. Use the -Wl,rpath loader option or the LD_LIBARY_PATH to override the default runtime bindings.

The Intel compiler, MKL math libraries, GOTO libraries, and other application libraries are located in directories specified by their respective environment variables, which are set when the module is loaded. Use the following command:

module help < modulefile >

to display instructions and examples on loading libraries for a particular modulefile. 

The SGE Batch System

Batch facilities such as LoadLeveler, LSF, and SGE differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). The following paragraphs list the basic batch operations and their options, explain how to use the SGE batch environment, and describe the queue structure.

New users should visit the SGE wiki and read the first chapter of the Introduction to the N1 Grid Engine 6 Software document. To help those migrating from other systems, a comparison of the IBM LoadLeveler, LSF, and SGE batch options and commands is offered in a Batch Systems Comparison guide.

Step 1: Job submission

SGE provides the qsub command for submitting batch jobs: Use this command to submit a batch job:

qsub < job_script >

where < job_script > is the name of a UNIX format text file containing job script commands. This file should contain both shell commands and special statements that include qsub options and resource specifications. Some of the most common options are described in Table 2. Details on using these options and examples of job scripts follow.

Table 2 List of the Most Common qsub Options
Option Argument Function
-q < queue_name > Submits to queue designated by < queue_name >.
-pe < TpN >way < NoN x 16 > Executes the job using the specified number of tasks (cores to use) per node ("wayness") and the number of nodes times 16 (total number of cores). (See example script below.)
-N < job_name > Names the job < job_name >.
-S < shell_absolute_path > Use the command shell specified for the batch session.(See example script below.)
-M < email_address > Specify the email address to use for notifications.
-m {b|e|a|s|n} Specify when user notifications are to be sent.
-V   Use current environment setting in batch job.
-cwd   Use current directory as the job's working directory.
-j y Join stderr output with the file specified by the -o option. (Don't also use the -e option.)
-o < output_file > Direct job output to < output_file >.
-e < error_file > Direct job error to < error_file >. (Don't also use the -j option.)
-A < project_account_name > Charges run to < project_account_name >. Used only for multi-project logins. Account names and reports are displayed at login.
-l < resource >=< value > Specify resource limits. (See qsub man page.)

Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.

Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #$ and are followed with qsub options. The SGE shell_start_mode has been set to unix_behavior, which means the UNIX shell commands are interpreted by the shell specified on the first line after #! sentinel; otherwise the Bourne shell (/bin/sh) is used. The file job below requests an MPI job with 32 cores and 1.5 hours of run time:

#!/bin/bash  
#$ -V # Inherit the submission environment
#$ -cwd # Start job in submission directory
#$ -N myMPI # Job Name
#$ -j y # Combine stderr and stdout
#$ -o $JOB_NAME.o$JOB_ID # Name of the output file (eg. myMPI.oJobID)
#$ -pe 16way 32 # Requests 16 tasks/node, 32 cores total
#$ -q normal # Queue name "normal"
#$ -l h_rt=01:30:00 # Run time (hh:mm:ss) - 1.5 hours
#$ -M # Use email notification address
#$ -m be # Email at Begin and End of job
set -x # Echo commands, use "set echo" with csh
ibrun ./a.out # Run the MPI executable named "a.out"

If you don't want stderr and stdout directed to the same file, replace the -j option line, with a -e option to name a separate output file for stderr (but don't use both). By default, stderr and stdout are sent to out.o and err.o, respectively.

Example job scripts are available online in /share/doc/sge.   They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

MPI Environment for Scalable Code

The MVAPICH-1 and MVAPICH-2 MPI packages provide runtime environments that can be tuned for scalable code. For packages with short messages, there is a "FAST_PATH" option that can reduce communication costs, as well as a mechanism to "Share Receive Queues". Also, there is a "Hot-Spot Congestion Avoidance" option for quelling communication patterns that produce hot spots in the switch. See Chapter 9, "Scalable features for Large Scale Clusters and Performance Tuning" and Chapter 10, "MVAPICH2 Parameters" of the MVAPICH2 User Guide for more information. The User Guides are available in PDF format at: MVAPICH User Guides

The SGE Parallel Environment

Each Ranger node (of 16 cores) can be assigned to only one user at a time; hence, a complete node is dedicated to a user's job and accrues wall-clock time for 16 cores whether they are used or not. The SGE parallel environment option "-pe" sets the number of MPI Tasks per Node (TpN), and the Number of Nodes (NoN). The syntax is:

-pe < TpN >way < NoN x 16 >

where < TpN > is the number of Tasks per Node, and < NoN x 16 > is the total number of cores requested (Number of Nodes times 16). For example:

#$ -pe 16way 64

Regardless of the value of < TpN >, the second argument is always 16 times the number of nodes that you are requesting.

Using a multiple of 16 cores per node

For "pure" MPI applications, the most cost-efficient choices are: 16 tasks per node (16way) and a total number of tasks that is a multiple of 16, as described above. This will ensure that each core on all the nodes is assigned one task.

Using a large number of cores

For core counts above 4,000, cache the executable in each node’s memory immediately before the ibrun statement with the following command:

cache_binary $PWD ./a.out
ibrun...

In this example, ./a.out is the executable, $PWD is evaluated to the present working directory, and cache_binary is a perl script that caches the executable on each node.

Using fewer than 16 cores per node

When you want to use less than 16 MPI tasks per node, the choice of tasks per node is limited to the set of numbers {1, 2, 4, 8, 12, and 15}. When the number of tasks you need is equal to "Number of Tasks per Node x Number of Nodes", then use the following command:

#$ -pe < TpN >way < NoN x 16 >

where < TpN >is a number in the set {1, 2, 4, 8, 12, 15}.

If the Total number of Tasks that you need is less than "Number of Tasks per Node x Number of Nodes", then set the MY_NSLOTS environment variable to the Total number of Tasks < TnoT >needed. In a job script, use the following -pe option and environment variable statement:

#$ -pe < TpN >way < NoN x 16 >  
export MY_NSLOTS=< TnoT > # For Bourne shells
or  
setenv MY_NSLOTS < TnoT > # For C shells

where < TpN >is a number in the set {1, 2, 4, 8, 12, 15}. For example, using a Bourne shell:

#$ -pe 8way 64 # Use 8 Tasks per Node, 4 Nodes requested
export MY_NSLOTS=31 # 31 tasks are launched
Program Environment for Serial Programs

For serial batch executions, use the 1-way program environment (#$ -pe 1way 16), don't use the ibrun command to launch the executable (just use "./my_executable my_arguments"), and submit your job to the serial queue (#$ -q serial).  The serial queue has a 12-hour runtime limit, 12 nodes reserved within its pool, and allows up to 6 simultaneous runs per user.

#$ -pe 1way 16 # 1 execution  on node of 16 cores
#$ -q serial # run in serial queue
./my_executable # execute your applications (no ibrun)
Program Environment for Hybrid Programs

For hybrid jobs, specify the MPI Tasks per Node through the first -pe option (1/2/4/8/15/16way) and the Number of Nodes in the second -pe argument (as the number of assigned cores = Number of Nodes x 16). Then, use the OMP_NUM_THREADS environment variable to set the number of threads per task. (Make sure that "Tasks per Node x number of Nodes" is less than or equal to the number assigned cores, the second argument of the -pe option.) The hybrid Bourne shell example below illustrates the use of these parameters to run a hybrid job. It requests 4 tasks per node, 4 threads per task, and a total of 32 cores (2 nodes x 16 cores).

#$ -pe 4way 32 # 4 tasks/node, 32 cores total
export OMP_NUM_THREADS=4 # 4 threads/task

SGE provides several environment variables for the #$ options lines that are evaluated at submission time. The above $JOB_ID string is substituted with the job id. The job name (set with -N) is assigned to the environment variable JOB_NAME. The memory limit per task on a node is automatically adjusted to the maximum memory available to a user application (for serial and parallel codes).

Step 2: Batch query

After job submission, users can monitor the status of their jobs with the qstat command. Table 2.1 lists the qstat options:

Table 2.1 List of qstat Options
Option Result
-t Show additional information about subtasks
-r Shows resource requirements of jobs
-ext Displays extended information about jobs
-j Displays information for specified job
-qs {a|c|d|o|s|u|A|C|D|E|S} Shows jobs in the specified state(s)
-f Shows "full" list of queue/job details

The qstat command output includes a listing of jobs and the following fields for each job:

Table 2.2 Some of the fields in the qstat command output
Field Description
JOBID job id assigned to the job
USER user that owns the job
STATE current job status, including, but not limited to:
  w(aiting)
  s(uspended)
  r(unning)
  h(old)
  E(rror)
  d(eletion)

For convenience, TACC has created an additional job monitoring utility which summarizes jobs in the batch system in a manner similar to the "showq" utility from PBS. Execute:

login4% showq

to summarize running, idle, and pending jobs, along with any advanced reservations scheduled within the next week. The command showq -u will show jobs associated with your userid only (issue showq --help to obtain more information on available options).

The latest queue information can be determined using the following commands:

Command Comment
qconf -sql Lists the available queues.
qconf -sq < queue_name > The s_rt and h_rt values are the soft and hard wall-clock limits for the queue.
cat /share/sge/default/tacc/sge_esub_control The first value after max_cores_per_job_< queue_name > is the queue core limit.

Step 3: Job control

Control of job behavior takes many forms:

  1. Job modification while in the pending/run state
    Users can reset the qsub options of a pending job with the qalter command, using the following syntax:

     

    qalter < options >

     

    where < options > refers only to the following qsub resource options (also described in Table 2):

    -l h_rt= per-job wall clock time
    -o output file
    -e error file

     

  2. Job deletion
    The qdel command is used to remove pending and running jobs from the queue. The following table explains the different qdel invocations:

    qdel Removes pending or running job.
    qdel -f Force immediate dequeuing of running job.

    Use forced job deletion (with the -f option) only if all else fails, and immediately report only forced deleted job IDs to TACC staff through the portal consulting system or the TeraGrid HelpDesk ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it ),as this may leave hung processes that can interfere with the next job.

     

  3. Job suspension/resumption
    The qhold command allow users to prevent jobs from running. This command may be used to stop serial or parallel jobs and can be invoked by a user. A user cannot resume a job that was suspended by a sys admin nor can he suspend or resume a job owned by another user.

    Jobs that have been placed on hold by qhold can be resumed by using the qalter -h U ${JOB_ID}  command. (Note the character spacing is mandatory.)

SGE Batch Environment

In addition to the environment variables inherited by the job from the interactive login environment, SGE sets additional variables in every batch session. The following table lists some of the important SGE variables:

SGE Batch Environment Variables
Environment Variable Contains
JOB_ID batch job id
TASK_ID task id of an array job sub-task
JOB_NAME name user assigned to the job

 

Ranger Queue Structure

The Ranger production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in the table below. Queues that don’t appear in the list (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support. From Ranger, users can submit jobs to the Spur nodes through the vis queue, using their Spur project name in the job account option (#$ -A). All Ranger projects have a Spur account, and at least a complementary allocation for experimenting with the visualization tools. More details are available in the Spur User Guide.

SGE Batch Environment Queues
Queue Name Max Runtime Max Procs SU Charge Rate Purpose
normal 24 hrs 4096 1 Normal Priority
long 48 hrs 1024 1 Long Times
large 24 hrs 16384 1 Large Core Count
development 2 hrs 256 1 Development
serial 12 hrs 16 1 Serial, Large Memory
request -- -- 1 Special Requests
vis 24 hrs 32 1 Visualization

 

Launching MPI Applications with ibrun

For all codes compiled with any MPI library, use the ibrun command to launch the executable within the job script.  The syntax is:

Simple Syntax ibrun   ./my_exec  < code_options >
Example ibrun  ./a.out 2000

The ibrun command now supports options for advanced host selection.  A subset of the processors from the list of all hosts can be selected to run an executable.  An offset must be applied. This offset can also be used to run two different executables on two different subsets, simultaneously.  The option syntax is:

Advanced Syntax ibrun -n < # of cores > -o < hostlist offset > my_exec  < code_options >

For the following advanced example, 64 cores were requested in the job script.

ibrun  -n 32  -o   0  ./a.out  &
ibrun  -n 32  -o 32  ./a.out  &
wait

The first call launches a 32-core run on the first 32 hosts in the hostfile, while the second call launches a 32-core run on the second 32 hosts in the  hostfile, concurrently (by terminating each command with &).  The wait command (required) waits for all processes to finish before the shell continues.  The wait command works in all shells.  Note, the -n and -o options must be used together.

 

 

Affinity Policies

Controlling Process Affinity and Memory Locality

While many applications will run optimally using 16 MPI tasks on each node (a single task on each core), certain applications will run more efficiently with fewer than 16 cores per node and/or with a non-default memory policy. In these cases, consider using the techniques described in this section for a specific arrangement of tasks (process affinity) and memory allocation (memory policy). In HPC batch systems an MPI task is synonymous with a process. This section uses these two terms interchangeably. The "wayness" of a job, (specified by the -pe option), determines the number of tasks, or processes, launched on a node. For example, the SGE batch option "#$ -pe 16way 128" will launch 16 tasks on each of the nodes assembled from a set of 128 cores.

Since each MPI task is launched on a node as a separate process, the process affinity and memory policy can be specified for each task as it is launched with the numactl wrapper command. When a task is launched it can be bound to a socket or a specific core; likewise, its memory allocation can be bound to any socket. The assignment of tasks to sockets (and cores) and memory to socket memories are specified as options of the numactl wrapper command. There are two forms that can be used for numa control when launching a batch executable (a.out). They are:

ibrun numactl options ./a.out #numactl command; options apply to all a.out's
ibrun my_affinity ./a.out #my_affinity shell script contains "numactl ./a.out"

 

In the first command, ibrun executes "numactl options a.out" for all tasks (a.out's) using the same options. Because the ranks for the execution of each a.out launch are not known to the job script, it is impossible on this command line to tailor numactl options for each individual task. In the second command, ibrun launches an executable UNIX script (here named my_affinity) for each task on a node and provides to this script the rank and wayness in environment variables. Hence, the script can manipulate these two variables, derive numactl option arguments for each task to use in its "numactl options a.out" command, within the my_affinity script. Additional details on numactl are given in the numactl man page and help information:

login3% man numactl
login3% numactl --help

 

In order to use numactl, it is necessary to understand the node configuration and the mapping between the logical core IDs, also called CPU Numbers, and the socket IDs, also called the Socket Numbers, and their physical location. The figure below shows the basic layout of the memory, sockets and cores of a Ranger node.

Ranger Sockets

Figure 3. Node Diagram.
Ranger Node Diagram

 

Both the CPU and Socket Numbers start at 0. Also, note the asymmetry in the HyperTransport (HT) connections: an HT port on socket 0 and 3 is used to connect to a PCI-express bus; socket 3’s PCI-e connects to the Infiniband Host Channel Adapter (HCA) and out to the InfiniBand network fabric.

Both memory bandwidth-limited and cache-friendly applications are affected by process and memory locality, and numa control can be used to modify affinity and memory policy for the specifics of the Ranger node architecture. The following figure shows two extremes cases of core scalability on a node. The upper curve shows the performance of a cache-friendly DGEMM matrix-matrix multiply from the GotoBLAS library. It scales to 15.2 for a 16 core run. The lower bound shows the scaling for the STREAM benchmark which is limited by the memory bandwidth to each socket. Efficiently parallelized codes should exhibit scaling between these curves. Applications which are memory-bandwidth limited are often more sensitive to memory policy and affinity than CPU-limited applications.

 

Figure 4. Application scaling boundaries for CPU-limited and Memory bandwidth-limited Applications.
Application scaling boundaries.

 

Numactl Command and Options

The table below lists important options for assigning processes and memory allocation to sockets, and assigning processes to specific cores. The “no fallback” condition implies a process will abort if no more allocation space is available.

Control Type Command Option Arguments Description
Socket Affinity numactl -N {0,1,2,3} Only execute process on cores of this (these) socket(s).
Memory Policy numactl -l none Allocate only on socket were process runs; fallback to another if full.
Memory Policy numactl -i {0,1,2,3} Strictly allocate round robin (interleave) on these (comma separated list) sockets; no fallback.
Memory Policy numactl --preferred= {0,1,2,3}
select only one
Allocate on this (only one) socket; fallback to another if full.
Memory Policy numactl -m {0,1,2,3} Strictly allocate on this (these, comma separated list) socket(s); no fallback.
Core Affinity numactl -C {0,1,2,3,
4,5,6,7,
8,9,10,11,
12,13,14,15}
Only execute process on this (these, comma separated list) core(s).

 

In threaded applications the same numactl command can be used, but its scope is limited globally to ALL threads. In an executable launched with the numactl command, any forked process or thread, inherits the process affinity and memory policy of the parent. Hence, in OpenMP codes, all threads at a parallel region take on the affinity and policy of the master process (thread). Even though multiple threads are forked (or spawned) at run time from a single process, you can still control (migrate) each thread after it is spawned through a numa API (application program interface). The basic underlying API utilities for binding processes (MPI Tasks) and threads (OpenMP/Pthreads threads) are sched_get/setaffinity, and numalib memory functions, respectively.

There are three levels of numa control that can be used when submitting batch jobs. These are:

  1. Global (usually with 16 tasks);
  2. Socket level (usually with 4, 8, and 12 tasks); and
  3. Core level (usually with an odd number of tasks)

Launch commands will be illustrated for each. The control of numactl can be directed either at a global level on the ibrun command line (usually for memory allocation only), or within a script (which accesses MPI variables) to specify the socket (-N) or core (-C) mapping to MPI ranks assigned to the execution. (Note, the numactl man page refers to sockets as "nodes". In HPC systems a node is a blade; to avoid confusion, we will only use "node" when we refer to a blade).

The default memory policy is a local allocation; that is, allocation preferentially occurs on the socket where the process is executing. Note that the (physical) memory allocation occurs when variables/arrays are first touched (assigned a value) in a program, not when the memory storage is malloc'd (C) or allocated (F90). Task assignments are set by numa control within MPI (and should not be of concern for 16way MPI and 1way pure OMP runs); nevertheless some applications can benefit from their own numa control. The syntax and script templates for the three levels of control are present below.

Numa Control in Batch Scripts

  • GLOBAL CONTROL:

Global control will affect every execution. This is often only used to control the layout of memory for 16-way MPI executions and SMP executions of 16 threads on a node. Since allocated I/O buffers may remain on a node from a previous job, TACC provides a script, tacc_affinity, to enforce a strict local memory allocation to the socket (-m memory policy), thereby removing I/O buffers (the default local memory policy does not evict the IO memory allocation, but simply overflows to another socket’s memory and incurs a higher overhead for memory access). The tacc_affinity script also distributes 4-, 8-, and 12way executions evenly across sockets. Use this script as a template for implementing your own control. The examples below illustrate the syntax.

ibrun ./a.out # -pe 16way 16 ; uses local memory; MPI does core affinity
ibrun numactl -i all ./a.out # -pe 1way 16 ; interleaved; possibly use for OMP
ibrun tacc_affinity ./a.out # -pe 4/8/12/16way ... ; assigns MPI tasks round robin to sockets,
mandatory memory allocation to socket

 

  • SOCKET CONTROL:

Often socket level affinity is used with hybrid applications (such as 4 tasks and 4 threads/task on a node = 4x4), and when the memory per process must be two to four times the default (~2GB/task). In this scenario, it is important to distribute 3, 2 or 1 tasks per socket. In the 4x4 hybrid model, it is critically important to launch a single task per socket and control 4 threads on the socket. For the usual cases of control TACC provides a tacc_affinity script for recommended behavior.

An example numa control script and related job commands are shown to illustrate how a task (rank) is mapped to a socket and how process memory policy is assigned. The my_affinity script below (based on TACC’s tacc_affinity script) is executed once for each task on the node. This script captures the assigned rank (from the particular MPI stack, one of MPIRUN_RANK, PMI_RANK, etc.) and wayness from ibrun and uses them to assign a socket to the task. From the list of the nodes, ranks are assigned sequentially in block sizes determined by the "wayness" of the Program Environment (assigned in the PE environment variable). So, for example, in the job below, a 32 core, 4way job (#$ -pe 4way 32) will have ranks {0,1,2,3} assigned to the first node, {4,5,6,7} assigned to the next node, etc. The “export OMP_NUM_THREADS=4” specifies a hybrid code using 4 threads per task (for hybrid executions). For the given job script parameters, a task is assigned to each of the sockets 0, 1, 2 and 3 on each of two nodes in the my_affinity script. Also, the script works for 8way, 12way and 16way jobs.

Bourne shell based job script
...
#! -pe 4way 32
...
export OMP_NUM_THREADS=4
ibrun numa.sh

 

numa.sh
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0

# Get rank from appropriate MPI API variable
[ "x$MPIRUN_RANK" != "x" ] && myrank=$MPIRUN_RANK
[ "x$PMI_ID" != "x" ] && myrank=$PMI_ID
[ "x$OMPI_COMM_WORLD_RANK" != "x" ] && myrank=$OMPI_COMM_WORLD_RANK
[ "x$OMPI_MCA_ns_nds_vpid" != "x" ] && myrank=$OMPI_MCA_ns_nds_vpid

# TasksPerNode
TPN=`echo $PE | sed 's/way//'`
[ ! $TPN ] && echo "TPN NOT defined!" && exit 1

# local_rank = 0...wayness = Rank modulo wayness

local_rank=$(( $myrank % $TPN ))

# Assign sockets in groups of 1,2,3 and 4 for
# only 4-, 8-, and 16-wayness, respectively.

socketID=$(( $local_rank / ( $TPN / 4 ) ))

exec numactl -N $socketID -m $socketID ./a.out

 

  • CORE CONTROL:

When there is no simple arithmetic algorithm (such as using the modulo function above) to map the ranks, lists may be used. The template below illustrates the use of a list for mapping tasks and memory allocation onto sockets.

We use the numactl -C option for assigning tasks to cores. Each element of the task2core array holds a core assigned to a task for the sequence of local ranks. Using localrank as the index of task2core maps the task (for the local rank) onto a core ID. Similarly, for the task2socket array; the 0-15 local rank values are used as indexes to provide socket IDs for the –m memory argument. In this case, the Program Environment is 16way and the set of local ranks {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} are mapped onto the cores {15,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14} and sockets {3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3}, respectively.

Task 0, which is often the root rank in MPI broadcasts, is assigned to core 15, with its memory allocated on socket 3, which is connected directly through the HyperTransport via the PCI-e bridge to the HCA (network card). Such a mapping may actually degrade performance for other reasons; hence, custom mappings should be thoroughly tested for their benefits before being used on a production application. Nevertheless, the approach is quite general and provides a mechanism for list-directed control over any desired mapping.

NOTE: When the number of cores is not a multiple of 16 (e.g. 28 in this case), the user must set the environment variable MY_NSLOTS to the number of cores within the job script, as shown below (AND the second argument in the -pe option (32 below) must be equal to the value of MY_NSLOTS rounded up the nearest multiple of 16).

Bourne shell based job script
...
#! -pe 14way 32
...
export MY_NSLOTS=28
ibrun numa.sh

 

numa.sh
#!/bin/bash

# Unset any MPI Affinities
export MV2_USE_AFFINITY=0
export MV2_ENABLE_AFFINITY=0
export VIADEV_USE_AFFINITY=0
export VIADEV_ENABLE_AFFINITY=0

# Get rank from appropriate MPI API variable
[ "x$MPIRUN_RANK" != "x" ] && myrank=$MPIRUN_RANK
[ "x$PMI_ID" != "x" ] && myrank=$PMI_ID
[ "x$OMPI_COMM_WORLD_RANK" != "x" ] && myrank=$OMPI_COMM_WORLD_RANK
[ "x$OMPI_MCA_ns_nds_vpid" != "x" ] && myrank=$OMPI_MCA_ns_nds_vpid

# TasksPerNode
TPN=`echo $PE | sed 's/way//'`
[ ! $TPN ] && echo "TPN NOT defined!" && exit 1

# local_rank = 0...wayness = Rank modulo wayness

if [ $TPN = 16 ]; then
task2socket=( 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3)
task2core=( 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
fi

localrank=$(( $myrank % $TPN ))

# sh: 1st element is 0

socket=${task2socket[$localrank]}
core=${task2core[ $localrank]}

exec numactl -C $core -m $socket ./a.out

 

In hybrid codes, a single MPI task (process) is launched and becomes the “master thread”. It uses any numactl options specified on the launch command. When a parallel region forks the slave threads, the slaves inherit the affinity and memory policy of the parent, the master thread (launch process). The usual case is for an application to launch a single task on a node without numactl process binding, but to provide a memory policy for the task (and hence all of the threads).

For instance, "ibrun numactl –i all ./a.out" would be used to assign interleave as the memory policy. Another hybrid scenario is to assign a single task on each socket (a 4x4 hybrid). In this case, a socket binding and some form of local memory policy should be employed, as in the example above.

In the first case, a socket could have been assigned to the single task (socket 0 is the usual default). In both cases, a memory policy could have been neglected, and may even be the most appropriate action. Without an explicit memory policy, a local, sometimes called a "first touch" mechanism is used. For any shared array, the first time an element within a block of memory (block = a page of 4096 bytes) is accessed the page is assigned to the socket memory on which the thread is running, regardless of the location of the master thread that may have performed the allocation statement. (For special cases, you can force touching at allocation time with the --touch option; but that should not be used for a shared array—because it assigns all the memory to a single socket.)

 

Basic Optimization

The most practical approach to enhance the performance of applications is to use advanced compiler options, employ high performance libraries for common mathematical algorithms and scientific methods, and tune the code to take advantage of the architecture. Compiler options and libraries can provide a large benefit for a minimal amount of work. Always profile the entire application to ensure that the optimization efforts are focused on areas with the greatest return on the optimization efforts.

"Hot spots" and performance bottlenecks can be discovered with basic profiling tools like prof and gprof. Observe the relative changes in performance among the routines when experimenting with compiler options. Sometimes it might be advantageous to break out routines and compile them separately with different options than those used for the rest of the package. Also, review routines for "hand-coded" math algorithms that can be replaced by standard (optimized) library routines. You should also be familiar with general code tuning methods and restructure statements and code blocks so that the compiler can take advantage of the microarchitecture.

Code should:

  • be clear and comprehensible
  • provide flexible compiler support
  • should be portable

Avoid too many architecture-specific code constructs. Use language features and restructure code so the compiler can discover how to optimize code for the architecture. That is, expose optimization when possible for the compiler, but don't rewrite the code specifically for the architecture.

Some best practices:

  • Avoid excessive program modularization (i.e. too many functions/subroutines)
    • write routines that can be inlined
    • use macros and parameters whenever possible
  • Minimize the use of pointers
  • Avoid casts or type conversions, implicit or explicit
  • Avoid branches, function calls, and I/O inside loops
    • structure loops to eliminate conditionals
    • move loops around a subroutine, into the subroutine
 

Tools

Program Timers and Performance Tools

Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.

The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.

Package Timers

The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore, you should use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax "/usr/bin/time -p " (-p specifies traditional "precision" output, units in seconds):

/usr/bin/time -p ./a.out #Time for a.out execution
real 1.54 #Output (in seconds)
user 0.5  
sys 0  
/usr/bin/time -p ibrun -np 4 ./a.out #Time for rank 0 task

 

The MPI example above provides only the timing information for the rank 0 task on the master node (the node that executes the job script); however, the real time is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (are not load balanced).

Code Section Timers

"Section" timing is another popular mechanism for obtaining timing information. The performance of individual routines or blocks of code can be measured with section timers by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below:

Code Section Timers
Routine Type Resolution (usec) OS/Compiler
times user/sys 1000 Linux
getrusage wall/user/sys 1000 Linux
gettimeofday wall clock 1 Linux
rdtsc wall clock 0.1 Linux
system_clock wall clock system dependent Fortran 90 Intrinsic
MPI_Wtime wall clock system dependent MPI Library (C & Fortran)

 

For general purpose or course-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accesses in the same way:

 

external x_timer double x_timer(void);
real*8 :: x_timer ...
real*8 :: sec0, sec1, tseconds double sec0, sec1, tseconds;
... ...
sec0 = x_timer() sec0 = x_timer();
...Fortran Code ...C Code
sec1 = x_timer() sec1 = x_timer();
tseconds = sec1-sec0 tseconds = sec1-sec0

 

Standard Profilers

The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the "hotspots". As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below for a Fortran program:

Profiling Serial Executables
ifort -qp prog.f90 Instruments code
a.out Produces gmon.out trace file
gprof Reads gmon.out (default args: a.out gmon.out)
(report sent to STDOUT)

 

Profiling Parallel Executables
mpif90 -qp prog.f90 Instruments code
setenv GMON_OUT_PREFIX gout.* Forces each tasks to produce a gout.
ibrun -np < # > a.out Produces gmon.out trace file
gprof -s gout.* Combines gout files into gmon.sum
gprof a.out gmon.sum Reads executable (a.out) & gmon.sum
(report sent to STDOUT)

 

Detailed documentation is available at www.gnu.org.

Timing Tools

Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.

Debugging with DDT

DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Ranger, please see the DDT Debugging Guide.

 

Resources

 

Manuals

The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.

AMD

Portland Group Compiler

Intel