Stampede User Guide
System Overview
The following table offers a high level description of the Stampede system configuration:
| Overview of the Stampede System |
| Host Name |
stampede.tacc.utexas.edu |
| Login Nodes |
slogin1.tacc.utexas.edu
slogin2.tacc.utexas.edu (may not be available) |
| Operating System |
Linux |
| Number of Processors |
1744 (compute cores) |
| Total Memory |
1800 GB |
| Peak Performance |
16 TFLOPS |
| Total Disk |
520 GB (local)
536 GB (shared)
68 TB(global, shared) |
Architecture
The Stampede system consists of 218 compute nodes and 2 login nodes. The nodes are interconnected using Gigabit Ethernet technology. Each of the compute nodes have 2 quad core Intel Clovertown processors. At the current time, there are three configurations for the compute nodes:
- 204 compute nodes have 8 GB of memory and 600 GB of local disk space of which 520 GB is available to the user.
- 7 compute nodes 8 GB of memory and 70 GB of local disk space (53 GB available to user).
- 7 compute nodes have 16 GB of memory with 70 GB of local disk space (53 GB available to the user).
Stampede users share a 562 GB home file system that is NFS mounted to the login nodes and the compute nodes. Stampede can also access 68 TB of parallel file storage that is managed by the Lustre file system and shared with the TACC Lonestar system (/work). Also, a 2.8 PB archive system and 5 TB SAN network storage system are available through the login/development nodes.
System Access
ssh
To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Telnet is not allowed because of the security vulnerabilities associated with it. The "r" commands rlogin, rsh, and rcp, as well as ftp, are also disabled on this machine for similar reasons. These commands are replaced by the more secure alternatives included in SSH --- ssh, scp, and sftp.
Before any login sessions can be initiated using ssh, a working SSH client needs to be present on the local machine. Wikipedia is a good source of information on SSH in general and provides information on the various clients available for your particular operating system.
To initiate an ssh connection to a Stampede login node from a UNIX or Linux system with ssh already installed, execute the following command:
| ssh < login-name >@stampede.tacc.utexas.edu |
Note: < login-name > is needed only if the user name on the local machine and the TACC machine differ.
Passwords are now changed using the TACC user portal. The passwd command is not available.Password changes should comply with practices presented in the TACC Password Guide.
Login Information
Login Shell
The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, execute:
You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by the /etc/shells file, along with their full-path.
To display the list of available shells with chsh and change your login shell to bash, execute the following:
| slogin1% chsh -l |
| slogin1% chsh -s /bin/bash |
User Environment
The next most important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:
The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.
| HOME=/home/00042/smith |
| PATH=/bin:/usr/bin:/usr/local/apps:/opt/intel/bin |
Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are "carried" to the environment of shell scripts and new shell invocations, while normal "shell" variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.
Startup Scripts
All UNIX systems set up a default environment and provide administrators and users with the ability to execute additional UNIX commands to alter the environment. These commands are "sourced". That is, they are executed by your login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment.
Forgetting to unload superseded modules or putting module commands in the wrong shell startup file are some of the most common environment set up mistakes. With the C based shells, use the .login_user file instead of .cshrc_user. The .login_user file is executed at login, but only after .cshrc_user, and therefore after the rest of the environment is set up. For Bourne based shells, use the .profile_user file for module commands.
Basic site environment variables and aliases are set in the following files:
| /usr/local/etc/cshrc |
{C-type shells, non-login specific} |
| /usr/local/etc/login |
{C-type shells, specific to login} |
| /usr/local/etc/profile |
{Bourne-type shells} |
For historical reasons, the C based shells (csh, tcsh, etc.) source two types of files. The .cshrc type files are sourced first (/etc/csh.cshrc then $HOME/.cshrc then /usr/local/etc/cshrc then $HOME/.cshrc_user). These files are used to set up the execution environment used by all scripts and for access to the machine without an interactive login. For example, the following commands execute only the .cshrc type files on the remote machine:
| scp data stampede.tacc.utexas.edu |
{only .cshrc sourced on stampede} |
| ssh stampede.tacc.utexas.edu date |
{only .cshrc sourced on stampede} |
The .login type files set up environment variables that accounts commonly use in an interactive session. They are sourced after the .cshrc type files (/etc/csh.login then $HOME/.login then /usr/local/etc/login then $HOME/.login_user).
Similarly, if your login shell is a Bourne based shell (bash, sh, ksh, etc.), the profile files are sourced (/etc/profile then $HOME/.profile then /usr/local/etc/profile then $HOME/.profile_user).
The commands in the /etc files above are concerned with operating system behavior and set the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)
TACC coordinates the environments on several systems. In order to efficiently maintain and create a common environment among these systems, TACC uses its own startup files in /usr/local/etc. (A corresponding file in this etc directory is sourced by the startup script files that reside in your home directory. (Please do not remove these files and the sourcing commands in them, even if you are a UNIX guru.) Any commands that you put in your .login_user, .cshrc_user, or .profile_user file are sourced (if the file exists) at the end of the corresponding /usr/local/etc command files. If you accidentally remove your .login, .cshrc, or .profile, you can copy new ones from /usr/local/etc/start-up.
Modules
TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.
At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the $PATH, $MANPATH, $LIBPATH environment variables, directory locations ($WORK, $HOME, etc.), aliases (cdw, cdh, etc.) and license paths. Therefore, there is no need for you to set them or update them when updates are made to system and application software.
Users that require 3rd party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.
Each of the major TACC applications has a modulefile that sets, unsets, appends to, or prepends to environment variables such as $PATH, $LD_LIBRARY_PATH, $INCLUDE_PATH, $MANPATH for the specific application. Each modulefile also sets functions or aliases for use with the application. You need only to invoke a single command to configure the application/programming environment properly. The general format of this command is:
| module load < module_name > |
where < module_name > is the name of the module to load. If you often need an application environment, place the module commands required in your .login_user and/or .profile_user shell startup file.
Most of the package directories are in /opt/apps ($APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.
As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set up by loading the fftw3 module:
| slogin1% module load fftw3 |
To see a list of available modules, a synopsis of a particular modulefile's operations (in this case, fftw3), and a list of currently loaded modules, execute the following commands:
| slogin1% module avail |
| slogin1% module help fftw3 |
| slogin1% module list |
During upgrades, new modulefiles are created to reflect the changes made to the environment variables. TACC will always announce upgrades and module changes in advance.
File Systems
The TACC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, account-owned directories in these file systems for you to store your data. Of course, these file systems are shared with others, so they are managed either by a quota limit, a purge policy (time-residency), or a migration policy.
To determine the amount of disk spaced used in a file system, cd to the directory of interest and execute the df -k . command, including the "dot", which represents the current directory. Without the "dot" all file systems are reported.
In the command output below, the file system name appears on the left (IP number, "ib" protocol, using OFED gen2) , and the used and available space (-k, in units of 1 KBytes) appear in the middle columns followed by the percent used and the mount point:
| slogin1% df -k . |
| File System |
1k-blocks |
Used |
Available |
Use% |
Mounted on |
| slogin2:/home |
421492960 |
93863936 |
306218432 |
24% |
/home |
To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sb option (s=summary, b=units in bytes):
To determine quota limits and usage on $HOME, execute the quota command from any directory:
The five major file systems available on Stampede are:
- home directory
-
At login, the system automatically the current working directory to your home directory.
Store your source code and build your executables here.
This directory has a quota limit of 200 MB.
The frontend nodes and any compute node can access this directory.
Use $HOME to reference your home directory in scripts.
Use cd to change to $HOME.
- work directory
-
Store large files here. Your work directory is shared with the Lonestar system.
Change to this directory in your batch scripts and run jobs in this file system.
The frontend nodes and any compute node can access this directory.
The work file system is approximately 30 TB.
Purge Policy: Files with access times greater than 10 days are purged.
This file system is not backed up.
Use $WORK to reference this directory in scripts.
Use cdw to change to $WORK.
NOTE: TACC staff may delete files from work if the file system becomes full, even if files are less than 10 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.
More on $WORK-- How to do parallel I/O in the Lustre File System
- scratch or temporary directory
-
This is a directory in a local disk on each node where you can store files and perform local I/O for the duration of a batch job.
It is often more efficient to use and store files directly in $WORK (to avoid moving files from scratch at the end of a batch job).
The size of the scratch file system varies on Stampede nodes but is typically 500GB.
Files stored in the scratch directory on each node are removed immediately after the job terminates.
Use $SCRATCH to reference this file system in scripts.
- archive
-
Store permanent files here for archival storage.
This file system is NOT NSF mounted (directly accessible) on any node.
Use the $ARCHIVE file system only for long-term file storage to the $ARCHIVER system; it is not appropriate to use it as a staging area.
Use the $ARCHIVE file system only for long-term file storage to the $ARCHIVER system; it is not appropriate to use it as a staging area.
Use the rcp command to transfer data to this system. For example:
| slogin1% rcp ${ARCHIVER}:$ARCHIVE/myfile $WORK |
Use the rsh command to login to the $ARCHIVER system from any TACC machine. For example:
See the Ranch User Guide for more on archiving and using TACC Tools such as sinc.
- project directory
-
An additional file system (/proj) is available only to the members of the group that purchased it.
This files ystem is available on the compute nodes via an NFS mount.
There is 3.7 TB of storage available in a RAID 5 configuration.
This file system is not backed up, but RAID 5 provides redundancy in case of disk failures.
Programming Models
There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.
For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor core loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 2. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, "mpirun").
In the MPMD paradigm, each processor core loads up and executes a different program image and operates on different data sets, as illustrated in Figure 2. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or hundreds of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) On Ranger and Lonestar, these executables are launched through the same mechanism as SPMD jobs, but a UNIX script is used to assign input parameters for the execution command, via the launcher module. However, this module is not available on Stampede.
| Figure 2. Distributed Memory Paradigm: Single/Multiple-Program Multiple-Data. |
 |
The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes. Each node on this system contains 8 CPUs with a single 16GB memory subsystem.
The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem, as shown in Figure 2.1. There is no need for explict messaging between processors as with with MPI coding.
| Figure 2.1 Shared-Memory Parallel Processing. |
 |
In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) is employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.
In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.
In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.
The number of application that benefit from hybrid programming on dual-processor nodes (e.g. on Lonestar) is very small. The programming and support of hybrid codes is complicated by compiler and platform support of both paradigms. However, with the new multi-core multi-socket commodity systems on the horizon, there may be a resurgence in hybrid programming if these systems provide better enhanced performance with SMPP (OMP) algorithms.
Compiling Code
The Stampede programming environment uses Intel C++ and Intel Fortran compilers by default. This section highlights the important HPC aspects of using the Intel compilers. The Intel compiler commands can be used for both compiling (making ".o" object files) and linking (making an executable from a ".o" object files).
The Intel Compiler Suite
The latest Intel compiler available is loaded as the default at login with the intel module. (The previous version of the compiler is available for special porting needs. Use the 'module available' command to list all modules installed, including versions and default information where applicable.) The Intel suite is installed with the EM64T 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode). Any programs compiled on 32-bit systems need to be recompiled. Any pre-compiled packages should be EM64T (x86-64) compiled or errors may occur. Since only 64-bit versions of the MPI libraries have been built, programs compiled in 32-bit mode will not execute MPI code.
The Intel Fortran compiler command is ifort (use 'ifort -V' for current version information) The ifc command is still accepted, but it displays an annoying message about the name obsolescence.
Web accessible Intel manuals are available: Intel C++ Compiler Documentation and Intel Fortran Compiler Documentation.
Compiling Serial Programs
The table below lists the syntax for serial program compilation.
| Compiler |
Language |
File Extension |
Example |
| icc |
C |
.c |
icc [compiler_options] prog.c |
| icc |
C++ |
.C, .cc, .cpp, .cxx |
icc [compiler_options] prog.cpp |
| ifort |
F77 |
.f, .for, .ftn |
ifort [compiler_options] prog.f |
| ifort |
F90 |
.f90, .fpp |
ifort [compiler_options] prog.f90 |
Appropriate file name extensions are required for each compiler. By default, the executable name is a.out; and it may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimizations.
A C program example:
| slogin1% icc -o flamec.exe -O3 -xW prog.c |
A Fortran program example:
| slogin1% ifort -o flamef.exe -O3 -xW prog.f90 |
Commonly used options may be placed in a icc.cfg or ifc.cfg file for compiling C and Fortran code, respectively.
For additional information, execute the compiler command with the -help option to display all compiler options, their syntax, and a brief explanation, or display the man page, as follows:
| slogin1% icc -help |
| slogin1% ifort -help |
| slogin1% man icc |
| slogin1% man ifort |
Some of the more important options are listed in the Basic Optimization section of this guide. Additional documentation, references, and a number of user guides (pdf, html) are available in the Fortran and C++ compiler home directories ($IFC_DOC and $ICC_DOC).
Compiling Parallel Programs
OpenMP
Since each of the PowerEdge nodes of the Stampede cluster had eight processing cores, applications can use the shared memory programming paradigm "on node". Use the -openmp compiler option to create binaries that only include OpenMP support. For hybrid programming (programs that use both OpenMP and MPI), use the mpi-compiler commands below, and include the -openmp option.
MPI Compilers
The Message Passing Interface (MPI) is a communication library used for writing parallel programs. MPI is available on Stampede, but we recommend that you only run tightly-coupled MPI programs on up to 8 processing cores (1 node). Stampede has a GigE interconnection network that has significantly lower performance than the Infiniband networks on the TACC Lonestar and Ranger systems. Therefore, you should run tightly-coupled parallel applications that use more than 8 processing cores on Lonestar or Ranger.
However, you can run loosely-coupled MPI programs on more than one Stampede node where MPI is used, for example, to coordinate independent tasks. The Stampede GigE network does have the performance to support such applications.
The OpenMPI implementation of MPI is installed, however, the default environment does not load the compiler. Therefore, if you wish to compile MPI programs, use the module command to load the openmpi module. For example :
Once this module is loaded, you will be able to use the mpicc, mpiCC, mpif77, and mpif90 compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable. The following table lists the compiler wrappers for each language:
| Compiler |
Language |
File Extension |
Example |
| mpicc |
C |
.c |
mpicc [compiler_options] prog.c |
| mpiCC |
C++ |
.cc, .C, .cpp, .cxx |
mpiCC [compiler_options] prog.cc |
| mpif77 |
F77 |
.f, .for, .ftn |
mpif77 [compiler_options] prog.f |
| mpif90 |
F90 |
.f90, .fpp |
mpif90 [compiler_options] prog.f90 |
Appropriate file name extensions are required for each wrapper. By default, the executable name is a.out; and it may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.
A C program example:
| slogin1% mpicc -o prog.exe -O3 -xW prog.cc |
A Fortran program example:
| slogin1% mpif90 -o prog.exe -O3 -xW prog.f90 |
Include linker options such as library paths and library names after the program module names, as explained in the Loading Libraries section below. The Running Code section of this guide explains how to execute MPI executables on the Stampede compute nodes.
We recommend that you use the Intel compiler for optimal code performance. TACC does not support the use of the gcc compiler for production code on the Stampede system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option for module requiring gcc. (Since gcc- and Intel-compiled code are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for these cases:
| slogin1% mpicc -O3 -xW -c -cc=gcc suba.c |
| slogin1% mpicc -O3 -xW mymain.c suba.o |
| |
| slogin1% mpicc -O3 -xW -c suba.c |
| slogin1% mpicc -O3 -xW -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o |
Note: MPI programs compiled for Lonestar will not execute on Stampede, nor will MPI programs compiled for Stampede execute on Lonestar. The reason is that the systems use different MPI libraries with different communication support (Infiniband for Lonestar, TCP/IP for Stampede).
For details on using MPI, visit the Mathematics and Computer Science Division MPI Standard page.
Compiler Options
Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.
At the most basic level of optimization that the compiler can perform is -On options, explained below.
Optimization Level: -On
| Level |
Description |
| n = 0: |
Fast compilation, full debugging support; equivalent to -g |
| n = 1,2: |
Low to moderate optimization, partial debugging support:
- instruction rescheduling
- copy propagation
- software pipelining
- common subexpression elimination
- prefetching, loop transformations
|
| n = 3+: |
Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!) :
- enables -O2
- more aggressive prefetching, loop transformations
|
The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.
| Option |
Description |
| -c |
For compilation of source file only. |
| -O3 |
Aggressive optimization (-O2 is default). |
| -xT |
Generates code with streaming SIMD extensions SSE2/3/4 for EM64T architecture. |
| -axT |
Same as -xT, but also generates generic code. |
| -g |
Debugging information, generates symbol table. |
| -mp |
Maintain floating point precision (disables some optimizations). |
| -mp1 |
Improve floating-point precision (speed impact is less than -mp). |
| -ip |
Enable single-file interprocedural (IP) optimizations (within files). |
| -ip0 |
Enable multi-file IP optimizations (between files). |
| -prefetch |
Enables data prefetching (requires –O3). |
| -openmp |
Enable the parallelizer to generate multi-threaded code based on the OpenMP directives. |
| -openmp_report[0|1|2] |
Controls the OpenMP parallelizer diagnostic level. |
Loading Libraries
Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.
- Use the -l loader option to link in a library at load time. For example:
| slogin1% ifort prog.f90 -lname |
This links in either the shared library libname.so (default) or the static library libname.a, provided it can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.
- To explicitly include a library directory, use the -L option. For example:
| slogin1% ifort prog.f -L/mydirectory/lib -lname |
In this example, the user's libname.a library is not in the default search path, so the "-L" option is specified to point to the libname.a directory.
Many modules for applications and libraries, such as the mkl library module provide environment variables for compiling and linking commands. Execute module help module_name for a description, listing, and use cases for the assigned environment variables. The following example illustrates their use for the mkl library:
| slogin1% mpicc |
-Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC mkl_test.c \ |
| |
-L$TACC_MKL_LIB -lmkl_em64t |
Here, the module supplied variables TACC_MKL_LIB and TACC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl specifies that the $TACC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the LD_LIBRARY path or the LDD dynamic cache of bindings between shared libraries and directory paths. (This avoids having to set the LD_LIBRARY path ("manually" or through a module command) before running the executables.
Note: Previously, FORTRAN programs that used utilities such as getarg needed to include the compatibility library libPEPCF90.a with the "-Vaxlib" option when using the Intel compiler. As of version 9.1, this option is no longer required.
Performance Libraries
ISPs (Independent Software Providers) and HPC vendors provide high performance math libraries that are tuned for specific architectures. Many applications depend on these libraries for optimal performance. Intel has developed performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for the em64t architectures. Details of the Intel libraries and specific loader/linker options are given below.
MKL library
The "Math Kernel Library" consists of functions with Fortran, C, and C++ interfaces for the following computational areas:
- BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
- LAPACK for linear algebraic equation solvers and eigensystem analysis
- Fast Fourier Transforms
- Transcendental Functions
In addition, MKL also offers a set of functions collectively known as VML -- the "Vector Math Library". VML is a set of vectorized transcendental functions which offer both high performance and excellent accuracy compared to the libm functions (for most of the Intel architectures). The vectorized functions are considerably faster than standard library library functions for vectors longer than a few elements.
To use MKL and VML, first load the MKL module using the command module load mkl. This will set the TACC_MKL_LIB, TACC_MKL_INC, and TACC_MKL_DOC environment variables to the directories containing the MKL libraries, the MKL header files and the MKL documentation. Below is an example command for compiling and linking a program that contains calls to BLAS functions (in MKL). Note that the library is for use in a single node, hence can be used by both serial compilers or by MPI wrapper scripts.
| mpicc -O3 -Wl,-rpath,$TACC_MKL_LIB -I$TACC_MKL_INC foo.c -L$TACC_MKL_LIB -lmkl_p4n -lguide |
For additional documentation and reference on MKL, both pdf and html-based, please look in the directory specified by the MKL_DOC environment variable.
Running Code
Acceptable Use of Login Nodes
The Stampede login nodes should not to be used for executing simulations, data analysis, or similar applications. The purpose of the login nodes is to prepare your applications for execution on the Stampede compute nodes via the SGE batch scheduling system, described in the next section.
Acceptible uses of the login nodes include transfering files to and from Stampede and between file systems on Stampede, compiling, SGE job submission and management, and executing low-processor-usage control programs. The login nodes should not be used to execute your applications - even if they are just tests that will run for only a few minutes. The execution of such programs on the login node can severely affect the usability of the login node.
Overview of Batch Scheduling
Execute applications on the compute nodes of Stampede via a batch scheduling system. When using such a system, you do not typically log in to a compute node to run the application. Instead, you use the batch scheduler to execute tge application for you, and the scheduler runs your application on one or more compute nodes as nodes become available. The basic operations when interacting with any batch scheduler are preparing a description of the program you want to execute (typically called a job), submitting this job to the batch scheduler, watching the status of the job as it proceeds through the batch system, (first waiting for nodes to be available, then running, then completing), and controling the job (for example, pausing or canceling the job).
Stampede uses the Sun Grid Engine (SGE) batch scheduling system to manage jobs.
Interactive Access to Compute Nodes
Before we begin describing how to use the SGE batch system, there are a few ways that you can gain interactive access to Stampede compute nodes when running a batch job. Such access is useful for debugging and performing compute-intensive interactive work such as data analysis.
You can obtain interactive access to a compute node by using the qrsh command. The qrsh command is used to execute a command on one of the compute nodes and this command can be a shell. For example:
slogin1$ qrsh -V /bin/bash -i
This command will submit a job to the batch system and when that job begins to execute (potentially after a delay), your terminal will be logged in to one of the Stampede compute nodes. The -V option will propagate your environment on the login node to the compute node.
The qsh command will submit a job to SGE and when it begins to execute, an xterm will be displayed on your screen from the compute node the job executes on. You will need to have X Windows running on your client system and your DISPLAY environment variable set on the Stampede login node to point to your client system for qsh to work.
Finally, if you have a job already running on one or more compute nodes, you can access these compute nodes by simply using the ssh program from a Stampede login node to connect to the compute node. You can then observe the behavior of your application, perform debugging, or other tasks.
The SGE Batch System
The SGE Batch System supports both serial and parallel jobs and has a variety of configuration options. This guide describes only basic usage, but there is documentation available on line including the N1 Grid Engine 6 User's Guide and the manual pages on the Stampede login nodes provide excellent information. The first man page to examine is the sge_intro man page that you can view with 'man sge_intro'.
Job Scripts
A job script describes your job to the SGE batch system and starts your job. It consists of a shell script with SGE directives as comments. A simple example for a serial program is:
#$ -N simple_test
#$ -cwd
#$ -V
#$ -o simple_test.out
#$ -e simple_test.err
#$ -l h_rt=0:05:00
#$ -pe serial 1
#$ -q normal
/bin/hostname
|
Lines that begin with #$ are directives to SGE. There are a variety of different directives that can be used. Those used in the example are described in the table below.
| Directive |
Description |
| -N |
name of the job (optional) |
| -cwd |
the job should run out of the current working directory |
| -V |
propagate submission environment (recommended) |
| -o |
send stdout of the program to the specified file (recommended) |
| -e |
send stderr of the program to the specified file (recommended) |
| -l |
limits for the job. The h_rt parameter specifies the run time limit |
| -pe |
the parallel environment to use and the number of processing cores to use |
| -q |
submit the job to the specified queue |
An example job script for an MPI job is:
#$ -N myjob
#$ -cwd
#$ -V
#$ -o myjob.out
#$ -e myjob.err
#$ -l h_rt=0:05:00
#$ -pe mpi 32
#$ -q parallel
mpirun -np $NSLOTS ./myjob
|
In this script, the -pe mpi 32 directive specifies the MPI parallel environment and 32 processing cores (4 or more nodes). The mpirun line passes the compiled MPI program to run as an argument. Specify the number of processes to use with the -np flag. $NSLOTS is an environment variable set by SGE with the number of processing cores allocated to the job. Use it as an argument to the -np option for convenience, instead of duplicating the number provided with the -pe mpi directive.
Note: To run MPI applications, you must to load the openmpi module into your environment.
See the section on Software for additional information.
Job Submission
Submit jobs to the SGE batch system using the qsub command. If your job is fully described in your job script, the use of qsub is very simple:
| slogin1$ qsub simple.sge |
| Your job 82 ("simple_test") has been submitted |
The only argument to qsub is the name of the file that contains your job script. If the job submission is successful, qsub prints out the identifier for your job (in this case, "82") that you can use to monitor and control your job. We recommend always using a job script so you have a reusable record of how the job was submitted.
Job Monitoring
The qstat command can be used to monitor the status of your job. For example:
slogin1$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
--------------------------------------------------------------------------------------------------
82 0.00000 simple_test wsmith qw 10/19/2007 00:54:36 1
|
Shows that the job is queued and waiting. The job then starts to run:
slogin1$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------------------------------------
82 0.55500 simple_test wsmith r 10/19/2007 00:54:41 queuename 1
|
Now the node(s) in use by the job are shown in the qstat information displayed, with queuename representing the name of the queue and the specific node on which the job is running. When the job completes, information on the job no longer appears in the output.
Job Control
There are a number of job control methods provided by SGE, but the most used one will be the qdel command. This command is used to delete a job that was previously submitted with qsub. The command is used by simply providing the job identifier as an argument to qdel:
slogin1$ qdel 82
wsmith has deleted job 82
|
Other job control commands include qalter, qhold, and qrls. Please see the manual pages of these commands for additional information.
Queue Structure
The qconf command can be used to learn about the SGE job queues. For example:
slogin1$ qconf -sql
clsb
development
high
normal
parallel
request
|
If you want detailed information about a queue, you can execute:
| slogin1$ qconf -sq < queuename > |
The output shows many details about the queue including which nodes jobs from the queue can use, who can use the queue, and limits on the jobs that can be submitted to the queue. To summarize:
| Queue |
Processing Cores |
Run Time |
Description |
| normal |
up to 8 |
up to 48 hours |
Serial, OpenMP, MPI, or custom multi-process jobs on a single node. |
| high |
up to 8 |
up to 48 hours |
Serial, OpenMP, MPI, or custom multi-process jobs on a single node. |
| parallel |
9 to 64 |
up to 48 hours |
Parallel jobs on multiple nodes. |
| development |
up to 8 |
up to 30 minutes |
Short serial, OpenMP, MPI, or custom multi-process jobs on a single node for development purposes. |
| clsb |
up to 400 |
up to 72 hours |
Restricted access for a group of Stampede contributors. |
| request |
on request |
on request |
Restricted access for special user requests (e.g. extended runtimes or additional nodes). |
Parallel Environments
You must specify a parallel environment using the -pe option for every SGE job. The available parallel environments are:
| Parallel Environment |
Processing Cores |
Available in Queues |
Description |
| serial |
1 |
all except parallel |
serial and threaded jobs |
| small_mpi |
1 - 8 |
all except parallel |
single node MPI or parallel jobs |
| mpi |
9 - 64 |
parallel, clsb, and request |
multi-node MPI or parallel jobs |
| large_mpi |
65-1632 |
clsb, and request |
large multi-node MPI or parallel jobs |
Basic Optimization
General Optimization Guidelines
The most practical approach to enhance the performance of applications is to use use advanced compiler options, employ high performance libraries for common mathematical algorithms and scientific methods, and tune the code to take advantage of the architecture. Compiler options and libraries can provide a large benefit for a minimal amount of work. Always profile the entire application to ensure that the optimization efforts are focused on areas with the greatest return on the optimization efforts.
"Hot spots" and performance bottlenecks can be discovered with basic profiling tools like gprof. Observe the relative changes in performance among the routines when experimenting with compiler options. Sometimes it might be advantageous to break out routines and compile them separately with different options than those used for the rest of the package. Also, review routines for "hand-coded" math algorithms that can be replaced by standard (optimized) library routines. You should also be familiar with general code tuning methods and restructure statements and code blocks so that the compiler can take advantage of the microarchitecture.
Code should:
- be clear and comprehensible
- provide flexible compiler support
- should be portable
Avoid too many architecture-specific code constructs. Use language features and restructure code so that the compiler can discover how to optimize code for the architecture. That is, expose optimization when possible for the compiler, but don't rewrite the code specifically for the architecture.
For single processor optimization, the first step involves a moderate amount of hand-tuning. However, excessive amounts of hand-tuning can lead to loss of clarity and limits compiler flexibility. Clean up code to avoid restricting the compiler and to expose optimization opportunities. Let the compiler do most of the optimization for you.
Some best practices:
- Avoid excessive program modularization (i.e. too many functions/subroutines)
- write routines that can be inlined
- use macros and parameters whenever possible
- Minimize the use of pointers
- Avoid casts or type conversions, implicit or explicit
- Avoid branches, function calls, and I/O inside loops
- structure loops to eliminate conditionals
- move loops around a subroutine, into the subroutine
This usually takes care of the majority of changes necessary for moderate hand-tuning to obtain optimal code. After hand-tuning, the compiler options typically lead to the biggest improvement in performance. So, devote some time to understanding the meaning and significance of the recommended options. After these basic steps, use profiling to locate "hot spots" or performance bottlenecks. Make use of performance routines provided by vendor libraries, as opposed to writing your own version.
Tools
Program Timers and Performance Tools
Measuring the performance of a program should be an integral part of code development. It provides benchmarks to gauge the effectiveness of performance modifications and can be used to evaluate the scalability of the whole package and/or specific routines. There are quite a few tools for measuring performance, ranging from simple timers to hardware counters. Reporting methods vary too, from simple ASCII text to X-Window graphs of time series.
The most accurate way to evaluate changes in overall performance is to measure the wall-clock (real) time when an executable is running in a dedicated environment. On Symmetric Multi-Processor (SMP) machines, where resources are shared (e.g., the TACC IBM Power4 P690 nodes), user time plus sys time is a reasonable metric; but the values will not be as consistent as when running without any other user processes on the system. The user and sys times are the amount of time a user's application executes the code's instructions and the amount of time the kernel spends executing system calls on behalf of the user, respectively.
Package Timers
The time command is available on most UNIX systems. In some shells there is a built-in time command, but it doesn't have the functionality of the command found in /usr/bin. Therefore you might have to use the full pathname to access the time command in /usr/bin. To measure a program's time, run the executable with time using the syntax:
The -p option specifies traditional "precision" output, units in seconds. See the time man page for additional information.
To use time with an MPI task, use:
| /usr/bin/time -p mpirun -np 4 ./a.out |
This example provides timing information only for the rank 0 task on the master node (the node that executes the job script); however, the time output labeled "real" is applicable to all tasks since MPI tasks terminate together. The user and sys times may vary markedly from task to task if they do not perform the same amount of computational work (not load balanced).
Code Section Timers
"Section" timing is another popular mechanism for obtaining timing information. Use these to measure the performance of individual routines or blocks of code by inserting the timer calls before and after the regions of interest. Several of the more common timers and their characteristics are listed below.
| Code Section Timers |
| Routine |
Type |
Resolution (usec) |
OS/Compiler |
| times |
user/sys |
1000 |
Linux/AIX/IRIX/UNICOS |
| getrusage |
wall/user/sys |
1000 |
Linux/AIX/IRIX |
| gettimeofday |
wall clock |
1 |
Linux/AIX/IRIX/UNICOS |
| rdtsc |
wall clock |
0.1 |
Linux |
| read_real_time |
wall clock |
0.001 |
AIX |
| system_clock |
wall clock |
system dependent |
Fortran90 Intrinsic |
| MPI_Wtime |
wall clock |
system dependent |
MPI Library (C & Fortran) |
For general purpose or course-grain timings, precision is not important; therefore, the millisecond and MPI/Fortran timers should be sufficient. These timers are available on many systems; and hence, can also be used when portability is important. For benchmarking loops, it is best to use the most accurate timer (and time as many loop iterations as possible to obtain a time duration of at least an order of magnitude larger than the timer resolution). The times, getrussage, gettimeofday, rdtsc, and read_real_time timers have been packaged into a group of C wrapper routines (also callable from Fortran). The routines are function calls that return double (precision) floating point numbers with units in seconds. All of these TACC wrapper timers (x_timer) can be accesses in the same way:
external x_timer double x_timer(void);
real*8 :: x_timer ...
real*8 :: sec0, sec1, tseconds double sec0, sec1, tseconds;
... ...
sec0 = x_timer() sec0 = x_timer();
...Fortran Code ...C Codes
sec1 = x_timer() sec1 = x_timer();
tseconds = sec1-sec0 tseconds = sec1-sec0
|
Standard Profilers
The gprof profiling tool provides a convenient mechanism to obtain timing information for an entire program or package. Gprof reports a basic profile of how much time is spent in each subroutine and can direct developers to where optimization might be beneficial to the most time-consuming routines, the "hotspots". As with all profiling tools, the code must be instrumented to collect the timing data and then executed to create a raw-date report file. Finally, the data file must be read and translated into an ASCII report or a graphic display. The instrumentation is accomplished by simply recompiling the code using the -qp (Intel compiler) option. The compilation, execution, and profiler commands for gprof are shown below with a sample Fortran program:
| Profiling Serial Executables |
| ifort -qp prog.f90 |
Instruments code |
| a.out |
Produces gmon.out trace file |
| gprof |
Reads gmon.out (default args: a.out gmon.out)
(report sent to STDOUT) |
| Profiling Parallel Executables |
| mpif90 -qp prog.f90 |
Instruments code |
| setenv GMON_OUT_PREFIX gout.* |
Forces each task to produce a gout |
| mpirun -np <#> a.out |
Produces gmon.out trace file |
| gprof -s gout.* |
Combines gout files into gmon.sum |
| gprof a.out gmon.sum |
Reads executable (a.out) & gmon.sum
(report sent to STDOUT) |
Detailed documentation is available at www.gnu.org.
Timing Tools
Most of the advanced timing tools access hardware counters and can provide performance characteristics about floating point/integer operations, as well as memory access, cache misses/hits, and instruction counts. Some tools can provide statistics for an entire executable with little or no instrumentation, while others requires source code modification.
Debugging with DDT
Users of Lonestar may be familiar with DDT, a symbolic, parallel debugger that allows graphical debugging of MPI applications. Unfortunately, neither DDT nor any other parallel debugger is available on Stampede. If you need to use a parallel debugger, we suggest you use DDT on Lonestar.
Two serial debuggers are available on Stampede. If you compile your code using the Intel compilers, as we recommend, then you can use the Intel debugger, idb. When debugging programs, you should run your code on one of the compute nodes instead of on the login node. To accomplish this, submit a request for interactive access to a compute node to SGE using the qrsh or qsh commands. Additional information about the Intel debugger is available via the manual page on Stampede by executing man idb.
If you compile your code using gcc, then you should use the Gnu debugger gdb. Information about gdb is available from the gdb manual page.
Manuals
The following manuals and other reference documents were used to gather information for this User Guide and may contain additional information of use.
|
|
|
|
|