Linux Clusters: April 2007

Linux Clusters Overview

able of Contents

1. Abstract
2. Background of Linux Clusters at LLNL
3. Hardware Overview
1. Intel IA32 Xeon Processor
2. AMD Opteron Processor
3. Nodes, Racks, and Configurations
4. Quadrics Interconnect
5. Infiniband Interconnect
4. LC Linux Cluster Systems
1. OCF Linux Clusters
2. SCF Linux Clusters
5. Software and Development Environment
6. Compilers
1. General Information
2. Intel Compilers
3. PGI Compilers
4. PathScale Compilers
7. MPI
1. Quadrics MPI
2. MVAPICH
3. MPICH
8. Running Jobs
1. Overview
2. Batch Versus Interactive
3. Starting Jobs
4. Terminating Jobs
5. Displaying Queue and Job Status Information
6. Optimizing CPU Usage
7. IA32 Memory Considerations
8. Opteron Memory Considerations
9. IA32 Performance Considerations
10. Opteron Performance Considerations
9. Debugging
10. Tools
11. References and More Information
12. Exercise

Abstract

This tutorial is intended to be an introduction to using LC's Linux clusters. It begins by providing a brief historical background of Linux clusters at LC, noting their success and adoption as a production, high performance computing platform. The primary hardware components of an LC Linux cluster are then presented, including the various types of nodes, processors and switch interconnects. The detailed hardware configuration for each of LC's production Linux clusters completes the hardware related information.

After covering the hardware related topics, software topics are discussed, including the LC development environment, compilers, and how to run both batch and interactive parallel jobs. Important issues in each of these areas are noted. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this tutorial. A lab exercise using one of LC's Linux clusters follows the presentation.

Level/Prerequisites: Intended for those who are new to developing parallel programs in LC's Linux cluster environment. A basic understanding of parallel programming in C or Fortran is assumed. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.

Background of Linux Clusters at LLNL

The Linux Project:

* LLNL first began experimenting with Linux clusters in 1999-2000 in a partnership with Compaq and Quadrics to port Quadrics software to Alpha Linux.

* The Linux Project was started for several reasons:
o Cost: price-performance analysis demonstrated that near-commodity hardware in clusters running Linux could be more cost-effective than proprietary solutions;
o Focus: the decreasing importance of high-performance computing (HPC) relative to commodity purchases was making it more difficult to convince proprietary systems vendors to implement HPC specific solutions;
o Control: it was believed that by controlling the OS in-house, Livermore Computing could better support its customers;
o Community: the platform created could be leveraged by the general HPC community.

* The objective of this effort was to apply LC's scalable systems strategy (the "Livermore Model") to commodity hardware running the open source Linux OS:
o Based on SMP compute nodes attached to a high-speed, low-latency interconnect.
o Uses OpenMP to exploit SMP parallelism within a node and MPI to exploit parallelism between nodes.
o Provides a POSIX interface parallel filesystem.
o Application toolset: C, C++ and Fortran compilers, scalable MPI/OpenMP GUI debugger, performance analysis tools.
o System management toolset: parallel cluster management tools, resource management, job scheduling, near-real-time accounting.

Livermore Model

Alpha Linux Clusters:

* The first Linux cluster implemented by LC was LX, a Compaq Alpha Linux system with no high-speed interconnect.

* The first production Alpha cluster targeted to implement the full Livermore Model was Furnace, a 64-node system comprised of dual-CPU EV68 processors with a QSnet interconnect. However...
o Compaq announced the eventual discontinuation of the Alpha server line
o Intel Pentium 4 with favorable SPECfp performance was released just as Furnace was delivered.

* This prompted Livermore to shift to an Intel IA32-based model for its Linux systems in July 2001.

* Furnace's interconnect was allocated to the IA32-based PCR clusters (below) instead. It then operated as a loosely coupled cluster until it was decommissioned in 10/03.

PCR Clusters:

* August 2001: The Parallel Capacity Resource (PCR) clusters were purchased from Silicon Graphics and Linux NetworX:

* Consisted of a 128-node production cluster (Adelie), an 88-node production cluster (Emperor), and a 26-node development cluster (Dev).

* Each PCR compute node had two 1.7-GHz Intel Pentium 4 CPUs and a QsNet Elan3 interconnect.

* Parallel file system was not implemented at that time - instead dedicated BlueArc NFS servers were used.

* SCF resource only

* The 16-node Pengra cluster was procured for the OCF to provide a less restrictive development environment for PCR related work in July, 2002.

* For more information see: http://www.llnl.gov/linux/ucrl-id-150021.html.

MCR Cluster...and More:

* The success of the PCR clusters was followed by the purchase of the Multiprogrammatic Capability Resource (MCR) cluster in July, 2002 from Linux NetworX.

* 1152 node cluster comprised of dual-processor, 2.4 GHz Intel Xeons

* MCR's procurement was intended to significantly increase the resources available to Multiprogrammatic and Institutional Computing (M&IC) users.

* MCR's configuration included the first production implementation of the Lustre parallel file system, an integral part of the "Livermore Model".

* Debuted as #5 on the Top500 Supercomputers list in November, 2002, and then peaked at #3 in June, 2003.

* For more information see: http://www.llnl.gov/linux/mcr/background/mcr_background.html.

* Convinced of the successfullness of this path, LC implemented several other IA-32 Linux clusters simultaneously with, or after, the MCR Linux cluster:
ALC OCF 960 nodes
ILX OCF 67 nodes
PVC OCF 64 nodes
SPHERE OCF 84 nodes
LILAC SCF 768 nodes
ACE SCF 160 nodes
GVIZ SCF 64 nodes

Which Led To Thunder…

* In September, 2003 the RFP for LC's first IA-64 cluster was released. Proposal from California Digital Corporation, a small local company, was accepted.

* 1024 node system comprised of 4-CPU Itanium 2 "Madison Tiger4" nodes

* Thunder debuted as #2 on the Top500 Supercomputers list in June, 2004.

* Thunder will not be discussed in detail in this tutorial, because it is LC's only IA64 production Linux cluster.

* For more information about Thunder see:
o "Using Thunder" tutorial - www.llnl.gov/computing/tutorials/linux_clusters/thunder.html
o Thunder homepage - www.llnl.gov/linux/thunder

Thunder Photo

And Now, the Peloton Systems:

* In early 2006, LC launched its most recent Linux cluster procurement with release of the Peloton RFP: www.llnl.gov/linux/peloton/rfp.

* Appro www.appro.com was awarded the contract in June, 2006.

* Peloton is comprised of six clusters representing a mix of resources:
o OCF and SCF
o ASC and M&IC
o Capability and capacity
o Infiniband interconnect and no high-speed interconnect

* Peloton clusters are built in 5.5 Tflop "scaleable units" (SU) of ~144 nodes

* All Peloton clusters use AMD dual-core Socket F Opterons:
o 8 cpus per node
o 2.4 GHz clock
o Upgradeable to 4-core Opteron "Deerhound" later

* The six clusters will be integrated and become generally available in late 2006 through early 2007.

Hardware Overview
Intel IA32 Xeon Processor
Basics:

* As its name implies, Intel's IA32 architecture is a 32-bit based processor design. This basic design is used in a wide range of Intel products.

* x86: The IA32 architecture is the result of an evolution that started in 1978 with Intel's 16-bit based 8086 processor, and spans through the 286, 386, 486, multiple Pentiums and Xeon architectures. For the curious, an interesting history of the IA32 is presented in the "IA-32 Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture", which can be downloaded from Intel's website.

* For our purposes, the IA32 Xeon processor can be thought of as a 2-way SMP capable version of the Intel Pentium 4 processor. The basic CPU designs are very similar.

* All of LC's IA32 Linux clusters are based on 2-way Xeon processors.

* All IA32 machines are "little endian", which may pose some portability issues, not discussed here, when moving codes/data from "big endian" systems, such as IBMs.

* Important: Intel has evolved the Xeon into a 64-bit, multi-core architecture. However, since all of LC's IA32 Linux clusters use 32-bit Xeons, the remainder of the discussions here are limited to the IA32 architecture.

IA32 Xeon Design Facts/Features: (circa LC's clusters)

* High clock speeds - currently over 3 GHz. LC's systems range between 2-3 GHz.

* Addressing: a 32-bit based system may access up to 4 GB (232) of linear memory. However, Intel's physical address extension (PAE) mode allows addressing of virtual memory up to 64 GB (236). For this, you need to consult Intel's documentation.

* SIMD vectorization capability (discussed below).

* Hyper-Threading Technology: allows a single CPU to look like 2 CPUs.

* Design features common to most modern processors:

o Enhanced branch prediction capabilities: employs advanced algorithms to better predict the outcome of branch operations.

o Speculative execution: able to execute instructions that lie beyond an unresolved conditional branch and apply the "correct" results to the original instruction stream.

o Superscalar: multiple execution units that can execute instructions simultaneously with each clock tick.

o Pipelined architecture: up to 126 instructions can be "in flight" with each clock cycle.

* Caches:
o First level caches: Execution trace cache and 8 KB data cache, 4-way associative, 64-byte line size.
o L2 cache: up to 512 MB, unified, 8-way associative, 128-byte line size
o L3: optional, up to 4 MB, 8-way associative, 128-byte line size

* A block diagram of an IA32 Pentium 4/Xeon processor is shown below. It is actually a composite from several sources of information from Intel and represents a general idea on how the processor is organized.

Pentium 4/Xeon processor block diagram
Adapted from a similar diagram in the "Overview of Recent Supercomputers" whitepaper by Aad J. van der Steen, Utrecht University, Netherlands, 2003.

* Register set (diagram and more discussion below):
o General purpose registers
o Floating-point unit registers
o MMX vector registers
o SSE/SSE2 vector registers

Xeon Execution Environment

Floating Point Unit:

* The IA32 Xeon has a single x87 floating point unit, per CPU. A 2 CPU Xeon system therefore has two x87 FPUs.

* In an ideal multiply-add code, one FP operation can be issued every clock tick when using the x87 FPU.

* The x87 FPU uses a floating point stack with eight 80-bit register elements. These same 80-bit registers are used whether the operands are single, double, or extended-double precision.

* The use of a stack in the FPU can cause application code bugs that are harmless on other architectures to produce NaNs (Not-a-Numbers) or incorrect results on the x87 if the FP stack pointer gets out-of-sync. The Exercises contain a couple examples of code bugs that cause the FP stack pointer to get out-of-sync.

* There are constructs of the x87 FPU that do not conform to IEEE 754 standard for binary floating point arithmetic. For example, the extra precision in the FPU registers and non-compliance of some instructions by default. To conform with IEEE 754 (so that results exactly match those on other architectures), special compile-time options must be used that incur a performance cost.

* FPU underflows and subsequent denormal operations require software assistance which is very costly. This overhead is incurred whenever numbers reach the following thresholds:
o Single precision: 1.18 x 10-38
o Double precision: 2.23 x 10-308
o Extended double precision: 3.37 x 10-4932

SIMD Vector Units:

* SIMD = Single Instruction Multiple Data. Refers to the ability to perform the same operation on multiple operands at the same time, and therefore produce multiple results in one operation. The diagram below demonstrates the difference between a typical scalar operation (left) and a SIMD operation (right).

SIMD execution example

* The MMX registers provide for SIMD operations on 64-bit packed byte, word or doubleword integer data. Benefits media, graphics and communications applications.

* The XMM registers provide for SIMD operations on 128-bit packed single or double precision floating point data, packed byte, word, double or quadword integer data. For floating point data, the XMM registers can contain 2 packed double-precision floating point values or 4 single-precision values.

SSE2 data types

* Special instructions known as SSE2 (Streaming SIMD Extensions 2) are used to operate on floating-point values in the XMM registers. Operations exist to pack and unpack the registers with values, perform arithmetic or logical operations on all or parts of registers, and shift or shuffle register contents.

* The Intel compilers provide the only truly workable SSE2 vectorizing capability.

* In order to take advantage of the XMM registers, your code must have loops in it that a compiler can vectorize.

* For a loop to vectorize, several conditions must be met, including:
o A lack of data dependence across loop iterations
o Use of arithmetic operations that are capable of being vectorized by a corresponding SSE2 instruction. Not all operations have a matching SSE2 instruction.
o Loops that are "countable" and with a sufficient number of iterations
o Only inner loops are vectorized

* Because the XMM registers contain packed data, there are no extra bits available to perform IEEE 754 compliant rounding control. Therefore, SSE2 vectorized code cannot be IEEE 754 compliant. Also, results may differ between the vectorized and unvectorized versions of the same code.

* In an ideal multiply-add code, it is possible to perform one SSE2 FP operation every clock tick. Since each SSE2 operation can operate on 2 doubles/4 singles at once, this equates to 2/4 floating-point operations per clock tick. In practice, this theoretical limit is very difficult to achieve due to overhead incurred in packing and unpacking the vector registers.

* Because vectorization can sometimes degrade performance, users are advised to test performance both with and without vectorization.

Hyper-Threading:

* Hyper-threading is Intel's term for their IA32 architecture specific implementation of simultaneous multi-threading (SMT).

* Hyper-threading characteristics:
o It appears to the user as if 2 CPUs exist when there is really only 1. For a 2-CPU Xeon, it would appear that there were 4 CPUs.
o Allows 2 user level threads to access the chip core without task switching overhead.
o Duplicates the chip's execution environment, including main registers and other chip architectural states.

* Best utilized by memory latency bound applications where there are "wasted cycles" waiting for slower operations to complete.

* Speedup benefits from hyper-threading can be > 25% for some latency-bound applications. In other cases speedup is less than 5%, and for some codes, such as those which are almost entirely memory-latency dependent, hyper-threading may decrease performance.

* More information: www.intel.com/products/ht/hyperthreading_more.htm

* Note: Hyper-threading is disabled on all of LC's IA32 machines because the current Linux kernel scheduling can put both hyper-threaded tasks on the same physical CPU even though the other CPU is not being used.

Chipsets, Memory and Peformance: Chipset and memory images

* Processors form only part of a complete machine. Additional hardware is required such as memory, I/O devices, network adapters and other peripherals. IA32 processors are commodity, off-the-shelf components that can be used in a variety of ways.

* Intel and other vendors package IA32 processors in many different configurations. One of the more important distinctions between Xeon based systems is the "chipset" used to create the motherboard.

* Chipsets are packages in themselves, and serve to connect the Xeon processors to the rest of what forms a node in a Linux cluster.

* Chipsets vary and do not necessarily have to be manufactured by Intel, or work only with IA32 Xeon processors. They also differ in the types and amount of memory they support.

Example: Intel E7500 Chipset Xeon System
Intel E7500 chipset diagram

* Why is this important? Primarily from a performance perspective - it is not the processor clock speed alone that determines the overall performance of a machine. It also involves the chipset and memory components too.

* Although all of LC's IA32 Linux clusters use dual-cpu Xeon chips, they vary in clock speed, chipset, amount of memory and type of memory. The current ranges are:
o Clock speeds: 2.2 - 3.0 GHz
o Chipsets: E7500, E7501, E7505, ServerWorks
o Memory: 2 - 4 GB
o Memory types: DDR SDRAM, RDRAM

* In general, the sustained memory-cpu bandwidth on LC's IA32 systems ranges from 1800-2200 MB/s as measured with the STREAM benchmark.

* More information:
o E7500 chipset: www.intel.com/products/chipsets/e7500
o E7501 chipset: www.intel.com/products/chipsets/e7501
o E7505 chipset: www.intel.com/products/chipsets/e7505
o STREAM benchmark: www.cs.virginia.edu/stream/ref.html
o Memory acronyms and definitions

# The table below describes LC's IA32 Linux cluster systems and how they vary with respect to chipsets and memory components.

System CPU Speed
(GHz) Chipset Memory
(GB) Memory
Type
ILX 2.4
2.8 E7500
E7501 4 DDR SDRAM
PENGRA 2.2 E7500 2 DDR SDRAM
MCR 2.4 E7500 4 DDR SDRAM
PVC 2.4 E7500 2 RDRAM
SPHERE 2.8 E7505 2
ALC 2.4 ServerWorks 4 DDR SDRAM
ADELIE 2.8 E7501 4 RDRAM
EMPEROR 2.8 E7501 4 RDRAM
LILAC 3.0
2.8 ServerWorks 4
GVIZ 2.8 E7505 2
ACE 2.8 E7501 4 RDRAM
--------------------------------------------->
Hardware Overview
AMD Opteron Processor
Basics:

* The AMD Opteron debuted in April, 2003 as the first 64-bit architecture compatible with the industry-standard x86 instruction set.

* Built on AMD64 technology (formerly code named "Hammer") which extends full backward compatibility for 32-bit x86 software. Also enables 64-bit computing for non-x86 applications.

* Initial offering was a single-core processor designed with multi-core in mind. Dual-core processor became available in April, 2005 and a quad-core processor is expected in 2007.

* Multi-core Opterons are used to build 2-, 4-, 8- and later, 16-way SMPs.

* Target market: multi-processor servers and workstations. AMD also manufactures multi-core 64-bit processors (i.e. Athlon) for other markets.

* All of LC's Opteron based clusters are dual-core, 2.4 GHz, 8-way SMPs.

* Because Opterons are x86 based, they are "little endian" like LC's Xeons.

Design Facts/Features: (circa LC's clusters)

* 64-bit architecture providing full 64-bit memory addressing

* Direct Connect Architecture:

o No front side bus

o Integrated (on-die) memory controller. CPU-memory bandwidth is 10.7 GB/s with DDR2-667 DIMMs (@5.3 GB/s per core).

o HyperTransport interconnects (3) directly connect CPUs to I/O subsystems, other chipsets, and other off-chip CPUs. Provide 8 GB/s bandwidth per link (4 GB/s each direction) for an aggregate bandwidth of 24 GB/s bandwidth per CPU.

* AMD Virtualization: disparate applications can coexist on same system. Each application's environment (operating system, middleware, communications, etc.) is represented as a virtual machine.

* Current dual-core Opterons will be easily upgradeable to quad-core later - only requires a bios upgrade.

* Full support for Intel's SIMD vectorization instructions (SSE, SSE2, SSE3…)

* Floating point units capable of producing 2 results (one add, one multiply) per clock cycle.

* Reduced power consumption:
o Energy efficient DDR2 memory
o AMD dynamic power management technology

* Design features common to most modern processors:
o Enhanced branch prediction capabilities
o Speculative / out-of-order execution
o Superscalar
o Pipelined architecture

* Caches:
o L1 Instruction: 64 KB, 64-byte line, 2-way associative
o L1 Data: 64 KB, 64-byte line, 2-way associative
o L2: 1 MB, 64-byte line, 16-way associative

Opteron Diagram

Opteron Detail

Chipsets/Motherboards:

* As with other vendor processors, AMD Opterons are commonly combined with other components to make a complete system.

* Motherboards for Opteron systems are manufactured by a number of vendors. LC's machines use the SuperMicro H8QM8-2 motherboard.

* Key components:
o Four AMD Opteron dual-core Socket F model 8216 processors
o 16 DIMM sockets supporting 2GB DDR2 667/533/400 MHz DIMMs. Currently populated with 16 GB of memory per node (8 DIMMs). Can be upgraded to maximum of 32 GB later.
o AMD-8132 Chipset (PCI-X bridge)
o nVidia MCP55 Pro Chipset (I/O Hub)
o Two PCI-X 100 MHz slots
o Two PCI-X 133/100 MHz slots
o One PCI-Express x16 slot
o One PCI-Express x8 slot
o Two 1GB ethernet ports
o Remote server management hardware
o Ports/controllers for SCSI, SATA, IDE, USB, keyboard, mouse, serial, parallel, etc. (most of these not used by LC machines)

* Photo at right, block diagram below

SuperMicro H8QM8-2 motherboard photo

SuperMicro H8QM8-2 motherboard block diagram

Hardware Overview
Nodes, Racks, and Configurations
Nodes: Rack mounted nodes

* The basic building block of a Linux cluster is the node. A node is essentially an independent PC. However, some important features of cluster nodes distinguish them from typical desktop machines.

* Low form-factor - Clusters nodes are very thin in order to save space. For example, the Opteron cluster nodes have a form factor of approximately 1U (U = 1.75").

* Rack Mounted - Nodes are mounted compactly in a drawer fashion to facilitate maintenance, reduced footprint, etc.

* Remote Management - There is no keyboard, mouse, monitor or other device typically used to interact with a desktop machine. All interaction with a cluster node takes place remotely over a network.

Racks:

* Racks are the "boxes" or physical frames that hold a cluster's nodes. They may also hold other hardware such as disk devices and network components.

* Racks vary in size/appearance between the different Linux clusters at LC.

* Power and console management - Racks include hardware and software that allow system administrators to perform most tasks remotely. LC clusters differ in specifics for this function.

Configurations:

* Clusters can be configured in a variety of ways and every LC Linux cluster is different.

* Nodes are typically configured into 4 types, according to their function:

o Compute nodes - Nodes that run user jobs. The majority of nodes in a cluster. Compute nodes are typically split into one of two partitions: batch or interactive/debug. However, several LC clusters have nodes that serve both purposes.

o Login nodes - One or more per cluster. This is where you login for access to the compute nodes. Login nodes are also used to build your applications and control cluster jobs.

o Gateway (I/O) nodes - These nodes are dedicated fileservers. They connect the compute nodes to essential file systems which are mounted on disk storage devices, such as Lustre OSTs. The number of these nodes vary per cluster.

o Administrative/management node - One per cluster, located remotely from actual cluster. Used by system administrators to manage the entire cluster. Not accessible to users.

* Additionally, racks contain switch components and management hardware.

* Racks containing a cluster's disk resources are separate from the cluster.

* One example (Zeus cluster) is shown below for illustrative purposes.

Example node/rack configuration

Hardware Overview
Quadrics Interconnect (IA32 clusters)

* The high-speed interconnect used for LC's IA32 clusters is Quadrics Elan3 QsNet.

* Quadrics QsNet is a high-bandwidth, low-latency, fault-tolerant, scalable, interconnect that enables microprocessor based systems running Linux to be networked into clusters.

Primary components:

* Elan Adapter Card: communications processor packaged on a PCI-based network adapter card. This card provides the interface between a node and a multi-stage network. Copper cabling connects each Elan card to a switch.

* Elite Switch: 8-way, bidirectional crossbar switch. Each port connects to either an Elan adapter card or to another switch. Elites are packaged on switch cards with various configurations.

Elan adapter card and 16-port switch card

Topology:

* Multi-stage, quarternary fat-tree. The building block is the 8-way Elite switch. The number of stages depends upon the number of nodes required in the network.

* Examples:

Example 16-port topology Example 128-port topology

* Federated Switch: for configurations of 256, 512 or 1024 nodes, stages 4-5 are required.

* There is some additional message passing latency on Federated configurations.

* 128-port switch images available here.

Features:

* Each Elan adapter card has 64 MB DMA memory that can be directly mapped into process virtual memory.

* Clock cards (128-port) and clock distribution box (federated network) provide a global time source.

* JTAG (Joint Test Action Group) interface provided for diagnostics and maintenance.

* Quadrics provides job scheduler software called Resource Management System (RMS). At LC however, SLURM has replaced RMS and LCRM is used to submit/monitor jobs.

* More than one Elan adapter card may be installed in a node linked to a separate QsNet network. Provides the means to build access for multiple networks (rails).

Performance:

* Ultra-low latency of ~5 us for MPI point-to-point communications on the Elan3 based IA32 Xeon clusters at LC.

* Peak observed point-to-point bidirectional MPI bandwidth of ~340 MB/s on LC's IA32 Linux clusters. May be less, depending upon the node's chipset, which is the limiting factor.

Hardware Overview
Infiniband Interconnect (Opteron clusters)

This section under development

LC Linux Cluster Systems
OCF Linux Clusters
ALC
ALC
Larger image available here
Zeus
ZEUS
Larger image available here

ATLAS:

Purpose: M&IC Capability Resource
Nodes: 1152
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 44.24 TFLOPS
Memory/Node: 16 GB
Memory Total: 18,432 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: Mellanox Infiniband 4x DDR HCA
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Estimated to be generally available (GA) in 4/07

ZEUS:

Purpose: M&IC Capacity Resource
Nodes: 288
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 11.06 TFLOPS
Memory/Node: 16 GB
Memory Total: 4,608 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: Mellanox Infiniband 4x DDR HCA
Parallel File System: Lustre
OS: CHAOS Linux
Notes:

MCR:

Purpose: MCR means "Multiprogrammatic Capability Resource"
Nodes: 1154
CPUs/Node: 2
CPU Type: Intel Xeon @2.4 GHz
Peak Performance: 11.2 TFLOPS
Memory/Node: 4 GB
Memory Total: 4,616 GB
Cache: 8 KB L1 data; 512 KB L2
Chipset: Intel E7500
Interconnect: Quadrics QsNet Elan3
Parallel File System: Lustre
OS: CHAOS Linux
Notes: To be retired in 2/07

'MCR'
Larger image available here
MCR Configuration

------------------------------ retired in 2/07 -----------------------------> ALC:

Purpose: ALC = "ASC Linux Cluster". An unclassified component of ASC Purple designated as an ASC Capacity Resource.
Nodes: 960
CPUs/Node: 2
CPU Type: Intel Xeon @2.4 GHz
Peak Performance: 9.2 TFLOPS
Memory/Node: 4 GB
Memory Total: 3,840 GB
Cache: 8 KB L1 data; 512 KB L2
Chipset: ServerWorks
Interconnect: Quadrics QsNet Elan3
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Packages as IBM "xSeries" boxes: IBM x335 IBM x345

YANA:

Purpose: M&IC serial/single-node computing resource
Nodes: 80
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 3.07 TFLOPS
Memory/Node: 16 GB or 32 GB
Memory Total: 1,600 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: None
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Serial or single-node parallel computing only because there is no switch

Other OCF Systems:

The systems shown below are reserved for specific purposes and are not typically used by most LC users.

PRISM SPHERE PENGRA
o Visualization cluster
o 128 nodes
o 2 CPUs/node
o 2.4 GHz AMD Opterons
o 1.23 TFLOPS system peak performance
o 16 GB memory per node
o 4096 GB total RAM
o 64 KB L1 data cache; 64 KB L1 instruction cache
o 1 MB L2 cache
o Infiniband 4x interconnect
o Lustre parallel file system
o CHAOS Linux OS

o Visualization cluster
o 96 nodes
o 2 CPUs/node
o 2.8 GHz Intel Xeons
o 1.08 TFLOPS system peak performance
o 2 GB memory per node
o 8 KB L1 cache
o 512 KB L2 cache
o Intel E7505 Chipset
o Quadrics QsNet Elan3 interconnect
o Large parallel Lustre file system
o CHAOS Linux OS

o Used for non-production testing, prototyping, workshops
o 16 nodes
o 2 CPUs/node
o 2.2 GHz Intel Xeons
o 0.14 TFLOPS system peak performance
o 2 GB memory per node
o 8 KB L1 cache
o 512 KB L2 cache
o Intel E7500 Chipset
o Quadrics QsNet Elan3 interconnect
o No parallel file system
o CHAOS Linux OS

SAN PLANS:

* LC is creating a Lustre-based Storage Area Network (SAN) that integrates massive Lustre storage shared by its Linux clusters, and high performance access to other resources, such as HPSS.

* A schematic for the OCF plans is shown below.

LC OCF SAN Configuration

LC Linux Cluster Systems
SCF Linux Clusters
MINOS:

Purpose: ASC Capacity Resource
Nodes: 864
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 33.18 TFLOPS
Memory/Node: 16 GB
Memory Total: 13,824 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: Mellanox Infiniband 4x DDR HCA
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Estimated to be generally available (GA) in 8/07

RHEA:

Purpose: ASC Capacity Resource
Nodes: 576
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 22.12 TFLOPS
Memory/Node: 16 GB
Memory Total: 9,216 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: Mellanox Infiniband 4x DDR HCA
Parallel File System: Lustre
OS: CHAOS Linux
Notes:

LILAC:

Purpose: ASC Capacity Resource.
Nodes: 768
CPUs/Node: 2
CPU Type: Intel Xeon; compute nodes @3.0 GHz, other nodes @2.8 GHz
Peak Performance: 9.2 TFLOPS
Memory/Node: 4 GB
Memory Total: 3,072 GB
Cache: 8 KB L1 data; 512 KB L2
Chipset: ServerWorks
Interconnect: Quadrics QsNet Elan3
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Packages as IBM "xSeries" boxes: IBM x335 IBM x345

HOPI:

Purpose: ASC serial/single-node computing resource
Nodes: 76
CPUs/Node: 8
CPU Type: AMD dual-core Socket F Opteron @2.4 GHz
Peak Performance: 2.92 TFLOPS
Memory/Node: 16 GB or 32 GB
Memory Total: 1,408 GB
Cache: 64 KB L1 data; 64 KB L1 instruction; 1 MB L2
Chipset: SuperMicro H8MQ8-2 motherboard
Interconnect: None
Parallel File System: Lustre
OS: CHAOS Linux
Notes: Serial or single-node parallel computing only because there is no switch

ACE:

Purpose: ASC serial/single-node computing resource.
Nodes: 176
CPUs/Node: 2
CPU Type: Intel Xeon @2.8 GHz
Peak Performance: 1.97 TFLOPS
Memory/Node: 4 GB
Memory Total: 704 GB
Cache: 8 KB L1 data; 512 KB L2
Chipset: Intel E7501
Interconnect: None
Parallel File System: None
OS: CHAOS Linux
Notes: Serial or single-node parallel computing only because there is no switch or parallel file system.

QUEEN:

Purpose: ASC serial/single-node computing resource.
Nodes: 64
CPUs/Node: 2
CPU Type: Intel Xeon @2.8 GHz
Peak Performance: 0.71 TFLOPS
Memory/Node: 4 GB
Memory Total: 252 GB
Cache: 8 KB L1 data; 512 KB L2
Chipset: Intel E7501
Interconnect: None
Parallel File System: None
OS: CHAOS Linux
Notes: Serial or single-node parallel computing only because there is no switch or parallel file system.

Other SCF Systems:

The systems shown below are reserved for specific purposes and are not typically used by most LC users.

GVIZ GAUSS
o Visualization resource
o 64 nodes
o 2 CPUs/node
o Intel Xeons @2.8 GHz
o 0.72 TFLOPS system peak performance
o 2 GB memory per node
o 128 GB total RAM
o 8 KB L1 Cache
o 512 KB L2 Cache
o Intel E7505 Chipset
o Quadrics QsNet Elan3 interconnect
o Lustre parallel file system

o Visualization cluster
o 256 nodes
o 2 CPUs/node
o 2.4 GHz AMD Opterons
o 2.46 TFLOPS system peak performance
o 12 GB memory per node
o 1644 GB total RAM
o 64 KB L1 data cache; 64 KB L1 instruction cache
o 1 MB L2 cache
o Infiniband 4x interconnect
o Lustre parallel file system
o CHAOS Linux OS

Software and Development Environment

NOTE: The software and development environment for LC's Linux clusters is similar to what is described in the Introduction to LC Resources tutorial. Only a summary or items specific to the Linux clusters are discussed below.

CHAOS Operating System:

* All LC Linux clustersr use the Clustered High Availability Operating System (CHAOS). CHAOS is a Livermore modified version of RedHat Linux, maintained by the Linux Project Team.

* CHAOS differs from Red Hat in the following areas:
o Modified kernel - to support high performance hardware, the Lustre file system, and other Livermore requirements.
o New packages - support added for cluster monitoring, system installation, power/console management, parallel job launch, resource management, compilers, etc.
o Modified packages - a few Red Hat packages are modified to implement timely bug fixes or enable them to work with CHAOS packages.

* To find out more about CHAOS, including release information, see www.llnl.gov/linux/chaos/chaos.html

Batch System:

* LCRM - LC's batch system. Covered in depth in the LCRM Tutorial. Integrated with SLURM on Linux clusters.

* SLURM
o LC's Simple Linux Utility for Resource Management, is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes.
o SLURM replaces the native scheduling system, and is integrated with LCRM.
o The SLURM reference manual can be found at: http://www.llnl.gov/LCdocs/slurm

* MOAB - LCRM's replacement beginning in 2007. Tri-lab common scheduler.

File Systems:

* Home directories - globally mounted under /g/g#, backed up, 16 GB quota in effect, not purged, .snapshot convenient short-term backup also available.

* Lustre parallel file systems:
o Mounted under /p/lscratch#
o Very large
o Not backed up
o Subject to purge
o Shared by all users on a cluster or multiple clusters
o Available on most production Linux clusters.
o Lustre is discussed in the Parallel File Systems section of the Introduction to Livermore Computing Resources tutorial.
o Are usually mounted by multiple clusters.
'Lustre ------------------------------------------------------------------------------>

* /nfs/tmp# - large globally mounted NFS file systems shared across all machines by all users, not backed up, purged, quotas in effect.

* /var/tmp, /usr/tmp, /tmp - different names for the same file system, local (non-NFS) mounted, moderate size, not backed up, purged, shared by all users on a given node.

* Archival HPSS storage - accessed by ftp to "storage". Virtually unlimited file space, not backed up or purged. More info: www.llnl.gov/computing/hpc/jobs/#sbp.

* /usr/gapps - globally mounted project workspace that may be group and/or world readable. More info: www.llnl.gov/computing/hpc/resources/usr_gapps.html.

Compilers:

* Compilers are discussed in detail in the Compilers section below.

Intel and AMD Math Libraries

Note: users are recommended to NOT use the lapack or blas libraries that are in /usr/lib. These libs are provided as part of the standard RedHat distribution but they have very poor performance characteristics in comparison to their Intel and AMD math library equivalents.

MKL - Intel Math Kernel Library

* Available for both IA32 and Opteron clusters
* Threaded, optimized LAPACK, BLAS, FFT, vector math functions and sparse BLAS functions.
* Currently located in /usr/local/intel/mkl, which is a symbolic link to the default version.
* Documentation, examples and everything else you need to get going is included in the install directory.
* A few usage notes:
o Required include path: -I/usr/local/intel/mkl/include
o Can be used with Intel and non-Intel compilers.
o Linking - it's best to consult the mkluse.htm document in the install directory for details. Linking instructions vary depending upon what you want to do.
o Threads: MKL is an OpenMP threaded library. It can be used with Intel compiled OpenMP programs safely, but not with other types of threaded programs or compilers unless you explicitly set the environment variable MKL_SERIAL to YES or yes.
o Memory management: For performance reasons, MKL employs memory buffers for use by the library functions. This feature can be turned off by setting MKL_DISABLE_FAST_MM to yes, but performance will degrade.
o The mkluse.htm document mentions several performance related conditions that may be of interest for specific routines.

ACML - AMD Core Math Library

* Opteron clusters only.
* Includes optimized BLAS, LAPACK, FFT, transcendental math routines and random number generators.
* ACML libraries built with the GNU, Intel, PathScale, and PGI compilers are available. They are found in /usr/local/lib/acml with the corresponding header files found in /usr/local/include/acml.
* When compiling, link with the library path and include path desired. For example, to use the PathScale version, include the following on your compile/link command:
-I/usr/local/include/acml/pathscale
-L/usr/local/include/acml/pathscale
-lacml

Debuggers and Performance Analysis Tools:

* LC's Development and Environment group maintains a number of debuggers and performance analysis tools that are able to be used on the IA32 Linux clusters.

* See the "Supported Software and Computing Tools" web page located at www.llnl.gov/computing/hpc/code/software_tools.html for more information.

* Some debugging details and a few words about performance analysis tools are covered later. However, covering these topics in depth is beyond the scope of this tutorial.

Man Pages:

* Linux man pages (and SLURM man pages) are stored in /usr/share/man.

* If you have problems bringing up the man page for a Linux command, check your MANPATH variable for /usr/share/man.

Compilers
General Information

Available Compilers and Invocation Commands:

* The table below summarizes compiler availability and invocation commands on LC Linux clusters.

* Note that parallel compiler commands are actually LC scripts that ultimately invoke the corresponding serial compiler.

Compiler Serial Command Parallel Command
Intel C icc mpiicc
C++ icpc mpiicpc
Fortran ifort mpiifort
PathScale C pathcc mpipathcc
C++ pathCC mpipathCC
Fortran pathf90 mpipathf90
PGI C pgcc mpipgcc
C++ pgCC mpipgCC
Fortran pgf77
pgf90 mpipgf77
mpipgf90
GNU C gcc mpicc
C++ g++ mpiCC
Fortran g77 mpif77
mpif90

Versions, Defaults and Paths:

* LC usually maintains multiple versions of each compiler on a cluster.

* The see the available compilers for all (or any) LC machine, consult the Compilers Currently Installed on LC Platforms web page.

* Each compiler has a default version. To determine what it is, issue the compiler invocation command with its "version" option. For example:

Compiler Option Example
Intel -V ifort -V
PathScale -v
-version
-dumpversion pathcc -version
PGI -V pgf90 -V
GNU -v
--version g++ --version

* The GNU compilers are automatically in your path on all clusters.

* For IA32 clusters, the Intel compiler is the preferred compiler. Because of this, LC automatically includes the default Intel compilers in your path. The PathScale and PGI compiler paths must be explicitly added by the user.

* For Opteron clusters, there is no preferred compiler. Therefore, LC automatically includes all compilers in your path.

* To use a specific version of a compiler other than the default (except for GNU compilers), do the following:
1. Login to the machine where you will be working
2. Issue the use -l (lowercase "L") command. The use -l compilers command should also work and provide a more specific list.
3. See what is available for each compiler and select one, noting its package/dotkit name
4. Issue the command: use package-name

Floating-point Exceptions:

* By default, the IA32 and Opteron processors mask floating-point exceptions (FPE), such as invalid operation, denormalization, overflow, underflow and divide-by-zero. Programs that encounter FPEs will not terminate abnormally, but instead, will continue execution with the potential of producing wrong results.

* Compilers differ in their ability to handle FPEs:
o Intel's ifort offers a limited means to control FPE handling through the -fpe option.
o Intel's C/C++ compiler does not offer a way to unmask FPEs.
o PGI compilers offer the convenient -ktrap option to unmask FPEs.
o GNU and PathScale compilers don't offer much for unmasking FPEs

* A locally developed method has been created for Intel compiler users to unmask FPEs. See the files located in /usr/local/fpu for details.

* Unrelated to compilers, Linux treats integer divides-by-zero as floating-point exceptions. For example, the following two codes will behave differently. The integer divide by zero will abort execution, while the floating point divide by zero will not. This applies to both Fortran and C/C++.

#include
int main()
{
int i = 2;
i /= 0;
printf("i = %d\n",i);
}

#include
int main()
{
float i = 2;
i /= 0;
printf("i = %f\n",i);
}

Precision, Performance and IEEE 754 Compliance:

* Typically, most compilers do not guarantee IEEE 754 compliance for floating-point arithmetic unless it is explicitly specified by a compiler flag. This is because compiler optimizations are performed at the possible expense of precision and performance.

* Unfortunately for most programs, adhering to IEEE floating-point arithmetic adversely affects performance.

* If you are not sure whether your application needs this, try compiling and running your program both with and without it to evaluate the effects on both performance and precision.

* See the relevant compiler documentation for details.

Compilers
Intel Compilers
General Information:

* The Intel compilers are specifically designed to take advantage of Intel chip architectures, such as the IA32 Xeon architecture.

* Have logged the best performance to date of any IA32 Linux compilers when building LC standard benchmarks such as sPPM and HPL (netlib High Performance Linpack).

* For Opteron clusters, LC supports the Intel compilers with the caveat that they have not been rigorously tested on this architecture.

* OpenMP version 2.0 compliant

* Can be used to compile for 32-bit or 64-bit architectures.

* GNU compatibility: Objects created with the Intel C/C++ compiler are binary compatible with GNU gcc. Also, the Intel C/C++ compiler supports many of the language extensions of gcc and g++. See the Intel C++ Compiler User's Guide for details.

Compiler Invocation Commands:

icc serial/OpenMP C
icpc serial/OpenMP C++
ifort serial/OpenMP Fortran 77 and 90
mpiicc script for C with MPI
mpiicpc script for C++ with MPI
mpiifort script for Fortran with MPI

* Note: The Intel C and C++ compiler are actually the same compiler - the Intel C++ compiler product. The invocation command (icc or icpc) determines how the compiler behaves. Be sure to use the appropriate invocation command for your source - don't use icc for C++ or icpc for C.

Common / Useful Options:

* A few useful compiler options are shown below. Interested users will definitely want to consult the relevant Intel documentation, including man pages and the Intel website: www.intel.com.

.
Option Description C/C++ Fortran
-ansi_alias
-no-ansi_alias Can help performance. Directs the compiler to assume the following:
o -Arrays are not accessed out of bounds.
o -Pointers are not cast to non-pointer types, and vice-versa.
o -References to objects of two different scalar types cannot alias.
If your program does not satisfy one of the above conditions, this flag may lead the compiler to generate incorrect code. For Fortran, conformance is according to Fortran 95 Standard type aliasability rules.
C/C++ Default = -no-ansi_alias (off)
Fortran Default = -ansi_alias (on)
-assume keyword
-assume buffered_io Specifies assumptions made by the compiler. One option that may improve I/O performance is buffered_io, which causes sequential file I/O to be buffered rather than being written to disk immediately. See the ifort man page for details.
-auto
-automatic
-nosave

-save
-noauto
-noautomatic
Places variables, except those declared as SAVE, on the run-time stack. The default is -auto_scalar (local scalar of types INTEGER, REAL, COMPLEX, or LOGICAL are automatic). However, if you specify -recursive or -openmp, the default is -auto.

Places variables, except those declared as AUTOMATIC, in static memory. However, if you specify -recursive or -openmp, the default is -auto.

-autodouble Defines real variables to be REAL(KIND=8). Same as specifying -r8.
-convert keyword Specifies the format for unformatted files, such as big endian, little endian, IBM 370, Cray, etc.
-Dname[=value] Defines a macro name and associates it with a specified value. Equivalent to a #define preprocessor directive.
-fast Shorthand for several combined optimization options: -O3, -ipo -static
-fpp
-cpp Invoke Fortran preprocessor. -fpp and -cpp are equivalent.
-fpe[n] Specifies the run-time floating-point exception handling behavior:
o 0 - Floating underflow results in zero; all other floating-point exceptions abort execution.
o 1 - Floating underflow results in zero; all other floating-point exceptions produce exceptional values (signed Infinities or NaNs) and execution continues.
o 3 - All floating-point exceptions produce exceptional values (signed Infinities, denormals, or NaNs) and execution continues. This is the default.

-g Build with debugging symbols. Note that -g does not imply -O0 in the Intel compilers; -O0 must be specified explicitly to turn all optimizations off.
-help Print compiler options summary
-ip Enable single-file interprocedural optimizations.
-ipo Enable multi-file interprocedural optimizations.
-mcpu=pentium4
-march=pentium4 Optimize for pentium 4 / Xeon processor (default)
-mp 'Maintain precision' - favor conformance to IEEE 754 standards for floating-point arithmetic.
-mp1 Improve floating-point precision - less speed impact than -mp.
-o name Create an object file called name.
-O0 Turn off optimizer - recommended if using -g for debugging.
-O, -O1, -O2, -O3 Optimization levels. (O,O1,O2 are essentially equivalent). -O3 is the most aggressive optimization level. Note that Intel compilers perform optimization by default.
-openmp Enable OpenMP.
-opt_report
-opt_report_file filename
-opt_report_level [min|med|max]
-openmp_report[0|1|2]
-par_report[0|1|2|3] Various reporting options on optimization, OpenMP, or auto-parallelization. See man pages for more information.
-p Enables function profiling with the gprof tool. Same as -qp
-parallel Enable auto-parallelizer to generate multi-threaded code for eligible loops.
-prof_gen
-prof_file
-prof_use Used for profile guided optimization.
-pthread, -lpthread Link with Pthreads library
-r8
-r16
-real_size 64
-real_size 128 Different ways to specify the default size of real and/or double-precision numbers.
-shared Create a shared object (.a, .so)
-static Enables linking to shared libraries (.so) statically.
-tpp7 Optimize for pentium 4 / Xeon (default)
-V Display compiler version information
-w Disable all warning messages
-w[0|1|2] Increasing levels of warning message reporting. Default=1
-Wall (C/C++)
-warn (Fortran) Enable all warning messages
-xW Utilize streaming SIMD instructions - turns on auto-vectorizer

Caveats:

* Bypassing the Intel Fortran Run-time Error System
o The Intel Fortran run-time error system will intercept run-time signals such as SIGABRT and SIGSEGV by default. This prevents core files from being created.
o Therefore, a signal handler must be installed and called if you want corefiles.
o An example of how to do this appears below

C routine:

#include
void clear_signal_()
{
signal(SIGSEGV, NULL);
}

Fortran code:

program mycode

call clear_signal()

etc.....

Compilers
PGI Compilers
General Information:

* The PGI C, C++ and Fortran compilers as well as the PGI debugger and profiler tools are available on the IA32 and Opteron clusters.

* The PathScale compilers are placed in your PATH and MANPATH environment variables by default on the Opteron clusters. On IA32 clusters, you will need to do this manually with the use command (below).

* There are several 32-bit and 64-bit versions of the PGI compilers available. To see what is currently available on the machine you are using, issue the use -l command. To select a particular version, use the use package-name command.

* Full PGI documentation is available in the doc install directory. Complete man pages are also available for every PGI tool after you load the version of choice as described above.

* OpenMP version 2.0 compliant

* Support SSE instructions

* The PGI compilers provide a compiler option for unmasking floating-point exceptions. See the -Ktrap option in the table below.

Compiler Invocation Commands:

pgcc serial/OpenMP C
pgCC serial/OpenMP C++
pgf77
pgf90 serial/OpenMP Fortran 77 and 90
mpipgcc script for C with MPI
mpipgCC script for C++ with MPI
mpipgf77
mpipgf90 scripts for Fortran 77 and Fortran 90 with MPI

Common / Useful Options:

* A few useful compiler options are shown below. Interested users will definitely want to consult the relevant documentation, including man pages and the PGI website: www.pgroup.com.

Option Description
-fast Turn on optimizations
-fpic Instructs the compiler to generate position independent code which can be used to create shared object files (dynamically linked libraries).
-help Display help
-Kieee Force IEEE 754 arithmetic
-Ktrap=fp,inv,denorm,divz,ovf,unf,inexact Unmask selected FPU exceptions
-lpthread Link with pthreads library.
-mp Turn on OpenMP
-Mvect=[prefetch,sse] Enable prefetch, SSE
-Mlist Create a listing file
-Mipa Enable interprocedural analysis. See man page for details.
-O, -O1, -O2, -O3/O4 Optimization levels. The default is -O1.
-pc32
-pc64
-pc80 Set precision of FPU significand to 32, 64, or 80 bits respectively
-pg Enable gprof-style sample-based profiling
-V Display version information
-v Verbose mode
-w Suppress warning messages

Compilers
PathScale Compilers
General Information:

* The PathScale C, C++ and Fortran compilers are available on the IA32 and Opteron clusters.

* The PathScale compilers are placed in your PATH and MANPATH environment variables by default on the Opteron clusters. On IA32 clusters, you will need to do this manually with the use command (below).

* There are several versions of the PathScale compilers available. To see what is currently available on the machine you are using, issue the use -l command. To select a particular version, use the use package-name command.

* The same compiler can be used to create 32-bit or 64-bit executables, as the compiler automatically determines the type of architecture being used. For example, the pathcc command on an IA32 cluster will produce a 32-bit executable, and on an Opteron cluster, will produce a 64-bit executable.

* OpenMP version 2.0 compliant

* Support SSE instructions

* Compatible with gcc and g77.

Compiler Invocation Commands:

pathcc serial/OpenMP C
pathCC serial/OpenMP C++
pathf90 serial/OpenMP Fortran 77/90/95
mpipathcc script for C with MPI
mpipathCC script for C++ with MPI
mpipathf90 scripts for Fortran 77/90/95 with MPI

Common / Useful Options:

* A few useful compiler options are shown below. Interested users will definitely want to consult the relevant documentation, including man pages and the PathScale website: www.pathscale.com.

Option Description
-ansi Adhere to ANSI rules.
-apo Automatically parallelize loops when it is safe and beneficial to do so.
-fb-create path Create an instrumented executable for profile-guided optimization.
-fb-opt path Specify the directory that contains the instrumentation output generated by compiling with -fb-create and then run your program with a training input set.
-fno-exceptions Disables runtime exception handling. The default is to handle exceptions.
-fno-fast-math Conform to IEEE math rules - with a cost to performance.
-fPIC Generate position independent code, if possible. Useful for creating dynamically linked libraries. (OFF by default.)
-lacml Link with the AMD Core Math Library for optimized BLAS, LAPACK and FFT routines.
-help Display help
-fullwarn More extensive list of warnings than the default.
-H Print the name of each header file used.
-ipa Invoke interprocedural analysis.
-m32 Build a 32-bit object or executable.
-march=auto Optimize for the architecture that the code is being compiled on.
-mcmodel=medium Required if executable is greater than 2 GB. Uses 32-bit relocations for code and 64-bit relocations for static data.
-mp Enable OpenMP directives.
-O0, -O1, -O2, -O3 Levels of optimization. -O2 is the default.
-Ofast Same as -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math
-pg Generate information for pathprof (alternative to gprof).
-pthread Compile with pthreads support.
-show-defaults Display the default options and quit (no compilation performed)
-static Use static linking for shared libraries. Dynamic linking is the default.
-version Display compiler version information.
-w Suppress warning messages.

MPI
Quadrics MPI
General Info:

* Quadrics MPI is used on all IA32 clusters with a Quadrics interconnect.

* IA32 clusters lacking an Quadrics interconnect use MPICH.

* Quadrics MPI is based on MPICH:
o Full MPI-1 implementation
o One-sided communications (MPI-2) are supported
o MPI-IO (MPI-2) is supported through MPICH's ROMIO implementation
o Optimized collective communication operations

* Provides optimal communication protocols:
o Uses Shmem shared-memory communications for on-node tasks.
o Uses the Quadrics Elan libraries for off-node communications over the switch.

* Not thread-safe; all MPI function calls should be made from the master thread on a node.

* There are debug versions of the elan libraries that are available in /usr/lib/rms/lib/dbg/. These should only be used if you suspect a bug in Quadrics MPI or the elan libs. To use these instead of the standard elan libraries, modify LD_LIBRARY_PATH as follows:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/rms/lib/dbg

MPI Build Scripts:

* MPI compiler wrapper scripts are used to compile MPI programs - these should all be in your default $PATH unless you have changed it. These scripts mimic the familiar MPICH scripts in their functionality, meaning, they automatically include the appropriate MPI include files and link to the necessary MPI libraries and pass switches to the underlying compiler.

* Available scripts are listed below:

Language Script Name Underlying Compiler
C mpicc gcc
mpiicc icc
mpipgcc pgcc
C++ mpiCC g++
mpiicpc icpc
mpipgCC pgCC
Fortran mpif77 g77
mpiifort ifort
mpipgf77 pgf77
mpipgf90 pgf90

* For additional information:
o See the man page (if it exists)
o Issue the script name with the -help option (almost useless)
o View the script yourself directly

* By default, the scripts point to the current version of their underlying compiler.

Static Linking:

* The above scripts do not provide for static linkage. If you need to do this, then follow the hints below.
o Library path: -L/usr/lib/mpi/lib
o Include path: -I/usr/lib/mpi/include
o For all compilers use the following link/loader switches, in this order:
-lmpif (fortran only)-lmpi -lelan -lelan3 -lrmscall
o For ifort 8 only, include this object file on the link line:
/usr/local/mpi/ifc8_farg.o
o For pgf77 only, include this object file on the link line:
/usr/local/mpi/pgf77_farg.o
o For pgf90 only, include this object file on the link line:
/usr/local/mpi/pgf90_farg.o

Libelan Environment Variables:

* The Quadrics MPI library is layered on top of the Quadrics Elan communication library, known as libelan.

* Libelan provides optimized message-passing, one-sided communication and collective functions that take advantage of the QsNet network hardware.

* A number of Libelan environment variables can be set to tune or change the behavior of MPI applications. These variables fall into 4 main areas:
o Work-Arounds - useful to try when an application hangs or crashes for unknown reasons.
o Performance - useful to tune communication performance of an application.
o Memory - useful for programs that require large amounts of message passing memory.
o Debugging - turns on debugging output in the MPI and Elan libraries.

* Generally, these environment variables are set to default values that should benefit most applications, however....

* The default settings may not be optimal for all applications. Furthermore, they have been known to cause problems with some apps.

* For a detailed discussion (recommended) see the "MPI and Elan Library Environment Variables" document located at: www.llnl.gov/computing/mpi/elan.html

Performance:

* As stated previously in the Quadrics switch section:
o Latency: Ultra-low latency of ~5 us for MPI point-to-point communications on LC's IA32 systems.
o Bandwidth: ~340 MB/s point-to-point MPI bandwidth, depending upon the node's chipset.

* Also see the Performance Considerations section of this tutorial.

MPI
MVAPICH
General Info:

* Mellanox/OSU MVAPICH MPI is used on all Opteron clusters with an Infiniband interconnect.

* Opteron clusters lacking an Infiniband interconnect use MPICH.

* MVAPICH is based upon MPICH.

* See Opteron Memory Considerations regarding MVAPICH's use of memory.

MPI Build Scripts:

* MPI compiler wrapper scripts are used to compile MPI programs - these should all be in your default $PATH unless you have changed it. These scripts mimic the familiar MPICH scripts in their functionality, meaning, they automatically include the appropriate MPI include files and link to the necessary MPI libraries and pass switches to the underlying compiler.

* Available scripts are listed below:

Language Script Name Underlying Compiler
C mpicc gcc
mpiicc icc
mpipgcc pgcc
mpipathcc pathcc
C++ mpiCC g++
mpiicpc icpc
mpipgCC pgCC
mpipathCC pathCC
Fortran mpif77 g77
mpiifort ifort
mpipgf77 pgf77
mpipgf90 pgf90
mpipathf90 pathf90

* For additional information:
o See the man page (if it exists)
o Issue the script name with the -help option (almost useless)
o View the script yourself directly

* By default, the scripts point to the current version of their underlying compiler.

MPI
MPICH
General Info:

* For IA32 and Opteron clusters without a switch, MPICH may be used for single-node parallel MPI jobs in shared memory mode.

* The usual compiler scripts are available as discussed previously, however they are setup to link to MPICH instead of Quadrics MPI or MVAPICH.

Language Script Name Underlying Compiler
C mpicc gcc
mpiicc icc
mpipgcc pgcc
C++ mpiCC g++
mpiicpc icpc
mpipgCC pgCC
Fortran mpif77 g77
mpiifort ifort
mpipgf77 pgf77
mpipgf90 pgf90

* Remember, only single-node parallelism is supported.

* Performance may be affected by scheduling policies that permit multiple users on a node, or if "oversubscription" is permitted - that is, running more MPI tasks than there are physical CPUs.

Running Jobs
Overview
Big Differences:

* LC's Linux clusters can be divided into two types: those having a high speed interconnect switch and those that don't. There are significant differences that arise from this distinction.

Interconnect No Interconnect
Most Linux Clusters ACE, QUEEN, YANA, HOPI
Designated as a parallel resource. Primarily designated for serial and single node parallel jobs.
Use SLURM as the native resource manager Do not use SLURM (except QUEEN)
All nodes fall into clearly defined pools, such as pdebug and pbatch. Additional pools may be present. Selection of individual nodes or selection by node characteristics is not applicable. Users can select nodes individually, or by characteristics such as the amount of memory on a node.
Nodes are not shared with other users (except for the login nodes). When your job runs, the allocated nodes are dedicated to you. Nodes can be shared with other users

* Usage differences between these two different types of clusters will be noted as relevant in the remainder of this tutorial.

Job Limits:

* For all clusters, there are defined job limits which vary from cluster to cluster, and are subject to change. These limits define:
o How many nodes/CPUs a job may use
o How long a job may run
o How many jobs may be running at once per user, group, machine
o When a job may run
o How much memory a job may use
o Where interactive versus batch is permitted

* Job limits are documented in several locations:
o By using the news job.lim.machinename command on the cluster you are logged into.
o See the "Job Limits" section of the LC web pages: www.llnl.gov/computing/hpc/jobs/index.html#limits
o In the Batch Limits section of the LCRM tutorial.

Running Jobs
Batch Versus Interactive
Interactive Jobs:

* In general:
o Interactive work has lower limits: shorter time limit, less number of nodes available, lower memory limits (where applicable), etc.
o Jobs run via cron, at and nohup are considered interactive jobs.

* Clusters with an interconnect:
o Typically have a special partition called pdebug that is configured for interactive parallel use.
o Nodes in the pdebug partition are configured similarly
o When your job runs, the nodes are dedicated to you alone
o Parallel interactive jobs require launching via the srun -ppdebug command (covered later).
o Can not use the pbatch partition for interactive work or debugging unless you have a batch job that is already running on pbatch partition nodes.
o Login nodes can be used for serial and non-MPI parallel (threads) jobs interactively, although this is not encouraged.

* Clusters without an interconnect:
o Specific nodes are designated for interactive use only and should not be used for batch jobs. See news job.lim.machinename after you login for details.
o Other users can run their jobs on the same node as you simultaneously.
o Nodes can be configured differently. For example, some nodes may have more memory than others.

Batch Jobs:

* On LC systems, LCRM is used for running batch jobs. LCRM is covered in detail in the LCRM Tutorial. This section only provides a quick summary of LCRM usage on the Linux clusters.

Note: LC is in the process of migrating to the Moab batch scheduler. Atlas will be the first cluster to use Moab, followed eventually by all other OCF machines and then SCF machines.

* On clusters with an interconnect, batch jobs are typically run in the larger pbatch partition.

* On clusters without an interconnect, batch jobs must be submitted to nodes that are configured to permit batch jobs. These nodes should also permit the memory and time resources the job requires.

* All batch jobs must be submitted in the form of a job control script with the psub command. For example:

psub myjobscript

* A sample job control script (for a Quadrics enabled machine) appears below.

# Sample LCRM script to be submitted with psub
#PSUB -c zeus # machine to run on
#PSUB -pool pbatch # pool to use
#PSUB -r t2d22 # sets job name
#PSUB -tM 1:00 # sets maximum total CPU time
#PSUB -b micphys # sets bank account
#PSUB -ln 2 # uses 2 nodes
#PSUB -x # export current env var settings
#PSUB -o ~/db/t2d22.log # sets output log name
#PSUB -e ~/db/t2d22.err # sets error log name
#PSUB -nr # do NOT rerun job after system reboot
#PSUB -mb # send email at execution start
#PSUB -me # send email at execution finish
#PSUB # no more psub commands

# job commands start here
# Display job information for possible diagnostic use
set echo
hostname
echo LCRM job id = $PSUB_JOBID
sinfo
squeue

# Run info
cd /p/lscratcha/db/t2d22
srun -n 4 ./my_mpiprog
echo 'ALL DONE'

* After successfully submitting a job, you may then check its progress and interact with it (hold, release, alter, kill) by means of other LCRM commands summarized below. All of these commands have man pages and are covered in other LC documentation such as the LCRM tutorial.

* Interactive debugging of batch jobs is possible (covered later).

Quick Summary of Common Batch Commands:

Command Description
psub Submits a job to LCRM
pstat LCRM job status command
prm Remove a running or queued job
phold Place a queued job on hold
prel Release a held job
palter Modify job attributes (limited subset)
lrmmgr Show host configuration information
pshare Queries the LCRM database for bank share allocations, usage statistics, and priorities.
defbank Set default bank for interactive sessions
newbank Change interactive session bank

Running Jobs
Starting Jobs
Clusters with an Interconnect:

* All of these clusters have a cluster login that will put you on one of the available login nodes. For example:

ssh zeus
ssh alc
ssh lilac

* Machines configured with an interconnect are designated as parallel resources. The compute nodes are configured very homogeneously within a given cluster. Additionally, compute nodes are not shared among users - when your job runs, the nodes it uses are dedicated to your job.

* The compute nodes on most of these clusters are typically configured into two partitions:
o pdebug - interactive work, small, short jobs
o pbatch - parallel batch production runs

* The SLURM srun command is required to launch all parallel jobs - both batch and interactive. The major differences between batch and interactive usage are:
o Interactive use requires the specification of the pdebug partition with the -ppdebug flag.
o Batch use: the srun command is part of the LCRM batch script and the pbatch partition is assumed.
o Interactive use - if there are no available nodes, your job will by default queue up and wait until there are enough free nodes to run it. In batch mode, the batch scheduler handles when to run your job regardless of the number of nodes available.

* Syntax:

srun [option list] [executable] [args]

Note that srun options must preceed your executable.

* Examples:

srun -n8 -ppdebug my_app

8 process job run interactively in pdebug partition

srun -n2 -c2 my_threaded_app

2 process job with 2 CPUs (threads) per process. Assumes pbatch partition.

srun -N8 my_app

Request that 8 nodes be used for job. Assumes pbatch partition.

srun -n4 -o my_app.out my_app

4 process job that redirects stdout to file my_app.out. Assumes pbatch partition.

srun -n4 -ppdebug -i my.inp my_app

4 process interactive job; each process accepts input from a file called my.inp instead of stdin

* A short list of common srun options appears below. See the srun man page for more complete information on srun options.

Option Description

-c [#cpus/task]

The number of CPUs used by each process. Use this option if each process in your code spawns multiple POSIX or OpenMP threads.

-->
Specifies creation of lightweight core files. May be useful for very
large process jobs which are crashing and filling disk space with core
files. Note double dashes before "core" in this option. The default is
--core=normal, which may actually be limited by your shell corefilesize
setting.

-d

Specify a debug level - integer value between 0 and 5

-i [file]
-o [file]

Redirect input/output to file specified

-I

Allocate CPUs immediately or fail. By default, srun blocks until resources become available.

-J

Specify a name for the job

-l

Label - prepend task number to lines of stdout/err

-m block|cyclic

Specifies whether to use block (the default) or cyclic distribution of processes over nodes

-n [#processes]

Number of processes that the job requires

-N [#nodes]

Number of nodes on which to run job

-O

Overcommit - srun will refuse to allocate more than one process per CPU unless this option is also specified

-p [partition]

Specify a partition on which to run job

-s

Print usage stats as job exits

-v -vv -vvv

Increasing levels of verbosity

-V

Display version information

* SLURM environment variables: some srun options may be set via environment variables. For example, SLURM_CPUS_PER_TASK behaves like the -c option. See the srun man page for details on the approximately two dozen SLURM environment variables.

Clusters Without an Interconnect:

o For interactive jobs, you can one of the cluster login nodes, or if you prefer, login directly to one of the designated interactive nodes and start your job there.

o For batch jobs, determine which node best suits the requirements of your job, recalling that:
+ These machines are for serial or single-node parallel jobs
+ Nodes may have different time limits
+ Nodes may have different memory configuration
+ Multiple users are permitted to run simultaneously on each node
+ In your LCRM job command script, you may need to specify an appropriate machine to run on. Then submit your job as usual via the psub command.

o SLURM commands (srun, squeue, sinfo, etc.) are not available (except on QUEEN).

o MPI jobs: because there is no interconnect, only MPICH jobs can be run on these nodes. You therefore need to start your job (both batch and interactive) with the mpirun command: The syntax is:

mpirun [option list] [executable] [args]

For example:

mpirun -np 2 mycode

would run a 2 process MPI job called "mycode". Note that the syntax is picky - must have a space between -np 2 and all mpirun options must appear before the executable name. Use the mpirun -h command for more information.

o MPICH uses the shared memory MPI implementation, and permits you to run with more processes than CPUs on a given node. However, if you use more MPI processes than there are CPUs, there will be a reduction in the parallelism achieved as processes will compete with each other for CPUs. They will also compete with any other users on the node.

o Pthreads and OpenMP jobs can be run as usual - keeping in mind the number of CPUs per node and that nodes may be shared with other users.

Running Jobs

Terminating Jobs
Clusters with an Interconnect:
o Interactive parallel jobs should normally be terminated with a SIGINT (CTRL-C):
+ The first CTRL-C will report the state of the tasks
+ A second CTRL-C within one second will terminate the tasks
+ If you started your job with srun's -q option, only one CTRL-C is required.

o The SLURM scancel command can be used to terminate any job (batch or interactive) that was started with the srun command. For example:

% squeue | grep jsmith
24688 pbatch test110 jsmith R 1:49:09 2 zeus[55,57]
68865 pbatch test1 jsmith R 0:11 1 zeus17
% scancel 68865

o The LCRM prm command can be used to terminate batch jobs that were submitted with the psub command. For example:

% pstat
25156 t1.cmd jsmith 000000 cs RUN zeus N
25157 t1.cmd jsmith 000000 cs STAGING zeus N
% prm 25156
remove running job 25156 (jsmith, 000000, cs)? [y/n] y
%

o WARNING: Do not use kill -9 to stop srun, as it will not stop the parallel processes it spawned. Because these processes are running on other nodes, ps -ef will not show them after you terminate the srun process, but they will still be there, using resources.

Clusters Without an Interconnect
o For interactive jobs:
+ Make sure you're logged into the node where your job is running - which is the usual case.
+ Using CTRL-C will kill your job, provided it is running in the foreground.
+ If your job is running in the background, then use the fg command to put it in the foreground. Then use CTRL-C.
+ If all else fails, use the ps command to obtain the process id(s) of your job, and then use the kill -9 pid command for each process. Note: if you're using this method for MPICH jobs, simply killing the mpirun task usually leaves the parallel tasks it started still running.

o For batch jobs submitted via LCRM, use the prm command as shown above. It will work with any job submitted via the psub command.

o If the prm doesn't work for your batch job, you can try to login directly to the node where it is running and kill it as described for interactive jobs above.

Running Jobs

Displaying Queue and Job Status Information
Clusters with an Interconnect:
o As with other LC machines, there are several commands that can be used to display queue configuration and job activity information. Examples of each are shown below. Hyperlinked command names will display the corresponding man page.

o ju - An LC command that shows partitions and job usage in a concise format. In the example below, some output has been truncated due to its width.

% ju
Partition total down used avail cap Jobs
pdebug 64 0 55 9 86% bbnliu-16, pmorris-1, rreed-8....
pbatch 1048 5 818 225 78% ggk-1, vo4-320, griinman-25, ...

o spjstat and spj - Two more LC commands that shows partitions and job usage with more detail. spjstat shows only running jobs while spj shows both running and queued jobs.

% spjstat

Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free
--------------------------------------------------------
pbatch 3300Mb 2 1048 1047 39
pdebug 3300Mb 2 64 64 48

Running job data:

-------------------------------------------------------
Job ID User Name Nodes Pool Status
-------------------------------------------------------
28309 kkio 360 pbatch Running
27132 nnning 32 pbatch Running
26937 wwwook 256 pbatch Running
28359 pyrota 8 pbatch Running
27515 sssuru 256 pbatch Running
28479 qupder 96 pbatch Running
66340 wickris 16 pdebug Running

o sinfo - SLURM command for displaying configuration and usage information. The "infinite" time limit for pbatch is set because LCRM is used to control the actual job time limit, which may vary.

% sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
pdebug up 30:00 2 alloc zeus[17-18]
pdebug up 30:00 10 idle zeus[8-16,19]
pbatch* up infinite 262 alloc zeus[20-85,87-104,106-283]
pbatch* up infinite 2 down zeus[86,105]

o squeue - SLURM command for displaying information about running jobs. In the example below, some output has been truncated due to its width.

% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
31250 pbatch RSRMV_3. rfffdder R 9:16:52 32 zeus[173-181…
30677 pbatch relax_Va abebb2 R 9:16:38 40 zeus[20-27,58-65...
30977 pbatch psub_001 mrrre R 5:39:02 8 zeus[165-172]
29945 pbatch PossP10 retyic2 R 4:48:56 16 zeus[99-101,106-118]
30724 pbatch 6size.0. bccce R 4:29:33 64 zeus[28-38,87-98,…
29170 pbatch ttest5a. reauuh R 4:23:02 16 zeus[66-81]

o pstat - LCRM command for displaying information about running and queued jobs. When used without options, only your jobs' information will be displayed.

% pstat -m zeus
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL
16346 do800 kers 000000 micphys *WCPU zeus N
17874 b4f lkwggner 000000 micphys *DEPEND zeus N
22678 valduc3d01 kbbbta 000000 cms *MULTIPLE zeus N
22684 ExpandingTube-3 jwen 529004 axcode RUN zeus N
22685 ExpandingTube-3 jwen 529004 axcode *DEPEND zeus N
22879 mo4.psub uyang 477530 squeeze *WCPU zeus N
22991 vlcc_8.10.16 m55rath5 000000 fph2o *DEPEND zeus N
24640 test.thunder lirin 530001 clchange RUN zeus N
24655 do70 kers 000000 micphys RUN zeus N
...
...
...
24656 amr100gp kgitsu 000000 chemd RUN zeus N
24839 rh315_100gp kgitsu 000000 micphys RUN zeus N
24840 origi htang 530001 lines *WCPU zeus N
24841 rh315_100gp kgitsu 000000 micphys *TOOLONG zeus N
24842 dpd ggee 000000 cms RUN zeus N
24873 methanol bmundy 000000 fph2o RUN zeus N
24880 amr100gp kgitsu 000000 chemd *TOOLONG zeus N
43344 sspex_test2.ksh qitera1 000000 folding *DEPEND zeus N

Clusters Without an Interconnect
o For interactive jobs, use the ps command to monitor your job's status, assuming you are logged into the machine where it is running.

o For batch jobs, you can use the pstat command as described above. You can also login to where the job is running and use the ps command.

Running Jobs

Optimizing CPU Usage
Clusters with an Interconnect:
o The Linux clusters differ from LC's IBM systems in the way you specify the number of tasks and number of nodes.

o For example, if you are on an Opteron cluster, and you want an MPI job to use 8 CPUs on 4 nodes (8 MPI tasks per node), then you would do something like:

Interactive Batch

srun -n32 -ppdebug a.out

#PSUB -ln 4
srun -n32 a.out

o If your MPI job uses POSIX or OpenMP threads within each node, you should allocate the number of tasks for your job and also specify the number of CPUs to use per task. For example, running on an Opteron cluster, a 4 task job where each task creates 8 OpenMP threads, would be specified as:

Interactive Batch

srun -n4 -c8 -ppdebug a.out

#PSUB -ln 4
srun -n4 -c8 a.out

o Don't forget that Quadrics MPI on the IA32 clusters is not thread-safe, so the master thread must perform all MPI calls.

o The default task distribution is block. For cyclic distribution, add the -m cyclic option to your script's srun command line. For example, on an IA32 cluster:

Task Block Cyclic
------ ----- ------
task 0 node0 node0
task 1 node0 node1
task 2 node1 node2
task 3 node1 node3
task 4 node2 node0
task 5 node2 node1
task 6 node3 node2
task 7 node3 node3

o You can include multiple srun commands within your LCRM job command script. For example, suppose that you were conducting a scalability run on an IA32 cluster. You could allocate the maximum number of nodes that you would use with #PSUB -ln and then have a series of srun commands that use varying numbers of nodes:

#PSUB -ln 20
srun -N10 -n20 myjob
srun -N11 -n22 myjob
srun -N12 -n24 myjob
....
srun -N20 -n40 myjob

Clusters Without an Interconnect:
o The issues for CPU utilization are different for clusters without a switch for several reasons:
+ Many jobs are serial
+ Nodes are shared with other users
+ Heavy utilization
+ Jobs are limited to one node

o On these systems, the more important issue is over utilization of CPUs rather than under utilization.

o Batch jobs: To assist LCRM with effectively scheduling CPU usage, users should include the -np (number of processors) option to specify how many CPUs their job will actually require. This is particularly true for threaded codes. For example, if a single process code on an Opteron node spawns 8 threads, then the job command script should contain:

#PSUB -np 8

This will prevent LCRM from over-allocating the node.

o If you are using the mpirun command to run and MPICH MPI job (batch or interactive), its -np flag specifies the number of tasks, and the assumption is that there will be one task per CPU...unless you oversubscribe the node by specifying more tasks than there are CPUs (not recommended).

o In cases where a node is being shared between multiple users, and a user "overutilizes" a node by using more CPUs than they specified, that user's job will be automatically "niced" to a lower priority.

Running Jobs
IA32 Memory Considerations
32-bit Architecture Memory Limit:
o All IA32 Linux clusters are built on the 32-bit Xeon architecture, which by definition is limited to an address space maximum of ~4GB (232).

o Intel provides the Physical Address Extension (PAE) feature which can increase address space to ~64GB. There is ample discussion of the subject on the web, however, with mixed reviews.

o Even if the PAE feature were to be used, addressing beyond a node's actual physical memory would cause paging, with obvious consequences to performance.

o None of LC's IA32 Linux cluster machines have more than 4GB of memory at this time, so using PAE is a moot point here.

Compiler Data Size Limits:
o Compilers will have size limits for data structures. This can vary between compilers (Intel, PGI, PathScale, GNU). Compilers will also differ in how they handle things when limits are exceeded.

o Using the Intel C/C++ and Fortran compilers as an example, both set as a hard limit a maximum data size of 2,147,483,647 bytes for data structures. (This number of course, is the upperbound for signed 4 byte integers.) Take the two simple codes below for example:

C/C++ Fortran

#include

#define N 2147483647

int main(int argc, char *argv[]) {
long long int i;
static char A[N];
for (i=0; i
A[i] = 'x';
printf("Sample result = %c \n",A[N-1]);
}

program testit
integer N
parameter(N=2147483647)

character A(N)
integer*8 i
do i=1, N
A(i) = "x"
enddo
write(*,*)'Sample result A= ', A(N-1)
end

o These will both compile OK, but if you try to create an array that is even 1 byte larger than the limit, you'll get a compile time error message that resembles:

C/C++ error: array is too large
Fortran A common block or variable may not exceed 2147483647 bytes

o However, if you have multiple arrays and each are dimensioned below the hard limit, yet when added up exceed the limit, the compiler won't complain, but the program will hang/seg fault at run time.

o Note also, that your code may fail if you actually attempt to get too close to the compiler declared limit.

o Bottom line: there appears to be a ~2 GB limit before problems begin to appear for programs that allocate data statically. This is aggregate data PER PROCESS.

Malloc Limits:
o A single malloc is limited to somewhat more than the magic number of 2147483647 bytes. In test cases on both 4 GB and 2 GB machines in batch mode, the largest single malloc permitted ~2247000000 bytes.

o Smaller mallocs are permitted until the total amount of all mallocs is ~3060000000 bytes (3 GB). This is the same for both 4 GB and 2 GB machines, batch, interactive, and regardless of how much memory was really in use by the OS and other processes at the time.

o Obviously, the system will let you allocate more memory than can be physically delivered, which means that paging will take effect if you exceed what is really available.

o Bottom line: it appears you can use up to ~3 GB of address space, even if it means you could encounter lots of paging, particularly on 2 GB memory machines. This is PER PROCESS.

Shell Stacksize Limits:
o LC has recently made the shell stacksize limit "unlimited" on all of its IA32 Linux clusters. For example:

Shell Limits
csh/tcsh

cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 16 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 4 kbytes
maxproc 1024

bash/ksh/sh

time(cpu-seconds) unlimited
file(blocks) unlimited
coredump(blocks) 32
data(kbytes) unlimited
stack(kbytes) unlimited
lockedmem(kbytes) 4
memory(kbytes) unlimited
nofiles(descriptors) 1024
processes 1024

Program Stack Limits:
o When the shell stacksize is set to unlimited, a process can call routines that take up to ~2 GB of stack space. It can be a single routine, or a series of routines that call one another. In total though, the simultaneous stack size of all routines can't exceed ~2 GB. This is PER PROCESS.

o As before, if the compiler detects that you are declaring data objects that are larger than ~2GB, it will complain and cease compilation.

Pthreads Stack Limits:
o The POSIX threads API does not specify what the default stack size should be, nor what the maximum stack size could be. It is entirely implementation dependent and varies across all LC platforms and under various conditions on the same platform.

o For LC's IA32 Linux cluster compute nodes with a shell stacksize set to unlimited, the default thread stack size for pthreads is 2,097,152 bytes (2 MB).

o This can be increased by using the pthread_attr_setstacksize routine. The amount it can be increased depends upon how many threads you have. Some examples:

#Threads Approx. Max. Size
(MB)
2 1072
4 712
8 352
16 182
32 92

o This appeared to be the same on 4 GB and 2 GB machines, and the same in pdebug and pbatch.

o Pthreads programming is covered in LC's POSIX Threads tutorial.

OpenMP Stack Limits:
o Like the POSIX threads API, the OpenMP APIs do not specify what the default stack size should be, nor what the maximum stack size could be. It is entirely implementation dependent.

o The discussion below pertains to the Intel compilers on LC IA32 clusters, but the same is probably close to being true with other compilers (not tested).

o For LC's IA32 Linux cluster compute nodes with a shell stacksize set to unlimited, the default thread stack size is approximately 2 MB, with slight variations between systems. The Intel documentation states that it should be 2 MB for IA32 systems, so we're close in anycase.

o Users can increase the thread stacksize by setting the KMP_STACKSIZE environment variable to the number of bytes desired. For example:

setenv KMP_STACKSIZE 12000000

o There is an upper limit of 2147483647 bytes ( ~2GB, once again, that magic number for signed 4 byte integers.)
+ If you try to use a larger value, a warning will be issued and it will be set to 2147483647 bytes anyway.
+ Using 2 threads, your code will probably seg fault if you use the upper limit or anything close to it anyway.

o Like pthreads, which OpenMP is based upon, the maximum stacksize for a thread depends upon the number of threads you create. For example, if you are using two threads with KMP_STACKSIZE set to 2000000000 (2GB), it will seg fault with 4 threads. So, if you intend on using more than two threads, you will have to find the upper limit that works with the number of threads you actually use. The table below is an approximation. Note how closely it parallels the pthreads table above.

#Threads Approx. Max. Size
(MB)
2 2145
4 900
8 400
16 190
32 100

o This appeared to be the same on 4 GB and 2 GB machines, and the same in pdebug and pbatch.

o OpenMP programming is covered in LC's OpenMP tutorial.

Miscellaneous
o Most of LC's IA32 Linux cluster machines have 4 GB of memory, however users can't take advantage of this full amount. The Linux OS and other system related processes require some memory. This subtracts from the total amount of memory available to the user application (before it starts paging).

o Some of LC's IA32 clusters have only 2GB of memory per node. This means that a code that runs without paging on a 4 GB node, may encounter performance problems when running on nodes with less memory.

o Login nodes can behave differently than compute nodes, but then one shouldn't be running jobs there anyway.

In Conclusion:

o The upper limits for memory addressing, mallocs, stack sizes, pthreads, OpenMP, etc. can easily allow you to create programs that will end up paging and may/may not be portable.

o Other architectures (like the 64-bit IBMs, Opterons and Thunder) may have very different defaults and behavior than described here, which implies portability considerations.

o There will be other portability issues when you go from an IA 32-bit Linux architecture to 64-bit architectures.

o Keep in mind that you really don't have access to the full memory of the machine and that machine configurations vary between 4 MB / 2 MB per node.

Running Jobs
Opteron Memory Considerations
64-bit Architecture Memory Limit:
o Because the Opteron is a 64-bit architecture, 16 exabytes of memory can be addressed - which is about 4 billion times more than 4 GB limit of 32-bit architectures. By current standards, this is virtually unlimited memory.

o In reality, machines are usually configured with only GBs of memory, so any address access that exceeds physical memory will result (on most systems) with paging and degraded performance. LC machines are an exception to this - see below.

LC's Diskless Opterons:
o LC's Opteron clusters are configured with diskless compute nodes. This has very important implications for programs that exceed physical memory, which for most nodes is 16 GB.

o Because compute nodes don't have disks, there is no virtual (swap) memory, which means there is no paging. Programs that exceed physical memory will terminate with a segmentation fault or similar error.

NUMA and Infiniband Adapter Considerations:
o Each Opteron node has 4 dual-core processors. Each processor is directly coupled to its own local DIMMs with a bandwidth of 10.7 GB/s (@5.3 GB per core).

o Shared memory access to the DIMMs of other processors occurs over the HyperTransport links at 8 GB/s (4 GB/s each direction).

o Only 1 processor is directly connected to the Infiniband adapter. This means that the communication traffic for all CPUs must pass through a single processor.

o LC testing has seen up to 50% variability in performance for some codes. It is possible that the NUMA and Infiniband architectures may have something to do with this, although there may be other causes unknown at this time. Investigation continues…
Opteron memory and Infiniband schematic

Compiler, Shell, Pthreads and OpenMP Limits:
o Much of what was covered above in the IA32 Memory Considerations section applies here as well. Specifically:
+ Compiler data structure limits are in effect, but may be handled differently by different compilers.
+ Shell stack limits are set to "unlimited" by default.
+ Pthreads stack limits apply.
+ OpenMP stack limits apply, and may differ between compilers.

MVAPICH MPI Memory Consumption:
o Currently (as of 2/07), MVAPICH MPI uses 300 MB of memory per on-node task. LC is working with MVAPICH developers to reduce this. Be aware and stay tuned.

Running Jobs
IA32 Performance Considerations
o In addition to the IA32 Memory Considerations issues cited previously, there are other IA32 specific performance considerations.
Environment Variables:
o Several Quadrics Elan environment variables meant to improve performance are defaulted into the LC user environment. It may be useful to know what these variables are in instances where you may wish to unset them or modify the default values.

Environment Variable Description
LIBELAN_WAITTYPE Default value is POLL. Sets wait type to polling for Elan communications; POLL should always be used for Quadrics MPI jobs.
MALLOC_MMAP_MAX_ Default value of 0. Forces malloc to use sbrk() rather than mmap() to allocate memory. Improves performance of MPI collectives because it prevents aggressive reclaiming of pages mapped on the Elan card DMA memory.
MALLOC_TRIM_THRESHOLD_ Default value of -1. Used in conjunction with MALLOC_MMAP_MAX_ (see above)
LIBELAN_GALLOC_EBASE Default value of 0xb0000000. With LIBELAN_GALLOC_MBASE and LIBELAN_GALLOC_SIZE, used to resize the Elan global memory heap for MPI collective operations. EBASE Refers to a pointer to a base virtual address in Elan memory to be used for the global heap.
LIBELAN_GALLOC_MBASE Default value of 0xb0000000. Refers to a pointer to the main memory base for resizing the Elan global memory heap.
LIBELAN_GALLOC_SIZE Default value of 16777216. The size, in bytes, of the Elan global memory heap. Cannot exceed 32 MB due to amount of memory on Elan cards.
MPI_USE_LIBELAN By default, not set, which equates to a value of 1. If set to 0, turns off Elan library optimizations. Use only for debugging the Elan libraries if problems within these libraries are suspected.

o There are other Quadrics Elan environment variables that effect performance. For a detailed discussion (recommended) see the "MPI and Elan Library Environment Variables" document located at: www.llnl.gov/computing/mpi/elan.html

Compiler Hints:
o Both the C/C++ and Fortran compilers provide support for compiler hints which can enhance the ability of the compiler to vectorize loops and perform other optimizations. Some C/C++ examples of these hints are shown below.

#pragma ivdep - Indicates that there are no dependencies that would prevent vectorization.
#pragma vector aligned - Indicates that memory references in a vectorizable loop are aligned.
#pragma novector - Do not vectorize the loop that follows the directive.
#pragma unroll(n) - Unroll the subsequent for loop n times.
#pragma nounroll - Do not unroll the subsequent for loop.
#pragma loop count (n) - Indicates that the loop count for a given loop is likely to be n.

o Fortran offers similar hints through !dir$ directives.

o See the Intel compiler C/C++ and Fortran User's guides at intel.com/software/products/compilers for more information on compiler directives and hints.

o Note that both the C/C++ and Fortran User's Guides include quite a bit of information on optimization and tuning also.

Web Documentation:
o Visit the Intel web site intel.com and search on "Xeon performance" or such similar topics. There's quite a lot there.

o There are numerous other reports and discussions available on the web via any search engine.

Running Jobs
Opteron Performance Considerations

This section under development

Debugging

Available Debuggers:
o There are several debuggers available on LC's Linux clusters. Note that not all debuggers are available on all clusters.

o This section only touches on selected highlights. For more information users will definitely need to consult the relevant documentation.

o Discussions about parallel MPI jobs and SLURM commands do not apply to clusters without an interconnect.

TotalView: Small TotalView screen shot
o TotalView is probably the most widely used debugger for parallel programs. It can be used with C/C++ and Fortran programs and supports all common forms of parallelism, including pthreads, openMP and MPI.

o There are multiple versions of TotalView installed on all LC production systems. The current version changes frequently. To see which versions are available, look in /usr/global/tools/tv/vsns.

o Starting TotalView for serial codes: simply issue the command:

totalview prog

o Starting TotalView for interactive parallel jobs:

+ Some special command line options are required to run a parallel job through TotalView under SLURM. You need to run srun under TotalView, and then specify the -a flag followed by 1)srun options, 2)your program, and 3)your program flags (in that order!).The general syntax is:

totalview srun -a -n processes -ppdebug prog [prog args]

+ To debug an already running interactive parallel job, simply issue the totalview command and then attach to the srun process that started the job.

o Documentation:
+ LC Tutorial: www.llnl.gov/computing/tutorials/totalview
+ Vendor website: www.etnus.com
+ Local copies of documentation under /usr/local/docs/totalview

DDT: Small ddd screen shot
o DDT stands for "Distributed Debugging Tool", a product of Allinea Software Ltd, a spin-off from Streamline Computing.

o DDT is a comprehensive graphical debugger designed specifically for debugging complex parallel codes. It is supported on a variety of platforms for C/C++ and Fortran. It is able to be used to debug multi-process MPI programs, and multi-threaded programs, including OpenMP. DDT is perhaps the first real competitor to TotalView in this domain.

o Currently, LC is evaluating DDT, and has a limited number of fixed and floating licenses for OCF Linux machines.

o The DDT executable should already be in your default OCF path.

o Starting DDT for serial jobs is simple:

ddt prog

o Starting DDT for interactive parallel jobs requires running under slurm in the pdebug partition:

srun -ppdebug -N4 ddt prog

o Once you start DDT, a "Session Control" dialog box will appear asking you for details on you job (below). For parallel jobs, you will need to minimally supply the information shown by the red arrow, and then click the run button.

DDT input dialog box

o DDT will then attempt to start your job under slurm, and permit you to begin debugging.

o Documentation:
+ Vendor website: http://www.allinea.com
+ Local copies of documentation under /usr/local/docs/ddt

PGDBG: Small pgdbg screen shot
o PGDBG is The Portland Group Compiler Technology symbolic debugger for F77, F90, C, C++ and assembly language programs

o PGDBG is capable of debugging parallel programs on SMP Linux clusters, including threaded, MPI and hybrid MPI/threads.

o On Opteron clusters, the default version of PGDBG will already be in your path. On IA32 clusters, you will need to select a version of PGI to use. See the PGI Workstation section of this tutorial for details.

o By default, PGDBG will start up in its GUI mode, however it can also operate in several other modes, such as dbx. Some examples:

+ Open a.out with default GUI mode:

pgdbg a.out

+ Start up in command line dbx mode:

pgdbg -dbx -text a.out

+ Analyze corefile core.1234 that was produced by a.out:

pgdbg -core corefilename a.out

o Parallel debugging with PGDBG: although PGDBG can be used to debug parallel MPI codes, at LC this is not practical because SLURM does not support using PGDBG at this time.

o Documentation:
+ Vendor website: www.pgroup.com
+ Local copies of documentation in the doc subdirectory of the install directory for the version you are using.

GDB:
o The GNU GDB debugger is a relatively powerful, command-line, single-process debugger. It is text-based, easy to learn, and suitable for quickly debugging serial programs, a single process of a parallel program, analyzing core files or attaching to a running process.

o C/C++ and Fortran programs are supported.

o Note that GDB is not a parallel debugger, and not particularly well-suited to debugging threaded codes.

o GDB can be used with the graphical DDD debugger (discussed next).

o Starting GDB - examples:

+ Open GDB on the executable a.out:

gdb a.out

+ Analyze corefile core.1234 that was produced by a.out:

gdb a.out core.1234

+ Attach to running a.out process that has PID 12345:

gdb a.out 12345

o GDB can be used to debug a single process in a parallel job by doing the following:
1. rsh to one of the nodes that is running your job;
2. Determine the PIDs of the running executable using ps;
3. Change directories to the directory containing your executable;
4. Attach to the running process using your executable name and PID, e.g, gdb a.out 12345.

o Once GDB starts up, you will be presented with a (gdb) command prompt. You can then begin debugging. Some representative gdb commands are shown in the table below.

Command Action
b,break N Set breakpoint at line N
b,break funcname|line Set breakpoint at function named funcname or at specified line
bt Print a stack backtrace
c,cont Continue after breakpoint
h,help Print list of help topics
i,info registers Show registers
i,info float Show floating-point registers
l,list N List N lines of code (default is 10)
n,next Execute next program line; step over function calls
q,quit Quit
r,run Run program
s,step Execute next program line; step into function

o Documentation:
+ GNU website: http://www.gnu.org/software/gdb
+ info gdb command

DDD: Small ddd screen shot
o The GNU DDD debugger is a graphical front-end for command-line debuggers such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash debugger, or the Python debugger.

o For most C/C++, Fortran codes at LC, DDD will use GDB as its underlying debugger. Other debuggers can be specified by a command flag.

o In addition to offering the features of the underlying debugger, DDD is known for its interactive graphical data display, where data structures are displayed as graphs.

o Starting DDD is similar to starting GDB:

+ Open DDD on the executable a.out:

ddd a.out

+ Analyze corefile core.1234 that was produced by a.out:

ddd a.out core.1234

+ Attach to running a.out process that has PID 12345:

ddd a.out 12345

o DDD is not really a parallel debugger, but like GDB, it can be used to debug a single process of a parallel program.
1. rsh to one of the nodes that is running your job;
2. Determine the PIDs of the running executable using ps;
3. Change directories to the directory containing your executable;
4. Attach to the running process using your executable name and PID, e.g, ddd a.out 12345.

o Documentation:
+ GNU website: http://www.gnu.org/software/ddd
+ info ddd command

IDB: Small idb screen shot
o IDB is Intel's debugger for C/C++ and Fortran codes. It provides both a command line and graphical user interface.

o IDB can operate in one of two modes, either DBX or GDB. The default mode is DBX - that is, it's commands are like DBX commands.

o To select the GDB mode, start the debugger with the -gdb flag (see below). In this mode, IDB commands are like the GDB debugger's.

o Examples for starting IDB in GDB mode:

+ Open IDB on the executable a.out:

idb -gdb a.out

+ Analyze corefile core.1234 that was produced by a.out:

idb a.out core.1234

+ Attach to running a.out process that has PID 12345:

idb -gdb a.out -pid 12345

o At LC, IDB will most likely not be in your default path, so you will need to include it's install directory in your PATH. For example:

set path = ($path /usr/local/intel/idb91/bin)

o Parallel debugging with IDB: although Intel's documentation states that IDB can be used to debug parallel MPI codes, at LC this is not practical because SLURM does not support using IDB at this time.

o Documentation:
+ Vendor website: intel.com. Search on "idb manual linux".
+ Local copies of documentation located in the install directory's doc subdirectory for the version you are using. Check under /usr/local/intel/idb* for starters.

Debugging in Batch: batchxterm:
o Debugging batch parallel jobs on LC production clusters is fairly straightforward. The main idea is that you need to submit a batch job that gets your partition allocated and running.

o Once you have your partition, you can login to any of the nodes within it, and then starting running as though your in the interactive pdebug partition.

o For convenience, LC has developed a utility called batchxterm which makes the process even easier. Note that batchxterm is not available on all machines.

o To use batchxterm:
1. Start X on your desktop machine
2. Make sure you enable X11 tunneling for your ssh session
3. ssh and login to your cluster
4. Issue the command as follows:

batchxterm display machine #nodes #minutes

Where:
display = your Xwindows display variable, typically desktopmachine:0
machine = where you want to run - that is, thunder
#nodes = number of nodes your job requires
#minutes = how long you need to keep your partition for debugging

5. This will submit a batch job for you that will open an xterm when it starts to run. Don't worry that the batch job will be named "TVdebugSession". The only thing batchxterm really does is get you the requested batch nodes and then open an xterm on your workstation.
6. After the xterm appears, cd to the directory with your source code and begin your debug session. For example:

cd ~/projects
totalview srun -a -n8 myprog

7. Look at the comments in the batchxterm executable (it's a script) for details and example usage.

A Few Additional Useful Debugging Hints:
o For clusters with an interconnect:

+ Add the sinfo and squeue commands to your PSUB scripts to assist in diagnosing problems. In particular, these commands will document which nodes your job is using.

+ Also add the -l option to your srun command so that output statements are prepended with the task number that created them.

+ Be sure to check the exit status of all I/O operations when reading or writing files in Lustre. This will allow you to detect any I/O problems with the underlying OST servers.

+ If you know/suspect that there are problems with particular nodes, you can use the srun -x option to skip these nodes. For example:

srun -N12 -x "alc40 alc41" -ppdebug myjob

o For interactive/pdebug work, it is quite likely that your shell's coredump size setting may limit the size of a corefile so that it is inadequate for debugging, especially with totalview. For example, it is set to 16 kbytes (csh/tcsh) or 32 kbytes (ksh/bsh) by default. If you find your debugger complaining about the core file, try increasing the size limit:

csh/tcsh

limit coredumpsize 64

ksh/bsh

ulimit -c 64

o Note that for batch jobs, the default coredump size should be unlimited. Some users have complained that for many-process jobs, they actually only want small or lightweight core files because normal coredump files are filling up their disk space. To effect this, set the shell limits in your batch command file accordingly. Or, you can also use the srun --core=light flag to get smaller (and most likely less useful) coredump files.

Tools

We Need a Book!
o The subject of "Tools" for Linux cluster applications is far too broad and deep to cover here. Instead, a few pointers are being provided for those who are interested in further research.

o The first place to check are LC's "Supported Software and Computing Tools" web pages at: www.llnl.gov/computing/hpc/code/software_tools.html for what may be available here. Some example tools are listed below.

o Memory Correctness Tools
+ Valgrind - memory managment suite of tools
+ TotalView (as a subset of its debugging features)

o Profiling, Tracing and Performance Analysis
+ gprof - call graph application profiler. Gprof does simple function profiling and requires that the code be built and linked with -pg. For parallel programs there are two ways to obtain individual gmon.out files for each process:
# Setting the undocumented environment variable GMON_OUT_PREFIX to some non-null string. For example:

setenv GMON_OUT_PREFIX 'gmon.out.'`/bin/uname -n`

# Place the program executable on local disk, and then gathering the profile files from each local disk at the end of the run. Be careful how you gather these if they are all named gmon.out
+ pgprof - PGI profiler
+ mpiP - for profiling MPI programs. Locally developed.
+ PAPI - for hardware performance counter profiling
+ TAU - complete performance analysis
+ VGV - complete performance analysis

o An LC whitepaper on the subject of "High Performance Tools and Technologies" describes a large number of tools, and an number of performance related topics applicable to LC developers. Find it at: www.llnl.gov/computing/tutorials/performance_tools/HighPerformanceToolsTechnologiesLC.pdf.

o Beyond LC, the web offers endless opportunities for discovering tools that aren't yet available here.

References and More Information

o Author: Blaise Barney, Livermore Computing.

o Numerous web pages at Intel's web site:
http://www.intel.com
intel.com/software/products/compilers

o Quadrics Home Page
http://www.quadrics.com

o AMD Home Page
http://www.amd.com

o Online help documents on every LC Linux cluster: /usr/local/docs/. In particular, linux.basics by Charles Shereda.

o SLURM Reference Manual
http://www.llnl.gov/LCdocs/slurm

o LLNL Linux Project home page
http://www.llnl.gov/linux

o Linux Project Report
http://www.llnl.gov/linux/ucrl-id-150021.pdf

o Photos/Graphics: Permission to use some of Quadrics' photos/graphics has been obtained by the author and is on file. Other photos/graphics have been created by the author Blaise Barney, created by other LLNL employees, obtained from non-copyrighted sources, or used with the permission of authors from other presentations and web pages.

Linux Clusters

Saturday, April 28, 2007

linux clusters

Blog Archive

About Me

Advertise