California Digital Corporation : Clusters

1-888-LINUX-4U

CLUSTERS

SYSTEMS

SOFTWARE

SUPPORT

ABOUT

Overview
Total Cluster Coverage
Custom

Architectures
Node Overview
Xeon64
Itanium2
Opteron
G5

Equipment
Interconnects
Racks
Power and Cooling
Head Node
Storage

Software
Operating System
Cluster Software
freeIPMI

You can contact us at
sales@californiadigital.com
or call 1-888-546-8948 for more
information.

Clusters

California Digital provides turnkey clusters running Linux or OS X on Xeon 64, Itanium 2, Opteron, and G5 hardware. All California Digital clusters are delivered as turnkey installations, including hardware and software configuration and integration, cluster and networking configuration and stabilization, and comprehensive support through California Digital's Total Cluster Coverage™ support program.

Total Cluster Coverage™

The key differentiating approach to California Digital's clusters is our Total Cluster Coverage™ support package. Total Cluster Coverage supports a cluster as a cluster rather than as a combination of individual servers. To best illustrate this concept, consider the simpler case of a traditional hardware warranty on a discrete server versus California Digital's Total Linux Coverage™. While a traditional warranty would cover hardware failures, Total Linux Coverage™ supports the system when used as a Linux server, including Linux installation and configuration, kernel issues, and Linux security and performance issues.

Total Cluster Coverage™ similarly broadens support from the individual server/operating system level to include an entire cluster installation. Thus support now includes MPI libraries, cluster software configuration, cluster networking, account creation and job execution, power and cooling issues, job scheduling, and compilation for the cluster environment.

The goal of Total Cluster Coverage™ is to provide cluster users with a single point of contact for all cluster-related issues. California Digital's focus on clustering and engineering expertise on cluster software enables us to deliver this unparalleled class of support.

Total Cluster Coverage™ is included with every California Digital cluster.

Nodes

California Digital builds clusters from a variety of compute nodes. These systems have been selected and designed specifically to deliver reliable yet powerful and manageable computing in a clustered environment. Popular node choices are as follows:

1520: This dual Xeon64 1U delivers about 13Gflops of performance with 3.4GHz processors (presently the most attractive pricepoint). The 1520 includes one PCI-X slot and support up to 16GB of RAM, though configurations above 8GB are expensive due to nonlinear RAM pricing.
1720: This dual Opteron 1U is similar to the 1520 except that it uses Opteron rather than Xeon64 processors. The biggest technical difference vis-a-vis the 1520 (besides the processors) is the architetcure of the memory subsystem.
6440: The 6440 delivers over 20Gflops of performance with quad Itanium 2 processors with up to 9MB of cache. With up to 32GB of RAM, the 6440 rests in a class of its own in terms of sheer horsepower. Price per node is relatively high, so most usage has been for very cache-sensitive applications and those using expensive low-latency interconnects (such as Quadrics) since quad-processor nodes reduce the number of interconnect adapters and switches needed. California Digital's Thunder cluster at Lawrence Livermore, the world's fifth fastest at 19.94Tflops, used 6440 compute nodes.
6420: The 6420 includes dual Itanium2 processors and can essentially be viewed as "half of a 6440." The 6420 does not minimize interconnect infrastructure costs (since it is a dual processor system) but it allow users to harness the power of the Itanium2's EPIC architecture at a more compelling price point than the 6440. That is because the dual-capable Itanium2 processors operating at 1.6GHz are cheaper than the slower quad-capable processors. The 6420 is well suited for Itanium2 clusters interconnected with infiniband or gigabit ethernet.
XServe: Apple's XServe Cluster Compute Node is an interesting choice, as it balances tremendous power of the PowerPC 970FX processor with the relative unfamiliarity of most cluster users with OS X. The PowerPC 970FX (or PowerPC G5) processor owes its heritage to IBM's Power architecture, and it has retained intersting architectural features from a supercomputing point of view. A node with dual G5 processors will outperform any of the listed systems (except the quad Itanium2 6440) on LINPACK performance. California Digital's CTO, Srinidhi Varadarajan, pioneered G5 clustering when building System X at Virginia Tech, the world's seventh fastest supercomputer with 12.25Tflops.

Interconnects

Most California Digital clusters utilize gigabit ethernet or infiniband for node interconnects. The former caters toward embarassingly-parallel problems (such as genomic sequencing, seismic analysis, computational finance, or electronic design automation) while the latter caters toward more tightly-coupled problems such as engineering modeling, climate simulation, physics and computational chemistry, and protein folding.

California Digital can provide other low-latency interconnects such as Quadrics or Myrinet. Quadrics has traditionally been deployed with some of the largest clusters in operation (such as California Digital's Thunder system at Lawrence Livermore) while Myrinet had been the de facto standard for many users in the pre-infiniband era.

California Digital normally deploys clusters with full bisection bandwidth. Some large designs might use a 2:1 ratio for half bisection with a fat tree toplogy. We usually recommend installation of a gigabit ethernet backbone in addition to a low-latency interconnect to provide background management and interconnect troubleshooting.

Pricing of interconnect equipment varies as the number of cluster nodes scales. For relatively small clusters, infiniband costs a little under $1000 per node. Gigabit ethernet comes in at well under $100 per node, since all compute nodes include integrated gigabit ethernet. Quadrics interconnects are the most expensive, while Myrinet is roughly the same cost as infiniband.

All California Digital clusters normally include integration of network switches and interconnect cables, software configuration and testing, and network performance verification and stabilization.

Rack Infrastructure

California Digital clusters normally ship installed in anywhere from one to 117 racks. Custom-designed shipping crates allow uneventful delivery. Please be certain your receiving dock has an appropriate loading bay and the path to the installation site can accomodate 42U racks (plus the addiitonal height for wheels). One-rack clusters ship ready-to-activate, while multiple-rack clusters can be interconnected by the customer or via California Digital's on site commissioning services provided by our professional services group.

Racks typically occupy 6 square feet (3.3 square meters) of floor space and can weigh more than 2000 pounds (900 kg). Racks can be situated adjacent to each other, but adequate space must be reserved for access and air flow in front of and behind units. Multi-row configurations require the use of alternating hot and cold aisles with appropriate air circulation infrastructure.

The number of compute nodes per rack varies depending upon the amount of switching equipment needed (more for larger clusters) as well as the optimal number of nodes per switch. In general, racks house between 32 and 40 compute nodes. Using a slightly lower density per rack allows better airflow and minimizes the requirements for site cooling infrastructure.

We typically install inter-rack cables underneath the floor if a raised floor and access ports are provided. For smaller clusters, cables can be strung at rack level.

Power distribution units are included with all racks. We typically use quad 30A/120V circuits or quad (240V/20A) circuits per rack with L5-30 or L6-30 plugs. We recommend use of 240V power for large installations.

Power and Cooling

We use a design requirement of 3A (@120V) for all servers except for 6440 (12A), 6420 (6A), and XServe compute nodes (2A). We assume a need for 1500BTU/hour for all 1U nodes (except for XServe compute nodes at 1000BTU/hour) with 6440 and 6420 systems dissipating 6000BTU/hour and 3000BTU/hour respectively.

Head Node

California Digital clusters normally include one head node (or two head nodes for large clusters) that provides management and scheduling for the cluster. The head node is typically connected only to the gigabit ethernet network. For small clusters, it may be combined with a local storage subsystem (like the California Digital 9316 or similar device.

Storage

California Digital clusters typically include small IDE-based node storage. Small clusters might include several TB of NAS storage accessible via NFS. Larger storage subsystems can be provided based on the 9316. California Digital does not generally provide parallel file systems or high performance storage subsystems, though we can integrate such products into cluster solutions. If accessing storage via infiniband, do not forget to scale the switch port count appropriately.

Operating System

California Digital clusters are pre-installed with GNU/Linux ( except that G5 clusters are installed with OS X). Debian, Scientific Linux, and Red Hat Enterprise Linux are offered, the first two with (obviously) no licensing costs. California Digital provides complete software support for any of these OS options.

Cluster Software Stack

Cluster is provisioned through OSCAR, Rocks, YACI or Gluster installer. Core components of cluster software stack includes C3, Ganglia, LAM/MPI, Maui Cluster Scheduler, MPICH, OpenSSH, PVM, System Installation Suite, Torque.

MPICH2

MPICH2 is an all-new implementation of MPI, designed to support research into high-performance implementations of MPI-1 and MPI-2 functionality. In addition to the features in MPICH, MPICH2 includes support for one-side communication, dynamic processes, intercommunicator collective operations, and expanded MPI-IO functionality. Clusters consisting of both single-processor and SMP nodes are supported. With the exception of users requiring the communication of heterogeneous data, we strongly encourage everyone to consider switching to MPICH2. Researchers interested in using using MPICH as a base for their research into MPI implementations should definitely use MPICH2.

LAM/MPI (Local Area Multicomputer)

LAM/MPI is a high-quality open-source implementation of the Message Passing Interface specification, including all of MPI-1.2 and much of MPI-2. Intended for production as well as research use, LAM/MPI includes a rich set of features for system administrators, parallel programmers, application users, and parallel computing researchers. With LAM/MPI, a dedicated cluster or an existing network computing infrastructure can act as a single parallel computer. LAM/MPI is considered to be "cluster friendly", in that it offers daemon-based process startup/control as well as fast client-to-client message passing protocols. LAM/MPI can use TCP/IP and/or shared memory for message passing (currently, different RPMs are supplied for this -- see the main LAM web site for details). Compliant applications are source code portable between LAM/MPI and any other implementation of MPI. In addition to providing a high-quality implementation of the MPI standard, LAM/MPI offers extensive monitoring capabilities to support debugging. Monitoring happens on two levels. First, LAM/MPI has the hooks to allow a snapshot of process and message status to be taken at any time during an application run. This snapshot includes all aspects of synchronization plus datatype maps/signatures, communicator group membership, and message contents (see the XMPI application on the main LAM web site). On the second level, the MPI library is instrumented to produce a cummulative record of communication, which can be visualized either at runtime or post-mortem.

PVM

PVM is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. The individual computers may be shared- or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations, that may be interconnected by a variety of networks, such as Ethernet, FDDI. User programs written in C, C++ or Fortran access PVM through library routines.

TORQUE

TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. TORQUE is based on *OpenPBS version 2.3.12 and incorporates scalability, fault tolerance, and feature extension patches provided by NCSA, OSC, the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations. This version may be freely modified and redistributed subject to the constraints of the included license. TORQUE provides enhancements over standard OpenPBS in the following areas:

    * Fault Tolerance
          o Additional failure conditions checked/handled
          o Many, many bugs fixed
          o Node health check script support 
    * Scheduling Interface
          o Extended query interface providing the scheduler with
          additional and more accurate information
          o Extended control interface allowing the scheduler
          increased control over job behavior and attributes
          o Allows the collection of statistics for completed jobs 
    * Scalability
          o Significantly improved server to MOM communication model
          o Ability to handle larger clusters (over 15 TF/2,500
          processors)
          o Ability to handle larger jobs (over 2000 processors)
          o Ability to support larger server messages 
    * Usability
          o Extensive logging additions
          o More human readable logging (ie no more 'error 15038 on
          command 42')

OpenPBS

The Portable Batch System (PBS) is a flexible workload management system. It operates on networked, multi-platform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems. It has been modified slightly to be usable under the OSCAR cluster software system.

This base package includes a few core directories necessary for the Openpbs system. To make this package useful, you will need to install one or more of the following related packages: openpbs-oscar-client includes the client commands and documentation, openpbs-oscar-gui includes the gui interface to the client commands, openpbs-oscar-mom includes the software used to run jobs on computation nodes, openpbs-oscar-server includes the software used to schedule jobs on the server node.

We highly recommend existing OpenPBS users to migrate to TORQUE.

SIS 3.3.2

System Installation Suite is a collection of open source software projects designed to work together to automate the installation and configuration of networked workstations. These software projects fit around a modular design framework designed to be cross-platform, operating system independant, and scalable from a single workstation to a thousand node collection of workstations. For more information please read the SIS Application Stack page.

MAUI

Maui is an advanced job scheduler for use on clusters and supercomputers. It is a highly optimized and configurable tool capable of supporting a large array of scheduling policies, dynamic priorities, extensive reservations, and fairshare and is acknowledged by many as 'the most advanced scheduler in the world'. It is currently in use at hundreds of leading government, academic, and commercial sites throughout the world. It improves the manageability and efficiency of machines ranging from clusters of a few processors to multi-teraflop supercomputers.

Ganglia

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on over 500 clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

C3

The Cluster Command and Control (C3) tool suite offers a command line interface for system and user administration tasks on a cluster. C3 is a command line interface that may also be called within programs. C2G provides a Python/TK based GUI, that among other features, may invoke the C3 tools. This suite implements a number of command line based tools that have been shown to increase system manager scalability by reducing time and effort to operate and manage the cluster. The current release, v4.x, is multihreaded and designed to improve the scalability of the C3 suite to larger clusters. C3 v4.x also has the ability to administrate multiple clusters from the command line. The combination of the C2G interface and the C3 tools is a start to an end-to-end solution for the system administration and user administration tasks on a cluster. V4.x also has a scalable mode that greatly decreases execution time for large clusters

Environment Switcher

The env-switcher package provides an convenient method for users to switch between "similar" packages. System- and user-level defaults are maintained in data files and are examined at shell invocation time to determine how the user's enviornment should be set up. The canonical example of where this is helpful is using multiple implementations of the Message Passing Interface (MPI). This typically requires that the user's "dot" files are set appropriately on each machine that is used since rsh/ssh are typically used to invoke commands on remote nodes. The env-switcher package alleviates the need for users to manually edit their fot files, and instead gives the user commandline control to switch between multiple implementations of MPI. While this package was specifically motivated by the use of multiple MPI implementations on OSCAR clusters, there is nothing specific to either OSCAR or MPI in env-switcher -- switching between mulitple MPI implementations is only used in this description as an example. As such, it can be used in any environment for any "switching" kind of purpose.

Opium

The OSCAR User Synchronization System keeps the users and groups synchronized from the the central OSCAR server out to the OSCAR compute nodes. The users and groups on the OSCAR server are checked and synchronized (if necessary) periodically (configurable).

HDF5

HDF5 is a general purpose library and file format for storing scientific data. HDF5 can store two primary objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. You can also mix and match them in HDF5 files according to your needs.

HDF5 was created to address the data management needs of scientists and engineers working in high performance, data intensive computing environments. As a result, the HDF5 library and format emphasize storage and I/O efficiency. For instance, the HDF5 format can accommodate data in a variety of ways, such as compressed or chunked. And the library is tuned and adapted to read and write data efficiently on parallel computing systems.

Kernel Picker

kernel_picker allows you to substitute a given kernel into your OSCAR (SIS) image prior to building your nodes. If executed with no command line options, you will be prompted for all required information. You can also specify command line options for (mostly) non-interactive execution. Any necessary information that you do not give via an option will cause the program to prompt you for that information.

ntpconfig

OSCAR-specific NTP configurator. Sets up the OSCAR server to receive NTP from well-known NTP servers, and then act as a relay server to all the cluster hosts.

autoupdate

This package contains the script cluster_update, used on an OSCAR cluster to update the server and clients without having to worry about autoupdate syntax. Optionally we can setup Debian APT like package update system. loghost

loghost configures syslog settings across the cluster.

nameserver

This package contains everything needed to set up your distribution's caching nameserver on the OSCAR server.

service management

OSCAR scripts to disable unwanted services on client nodes.

California Digital's expertise with clusters includes MPI software, mathematical libraries, compilers, schedulers, configuration, performance optimization, benchmarking, networking drivers, and other related items.

FreeIPMI

FreeIPMI is an free/open-source system management software suite developed to provide cutting-edge system management tools to some of California Digital's OEM customers and refined for deployment of Lawrence Livermore's Thudner cluster. FreeIPMI runs on all of California Digital's GNU/Linux clusters, providing a host of important cluster manageemnt capabilities.

BIOSConfig

BIOSConfig is an free/open-source BIOS configuration utility. It provides an inband console based interface to edit, update, save/restore and replicate settings across the cluster nodes. As IPMI does not cover BIOS configuration interface, BIOSConfig utility briges that gap by providing a generic token map interface to support every different motherboards.

Custom

The heading above indicate the standard configurations for California Digital clusters. We are happy to consider special requests to meet specific customer needs. In general, our ability to accomodate such requests scales with the number of nodes.

Cluster QA

California Digital's rigorous quality control procedures knocks out every single weak component in a cluster. Individual nodes are passed through Cerberus Test Control System (CTCS) and extreme linpack burn-in for a minimum of 72 hours before it gets integrated into a cluster. Most of the hardware defects and Kernel bugs are eliminated during this test. Additionally a test script based on streams, linpack, HPL and tiobench detects slow and defective memory modules, processors, hard-drives and network controllers. These tests also triggers platform management events which are monitored using IPMI system event log, sensors, PEF and MCA interfaces. Finally the entire cluster is put to maximum CPU and memory load for continuous 48 hours and the engineering team will validate the benchmark results.