Clusters
California Digital provides turnkey clusters running Linux or OS X on
Xeon 64, Itanium 2, Opteron, and G5 hardware. All California Digital
clusters are delivered as turnkey installations, including hardware
and software configuration and integration, cluster and networking
configuration and stabilization, and comprehensive support through
California Digital's Total Cluster Coverage™ support program.
Total Cluster Coverage™
The key differentiating approach to California Digital's clusters
is our Total Cluster Coverage™ support package. Total Cluster
Coverage supports a cluster as a cluster rather than as a
combination of individual servers. To best illustrate this concept,
consider the simpler case of a traditional hardware warranty on a
discrete server versus California Digital's Total Linux
Coverage™. While a traditional warranty would cover hardware
failures, Total Linux Coverage™ supports the system when used
as a Linux server, including Linux installation and
configuration, kernel issues, and Linux security and performance
issues.
Total Cluster Coverage™ similarly broadens support from the
individual server/operating system level to include an entire cluster
installation. Thus support now includes MPI libraries, cluster
software configuration, cluster networking, account creation and job
execution, power and cooling issues, job scheduling, and compilation
for the cluster environment.
The goal of Total Cluster Coverage™ is to provide cluster users
with a single point of contact for all cluster-related issues.
California Digital's focus on clustering and engineering expertise on
cluster software enables us to deliver this unparalleled class of
support.
Total Cluster Coverage™ is included with every California Digital
cluster.
Nodes
California Digital builds clusters from a variety of compute nodes.
These systems have been selected and designed specifically to deliver
reliable yet powerful and manageable computing in a clustered
environment. Popular node choices are as follows:
- 1520: This dual Xeon64 1U delivers about 13Gflops of
performance with 3.4GHz processors (presently the most attractive
pricepoint). The 1520 includes one PCI-X
slot and support up to 16GB of RAM, though configurations above
8GB are expensive due to nonlinear RAM pricing.
- 1720: This dual Opteron 1U is similar to the 1520 except
that it uses Opteron rather than Xeon64 processors. The biggest
technical difference vis-a-vis the 1520 (besides the processors) is
the architetcure of the memory subsystem.
- 6440: The 6440 delivers over 20Gflops of performance with
quad Itanium 2 processors with up to 9MB of cache. With up to 32GB of
RAM, the 6440 rests in a class of its own in terms of sheer
horsepower. Price per node is relatively high, so most usage has been
for very cache-sensitive applications and those using expensive
low-latency interconnects (such as Quadrics) since quad-processor
nodes reduce the number of interconnect adapters and switches needed.
California Digital's Thunder cluster at Lawrence Livermore, the
world's fifth fastest at 19.94Tflops, used 6440 compute nodes.
- 6420: The 6420 includes dual Itanium2 processors and can
essentially be viewed as "half of a 6440." The 6420 does not minimize
interconnect infrastructure costs (since it is a dual processor
system) but it allow users to harness the power of the Itanium2's EPIC
architecture at a more compelling price point than the 6440. That is
because the dual-capable Itanium2 processors operating at 1.6GHz are
cheaper than the slower quad-capable processors. The 6420 is well
suited for Itanium2 clusters interconnected with infiniband or gigabit
ethernet.
- XServe: Apple's XServe Cluster Compute Node is an
interesting choice, as it balances tremendous power of the PowerPC
970FX processor with the relative unfamiliarity of most cluster users
with OS X. The PowerPC 970FX (or PowerPC G5) processor owes its
heritage to IBM's Power architecture, and it has retained intersting
architectural features from a supercomputing point of view. A node
with dual G5 processors will outperform any of the listed systems
(except the quad Itanium2 6440) on LINPACK performance. California
Digital's CTO, Srinidhi Varadarajan, pioneered G5 clustering when
building System X at Virginia Tech, the world's seventh fastest
supercomputer with 12.25Tflops.
Interconnects
Most California Digital clusters utilize gigabit ethernet or
infiniband for node interconnects. The former caters toward
embarassingly-parallel problems (such as genomic sequencing, seismic
analysis, computational finance, or electronic design automation)
while the latter caters toward more tightly-coupled problems such as
engineering modeling, climate simulation, physics and computational
chemistry, and protein folding.
California Digital can provide other low-latency interconnects such as
Quadrics or Myrinet. Quadrics has traditionally been deployed with
some of the largest clusters in operation (such as California
Digital's Thunder system at Lawrence Livermore) while Myrinet had been
the de facto standard for many users in the pre-infiniband era.
California Digital normally deploys clusters with full bisection
bandwidth. Some large designs might use a 2:1 ratio for half
bisection with a fat tree toplogy. We usually recommend installation
of a gigabit ethernet backbone in addition to a low-latency
interconnect to provide background management and interconnect
troubleshooting.
Pricing of interconnect equipment varies as the number of cluster
nodes scales. For relatively small clusters, infiniband costs a
little under $1000 per node. Gigabit ethernet comes in at well under
$100 per node, since all compute nodes include integrated gigabit
ethernet. Quadrics interconnects are the most expensive, while
Myrinet is roughly the same cost as infiniband.
All California Digital clusters normally include integration of
network switches and interconnect cables, software configuration and
testing, and network performance verification and stabilization.
Rack Infrastructure
California Digital clusters normally ship
installed in anywhere from one to 117 racks. Custom-designed shipping
crates allow uneventful delivery. Please be certain your receiving
dock has an appropriate loading bay and the path to the installation
site can accomodate 42U racks (plus the addiitonal height for wheels).
One-rack clusters ship ready-to-activate, while multiple-rack clusters
can be interconnected by the customer or via California Digital's on
site commissioning services provided by our professional services
group.
Racks typically occupy 6 square feet (3.3 square meters) of floor
space and can weigh more than 2000 pounds (900 kg). Racks can be situated adjacent to
each other, but adequate space must be reserved for access and air
flow in front of and behind units. Multi-row configurations require
the use of alternating hot and cold aisles with appropriate air
circulation infrastructure.
The number of compute nodes per rack varies depending upon the amount
of switching equipment needed (more for larger clusters) as well as
the optimal number of nodes per switch. In general, racks house
between 32 and 40 compute nodes. Using a slightly lower density
per rack allows better airflow and minimizes the requirements for site
cooling infrastructure.
We typically install inter-rack cables underneath the floor if a
raised floor and access ports are provided. For smaller clusters,
cables can be strung at rack level.
Power distribution units are included with all racks. We typically
use quad 30A/120V circuits or quad (240V/20A) circuits per rack with
L5-30 or L6-30 plugs. We recommend use of 240V power for large
installations.
Power and Cooling
We use a design requirement of 3A (@120V) for all servers except for
6440 (12A), 6420 (6A), and XServe compute nodes (2A). We assume a
need for 1500BTU/hour for all 1U nodes (except for XServe compute
nodes at 1000BTU/hour) with 6440 and 6420 systems dissipating
6000BTU/hour and 3000BTU/hour respectively.
Head Node
California Digital clusters normally include one head node (or two
head nodes for large clusters) that provides management and scheduling
for the cluster. The head node is typically connected only to the
gigabit ethernet network. For small clusters, it may be combined with
a local storage subsystem (like the California
Digital 9316 or similar device.
Storage
California Digital clusters typically include small IDE-based node
storage. Small clusters might include several TB of NAS storage
accessible via NFS. Larger storage subsystems can be provided based
on the 9316. California Digital does not
generally provide parallel file systems or high performance storage
subsystems, though we can integrate such products into cluster
solutions. If accessing storage via infiniband, do not forget to
scale the switch port count appropriately.
Operating System
California Digital clusters are pre-installed with GNU/Linux ( except
that G5 clusters are installed with OS X). Debian, Scientific Linux,
and Red Hat Enterprise Linux are offered, the first two with
(obviously) no licensing costs. California Digital provides complete
software support for any of these OS options.
Cluster Software Stack
Cluster is provisioned through OSCAR, Rocks, YACI or Gluster
installer. Core components of cluster software stack includes C3,
Ganglia, LAM/MPI, Maui Cluster Scheduler, MPICH, OpenSSH, PVM, System
Installation Suite, Torque.
MPICH2
MPICH2 is an all-new implementation of MPI, designed to support
research into high-performance implementations of MPI-1 and MPI-2
functionality. In addition to the features in MPICH, MPICH2 includes
support for one-side communication, dynamic processes,
intercommunicator collective operations, and expanded MPI-IO
functionality. Clusters consisting of both single-processor and SMP
nodes are supported. With the exception of users requiring the
communication of heterogeneous data, we strongly encourage everyone to
consider switching to MPICH2. Researchers interested in using using
MPICH as a base for their research into MPI implementations should
definitely use MPICH2.
LAM/MPI (Local Area Multicomputer)
LAM/MPI is a high-quality open-source implementation of the Message
Passing Interface specification, including all of MPI-1.2 and much of
MPI-2. Intended for production as well as research use, LAM/MPI
includes a rich set of features for system administrators, parallel
programmers, application users, and parallel computing researchers.
With LAM/MPI, a dedicated cluster or an existing network computing
infrastructure can act as a single parallel computer. LAM/MPI is
considered to be "cluster friendly", in that it offers daemon-based
process startup/control as well as fast client-to-client message
passing protocols. LAM/MPI can use TCP/IP and/or shared memory for
message passing (currently, different RPMs are supplied for this --
see the main LAM web site for details).
Compliant applications are source code portable between LAM/MPI and
any other implementation of MPI. In addition to providing a
high-quality implementation of the MPI standard, LAM/MPI offers
extensive monitoring capabilities to support debugging. Monitoring
happens on two levels. First, LAM/MPI has the hooks to allow a
snapshot of process and message status to be taken at any time during
an application run. This snapshot includes all aspects of
synchronization plus datatype maps/signatures, communicator group
membership, and message contents (see the XMPI application on the main
LAM web site). On the second level, the MPI library is instrumented
to produce a cummulative record of communication, which can be
visualized either at runtime or post-mortem.
PVM
PVM is a software system that enables a collection of heterogeneous
computers to be used as a coherent and flexible concurrent
computational resource.
The individual computers may be shared- or local-memory
multiprocessors, vector supercomputers, specialized graphics engines,
or scalar workstations, that may be interconnected by a variety of
networks, such as Ethernet, FDDI.
User programs written in C, C++ or Fortran access PVM through library
routines.
TORQUE
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a
resource manager providing control over batch jobs and distributed
compute nodes. TORQUE is based on *OpenPBS version 2.3.12 and
incorporates scalability, fault tolerance, and feature extension
patches provided by NCSA, OSC, the U.S. Dept of Energy, Sandia, PNNL,
U of Buffalo, TeraGrid, and many other leading edge HPC
organizations. This version may be freely modified and redistributed
subject to the constraints of the included license.
TORQUE provides enhancements over standard OpenPBS in the following
areas:
* Fault Tolerance
o Additional failure conditions checked/handled
o Many, many bugs fixed
o Node health check script support
* Scheduling Interface
o Extended query interface providing the scheduler with
additional and more accurate information
o Extended control interface allowing the scheduler
increased control over job behavior and attributes
o Allows the collection of statistics for completed jobs
* Scalability
o Significantly improved server to MOM communication model
o Ability to handle larger clusters (over 15 TF/2,500
processors)
o Ability to handle larger jobs (over 2000 processors)
o Ability to support larger server messages
* Usability
o Extensive logging additions
o More human readable logging (ie no more 'error 15038 on
command 42')
OpenPBS
The Portable Batch System (PBS) is a flexible workload management
system. It operates on networked, multi-platform UNIX environments,
including heterogeneous clusters of workstations, supercomputers, and
massively parallel systems. It has been modified slightly to be
usable under the OSCAR cluster software system.
This base package includes a few core directories necessary for the
Openpbs system. To make this package useful, you will need to install
one or more of the following related packages: openpbs-oscar-client
includes the client commands and documentation, openpbs-oscar-gui
includes the gui interface to the client commands, openpbs-oscar-mom
includes the software used to run jobs on computation nodes,
openpbs-oscar-server includes the software used to schedule jobs on
the server node.
We highly recommend existing OpenPBS users to migrate to
TORQUE.
SIS 3.3.2
System Installation Suite is a collection of open source software
projects designed to work together to automate the installation and
configuration of networked workstations. These software projects fit
around a modular design framework designed to be cross-platform,
operating system independant, and scalable from a single workstation
to a thousand node collection of workstations. For more information
please read the SIS Application Stack page.
MAUI
Maui is an advanced job scheduler for use on clusters and
supercomputers. It is a highly optimized and configurable tool
capable of supporting a large array of scheduling policies, dynamic
priorities, extensive reservations, and fairshare and is acknowledged
by many as 'the most advanced scheduler in the world'. It is
currently in use at hundreds of leading government, academic, and
commercial sites throughout the world. It improves the manageability
and efficiency of machines ranging from clusters of a few processors
to multi-teraflop supercomputers.
Ganglia
Ganglia is a scalable distributed monitoring system for
high-performance computing systems such as clusters and Grids. It is
based on a hierarchical design targeted at federations of clusters. It
relies on a multicast-based listen/announce protocol to monitor state
within clusters and uses a tree of point-to-point connections amongst
representative cluster nodes to federate clusters and aggregate their
state. It leverages widely used technologies such as XML for data
representation, XDR for compact, portable data transport, and RRDtool
for data storage and visualization. It uses carefully engineered data
structures and algorithms to achieve very low per-node overheads and
high concurrency. The implementation is robust, has been ported to an
extensive set of operating systems and processor architectures, and is
currently in use on over 500 clusters around the world. It has been
used to link clusters across university campuses and around the world
and can scale to handle clusters with 2000 nodes.
C3
The Cluster Command and Control (C3) tool suite offers a command line
interface for system and user administration tasks on a cluster. C3
is a command line interface that may also be called within
programs. C2G provides a Python/TK based GUI, that among other
features, may invoke the C3 tools. This suite implements a number of
command line based tools that have been shown to increase system
manager scalability by reducing time and effort to operate and manage
the cluster.
The current release, v4.x, is multihreaded and designed to improve the
scalability of the C3 suite to larger clusters. C3 v4.x also has the
ability to administrate multiple clusters from the command line. The
combination of the C2G interface and the C3 tools is a start to an
end-to-end solution for the system administration and user
administration tasks on a cluster. V4.x also has a scalable mode that
greatly decreases execution time for large clusters
Environment Switcher
The env-switcher package provides an convenient method for users to
switch between "similar" packages. System- and user-level defaults
are maintained in data files and are examined at shell invocation time
to determine how the user's enviornment should be set up.
The canonical example of where this is helpful is using multiple
implementations of the Message Passing Interface (MPI). This
typically requires that the user's "dot" files are set appropriately
on each machine that is used since rsh/ssh are typically used to
invoke commands on remote nodes.
The env-switcher package alleviates the need for users to manually
edit their fot files, and instead gives the user commandline control
to switch between multiple implementations of MPI.
While this package was specifically motivated by the use of multiple
MPI implementations on OSCAR clusters, there is nothing specific to
either OSCAR or MPI in env-switcher -- switching between mulitple MPI
implementations is only used in this description as an example. As
such, it can be used in any environment for any "switching" kind of
purpose.
Opium
The OSCAR User Synchronization System keeps the users and groups
synchronized from the the central OSCAR server out to the OSCAR
compute nodes. The users and groups on the OSCAR server are checked
and synchronized (if necessary) periodically (configurable).
HDF5
HDF5 is a general purpose library and file format for storing
scientific data. HDF5 can store two primary objects: datasets and
groups. A dataset is essentially a multidimensional array of data
elements, and a group is a structure for organizing objects in an HDF5
file. Using these two basic objects, one can create and store almost
any kind of scientific data structure, such as images, arrays of
vectors, and structured and unstructured grids. You can also mix and
match them in HDF5 files according to your needs.
HDF5 was created to address the data management needs of scientists
and engineers working in high performance, data intensive computing
environments. As a result, the HDF5 library and format emphasize
storage and I/O efficiency. For instance, the HDF5 format can
accommodate data in a variety of ways, such as compressed or
chunked. And the library is tuned and adapted to read and write data
efficiently on parallel computing systems.
Kernel Picker
kernel_picker allows you to substitute a given kernel into your OSCAR
(SIS) image prior to building your nodes. If executed with no command
line options, you will be prompted for all required information. You
can also specify command line options for (mostly) non-interactive
execution. Any necessary information that you do not give via an
option will cause the program to prompt you for that information.
ntpconfig
OSCAR-specific NTP configurator. Sets up the OSCAR server to receive
NTP from well-known NTP servers, and then act as a relay server to all
the cluster hosts.
autoupdate
This package contains the script cluster_update, used on an OSCAR
cluster to update the server and clients without having to worry about
autoupdate syntax. Optionally we can setup Debian APT like package
update system.
loghost
loghost configures syslog settings across the cluster.
nameserver
This package contains everything needed to set up your distribution's
caching nameserver on the OSCAR server.
service management
OSCAR scripts to disable unwanted services on client nodes.
California Digital's expertise with clusters includes MPI software,
mathematical libraries, compilers, schedulers, configuration,
performance optimization, benchmarking, networking drivers, and other
related items.
FreeIPMI
FreeIPMI is an free/open-source system
management software suite developed to provide cutting-edge system
management tools to some of California Digital's OEM customers and
refined for deployment of Lawrence Livermore's Thudner cluster. FreeIPMI runs on all of California
Digital's GNU/Linux clusters, providing a host of important cluster
manageemnt capabilities.
BIOSConfig
BIOSConfig is an free/open-source BIOS
configuration utility. It provides an inband console based interface
to edit, update, save/restore and replicate settings across the
cluster nodes. As IPMI does not cover BIOS configuration interface,
BIOSConfig utility briges that gap by providing a generic token map
interface to support every different motherboards.
Custom
The heading above indicate the standard configurations for California
Digital clusters. We are happy to consider special requests to meet
specific customer needs. In general, our ability to
accomodate such requests scales with the number of nodes.
Cluster QA
California Digital's rigorous quality control procedures knocks out
every single weak component in a cluster. Individual nodes are passed
through Cerberus Test Control System (CTCS) and extreme linpack
burn-in for a minimum of 72 hours before it gets integrated into a
cluster. Most of the hardware defects and Kernel bugs are eliminated
during this test. Additionally a test script based on streams,
linpack, HPL and tiobench detects slow and defective memory modules,
processors, hard-drives and network controllers. These tests also
triggers platform management events which are monitored using IPMI
system event log, sensors, PEF and MCA interfaces. Finally the entire
cluster is put to maximum CPU and memory load for continuous 48
hours and the engineering team will validate the benchmark results.