DESIGN PRINCIPLES OF COMPUTER CLUSTERS
Clusters should be designed
for scalability and availability. In this section, we will cover the design
principles of SSI, HA, fault tolerance, and rollback recovery in
general-purpose computers and clus-ters of cooperative computers.
1. Single-System Image Features
SSI does not mean a single
copy of an operating system image residing in memory, as in an SMP or a
workstation. Rather, it means the illusion of a single system, single control, symmetry,
and transparency as characterized in the following list:
• Single system The entire
cluster is viewed by users as one system that has multiple processors. The user
could say, “Execute my application using
five processors.” This is different from a
distributed system.
• Single control Logically, an
end user or system user utilizes services from one place with a single
interface. For instance, a user submits batch jobs to one set of queues; a
system administrator configures all the hardware and software components of the
cluster from one control point.
• Symmetry A user can use a
cluster service from any node. In other words, all cluster services and
functionalities are symmetric to all nodes and all users, except those
protected by access rights.
• Location-transparent The user
is not aware of the where abouts of the physical device that eventually
provides a service. For instance, the user can use a tape drive attached to any
cluster node as though it were physically attached to the local node.
The main
motivation to have SSI is that it allows a cluster to be used, controlled, and
main-tained as a familiar workstation is. The word “single” in “single-system image” is sometimes synon-ymous with “global” or “central.” For instance, a global file system means a
single file hierarchy, which a user can access from any node. A single point of
control allows an operator to monitor and configure the cluster system.
Although there is an illusion of a single system, a cluster service or
functionality is often realized in a distributed manner through the cooperation
of multiple compo-nents. A main requirement (and advantage) of SSI techniques
is that they provide both the perfor-mance benefits of distributed
implementation and the usability benefits of a single image.
From the
viewpoint of a process P, cluster nodes can be
classified into three types. The home node of a process P is the node where P resided when it was created. The local node of a process P is the node where P currently resides. All other nodes are remote nodes to P. Cluster nodes can be configured to suit different needs. A host node serves user logins through Telnet, rlogin, or
even FTP and HTTP. A compute
node is one
that performs computational jobs. An I/O node is one that serves file I/O requests. If a
cluster has large shared disks and tape units, they are normally physi-cally
attached to I/O nodes.
There is
one home node for each process, which is fixed throughout the life of the
process. At any time, there is only one local node, which may or may not be the
host node. The local node and remote nodes of a process may change when the
process migrates. A node can be configured to provide multiple functionalities.
For instance, a node can be designated as a host, an I/O node, and a compute
node at the same time. The illusion of an SSI can be obtained at several
layers, three of which are discussed in the following list. Note that these
layers may overlap with one another.
• Application software layer
Two examples are parallel web servers and various parallel databases. The user
sees an SSI through the application and is not even aware that he is using a
cluster. This approach demands the modification of workstation or SMP
applications for clusters.
• Hardware or kernel layer
Ideally, SSI should be provided by the operating system or by the hardware.
Unfortunately, this is not a reality yet. Furthermore, it is extremely
difficult to provide an SSI over heterogeneous clusters. With most hardware
architectures and operating systems being proprietary, only the manufacturer
can use this approach.
• Middleware layer The most
viable approach is to construct an SSI layer just above the OS kernel. This
approach is promising because it is platform-independent and does not require
application modification. Many cluster job management systems have already
adopted this approach.
Each computer in a cluster has its own
operating system image. Thus, a cluster may display multiple system images due
to the stand-alone operations of all participating node computers. Deter-mining
how to merge the multiple system images in a cluster is as difficult as
regulating many indi-vidual personalities in a community to a single
personality. With different degrees of resource sharing, multiple systems could
be integrated to achieve SSI at various operational levels.
1.1
Single Entry Point
Single-system image (SSI) is
a very rich concept, consisting of single entry point, single file hierarchy,
single I/O space, single networking scheme, single control point, single job manage-ment
system, single memory space, and single process space. The single entry point
enables users to log in (e.g., through Telnet, rlogin, or HTTP) to a cluster as
one virtual host, although the clus-ter may have multiple physical host nodes
to serve the login sessions. The system transparently distributes the user’s login and connection requests to different
physical hosts to balance the load. Clusters could substitute for mainframes
and supercomputers. Also, in an Internet cluster server, thousands of HTTP or
FTP requests may come simultaneously. Establishing a single entry point with
multiple hosts is not a trivial matter. Many issues must be resolved. The
following is just a partial list:
• Home directory Where do you
put the user’s home directory?
• Authentication How do you
authenticate user logins?
• Multiple connections What if
the same user opens several sessions to the same user account?
• Host failure How do you deal
with the failure of one or more hosts?
Example 2.5 Realizing a Single Entry Point in a
Cluster of Computers
Figure
2.13 illustrates how to realize a single entry point. Four nodes of a cluster
are used as host nodes to receive users’ login requests. Although only one user
is shown, thousands of users can connect to the cluster in the same fashion.
When a user logs into the cluster, he issues a standard UNIX command such as
telnet cluster.cs.hku.hk, using the symbolic name of the cluster system.
The DNS translates the symbolic name and
returns the IP address 159.226.41.150 of the least-loaded node, which happens
to be node Host1. The user then logs in using this IP address. The DNS
periodically receives load information from the host nodes to make
load-balancing translation decisions. In the ideal case, if 200 users
simultaneously log in, the login sessions are evenly distributed among our
hosts with 50 users each. This allows a single host to be four times more
powerful.
1.2 Single File Hierarchy
We use
the term “single file hierarchy” in this book to mean the illusion of a single,
huge file sys-tem image that transparently integrates local and global disks
and other file devices (e.g., tapes). In other words, all files a user needs
are stored in some subdirectories of the root directory /, and they can be accessed
through ordinary UNIX calls such as open, read, and so on. This should not be confused with
the fact that multiple file systems can exist in a workstation as
subdirectories of the root directory.
The functionalities of a single file hierarchy
have already been partially provided by existing distrib-uted file systems such
as Network
File System (NFS) and Andrew File System (AFS). From
the view-point of any process, files can reside on three types of locations in
a cluster, as shown in Figure 2.14.
Local
storage is the disk on the local node of a process. The disks on remote
nodes are remote
storage. A stable
storage requires two aspects: It is persistent, which means data, once written to the stable storage, will stay there for a
sufficiently long time (e.g., a week), even after the cluster shuts down; and
it is fault-tolerant to some degree, by using redundancy and periodic backup to
tapes.
Figure 2.14 uses stable storage. Files in stable storage are called
global files, those in local sto-rage local files, and those in remote storage remote files. Stable storage could be implemented as one
centralized, large RAID disk. But it could also be distributed using local
disks of cluster nodes. The first approach uses a large disk, which is a single
point of failure and a potential performance bottleneck. The latter approach is
more difficult to implement, but it is potentially more economical, more
efficient, and more available. On many cluster systems, it is customary for the
system to make visible to the user processes the following directories in a
single file hierarchy: the usual system directories as in a traditional UNIX workstation, such as
/usr and /usr/local; and the user’s home directory ~/ that has a small disk quota
(1–20 MB). The user stores his
code files and other files here. But large data files
must be stored elsewhere.
• A global directory is shared by all users and all processes. This
directory has a large disk space of multiple gigabytes. Users can store their
large data files here.
• On a cluster system, a
process can access a special directory on the local disk. This directory has
medium capacity and is faster to access than the global directory.
1.3
Visibility of Files
The term “visibility” here means a process can use traditional UNIX
system or library calls such as fopen, fread, and fwrite to access files. Note that there are multiple
local scratch directories in a cluster. The local scratch
directories in remote nodes are not in the single file hierarchy, and are not
directly visible to the process. A user process can still access them with
commands such as rcp or some special library
functions, by specifying both the node name and the filename.
The name
“scratch” indicates that the storage is meant to act as
a scratch pad for temporary information storage. Information in the local
scratch space could be lost once the user logs out. Files in the global scratch
space will normally persist even after the user logs out, but will be deleted
by the system if not accessed in a predetermined time period. This is to free
disk space for other users. The length of the period can be set by the system
administrator, and usually ranges from one day to several weeks. Some systems
back up the global scratch space to tapes periodically or before deleting any
files.
1.4
Support of Single-File Hierarchy
It is desired that a single
file hierarchy have the SSI properties discussed, which are reiterated for file
systems as follows:
• Single system There is just
one file hierarchy from the user’s viewpoint.
• Symmetry A user can access
the global storage (e.g., /scratch) using a cluster service
from any node. In other words, all file services and functionalities are
symmetric to all nodes and all users, except those protected by access rights.
• Location-transparent The user
is not aware of the whereabouts of the physical device that eventually provides
a service. For instance, the user can use a RAID attached to any cluster node
as though it were physically attached to the local node. There may be some
performance differences, though.
A
cluster file system should maintain UNIX semantics: Every file operation (fopen, fread, fwrite, fclose, etc.) is a transaction. When an fread accesses a file after an fwrite modifies the same file, the fread should get the updated value. However,
existing distributed file systems do not completely follow UNIX semantics. Some
of them update a file only at close or flush. A number of alternatives have
been suggested to organize the global storage in a cluster. One extreme is to
use a single file server that hosts a big RAID. This solution is simple and can
be easily implemented with current software (e.g., NFS). But the file server
becomes both a performance bottleneck and a single point of failure. Another
extreme is to utilize the local disks in all nodes to form global storage. This
could solve the performance and availability problems of a single file server.
1.5
Single I/O, Networking, and Memory Space
To achieve SSI, we desire a
single control point, a single address space, a single job management system, a
single user interface, and a single process control, as depicted in Figure
2.15. In this example, each node has exactly one network connection. Two of the
four nodes each have two I/O devices attached.
Single
Networking: A properly designed cluster should behave as one system (the shaded
area). In other words, it is like a big workstation with four network
connections and four I/O devices attached. Any process on any node can use any
network and I/O device as though it were attached to the local node. Single
networking means any node can access any network connection.
Single
Point of Control: The system administrator should be able to configure,
monitor, test, and control the entire cluster and each individual node from a
single point. Many clusters help with this through a system console that is
connected to all nodes of the cluster. The system console is normally connected
to an external LAN (not shown in Figure 2.15) so that the administrator can log
in remotely to the system console from anywhere in the LAN to perform
administration work.
Note that single point of control does not mean all system
administration work should be carried out solely by the system console. In
reality, many administrative functions are distributed across the cluster. It
means that controlling a cluster should be no more difficult than administering
an SMP or a mainframe. It implies that administration-related system
information (such as various configuration files) should be kept in one logical
place. The administrator monitors the cluster with one graphics tool, which
shows the entire picture of the cluster, and the administrator can zoom in and
out at will.
Single point of control (or single point of management) is one of the most
challenging issues in constructing a cluster system. Techniques from
distributed and networked system management can be transferred to clusters.
Several de facto standards have already been developed for network management.
An example is Simple
Network Management Protocol (SNMP). It demands an efficient
cluster management package that integrates with the availability support system, the file system, and the job
management system.
Single Memory Space: Single memory space gives users the illusion of a big, centralized
main memory, which in reality may be a set of distributed local memory spaces.
PVPs, SMPs, and DSMs have an edge over MPPs and clusters in this respect,
because they allow a program to utilize all global or local memory space. A
good way to test if a cluster has a single memory space is to run a sequential program that needs a memory space larger than any single node can
provide.
Suppose each node in Figure 2.15 has 2 GB of
memory available to users. An ideal single memory image would allow the cluster
to execute a sequential program that needs 8 GB of memory. This would enable a
cluster to operate like an SMP system. Several approaches have
been
attempted to achieve a single memory space on clusters. Another approach is to
let the compiler distribute the data structures of an application across
multiple nodes. It is still a challenging task to develop a single memory
scheme that is efficient, platform-independent, and able to support sequential
binary codes.
Single I/O Address Space: Assume the cluster is
used as a web server. The web information database is distributed between the
two RAIDs. An HTTP daemon is started on each node to handle web requests, which
come from all four network connections. A single I/O space implies that any
node can access the two RAIDs. Suppose most requests come from the ATM network.
It would be beneficial if the functions of the HTTP on node 3 could be
distributed to all four nodes. The following example shows a distributed RAID-x
architecture for I/O-centric cluster computing.
Example 2.6 Single I/O Space over Distributed
RAID for I/O-Centric Clusters
A
distributed disk array architecture was proposed by Hwang, et al. [9] for
establishing a single I/O space in I/O-centric cluster applications. Figure
2.16 shows the architecture for a four-node Linux PC cluster, in which three
disks are attached to the SCSI bus of each host node. All 12 disks form an
integrated RAID-x with a single address space. In other words, all PCs can access
both local and remote disks. The addres-sing scheme for all disk blocks is
interleaved horizontally. Orthogonal stripping and mirroring make it possi-ble
to have a RAID-1 equivalent capability in the system.
The shaded blocks are images of the blank blocks.
A disk block and its image will be mapped on dif-ferent physical disks in an
orthogonal manner. For example, the block B0 is located on disk D0. The image block Mo of block B0 is located on disk D3. The four disks D0, D1,
D2, and D3 are attached to four servers, and thus can be accessed in parallel.
Any single disk failure will not lose the data block, because its image is
available in recovery. All disk blocks are labeled to show image mapping.
Benchmark experi-ments show that this RAID-x is scalable and can restore data
after any single disk failure. The distributed RAID-x has improved aggregate
I/O bandwidth in both parallel read and write operations over all physical
disks in the cluster.
1.6
Other Desired SSI Features
The ultimate goal of SSI is
for a cluster to be as easy to use as a desktop computer. Here are addi-tional
types of SSI, which are present in SMP servers:
• Single job management system
All cluster jobs can be submitted from any node to a single job management
system.
• Single user interface The
users use the cluster through a single graphical interface. Such an interface
is available for workstations and PCs. A good direction to take in developing a
cluster GUI is to utilize web technology.
• Single process space All user
processes created on various nodes form a single process space and share a
uniform process identification scheme. A process on any node can create (e.g.,
through a UNIX fork) or communicate with (e.g., through signals, pipes, etc.)
processes on remote nodes.
• Middleware support for SSI
clustering As shown in Figure 2.17, various SSI features are supported by
middleware developed at three cluster application levels:
• Management level This level
handles user applications and provides a job management system such as GLUnix,
MOSIX, Load
Sharing Facility (LSF), or Codine.
• Programming level This level
provides single file hierarchy (NFS, xFS, AFS, Proxy) and distributed shared
memory (TreadMark, Wind Tunnel).
• Implementation level This
level supports a single process space, checkpointing, process migration, and a
single I/O space. These features must interface with the cluster hardware and
OS platform. The distributed disk array, RAID-x, in Example 2.6 implements a
single I/O space.
2. High Availability through
Redundancy
When designing robust, highly
available systems three terms are often used together: reliability, availability, and serviceability (RAS). Availability is the most
interesting measure since it combines the concepts of reliability and serviceability
as defined here:
• Reliability measures how long a system can operate without
a breakdown.
• Availability indicates the percentage of time that a system
is available to the user, that is, the percentage of system uptime.
• Serviceability refers to how easy it is to service the system,
including hardware and software maintenance, repair,
upgrades, and so on.
The
demand for RAS is driven by practical market needs. A recent Find/SVP survey
found the following figures among Fortune 1000 companies: An average computer
is down nine times per year with an average downtime of four hours. The average
loss of revenue per hour of downtime is $82,500. With such a hefty penalty for
downtime, many companies are striving for systems that offer 24/365 availability, meaning the system is
available 24 hours per day, 365 days per year.
2.1 Availability and Failure Rate
As Figure 2.18 shows, a computer system
operates normally for a period of time before it fails. The failed system is then
repaired, and the system returns to normal operation. This operate-repair cycle
then repeats. A system’s reliability is measured by
the mean time to failure (MTTF), which is the average
time of normal operation before the system (or a component of the system)
fails. The metric for serviceability is the mean time to repair (MTTR), which is the average
time it takes to repair the system and restore it to working condition after it
fails. The availability of a system is defined by:
2.2
Planned versus Unplanned Failure
When studying RAS, we call
any event that prevents the system from normal operation a failure. This includes:
• Unplanned failures The system
breaks, due to an operating system crash, a hardware failure, a network disconnection,
human operation errors, a power outage, and so on. All these are simply called
failures. The system must be repaired to correct the failure.
• Planned shutdowns The system
is not broken, but is periodically taken off normal operation for upgrades,
reconfiguration, and maintenance. A system may also be shut down for weekends
or holidays. The MTTR in Figure 2.18 for this type of failure is the planned
downtime.
Table
2.5 shows the availability values of several representative systems. For
instance, a conven-tional workstation has an availability of 99 percent,
meaning it is up and running 99 percent of the time or it has a downtime of 3.6
days per year. An optimistic definition of availability does not consider
planned downtime, which may be significant. For instance, many supercomputer
installations have a planned downtime of several hours per week, while a
telephone system cannot tolerate a downtime of a few minutes per year.
2.3
Transient versus Permanent Failures
A lot of failures are transient in that they occur temporarily and then
disappear. They can be dealt with without replacing any components. A standard
approach is to roll back the system to a known
state and start over. For
instance, we all have rebooted our PC to take care of transient failures such
as a frozen keyboard or window. Permanent failures cannot be corrected by
rebooting. Some hard-ware or software component must be repaired or replaced.
For instance, rebooting will not work if the system hard disk is broken.
2.4
Partial versus Total Failures
A failure that renders the entire system
unusable is called a total
failure. A failure that only affects part of the system is called a partial failure if the system is still usable, even at a
reduced capacity. A key approach to enhancing availability is to make as many
failures as possible partial failures, by systematically removing single points of failure, which are hardware or
software components whose failure will bring down the entire system.
Example 2.7 Single Points of Failure in an SMP
and in Clusters of Computers
In an SMP (Figure 2.19(a)), the shared memory,
the OS image, and the memory bus are all single points of failure. On the other
hand, the processors are not forming a single point of failure. In a cluster of
work-stations (Figure 2.19(b)), interconnected by Ethernet, there are multiple
OS images, each residing in a workstation. This avoids the single point of
failure caused by the OS as in the SMP case. However, the Ethernet network now
becomes a single point of failure, which is eliminated in Figure 2.17(c), where
a high-speed network is added to provide two paths for communication.
When a node fails in the clusters in Figure
2.19(b) and Figure 2.19(c), not only will the node applications all fail, but
also all node data cannot be used until the node is repaired. The shared disk
cluster in Figure 2.19(d) provides a remedy. The system stores persistent data
on the shared disk, and periodically checkpoints to save intermediate results. When one WS node
fails, the data will not be lost in this shared-disk cluster.
2.5
Redundancy Techniques
Consider the cluster in
Figure 2.19(d). Assume only the nodes can fail. The rest of the system (e.g.,
interconnect and the shared RAID disk) is 100 percent available. Also assume
that when a node fails, its workload is switched over to the other node in zero
time. We ask, what is the availability of the cluster if planned downtime is
ignored? What is the availability if the cluster needs one hour/week for
maintenance? What is the availability if it is shut down one hour/week, one
node at a time?
According
to Table 2.4, a workstation is available 99 percent of the time. The time both
nodes are down is only 0.01 percent. Thus, the availability is 99.99 percent.
It is now a fault-resilient sys-tem, with only one hour of downtime per year.
The planned downtime is 52 hours per year, that is, 52 / (365 × 24) = 0.0059. The total downtime is now 0.59
percent + 0.01 percent = 0.6 percent. The availability of the cluster becomes
99.4 percent. Suppose we ignore the unlikely situation in which the other node fails while one node
is maintained. Then the availability is 99.99 percent.
There are basically two ways to increase the
availability of a system: increasing MTTF or redu-cing MTTR. Increasing MTTF
amounts to increasing the reliability of the system. The computer industry has
strived to make reliable systems, and today’s workstations have MTTFs in the range of
hundreds to thousands of hours. However, to further improve MTTF is very
difficult and costly. Clus-ters offer an HA solution based on reducing the MTTR
of the system. A multinode cluster has a lower MTTF (thus lower reliability)
than a workstation. However, the failures are taken care of quickly to deliver
higher availability. We consider several redundancy techniques used in cluster
design.
2.6
Isolated Redundancy
A key technique to improve
availability in any system is to use redundant components. When a component
(the primary component) fails, the
service it provided is taken over by another compo-nent (the backup component). Furthermore, the primary and the
backup components should be iso-lated
from
each other, meaning they should not be subject to the same cause of failure.
Clusters provide HA with redundancy in
power supplies, fans, processors, memories, disks, I/O devices, net-works,
operating system images, and so on. In a carefully designed cluster, redundancy
is also isolated. Isolated redundancy provides several benefits:
• First, a component designed
with isolated redundancy is not a single point of failure, and the failure of
that component will not cause a total system failure.
• Second, the failed component
can be repaired while the rest of the system is still working.
• Third, the primary and the
backup components can mutually test and debug each other.
The IBM
SP2 communication subsystem is a good example of isolated-redundancy design.
All nodes are connected by two networks: an Ethernet network and a
high-performance switch. Each node uses two separate
interface cards to connect to these networks. There are two communication
protocols: a standard IP and a user-space (US) protocol; each can run
on either network. If either network or protocol fails, the other network or
protocol can take over.
2.7
N-Version Programming to Enhance Software Reliability
A common isolated-redundancy approach to
constructing a mission-critical software system is called N-version programming. The software is implemented by N isolated teams who may not even know the others exist. Different teams are asked to
implement the software using different algorithms, programming languages,
environment tools, and even platforms. In a fault-tolerant system, the N versions all run simultaneously and their results are constantly
compared. If the results differ, the system is notified that a fault has occurred.
But because of isolated redundancy, it is extremely unli-kely that the fault
will cause a majority of the N versions to fail at the same time. So the
system continues working, with the correct result generated by majority voting.
In a highly available but less mission-critical system, only one version needs
to run at a time. Each version has a built-in self-test capability. When one
version fails, another version can take over.
3. Fault-Tolerant Cluster
Configurations
The cluster solution was
targeted to provide availability support for two server nodes with three
ascending levels of availability: hot standby, active takeover, and fault-tolerant. In this section, we will consider the recovery time, failback feature, and node activeness. The level of availability increases from
standby to active and fault-tolerant cluster configurations. The shorter is the
recovery time, the higher is the cluster availability. Failback refers to the ability of a recovered node
return-ing to normal operation after repair or maintenance. Activeness refers to whether the node is used in active
work during normal operation.
• Hot standby server clusters
In a hot standby cluster, only the primary
node is actively doing all the useful work normally. The standby node is
powered on (hot) and running some monitoring programs to communicate heartbeat
signals to check the status of the primary node, but is not actively running
other useful workloads. The primary node must mirror any data to shared disk
storage, which is accessible by the standby node. The standby node requires a
second copy of data.
• Active-takeover clusters In
this case, the architecture is symmetric among multiple server nodes. Both
servers are primary, doing useful work normally. Both failover and failback are
often supported on both server nodes. When a node fails, the user applications
fail over to the available node in the cluster. Depending on the time required
to implement the failover, users may experience some delays or may lose some
data that was not saved in the last checkpoint.
•
Failover cluster This is probably the most important feature
demanded in current clusters for commercial applications. When a component
fails, this technique allows the remaining system to take over the services
originally provided by the failed component. A failover mechanism must provide
several functions, such as failure
diagnosis, failure
notification, and failure
recovery. Failure diagnosis refers to the detection of a failure and the
location of the failed component that caused the failure. A commonly used
technique is heartbeat, whereby the cluster nodes
send out a stream of heartbeat messages to one another. If the system does not
receive the stream of heartbeat messages from a node, it can conclude that
either the node or the network connection has failed.
Example 2.8 Failure Diagnosis and Recovery in a
Dual-Network Cluster
A cluster uses two networks to connect its
nodes. One node is designated as the master node. Each node has a heartbeat
daemon that periodically (every 10 seconds) sends a heartbeat message to the
master node
through both networks. The master node will detect a failure if it does not
receive messages for a beat (10 seconds) from a node and will make the
following diagnoses:
• A node’s
connection to one of the two networks failed if the master receives a heartbeat
from the node through one network but not the other.
• The node
failed if the master does not receive a heartbeat through either network. It is
assumed that the chance of both networks failing at the same time is
negligible.
The failure diagnosis in this
example is simple, but it has several pitfalls. What if the master node fails?
Is the 10-second heartbeat period too long or too short? What if the heartbeat
messages are dropped by the net-work (e.g., due to network congestion)? Can
this scheme accommodate hundreds of nodes? Practical HA sys-tems must address
these issues. A popular trick is to use the heartbeat messages to carry load
information so that when the master receives the heartbeat from a node, it
knows not only that the node is alive, but also the resource utilization status
of the node. Such load information is useful for load balancing and job
management.
Once a failure is diagnosed, the system
notifies the components that need to know the failure event. Failure
notification is needed because the master node is not the only one that needs
to have this informa-tion. For instance, in case of the failure of a node, the
DNS needs to be told so that it will not connect more users to that node. The
resource manager needs to reassign the workload and to take over the remaining
workload on that node. The system administrator needs to be alerted so that she
can initiate proper actions to repair the node.
3.1
Recovery Schemes
Failure recovery refers to
the actions needed to take over the workload of a failed component. There are
two types of recovery techniques. In backward recovery, the processes running on a cluster
per-iodically save a consistent state (called a checkpoint) to a stable storage. After a failure, the
system is reconfigured to isolate the failed component, restores the previous
checkpoint, and resumes nor-mal operation. This is called rollback.
Backward recovery is relatively easy to
implement in an application-independent, portable fashion, and has been widely
used. However, rollback implies wasted execution. If execution time is crucial,
such as in real-time systems where the rollback time cannot be tolerated, a forward
recovery scheme should be used. With such a scheme, the system is not
rolled back to the previous checkpoint upon a failure. Instead, the system
utilizes the failure diagnosis information to reconstruct a valid system state
and continues execution. Forward recovery is application-dependent and may need
extra hardware.
Example 2.9 MTTF, MTTR, and Failure Cost
Analysis
Consider
a cluster that has little availability support. Upon a node failure, the
following sequence of events takes place:
1.
The entire system is shut down and powered off.
2.
The faulty node is replaced if the failure is
in hardware.
3.
The system is powered on and rebooted.
4.
The user application is reloaded and rerun from
the start.
Assume
one of the cluster nodes fails every 100 hours. Other parts of the cluster
never fail. Steps 1 through 3 take two hours. On average, the mean time for
step 4 is two hours. What is the availability of the cluster? What is the
yearly failure cost if each one-hour downtime costs $82,500?
Solution: The cluster’s MTTF is 100 hours; the
MTTR is 2 + 2 = 4 hours. According to Table 2.5, the availability is 100/104 =
96.15 percent. This corresponds to 337 hours of downtime in a year, and the
failure cost is $82500 × 337, that is, more than $27 million.
Example 2.10 Availability and Cost Analysis of
a Cluster of Computers
Repeat
Example 2.9, but assume that the cluster now has much increased availability
support. Upon a node failure, its workload automatically fails over to other
nodes. The failover time is only six minutes. Meanwhile, the cluster has hot
swap capability: The faulty node is taken off the cluster, repaired, replugged,
and rebooted, and it rejoins the cluster, all without impacting the rest of the
cluster. What is the availability of this ideal cluster, and what is the yearly
failure cost?
Solution: The cluster’s MTTF is still 100
hours, but the MTTR is reduced to 0.1 hours, as the cluster is available while
the failed node is being repaired. From to Table 2.5, the availability is
100/100.5 = 99.9 percent. This corresponds to 8.75 hours of downtime per year,
and the failure cost is $82,500, a 27M/722K = 38 times reduction in failure
cost from the design in Example 3.8.
4. Checkpointing and Recovery
Techniques
Checkpointing and recovery
are two techniques that must be developed hand in hand to enhance the
availability of a cluster system. We will start with the basic concept of
checkpointing. This is the process of periodically saving the state of an
executing program to stable storage, from which the system can recover after a
failure. Each program state saved is called a checkpoint. The disk file that contains the saved state
is called the checkpoint
file.
Although all current checkpointing soft-ware saves program states in a disk,
research is underway to use node memories in place of stable storage in order
to improve performance.
Checkpointing
techniques are useful not only for availability, but also for program
debugging, process migration, and load balancing. Many job management systems
and some operating systems support checkpointing to a certain degree. The Web
Resource contains pointers to numerous check-point-related web sites, including
some public domain software such as Condor and Libckpt. Here we will present
the important issues for the designer and the user of checkpoint software. We
will first consider the issues that are common to both sequential and parallel
programs, and then we will discuss the issues pertaining to parallel programs.
4.1
Kernel, Library, and Application Levels
Checkpointing can be realized
by the operating system at the kernel
level, where
the OS transpar-ently checkpoints and restarts processes. This is ideal for
users. However, checkpointing is not sup-ported in most operating systems,
especially for parallel programs. A less transparent approach links the user
code with a checkpointing library in the user space. Checkpointing and
restarting are handled by this runtime support. This approach is used widely
because it has the advantage that user applications do not have to be modified.
A main
problem is that most current checkpointing libraries are static, meaning the
application source code (or at least the object code) must be available. It
does not work if the application is in the form of executable code. A third
approach requires the user (or the compiler) to insert check-pointing functions
in the application; thus, the application has to be modified, and the
transparency is lost. However, it has the advantage that the user can specify
where to checkpoint. This is helpful to reduce checkpointing overhead.
Checkpointing incurs both time and storage overheads.
4.2
Checkpoint Overheads
During a program’s execution, its states may be saved many
times. This is denoted by the time consumed to save one checkpoint. The
storage overhead is the extra memory and disk space required for checkpointing.
Both time and storage overheads depend on the size of the checkpoint file. The
overheads can be substantial, especially for applications that require a large
memory space. A number of techniques have been suggested to reduce these
overheads.
4.3
Choosing an Optimal Checkpoint Interval
The time period between two checkpoints is
called the checkpoint
interval. Making the interval larger can reduce checkpoint time overhead.
However, this implies a longer computation time after a failure. Wong and
Franklin [28] derived an expression for optimal checkpoint interval as
illu-strated in Figure 2.20.
Here, MTTF
is the system’s mean time to failure. This MTTF accounts the time
consumed to save one checkpoint, and h is the average percentage of normal
computation performed in a check-point interval before the system fails. The
parameter h is always in the range.
After a system is restored, it needs to spend h × (checkpoint interval) time to recompute.
4.4
Incremental Checkpoint
Instead of saving the full state at each
checkpoint, an incremental
checkpoint scheme saves only the portion of the state that is changed from
the previous checkpoint. However, care must be taken regarding old checkpoint files. In full-state checkpointing,
only one checkpoint file needs to be kept on disk. Subsequent checkpoints
simply overwrite this file. With incremental checkpointing, old files needed to
be kept, because a state may span many files. Thus, the total storage
requirement is larger.
4.5
Forked Checkpointing
Most checkpoint schemes are
blocking in that the normal computation is stopped while checkpoint-ing is in
progress. With enough memory, checkpoint overhead can be reduced by making a
copy of the program state in memory and invoking another asynchronous thread to
perform the checkpointing concurrently. A simple way to overlap checkpointing
with computation is to use the UNIX fork( ) system call. The forked child process
duplicates the parent process’s address space and checkpoints it. Meanwhile, the parent process
continues execution. Overlapping is achieved since checkpointing is disk-I/O intensive.
A further optimization is to use the copy-on-write mechanism.
4.6
User-Directed Checkpointing
The checkpoint overheads can
sometimes be substantially reduced if the user inserts code (e.g., library or
system calls) to tell the system when to save, what to save, and what not to
save. What should be the exact contents of a checkpoint? It should contain just
enough information to allow a system to recover. The state of a process
includes its data state and control state. For a UNIX process, these states are
stored in its address space, including the text (code), the data, the stack
segments, and the process descriptor. Saving and restoring the full state is
expensive and sometimes impossible.
For instance, the process ID and the parent
process ID are not restorable, nor do they need to be saved in many
applications. Most checkpointing systems save a partial state. For instance,
the code seg-ment is usually not saved, as it does not change in most
applications. What kinds of applications can be checkpointed? Current
checkpoint schemes require programs to be well
behaved, the exact meaning of which differs in
different schemes. At a minimum, a well-behaved program should not need the
exact contents of state information that is not restorable, such as the numeric
value of a process ID.
4.7
Checkpointing Parallel Programs
We now turn to checkpointing parallel programs.
The state of a parallel program is usually much larger than that of a
sequential program, as it consists of the set of the states of individual
processes, plus the state of the communication network. Parallelism also
introduces various timing and consistency problems.
Example 2.11 Checkpointing a Parallel Program
Figure 2.21 illustrates checkpointing of a
three-process parallel program. The arrows labeled x, y, and z represent
point-to-point communication among the processes. The three thick lines labeled
a, b, and c represent three global snapshots (or simply snapshots), where a
global snapshot is a set of checkpoints
(represented by dots), one from every process.
In addition, some communication states may need to be saved. The intersection
of a snapshot line with a process’s time line indicates where the process
should take a (local) checkpoint. Thus, the program’s snapshot c consists of
three local checkpoints: s, t, u for processes P, Q, and R, respectively, plus
saving the communication y.
4.8
Consistent Snapshot
A global snapshot is called consistent if there is no message that is received by the
checkpoint of one process, but not yet sent by another process. Graphically,
this corresponds to the case that no arrow crosses a snapshot line from right
to left. Thus, snapshot a is consistent, because arrow
x is from left to right. But
snapshot c is inconsistent, as y goes from right to left. To be consistent,
there should not be any zigzag
path between
two checkpoints [20]. For instance, checkpoints u and s can-not belong to a consistent global
snapshot. A stronger consistency requires that no arrows cross the snapshot.
Thus, only snapshot b is consistent in Figure
2.23.
4.9
Coordinated versus Independent Checkpointing
Checkpointing schemes for parallel programs can
be classified into two types. In coordinated
check-pointing (also
called consistent checkpointing), the parallel program is frozen, and all
processes are checkpointed at the same time. In independent
checkpointing, the processes are
checkpointed inde-pendent of one another. These two types can be combined in
various ways. Coordinated check-pointing is difficult to implement and tends to
incur a large overhead. Independent checkpointing has a small overhead and can
utilize existing checkpointing schemes for sequential programs.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.