The key to successful
recovery is adequate preparation. Seldom does a crisis destroy irreplaceable
equipment; most computing systemspersonal computers to mainframesare standard,
off-the-shelf systems that can be easily replaced. Data and locally developed
programs are more vulnerable because they cannot be quickly substituted from
another source. Let us look more closely at what to do after a crisis occurs.
In many computing systems,
some data items change frequently, whereas others seldom change. For example, a
database of bank account balances changes daily, but a file of depositors'
names and addresses changes much less often. Also the number of changes in a
given period of time is different for these two files. These variations in
number and extent of change relate to the amount of data necessary to
reconstruct these files in the event of a loss.
A backup is a copy of all or a part of a file to assist in
reestablishing a lost file. In professional computing systems, periodic backups
are usually performed automatically, often at night when system usage is low.
Everything on the system is copied, including system files, user files, scratch
files, and directories, so that the system can be regenerated after a crisis.
This type of backup is called a complete
backup. Complete backups are done at regular intervals, usually weekly or
daily, depending on the criticality of the information or service provided by
Major installations may
perform revolving backups, in which
the last several backups are kept. Each time a backup is done, the oldest
backup is replaced with the newest one. There are two reasons to perform
revolving backups: to avoid problems with corrupted media (so that all is not
lost if one of the disks is bad) and to allow users or developers to retrieve
old versions of a file. Another form of backup is a selective backup, in which only files that have been changed (or
created) since the last backup are saved. In this case, fewer files must be saved, so the backup can be done more
quickly. A selective backup combined with an earlier complete backup gives the
effect of a complete backup in the time needed for only a selective backup. The
selective backup is subject to the configuration management techniques
described in Chapter 3.
For each type of backup, we
need the means to move from the backup forward to the point of failure. That
is, we need a way to restore the system in the event of failure. In critical
transaction systems, we address this need by keeping a complete record of
changes since the last backup. Sometimes, the system state is captured by a
combination of computer- and paper-based recording media. For example, if a
system handles bank teller operations, the individual tellers duplicate their
processing on paper recordsthe deposit and withdrawal slips that accompany your
bank transactions; if the system fails, the staff restores the latest backup version
and reapplies all changes from the collected paper copies. Or the banking
system creates a paper journal, which is a log of transactions printed just as
each transaction completes.
Personal computer users often
do not appreciate the need for regular backups. Even minor crises, such as a
failed piece of hardware, can seriously affect personal computer users. With a
backup, users can simply change to a similar machine and continue work.
A backup copy is useless if
it is destroyed in the crisis, too. Many major computing installations rent
warehouse space some distance from the computing system, far enough away that a
crisis is not likely to affect the offsite location at the same time. As a
backup is completed, it is transported to the backup site. Keeping a backup
version separate from the actual system reduces the risk of its loss.
Similarly, the paper trail is also stored somewhere other than at the main
Personal computer users
concerned with integrity can take home a copy of important disks as protection
or send a copy to a friend in another city. If both secrecy and integrity are
important, a bank vault, or even a secure storage place in another part of the
same building can be used. The worst place to store a backup copy is where it
usually is stored: right next to the machine.
With today's extensive use of
networking, using the network to implement backups is a good idea. Storage
providers sell space in which you can store data; think of these services as
big network-attached disk drives. You rent space just as you would consume
electricity: You pay for what you use. The storage provider needs to provide
only enough total space to cover everyone's needs, and it is easy to monitor
usage patterns and increase capacity as combined needs rise.
Networked storage is perfect
for backups of critical data because you can choose a storage provider whose
physical storage is not close to your processing. In this way, physical harm to
your system will not affect your backup. You do not need to manage tapes or
other media and physically transport them offsite.
Depending on the nature of
the computation, it may be important to be able to recover from a crisis and
resume computation quickly. A bank, for example, might be able to tolerate a
four-hour loss of computing facilities during a fire, but it could not tolerate
a ten-month period to rebuild a destroyed facility, acquire new equipment, and
Most computer manufacturers
have several spare machines of most models that can be delivered to any
location within 24 hours in the event of a real crisis. Sometimes the machine
will come straight from assembly; other times the system will have been in use
at a local office. Machinery is seldom the hard part of the problem. Rather,
the hard part is deciding where to put the equipment in order to begin a
A cold site or shell is a facility with power and cooling available,
in which a computing system can be installed to begin immediate operation. Some
companies maintain their own cold sites, and other cold sites can be leased
from disaster recovery companies. These sites usually come with cabling, fire
prevention equipment, separate office space, telephone access, and other
features. Typically, a computing center can have equipment installed and resume
operation from a cold site within a week of a disaster.
If the application is more
critical or if the equipment needs are more specialized, a hot site may be more appropriate. A hot site is a computer facility
with an installed and ready-to -run computing system. The system has
peripherals, telecommunications lines, power supply, and even personnel ready
to operate on short notice. Some companies maintain their own; other companies
subscribe to a service that has available one or more locations with installed
and running computers. To activate a hot site, it is necessary only to load
software and data from offsite backup copies.
Numerous services offer hot
sites equipped with every popular brand and model of system. They provide
diagnostic and system technicians, connected communications lines, and an
operations staff. The hot site staff also assists with relocation by arranging
transportation and housing, obtaining needed blank forms, and acquiring office
Because these hot sites serve
as backups for many customers, most of whom will not need the service, the
annual cost to any one customer is fairly low. The cost structure is like
insurance: The likelihood of an auto accident is low, so the premium is
reasonable, even for a policy that covers the complete replacement cost of an
expensive car. Notice, however, that the first step in being able to use a
service of this type is a complete and timely backup.