The on-call IT tech is jolted awake from a terrible dream, his heart
pounding. Lightning crashes overhead as he glances as the clock--2:59
a.m. The server isn't down, it was just a dream.
3:00 a.m.: While the tech is still awake, the IT on-call pager goes off.
This could mean any number of things: a fire, a break-in, a failed air
conditioner in the server room, or even a main business server crash.
3:25 a.m.: The on-call IT tech arrives at the site and evaluates the
situation. There is no fire, no evidence of a break-in, and the server
room temperature reads a cool 18 degrees Celsius. A quick check of the
servers shows that most of them are at a login screen. After checking
two or three machines, it is obvious that the room lost power at some
point. The UPS units verify a failure; all three massive battery units
are showing failures and heavy load percentages.
3:40 a.m.: The on-call IT tech calls the lead technician and department
manager and informs them of the situation; both are on their way to the
site. They leave instructions to check the main business application
servers; one of them holds the company's customer database, payroll, and
accounting system, and the other is the company's messaging server.
3:55 a.m.: The on-call IT tech discovers that the RAID array for the
business database server is not coming back online. The messaging server
has rebooted but the messaging application is returning errors when it
starts up. The tech realizes that the messaging server was performing
incremental backups during the time of the outage.
4:00 a.m.: The lead tech and manager arrive. Assessments of the other
servers are made. The lead tech begins working with the messaging
server. The on-call tech works with the failed RAID array. The firmware
shows the array has failed; the controller only recognizes three of the
10 drives. After a complete power down and restart of the server and
drive enclosure, the firmware shows the drives are back online, however
the array is shown as "Failed."
4:30 a.m. The-on call technician calls the RAID array manufacturer's
technical support. The choices in the firmware menu are vague and the IT
tech wants to know if forcing the drives online will get their array
back. The manufacturer's technical support says that the array will come
back; however, there is a slight possibility that the data on the volume
may be corrupted. The manufacturer's technical support asks how recent
their latest backup is. The IT Tech responds that the data is one week
old and that is unacceptable; they cannot lose a week of transactions.
The IT tech hesitates in deciding what to do next...
Disaster waiting to happen
Business system disasters like this happen every day. Despite the
redundancy in backup systems or storage array systems, failures occur.
Some failures can be hardware-related, others can be due to software,
and still others are the result of human error or natural disaster.
As more and more businesses rely on their corporate server structure and
document storage volumes, it is critical to have a comprehensive
disaster plan in case the unexpected occurs.
The scenario listed above is only one of many that can occur to cause
data loss on your server. Looking at a few of the different causes
provides a good idea of the challenges IT departments face on a constant
basis.
Partition/volume/file system corruption disasters: When trying to resize
their partition/volume settings, a company's utilities program caused
severe damage to the partition, making a great deal of data
inaccessible. They tried to recover the missing documents using
third-party recovery software, but were unsuccessful. As a last resort,
they reinstalled the operating system, but it couldn't find the second
partition/volume and made the entire system fail.
Specific file error disasters: On a company's Windows 2000 server, the
volume repair tool damaged the file system, rendering the target
directories unavailable. Complete access to the original files was
critical so restoring their one-month old backups was not a viable
option.
Hardware-related disasters: On a Netware volume server, a failing hard
drive made the volume inaccessible. Although errors in the drive were
not in the data area and the drive was still functional, Netware would
not mount the volume.
Software-related disasters: A company was doing a partial drive copy
overwrite using third party tools. The overwrite started with no
problems, but then crashed 1 percent into the process. This caused file
system corruption and made the data inaccessible.
User error disasters: A user's machine had the operating system
reinstalled with restore CD. Unfortunately, this overwrote the file
system completely and the user couldn't find the PST file where there
were a lot of important messages and attachments needed for their
business.
Thankfully, data recovery can assist in every one of the situations
described above. From legacy systems and post-mainframe storage devices
to the latest high-end SANs, data recovery can be the solution a company
needs to get back to business as quickly as possible. Traditionally,
disassembling the server and sending in the drives for repair was the
only recovery option available. This method can get back the most recent
data, but might not be quick enough if data is needed immediately due to
the time needed for shipping the drives. New technology, however, is
making it possible for data recovery to happen faster than ever.
Data recovery performed remotely over a modem or Internet connection is
available 24/7 from anywhere in the world, and recovers data in as
little as one hour. If the hardware is functioning properly, engineers
can perform lab-quality recovery service through a secured connection
using a proprietary communication protocol, encrypted packets and safe
facilities. The recovered data can be restored to the system or copied
to a new destination and is accessible upon completion. Remote data
recovery can even work on RAID systems where one drive has physically
failed. Not every data recovery provider can offer this method of
recovery, so it's important to inquire specifically about the service.
Recovery tactics
Data disasters will happen; accepting that reality is the first step in
preparing a comprehensive disaster plan. Time is always against an IT
team when a disaster strikes, therefore the details of a disaster plan
are critical for success. Establishing a relationship with a data
recovery company is the most important factor toward maintaining
business continuity--but following a few simple steps can make server
recoveries much easier.
-- Use a volume defragmenter regularly--a defragmenter moves the pieces
of each file or folder to one location on the volume, so that each
occupies a single, contiguous space on the disk drive. This helps
improve the quality of recovery, making files and folders easier for
data recovery specialists to locate. Do not run defragmenter utilities
on suspected bad drives--if drives are bad, this could have damaging
effects.
-- Perform a valid backup before making hardware or software changes.
-- If a drive is making unusual mechanical noises, turn it off
immediately and get assistance from your data recovery company.
-- Before removing drives, label the drives with their original position
in a RAID array.
-- Never restore data to the server that has lost the data--always
restore to a separate server or alternate location.
-- In Microsoft Exchange or SQL failures, never try to repair the
original Information Store or database files--make a copy and perform
recovery operations on the copy.
-- When replacing drives on RAID systems, never replace a failed drive
with a drive that was part of a previous RAID system--always zero out
the replacement drive before using.
-- In a power loss situation with a RAID array, if the file system looks
suspicious, is unmountable or the data is inaccessible after power is
restored, do not run volume repair utilities. Do not run volume repair
utilities on suspected bad drives.
The fictional, true-to-life IT scenario at the beginning of this article
illustrates the types of situations and decisions that IT staff must
make. Businesses without access to their data run the risk of losing
millions in revenue every day. The fact is, today's systems are relied
on more then ever for consistent and available data.
The speed and quality of recovery are extremely important--especially on
large servers. The best data recovery companies offer unique services
that can provide the fastest method for solving server recovery
nightmares.
Jim Reinert serves as director of software and services for Minneapolis-based Kroll Ontrack.