Data Loss Prevention  


Leading Causes of Data Loss:

Natural Disasters 3%
Viruses 7%
Human Errors 32%
Software Malfunction 14%
Hardware & System Malfunction 44%

Computer's are more relied upon now than ever, or more to the point the data that is contained on them. In nearly every instant the system itself can be easily repaired or replaced, but the data once lost may not be recreatable. That's why the Data Recovery Clinic stresses the importance of regular system back ups and the implementation of some preventative measures.

The chart above lists the most common reasons that data recovery would be needed for. In all cases there are steps that you the user can take to minimize your risk of data loss.

1. Natural disasters

While the least likely cause of data loss, a natural disaster can have a devastating effect on the pyhsical drive. However, Data Recovery Clinic has rescued data from fires, floods, lightening strikes and the subsequent power surges.

In instances of severe housing damage, such as scored platters from fire, water emulsion due to flood, or broken or crushed platters, the drive may become unrecoverable.

The best way to prevent data loss from a natural disaster is an off site back up. Since it is nearly impossible to predict the arrival of such an event, there should be more than one copy of the system back up kept, one onsite and one off. The type of media you back up to will depend on your system, software, and the required frequency you need to back up. Can you proceed with a day's data loss? a week's? a month's? Also be sure to check your back ups to be certain that they have properly backed up. There's nothing worse than attempting to restore data from a blank medium.

2.Viruses

Viral infection increases at rate of nearly 200-300 new trojans, exploits and viruses every month. There are approximately 56,712 "wild" or risk posing viruses and about 105,000 total known viruses, some of which are considered non-threatening. With those numbers growing everyday, you are at an ever-increasing risk to become infected with a virus.

There are several ways to protect yourself against a viral threat:

a. Install a Firewall on your system to prevent hackers access to your data.

b. Install an anti-virus program on your system and use it regularly and scan to see if you have been infected. Many viruses will lie dormant or perform many minor alterations that can cumulatively disrupt you system works. Be sure to check for updates on a regular basis.

c. Back up and be sure to test your back ups for infection as well. There is no use in removing the virus only to restore it again form your back up.

d. Be wary of any email containing an attachment. If you don't know where it came from or what it is, then don't open it.

e. If you have contracted a "wild" virus that there is no known cure for, quarantine it to that system and contact the Data Recovery Clinic for further information and assistance.

3. Human Errors

Even in today's era of highly trained, certified, and computer literate staffing there is always room for the timelessness of accidents. Sometimes referred to as the U.S.E.R virus, human mistakes are made daily all over the world. There is not much we can do as users to prevent the intervention of Murphy's Law, except to be cautious. Here are a few things you might want to try:

a. Be aware. It sounds simple enough to say, but not so easy to perform. When transferring data, be sure it is going to the destination you had in mind. If asked "Would you like to replace the existing file" make sure you are before clicking "yes".

b. If you are even a little bit uncertain about a task you are about to carry out, make sure there is a copy of the data to restore from.

c. Take extra care when using any software that may manipulate your drives data storage, such as: partition mergers, format changes, or even disk checkers.

d. Before upgrading to a new Operating System, back up your most import files or directories in case there is a problem during the install. Keep in mind if you have a slaved data drive it may become formatted as well.

e. Never shut the system down while programs are running. The open files will more than likely become truncated and non functional.

4. Software Malfunction

Software malfunction is a nessesary evil when using a computer. Even the world's top programs cannot anticipate every error that may occur on any given program. There are still a few things you can do to lessen the risk:

a. Be sure you are using the software ONLY for its intended purpose. Mis-using a program may cause it to malfunction.

b. Using pirated copies of a program may cause the software to malfunction, resulting in a corruption of you data files.

c. Be sure that you have the proper amount of memory installed if you plan to run multiple programs simultaneously. If a program shuts down or freezes up you may lose or corrupt what you were working on.

d. Back up, Back up, Back up. A tedious task, but you will be glad you did if the software corrupts your customer data base.

5. Hardware Malfunction

The most common cause of data loss, hardware malfunction or hard drive failure, is another nessesary evil inheirent to computing. There is usually little to no warning that your drive will fail, but some steps can be taken to minimize the need for data recovery from a hard drive failure:

a. Do not stack drives on top of each other-leave space for ventilation. An over heated drive is likely to fail. Be sure to keep the computer away from heat sources and make sure it is well ventilated.

b. Purchase an UPS (Uninterruptible Power Supply) to lessen malfunction caused by power surges.

c. NEVER open the casing on a hard drive. Even the smallest grain of dust settling on the platters in the interior of the drive can cause it to fail.

If you need hard drive recovery do one of the following:

Fill out an online data recovery quote form - a representative will get back to you within an hour of submittal.

Call 727-642-5521 ( our toll-free number is at the top of every page) to speak with a representative and receive your quote over the phone. We answer our phones 24 hours a day 7 days a week.

Fill out a data recovery request form and ship us your drive. please follow any instructions on how to package and ship a hard drive.

 

 

 

 

Exchange Server Data Recovery

Introduction
The capacity planner's role is critical for efficient backup and recovery for any Datacenter. This white paper is intended to provide a capacity planner with detailed information and guidelines for performing effective capacity planning. This paper includes two main sections:

Overview of Backup Technology
Capacity Planning
For those who are not highly familiar with current backup technology, the first section, Overview of Backup Technology, provides a useful foundation for the Capacity Planning section.

Overview of Backup Technology
The need to reliably backup and retrieve data has reached a new level of importance as companies are realizing the importance of saving and accessing large volumes of data. Today's corporate databases and on-line applications routinely manipulate hundreds of gigabytes (GB) of data, and databases one terabyte (TB) and larger are becoming increasingly common. The amount of corporate data collected electronically is growing dramatically each year. And companies are realizing the value in saving un-sifted data, for example, to glean information about market trends that can make or break their future success.

This reliance on full-time availability of data means the time to backup data is shrinking, and the demands for 100% availability of important data and for frequent backups is growing. These trends are placing enormous pressure on Information Technology organizations to increase the speed of backups while reducing the degree to which they intrude on day-to-day operations. Equally important is the need to recover files quickly and efficiently. Thus scheduled backups and rapid recoveries are activities that must be predictable, stable, reliable, and fast.

Basics of Backup and Recovery Technology
Current backup technology allows most of the backup process to be automated, with the exception of initial configuration and subsequent adjustment as storage requirements expand.

Physical and Logical Backups
There are two basic backup and recovery processes: physical and logical backups. Physical backups copy a byte-for-byte image of all of the database disk storage to a backup device. Logical backups copy all of the logical entities in the database to a backup device. Each process presents a different configuration problem. Physical backups are usually much faster than logical backups, because the source is read sequentially and the data can be retrieved at full device speed. The drawback is that the entire volume must be backed up as a single entity. Thus raw device backups are most useful when the entire device must be backed up. In contrast, a logical backup program reads the superblock to obtain the names of all the directories in the file system, and then reads logical entities such as directory entries one by one, almost always not in device order. While slower, a benefit of logical backups is their ability to inspect the last-modified date of each file and decide whether or not the file has been updated since the most recent backup.

Fully-Consistent Dumps
Two backup strategies can be implemented when fully-consistent dumps are required. One way to make the file system being dumped inaccessible to modifications is to simply unmount the file system before dumping it. The file system can then be remounted read-only if such access is required during the backup. Another option is to lock the file system against the updates while the backup is being performed. Because these systems prevent the file system from being modified during backup, they are nearly always used off-hours. This is usually not a problem unless user batch jobs are run overnight, as they can be substantially degraded during the backup.

Full-Time Availability
Datacenters that require full-time availability of data can use software or hardware mirroring to replicate crucial data onto two or more separate disks. By itself, mirroring does not solve the real backup problem (nor do other protected storage mechanisms, such as RAID-5), because mirrored data is also susceptible to application bugs and operator or user error, and mirrored disks must also be backed up. When full-time availability is required, a number of options are available, for example, hot database backups, and the use of snapshot images--read-only copies of data for backups.

Database Backup Technology
There are three basic type of full database backups: on-line, off-line and raw device backups. On-line backups are logical backups of a database that can be simultaneously handling transactions. Off-line backups are logical backups of a database that is quiet and is not available for transactions. Raw device backups are physical backups of the raw disk devices.

On-line Backups
On-line backups are the least-intrusive strategy, and they are a popular solution for databases that must be available 24 hours per day. On-line database backups are facilitated with software such as Oracle Enterprise Backup Utility (EBU), which can provide a consistent snapshot of all database table spaces to backup utilities such as Sun Enterprise Sun StorEdge Enterprise NetBackupsoftware. With several parallel streams of data provided by the database,Sun StorEdge Enterprise NetBackup software utilizes the backup drives to their maximum capacity, multiplexing multiple streams onto single devices where feasible.

Because transactions must be logged during the backup process, database performance may be degraded while on-line backups are performed. One way to backup a database that must sustain high transaction rates, is to mirror the database and perform a physical backup of the mirror. This requires first altering the database to begin backup, which establishes a quiescent database image. Then the mirror is detached so that a static image of the database is maintained on the detached mirror. The database is then altered to end backup, which allows logged transactions to be rolled forward into the tablespaces while a raw device backup of the mirror is done. When the backup is complete, the mirror is re-attached and the mirroring mechanism synchronizes the two disk images once again.

Off-line Backups
For very large databases that can be taken out of use for short periods of time, off-line backups are often the choice. This approach uses a utility such as Oracle EBU to make the database unavailable for normal transactions. It synchronizes the state of all its tables and provides a consistent view of the database to Sun StorEdge Enterprise NetBackup software. Off-line backups typically outperform on-line backups because of the lack of contention for system resources, and the fact that they have no impact on transaction rates once the database is back in use again. Today, with high-performance backup solutions such as Sun StorEdge Enterprise NetBackup software, off-line backups are once again being considered viable solutions.

Raw Device Backups
Raw device backups are the simplest way to backup a database, as they directly copy the raw disk devices to tape. This requires the database to be in a quiescent state, and uses a utility such as Sun StorEdge Enterprise NetBackup software to manage the high-speed transfer of disk data onto tape. Raw device backups are fast because the database itself is not involved in the process, eliminating all but the essential overhead. They are also fast because the disk devices are read sequentially, providing data to Sun StorEdge Enterprise NetBackup software at high speeds.

Advances in Backup Technology
In the past, IT organizations have turned to mainframes as the solution to large database and high-speed backup needs. While UNIX ® systems have typically delivered a 50-70 gigabyte per hour backup throughput, mainframes and their high-speed tape drives have managed throughput nearly six times faster. Several recent developments have turned the tables on this equation and have enabled sustained backup rates of more than one terabyte per hour on Sun servers--while at the same time decreasing the intrusiveness of backup operations.

Faster Throughput Rates
Tape drive technology has seen dramatic improvements in throughput rates. The familiar Sun 7 GB 8 mm tape drive provides native (uncompressed) throughput of 1 MB/second. The Sun StorEdge DLT tape 4000 tape drive almost doubles this rate by managing 1.5 MB/second, and the familiar IBM 3490E manages three times the throughput of the 7 GB drive with a rate of 3 MB/second. Newer drives that significantly change the character of database backup capabilities include the Sun 20 GB 8 mm AME tape drive with a transfer rate of 3 MB/second, the Sun StorEdge DLT tape 7000 tape drive that transfers 5 MB/second, and the Storage Technology RedWood SD-3 tape drive that, with 12 MB/second throughput, outperforms the IBM 3490E by a factor of four.

Greater Capacities
Along with these improvements in speed have come improvements in capacity. Sun's StorEdge 20 GB 8 mm AME tape drive stores almost three times the native capacity of the 7 GB previous generation. The Sun StorEdge DLT tape 4000 and DLT tape 7000 tape drives have native capacities of 20 GB and 35 GB, respectively. With a capacity of up to 50 GB on a single data cartridge, the StorageTek SD-3 drive can hold up to 250 times more data than a standard
18-track cartridge, and 125 times more data than a 36-track cartridge. The result is smooth, high-performance backups because less tape handling is required.

Automated Backup and Recovery Management Procedures
Another important development that is changing the character of backups is the advent of management software that automates backup policies and optimally feeds data to tape devices--ensuring integrity and speeding the backup process. After all, raw tape speed and high-capacity drives are meaningless without the ability to effectively manage the transfer of data.

New Approaches to On-line Backups Using Database Technology
Recognizing the need for high-speed backups that require no down time, database vendors have developed approaches to on-line backups that enable specialized backup software such as Sun StorEdge Enterprise NetBackup software to transfer data from the database management system (DBMS) to backup devices using parallel streams of data. One example is Oracle's Enterprise Backup Utility (EBU). This utility is responsible for managing the creation of a consistent database snapshot and feeding parallel data streams to the Sun StorEdge Enterprise NetBackup software server for multiplexing onto tape devices. Whereas once this process required dumping database tables to separate ASCII files and then backing up the files, EBU now provides convenient interfaces that can be effectively utilized by third-party utilities.

All of these developments in database backup technology require processing power and I/O bandwidth in order to work in concert to speed the backup process. Sun's Ultra Enterprise servers provide scalable, symmetric multi-processing, scaling from one to 64 high-performance UltraSPARC processors, up to 64 GB of memory, and supporting up to 20 TB of disk storage. The advent of scalable I/O platforms such as these allows DBMSs to be configured with the optimal balance of processing power and I/O bandwidth--enabling on-line backups to proceed without impacting database performance.

Capacity Planning
Capacity planning is critical to the success of efficient backup and recovery for any Datacenter. Bad performance is usually the result of unrealistic expectations and poor planning. Realistic expectations and good planning must consider current and future needs. It must include a plan for the time and skill to configure the Datacenter, and a plan for training personnel to operate and fix problems as they arise.

Capacity planning is part science and part art. The capacity planner must account for numerous variables and virtually unlimited configuration permutations. Systems are often underconfigured and the wrong products are often selected for the job. Because installation and configuration are complex, there is much room for error. Furthermore, because there are always interrelated bottlenecks, a major aspect of capacity planning is choosing the preferred bottleneck.

The main role of the capacity planner is to choose hardware and software for efficient backup and recovery in the Datacenter. To do this, the planner must first determine the following:

Volume of data the Datacenter will be managing
Availability of that data
How the data will be spread out across the network
Policies for backing up the data
The capacity planner can use this information to derive the following types of requirements:

Backup servers
Network
Storage
Backup device
Finally, the planner can determine the configuration requirements.

Understanding the Enterprise
Perhaps the single most important factor the capacity planner needs to assess is the environment to be backed up. This section presents the information the planner needs to assess the environment.

Dataset Size
The planner's first step is to determine how much data there is to backup or archive on a regular basis. Two main factors the planner needs to determine are the total size of the data and the size of the dataset that changes.

Total Dataset Size
The total size of existing data is an indication of the following:

Minimum amount of storage capacity required
Amount of data to be backed up during a full backup
Predictor of total required capacity
Total data size is often one of the easiest pieces of information to obtain, and tends to be specified as part of the requirements. In addition to obtaining the total data size, the planner must know or estimate the following factors:

Number of separate files. The total volume of data may be composed of a few large files or millions of small files. Certain types of data (e.g., databases) may not reside in files at all, but be built on top of raw volumes. In filesystem backups, there is often a small fixed overhead per file. The file record needs to be added to the backup database, the directory information read, and the disk needs to perform a seek to beginning of file.

Knowing the number of files also helps the planner determine the size of the backup index database retained by the backup software. On average, Sun StorEdge Enterprise NetBackup software suggests planning for 150 bytes in the database per file revision retained on media. That works out to over seven million file records per gigabyte of index database.

Average file size. By knowing the above two pieces of information, the capacity planner can calculate the average file size in the enterprise. If there is a large skew in file size distribution (e.g., many small files and a couple of very large files that throw off the average), the average may not be a good predictor of behavior. Therefore the planner must plan for slightly different performance when backing up small files versus large files.

Average directory depth. The directory structure into which the files are organized may also have an effect on the performance of the backup system. This is partly because long directory paths results in multiple seeks to the disk. Longer paths also result in larger records, because each filepath backed up is recorded in the database as a variable size entry. Therefore, longer paths tend to make the backup index databases grow faster.

Size of the Dataset that Changes
The size of the dataset that changes determines the volume of data that needs to be saved during incremental backups. As the number of changed files or blocks increase, the volume of data that needs to be written to tape grows. The capacity planner must know or estimate the following factors:

The frequency of the dataset change. The frequency of the dataset change determines the frequency for performing backups. The frequency that datasets change can widely vary. For example, some directories never change, some change only when something is upgraded, some change only at the end of the month, and some, like user mailboxes, typically change on a minute-by-minute basis. In addition, the frequency of dataset change, in part, determines the volume of data written during incremental backups, because incremental backups only save the files that have changed.

Amount of data to be backed up. The planner needs to decide whether to back up all the data or only the changed portions. While it is usually faster to save only the changed portions, it is also usually faster to restore whole directories and filesystems from full backups than from incremental ones. This is because of the restore process: restores from incremental backups need to first restore from the full backup, and then from all the incremental backups, until the latest versions of all files have been restored. This multi-step process often results in numerous tape mount requests and multiple retrieves of the same piece of data. The choice of performing full backups or incremental ones tends to be a matter of which case is most important: a regularly scheduled backup or an emergency after data has been lost on disk. While the former is done much more frequently, the latter tends to be a more time-critical situation.

Data Type
The type of data to be backed up relates mostly to the level of compression that could be expected from the backup hardware or software. There is no guarantee that the types of data to be compressed will exhibit similar properties, so it is safest to assume the data will not be previously compressed, and to compress all the data to be backed up.

Database or higher-level application data plays a special role in effective capacity planning. Unless the enterprise has relatively simple availability requirements for their data, backup will require special modules to save the data in a consistent state for restore. These modules are available for many popular database and application environments for both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages.

The following are types of data the planner needs to consider. The various data types mentioned below include an example compression ratio for the DLT tape 7000 tape drive.

Text or natural language. Text or natural language tends to have a lot of redundancy, and can therefore be well compressed by both software and hardware. For example, in tests using sample English texts, the DLT tape 7000 hardware compressed the data at ratio of approximately 1.4:1.

Databases and high-level applications. Many popular database packages and application environments have corresponding backup modules for Sun StorEdge Enterprise NetBackup and Solstice Backup software packages. For example, backup modules exist for Oracle, Informix, and Sybase database packages as well as for application environments like SAP. These modules enable backing up and restoring data in a consistent state, without taking the database off-line, making it unavailable to users.

Additionally, while databases and high-level applications tend to have widely varying contents and structure, they often contain text or numeric data with a lot of redundancy. This makes them more compressible. For example, in tests with sample databases from a TPC-C benchmark, the DLT tape 7000 hardware compressed the data at a ratio of approximately 1.6:1.

Graphics. Many applications require manipulating numerous large graphical objects. The fact that graphic files tend to be larger than text files does not imply the filesystem will consist of a few large files. This is because applications create composite objects from a myriad of smaller isolated objects.

In general, graphic objects tend to be previously compressed, making further compression in hardware or software unlikely. Indeed, the nature of hardware compression algorithms often inflates files that are already optimally compressed. For example, in tests with Motion JPG data, the DLT tape 7000 hardware compression showed a compression ratio of approximately 0.93:1.

Combined file types. Data residing on network file servers and internet servers, the most common server types, is usually a mix of text, graphics, and binary files. Because these datasets often consist of many small files, the capacity planner must also evaluate system performance. These mixed file types compress well. For example, in tests with files from network file servers and internet servers, the DLT tape 7000 hardware had a compression ratio of approximately 1.6:1.

File Structure
Another factor the planner must consider is the structure of the files: will they be backed up using a filesystem or dumped from a raw device?

As mentioned previously, raw dumps copy all the bits from the storage volume to the backup media. This captures the bits for any filesystem or database metadata, as well as the actual application data written on that volume. However, the metadata may be out of synch with the data in the volume. This is because the metadata on the volume is not interpreted, and the volume cannot differentiate the backup from another access. To prevent this problem, the volume is typically taken off-line to prevent updates to both data and metadata. Another solution is to mark all entities on that volume read-only for the duration of the backup.

The level of this problem varies depending on the types of filesystems and databases to be backed up. On-line filesystems maintain consistency, and do not require periods of unavailability. However, some higher-level applications may keep their data and metadata in the filesystem, and may need to be taken off-line or otherwise prevented from updating their files during the backup. Prevention from file updates during backups is required so that all the application data can be simultaneously saved and restored in a consistent state.

Another consideration between raw volume and filesystem backup is the atomicity of the data. The raw volume is treated as one large entity, while filesystems are divided into many small logical pieces. The entire dataset needs to be restored to keep one portion of data (e.g., a file or database row) that needs to be recovered from a raw volume dump in a consistent state. Restoring the entire dataset not only takes longer, but it also overwrites any changes to all the other data that had been made since the dump. In addition, incremental backups are currently impossible with raw volumes, because an update to any part of the volume compromises the integrity of the whole. In this case, the whole volume needs to be dumped again. With filesystems, only those files that changed since the last backup need be saved again.

The main advantage of raw volume dumps is the sheer efficiency of dumping raw bits without further interpretation by the system. The disk accesses tend to be large and sequential, minimizing the overhead of system calls and eliminating seeks by the disk drive arms (which are orders of magnitude slower than data transfers).

In contrast, filesystems add additional overhead. The data from file accesses is, by default, buffered in the virtual memory system, and this incurs copies in the kernel. In addition, files are read from disk in directory order and may be scattered in various areas of the disk, causing seeks to pass from one file to the next. This process may reduce the data rate from the disk volume. To perform closer to the level of raw dumps, the filesystem inefficiencies can be minimized through careful configuration and tuning. Nevertheless, there are certain situations where raw dumps are superior, if only for their sheer simplicity.

Filesystems can also offer a number of features that benefit effective backup configuration and planning. Chief among these is the ability to turn on Direct I/O. Direct I/O is a method of accessing files in the filesystem as though they were raw devices. This mainly bypasses the virtual memory buffering, but this may result in a large saving in CPU time, memory usage, and overall wall time. (Despite the benefits of Direct I/O, seeking to various positions on the disk to reach the beginning of file cannot be avoided.) A recent study showed that Direct I/O saved an average of approximately 20% CPU cycles, and kept the system from thrashing during extraordinarily heavy loads.

Direct I/O is available in both VxFS and UFS (starting with the Solaris 2.6 Operating Environment software). VxFS provides various mechanisms for engaging Direct I/O, including a per-I/O option. The most common method, however, is to use a mount-time option to enable this feature for the entire filesystem. UFS also allows Direct I/O to be turned on for the entire filesystem. One additional benefit of VxFS is that a filesystem can be remounted with different options without first unmounting the filesystem. This allows users to remain on-line and active, even when Direct I/O is toggled. This may form a benefit in enterprises where continuous operation is necessary.

Lastly, the VxFS filesystem provides a quick snapshot capability that can mount an additional filesystem as a read-only snapshot of the original. This is done while the original is still active and available. This feature is implemented via a copy-on-write mechanism that makes sure any blocks from the original filesystem are copied out to a special area before the block is changed on disk. A much smaller amount of additional disk space is required to activate the filesystem snapshot capability than from the logical volume manager. This is because only blocks that changed during the snapshot need to be duplicated.

Data Origin
Knowing where the data is coming from will help the planner to plan an appropriate configuration. The configuration needed for a local backup at high speed is very different from that needed to backup hundreds of small PC's over a metropolitan area network. The considerations below explore this issue in more depth.

Is the Server Where the Data Resides the One Doing the Backups?
When the server where the data resides does the backups, the complication of configuring networks is eliminated, and the planning focus is narrowed to the disk and tape subsystems and server processing capabilities. The server needs to have sufficient tape bandwidth to meet the backup window requirements--the available time period for backing up a specified quantity of data. To ensure capacity for multiple backups of the data (e.g., daily differential, weekly cumulative, and monthly fulls), tape capacity should be configured for at least three to five times the dataset size.

Disk bandwidth should be configured to meet the backup window requirements and keep the tapes streaming. (To keep from back-hitching, the DLT tape 7000 tape drive needs to receive data at a rate no less than 3.5 MB/second.) This may be difficult to ensure, because the server and disk subsystem are often already in place and tuned to perform a specific set of tasks. In this case, to determine if the desired backup window is feasible before planning for a specific set of tape devices, it often helps to measure the sequential rate of the disk subsystem. If the backup window is feasible, but backup performance still suffers due to slow disks, the planner needs to consider reconfiguring or upgrading the disk subsystem as part of the system upgrade path.

Lastly, the planner needs to consider the CPU resources necessary for local backup. Fortunately, these tend to be minimal, especially if Direct I/O is used to access the filesystems. For example, with Direct I/O, a single 250 MHz CPU should be sufficient to backup at 50 MB/second from local disk to tape. If the backups will be concurrent with regular operation and the system is already fully loaded, the additional CPU resources needed for backup may need to be added.

There are some additional factors the planner must consider. If the system has spare processing capacity, the planner must determine how much head-room exists and whether it will be sufficient to meet demands. Secondly, if the backups will be performed at off-peak hours, the planner must determine if there are any other scheduled processes to be run concurrently with the backup, and how much CPU is available for both. The planner also needs to consider sizing and tuning memory, especially if Direct I/O is not used. The main consideration in that case is the shared memory buffers used to coordinate between various backup processes, albeit memory is needed for essentially all system activities.

Is the Data on Remote Clients?
The planner must consider the requirements for backup of remote clients. This involves planning for the networking requirements to meet the backup window and other considerations. There is no recipe solution because of the virtually boundless varieties and configuration possibilities of enterprise networks. The planner must carefully plan for a successful network backup infrastructure, and have a good knowledge of network performance.

Even with the latest networking technologies, network bandwidth tends to lag behind the bandwidth of storage subsystems. Gigabit Ethernet is theoretically 100 times faster than Ethernet, but at the same time, FiberChannel Arbitrated Loop (FC-AL) offers twice the available bandwidth of Gigabit Ethernet. This discrepancy in bandwidth is unlikely to change anytime soon, because the tolerances in network connectivity tend to be much tighter than for storage. Network bandwidth issues are further complicated by the relatively high cost of upgrading the network infrastructure. While new storage devices can just be plugged in, adding network capacity may mean re-wiring parts of the enterprise. Such infrastructure tends to be very expensive and needs to be planned years in advance. Therefore, even if the upgrade is committed, there is often a period of time where the backup solution needs to work around inadequate network bandwidth.

Because of these network bandwidth issues, a frequent challenge when planning backup solutions is to find ways to satisfy backup requirements within the confines of a given network bottleneck. To understand the overall situation and to obtain a satisfactory solution, the planner needs to find the answers to the following five key questions:

How many clients are there? Knowing the number of clients helps the planner understand the overall scale of the enterprise. It also helps the planner determine aspects of backup planning such as level of multiplexing. Knowing the number of clients is also important because it ties in with the clients' location in the network in relation to the backup server.

What types of clients are there? To understand the client processing capabilities, the planner needs to know the types (i.e., the architecture and operating system) of clients that need to be backed up. For example, if a client has powerful processing capabilities but little network bandwidth to the server, software compression may be a good choice in backing up that client. Both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages offer client-side modules for most platforms.

Do the clients have their own backup devices? If the clients have their own backup devices, the best configuration may be a hierarchal master-slave configuration. In this configuration, the master server initiates and tracks backups, but data goes to the local device. This configuration saves network bandwidth, and can often be significantly faster. The master-slave configuration is recommended on large clients connected to the backup server by a slow network. The backup server is often less powerful than the client it controls, and the main backup devices are attached to the slave clients.

How are the clients distributed? Knowing where in the enterprise network various clients reside helps the planner determine the available network bandwidth between the clients and server. This is necessary information for predicting backup times and data rates available from the clients to disk. Because the network bandwidth is often inadequate, a hybrid solution is most appropriate, in which both network backup of some clients and master-slave configurations are used.

How autonomous are the client systems? Sometimes the client systems are located in remote offices connected to the backup server via WAN (wide-area network) links. These systems often do not have dedicated technical support, and hence need to be managed remotely. By centralizing management, Sun StorEdge Enterprise NetBackup software helps make that task easier. However, certain tasks are necessarily manual, and involve personnel at the remote site. These people will need to be trained to carry out specific tasks associated with backup (e.g., changing tapes in stand-alone drives).

What Does the Disk Subsystem Look Like?
It is critical to obtain the optimal disk subsystem for good backup performance with modern tape technologies. This is because the disk becomes the next most likely bottleneck, assuming the network bandwidth is sufficient or the backups are being performed locally. The performance of the disk subsystem depends on numerous factors. To plan backup solutions, the planner can use the questions below as guidelines for addressing some of the more important disk-related performance issues:

How are the data on the disks laid out? The data layout on the disk affects throughput rate, because it determines whether access to the disk is mostly sequential or random. If the access pattern requires frequent seeks between portions of the disk, the overall throughput rate of data from the disk will dramatically decrease.

There are three reasons that the access pattern may require frequent seeks. The most common one is that the data on the disk was created over a long period of time. In this case, deleted files are left on scattered parts of the disk, and they are subsequently filled by newer files. A seek may then occur to get the next file, because the disk is backed up in directory order. (In this case, one way to obtain mostly sequential access to the existing files--albeit not an ideal process--is to backup all the files once, recreate the filesystem on the device, and then restore all the files from tape.)

Another common cause for this access pattern is that multiple processes are accessing different regions of the disk simultaneously. This results in seeks between the various regions. This can occur, for example, if two different filesystems on the same disk are being backed up simultaneously. In this case, it may be possible to serialize the access by scheduling the backups differently.

A third reason for this pattern is that outer regions of the disk (lower numbered cylinders) tend to be faster than inner regions. Data that needs to be accessed more quickly may be laid out on the outer cylinders.

How are the disks arranged into logical volumes? The logical volume configuration significantly affects performance. To add levels of performance or reliability to the disk subsystem, most enterprise server environments will involve some level of logical volume management, using software or hardware RAID.

RAID-0 (or stripes) volumes tend to increase overall performance, but significantly reduce overall volume reliability. Various combinations of RAID-1 (mirroring) and RAID-0 increase performance while also increasing reliability. RAID-5 also tends to increase both performance and reliability. However, RAID-5 has performance characteristics which slightly complicate backup planning. Approximately two to three times more time should be planned for restoring data to a RAID-5 volume than it took to back it up, because RAID-5 writes (especially small random writes) take significantly longer than reads. The expected reliability of the logical volumes plays a role in determining backup frequency. The RAID volume should probably be backed up more frequently if the following are all the case: the volume has poor reliability (e.g., RAID-0), it is updated often, and it contains valuable data.

How are the disks managed? Another important consideration is the mechanism by which the individual disks are managed or configured into logical volumes. Two possible mechanisms are host-based and hardware RAID. Host-based RAID imposes slightly more overhead on the server system than hardware RAID, but tends to be more flexible. Various volume managers offer different RAID configuration options (e.g., RAID 1+0 vs. RAID 0+1). Some volume managers also offer additional features (e.g., snapshot) that are attractive for backup solutions. A large number of server clients and most workstation/PC clients do not implement logical volume management at all, and are limited to the performance and reliability characteristics of the individual component disks (i.e., JBOD).

What are the disk capabilities? The capabilities of the individual disks also affect disk subsystem performance and reliability. Newer disks tend to be faster and more reliable than older disks. This is not only because of age, but also because of rapid advances in disk technologies. When doing sequential I/O, each disk tends to be capable of a certain data rate, and a certain random seek rate. When the disks are managed as RAID volumes, these capabilities place limitations on the overall logical volume performance. Additionally, different disks have different MTBF (mean-time-between-failures).

Data Destination
Several key questions below provide guidelines for the planner to plan for factors related to the tape subsystem, the data's target location.

What Does the Tape Subsystem Look Like?
The tape subsystem is another critical consideration, but tends to be slightly less complex than the disk subsystem. Overall, tape devices tend to be relatively predictable and generally behave as advertised. The most difficult task associated with a high-performance tape subsystem tends to be in terms of installation and configuration rather than planning. Planning tape subsystem capabilities is often a matter of using the device specifications to amass the required storage capacity and throughput. The planner can use the following questions to consider related issues:

Where do the tape devices reside? The planner needs to determine whether the tape devices are stand-alone desktop or rack-mounted units that need to be loaded by hand, or if they are mounted in a robotic library. If they are the former, the planner needs to consider planning for the human interaction required to implement an effective backup solution.

The robotic library is a superior choice for enterprise-level backup solutions. There are many variations of tape libraries, but most commonly they offer multiple tape drives and internal storage capacities in the hundreds of gigabytes.

By knowing the required data capacities, the planner can plan for a sufficient number of libraries to house all the data and to have room to grow. It may be more reliable to purchase a number of smaller libraries than a single very large library, because most tape libraries have only a single robot mechanism.

How many tape drives are there? The planner needs to determine the number of tape drives needed to meet the throughput requirements, and to configure at least that many as part of the libraries. The planner must also remember the SCSI or FC-AL slots on the server needed to connect the tape robotic devices. If there is an existing tape subsystem, they must determine its capabilities and supplement them with new equipment, if necessary. They must also be aware of any forward or backward compatibility issues with the media, because tape formats change almost as frequently as the underlying hardware.

What are the drive capabilities? Each individual type of tape drive has its own characteristics and capabilities. These include native-mode throughput, tape capacity, effectiveness of compression, compatibility of tape formats, and recording inertia. While throughput and capacity are relatively simple, the others also need to be carefully considered.

The actual compression ratio achieved depends mostly on the type of data, but it also depends on the compression algorithm implemented by the drive hardware. For example, the DLT tape 7000 algorithm prefers to trade throughput for compaction, while the EXB-8900 Mammoth 8 mm drive prefers the opposite. Not all tape drives are capable of using older media, even if the form-factor is identical. Most can read tapes written with older formats but cannot write in the older format.

If the backup images are to be archived for a number of years, the upgrade path is also important. The drive technology will chiefly determine the recording inertia. For example, linear recording technologies like the DLT tape 7000 and STK Redwood drives tend to have a stationary read/write head and quickly moving tape. To perform well, these drives need to be fed data above a specific rate. Helical-scan technologies like 8 mm and 4 mm tapes have lower recording inertia and are thus less sensitive to data input rates, but have overall lower throughput capabilities. It is difficult to balance all these factors, but as long as some minimal requirements are met, a suboptimal choice usually has little real effect on the overall performance.

How Are the Tape Devices Distributed?
It is also important to optimally position tape devices throughout the enterprise. This mainly depends on where it is advantageous to make the extra effort and attach backup devices directly to servers where the data resides. The following questions can help the planner examine the relationship between the tape devices and data, and may help them to focus on the relevant considerations:

Are all tapes on the master server? If all tape devices reside on the master server and the bulk of the data is elsewhere, the network needs to support the transfer rates necessary to move data from the remote clients to the centralized backup server. This configuration often simplifies day-to-day management at the cost of a complex networking infrastructure. As noted previously, networks are traditional bottlenecks for backup applications, and need to be configured for optimal performance.

Are tape libraries attached to important servers? An effective backup architecture is to add tape devices to servers where large quantities of data reside, and task them with being backup slave servers, centrally managed from the master server. With this architecture, the only information that is communicated over the network between master and slaves is the file record information, about 200 bytes per file backed up. Both Sun StorEdge Enterprise NetBackup and Solstice Backup software packages support this option.

How close are the tape drives to the data? The proximity of the tape drives to the data is usually an issue of network bandwidth. This is because shorter network distances tend to be covered by higher speed network links. If the tape devices and data are separated by hundreds of kilometers, the link bandwidth is likely to be low. In contrast, if they are located in the same data center, it may simple to configure a point-to-point link, dedicated for backups, between the two. This is mainly important when deciding where to locate the master server in a widely distributed enterprise, because the network architecture and data locations tend to be fixed. A general guideline is to locate the master server as close as possible to the bulk of the data, and hopefully close to a central location in the network topology.

Tape Environment
The operating environment influences the reliability of the tape subsystem longevity. The planner can use the three questions below to address the main factors:

What are the temperature and humidity like? Tapes perform best in moderate temperatures and relatively low humidity. The operating temperature affects things like tape tension and strength, drive part tolerances, and temperature of internal electronic components in the drive. Humidity may affect the longevity of the magnetic coating on the tape. This is because high humidity causes the surface of the tape to become gummy. The ideal operating conditions tend to be listed as part of the media packaging. For example, the DLT CompacTape IV lists operating conditions as 10-40 degrees C, storage as 16-32 degrees C, and humidity between 20-80%. Long-term archive storage (20+ years) requires even more stringent conditions.

How often are the drive heads cleaned? Drive heads need to be cleaned periodically because they pick up deposits with continual use. This is usually accomplished by inserting a cleaning tape. Tapes operating in dirty conditions (e.g., near printers) need to be cleaned more frequently, as do drives that operate outside of environmental specifications. Brand new tapes tend to have some manufacturing debris on the surface, and drives that frequently use brand new tapes should also be cleaned frequently. Both backup software and tape library hardware are capable of automatically inserting cleaning tapes after a certain number of uses.

How old are the drives and tapes? As they get older, tape drives tend to eventually wear out and encounter errors more frequently. Each tape technology has an associated MTBF (mean-time-between-failures), and media has a certain rated number of passes it before it is expected to wear out. These statistics, available from the manufacturers, tend to be optimistic.

The Data's Path
One of the last considerations in the overall system, is the path the data takes from the disks where it originates, to the tape cartridges where it is destined. The planner can explore this factor through the following questions:

Are Data and Tape Local to Backup Servers?
If data and tape are local to backup servers, the planner should focus should focus configuration and tuning on moving data quickly through the system between the devices. They should also focus on supporting the potentially large number of processes involved in managing the backup streams. These tend to fall into two areas: using memory effectively and providing local host/RPC capacity.

Is the filesystem buffer cache used? Backups are more efficient when avoiding the filesystem buffer cache. The buffer cache can be bypassed by either using Direct I/O to access individual files, or backing up the raw volume rather than the filesystem.

How much system memory exists? Backup relies on system memory in two capacities. Primarily, it is used for shared memory regions used to implement interprocess communication between various backup/restore processes. Memory is also used when buffering filesystem data in the virtual memory cache. If data is cached in virtual memory faster than old pages can be purged, the system may begin to thrash. More memory temporarily forestalls this condition. However, if the system is in a condition where data is cached faster than purged, it will likely thrash at some point during the course of a long backup.

The most elegant solution is to avoid the buffer cache in the first place, but if that is impossible, the planner needs to tune the memory reclaim rates to be more aggressive. In addition, to improve I/O to the swap device, they also need to stripe-swap across multiple spindles. This may eliminate thrashing, or at least reduce its impact.

What software is being used? The software used determines the overall efficiency with which data is moved from disk to tape. Both Sun StorEdge Enterprise NetBackup and Solstice Backup Power Edition software packages move data very efficiently, but Solstice Backup Network Edition software is a little less efficient. the Solaris Operating Environment software utilities such as tar and ufsdump are not particularly efficient and should not be used to implement enterprise backup solutions.

How much shared memory is available? The amount of shared memory the system can allocate is controlled in the /etc/system file. This file determines the memory used for interprocess control (IPC) between the reader and writer processes in the system. For efficient backup and restore a certain amount of shared memory should be configured per device and data stream.

What are the TCP tunings like? Tuning various parameters for the TCP kernel helps determine the buffer sizes used by the system, and the speed that closed connections in various TCP wait states are flushed from the system.

Are Data and Backup Server(s) Distributed on the Network?
If the data is connected to its eventual destination on tape by a network, the planner needs to place emphasis on making sure the connectivity is uninterrupted and of sufficient bandwidth to meet requirements. To do this, the planner should consider the following questions.

What kind of network is it? Not all networks behave similarly, although all networks tend to be described in terms of their bandwidth. Different networking technologies have different properties. Ethernet variants tend to be inexpensive and common, but their range tends to be limited to local area networks. Within local area networks, there are various topologies that have different performance characteristics (e.g., switched to the desktop vs. hub vs. shared segment).

In addition, the nature of Ethernet causes overall bandwidth to degrade as more nodes are active on the network simultaneously. ATM (asynchronous transfer mode) and FDDI (fiber distributed data interface) networks have longer ranges and degrade more gracefully under heavier loads. However, they use fiberoptic connections, which make them less common and more expensive to install. Gigabit Ethernet and Sun Quad FastEthernet are growing in popularity due to their familiarity and ease of management, but are still not common in existing enterprises.

What is the available network bandwidth? A typical enterprise network consists of multiple segments and various network technologies. The available network bandwidth from one client to another may be vastly different. The planner must estimate the available bandwidth for each key path between backup server and client. This often entails constructing a detailed map of the enterprise network, which may not be available or up to date. To obtain this information requires several days of planning.

How many simultaneous clients are sharing it? If all clients are active at once, the network is more likely to get overloaded than when more clients are on each network segment. However, when there are more clients, the level of multiplexing to the tape drives can be increased. This allows them to stream when a single client is too slow to feed data to the tape at a sufficient rate.

Enterprise Backup Requirements
Backup requirements tend to fall into a few discrete camps. These are primarily concerned with the following:

Backing up the data in a certain period of time
Restoring the data as needed
Limiting the impact of the process on day-to-day operations
The planner should thoroughly investigate these requirements before suggesting any particular solution. The following questions give the planner a useful start to that investigation:

What Is the Backup Window?
The planner's first step is to determine the backup window. However, a backup window is not a given, and there may not be an ideal period of time in which to perform the backup. Some applications and services need to be available 24 hours a day, seven days a week. In those cases, other methods of obtaining consistent backups need to be employed. In less extreme situations, the amount of time necessary to perform the backup may exceed the natural period of inactivity. These situations require compromises, in terms of one of the following:

What is backed up?
Back up frequency
Data availability
Performance impact
When are people least likely to need access to the data? Times where demand for data is light tend to create natural backup windows. This is usually when few users are on the system, typically at night or on weekends. There may be other predictable periods of time when system activity is low (e.g., lunch time, after quarterly processing is completed, and holidays), and these are also good opportunities to schedule backups. If these natural periods of inactivity are insufficient, the planner needs to consider how or when they could be extended. The planner's goal is to ease the burden in terms of required throughput necessary to back everything up during the backup window.

How much data needs to be backed up (full and incremental)? The other part of the equation is the amount of data that needs to be backed up. For consistency and recovery purposes, the ideal backup saves the full set of data. The down side is that the full dataset is usually very large, consuming a lot of time and tape capacity. Most installations choose to perform full backups occasionally, and supplement those with more frequent incremental backups that record only the data that changed.

Sun StorEdge Enterprise NetBackup software offers a number of incremental backup options. Differential backups record files that changed since the last backup (either full or incremental). Cumulative backups record all files that changed since the last full backup. A drawback of cumulative backups is that they usually record more data than differential backups. An advantage is that restoring requires retrieving only from the last full and the last cumulative, rather than fetching the last full and potentially many differential images.

Solstice Backup software offers similar mechanisms, including multiple levels of cumulative backups similar to the levels used by ufsdump(1M). By knowing the potential backup targets of the software and data usage patterns at the site, the planner can estimate approximately how much data will be saved during each type of backup. A target data rate that the backup system should plan to achieve can be obtained by dividing that amount by the estimated time available. Various margins of error can be built into the calculations for added control.

What Is the Acceptable Impact of Performing the Backup?
If the window of inactivity is not sufficient to save the required volume of date without resorting to extravagant hardware, the planner needs to estimate the impact of performing backup concurrent with regular system use. To minimize the impact, there are a number of options available, with no hard rules for planning what to do when. The planner must evaluate all possible choices and select the most appropriate one.

Is data unavailability acceptable? The planner's central consideration is whether data can be kept from the users for some period of time. If it can, the planner needs to determine how that dedicated time might be best used to perform the backup. If data can be unavailable for some length of time, it is usually possible to back it up faster than keeping it on-line. This may be in the form of shutting down any databases and backing up the raw volume, or unmounting any filesystems and backing up the underlying devices.

Is degraded performance acceptable? If data needs to be continually available but the overall performance of the system may be somewhat degraded, one choice is to continue backups concurrent with user activity. There are a number of mechanisms for on-line backup, and each has a different degree of impact on performance. The planner needs to assess the trade-offs and choose the best possible compromise.

How long is degraded performance acceptable? If data unavailability or degraded performance are acceptable, the planner needs to determine the period of time that must not be exceeded. This period is usually smaller for data availability than degraded performance, but lower performance may lead to overall lower productivity, and thus should be minimized.

If databases are used, are appropriate modules available? Not all commercial databases have corresponding backup modules for the Sun StorEdge Enterprise NetBackup and Solstice Backup software packages. If hot backups of a database or some other high-level system are needed, the planner must verify that an appropriate module is available.

What Availability Concerns Should the Solution Address?
Each solution should address the real concerns and objectives of the customer. It is vital to understand the availability concerns that the backup attempts to address. For example, a good solution for retrieving accidentally deleted files is probably not a good solution for disaster recovery in which the whole site may be destroyed. To start thinking about the relevant issues, the planner can use the following three questions:

Is it critical to minimize impact of user or operator error? If the major concern being addressed is loss of individual files, the solution should be designed to retrieve the file quickly and with minimal effort on the part of the administrator. Minor issues include tape storage, duplicate media, and offsite import/export. An important issue is backup frequency, because the copy on tape should be as close as possible to the file's final state. The level of multiplexing can be high, because the overall throughput is not an issue when retrieving a small set of files, unless the files are very large.

To address such issues, planners may choose to use disk-based rather than tape solutions. Such solutions include, for example, keeping a third mirror of the volume off-line and readable in case something needs to be retrieved, or backing up important files to a disk directory rather than tape.

Is it critical to minimize impact from loss of equipment? If the goal is to minimize the impact of failed hardware (e.g., disk-head crash), backups can be structured to keep data from the same equipment arranged on the same set of media, and perhaps to duplicate the media. This would minimize tape fetch time from data that spans several tapes. To reduce the chance of losing data to failed hardware, RAID software or devices can be used.

Impacts from hardware failure also relates to highly available and cluster configurations. Configuring backup for these environments is potentially difficult and requires some experience. In situations where the entire system needs to be highly available, the best solution may be specialty contractors like Comdisco.

Is it critical to minimize impact in case of disaster? Disaster recovery and preparations need to encompass all aspects of the operation. These aspects range from frequent training for data center personnel, to using customized scripts for the backup software. The more common steps, however, are to keep multiple copies of media, one local and another archived at a remote site. Another option is to have another site where the data is imported by the backup software and ready for a restore. Some companies choose to have a "hot site" available to go on-line within a few minutes of a disaster, where the configuration has the same capabilities as the original site.

Expectations
A critical part of effective capacity planning is maintaining realistic expectations. This is important because a few areas of confusion can cause a disproportionate number of problems.

Compression
Compression can be problematic for a number of reasons. The main reason is that the benefit of compression varies with the data being compressed as well as with the compression mechanism being used. With the same compression algorithm, different types of data compress to different degrees. The level of compression depends on how much redundancy can be identified and remapped in the time available. Some types of data (e.g., video) have little or no redundancy to eliminate. Therefore, these will not compress well regardless of the compression scheme that is used. Hardware compression in the tape drive typically relies on a small buffer in which to temporarily hold the data as it is compressed. The size of this buffer limits how much of the data may be examined for redundant patterns. Lastly, the amount of time necessary to locate all redundant patterns may be longer than available to the compression mechanism. This is because the compression needs to happen in real-time, as the data streams into the tape device and onto the tape.

People often expect either the 2:1 compression ratio frequently quoted in the tape hardware literature, or expect similar compression ratios as they see with compress utilities like compress(1) or GNUzip. In the past, the 2:1 number was sometimes touted as "typical", but in truth it was typical only of the special test patterns manufacturers use to test their algorithms. When compressing diverse types of data in the field, the compression ratios were often lower. If capacity planning was done expecting 2:1 compression, the system was often inadequate to the task.

Another typical compression mistake is to compress the target data on the system using compression programs, and use the observed compression ratios to estimate hardware compression. This mistake stems from the different natures of hardware and software compression. Compression utilities can use all of the system memory to perform the compression, and are under no time constraints. Hardware compression is limited to the hardware buffer size, and needs to be compressed in real time. The compression ratio observed with software utilities will usually be much better than the drive hardware can deliver. Inadequate systems can occur if capacity planners use those numbers.

Compression ratios for various types of data (as observed in simple tests) are shown in Table 1. For hardware compression, the more "typical" compression ratio to expect is closer to 1.4:1, although some data types appear to do better. If attempting to save data with little to no redundancy (e.g., compressed video like MPEG or MJPG), it is better for compression in the drive to be turned off. In addition, the compression mechanism has two effects. The first is to speed up the rate at which data is processed by the device, and the second is to compact the data written to tape so that tape can hold more information. 1


Compression Ratios Mode

Speedup Ratio
Compaction Ratio

None
1:1
1:1

Text
1:46:1
1.44:1

Motion JPG
0.93:1
0.92:1

Database
1.60:1
1.57:1

Fileserver
1.60:1
1.63:1

Webserver
1.57:1
1.82:1

Aggregate
1.32:1
1.39:1

Overhead
The planner should plan for a certain amount of metadata overhead, on top of the data that is being saved and restored. The backup software keeps a database of files residing on tape, with a record for each instance of the file. An estimated 150-200 bytes are needed per file record; Solstice Backup software typically requires slightly more bytes than Sun StorEdge Enterprise NetBackup software. This means that a database containing a million file records is typically between 143 MB and 191 MB. The planner should plan for reliable, fast disk space to accommodate the file database, and they should configure a regular schedule to backup the database itself.

The software also writes a certain amount of metadata to tape in order to keep track of what is being written where. This metadata tends to be minor in relation to the dataset size. Simple tests indicate that the metadata written to tape by Sun StorEdge Enterprise NetBackup and Solstice Backup software packages are typically below 1%. Other software (e.g., ufsdump) may write more metadata to the tape, depending on the format used.

Recovery Performance
Another common planning error is to assume that restore performance will be identical to backup performance. Initial rules of thumb suggested expecting the recovery to take approximately three times longer than the backup. While this is probably a safe metric to use, recent measurements indicate that it may be too conservative for Sun's latest systems and software. With proper tuning and adequate hardware support, it is possible to have restores perform within 10% of backups. When no other information is available, it may be safer to use some compromise like 50% or 75% longer. This is because this performance is predicated on a number of assumptions.

The main reason for this performance discrepancy appears to be the nature of writes versus reads. For various reasons, writes to stable storage often take longer than reads. There is also more frequent demand for writes to be performed synchronously (in order to guarantee consistency). For example, creating files requires several synchronous writes to update the metadata keeping track of the file information. Those updates need to be performed in order to preserve file integrity.

Another component of the longer restore times is the browse delay introduced at the start of the request. When a restore request is initially issued, the software needs to browse the file record database and locate all records that need to be retrieved. This may take some time for large databases containing millions of records.

The situation is even more complicated for multiplexed restores. This is because the software usually waits to make sure all requests are received before initiating the actual restore. Alternatively, it may go back to retrieve files that were requested after the restore had already begun. This occurs in order to resynchronize the retrieval of file data intermingled on the same length of media. Otherwise, the restore operation needs to be serialized, constantly rewinding the tape to get each additional backup stream.

Ease of Use and Training Requirements
Modern storage management software offers powerful features behind easy to use graphical user interfaces (GUI). Modern library hardware has also been streamlined for ease of use and reliability (e.g., the GUI touch-screen controls on the Sun StorEdge L3500 tape library). However, the entire area of backup and data protection is very complex. It will inevitably be up to the planners, installers, and operators to make complex decisions which will affect the long-term success of the installation. This requires training on the products involved, hand-on experience, and at least a rudimentary understanding of the issues.

It is naive to expect to take the software out of the shrink-wrap, uncart the hardware, and put together a well-tuned backup solution. On top of careful planning, even moderately complex backup installations call for trained and experienced personnel to install, configure, and tune the various components. This usually takes several days of dedicated effort. For the most complex installations, it may take multiple weeks to have everything optimally running.

The most successful approach is to bring in experienced consultants (e.g., Sun, Veritas, or Legato professional services) to install and configure the system for current needs, and to teach on-site personnel the basics of maintaining and operating the configured system. The on-site personnel then need to develop in-depth knowledge to be able to modify the configuration to meet increasing demands; this can be achieved through further training or other means. Meeting on-going demands is certainly also possible through long-term contracts with the consulting services that initially configured the systems.

Measurements and Calculations
A number of simple measurement techniques and calculations are useful in reaching correct capacity planning decisions. The following sections should provide the necessary background and tools for the planner to make simple bandwidth estimates to match capacities. This information can also provide an example methodology that can be adapted for more complex decisions. These sections serve as a reference as well as a learning aid.

Network Sizing
Accurately networks can be tricky. There are many different networking technologies in place today, and more are being added over time. Each technology has its own characteristics, and these are complexly interrelated. There are often multiple paths between any two points in the network, and different paths offer different bandwidths. All these factors combine to challenge planners trying to understand the layout of the corporate intranet.

It is usually easiest for the planner to start from scratch and plan for additional new networks dedicated for backup. Unfortunately, adding these may involve pulling additional wiring between distance corners of the enterprise, far more expensive than the purchase of a few switches and adapters. To meet the new backup demands, most planners need to understand how to efficiently use the existing network infrastructure.

Principles
To effectively perform network capacity planning, there are a few simple but powerful techniques that the planner can use. When working without an existing network configuration for backup, the goal is to configure sufficient bandwidth between the data location and the tape device. This can be done by allocating multiple links between source and destination until the aggregate bandwidth is adequate. Table 2 may help the planner choose technologies best suited to the task.


Estimated Rates for Various Network Technologies Technology
Theoretical Speed
Realistic Speed

Modem
28.8 KBaud
2 KB/sec

ISDN
128 Kb/sec
10 KB/sec

Frame Relay 256
256 Kb/sec
20 KB/sec

Frame Relay 512
512 Kb/sec
39 KB/sec

T-1
1.54 Mb/sec
115 KB/sec

T-3
44.7 Mb/sec
3.4 MB/sec

Ethernet (10BaseT)
10 Mb/sec
0.75 MB/sec

FastEthernet (100BaseT)
100 Mb/sec
7.5 MB/sec

GigabitEthernet (1000BaseT)
1000 Mb/sec
50 MB/sec

FDDI
100 Mb/sec
8 MB/sec

CDDI
100 Mb/sec
8 MB/sec

ATM 155
155 Mb/sec
11.6 MB/sec

ATM 622
622 Mb/sec
50 MB/sec

HIPPI-s
800 Mb/sec
60 MB/sec

Most realistic environments will already contain significant investment in network infrastructure that can be leveraged for backup and recovery. With the high cost of installing additional wiring, it is usually preferable to strategically place backup servers to use these existing networks.

The planner's first step is to sketch a map of the existing network. Many enterprises will already have such a map, or know who to turn to for this information. The goal is to produce a map showing all relevant network links in relation to one another. These links are then labeled with their expected available bandwidth during the projected backup window. The full bandwidth of the link may not be available for backup, because the networks are usually shared with other users. Network administrators often keep usage statistics that may point to a time when the networks are nearly idle, an ideal time to perform backups. The planner also needs to note how much backup data is located on each segment, and on which machines it is located. Systems that have a large concentration of data may turn into hotspots; therefore, they need to be adjusted.

Once a map of the existing network infrastructure is available, the planner needs to locate the most central point in the network. This is the one that has the most access to plenty of bandwidth and minimizes the overall number of hops to the data. This central point is the ideal place to locate the master server from the standpoint of the network. (Administrative issues may be a different story.) Once the master server is placed, the planner needs to estimate the available bandwidth from the various key data sources to the master server.

If the above estimation process shows that the network would be a bottleneck, the planner needs to consider adding slave servers to various network segments. One situation the planner needs to consider is when there are a few systems that hold the bulk of the data on that segment, and one system in particular is either the largest or least busy. In this case, the planner must consider converting that machine into a slave server by adding a tape subsystem that is sufficient to service the local backup needs. If no such machine is available and all existing machines are fully utilized, the planner should consider adding an additional system on that segment to be the slave backup server. The slave servers will direct all backup data to themselves, limiting the network transfer to the master server to just the file record information.

Estimating Available Bandwidth
The planner needs to use the network map to estimate available bandwidth without accessing the actual network. For each link, the planner needs to locate the route from data source to the nearest backup server. The bandwidth of that route is the bandwidth of the slowest link in that route. To estimate the rates of individual links, the planner should use the realistic rates listed in Table 2. If multiple streams will traverse a link simultaneously, it is best for the planner to assume that all streams will equally share the link. As the number of active hosts increases, all network technologies degrade somewhat, but Ethernet variants, in particular, degrade rather quickly. For links using variants of the Ethernet protocol, the planner must try to keep the number of simultaneous streams below twenty.

Another approach the planner can use is to measure the available bandwidth across key points of the network. There is no fail-proof way of doing this, because most bandwidth measurement tools are invasive, and may or may not mimic the type of load applied by the application in question. Perhaps the easiest way for the planner to measure the available bandwidth is to use the ftp utility. To do this, the planner can use the following procedure:


Create a large file on the client system. A tar file containing some relevant files is ideal, although you can just concatenate a number of smaller files.
From the client, connect to the server system and turn on binary transfer mode.
Put the large test file onto the server as /dev/null. This will transfer the bits for the file over the network, but not store them on the other side.
Use the transfer rate estimated by ftp as the bandwidth of that particular route.
The above method is simple and uses commonly available tools. However, the interactive component makes it difficult to script for testing portions of a large network, and the high overhead of the ftp protocol is likely to underestimate the bandwidth available to the backup software. To estimate the bandwidth, a more accurate and flexible method is to use network benchmarks like NetPerf 2 . The NetPerf tools tend to be relatively simple to use, and come with directions and sample scripts, but there is a slight learning curve involved. If this is going to be a once-only exercise, the ftp method may be preferable.

Once the bandwidth across key routes is estimated, the planner needs to compute the time necessary to transfer the data from the source to the destination:

One thing the planner needs to remember is that the units used to describe network and storage bandwidth tend to be different. Network bandwidth is usually listed as Mb/sec or Mbps, and refers to 1000 x 1000 bits per second. In contrast, storage bandwidth is usually listed as MB/sec or MB/s, and refers to 1024 x 1024 bytes per second. For example, storage bandwidth of 1 MB/s is equivalent to network bandwidth of 8.39 Mbps.

Disk Sizing
Simpler bandwidth estimation for disks makes capacity planning generally less troublesome than for networks, but modern tape subsystems can be sensitive to slow disks. In general, the planner does not need to be concerned with simple configurations like single spindles and striped JBOD arrays. Disks should be configured and tuned for their primary purpose first, and adjusted for backup second. If this planning exercise is an opportunity to put together a new system from scratch, it might make sense for the planner to plan for backup as well for as some other primary activity.

Principles
When planning disks for backups, a few simple principles apply. First, reads tend to be faster than writes. This has to do with factors like data integrity and prefetching. This is mostly a challenge during large-scale restores, because the majority of accesses will be reads.

When the disks are combined into logical volumes, RAID performance principles also come into play. Stripes and mirrors tend to aggregate the performance of their component disks. It is often sufficient to simply check that an adequate number of spindles are configured. RAID-5 volumes are more complex, but tend to have good read performance. If large volumes of data need to be restored quickly, to match RAID segment size to the restore I/O size for optimal performance, RAID-5 volumes need to be carefully configured. To make sure that the configuration is satisfactory, the planner needs to test the restore performance for the RAID-5 volumes. Otherwise, when trying to restore in an emergency, they could be unpleasantly surprised. If long-term performance of the backup appears to be problematically slowed by the disk subsystem, the planner can consider upgrading or reconfiguring the storage.

Estimating Available Bandwidth
A number of methods are available to estimate the bandwidth of the disk subsystem. The first is to analyze the existing or projected configuration, using established values for each component. For each expected backup stream, the planner must calculate the number of spindles (individual disks) available to service that stream. If the spindles are striped together, they need to consider the available bandwidth to be 70% of the aggregate disk bandwidths. If they are stripped and mirrored, they need to use 70% for reads and 35% for writes. If the disks are configured together as a RAID-5 volume, the planner can estimate the read bandwidth as:

where:

N is the number of data disks,
M the number of parity disks,
N plus M adds up to the total number of spindles in the volume.

For write performance, the planner should use half that value as the estimate. This gives a rough estimate of the raw performance available directly from the disks, assuming no bus or channel bandwidth limitations have been reached. The planner can use Table 3 to estimate the spindle rates and the overall abilities of the storage arrays in question.


Estimated Rates for Various Disk Technologies Disk Technology
Peak Read Throughput
Peak Write Throughput

4GB 5400rpm disk
5.6 MB/sec
2.8 MB/sec

4GB 7200rpm disk
9.3 MB/sec
4.2 MB/sec

9GB 7200rpm disk
8.7 MB/sec
4.1 MB/sec

9GB 10000rpm disk
11-16 MB/sec

18GB 7200rpm disk
14-21 MB/sec

SSA
18 MB/sec
16 MB/sec

A1000
30 MB/sec
14 MB/sec

A3000
35 MB/sec
20 MB/sec

A5000
168 MB/sec
76 MB/sec

DASD (3390)
3.5-4.2 MB/sec

PC Clients
2-8 MB/sec

For any logical volume configuration, if filesystems are used with Direct I/O on top of the logical volumes, the planner needs to reduce the value by an addition 10% for reads and 15% for writes. If for some reason Direct I/O cannot be used, the planner can divide the calculated raw value by 2 for reads, and 3 for writes.

Once the available disk bandwidth is calculated for all logical volumes, the planner should consider how the volumes are laid out on top of the multiple busses and I/O channels (e.g., SCSI, FC-AL, SBus). If the aggregate volume bandwidth exceeds the bus bandwidth, they can assume all logical volumes can share the bus equally, and divide the available bus bandwidth among the competing volumes.

Another method the planner can use for estimating available bandwidth is to measure it using simple tools. The most easy tool to come by is dd(1), which can easily generate a sequential stream of accesses to either a file or a raw device. To test potential backup performance, the planner can create a large file on the source disk subsystem. On the host, they can time a dd process reading from the large file and writing to /dev/null using blocksizes similar to the backup software (a 64 KB block size is a good guess). The planner can divide the file size by the time it took to transfer read all the contents, obtaining an approximation for the disk bandwidth. If the raw device bandwidth is required, they can read a certain number of blocks from the raw device, and compute the rate based on the number of blocks transferred rather than the file size.

If restore performance is the goal, they can write to a file from /dev/zero for filesystem performance. Writing to a raw device only works if there is no valid data on that device. (The planner must take caution when writing to a raw device using dd, because this will likely destroy any data on that device.) These estimates are likely higher than the likely performance during backup, so they can use perhaps 80% of the measured value in planning.

A more accurate method would be to use the actual programs used by the backup software and direct their output to /dev/null. This measures the exact data access load on the disk in isolation from other potential bottlenecks, such as networks and tape drives. The exact invocation varies from package to package and filesystem versus raw device. The software CLI documentation should provide the necessary details to conduct this test, although this method is most useful when troubleshooting or tuning an existing installation. This is because it requires the software and data to already be in place.

Lastly, it is generally not a good method to measure backup performance using standard the Solstice Backup utilities like tar or ufsdump. These programs are not especially tuned for high performance, and may bottleneck somewhere other than the disk subsystem.

Tape Sizing
Tape sizing divides an equal concern between adequate on-line capacity and available bandwidth. Fortunately, both calculations tend to be simple. The only complication tends to be potential back-hitching for linear tape devices. This can be addressed by a combination of using fewer devices and higher levels of multiplexing to the tapes.

Principles
Without other specific requirements, the planner can configure the on-line capacity as approximately three to five times existing data and expected near term growth. This allows for multiple copies of the data to reside on-line, as is necessary when using full and multiple incremental backup schedules. Tape bandwidth should be configured to match or be slightly below the bandwidth of networks and disk. This tends to be easy to accommodate, and reduces the chance of back-hitching.

When trying to back up an existing enterprise with multiple slow clients, networks, or logical volumes, the planner should configure multiplexing in the backup schedules. This allows each tape device to be fully utilized. Each multiplexed stream uses a finite amount of resources (e.g., TCP ports, buffers, CPU) on the server, so the total number of backup streams handled by a server simultaneously should be kept below approximately 120 3 .

Estimating Available Bandwidth
The planner can use the advertised native rates to estimate bandwidth available to tape devices. Table 4 lists capacities and rates for common tape devices. Most devices include some level of hardware compression. If compression is going to be used, the planner should take this into account. As a guideline to plan for compression, the planner should use a 1.4:1 compression ratio, because the 2:1 advertised ratio tends to be overly optimistic.

It is also important for the planner to consider the SCSI bus bandwidth. Generally, they should plan on a maximum of two or three tape devices per SCSI bus, and should not mix tape and disk devices on the same bus. Empirical tests of tape bandwidth can be easily accomplished using the dd(1) and mt(1) commands, although access to library robotics from the system requires additional software to drive the robot.


Estimated Rates and Capacities for Various Tape Technologies Device
No. Drives
Capacity
Throughput

1:1

GB
1.4:1

GB
1:1 MB/sec
1:1 GB/hr
1.4:1 MB/sec
1.4:1 GB/hr

DDS-3
1
12
16.8
1
3.5
1.4
4.9

DDS-3 Autoloader
1
72
100.8
1
3.5
1.4
4.9

EXB-8900
1
20
28
3
10.5
4.2
14.8

DLT tape 7000
1
35
49
5
17.6
7
24.6

L280
1
280
392
5
17.6
7
24.6

L400
2
400
560
6
21.1
8.4
29.5

L1000
4
1000
1400
20
70.3
28
98.4

L1800
4
1800
2520
20
70.3
28
98.4

L3500
7
3500
4900
35
123
49
172.3

L11000
16
11000
15400
80
281.3
112
393.8

IBM 3490
1
0.2
N/A
3
10.55
N/A
N/A

IBM 3590E
1
0.4
0.56
6
21.1
8.4
29.5

STK Redwood
1
50
71.5
10.5
36.9
14.7
51.7

Memory Sizing
General server memory sizing guidelines should be adequate except in cases of large-scale local backup of many filesystems in parallel. In those cases, all attempts should be made to use Direct I/O to eliminate buffering backup data in the virtual memory cache. If Direct I/O cannot be used, the kernel memory reclamation rates should be adjusted to be more aggressive and free up buffers faster than they are used by the backup processes. When adjusting other system parameters, such the number of filenames and inodes cached in the system memory, the planner should also consider size. Another consideration is the amount of shared memory configured for inter-process communication between backup processes.

System Sizing
Backup performance may be impacted by various aspects of system sizing configuration. Sun's Ultra Enterprise architecture can perform very well for backup and restore, as demonstrated in the terabyte-per-hour benchmark, and a number of other studies. When planning to configure an existing server for local backup, most choices are already made. The remaining decisions consist of adding any additional I/O boards to accommodate the tape hardware, memory for larger buffers, and CPU's if the system is already close to full utilization. The number of additional I/O boards needed depends on the number of devices that need to be configured. This follows directly from capacity of the boards and host bus adapters.

To simplify the configuration of CPU capacity, the planner can estimate how many CPU cycles are needed to move data at a certain rate. Simple experiments have shown that a useful, conservative estimate is 5 MHz of UltraSPARC CPU capacity per 1 MB/second of data that needs to be moved. This means that for every MB/second of data moved (whether over the network, from disk, or to tape), the system should have 5 MHz of processing power available for the transfer. For example, a system that needed to back up a number of clients over the network to local tape at a rate of 10 MB/second would need 100 MHz of available CPU power. This included 50 MHz to move data from the network to the server, and another 50 MHz to move data from the server to the tapes. This would keep a 300 MHz UltraSPARC processor at 33% utilization. As another example, a system that needed to back up a database residing on local disks to local tape device at a rate of 35 MB/second would need 350 MHz of available CPU power. The actual software overhead is small, and is included in the 5 MHz per MB/second number.

Conclusion
Backup and recovery are essential processes because of the large volumes of data retained in today's Datacenters. Thus the role of the capacity planner is critical for designing the optimal backup architecture for the Datacenter and system requirements.

Capacity planning is not a straight-forward procedure; it requires knowing how to efficiently use the network infrastructure, and understanding network performance and bandwidth issues. The planner needs to configure the network for optimal backup and recovery performance. And because networks are traditional bottlenecks for backup applications, the capacity planner often needs to choose the preferred bottleneck.

While a complex process, the planner can follow a series guidelines and use a number of available tools and methods to obtain information necessary for making good decisions. The planner first needs to assess the environment to be backed up. This includes obtaining the following information: (1) the data type, (2) the file structure, (3) the data origin, (4) the data destination, and (5) the data's path. The planner also needs to know whether data and backup servers are distributed on the network, and if so, how they are distributed.

Knowing the backup requirements of the enterprise is also essential. This includes determining the time period available for backups, the needs for restoring the data, and ways to limit the impact of the process on day-to-day operations.

Finally, the planner needs to maintain realistic expectations. This means accounting for data overhead caused by additional metadata, understanding the ease of use in the backup process and assessing training requirements, and understanding the recovery performance.

Glossary
ATM
Asynchronous transfer mode. A standard for switching and routing all types of digital information, including video, voice, and data. With ATM, digital information is broken up into standard-sized packets, each with the "address" of its final destination.

atomicity
Refers to an operation that is never interrupted or left in an incomplete state under any circumstance.

backup
A copy on a diskette, tape, or disk of some or all of the files from a hard disk. There are two types of backups: a full backup and an incremental backup. Synonymous with "dump."

bus
(1) A circuit over which data or power is transmitted, one that often acts as a common connection among a number of locations. (2) A set of parallel communication lines that connect the major components of a computer system, including CPU, memory, and device controllers.

cache
A buffer of high-speed memory filled at medium speed from main memory, often with instructions. A cache increases effective memory transfer rates and processor speed.

data base management system (DBMS)
A software system facilitating the creation and maintenance of a data base and the execution of programs using the data base.

Ethernet
A type of local area network that enables real-time communication between machines connected directly together through cables. Ethernet was developed by Xerox in 1976, originally for linking minicomputers at the Palo Alto Research Center. A widely implemented network from which the IEEE 802.3 standard for contention networks was developed, Ethernet uses a bus topology (configuration) and relies on the form of access known as CSMA/CD to regulate traffic on the main communication line. Network nodes are connected by coaxial cable (in either of two varieties) or by twisted-pair wiring. See also 10BASE2, 10BASE5, 10BASE-T, and 100BASE-T.

FDDI
Fiber distributed data interface. An emerging high-speed networking standard. The underlying medium is fiber optics, and the topology is a dual-attached, counter-rotating token ring. FDDI networks can often be spotted by the orange fiber "cable."

full dump
A copy of the contents of a file system backed up for archival purposes. Contrast with incremental dump.

gigabyte (Gbyte)
One billion bytes. In reference to computers, bytes are often expressed in multiples of powers of two. Therefore, a gigabyte can also be 1024 megabytes, where a megabyte is considered to be 2^20 (or 1,048,576) bytes.

incremental dump
A duplicate copy of the files that have changed since a certain date. An incremental dump is used for archival purposes. Contrast with full dump.

internet
A collection of networks interconnected by a set of routers that enable them to function as a single, large virtual network.

I/O
Input/output. Refers to equipment used to communicate with a computer, the data involved in that communication, the media carrying the data, and the process of communicating that information.

interprocess control (IPC)
The process of sharing data between processes and, when necessary, coordinating access to the shared data.

kernel
The core of the operating system software. The kernel manages the hardware (for example, processor cycles and memory) and supplies fundamental services such as filing that the hardware does not provide linear recording technology local area networks (LAN) - A group of computer systems in close proximity that can communicate with one another via some connecting hardware and software.

local host
The CPU or computer on which a software appliation is running; the workstation.

master
An SBus device capable of initiating an SBus transaction. The term CPU master is used when a host CPU must be distinguished from a more generic SBus master. The term DVMA master is used when explicitly excluding CPU masters. Any SBus master may communicate with any other slave on the same bus, regardless of system configuration.

MPEG
Moving Picture Experts Group. This group has developed standards for compressing moving pictures and audio data and for synchronizing video and audio datastreams. The MPEG standard is similar to CCIT H.261 encoding, with compression rates in the range of 1-to-1.5 Mbits per second. MPEG images are 352 by 240 pixels.

MTBF
Mean-time-between-failures. For a stated period in the life of a function unit, the mean value of the lengths of time between consecutive failures under stated conditions.

point-to-point protocol (PPP)
The successor to SLIP, PPP provides router-to-router and host-to-network connections over both synchronous and asynchronous circuits.

RAID
Redundant array of inexpensive disks. A subsystem for expanding disk storage. Used in the SPARCstorage* Array Subsystem for Disk Expansion.

RPC
Remote procedure call. An easy and popular paradigm for implementing the client-server model of distributed computing. A request is sent to a remote system to execute a designated procedure, using arguments supplied, and the result is returned to the caller. There are many variations and subtleties, resulting in a variety of different RPC protocols.

SCSI
Small computer systems interface. An industry standard bus used to connect disk and tape devices to a workstation.

slave
An SBus device that responds with an acknowledgment to a slave select and address strobe signal. Any SBus master may communicate with any other slave on the same bus, regardless of system configuration.

superblock
A block on the disk that contains information about a file system, such as its name, size in blocks, and so on. Each file system has its own superblock.

swap
To write the active pages of a job to external storage (swap space) and to read pages of another job from external page storage into real storage.

TCP
Transport control protocol. The major transport protocol in the Internet suite of protocols providing reliable, connection-oriented, full-duplex streams. Uses IP for delivery.

WAN
Wide-area network. A network consisting of many systems that provide file transfer services. this network may cover a large physical area, sometimes spanning the globe.