Disaster Tolerance Construction Plan
Chapter 1 Methods of the Construction of Disaster Tolerance Centers
Disaster-tolerant construction projects and business continuity projects are closely related to users' business requirements and application status, and involve a wide range of technologies and products, as well as numerous suppliers. Therefore, they are part of the projects that are complex for construction and highly risky. In order to reduce the risk of the projects and guarantee the successful construction of disaster tolerance centers, it's very important to select experienced partners and follow the guidance of a mature and practical methodology for the disaster-tolerant construction of information centers.
For enterprises' "disaster-tolerant center construction" and business continuity construction, EMC has provided BCSI (Business Continuity Solution Integration) methodology, which unfolds as follows:
As shown above, EMC has divided enterprises' business continuity or disaster-tolerant system construction into three phases。 They are (not including startup and other prophase work):
• Plan------scientific planning is the prerequisite for the success of projects。 At this phase we need to evaluate and analyze the current status of the enterprise' IT system, and make a clear definition of requirements to meet the business development requirements of the enterprise。 Then, with specific requirements, we can select appropriate technologies for the design of technical architecture, find out an appropriate technical scheme and purchase corresponding products。
• Build------at this phase, the main tasks are technology platform construction (including integration, data migration, etc。) and testing, as well as the building of a complete "Disaster Recovery Plan (DRP)" or "Business Continuity Plan (BCP)"。 Under the premise of scientific and reasonable planning, the build stage will be comparatively orderly。
• Manage------for disaster-tolerant construction projects and business continuity projects, the construction of disaster-tolerant technology platform, related personnel and process requirements is just the beginning, rather than the end. They must be regularly updated and maintained to meet the changing business development requirements.
What connects the three phases of "plan, build and manage" is the "project management and service integration capability". Disaster tolerance and business continuity construction involve a wide range of technologies and products, so different business applications may adopt different technical schemes, which are provided by different manufactures; and due to the interconnectedness of businesses, different technical schemes are also closely related to, or even interdependent upon each other. Meanwhile, during the process of disaster-tolerant construction, there will be many suppliers who provide service support. So, the capability to coordinate multiple parties, carry out a unified control on the implementation process and quality of the project, and implement an integrated scheduling for the multi-party services is not only important to "project management and service integration", but also helps guarantee that the project is of high quality and completed on schedule.
EMC's BCSI methodology defines all the tasks and steps required to complete in each phase (ten steps in total) and makes further detailed definitions for each step。 The following chapters will provide detailed flow diagrams for fields related to the project。 To meet the needs of different clients, EMC will design appropriate steps in accordance with the scientific methodology above as well as clients' actual situation, if necessary, to build disaster-tolerant system and business continuity schemes for users in a planned and systematic manner。
Chapter 2 The General Technical Framework for Disaster Tolerance
2。1 The protection levels of enterprise information system
The protection and the recovery of modern enterprises' IT platforms in data centers (including host platform, network platform, and storage platform and so on) can use technical means of different levels. The future business continuity construction of enterprises will require the protection and recovery level of enterprises' information and data to be constantly upgraded.
Different levels of protection for data centers
As shown above, the protection of the IT systems and business data in the enterprise's centralized data center can adopt many different levels of protection schemes, which can be mainly classified as local protection and remote protection.
The operation-oriented protection and recovery for enterprises' data centers includes the following three levels:
1. Platform protection------primarily high availability of platform. For instance, adopt the host cluster system and highly available storage platforms (including the high availability of SAN network environment and storage system) to protect IT platforms from any single points of failure and achieve the high availability of business and application.
2. Data backup------it is a regular local backup for business data。 Data backup can provide reliable data protection in the event of a physical or logical failure of IT system。
3. Data recovery------it enables fast and predictable data recovery in the event of data errors or losses so as to shorten the downtime of IT system and lower its impact on business operation。
After building sound local protection and recovery, enterprises need to plan and build disaster-oriented protection and recovery for "remote" data and businesses. It includes the following three levels:
1. Remote information protection------it stores all the important data of the enterprise securely at remote sites to protect the data from catastrophic events.
2. Remote automatic processing------in addition to providing remote protection for production data, it can carry out system switching and failback, data recovery and other work automatically so as to resume business operation quickly when a catastrophic event happens.
3. Multiple data centers protection------by building multiple data centers, it can use the data protection and recovery technology in those centers to prevent a wider range of catastrophic events.
2.2 The model of disaster-tolerant technology
The construction of disaster-tolerant technology platforms is the important foundation of enterprises' business continuity construction。 EMC Company has divided the IT platform of an enterprise into three parts ------"access platform, application platform and data platform"。 The company suggests that enterprises should focus on the protection of these three important system areas during the construction of disaster-tolerant technology platforms。
The diagram of the model of disaster-tolerant technology
2。2。1 The protection of business platforms------the redundancy of business processing capability
During the building of disaster-tolerant technical scheme, the protection for the enterprise's business platform mainly refers to the redundancy and reuse of business processing capability, which involves:
• The server, operating system and other system software that support the operation of the application system。
• The storage that support the operation of the application system and the connection between the storage and the server (storage network, etc.)
• IP network system that connects servers
• Middleware or database that supports the realization of the application system
Clients should configure application server, middleware and database, whose manufacturers, version and configuration are the same as those of the production center that requires protection, in the disaster tolerance center。 In this way, we can ensure that the software operating environment for the primary data center and the disaster-tolerant center is the same。
• The application software system that implements the business logic
EMC Consulting Services will be able to conduct an investigation and evaluation on all the above aspects for clients, analyze the current status and specific technical requirements of the business platform in the client's current production center, and put forward specific requirements for the building of the disaster-tolerant schemes.
2.2.2 The protection of data platforms------the replication of business status data
In the disaster-tolerant system, the protection of data platforms mainly refers to the protection, backup, recovery, and replication of business status data. The business status data that need to be protected include:
• Business transaction status (the data's own data attributes are file, database, etc.)
• System status------it includes the initial data and parameter setting of application software, as well as the configuration data and parameter setting of system software, etc.
• Intermediate data (or temporary data)
During the construction of disaster-tolerant system, the protection of data platforms is the core of the enterprise's disaster recovery. We must always put data security in the first place. Only if the data that supports the enterprise's business operation can be replicated to the disaster tolerance center in on time and completely can disaster-affected business applications be recovered at the disaster tolerance center when a disaster happens.
For different enterprises, EMC will take different data replication methods for their application or business units at different important levels on the basis of the results of requirement analysis. And for different types of applications, EMC will also take different data replication methods in view of their access characteristics and other factors.
2.2.3 The redundancy and switching of the access platform
The access platform needs to achieve the redundancy and switching of the external interface in the disaster-tolerant backup system. That involves:
• The switching of applications' data interface------file transfer, message mechanism, etc.
• The switching of applications' connection interface------HTTP connection, database connection, remote procedure call, object call, etc.
• The redundancy and switching of network connections------network connections of metropolitan area network, dial-up connections, etc.
The key of the enterprise's "redundancy and switching of the access platform" is to configure network devices with the same access function in the disaster tolerance center, and to ensure the fast and convenient switching of network access from the primary production center to the backup production center in the network configuration.
2.3 The disaster-tolerant mode
With the "current status assessment and business requirement analysis" in the early stage of the project, we can analyze in a comprehensive manner from multiple perspectives, such as disaster-tolerant levels, disaster-tolerant scope, operation mode and disaster-tolerant scale. Then we can get the disaster-tolerant mode and operation mode which meets users' requirements for disaster tolerance.
2.3.1 The disaster-tolerant levels
According to the length for business recovery time, disaster-tolerant construction can be divided into different levels:
l The disaster protection of data alone can only ensure the integrity of data. For such services, we only need to configure a storage platform in the disaster tolerance center and then achieve the remote replication and storage of data. In this way, investment will be reduced, but business recovery time is very long (usually more than 3 days). The disaster protection of data is a disaster-tolerant mode that just replicates the data of the production center entirely to the disaster tolerance center. It is the lowest level of remote disaster tolerance and the most basic mode which lays a foundation for a higher level of disaster-tolerant mode.
When a disaster happens, the disaster protection of data alone cannot guarantee the continuity of business, but can only ensure that the data is available. It can guarantee the integrity of business data only if the technical strategy is appropriate. The use of this mode has the following characteristics:
• Business recovery is slow, and RTO is usually more than72 hours
• Business recovery is very difficult and new devices are needed
• The implementation of technology is not difficult
• The operation and maintenance cost is relatively low
• Less investment is needed
l In addition to providing disaster protection for data, it can achieve the high availability of applications and ensure the quick recovery of business。 The applications of the disaster-tolerant system do not change the original business processing logic, but perform a basic replication of the production center system。 This mode has the following characteristics:
• Business recovery is fast, and RTO is usually less than 24 hours or sometimes just a few hours
• The business recovery process is relatively simple
• The implementation of technology is difficult
• The operation and maintenance cost is relatively high。 For example, more software version management, software deployment and maintenance personnel are needed。
• More investment is needed
2。3。2 The disaster-tolerant scope
The business of the disaster-tolerant backup and storage platform project will be classified, in the light of the results of business impact analysis, into two major categories: critical business and non-critical business. In the future, business types which need disaster-tolerant protection can be selected to meet different requirements. We can build the disaster tolerance of critical business first, and the disaster tolerance of full business in the future.
• The disaster tolerance for critical business: during the process of defining business requirements, define the disaster tolerance for critical business through the analysis of business impact.
• The disaster tolerance for full business.
2.3.3 Same-level or degraded disaster tolerance
The design of disaster tolerance can be classified as same-level disaster tolerance and degraded disaster tolerance for different processing capacity of disaster tolerance center's configuration. In the future disaster tolerance center, all the business systems that need disaster-tolerant protection will be equipped with business processing platforms. If the platforms have the same processing capacity as the production center and high availability(mainly refers to host performance, high availability cluster, etc.),we call it same-level disaster tolerance; if the platforms have lower processing capacity than the production center or their high availability is reduced (such as no cluster, etc.), we call it degraded disaster tolerance. Whether adopt same-level disaster tolerance or degraded disaster tolerance depends on business requirements and investment budget because the degraded disaster tolerance can save investment (investment in the host).
Chapter 3 The Introduction to Different disaster-tolerant technologies
3。1 The overview of different disaster-tolerant technical schemes
Different business requirements and application characteristics of different enterprises may have different disaster-tolerant technical requirements, so we can use a variety of disaster-tolerant technologies to build a disaster-tolerant system. EMC's professional Consulting Services will provide different technical schemes to satisfy the actual needs of clients. For all the construction of clients' disaster-tolerant technology platform, the technical core of the disaster-tolerant scheme is the protection of data. It refers to the capability to achieve remote data replication, and use the replicated data to provide support services for the enterprise's business operation at the far end when a disaster happens. Therefore, the data replication technology is the core of the construction of disaster-tolerant technology platform. Different data replication technologies are classified as follows:
As shown above, it is more feasible to adopt the continuous data replication technology for a disaster-tolerant project.
Since the remote data replication technologies used in different disaster-tolerant schemes are in different levels of the enterprise's IT architecture, those disaster-tolerant schemes can be classified as the following three types:
• Storage-based disaster-tolerant scheme------use the remote data replication function of the storage system to build a disaster-tolerant system. It includes:
n Data replication between homogeneous storage platforms;
n Use the virtualized storage technology to achieve data replication between heterogeneous storage platforms。
• Host-based disaster-tolerant scheme------use relevant functional software provided by the host manufacturers or third-party host software to achieve remote data replication and then build the disaster-tolerant system.
• Application-based disaster-tolerant scheme------use application software's (such as Oracle database) own remote data replication technology to build the disaster-tolerant system.
This chapter will analyze the three different disaster-tolerant schemes above------"storage-based disaster-tolerant scheme", "host-based disaster-tolerant scheme" and "application-based disaster-tolerant scheme (take Oracle Data Guard as an example)"
For different users, EMC will consider the actual needs and technical conditions of clients' disaster-tolerant technology schemes when making assessment so as to find the most appropriate disaster-tolerant technology scheme for users.
3.2 Use storage-based data replication technology to build a disaster-tolerant system
The technical core of the storage-based disaster-tolerant scheme is to achieve the remote copy of production data by using the disk array-to-disk array data block replication technology of the storage array itself so that the disaster protection of production data is achieved. If there is a disaster in the primary data center, the data of the disaster backup center can be used to build an operation support environment in the disaster backup center, which can provide IT support for the continuous operation of business. Meanwhile, the data of the disaster backup center can also be used to recover the business system of the primary data center, so that the business operation of the enterprise can quickly revert to its normal state before the disaster.
The diagram of the storage-based disaster-tolerant scheme is as follows:
The diagram of the disaster-tolerant scheme which relies on the data replication technology of storage
The disaster-tolerant scheme which relies on the data replication technology of storage to build a disaster-tolerant system is widely adopted by current financial enterprises, telecommunication companies and the government and there are many cases of application, so it is one of the technical options for disaster-tolerant construction.
The storage-based replication can be the "one-to-one" mode as the above diagram shows, or "one-to-many or many-to-one" mode in which the data of one storage is replicated to multiple remote storages or the data of multiple storages is replicated to the same remote storage. Moreover, the replication can be bidirectional.
There are two methods of the storage-based disaster-tolerant scheme: synchronous mode and asynchronous mode. They are illustrated as follows:
In the synchronous mode, the disk arrays of the primary and backup center can update data synchronously. After the I/O of the application system is written to the primary disk array (to the Cache), the primary disk array will use its own mechanism (such as EMC's SRDF/S) to write the write I/O to the backup disk arrays, and wait for its confirmation before returning the information that write operation is completed to the application system.
In the asynchronous mode, after the I/O of the application system is written to the primary disk array (to the Cache), the primary disk array will immediately return the information that "write operation is completed" to the host application system, so the host application can continue the read and write I/O operation。 At the same time, the primary center's disk array will use its own mechanism (such as EMC's SRDF/A) to write the write I/O to the backup disk array to achieve data protection。
The synchronous mode enables the data in the backup disk array to synchronize with the data in the production system, so the data will not be lost when a disaster event occurs in the production datacenter. To avoid its impact on the performance of the production system, the synchronous mode is usually used within a close range (FC connection is usually within 200KM and the actual deployment of users is usually around 35KM).
While in the asynchronous mode, applications do not have to wait for the completion of remote updating, so the performance of remote data backup is usually slightly affected, and the distance between the backup disk and the production disk is theoretically unrestricted (the asynchronous replication of data can be achieved over an IP connection).
The prerequisites for building a disaster-tolerant scheme which relies on the data replication technology of storage are:
• Under normal conditions, it must use storage platforms of the same manufacturer and the same series of storage products, which limits users' choices of storage platform to some extent.
• The synchronous mode may affect the performance of the production system, and it is demanding in communication links。 Moreover, due to the limitation of distance, the mode usually works in a close range (city-wide disaster-tolerant or park disaster-tolerant schemes)
• Just like other types of asynchronous disaster-tolerant schemes, the asynchronous mode bears the risk of data loss, so it usually works when the bandwidth of remote communication links is limited.
Despite the above limitations, the storage-based disaster-tolerant technology scheme remains the most preferred disaster-tolerant technology platform at present。 The disaster-tolerant schemes which are built upon EMC's storage-system are especially widely-used。 The main reason is that the storage-based disaster-tolerant technology schemes have the following advantages:
• The storage-based data replication is independent of the host platform and applications。 So it can apply to a variety of applications, and does not consume the host's processing resources at all;
• The storage-based data replication technology is at the most basic level and its operation is least affected by applications, the host environment and other relevant technologies, so it is very suitable for such a complex environment with many host and business systems。 Using this mode can effectively lower the difficulty in implementation and management;
• The synchronous mode can ensure that no data is lost. Therefore, in the city-wide disaster-tolerant or park disaster-tolerant schemes, as long as the bandwidth of communication links permits, we can adopt the synchronous scheme, which will not have a significant impact on the performance of the primary data center's production system. EMC's storage-based synchronous replication technology has been applied in many disaster-tolerant cases, which provides rich successful experience. Enterprises, such as Jiangsu Mobile, China Ever bright Bank, Liaoning Mobile, and Heilongjiang Mobile, have all used EMC's synchronous replication technology. Besides, the technology is able to meet the requirements of synchronous data replication in large-scale I/O throughput. At present, the city-wide disaster-tolerant environment has already met the above conditions, so it is very convenient to deploy the synchronous mode of replication.
• The asynchronous mode may bear certain risk of data loss, but there is no limitation of distance, so the remote protection can be achieved。 The remote data center will adopt the asynchronous replication mode with two centers------the remote data center itself and the center in Beijing, for data protection。
• The data of the disaster backup center can be used effectively.
In the three disaster-tolerant schemes------application-based, host-based, and storage-based schemes, the data of the disaster backup center is usually unavailable and it just provides disaster protection and recovery for the data in the production system. However, in the disaster-tolerant scheme which relies on storage technology, we can use some very flexible technologies to give a full play to the data of the disaster backup center, so as to improve the efficiency of the enterprise's business operation and bring more return on investment. The process is illustrated in the following diagram:
Efficient utilization of disaster backup data in the storage-based disaster-tolerant scheme
As shown in the above diagram, the "source data-R1" of the production center is replicated to the disaster backup center where it becomes the "target data-R2"through the data replication mechanism of storage itself. The "target data R2" is not accessible under normal production conditions. Only when the services of the primary center are stopped in the event of a disaster, can the backup host of the disaster backup center access the "target data" and take over the services of the primary center (similar to the data of the disaster backup center in the host-based and application-based disaster-tolerant schemes). However, in a storage-based disaster-tolerant scheme, we can create a BCV volume or snapshot and clone for the "target data" so that it can be used by other servers.
With this mechanism, users can do a lot of work in the disaster tolerance center:
• Development and testing personnel of the user can employ R2-BCV or R2 snapshots to get real data for the development and testing of new applications, so as to ensure the quality of new applications and shorten the time-to-market of new products. This mode is difficult to realize in the host-based and application-based schemes, or it will take a very long time and consume a lot of resources to get real data for development and testing.
• Other applications of the user can also employ R2-BCV or R2 snapshots to meet the needs of other businesses。 For example, the data warehouse application usually needs to extract data from the production system。 If it is a large-scale data extraction, the production system will come to a standstill。 At this point, R2-BCV volume can be used to extract data, so as to avoid the huge impact of data extraction on the performance of the production system。 The data source of the enterprise's decision analysis system can also be obtained through R2-BCV。
Due to these advantages, the storage-based disaster protection scheme is the most widely used disaster protection scheme at present.
3.3 Use virtualized storage technology to build a disaster-tolerant system
The technology of storage virtualization is to map the heterogeneous storage devices in the system into a single storage pool, which is completely transparent to users, to shield the heterogeneity of storage devices and the host. Thanks to the virtualization technology, users can use existing hardware resources to unify the heterogeneous storage resources within SAN into a single-view storage pool. And technologies, such as Striping, LUN Masking, and Zoning, enable users to easily partition and allocate the large storage pool to meet their own needs. In this way, the users' existing investments are protected and the total cost of ownership (TCO) is reduced. In addition, the dynamic and transparent growth and reduction of the storage pool on the server can be achieved to meet the needs of business.
Through the storage virtualization technology, the remote replication of data can be achieved to ensure that the data of the disaster tolerance center is synchronized with that of the primary site. Then the disaster tolerance of data is realized.
The storage virtualization technology can be implemented at different levels, such as the intelligent switch level, the storage level, or the third-party devices. There are also synchronous and asynchronous replication schemes for the data replication through virtualized storage technology. So, we need to select right products for specific needs.
Using the virtualized storage technology to build a disaster-tolerant scheme has the following advantages:
• The storage arrays of the primary production center and the disaster tolerance center can be produced in different manufacturers, and the choice of storage platform is not limited by the vendors of existing storage platforms (but the products in the current market cannot achieve this yet);
• It can provide a unified management interface for the storage arrays of different manufacturers;
In a virtual storage environment, no matter what device the back-end physical storage uses, all the logical mirroring that the server and its application system see is mirroring about their familiar storage devices。 Even if the physical storage changes, this logical mirroring remains the same all the time。 So the system administrator doesn't need to care about the back-end physical storage any more, but only focus on the management of the storage space。 As all the storage management operations, such as system upgrades, the setup and allocation of virtual disks, the change of the RAID level and the expansion of the storage space, are easier than any previous products, storage management becomes easy and simple。
The following issues need to be considered when using the virtualized storage technology to build a disaster-tolerant scheme:
• The virtualized storage technology is relatively new. Although the technology is designed for heterogeneous environments, it is still highly risky for it to ensure compatibility and data integrity in a heterogeneous environment.
• Adopting the virtualized storage technology, especially the method of adding third-party hardware, will need to assess its impact on the high availability and performance of the system as a whole;
• We need to verify the maturity of selected products and technologies and their compatibility with existing and future devices, especially their practical application under the circumstance that they have difficulty in satisfying the requirements of complex environment and large-scale disaster tolerance.
• The virtualized storage technology is not mature enough and is still in its development phase. And at present, there is no case and application for deploying a disaster-tolerant scheme which relies on the virtualized storage technology in a heterogeneous storage environment.
3.4 Use host-based data replication technology to build a disaster-tolerant system
The diagram of the host-based disaster-tolerant scheme is as follows:
The diagram of the host-based disaster-tolerant scheme
The core of the host-based disaster-tolerant mode is to use the host systems of the primary and backup center to establish the data transmission channel through an IP network, and to achieve the remote replication of data through the data management software of the host. When the data in the primary data center is destroyed, we can recover applications or data from the backup center at any time so as to provide the enterprise with disaster-tolerant capability of application system.
There are many products that can serve as data management software for remote data replication。 The host vendors and some third-party software companies (such as Veritas) can provide host-based data replication schemes。 For example, the Sun Company's Availability Suite, Veritas Volume Replicator (VVR) and other software can achieve host-based remote data replication and then build a host-based disaster-tolerant system。
Using host-based data replication technology to build a disaster-tolerant scheme has the following advantages:
• The most significant advantage of the host-based scheme is that it only depends on server platforms and the host software, and it is completely independent of the underlying storage platform, so the production data center and the backup data center can adopt different storage platforms;
• There are not only disaster-tolerant protection schemes for database but also disaster-tolerant protection schemes for file system;
• There are many different host-based schemes that can meet users' different requirements for data protection and provide a variety of modes for data protection;
• Such schemes rely on IP network and have no limitation of distance。
Meanwhile, using the host-based data replication technology to build a disaster-tolerant scheme has the following limitations:
• The host-based scheme requires the same kind of host platform;
• In the host-based data replication scheme, the production host has to deal with production requests and the remote data replication, so it must consume its own computing resources. Moreover, the upgrades of the host memory and CPU are very costly. Therefore, the performance of the production host will be greatly affected or even seriously affected.
• Since the data of the disaster backup center is usually unavailable, it will be very difficult for users to use production data for development, testing and DW/BI application in a remote data center;
• The host-based data replication scheme is relatively complex. When it combined with database applications, a very complex mechanism or the combination of a variety of software is required, which will greatly affect the stability, reliability and performance of the production system.
• If there are multiple systems and applications that require disaster protection, the host-based scheme will not be realized through a unified technical scheme。
• The management is complex and requires a large number of human interventions, so it is easy to generate errors.
At present, the number of enterprises that have adopted the host-based data replication technology to build a disaster-tolerant scheme is relatively small because such schemes are usually suitable for local use by a single application or system in the case of small I/O。 Under the conditions that the I/O load of applications is large, applications and application types that require disaster protection are numerous and the host environment is complex, the host-based scheme is not applicable。
3.5 Use application-based data replication technology to build a disaster-tolerant system
There are also many types of application-based data replication technologies. In the following part, we will use the common Oracle Data Guard technology of Oracle 9i/10G itself to make an analysis (the Mirror technology of Microsoft SQL*Server adopts a similar method).
The Oracle Data Guard technology is a unique disaster backup and recovery technology of the Oracle database system and it draws on the log backup and recovery mechanism of the Oracle database system. The basic principle of Data Guard is to build a backup database system on the platforms of hardware and operating systems which are identical to the primary system, and to back up the database log and control files of the primary database and other critical files.
While the primary system is working, Data Guard will continuously transfer the archive log files generated in the primary system to the backup database system, and use these log files to perform continuous recovery operations in the backup database system, so as to keep the consistency between the backup system and the operating system. When the primary system fails, the backup log files of database will be used in the backup database to recover the data of the primary database.
Fig. 5.18. The disaster-tolerant scheme that adopts Oracle Data Guard
Oracle9i/10G Data Guard provides the following three modes:
• Maximum protection mode
• Maximum availability mode
• Maximum performance mode
The maximum protection mode of Oracle Data Guard provides the highest level of data availability for the primary database, so it is a disaster-tolerant scheme that guarantees no data loss. When the maximum protection mode is working, the redo record is sent synchronously from the primary database to the backup database. Moreover, the transactions in the primary database cannot be committed until at least one backup database confirms that the transaction data have been received. In this mode, at least two backup databases need to be configured to provide dual fault-tolerant protection. If the backup database is unavailable, the primary database will suspend the processing process automatically.
The maximum availability mode provides the second highest level of data availability for the primary database which can guarantee no data loss, and provide protection for the failure of a single component. Just like in the maximum protection mode, in this mode, the redo data is sent synchronously from the primary database to the backup database, and the transactions in the primary database cannot be committed until the backup database confirms that the transaction data was received. However, when the backup database becomes unavailable because of network connectivity and other problems, the processing operation of the primary database will continue. As a result, the backup database is temporarily inconsistent with the primary database. But once the backup database becomes available again, the database will synchronize itself automatically and no data will be lost.
The maximum performance mode is a default protection mode. Compared with the maximum availability mode, it provides a slightly weaker protection for the primary database, but it has a higher performance. In this mode, while the primary database is handling transactions, the log data are transferred asynchronously to the backup database. In the primary database, the commit operation does not need to wait for the confirmation of receipt from the backup database before completing the write operation. At any time, if the backup database is not available, the processing operation of the primary database will continue so that the performance will not be affected.
Using Oracle 9i/10G Data Guard technology to perform disaster backup needs to meet the following prerequisites:
• The backup system is consistent with the primary system in terms of hardware platform, operating system and the version of operating system;
• The backup system has the same user permissions as the Oracle in the primary system;
• The backup system has the same database version as the Oracle in the primary system;
• The backup system has the same database configuration files as the Oracle in the primary system.
Using Oracle Data Guard to build a disaster-tolerant scheme has the following advantages:
• It totally depends on the Oracle database mechanism and is independent of other software and the underlying storage platform;
• It can meet the user's different requirements for performance and data protection, and provide a variety of data protection modes;
• It can achieve the one-to-many data replication to provide multiple protection;
• The backup database can be upgraded to its production status within a very short time (because the database has been working);
• It relies on the IP network and has no limitation of distance.
Meanwhile, Using Oracle Data Guard to build a disaster-tolerant scheme has the following limitations:
• The three modes of Oracle Data Guard will affect the performance of the production database system, thus requiring more processing resources;
• Since the backup database is unavailable, it will be very difficult for users to use the production data for development, testing and DW/BI application in a remote data center;
• It can only provide protection for Oracle database data, but cannot provide disaster protection for other application data, such as file application;
• The management is complex, and requires a large number of human interventions and the proficiency in data base recovery technology, so it is easy to generate errors.
• It is difficult to realize the initial synchronization of source database and target database with large data and there is no corresponding solution;
The advantages and limitations of other application-based disaster-tolerant schemes in the industry are almost the same as those of the Oracle Data Guard mode. Golden Gate and Quest Shareplex software can be good examples, which will be briefly introduced below.
Its principle is similar to that of Oracle Data Guard. It replicates data incrementally for the log of database, and uses the Queue technology to ensure the reliability of transmission. The advantages of this scheme are:
• The same advantages as the Oracle Data Guard (see above);
• This scheme is more flexible and does not rely on the host system platform, so it is superior if the primary production host is different from the standby node host.
The disadvantages of this scheme are:
• The same disadvantages as the Oracle Data guard (see above);
• It can only adopt the asynchronous mode (relying on log and Queue technology), so it cannot meet the requirements of city-wide disaster tolerance and demanding disaster tolerance, such as no data loss;
• Oracle does not announce technical support and problem manipulation for this technical scheme, which has increased the risk of this disaster-tolerant scheme.
3。6 The contents related to disaster-tolerant schemes
The results of current status assessment, requirement analysis and technology selection require that the design of disaster-tolerant technology schemes should include the following contents:
• The design of overall disaster-tolerant architecture
• The design of storage-level disaster-tolerant data replication scheme
• The design of application-level (or other methods)data replication scheme
• The planning and design of SAN Network
• The planning and design of IP Network
• The deployment scheme for the host and applications
• The system tuning (as required)
• The data migration scheme
• The deployment and planning of storage
• The design of backup system (as required)
• The design of computer room or the requirements for computer room environment
• And so on
The application-based, host-based and storage-based (including virtualized storage technology) disaster-tolerant schemes have their own scope of application and they are suitable for different disaster protection. The users should choose an appropriate disaster-tolerant protection scheme in view of specific actual demands.
Different users, different business systems and different applications have different requirements for disaster tolerance and require different levels of disaster-tolerant services. In the future, EMC will follow scientific processes and methods, and leverage its expertise and experience in the field of information storage and management to assess IT environment and to analyze business impact for users. Then EMC can identify the requirements for disaster-tolerant technology in view of clients' business requirements and recommend the most appropriate disaster-tolerant scheme to clients.
When choosing a disaster-tolerant scheme, an enterprise should not only consider how to choose an appropriate technical scheme, but also examine the products used for the scheme in terms of technical maturity and reliability, performance and flexibility. It is also necessary to examine whether the provider of the scheme has rich experience and proven skills to ensure the feasibility and success of the scheme.
EMC Company' technologies in the field of disaster tolerance are advanced and have been tested by the practical applications of numerous users. Moreover, the feasibility of its schemes, and the maturity, stability, reliability and flexibility of its products has been tested in many practical applications. As EMC's technical service team has demonstrated strong technical strength during the successful implementation of many disaster-tolerant projects, it can guarantee the successful implementation of disaster-tolerant schemes for users.
Chapter 4 The Design of Disaster-tolerant Communication Links
The design of disaster-tolerant communication links is very important to the construction of a disaster-tolerant system, and it is also one of the difficult and key points in the design of disaster-tolerant scheme, so this chapter is separated to elaborate it。
4。1 The overview of the design of disaster-tolerant communication links
For reference, there is an introduction to relevant technologies of link design:
If we use host-based or application-based disaster-tolerant technology to build a disaster-tolerant system, we need to use the standard IP network connection, so the communication links can be ATM, E1/E3 and IP; If we use the storage-based technology or virtualized storage technology to build a disaster-tolerant scheme, we can use Fibre Channel, ESCON, DWDM,SONET and other communication links, or use ATM, E1/E3 and IP through the FCIP device.
Different communication links have different requirements, such as distance limitation and bandwidth capability; different disaster-tolerant technologies and applications have different requirements for communication links; and the requirements for communication links in synchronous data replication mode and asynchronous data replication mode are also different.
In a disaster-tolerant scheme, no matter which replication technology is used, the following issues need to be addressed.
According to the selected distance from the disaster-tolerant center at present:
• What kind of link is needed? How many links are needed? How much it will cost?
• What effect such a long distance will have on applications? If the synchronous mode is adopted, will the response time be too long or whether the I/O amount can meet needs?
• If the asynchronous mode is adopted, what is the RPO and how much cache does it need?
• Does the link that is to be designed necessarily meet the intended goal?
Designing communication links that can meet different requirements of users in a scientific way is one of the important steps to ensure the successful construction of a disaster-tolerant system for users at a reasonable communication cost.
4。2 The comparison of different disaster-tolerant communication links
At present, the design of communication links for disaster-tolerant schemes usually adopts "the direct connection to switches through bare optical fibers, the connection to bare optical fibers through DWDM devices and the connection through an IP network". Each of the above modes has its own advantages and disadvantages. There is a comparison of different communication link modes below.
1. The direct connection to switches through bare optical fibers, which adopts the FC protocol
The communication links which adopt the FC protocol is only suitable for disaster-tolerant schemes that rely on storage replication or virtual storage replication。 In such schemes, the optical fiber switches in the production center and the backup center are directly connected through bare optical fibers。 The process is shown in the following diagram:
The communication link mode of direct connection to switches through bare optical fibers.
The disaster-tolerant ports of the two centers' storage systems are connected by optical fiber switches and bare optical fibers, which guarantees the performance of synchronous or asynchronous data replication. In order to guarantee high availability, he redundant connection is usually adopted for link design. The bare optical fibers of disaster-tolerant links share a SAN switch with the production host, or it uses a standalone SAN switch (also requires redundancy) or SAN Router. In order to avoid the mutual interference between disaster-tolerant communication links and host access storage, it is common to use a standalone SAN for the connection between disaster-tolerant communication links.
Different disaster-tolerant schemes need different numbers of communication links. To get the specific number of links (i.e. bandwidth requirement), we need specific analysis and calculation.
1。 The direct connection to bare optical fibers through CWDM/DWDM devices
By using dense wavelength division multiplexing, this mode can load multiple protocols, such as FC protocol and IP protocol. The process is shown in the following diagram:
The communication link mode that uses CWDM/DWDM devices
As shown above, by using CWDM/DWDM technology, the IP network connection and FC connection of the primary data center and the disaster-tolerant data center can be reused for shared bare optical fibers。 In this way, the problem about the utilization of bare optical fibers and the multiplexing of multiple protocols are solved。 To avoid a single point of failure, we can also adopt the solution with redundant connections and no single point of failure。 Moreover, there are more topological schemes in this mode, so we need to determine one after we make an analysis during the process of specific design。
1。 The connection through an IP network, which adopts ATM or E1, E3 lines
The host-based and application-based disaster-tolerant schemes can use an IP network directly, which will not be further explained here. The disaster-tolerant technology that relies on storage or virtual storage requires the conversion of the FC protocol to the IP protocol, so that the FC can be loaded to the IP network for transmission. This scheme adopts internationally well-received IP network protocols and links. It packs the FC channel protocol into the IP data package through FC/IP conversion devices(such as Nishan) and transmits the protocol through IP links. In theory, this scheme has no limitation distance, so it is suitable for remote asynchronous data replication and it is cost-effective. The diagram of the connection is as follows:
The communication link mode that uses FC/IP conversion devices
1. Bandwidth of various communication links (just for reference)
Type of line
Actual bandwidth（after removing overhead）(Mbps)
Time needed for 1TB replication
• T1 - 1.544 megabits per second
• T3 - 43。232 megabits per second (28 T1s)
• OC3 - 155 megabits per second (84 T1s)
• OC12 - 622 megabits per second (4 OC3s)
• OC48 - 2.5 gigabits per seconds (4 OC12s)
• OC192 - 9。6 gigabits per second (4 OC48s)
4。3 The estimation of bandwidth of disaster-tolerant communication links
To determine the performance configuration requirements of the storage system and the bandwidth requirements of communication links, we need to make an analysis and calculation in view of the actual situation in the user's data center. To accurately estimate the bandwidth requirements of users' disaster-tolerant communication links, we need to collect the data about the I/O load of applications that require disaster-tolerant protection in every center, namely, we need to collect the I/O characteristics and load size of all the applications, especially the data about write I/O. Then we can make use of the collected data about write I/O, and combine it with the disaster-tolerant data replication technology and mode (synchronous or asynchronous)that we use, and the RTO/RPO requirements for application recovery, so as to calculate the bandwidth requirements of disaster-tolerant communication links.
EMC Company can provide standard methods and tools for clients to design the communication links of disaster-tolerant data replication. It usually adopts the following steps to estimate the bandwidth requirements of communication links in a disaster-tolerant scheme:
1. Collect the I/O performance data in the current production center
The collection is mainly about the I/O performance data of applications and host storages that require disaster-tolerant protection. The data is obtained from the following two aspects.
• Obtain I/O performance data from the host (such as, use IOSTAT and SAR on the UNIX platform to obtain I/O performance data, and use the Perfmon tool in the Windows server to obtain the I/O performance data of the Windows server);
• Obtain I/O performance data from the storage platform。 By using the performance collecting tool of the storage platform, we can access the I/O distribution and I/O characteristics on each LUN of storage(EMC can provide complete tools for collecting the I/O performance information of storage platforms)。
2. Use EMC's design software to filter I/O performance data and get the data about I/O write
The design of disaster-tolerant communication links is related to the performance requirements of I/O write. Only write I/O can be replicated to a remote disaster tolerance center, so the characteristics and load of write I/O determine the requirements for links. During the process, irrelevant data (such as non-critical applications' I/O which does not require disaster tolerance)will be filtered, and we will know times of write I/O per second, the average size of I/O block of different application types, and whether there is a need for tuning. The following diagram shows a reference sample for write I/O performance data which are obtained through EMC's tools.
A reference sample for write I/O performance data (collected by EMC's tools)
1. Estimate the total peak bandwidth and average bandwidth of client applications with the collected I/O write performance data
2. Estimate the "latency" of disaster-tolerant communication in view of the type of disaster-tolerant links and the connection scheme.
The additional overhead of different communication protocols and the "latency" brought by physical links should be considered.
3. Estimate future performance growth requirements and peak space that needs to be reserved
The design of communication links (including all capacity planning) needs to consider the growth of future business and reserve the space for growth。
4. Determine whether to use synchronous replication mode or asynchronous replication mode. If we choose the asynchronous replication mode, we need to identify the RPO requirements (the maximum data that is allowed to lose) ------we can design link requirements with RPO requirements and I/O amount of businesses; or we can consider the situation of existing links and combine it with I/O amount of businesses to analyze the RPO capability that can be achieved and the extra cache overhead that needs to be added for asynchronous replication at the source data end.
5. Use EMC's specialized tools to design
According to different replication modes, we can input the collected I/O performance and other parameters into EMC's tools and take into account the requirements for link redundancy so as to calculate the bandwidth requirements needed by clients.
EMC Company will use these methods to design disaster-tolerant links for users in the future since this mode has been successfully applied to a number of disaster-tolerant schemes provided by EMC. With its scientific methods of link design and its unique design tools, EMC are able to provide users with reasonable link planning schemes that lay a foundation for the successful implementation of disaster-tolerant schemes.
4。4 A brief introduction to EMC's tools for designing disaster-tolerant replication schemes
With its rich experience in providing disaster-tolerant schemes for a large number of high-end users, EMC Company has developed specialized tools ------ET tools for the design of disaster-tolerant data replication schemes. The tool leverages the current business I/O situation and requirements for service level of users to analyze the key requirements in designing a replication scheme: the bandwidth of communication links and the processing capability of replication platforms (such as hosts or storage).It can also be used to assess the RPO requirements that users can achieve under restricted communication conditions. The tool, which will be used as an assessment tool for the service level of the user's disaster-tolerant technology platform, can regularly perform I/O performance statistics, and evaluate whether the disaster-tolerant data replication platform meets the changing requirements of business development.