BCP/DR Primer – An introduction to planning your IT recovery strategy – Part 1

You see the acronyms all over the place, and it can be very overwhelming. There are so many articles about cloud, SPOF, RPO, RTO, five 9 up-time and in the end it is really all meaningless unless you understand your own environment.

A common misconception is that by putting your data “in the cloud” that you have a Disaster Recovery (more often caled BCP, or Business Continuity Planning) strategy. Wrong! There is much, much more to protecting your data, applications and business than simply copying your data to somewhere that boasts (and I stress the word boasts) a “99.999%” uptime.

This series will cover the fundamentals of IT BCP strategy with a split focus on technology as well as the business requirements to define your overall BCP program. Because BCP is just that, a Business Continuity Planning strategy, you have to begin by understanding the nature of your business and what is required to meet the needs of your staff and your customers.

BCP Speak

Much like what I call “project speak” there is “BCP speak”. These are the basic, and much used acronyms and phrases when defining your BCP strategy. You’ve most likely seen these before but it is key to understand these key phrases and features of a BCP program.

RPO

Recovery Point Objective. This is the point in time that your data will be recovered to in the event of a disaster event. The most common RPO is 24 hours because we assume that you have a backup from within a 24 hour period which you can restore from. This does not mean that it takes 24 hours to get your system online, but it means that whenever your system is recovered that the data will be up to 24 hours old as of the disaster event.

RTO

Recovery Time Objective. This is the duration of time to recovery your system(s) after a disaster event. While you may have an RPO of 24 hours because you have a 24 hour old backup tape, it may take 3 days to recover that system or data thus resulting in a 72 hour RTO with a 24 hour RPO.

SBD

Significant Business Disruption. This term is used because we are not always faced with a “disaster” but perhaps we have situation like your primary data center loses power for 8 hours and you only have a 2 hour battery resulting in a 6 hour outage. This is not a “disaster” but a “disruption”. It may be as simple as the transit being shut down for some reason which impedes your staff from getting to their place of work for a day or more.

BIA

Business Impact Analysis. This is a planning and documenting process where IT and the business coordinate to document and define the needs for a business process. This will include people, processes and technology requirements and from the BIA we will be able to define the RPO and RTO for the technology components.

Alternate Site

This term simply refers to a secondary site which is usually more than 100 km (60 miles) from the primary site. This is usually referring to the data center where servers and infrastructure are held.

Synchronous versus Asychronous

Data synchronization is done in one of two ways which is either Synchronously, or Asynchronously. Synchronous replication means that the data is current in both locations and when data is written to the primary location, it is simultaneously written to the secondary location. In database terms this is referred to as “dual commit” where the transaction is not considered to be completed until the second write is confirmed.

Asynchronous replication means that the data is written to primary location and then sent to the secondary location to be written as soon as is technically possible. The advantage to asynchronous is that while there may be a slight delay in the writing of the data to the secondary location, there is less latency because the transaction is completed as soon as the primary data is written.

When your primary and secondary site are separated by more than 100 km you will find there are technical limitations to providing sychronous replication. This is a matter of physics and at this time is not up for debate unfortunately.

Near-Zero

With asynchronous replication we may still have “near” to synchronous speeds but because it is not a guarantee that we have that speed, and that we know for a fact that the transaction is not synchronous, we often refer to this as Near-Zero because it may be under a minute, or under 5 minutes. If your system has limited data change you may be able to be comfortable with under 15 minutes as the threshold for data synchronization.

Real-Time

Much like Near-Zero, the term Real-Time is used to describe the currency of data in the secondary site. Data transfer which is referred to as Real-Time is usually synchronous data replication.

Hot, Warm and Cold Standby

The terms hot, warm and cold when referring to standby systems are used to refer to the currency of the data and availability of the system in the alternate site.

Hot standby is defined as an online recovery system with synchronous data, or possibly asynchronous, but near-zero replication. A hot standby system is immediately available in the event of an outage in the primary site.

Warm standby is defined as an online recovery system with asynchronous data replication. While the data transfer may be as close as near-zero, there is some recovery required to bring the system online to recover that service. This may be automatic or manual.

Cold standby is defined by hardware or virtual systems available in the alternate site which can be used for the recovery of a system. There is manual intervention in restoring data and bringing the system online, but it can be done without the purchase and installation of hardware or software.

SPOF

Single Point of Failure. This is where a system has one or more component, which if removed, renders the system unusable. Even when we build redundancy into systems, we often have a single point of failure. A simple example is a web application which has a database connection, yet there is only a single database server. If the database system were to go offline then the web application would also become unavailable.

BCP Tiers

The tiers in BCP are defined by RPO and RTO ranges. Using these tiers we map the business requirements against the BCP tier and this will define the technnology and people factors involved to maintain a recovery strategy within a specific timeframe.

High Recoverability versus High Availability

A system may be fully redundant by having multiple paths to a database, or mutliple servers in a site servicing the delivery to the customer, but redundancy is often done locally with low latency technology. This is what we refer to as High Availability because it has the ability to be available despite a number of local distruptions or events.

High Recoverability is the capability to recover the business or technical function in an alternate location. It is entirely possible to have a system which is higly available, but with limited recoverability and the reverse is possible also. The example of High Recoverability could be a simple standalone application server with no local redundancy, but a warm standby in an alternate site.

Oxygen Services

The term Oxygen Services refers to core IT systems which are required such as network, name resolution, directory services. This name was chosen because without these services we cannot do anything else. These are the essential systems and services requiredbeforewe are able to recover our business systems.

What’s Next?

Our next post in the BCP/DR Primer series will take the key things we’ve spoken about here and we will look at how we define the BCP Tiers for our environment with our initial focus on Oxygen Services.