BCP/DR Primer – Part 2 – BCP Tiers and recovery requirements

In Part1 of our BCP/DR Primer series we defined some key terminology that we will use as we build up our plans and document our systems as they map against our ability to recovery them.

The BCP Tiers are common among many organizations although you may seem some deviation in the specific time-frames that they use to define them. I have used the same definitions with multiple organizations and these are what I will be showing you here.

Each BCP Tier is defined by RPO and RTO. These are both required for the system(s) to be included in the tier. The reason that we have to use both RPO and RTO is that we cannot consider a system as being tier 1 by RTO, yet the RPO is 5 days. What we mean by this is that there is no value in being able to bring a system online very quickly, yet the data is far out of date. When we are defining the BCP tier we use the lowest common denominator of the system recovery requirements. This will make sense with as we walk through the BCP tier definitions.

An important note is that these are IT recovery definitions. Many businesses may refer to having “tier 1” systems that have a requirement to be recovered within 48 hours. Quite often these terms and definitions come from regulatory requirements where a company must have certain capabilities within 48 hours of a Significant Business Disruption. It is important to differentiate between business definitions and IT definitions.

Tier 1

  • RPO: <15 minutes
  • RTO: < 1 hour

This is your highest availability and highest recoverability category. The key definition for a system to fall into this environment is that it requires near-zero or real-time replication of data and automated failover to the alternate site. If the data is replicated, but the recovery requires manual intervention you are immediately pushed into Tier 2. The reason for automation being a requirement is that we cannot assume availability of people resources when a disaster event takes place so a manual recovery may take longer than 1 hour. A majority of your Oxygen Services should be in this category.

Tier 2

  • RPO: < 4 hours
  • RTO: <24 hours

Tier 2 is the sweet spot for many systems. The operational costs to maintain a true Tier 1 environment is often beyond the reach for many application environments. Requirements for Tier 2 include replicated data and hot or warm standby environments in the alternate site. Any environments that will require recovery from tape/archive data are automatically moved into Tier 3 or even into Tier 4. This Tier will include the remainder of your Oxygen Services which ensures that the other IT application environments and your business application environments have the required infrastructure to be brought online at the alternate site in Tier 2 and below.

Tier 3

  • RPO: <24 hours
  • RTO: <48 hours

Tier 3 is going to be a heavily populated recovery Tier also. Requirements for Tier 3 recovery include a minimum of cold standby systems (may also include warm) and data should be in place, or recovery from archive is possible provided that archive content is available at the alternate site within the RTO window.

Tier 4

  • RPO: <24 hours
  • RTO: <5 days

The final category is Tier 4 which will encompass all systems remaining, including those with manual recovery processes, data recovery from archive for medium to large data sets and anything which is deemed as “less important” than systems which are categorized as Tiers 1-3. Remember that importance is defined by the business sponsor, and cost of recovery may impact how systems become Tier 4.

Where is Tier 5?

There are few environments which have a Tier 5 simply because any system which would fall under this category is a very manual process or system and may even be just paper data which is stored off site in archival storage. While we may have some systems that could live in the Tier 5 space, usually through the BIA process you will find that the business will want these systems classed as Tier 4.

You may find that Tier 5 exists for your organization so it should not be assumed to always be unnecessary. I have just found that in practice that most organizations require recover in one way or another within the 5 day window. If you define Tier 5 it is typically an RPO of 24 hours and RTO of 5-14 days. The Tier 5 category is what we would call “everything else”.

Why the Lowest Common Denominator?

LCD isn’t just for mathematics. When we define the BCP Tier for an application or system, we have to use the lowest common denominator for the RPO and RTO to decide where it belongs in our recovery plan. What we mean by this is that you should not have a system that has near-zero replication of data, yet the core application requires 48 hours to be recovered. While this may happen, we have to define this application as Tier 3 because regardless of the currency of the data (<15 minutes) we still require the 48 hours to recover the whole application environment.

What is zero hour? The Declaration of Disaster

This is a very good question that is hot topic among the BCP community. When does the clock start? There are 2 events which will occur in a Significant Business Disruption or Disaster Event. First we have the actual event itself, followed by the “declaration” by the business sponsors on the BCP team who will officially initiate BCP recovery to an alternate site.

As an example, let’s use a localized power loss to your data center. Assume for our example that you have 6 hours of UPS available to withstand an outage. We must also assume that our network provider(s) have similar protection to provide you with connectivity during the time when you are operating on UPS.

While we have an immediate issue with the loss of power, we have been able to maintain our Tier 1 systems. During the course of the first hours (less than 6 in this case) we must prepare to fail-over environments if we continue to suffer the loss of power. We may also find out that we will get our power back after 8 hours. In this case the BCP team may declare the start time as of 4 hours into the loss of power once we come to a agreement that we will initiate some fail-over procedures.

So for this example, the “zero hour” is actually 4 hours beyond the actual event. This is just an example to illustrate that the event may not always be considered the start point because we consider the declaration of disaster to be the true starting point where we enact the recovery plans.

Other Restrictions which affect the BCP Tier

There are a number of things which can negatively affect the recoverability of a system. Most importantly, we have to be aware of many issues that will arise that can, and will affect the BCP Tier definition of an application or system.

If we look at the distribution of applications among the BCP Tiers we will see that many will land in Tier 2. The challenge that we have with that is that if there are too many in Tier 2, the manual intervention and time required to complete recovery of many of those systems may exceed the RTO of Tier 2. In other words we may have packed 2 days worth of work into a 24 hour recovery window.

For systems that require data restore from archive, you have to carefully measure the recovery time. While some tape recovery can still qualify as Tier 3, if the data size is too great this will also push the application into Tier 4.

It is important that we evaluate the overall recovery plan along with the individual itemized plans to be sure that we correctly categorize application recovery. If we find that there are too many in the Tier 2 range then we must either modify the recovery process to move it to Tier 1 or we must work with the business to prioritize it accordingly and move some applications into Tier 3.

Mapping the Application to a BCP Tier

I will use 2 examples to show how to map the application into the appropriate BCP Tier. It will become simpler to do this process as you spend more time with it, so the key is choosing applications with which you have familiarity.

System 1: Microsoft Active Directory, DNS and WINS

It’s really 3 systems, but because I co-locate them on a single server (two or more in each network location). This is an easy one for BCP Tier definition because by design it is a geographically dispersed cluster with data transmitted every 5 minutes and the longest wait duration is 15 minutes across the WAN.

DESTINATION: Tier 1

System 2: PHP Web server on Microsoft Windows with MySQL

The web application is protected by nightly Robocopy tasks and the database is backed up to disk using mysqldump and the content is Robocopied to the alternate server along with the web application code. The destination server automatically imports the nightly database dump. All that is required for recovery is to modify the CNAME record to point to the new target server.

DESTINATION: Tier 2

It’s just that easy 🙂 Of course you will have to try this out with your systems and through evaluating each one you will get much more familiar with the recovery processes and the mapping of applications to BCP Tiers should become easier as you go.

What’s Next?

In our next post we will build the BCP Recoverability Matrix