Blameless Culture: The Path to Agility in IT

You’ve inevitably heard the word blameless used in the context of technology and tech culture.  The foundation of DevOps and agile and many other methodologies is around blameless post-mortems.  Retrospective meetings and recaps that allow for trust and honesty when discussing what went wrong and what went well.

Here’s the issue:  being blameless is really difficult.  Like, really, really difficult.

Battling Human Nature

Blameless meetings in many organizations are more like “don’t blame me” meetings which erodes the very foundation of what blameless culture is all about.  A societal challenge is getting past the “Not it!!” mentality and embracing the issues without seeking a person to pin it on who isn’t ourselves.  People often mistake not taking responsibility with being blameless.  Just because you don’t assign blame, constantly un-assigning yourself from it is not a blameless approach.

It’s not easy to admit we made mistakes.  Trust me…

A Personal Story:  “Wait, that was production?!”

It was 9:30 AM on a Tuesday.  Two Active Directory administration windows open, one for production, and one for the test domain with the identical OU structure (to look just like production).  As I clicked the option to upgrade the Active Directory version to Windows 2000 native mode and proudly clicked the Change Mode button (clearly stating “this operation cannot be reversed”) and the following acknowledgement that it was irreversible, I sat back in the chair and smiled…for a moment.

Then I checked the other window to what I thought was production…and saw the domain version was mixed mode (legacy).  But then I also noticed the domain name was the test domain.  I had updated production in the middle of business hours during the potentially busiest login period of the day.

I stood up and took a lap around the cubicle. Once I stepped back in and confirmed it had happened, I called my systems architect who worked with me.  “James, I made a mistake.  I just updated production instead of test.”

James calmly said “Ok.  Let’s get some folks looking at it and figure out if anything happened and what our options are if something broke”.  James understood blameless culture.  We survived the change (without issue, luckily) and I learned that rallying the team around an issue was better than seeking blame.  We would work together on other issues that did have far reaching effect at other times, and the blameless culture and teamwork got us through those events.

Hire the Person that Made a Big Mistake – They Won’t Want to Make Another One

Making mistakes and learning to change process and recovery procedures meant building better IT.  As a systems architect and operations team member, our culture grew stronger under blameless acceptance of issues.  I also witnessed the very negative results of not using this tactic.  A manager at one time told me that even though we had a large, preventable situation occur from human error, we don’t remove the person.  We teach them to not have it occur again.  If you make the same mistakes again…and again, well, there isn’t a blameless team around who will not take notice and perhaps have to make some changes in your role.

Test-driven development and test-driven infrastructure are wrapped around the foundations of finding the error first and then working back from there.  Seek a problem and then find resolution.  The same goes for teamwork and production deployments.  To move faster, you have to be confident that you are testing and also able to work together as a team without fear of retribution when issues occur.  Some issues are avoidable, and some are not.  Be blameless before you find the solution and then when you find the cause, accept that it happens…but should happen less.  The real leaders then ask “how can we do it better next time?”

Leadership Defined:  We Succeed; I Fail

Acknowledging failure or just challenges is healthy,  Celebrating success as a team is also extremely positive.  My way of encapsulating what it means to be a leader of people in a team goal is “We succeed; I fail”.  In other words, celebrate successes as a team, because we all did this together.  When looking for where we may not have succeeded, look into yourself for what you feel could have been done better.  There may be others involved in the tactical things that went wrong.  The best outcome is a learning experience for everyone throughout the team.

Even when things go well, we should always ask how it could have gone better, or what would you do differently?  This is the beginning of embracing a culture of experimentation.  You have to know that things can fail and we can recover, without blame.


We’re Not Building a Piano: Full-Stack Infrastructure Understanding – Part 2

In our previous post (Design Patterns for Resilient Application Infrastructure – Part 1) we explored the basics of N+1 concept for node loss in a cluster and discussed what our series is going to cover. Let’s start by mapping out the full-stack view as the IT architect will need to.

An IT architect needs to understand the physical layer (servers, storage, network), the data layers (relational databases, NoSQL databases), the application layers (application logic, code logic and code deployment), and the access layers (front-end load balancing and caching)

There are many stacks we have been exposed to over our IT careers. Today’s “stack” is one that spans physical, virtual, cloud, and application infrastructure. We will expand this model as we go in the series, but let’s start with this as our initial set of building blocks to work from:

  • Access
  • Application
  • Data

Inside each block will be many services that support the needs at that logical layer.

Access Layer

The access layer is often known as the presentation layer. This layer includes our many facets of network access to reach our application front-end. The access layer may include just raw Layer-3 networking directly to a HTTP server, or it may also use proxies, distributed load balancers, firewalls bridging a DMZ and more. Understanding the needs of your application presentation will drive a lot of the underlying architectural decisions for the access layer. Some of your designs may also leverage existing technologies in place.

Application Layer

The access layer draws content from the actual application itself. Is your application a single node which hosts all code in a monolithic instances virtual machine? It may be a variety of services that are distributed for resiliency and to decouple from each other for better portability. How are your applications stored, deployed and updated? These are considerations for the application layer.

Data Layer

Applications are typically backed by data repositories. It could be just for read access or to interactively create/update/manage data in the back-end. Data may be distributed close to the application itself, and may be in a variety of forms. Relational databases, flat files, key-value stores, and many more options. Data can also be aggregated by other API-enabled services which makes understanding data and application architectures challenging.

Understanding Requirements for Application Resiliency

Put aside cost and complexity for a moment, and think about raw requirements. If you need resilient application infrastructure, you need to think about each of the three layers this is our next challenge as we explore some options for distributed architectures to protect and deliver resiliency using some more direct product exploration and the more physical layer.

Using the CLP (Conceptual, Logical, Physical) layers for our applications means we can properly assess and architect each of the layers in the IT stack to meet the application resiliency requirements.

NEXT POST: Understanding the Access Layer

We’re Not Building a Piano: Design Patterns for Resilient Application Infrastructure – Part 1

When I was doing work as a general contractor and building a house, one of the teams I worked with was a Father-Son pair who were helping to build the house with me, and the owner of the house. The son of the team was nicknamed “Lumpy” and he was learning the trade. When Lumpy was hammering in a nail that went crooked and folded over, he began pulling the nail out. This happened a couple of times and was noted by his father by the phrase that I will never forget (apologies for the salty language):

For Christ’s sake, Lumpy, we’re not building a f%&*ing piano

The choice of wording aside, the issue is that too much time was being spent trying to make each part of the building process ideal, which was likened to a finely-tuned piano. Lumpy was spending 3-4 times as long trying to remove the bent nail as he would have spent just hammering it in crooked and then putting another nail beside it.

This series takes the same concept and puts it into practice for infrastructure design. Only once we understand the design patterns can we apply them to where it really matters, which is in service of application availability.

Infrastructure Resiliency – Stop Building Pianos

Crooked nails will be all over your virtual and physical infrastructure. They should be. What we need focus on as infrastructure operations teams is moving further up the proverbial stack with our understanding of resiliency. We should be designing infrastructure using the same patterns as the “at-scale” folks use where possible. This is the concept behind what Alex Polvi (founder of CoreOS) calls GIFEE (Google Infrastructure for Everyone Else).

You don’t need to be Google to think like them. The term SRE (Site Reliability Engineer) may seem like a buzzword title, but the concepts are sound and the fundamentals can be adopted more easily than we realize. It takes a little rethinking of what the goal of resiliency in infrastructure really is.

Bottom Up – Understanding N+1 and N+2

We use the phrases N+n to illustrate systems component resiliency. When we talk about things like server availability for virtualization clustering, N+1 and N+2 are often confused in their meaning. N+n is a measure of how many single components in a system can fail before critically affecting that system. If you have 12 hosts in a cluster, N+1 would indicate you have enough resources to survive a single node (virtualization host) failure and to continue service the remaining workloads. N+2 becomes a 2-node loss, and so on.

For host clustering and N+n illustration, we measure as a percentage of resources which is left over for the surviving nodes. A single-node system obviously cannot sustain any lost at all. A 2-node cluster can sustain a single node failure and survive but will have 50% of the total compute resources.

This is just a sample showing N+1, N+2, and N+3 node loss effects in clusters up to 7-nodes. The calculations can be made quite easily using a simple formula:

Remaining Resource Percentage = ( Nodes Lost / Nodes Available ) * 100

The interesting thing with cluster sizing is that we spend a surprising amount of time designing the clusters and then forget to keep track of the dynamic workloads. That’s another blog all unto itself. Our goal in this series is to uncover the upper layers in which we can understand resiliency. Even if you do not directly affect these layers yourself as an operations admin or IT architect, it’s my believe that we have a responsibility to know more to truly design and build resilient application infrastructure.

Understanding the True Full-Stack Infrastructure Resilience Approach

When somebody is described as a full-stack application designer, it usually means they are competent in both front-end (visual)) and back-end (application logic and data) design. For full-stack infrastructure architects, there are a lot more layers. An IT architect needs to understand the physical layer (servers, storage, network), the data layers (relational databases, NoSQL databases), the application layers (application logic, code logic and code deployment), and the access layers (front-end load balancing and caching). All of these need to also be understood on traditional virtualization and on private or public cloud infrastructure. Yikes!

Have no fear, we are going to take these topics on in some simple and meaningful examples, and you will have a crash course in resilient application infrastructure. Using these fundamentals will give us the foundation to then apply these patterns to specific infrastructure deployments like AWS, Microsoft Azure and private cloud products.

Strap in and enjoy the ride, and I hope that you find this series to be helpful!

NEXT POST: Full-Stack Infrastructure Understanding

Visualizing your Solutions: Mind Maps and Wireframe Diagrams

Let me start this post out with a huge thanks to Rene (aka @vcdx133) and Melissa (aka @vmiss33) who has been very helpful with me getting from idea to diagram/document using these tips. Having a simple template to start things off with becomes the best way to get Visualization helps your ideas become more clear because it forces you to see the relationships between things, and to do the physical process of drawing them out on paper and/or using a digital platform. Before you think you need to be an AutoCAD, or even a Visio export, you have to learn to quickly get ideas drafted out.

There have been many days where I stared at a blank diagram software screen and fought with how to get it to work in a nice way using the product when what I should have done is to start with just sketching it out in rough format first. This goes to the classic phrase “don’t let perfection get in the way of good enough” When you need to take the thought process from ideation to visualization, there are many tools and techniques that can help you. The most popular ones I use nearly every day are:

  • Paper sketches
  • Mind Maps
  • Diagram Tools: Visio, OmniGraffle, PowerPoint

Each has a distinct purpose in the process.

Paper Sketches

This is one that Melissa (aka @vmiss33) has taught me to leverage more and more. When you want to get started on an idea, just break out a pencil or pen, and some paper. Scratch diagrams and sketches take your idea and put them into a visual form. This helps you think about how to visualize it before you go diving into OmniGraffle or Visio and find yourself searching shape catalogs for hours and getting frustrated. Scratch pads and notebooks are excellent for both words and diagrams. As you write out and sketch out things, your mind is forced to connect the physical motor act with the thought process. This helps to enhance learning and to get closer to a result for you with your ideas. I’ve also gotten some really nice notebooks which I enjoy using. Rhodia is one type that have very nice paper and lots of different styles. My favourite to use is engineering paper or graph paper style.

Mind Maps

Whether it’s a site map you want to work out, some ideas and related content/thoughts, or just general brainstorming, mind maps are also a great tool for taking verbal and thought processes and putting them to paper easily. Start with your core idea/thought and then branch out from there using simple mind map diagrams. There are lots of resources online to help you as you learn to use this technique to expand on your ideas. MindNode is a product I use for the Mac, but there are many different products which you can find online. The goal is really just to adopt the practice first and then you can use this for both self-ideation as well as for collaboration. A project manager who I worked with for years taught me the value of quickly scribing down discussion ideas for project planning using a mind map which has served me well over the years.

Diagram Tools

Before you think you need to be creating perfect diagrams with visually-stunning graphics, start with the basics. Wireframe diagrams can be easily drafted out as a digital version of your earlier sketches. You can choose the level you want your graphic quality to be, but the best diagrams I’ve used and created are ones that I modelled after a template that I got from Rene Van Den Bedem (aka @vcdx133). Using a seemingly simple diagram format means you concentrate on the content. Once the content is completed and your idea is committed to a diagram, you can then tune the graphic style all you want. The first step is moving from concept in your head to the concept in a diagram. Products I’ve used include OmniGraffle, Microsoft Visio, and even Microsoft PowerPoint can be quite handy for doing such diagrams. Hopefully these are helpful tips for you as much as they were for me.