Oracle Ravello Blogger Day 2 #RBD2 – Oracle Cloud Infrastucture

Oracle is driving hard towards customer-focused cloud. The amount of listening being done by the product teams is showing in the product roadmap that we saw at the (Ravello Blogger Day 2) #RBD2 event in Redwood City, CA this week. This event features a broad array of industry folks who are in vendors, ISVs, VARs and other roles.

Oracle was presenting the Oracle Cloud Infrastructure and Oracle Ravello platforms over the course of the 8 hour event. Q&A becomes the dominant and most exciting part of the day as the content is unfolded for us. Some may quip about the “leading from behind” approach, but there can be little doubt that the boldness of Oracle is proving to be more than all talk as the customer examples and successful execution of the OCI story continue to roll in.

Cloud Truly Ready for the Enterprise

SLA is king in the Oracle cloud. Moving mission critical apps to the cloud rather than rearchitecting to map to the cloud patterns is a big driver for Oracle Cloud adoption. While the cloud pundits may shun this approach, the reality is that customers will adopt cloud to get the new features and are comfortably beside the “lift and shift” application stacks. The marketing of this option has resulted in a change from “lift and shift” to the newly dubbed “move and improve”.

Providing versatility and SLA on the OCI service stack is the icing on the typical IaaS cake. Oracle clearly pins their value on the performance side of things, which is top of mind as folks have tested the cloud waters and seen the inconsistencies in performance at many layers of some environments.

I get the feeling that a distinct advantage for Oracle is the fact that their existing customer base will tend towards being enterprise already. Not many Mom and Pop shops are running Oracle RAC. Being able to sell to a strong base will be extremely helpful in early expansion.

Driving SLA and quality of service as the key performance indicators will draw the eye of any good CIO. There is also a surprising humility among the folks that I’ve spoken with here at the event as they acknowledge the need to shake a potential perception problem with leaning into Oracle as a primary provider.

Build, Buy, Integrate

Starting from 12 years behind compared to alternative public cloud providers means being aggressive. The story is playing out on a few fronts. As most of you know, when you can’t build it, you have to buy it. More importantly, you have to integrate it.

Dyn, Ravello, and more are being quickly integrated to widen the portfolio of Oracle Cloud services. Adding Wercker into the portfolio also brought over some development credibility as a nice way to drive the CI/CD adoption on the Oracle Cloud stack.

Filling up the full stack is also particularly important. The evolution and goals are very interesting to watch. The need to present more than core IaaS is apparent. I’m especially excited about the Terraform integrations (you had me at Terraform), and Terraform services (state management, registry and more) that was discussed. I’ll share more on that once I dive further into it.

Further blogs will dive into some of the key areas of Kubernetes and PaaS/CaaS options. I felt those need a bit more of a deep dive.

Did I Mention Performance?

As someone who’s in the business if workload performance through Turbonomic, this resonates deeply with me. The Oracle team is clearly laser focuses on performance and looking to hit those performance targets while being cost-competitive to alternative offerings.

Oracle’s business is full-stack. The goal (clearly stated and fully transparent) is to go directly against the competitors (read: AWS and Azure) with performance and price. Having the customers using their stack already on-premises makes the transition to Oracle Cloud rather attractive. This is the same approach that Azure is taking. Credits and discounts will come into play to make the first steps enticing, but what will keep the customers in-platform will be performance, cost, and service breadth.

It will be very interesting to see the results of diving into the latest iteration of OCI. Look for much more in the coming weeks as I share my experiences.




We’re Not Building a Piano: Full-Stack Infrastructure Understanding – Part 2

In our previous post (Design Patterns for Resilient Application Infrastructure – Part 1) we explored the basics of N+1 concept for node loss in a cluster and discussed what our series is going to cover. Let’s start by mapping out the full-stack view as the IT architect will need to.

An IT architect needs to understand the physical layer (servers, storage, network), the data layers (relational databases, NoSQL databases), the application layers (application logic, code logic and code deployment), and the access layers (front-end load balancing and caching)

There are many stacks we have been exposed to over our IT careers. Today’s “stack” is one that spans physical, virtual, cloud, and application infrastructure. We will expand this model as we go in the series, but let’s start with this as our initial set of building blocks to work from:

  • Access
  • Application
  • Data

Inside each block will be many services that support the needs at that logical layer.

Access Layer

The access layer is often known as the presentation layer. This layer includes our many facets of network access to reach our application front-end. The access layer may include just raw Layer-3 networking directly to a HTTP server, or it may also use proxies, distributed load balancers, firewalls bridging a DMZ and more. Understanding the needs of your application presentation will drive a lot of the underlying architectural decisions for the access layer. Some of your designs may also leverage existing technologies in place.

Application Layer

The access layer draws content from the actual application itself. Is your application a single node which hosts all code in a monolithic instances virtual machine? It may be a variety of services that are distributed for resiliency and to decouple from each other for better portability. How are your applications stored, deployed and updated? These are considerations for the application layer.

Data Layer

Applications are typically backed by data repositories. It could be just for read access or to interactively create/update/manage data in the back-end. Data may be distributed close to the application itself, and may be in a variety of forms. Relational databases, flat files, key-value stores, and many more options. Data can also be aggregated by other API-enabled services which makes understanding data and application architectures challenging.

Understanding Requirements for Application Resiliency

Put aside cost and complexity for a moment, and think about raw requirements. If you need resilient application infrastructure, you need to think about each of the three layers this is our next challenge as we explore some options for distributed architectures to protect and deliver resiliency using some more direct product exploration and the more physical layer.

Using the CLP (Conceptual, Logical, Physical) layers for our applications means we can properly assess and architect each of the layers in the IT stack to meet the application resiliency requirements.

NEXT POST: Understanding the Access Layer




We’re Not Building a Piano: Design Patterns for Resilient Application Infrastructure – Part 1

When I was doing work as a general contractor and building a house, one of the teams I worked with was a Father-Son pair who were helping to build the house with me, and the owner of the house. The son of the team was nicknamed “Lumpy” and he was learning the trade. When Lumpy was hammering in a nail that went crooked and folded over, he began pulling the nail out. This happened a couple of times and was noted by his father by the phrase that I will never forget (apologies for the salty language):

For Christ’s sake, Lumpy, we’re not building a f%&*ing piano

The choice of wording aside, the issue is that too much time was being spent trying to make each part of the building process ideal, which was likened to a finely-tuned piano. Lumpy was spending 3-4 times as long trying to remove the bent nail as he would have spent just hammering it in crooked and then putting another nail beside it.

This series takes the same concept and puts it into practice for infrastructure design. Only once we understand the design patterns can we apply them to where it really matters, which is in service of application availability.

Infrastructure Resiliency – Stop Building Pianos

Crooked nails will be all over your virtual and physical infrastructure. They should be. What we need focus on as infrastructure operations teams is moving further up the proverbial stack with our understanding of resiliency. We should be designing infrastructure using the same patterns as the “at-scale” folks use where possible. This is the concept behind what Alex Polvi (founder of CoreOS) calls GIFEE (Google Infrastructure for Everyone Else).

You don’t need to be Google to think like them. The term SRE (Site Reliability Engineer) may seem like a buzzword title, but the concepts are sound and the fundamentals can be adopted more easily than we realize. It takes a little rethinking of what the goal of resiliency in infrastructure really is.

Bottom Up – Understanding N+1 and N+2

We use the phrases N+n to illustrate systems component resiliency. When we talk about things like server availability for virtualization clustering, N+1 and N+2 are often confused in their meaning. N+n is a measure of how many single components in a system can fail before critically affecting that system. If you have 12 hosts in a cluster, N+1 would indicate you have enough resources to survive a single node (virtualization host) failure and to continue service the remaining workloads. N+2 becomes a 2-node loss, and so on.

For host clustering and N+n illustration, we measure as a percentage of resources which is left over for the surviving nodes. A single-node system obviously cannot sustain any lost at all. A 2-node cluster can sustain a single node failure and survive but will have 50% of the total compute resources.

This is just a sample showing N+1, N+2, and N+3 node loss effects in clusters up to 7-nodes. The calculations can be made quite easily using a simple formula:

Remaining Resource Percentage = ( Nodes Lost / Nodes Available ) * 100

The interesting thing with cluster sizing is that we spend a surprising amount of time designing the clusters and then forget to keep track of the dynamic workloads. That’s another blog all unto itself. Our goal in this series is to uncover the upper layers in which we can understand resiliency. Even if you do not directly affect these layers yourself as an operations admin or IT architect, it’s my believe that we have a responsibility to know more to truly design and build resilient application infrastructure.

Understanding the True Full-Stack Infrastructure Resilience Approach

When somebody is described as a full-stack application designer, it usually means they are competent in both front-end (visual)) and back-end (application logic and data) design. For full-stack infrastructure architects, there are a lot more layers. An IT architect needs to understand the physical layer (servers, storage, network), the data layers (relational databases, NoSQL databases), the application layers (application logic, code logic and code deployment), and the access layers (front-end load balancing and caching). All of these need to also be understood on traditional virtualization and on private or public cloud infrastructure. Yikes!

Have no fear, we are going to take these topics on in some simple and meaningful examples, and you will have a crash course in resilient application infrastructure. Using these fundamentals will give us the foundation to then apply these patterns to specific infrastructure deployments like AWS, Microsoft Azure and private cloud products.

Strap in and enjoy the ride, and I hope that you find this series to be helpful!

NEXT POST: Full-Stack Infrastructure Understanding




Resetting vSphere 6.x ESXi Account Lockouts via SSH

VMware vSphere has had a good security feature added since vSphere ESXi 6.0 to add a root account lockout for safety. After a number of failed login attempts, the server will trigger a lockout. This is a good safety measure for when you have public facing servers and is even important for internally exposed servers on your corporate network. We can’t always assume that it’s external bad actors who are the only ones attempting to breach your devices.

Using the vSphere web client shows us the settings which are used to define the lockout count and duration. The parameters under the Advanced settings are as follows:

Security.AccountLockFailures
Security.AccountUnlockTime

Resetting your Failed Login Attempts with pam_tally2

There is a rather simple but effective tool to help you do this. It’s called pam_tally2 and is baked in with your ESXi installation. The command line to clear the lockout status and reset the count to zero for an account is shown here with the root account as an example:

pam_tally2 --user root --reset

In order to gain access to do this, you will need to have SSH access or console access to your server. Console access could be at a physical or virtual console. For SSH access, you need to use SSH keys to make sure that you won’t fall victim to the lockouts for administrative users. In fact, this should be a standard practice. Setting up the SSH keys is relatively simple and is nicely documented in the Knowledge Base article Allowing SSH access to ESXi/ESX hosts with public/private key authentication (1002866)

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002866

Uploading a key can be done with the vifs command as shown here:

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.security.doc/GUID-392ADDE9-FD3B-49A2-BF64-4ACBB60EB149.html

The real question will come as to why you have the interface exposed publicly. This is a deeper question that we have to make sure to ask ourselves at all times. It’s generally not recommended as you can imagine. Ensuring you always use complex passwords and 2-factor authentication is another layer which we will explore. Hopefully this quick tip to safely reset your accounts for login is a good first step.