Before knowing what the keyword(Resiliency) means, let's understand what led to this concept.

Consider the example of a fiber optic connection between 2 buildings. Just to have a backup, there is always a second/replica connection in case of a failure. This is called Redundancy.

But there's a problem with this arrangement. If there are any external factors like construction, pests etc., both the wires can be exposed at the very same time, thus resulting in complete failure of connection if they're destroyed.
Thus, the inception of the term Resiliency.

Resiliency is the capacity to recover quickly from difficulties, self-heal the system and converge back to the original state.

The exact same connection can be laid out in a way where the backup wire can be laid, connecting different locations of the 2 buildings. So, if one of them is exposed, the other stays completely safe!

Let's learn its significance in the cloud & cloud-native space:

While talking about systems and resiliency in cloud, there are certain terminologies which come into action. Some of these are availability, reliability, durability and of course, resiliency. They are used to measure how a certain request by the user is handled by the server and determine the metrics. This is a nice blogpost explaining these terms in detail.

The cloud native space have opened a new domain of engineers assigned for this role called as Site Reliability Engineers(SREs).

Every company, big or small, prioritizes their customers and should be available for every small request. So, SREs follow certain contracts with the clients to maintain a certain percentage of uptime(the time their server stays active and running).

They achieve this by having a high availability percentage(in the ranges of 99.9%). Even they are measured using metrices like SLA, SLO and SLI. Here is a blog explaining the basics of cloud native and SREs in detail.

Resiliency in cloud look something like this:

Here, when a request is started by the user, there is a load balancer which distributes the request to several back-end servers, which try to fulfill the request. If one of the system goes down, others can help complete the request, thus making it a resilient system.

But construction and rats aren't the problem in cloud computing, then why do we need Resiliency??

Because downtimes(opposite of uptime) are the biggest evil in production. There have been multiple instances in the past when big MNCs have lost millions in revenue due to poorly managed systems.

With the advent of cloud-native technologies and microservice architecture, although automation and ease of production has been reaching its peak, the danger of failure keeps on increasing. Recently, on the October 4, 2021, Facebook had a server outage of almost 12 hours which costed them $6 bn!! Here is an article by the engineering team explaining that.

Due to this, in 2010 Netflix developed a practice called Chaos Engineering, when it ran into a similar problem, to make systems more fault-tolerant.
Chaos Engineering is a disciplined approach to identifying failures before they become outages.

By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.

This was adopted by the Cloud Native Computing Foundation(CNCF) later to develop different projects like LitmusChaos, ChaosMesh etc. Simply speaking, this is the practice of "breaking things on purpose" just to make systems more resilient to failures.

How Kubernetes achieves it?

Just to give a small intro to Kubernetes(k8s), it is a tool which helps you deploy your applications, scale and manage them as per your wish. To learn in detail, how k8s is resilient in itself, we need to dive deep into the various components constituting it.

Here is how Kubernetes looks from the inside:

I know this is a lot to digest to be thrown at the face suddenly, so, let's just try to understand it in a layman's perspective.

Imagine a lizard. We know that when a lizard's tail gets cut off, it doesn't die but the tail grows back. But if you take its heart out, it dies instantly(so does every being on this planet).
K8s has a similar structure. There is a control plane and multiple worker nodes. Control plane is the most important component as it contains the heart, the API server(along with a few other things).

Any kind of applications, if deployed, are run on the worker nodes. This command is relayed to the different components by the API server only. So, all the different components watch commands of the API server.

If due to some error, one of the worker node goes down, the nodes watching the API server sees that the application which needs to run isn't running and thus, with some internal mechanism, deploys the application to a different worker node. And by that time, it creates another worker node to replace the destroyed one(lizard analogy!). Thus, the downtimes are at an all time low!

These features make Kubernetes one of the biggest technical innovations in the recent times and an industry revolutionizer with more and more companies adopting the various cloud native technologies and scaling up their production with such container orchestration tools.

As a wise man previously said, "Learning Kubernetes isn't a technical challenge, it's a people's challenge".

I hope if you've reached here till the end of this blog, I've kept you interested enough to learn more about cloud native, containers and Kubernetes.