Achieving High Availability in VOS Cloud - Part 1 (Availability Zones)

  • 1
  • Question
  • Updated 2 years ago
  • (Edited)
For an application processing linear video services, High Availability (HA) is a key requirement. However, achieving it is no small accomplishment. In the design of VOS Cloud, we spent a lot of time thinking about the things that can go wrong and addressing them in every part of our application. This article details how we tackled the various potential problems and aims to educate our customers on what options are available to them to optimize failure recovery to meet their business needs.

Perhaps the best place to start this discussion, surprisingly enough for a cloud native software application, is by talking about hardware and hardware topology of the cloud.

What is an Availability Zone?

An Availability Zone (AZ) is an isolated location inside a region (ex: datacenters, racks, power strips,etc.). Each AZ belongs to a single region and is connected through low-latency links, which have a separate fault domain, such that failure in 1 zone would not impact another, as illustrated in the diagram below. While at the same time, connectivity between zones should be fast to ensure no performance impact among zones.

Although rare, failures can occur that can affect the availability of instances that are in the same location. If you host all of your instances in a single location that is affected by such a failure, none of your instances would be available. If you distribute your instances across multiple AZs and one instance fails, you can design your application so that an instance in another AZ can handle requests as an emergency back-up.

When building a cloud, you usually define two or more AZs inside the datacenter. Examples of AZs are servers on:

  1. Different power lines feeding the datacenter
  2. Different blade enclosures
  3. Different rooms inside the datacenter
  4. Different buildings
  5. Etc....
We have built our application to take advantage of the AZ construct as built into AWS and OpenStack. An AZ is an attribute that can be attached to compute nodes.

Using Mesos as an "Availability Zone Aware" Hardware Schedule

Mesos, a hardware scheduler, is aware of the different AZs. Queried through metadata Application Programming Interface (API), this is a standard service provided by AWS/OpenStack. When we launch Mesos on each of the nodes, we pass a parameter to Mesos to notify it that the current nodes are running at this AZ. Within VOS, we tag certain tasks to be run in separate AZs if they are present. Each processing node in the system is tagged with its AZ.

How Many Availability Zones Can We Support in AWS and OpenStack?

On AWS, we can launch EC2 instances into at least 2 AZs, but can support up to 4. While on OpenStack, the number of AZs we support depends on how you configure your OpenStack. We use AZ information for providing HA, by distributing redundant tasks across different AZs.

Which Services in VOS are “Availability Zone Sensitive”?

All services and tasks that have redundancy can utilize AZ information.

Within VOS we tag certain tasks to be run in separate AZs if they are present. The means for which this is achieved is similar between AWS and OpenStack. Each processing node in the system is tagged with its AZ. When tasks are spun up, there are rules about certain tasks running in the same AZ. Examples of these are:

  • Redundant transcoding which run in different AZs.
  • VOS HA Process: the primary access to the VOS Cloud system is the API. Perhaps you only use the User Interface, but this is built on top of the API. In fact, both of them are served by the same Web Server. Therefore, loss of access or failure of this component would leave you unable to configure the system or see if it is healthy, but would not stop any of your services that are running, or impact the rest of the HA of the rest of VOS Cloud.
Photo of Yaniv Ben-Soussan

Yaniv Ben-Soussan, Product Manager

  • 474 Points 250 badge 2x thumb

Posted 2 years ago

  • 1

Be the first to post a reply!