March 10, 2022

Terraform Config for Multi-Cloud: Solution

Shadow-Soft Team · 9 minute read

Introduction

Terraform is the industry standard tool for infrastructure provisioning. It provides a unified language for interacting with any supported API. This enables developers to interact with a variety of platforms via knowledge of Terraform, and some measure of knowledge of the platform itself. The tool and language removes the need to learn multiple APIs and bindings, and provides a simplified and easy-to-use singular interface through a standardized language. However, the code usage is not unified for the platforms. In this article, we explore how to unify the code usage in a Terraform config for three major cloud platforms.

This article assumes beginner level experience with Terraform, basic coding skills, and familiarity with virtual instances.

Simple Solution for Single Module

Conceptual Solution

When last we left our heroes, we had an example resource:

resource "aws_instance" "this" {
ami = "ami-abcdefg1234567890"
instance_type = "t3.medium"
}

That we could not apply to GCP and Azure because of differences in schema parameters and values, this nomenclature is generally mapped to the API and bindings through the provider for convenience, efficiency, and readability.

However, we did discuss that a provider could be developed to address all three cloud platforms generally, but that also this provider would be a major undertaking. What would a general cloud platform Terraform config utilizing this theoretical provider appear like though?

resource "virtual_instance" "this" {
ami = "ami-abcdefg1234567890"
instance_type = "t3.medium"
}

Sure, that solves the problem of specifically referencing a platform provider and refers to a general “instance” resource, but we still have the problem of a specific AWS image ID and instance type for the values. We need to standardize the schema for the argument values.

resource "virtual_instance" "this" {
ami = "debian-11"
instance_type = "general-medium"
}

That does look better already. We can definitely apply those values to all three cloud platforms in a logical and sensible manner. However, we still need to do something about those argument names for the sake of interfacing.

resource "virtual_instance" "this" {
image = "debian-11"
type = "general-medium"
}

Now we are looking very good indeed. This is a resource that intuitively maps to AWS, GCP, and Azure bindings and APIs with a little brokering and wrapping. However, the provider backing this resource does not exist, so how are we supposed to work with this non-existent resource? What exists in Terraform that would allow us to code something like this declaration?

module "instance" {
source = "./instance"

platforms = ["aws", "gcp", "azure"]
image = "debian-11"
type = "general-medium"
}

Bingo!

Instance Module

Now we need to actually develop the module backing these multi-cloud instances. It is assumed there exists a versions.tf specifying Terraform ~> 1.0, AWS ~> 3.0, Google ~> 4.0, and AzureRM ~> 2.0. We now need to declare the variables we assumed above.

# variables.tf
variable "platforms" {
type = list(string)
description = "The desired cloud platforms supporting the declared instances."
default = []

validation {
condition = alltrue([for platform in var.platforms : contains(["aws", "gcp", "azure"], platform)])
error_message = "One of the specified cloud platforms is invalid. Only 'aws', 'gcp', and 'azure' are allowed."
}
}

variable "image" {
type = string
description = "The desired image for the instance."

validation {
condition = contains(["debian-10", "debian-11", "ubuntu-22", "rhel-8", "fedora-36"], var.image)
error_message = "The specified instance image is invalid."
}
}

variable "type" {
type = string
description = "The desired type for the instance."

validation {
condition = contains(["general-small", "general-medium", "general-large", "memory-large", "cpu-small"], var.type)
error_message = "The specified instance type is invalid."
}
}

Now we need to write the config for the specific instances. For the sake of brevity, we will let the cat out of the bag early and spoil that a simple solution for brokering the module inputs to the wrapped resources will be with static data defined inside a locals block. This block will be explained in the next section.

# aws_instance.tf
data "aws_ami" "this" {
# set still only type without a constructor
for_each = contains(var.platforms, "aws") ? toset(["this"]) : []

most_recent = true
owners = [local.image["aws"][var.image].owner]

filter {
name = "name"
values = ["${local.image["aws"][var.image].name}-*"]
}
}

resource "aws_instance" "this" {
# even tuples have a constructor
for_each = contains(var.platforms, "aws") ? toset(["this"]) : []

ami = data.aws_ami.this["this"].id
instance_type = local.type["aws"][var.type]
}

# gcp_instance.tf
data "google_compute_image" "this" {
# set type constructor when?
for_each = contains(var.platforms, "gcp") ? toset(["this"]) : []

family = local.image["gcp"][var.image].family
project = local.image["gcp"][var.image].project
}

resource "google_compute_instance" "this" {
for_each = contains(var.platforms, "gcp") ? toset(["this"]) : []

machine_type = local.type["gcp"][var.type]

boot_disk {
initialize_params {
image = data.google_compute_image.this["this"].self_link
}
}

# dummy up other required args
name = each.value
network_interface { network = "default" }
}

# azure_instance.tf
resource "azurerm_linux_virtual_machine" "this" {
for_each = contains(var.platforms, "azure") ? toset(["this"]) : []

size = local.type["azure"][var.type]

source_image_reference {
publisher = local.image["azure"][var.image].publisher
offer = local.image["azure"][var.image].offer
sku = local.image["azure"][var.image].sku
version = "latest"
}

# dummy up other required args
name = each.value
resource_group_name = "my-resource-group"
location = "West Europe"
admin_username = "user"
network_interface_ids = ["/subscriptions/subid/resourceGroups/my-resource-group/providers/Microsoft.Network/networkInterfaces/test-nic"]

os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}
}

As we can observe above, a single input variable is mapped to the resource schema for each provider resource for an instance. We have now achieved a single Terraform root module config for multi-cloud management that is completely valid and robust.

Static Data Brokering

To achieve this single code interface for multiple cloud instances, we need to simulate static general data in Terraform. We can do this with locals in lieu of an intrinsic solution provided by the tool itself (other DIY solutions also exist). An example, locals block for brokering the general interface to the specific resource schemas is as follows:

locals {
type = {
"aws" = {
"general-medium" = "t3.large"
}
"gcp" = {
"general-medium" = "e2-standard-2"
}
"azure" = {
"general-medium" = "Standard_D2as_v5"
}
}

image = {
"aws" = {
"debian-11" = {
owner = "136693071363"
name = "debian-11-amd64"
}
}
"gcp" = {
"debian-11" = {
family = "debian-11"
project = "debian-cloud"
}
}
"azure" = {
"debian-11" = {
publisher = "Debian"
offer = "Debian"
sku = "11"
}
}
}
}

With the above static data placed in locals, we can easily write a single code declaration for multiple cloud platforms. Note that above the structure is <variable>.<provider>.<value> instead of <provider>.<variable>.<value>. The reason for this is because they are essentially equivalent with respect to all factors, except with respect to the fact that removing or adding functionality, in general, would be per variable and not per provider. Therefore, it is easier and safer to adhere to the structure outlined above to reduce the probability of human error.

Extending and Expanding Functionality

Additional Variables

It is rather intuitive to add new module variables and local data to enable additional arguments and values for the instance resources, but we can also explore how to extend this module from one to ‘n’ numbered instances. In reality, we would preferably use auto-scaling for this purpose and therefore this exercise is likely frivolous for a production environment, but it is an illustrative and didactic example nonetheless.

Let us declare an additional module variable for the number of instances.

# variables.tf

variable "instances" {
type = number
description = "The number of instances per cloud platform."
default = 1
}
Now we have a variable corresponding to the number of instances we want in each cloud platform. We can modify the for_each meta-argument in the instance resources correspondingly.

resource "aws_instance" "this" {
for_each = contains(var.platforms, "aws") ? toset([for instance in range(var.instances) : tostring(instance)]) : []
...
}

resource "google_compute_instance" "this" {
for_each = contains(var.platforms, "gcp") ? toset([for instance in range(var.instances) : tostring(instance)]) : []
...
}

resource "azurerm_linux_virtual_machine" "this" {
for_each = contains(var.platforms, "azure") ? toset([for instance in range(var.instances) : tostring(instance)]) : []
...
}

We can also more easily use the classic count meta-argument here with count = var.instances.

The variable for the number of instances can be converted into a map(object) type which contains the image and type variables within the object. This would then enable sets of instances of varying number, type, and image. This would also require structure transformations with for expression lambdas, but these would be straightforward. In this situation, the meta-argument for the resource would then absolutely need to be the modern for_each and not the classic count.

Additional Resources

Thus far this article explored how to manage an instance in a single code for multi-cloud. It is of course possible to extend this design pattern to other resources that exist in multiple cloud providers such as subnets or VPCs. The two potential architectures for the general interface are to create additional modules mapped to each resource, or to maintain a single module interface with nested modules for each cloud platform.

The design pattern for multiple modules isomorphic to cloud services follows intuitively from the above design pattern. In the VPCs and subnets example, we would simply create an additional module for VPCs, and either combine subnets into that module, or create a separate module for the subnets. The advantages of this model would include separate release management per service group, mitigation of coupled side effects between services, and easy development of new supported services. The disadvantages would include extensive module input/output mappings, repeated code throughout the codebase in each module, and poor scaling for large cloud infrastructure environments. It follows therefore that grouping similar services together (i.e. VPC and subnet) would be ideal in this scenario and architecture for mitigating disadvantages.

Nested modules from a single module interface would require refactoring the approach thus far. We would have something akin to:

module "cloud" {
platforms = ["aws", "gcp", "azure"]

instance = {
"application" = {
number = 3
image = "debian-11"
type = "general-medium"
}
}

vpc = {
"my-vpc" = {
cidr = "10.0.0.0/16"
subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}
}
}

and then nested module declarations within this cloud module for each of the supported cloud providers. The advantages of this approach include a single cloud platform conditional in the nested module declaration for each platform instead of individual resources, simple data brokering since there would be a locals block within each nested module with only that cloud’s schema, or instead maintain the current locals and broker to modules specific to each cloud provider (this would be very ideal in general), and easy input/output mapping between resources. The disadvantages include difficulty implementing new services in multiple codebases that maintain cross-compatibility, lack of separation between service resources causing a “domino effect” during issues, and extra caution for impact to current services when developing support for new services.

It would also be helpful here if there was a reserved keyword in Terraform for the declared module name, but that does not exist yet, although it exists in other declarative DSLs. There is only a small demand for it at the moment, so its eventual implementation is doubtful.

Robust Solution for Single Module

Up to this point, we have explored static data brokering for the wrapped interfaces to the cloud platforms. It would be better if the data could be dynamic. For that, we would need to utilize the external data source. Assume that versions.tf now contains external as a required provider at ~> 2.0.

For the sake of brevity, we will demonstrate this for AWS only. Although an external data source can be developed in a variety of languages, we will use Python 3.9 as an example because many people will likely be familiar with it. We need to declare the external data sources for the AWS data.

data "external" "aws_image" {
program = ["python3", "${path.module}/data/data.py"]
query = { var = "image", platform = "aws", value = var.image }
}

data "external" "aws_type" {
program = ["python3", "${path.module}/data/data.py"]
query = { var = "type", platform = "aws", value = var.type }
}

We also need a source for the dynamic data. For the sake of simplicity, we will use a YAML file containing the data. In reality, it is possible to directly interface with a YAML file in a Terraform config to read in data, and this example is also technically still static data. However, it is useful as an example of capability, and one could easily adapt the external data to leverage cloud bindings (e.g. boto3) instead of a file for information to fully realize the capabilities here.

---
type:
aws:
'general-medium':
type: t3.large
gcp:
'general-medium':
type: e2-standard-2
azure:
'general-medium':
type: Standard_D2as_v5

image:
aws:
debian-11:
owner: '136693071363'
name: debian-11-amd64
gcp:
debian-11:
family: debian-11
project: debian-cloud
azure:
debian-11:
publisher: Debian
offer: Debian
sku: '11'

We then need a simple executable for retrieving and interfacing the data to Terraform. If this code’s structure and design is unfamiliar, then it is recommended to consult a tutorial and documentation on external data sources in Terraform.

"""dynamic data brokering for terraform"""
import json
import pathlib
import sys
import yaml

# unmarshal stdin json to dictionary and assign values
input_vars: dict = json.loads(sys.stdin.read())
var: str = input_vars['var']
platform: str = input_vars['platform']
value: str = input_vars['value']

# load in data from yaml and access subset for variable
data_file_contents: str = open(str(pathlib.Path(__file__).parent.resolve().joinpath('data.yaml')), 'r', encoding='utf8')
data: dict = yaml.safe_load(data_file_contents)[var][platform][value]

# output json of data for terraform consumption
sys.stdout.write(json.dumps(data, indent=2))

Now we must update our Terraform config to leverage our new dynamic data.

data "aws_ami" "this" {
for_each = contains(var.platforms, "aws") ? toset(["this"]) : []

most_recent = true
owners = [data.external.aws_image.result["owner"]]

filter {
name = "name"
values = ["${data.external.aws_image.result["name"]}-*"]
}
}

resource "aws_instance" "this" {
for_each = contains(var.platforms, "aws") ? toset([for instance in range(var.instances) : tostring(instance)]) : []

ami = data.aws_ami.this["this"].id
instance_type = data.external.aws_type.result["type"]
}

Now the wrapped interface utilizes dynamic data brokering. We can now update values for the individual cloud platform providers more easily, safely, and in a future-proof manner.

However, note that external data does introduce an additional dependency. In this situation, your pipeline agents executing infrastructure provisioning will now require Python and the external data’s dependent libraries.

It is also interesting to revisit structuring the data by platform instead of by variable. The external data source requires the JSON passed to Terraform to be a single-level object of key-value pairs of string types. We can perform a data transformation and enable the verisimilitude of a second level with the for_each meta-argument:

data "external" "aws" {
for_each = { "type" = var.type, "image" = var.image }

program = ["python3", "${path.module}/data/data.py"]
query = { var = each.key, platform = "aws", value = each.value }
}

# data.external.aws["image"].result["owner"]

This does simplify the situation somewhat by enabling a singular data declaration per cloud platform. It also works better than the previous example for the situation where we have sets of instances with individual attributes for their types and images.

Conclusion

We have now explored and discussed how to simultaneously manage multiple cloud platforms with a single Terraform config. The examples demonstrate how to broker the code to the wrapped interface with static and dynamic data and how to extend to additional cloud services and service customization. The ideal solution would still be a generalized cloud provider schema for shared service resources and data, but until that dream becomes a reality, this solution works quite well. It also is completely under the developer’s control and is not subject to the roadmaps of the vendors and their development teams. This is because the Terraform provider SDK is standardized.

If your organization is interested in vastly improving infrastructure and platform management for your systems, applications, or other software you develop or otherwise utilize, contact Shadow-Soft.