011 - Infrastructure Code Disaggregation #181
cpcundill
started this conversation in
Open design proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Open Design Proposal - 011 - Infrastructure Code Disaggregation
Author(s)
Introduction
Disaggregation of Infrastructure code is proposed in order to realise the following improvements:
Status
Detail
Problem
Big ball of state
The Digital Land Infrastructure codebase, implemented with Terraform, has grown organically over time. Originally only responsible for provisioning the Platform system, it has now grown to provision AWS resources for the whole Digital Land Service which spans the Platform, Data Collection Pipeline, Providers and Data Design systems.
Terraform uses a state file to map infrastructure resources definitions (code) to real-world infrastructure resources (e.g. in AWS), i.e.
State must be locked when applying changes. At the time of writing (Dec '24), the Terraform codebase currently manages over 1,300 AWS resources. The large size of the state has begun to be problematic for a few reasons:
Any a minor change currently requires the reading of state and AWS resources across the whole programme infrastructure and blocks any other change being applied concurrently. An innocuous change such as introducing a new CloudWatch Log metric filter should not block the rollout of an ECS cluster change - and vice-versa.
Borrowing terminology from the architecture abstractions encapsulated in the C4 modelling approach, the current infrastructure codebase is responsible for the whole system context of the programme. This is unsustainable.
Environment Branches
The current behaviour of the Terraform provisioning code expects all resources to be deployed into all environments. Granted, there is some customisation of parameters between environments, but not much in the way of control over which resources are included. Unfortunately, by necessity, this has led to maintaining an environment branch (dev) for the development environment. The main branch is used to represent the Terraform code for application to staging and production environments. Often, engineers are required to develop features against the dev branch and then reimplement against the main branch for promotion to higher environments.
Solution
Summary
Disaggregate the Terraform codebase into multiple units based around systems:
Indicative code structure
The provisioning code for each system can exist within a sub-folder of the infrastructure code repository and have its own dedicated state; this will allow for the application of changes to different systems concurrently within the same environment. Existing modules can be shared by the systems and can undergo some refactoring and simplification as necessary.
Each system must be independent and use data lookups for any dependencies rather than code-level references.
Indicative repository folder structure
The top level structure within the code repository will resemble will include two main folders:
Modules will continue to be the container for shared modules while the systems folder will contain isolated units (each with its own independent state).
A more expanded view of the folder structure would reveal:
Each system directory will contain an environments directory which holds Terraform values files for each environment into which the system is to be deployed. Care must be taken to ensure that the Terraform provisioning code within the system can handle no-op scenarios for a given environment. That is to say, it must be possible to deploy a system - and indeed aspects of a system - to selective environments. This principle is critical to solve the aforementioned environment branch problem and thus have a single branch (main) serving all environments. Any differences required between environments should be defined within environment-specific tfvar files.
Scalability
Should the Terraform state for an individual system become too large in future, that system could be further disaggregated into isolated abstractions. Borrowing again from C4 modelling terminology, the infrastructure resources for a system could be further organised into containers.
Implementation considerations
Migration of state
State in Terraform must be managed carefully since there is a mapping between code artifacts and AWS resources, almost all kinds of refactoring changes (file moves, renames) will require corresponding changes to Terraform state. See https://developer.hashicorp.com/terraform/cli/state/move for details.
This can be achieved in a number of ways:
Direct editing of state files is rarely recommended and should be avoided where at all possible.
Further reading
Migration Stages
The following sequential stages of migration are recommended:
Environment migration strategy
For each of the outlined stages, the Terraform code and state would applied in all environments as to ensure no delayed integration issues are encountered.
Design Comments/Questions
Leave comments and questions in this discussion.
Beta Was this translation helpful? Give feedback.
All reactions