Infrastructure as Code: Terraform Best Practices for Enterprise
Jules Musoko
Principal Consultant
Terraform has revolutionized infrastructure management, but scaling it across enterprise environments requires sophisticated patterns for state management, security, and governance. After implementing Terraform across dozens of large-scale deployments managing thousands of resources, I've developed a comprehensive approach that ensures reliability, security, and maintainability.
This article shares the enterprise-grade patterns and practices that make Terraform successful at scale.
The Enterprise Infrastructure Challenge
In a recent multi-cloud transformation for a financial services company, we needed to manage infrastructure across AWS, Azure, and GCP for 200+ applications. The challenge was maintaining consistency, security, and compliance across:
- 50+ development teams - 15 different environments (dev, staging, production per region) - Strict regulatory requirements (PCI DSS, SOX) - Multi-region disaster recovery - Zero-downtime deployments
The solution: a comprehensive Terraform framework that standardized infrastructure while enabling team autonomy.
Enterprise Terraform Architecture
Repository Structure
terraform-infrastructure/
├── modules/ # Reusable infrastructure modules
│ ├── compute/
│ │ ├── k8s-cluster/
│ │ ├── vm-instance/
│ │ └── auto-scaling-group/
│ ├── networking/
│ │ ├── vpc/
│ │ ├── load-balancer/
│ │ └── cdn/
│ ├── security/
│ │ ├── iam-roles/
│ │ ├── security-groups/
│ │ └── certificates/
│ └── data/
│ ├── rds/
│ ├── redis/
│ └── elasticsearch/
├── environments/ # Environment-specific configurations
│ ├── shared/
│ │ ├── networking/
│ │ ├── security/
│ │ └── monitoring/
│ ├── development/
│ ├── staging/
│ └── production/
├── policies/ # Governance and compliance
│ ├── security-policies/
│ ├── cost-policies/
│ └── compliance-checks/
├── scripts/ # Automation and tooling
│ ├── deploy.sh
│ ├── validate.sh
│ └── cost-analysis.py
└── docs/ # Documentation and runbooks
├── MODULE-GUIDE.md
├── DEPLOYMENT-PROCESS.md
└── TROUBLESHOOTING.md
State Management Strategy
backend.tf - Remote state configuration
terraform {
backend "s3" {
bucket = "company-terraform-state-${var.environment}"
key = "${var.service_name}/${var.environment}/terraform.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
# State isolation per environment and service
workspace_key_prefix = "workspaces"
}
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
google = {
source = "hashicorp/google"
version = "~> 4.0"
}
}
}Provider configuration with assume role
provider "aws" {
region = var.aws_region
assume_role {
role_arn = "arn:aws:iam::${var.aws_account_id}:role/TerraformExecutionRole"
}
default_tags {
tags = {
Environment = var.environment
Service = var.service_name
Owner = var.team_name
CostCenter = var.cost_center
Compliance = var.compliance_level
ManagedBy = "Terraform"
LastModified = formatdate("YYYY-MM-DD", timestamp())
}
}
}State bucket with versioning and encryption
resource "aws_s3_bucket" "terraform_state" {
bucket = "company-terraform-state-${var.environment}"
lifecycle {
prevent_destroy = true
}
tags = {
Name = "Terraform State Bucket"
Environment = var.environment
Purpose = "Infrastructure State Management"
}
}resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
kms_master_key_id = aws_kms_key.terraform_state.arn
sse_algorithm = "aws:kms"
}
bucket_key_enabled = true
}
}
DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID" attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Locks"
Environment = var.environment
Purpose = "Infrastructure State Locking"
}
}
Enterprise Module Design
Comprehensive VPC Module
modules/networking/vpc/main.tf
variable "vpc_config" {
description = "VPC configuration"
type = object({
name = string
cidr_block = string
availability_zones = list(string)
enable_dns_hostnames = bool
enable_dns_support = bool
enable_nat_gateway = bool
single_nat_gateway = bool
enable_vpn_gateway = bool
enable_flow_logs = bool
tags = map(string)
})
validation {
condition = can(cidrhost(var.vpc_config.cidr_block, 0))
error_message = "VPC CIDR block must be a valid IPv4 CIDR."
}
validation {
condition = length(var.vpc_config.availability_zones) >= 2
error_message = "At least 2 availability zones must be specified for high availability."
}
}
variable "subnet_config" {
description = "Subnet configuration"
type = object({
public_subnets = list(string)
private_subnets = list(string)
database_subnets = list(string)
intra_subnets = list(string)
})
validation {
condition = length(var.subnet_config.public_subnets) == length(var.subnet_config.private_subnets)
error_message = "Number of public and private subnets must match."
}
}
VPC with comprehensive configuration
resource "aws_vpc" "main" {
cidr_block = var.vpc_config.cidr_block
enable_dns_hostnames = var.vpc_config.enable_dns_hostnames
enable_dns_support = var.vpc_config.enable_dns_support tags = merge(
var.vpc_config.tags,
{
Name = var.vpc_config.name
Type = "VPC"
}
)
lifecycle {
create_before_destroy = true
}
}
Public subnets with automatic public IP assignment
resource "aws_subnet" "public" {
count = length(var.subnet_config.public_subnets) vpc_id = aws_vpc.main.id
cidr_block = var.subnet_config.public_subnets[count.index]
availability_zone = var.vpc_config.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-public-${count.index + 1}"
Type = "Public"
Tier = "Web"
"kubernetes.io/role/elb" = "1" # For AWS Load Balancer Controller
}
)
}
Private subnets for application workloads
resource "aws_subnet" "private" {
count = length(var.subnet_config.private_subnets) vpc_id = aws_vpc.main.id
cidr_block = var.subnet_config.private_subnets[count.index]
availability_zone = var.vpc_config.availability_zones[count.index]
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-private-${count.index + 1}"
Type = "Private"
Tier = "Application"
"kubernetes.io/role/internal-elb" = "1" # For internal load balancers
}
)
}
Database subnets with additional security
resource "aws_subnet" "database" {
count = length(var.subnet_config.database_subnets) vpc_id = aws_vpc.main.id
cidr_block = var.subnet_config.database_subnets[count.index]
availability_zone = var.vpc_config.availability_zones[count.index]
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-database-${count.index + 1}"
Type = "Database"
Tier = "Data"
}
)
}
Internet Gateway for public subnet access
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-igw"
}
)
}
Elastic IPs for NAT Gateways
resource "aws_eip" "nat" {
count = var.vpc_config.single_nat_gateway ? 1 : length(var.subnet_config.public_subnets) domain = "vpc"
depends_on = [aws_internet_gateway.main]
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-nat-eip-${count.index + 1}"
}
)
}
NAT Gateways for private subnet internet access
resource "aws_nat_gateway" "main" {
count = var.vpc_config.enable_nat_gateway ? (var.vpc_config.single_nat_gateway ? 1 : length(var.subnet_config.public_subnets)) : 0 allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
depends_on = [aws_internet_gateway.main]
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-nat-${count.index + 1}"
}
)
}
Route tables and associations
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-public-rt"
Type = "Public"
}
)
}
resource "aws_route_table" "private" {
count = length(var.subnet_config.private_subnets)
vpc_id = aws_vpc.main.id
dynamic "route" {
for_each = var.vpc_config.enable_nat_gateway ? [1] : []
content {
cidr_block = "0.0.0.0/0"
nat_gateway_id = var.vpc_config.single_nat_gateway ? aws_nat_gateway.main[0].id : aws_nat_gateway.main[count.index].id
}
}
tags = merge(
var.vpc_config.tags,
{
Name = "${var.vpc_config.name}-private-rt-${count.index + 1}"
Type = "Private"
}
)
}
VPC Flow Logs for security monitoring
resource "aws_flow_log" "vpc" {
count = var.vpc_config.enable_flow_logs ? 1 : 0 iam_role_arn = aws_iam_role.flow_log[0].arn
log_destination = aws_cloudwatch_log_group.vpc_flow_log[0].arn
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
}
resource "aws_cloudwatch_log_group" "vpc_flow_log" {
count = var.vpc_config.enable_flow_logs ? 1 : 0
name = "/aws/vpc/${var.vpc_config.name}/flowlogs"
retention_in_days = 30
tags = var.vpc_config.tags
}
Kubernetes Cluster Module
modules/compute/k8s-cluster/main.tf
variable "cluster_config" {
description = "EKS cluster configuration"
type = object({
name = string
version = string
endpoint_private_access = bool
endpoint_public_access = bool
public_access_cidrs = list(string)
enable_logging = list(string)
vpc_id = string
subnet_ids = list(string)
tags = map(string)
})
}
variable "node_groups" {
description = "EKS node group configurations"
type = map(object({
instance_types = list(string)
ami_type = string
capacity_type = string
disk_size = number
min_size = number
max_size = number
desired_size = number
max_unavailable = number
labels = map(string)
taints = list(object({
key = string
value = string
effect = string
}))
tags = map(string)
}))
}
IAM role for EKS cluster
resource "aws_iam_role" "cluster" {
name = "${var.cluster_config.name}-cluster-role" assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "eks.amazonaws.com"
}
}
]
})
tags = var.cluster_config.tags
}
resource "aws_iam_role_policy_attachment" "cluster_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.cluster.name
}
resource "aws_iam_role_policy_attachment" "cluster_vpc_resource_controller" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSVPCResourceController"
role = aws_iam_role.cluster.name
}
Security group for EKS cluster
resource "aws_security_group" "cluster" {
name_prefix = "${var.cluster_config.name}-cluster-"
vpc_id = var.cluster_config.vpc_id ingress {
description = "HTTPS from allowed CIDRs"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = var.cluster_config.public_access_cidrs
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = merge(
var.cluster_config.tags,
{
Name = "${var.cluster_config.name}-cluster-sg"
}
)
}
EKS cluster with comprehensive configuration
resource "aws_eks_cluster" "main" {
name = var.cluster_config.name
role_arn = aws_iam_role.cluster.arn
version = var.cluster_config.version vpc_config {
subnet_ids = var.cluster_config.subnet_ids
endpoint_private_access = var.cluster_config.endpoint_private_access
endpoint_public_access = var.cluster_config.endpoint_public_access
public_access_cidrs = var.cluster_config.public_access_cidrs
security_group_ids = [aws_security_group.cluster.id]
}
enabled_cluster_log_types = var.cluster_config.enable_logging
# Encryption configuration
encryption_config {
provider {
key_arn = aws_kms_key.eks.arn
}
resources = ["secrets"]
}
depends_on = [
aws_iam_role_policy_attachment.cluster_policy,
aws_iam_role_policy_attachment.cluster_vpc_resource_controller,
aws_cloudwatch_log_group.cluster,
]
tags = var.cluster_config.tags
}
CloudWatch log group for cluster logs
resource "aws_cloudwatch_log_group" "cluster" {
name = "/aws/eks/${var.cluster_config.name}/cluster"
retention_in_days = 30 tags = var.cluster_config.tags
}
KMS key for EKS encryption
resource "aws_kms_key" "eks" {
description = "EKS encryption key for ${var.cluster_config.name}"
deletion_window_in_days = 7 tags = merge(
var.cluster_config.tags,
{
Name = "${var.cluster_config.name}-eks-key"
}
)
}
Node groups with comprehensive configuration
resource "aws_eks_node_group" "main" {
for_each = var.node_groups cluster_name = aws_eks_cluster.main.name
node_group_name = each.key
node_role_arn = aws_iam_role.node_group.arn
subnet_ids = var.cluster_config.subnet_ids
instance_types = each.value.instance_types
ami_type = each.value.ami_type
capacity_type = each.value.capacity_type
disk_size = each.value.disk_size
scaling_config {
desired_size = each.value.desired_size
max_size = each.value.max_size
min_size = each.value.min_size
}
update_config {
max_unavailable = each.value.max_unavailable
}
labels = each.value.labels
dynamic "taint" {
for_each = each.value.taints
content {
key = taint.value.key
value = taint.value.value
effect = taint.value.effect
}
}
# Remote access configuration
remote_access {
ec2_ssh_key = aws_key_pair.node_group.key_name
source_security_group_ids = [aws_security_group.node_group_remote_access.id]
}
depends_on = [
aws_iam_role_policy_attachment.node_group_worker,
aws_iam_role_policy_attachment.node_group_cni,
aws_iam_role_policy_attachment.node_group_registry,
]
tags = merge(
var.cluster_config.tags,
each.value.tags,
{
Name = "${var.cluster_config.name}-${each.key}"
}
)
}
Security and Compliance
IAM Policies and Roles
modules/security/iam-roles/main.tf
variable "role_config" {
description = "IAM role configuration"
type = object({
name = string
assume_role_policy = string
policies = list(string)
managed_policies = list(string)
max_session_duration = number
tags = map(string)
})
}
variable "policy_documents" {
description = "Custom policy documents"
type = map(string)
default = {}
}
IAM role with comprehensive configuration
resource "aws_iam_role" "main" {
name = var.role_config.name
assume_role_policy = var.role_config.assume_role_policy
max_session_duration = var.role_config.max_session_duration tags = merge(
var.role_config.tags,
{
ManagedBy = "Terraform"
Purpose = "Service Role"
}
)
}
Attach custom policies
resource "aws_iam_role_policy" "custom" {
for_each = var.policy_documents name = each.key
role = aws_iam_role.main.id
policy = each.value
}
Attach managed policies
resource "aws_iam_role_policy_attachment" "managed" {
for_each = toset(var.role_config.managed_policies) role = aws_iam_role.main.name
policy_arn = each.value
}
Output role ARN for use in other modules
output "role_arn" {
description = "IAM role ARN"
value = aws_iam_role.main.arn
}output "role_name" {
description = "IAM role name"
value = aws_iam_role.main.name
}
Security Group Rules
modules/security/security-groups/main.tf
variable "security_groups" {
description = "Security group configurations"
type = map(object({
name = string
description = string
vpc_id = string
ingress_rules = list(object({
description = string
from_port = number
to_port = number
protocol = string
cidr_blocks = list(string)
security_groups = list(string)
self = bool
}))
egress_rules = list(object({
description = string
from_port = number
to_port = number
protocol = string
cidr_blocks = list(string)
security_groups = list(string)
self = bool
}))
tags = map(string)
}))
}
Security groups with explicit rule management
resource "aws_security_group" "main" {
for_each = var.security_groups name = each.value.name
description = each.value.description
vpc_id = each.value.vpc_id
tags = merge(
each.value.tags,
{
Name = each.value.name
ManagedBy = "Terraform"
}
)
lifecycle {
create_before_destroy = true
}
}
Ingress rules
resource "aws_security_group_rule" "ingress" {
for_each = {
for idx, rule in flatten([
for sg_key, sg in var.security_groups : [
for rule_idx, rule in sg.ingress_rules : {
sg_key = sg_key
rule_key = "${sg_key}-ingress-${rule_idx}"
rule = rule
}
]
]) : rule.rule_key => rule
} security_group_id = aws_security_group.main[each.value.sg_key].id
type = "ingress"
description = each.value.rule.description
from_port = each.value.rule.from_port
to_port = each.value.rule.to_port
protocol = each.value.rule.protocol
cidr_blocks = length(each.value.rule.cidr_blocks) > 0 ? each.value.rule.cidr_blocks : null
source_security_group_id = length(each.value.rule.security_groups) > 0 ? each.value.rule.security_groups[0] : null
self = each.value.rule.self
}
Egress rules
resource "aws_security_group_rule" "egress" {
for_each = {
for idx, rule in flatten([
for sg_key, sg in var.security_groups : [
for rule_idx, rule in sg.egress_rules : {
sg_key = sg_key
rule_key = "${sg_key}-egress-${rule_idx}"
rule = rule
}
]
]) : rule.rule_key => rule
} security_group_id = aws_security_group.main[each.value.sg_key].id
type = "egress"
description = each.value.rule.description
from_port = each.value.rule.from_port
to_port = each.value.rule.to_port
protocol = each.value.rule.protocol
cidr_blocks = length(each.value.rule.cidr_blocks) > 0 ? each.value.rule.cidr_blocks : null
destination_security_group_id = length(each.value.rule.security_groups) > 0 ? each.value.rule.security_groups[0] : null
self = each.value.rule.self
}
CI/CD Pipeline Integration
GitLab CI Pipeline
.gitlab-ci.yml
stages:
- validate
- plan
- security-scan
- apply
- testvariables:
TF_VERSION: "1.5.7"
TERRAFORM_DIR: "./terraform"
AWS_DEFAULT_REGION: "eu-west-1"
before_script:
- apt-get update -qq && apt-get install -y -qq git curl unzip jq
- curl -fsSL https://releases.hashicorp.com/terraform/${TF_VERSION}/terraform_${TF_VERSION}_linux_amd64.zip -o terraform.zip
- unzip terraform.zip && mv terraform /usr/local/bin/
- terraform version
Validation stage
terraform:validate:
stage: validate
script:
- cd $TERRAFORM_DIR
- terraform init -backend=false
- terraform fmt -check -recursive
- terraform validate
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
- if: '$CI_COMMIT_BRANCH == "main"'Planning stage
terraform:plan:
stage: plan
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform workspace select $ENVIRONMENT || terraform workspace new $ENVIRONMENT
- terraform plan -var-file="environments/$ENVIRONMENT.tfvars" -out=tfplan
- terraform show -json tfplan > tfplan.json
artifacts:
paths:
- $TERRAFORM_DIR/tfplan
- $TERRAFORM_DIR/tfplan.json
expire_in: 1 hour
parallel:
matrix:
- ENVIRONMENT: [development, staging, production]
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
- if: '$CI_COMMIT_BRANCH == "main"'Security scanning
security:checkov:
stage: security-scan
image: bridgecrew/checkov:latest
script:
- checkov -f $TERRAFORM_DIR/tfplan.json --framework terraform_plan
allow_failure: true
dependencies:
- terraform:plan
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
- if: '$CI_COMMIT_BRANCH == "main"'security:tfsec:
stage: security-scan
image: aquasec/tfsec:latest
script:
- tfsec $TERRAFORM_DIR --format json --out tfsec-results.json
- cat tfsec-results.json
artifacts:
reports:
junit: tfsec-results.json
allow_failure: true
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
- if: '$CI_COMMIT_BRANCH == "main"'
Cost estimation
cost:infracost:
stage: security-scan
image: infracost/infracost:latest
script:
- infracost breakdown --path $TERRAFORM_DIR/tfplan.json --format json --out-file infracost-base.json
- infracost output --path infracost-base.json --format table
dependencies:
- terraform:plan
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'Apply changes (production only on main branch)
terraform:apply:production:
stage: apply
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform workspace select production
- terraform apply -auto-approve tfplan
environment:
name: production
dependencies:
- terraform:plan
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
when: manual
- if: '$CI_PIPELINE_SOURCE == "schedule"'
variables:
ENVIRONMENT: "production"Apply to non-production automatically
terraform:apply:non-prod:
stage: apply
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform workspace select $ENVIRONMENT
- terraform apply -auto-approve tfplan
environment:
name: $ENVIRONMENT
parallel:
matrix:
- ENVIRONMENT: [development, staging]
dependencies:
- terraform:plan
rules:
- if: '$CI_COMMIT_BRANCH == "main"'Post-deployment testing
test:infrastructure:
stage: test
script:
- ./scripts/infrastructure-tests.sh $ENVIRONMENT
parallel:
matrix:
- ENVIRONMENT: [development, staging, production]
dependencies:
- terraform:apply:non-prod
- terraform:apply:production
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
Automated Testing
#!/bin/bash
scripts/infrastructure-tests.sh
set -euo pipefail
ENVIRONMENT=$1
REGION=${AWS_DEFAULT_REGION:-eu-west-1}
echo "Running infrastructure tests for $ENVIRONMENT environment"
Test VPC connectivity
echo "Testing VPC connectivity..."
VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:Environment,Values=$ENVIRONMENT" --query 'Vpcs[0].VpcId' --output text)
if [ "$VPC_ID" = "None" ]; then
echo "❌ VPC not found for environment: $ENVIRONMENT"
exit 1
fi
echo "✅ VPC found: $VPC_ID"Test EKS cluster accessibility
echo "Testing EKS cluster..."
CLUSTER_NAME=$(aws eks list-clusters --query "clusters[?contains(@, '$ENVIRONMENT')]" --output text)
if [ -z "$CLUSTER_NAME" ]; then
echo "❌ EKS cluster not found for environment: $ENVIRONMENT"
exit 1
fiUpdate kubeconfig and test connectivity
aws eks update-kubeconfig --region $REGION --name $CLUSTER_NAME
if kubectl get nodes >/dev/null 2>&1; then
echo "✅ EKS cluster accessible: $CLUSTER_NAME"
NODE_COUNT=$(kubectl get nodes --no-headers | wc -l)
echo " Nodes: $NODE_COUNT"
else
echo "❌ Cannot connect to EKS cluster: $CLUSTER_NAME"
exit 1
fiTest RDS connectivity
echo "Testing RDS instances..."
RDS_INSTANCES=$(aws rds describe-db-instances --query "DBInstances[?contains(DBInstanceIdentifier, '$ENVIRONMENT')]" --output text)
if [ -n "$RDS_INSTANCES" ]; then
echo "✅ RDS instances found for environment: $ENVIRONMENT"
else
echo "⚠️ No RDS instances found for environment: $ENVIRONMENT"
fiTest Load Balancer health
echo "Testing Load Balancer health..."
ALB_COUNT=$(aws elbv2 describe-load-balancers --query "LoadBalancers[?contains(LoadBalancerName, '$ENVIRONMENT')] | length(@)")
if [ "$ALB_COUNT" -gt 0 ]; then
echo "✅ Load balancers found: $ALB_COUNT"
# Check healthy targets
aws elbv2 describe-load-balancers --query "LoadBalancers[?contains(LoadBalancerName, '$ENVIRONMENT')].[LoadBalancerArn]" --output text | while read -r ALB_ARN; do
TARGET_GROUPS=$(aws elbv2 describe-target-groups --load-balancer-arn "$ALB_ARN" --query 'TargetGroups[].TargetGroupArn' --output text)
for TG_ARN in $TARGET_GROUPS; do
HEALTHY_COUNT=$(aws elbv2 describe-target-health --target-group-arn "$TG_ARN" --query 'TargetHealthDescriptions[?TargetHealth.State==healthy
] | length(@)')
echo " Target group healthy targets: $HEALTHY_COUNT"
done
done
else
echo "⚠️ No load balancers found for environment: $ENVIRONMENT"
fiTest Security Group rules
echo "Testing Security Groups..."
SG_COUNT=$(aws ec2 describe-security-groups --filters "Name=tag:Environment,Values=$ENVIRONMENT" --query 'SecurityGroups | length(@)')
echo "✅ Security groups found: $SG_COUNT"Validate tags compliance
echo "Validating resource tagging compliance..."
REQUIRED_TAGS=("Environment" "Service" "Owner" "ManagedBy")
NON_COMPLIANT=0for tag in "${REQUIRED_TAGS[@]}"; do
RESOURCES_WITHOUT_TAG=$(aws resourcegroupstaggingapi get-resources --resource-type-filters "ec2:instance" "rds:db" "eks:cluster" --tag-filters "Key=Environment,Values=$ENVIRONMENT" --query "ResourceTagMappingList[?!Tags[?Key=='$tag']] | length(@)")
if [ "$RESOURCES_WITHOUT_TAG" -gt 0 ]; then
echo "❌ Found $RESOURCES_WITHOUT_TAG resources missing required tag: $tag"
NON_COMPLIANT=1
fi
done
if [ $NON_COMPLIANT -eq 0 ]; then
echo "✅ All resources are compliant with tagging policy"
fi
echo "Infrastructure tests completed for $ENVIRONMENT environment"
Cost Management and Optimization
Cost Monitoring
#!/usr/bin/env python3
scripts/cost-analysis.py
import boto3
import json
from datetime import datetime, timedelta
from typing import Dict, List
class TerraformCostAnalyzer:
def __init__(self, region='us-east-1'):
self.ce_client = boto3.client('ce', region_name=region)
self.ec2_client = boto3.client('ec2')
self.rds_client = boto3.client('rds')
def get_cost_by_service(self, start_date: str, end_date: str, environment: str) -> Dict:
"""Get cost breakdown by AWS service for specific environment"""
response = self.ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='DAILY',
Metrics=['BlendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'},
],
Filter={
'Dimensions': {
'Key': 'LINKED_ACCOUNT',
'Values': [boto3.client('sts').get_caller_identity()['Account']]
},
'Tags': {
'Key': 'Environment',
'Values': [environment]
}
}
)
return response['ResultsByTime']
def get_cost_by_resource(self, start_date: str, end_date: str, environment: str) -> Dict:
"""Get cost breakdown by individual resources"""
response = self.ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='DAILY',
Metrics=['BlendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'RESOURCE_ID'},
],
Filter={
'Tags': {
'Key': 'Environment',
'Values': [environment]
}
}
)
return response['ResultsByTime']
def identify_cost_anomalies(self, environment: str) -> List[Dict]:
"""Identify resources with unexpectedly high costs"""
# Get last 30 days of cost data
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
cost_data = self.get_cost_by_resource(
start_date.strftime('%Y-%m-%d'),
end_date.strftime('%Y-%m-%d'),
environment
)
anomalies = []
for day_data in cost_data:
for group in day_data['Groups']:
resource_id = group['Keys'][0]
cost = float(group['Metrics']['BlendedCost']['Amount'])
# Flag resources costing more than €100/day
if cost > 100:
anomalies.append({
'resource_id': resource_id,
'daily_cost': cost,
'date': day_data['TimePeriod']['Start']
})
return anomalies
def get_optimization_recommendations(self, environment: str) -> List[Dict]:
"""Generate cost optimization recommendations"""
recommendations = []
# Check for oversized EC2 instances
ec2_instances = self.ec2_client.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': [environment]},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
for reservation in ec2_instances['Reservations']:
for instance in reservation['Instances']:
# Get CloudWatch metrics to check utilization
recommendations.append({
'type': 'EC2_RIGHTSIZING',
'resource_id': instance['InstanceId'],
'instance_type': instance['InstanceType'],
'recommendation': 'Check CPU utilization and consider downsizing'
})
# Check for unattached EBS volumes
volumes = self.ec2_client.describe_volumes(
Filters=[
{'Name': 'tag:Environment', 'Values': [environment]},
{'Name': 'status', 'Values': ['available']}
]
)
for volume in volumes['Volumes']:
recommendations.append({
'type': 'UNATTACHED_VOLUME',
'resource_id': volume['VolumeId'],
'size': volume['Size'],
'recommendation': 'Delete unattached EBS volume to save costs'
})
return recommendations
def generate_cost_report(self, environment: str) -> Dict:
"""Generate comprehensive cost report"""
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
# Get cost by service
service_costs = self.get_cost_by_service(
start_date.strftime('%Y-%m-%d'),
end_date.strftime('%Y-%m-%d'),
environment
)
# Calculate total cost for the period
total_cost = 0
service_breakdown = {}
for day_data in service_costs:
for group in day_data['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['BlendedCost']['Amount'])
if service not in service_breakdown:
service_breakdown[service] = 0
service_breakdown[service] += cost
total_cost += cost
# Get optimization recommendations
recommendations = self.get_optimization_recommendations(environment)
# Get cost anomalies
anomalies = self.identify_cost_anomalies(environment)
return {
'environment': environment,
'period': {
'start': start_date.strftime('%Y-%m-%d'),
'end': end_date.strftime('%Y-%m-%d')
},
'total_cost': round(total_cost, 2),
'service_breakdown': service_breakdown,
'recommendations': recommendations,
'anomalies': anomalies,
'generated_at': datetime.now().isoformat()
}
def main():
import sys
if len(sys.argv) != 2:
print("Usage: python cost-analysis.py ")
sys.exit(1)
environment = sys.argv[1]
analyzer = TerraformCostAnalyzer()
print(f"Generating cost report for environment: {environment}")
report = analyzer.generate_cost_report(environment)
print(json.dumps(report, indent=2))
# Save report to file
filename = f"cost-report-{environment}-{datetime.now().strftime('%Y%m%d')}.json"
with open(filename, 'w') as f:
json.dump(report, f, indent=2)
print(f"\nCost report saved to: {filename}")
if __name__ == "__main__":
main()
Governance and Compliance
Policy as Code with Sentinel
policies/security-policies/require-encryption.sentinel
import "tfplan/v2" as tfplan
Require encryption for S3 buckets
s3_buckets = filter tfplan.resource_changes as _, resource_changes {
resource_changes.type is "aws_s3_bucket" and
resource_changes.mode is "managed" and
(resource_changes.change.actions contains "create" or
resource_changes.change.actions contains "update")
}Check S3 bucket encryption
s3_encryption_violations = []
for s3_buckets as address, bucket {
if bucket.change.after.server_side_encryption_configuration is null {
append(s3_encryption_violations, address)
}
}Require encryption for RDS instances
rds_instances = filter tfplan.resource_changes as _, resource_changes {
resource_changes.type is "aws_db_instance" and
resource_changes.mode is "managed" and
(resource_changes.change.actions contains "create" or
resource_changes.change.actions contains "update")
}Check RDS encryption
rds_encryption_violations = []
for rds_instances as address, instance {
if instance.change.after.storage_encrypted is not true {
append(rds_encryption_violations, address)
}
}Main rule
main = rule {
length(s3_encryption_violations) is 0 and
length(rds_encryption_violations) is 0
}Print violations
if length(s3_encryption_violations) > 0 {
print("S3 buckets must have encryption enabled:")
for s3_encryption_violations as violation {
print(" - " + violation)
}
}if length(rds_encryption_violations) > 0 {
print("RDS instances must have encryption enabled:")
for rds_encryption_violations as violation {
print(" - " + violation)
}
}
Compliance Validation
#!/usr/bin/env python3
scripts/compliance-check.py
import boto3
import json
from typing import Dict, List
class ComplianceChecker:
def __init__(self):
self.config_client = boto3.client('config')
self.ec2_client = boto3.client('ec2')
self.s3_client = boto3.client('s3')
self.rds_client = boto3.client('rds')
def check_tagging_compliance(self, environment: str) -> Dict:
"""Check if all resources have required tags"""
required_tags = ['Environment', 'Service', 'Owner', 'CostCenter', 'ManagedBy']
non_compliant_resources = []
# Check EC2 instances
instances = self.ec2_client.describe_instances(
Filters=[{'Name': 'tag:Environment', 'Values': [environment]}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_tags = {tag['Key']: tag['Value'] for tag in instance.get('Tags', [])}
missing_tags = [tag for tag in required_tags if tag not in instance_tags]
if missing_tags:
non_compliant_resources.append({
'resource_type': 'EC2',
'resource_id': instance['InstanceId'],
'missing_tags': missing_tags
})
return {
'environment': environment,
'total_resources_checked': len([i for r in instances['Reservations'] for i in r['Instances']]),
'non_compliant_resources': non_compliant_resources,
'compliance_rate': (1 - len(non_compliant_resources) / max(1, len([i for r in instances['Reservations'] for i in r['Instances']]))) * 100
}
def check_encryption_compliance(self, environment: str) -> Dict:
"""Check encryption compliance for various resources"""
violations = []
# Check S3 bucket encryption
buckets = self.s3_client.list_buckets()
for bucket in buckets['Buckets']:
try:
bucket_tags = self.s3_client.get_bucket_tagging(Bucket=bucket['Name'])
bucket_env = next((tag['Value'] for tag in bucket_tags['TagSet'] if tag['Key'] == 'Environment'), None)
if bucket_env == environment:
try:
encryption = self.s3_client.get_bucket_encryption(Bucket=bucket['Name'])
except self.s3_client.exceptions.ClientError:
violations.append({
'resource_type': 'S3',
'resource_id': bucket['Name'],
'violation': 'Encryption not enabled'
})
except:
continue
# Check RDS encryption
rds_instances = self.rds_client.describe_db_instances()
for instance in rds_instances['DBInstances']:
db_tags = self.rds_client.list_tags_for_resource(
ResourceName=instance['DBInstanceArn']
)
instance_env = next((tag['Value'] for tag in db_tags['TagList'] if tag['Key'] == 'Environment'), None)
if instance_env == environment and not instance.get('StorageEncrypted', False):
violations.append({
'resource_type': 'RDS',
'resource_id': instance['DBInstanceIdentifier'],
'violation': 'Storage encryption not enabled'
})
return {
'environment': environment,
'encryption_violations': violations,
'is_compliant': len(violations) == 0
}
def generate_compliance_report(self, environment: str) -> Dict:
"""Generate comprehensive compliance report"""
tagging_compliance = self.check_tagging_compliance(environment)
encryption_compliance = self.check_encryption_compliance(environment)
overall_compliance = (
tagging_compliance['compliance_rate'] > 90 and
encryption_compliance['is_compliant']
)
return {
'environment': environment,
'overall_compliance': overall_compliance,
'tagging_compliance': tagging_compliance,
'encryption_compliance': encryption_compliance,
'generated_at': datetime.now().isoformat()
}
def main():
import sys
from datetime import datetime
if len(sys.argv) != 2:
print("Usage: python compliance-check.py ")
sys.exit(1)
environment = sys.argv[1]
checker = ComplianceChecker()
print(f"Running compliance check for environment: {environment}")
report = checker.generate_compliance_report(environment)
print(json.dumps(report, indent=2))
# Exit with error code if not compliant
if not report['overall_compliance']:
print("\n❌ Compliance check failed!")
sys.exit(1)
else:
print("\n✅ All compliance checks passed!")
if __name__ == "__main__":
main()
Operational Results
Enterprise Deployment Metrics
In our multi-cloud financial services implementation:
Infrastructure Management: - Resources managed: 5,000+ across 3 cloud providers - Environments: 15 (5 regions × 3 environments) - Deployment time: 12 minutes average - Success rate: 99.2%
Cost Optimization: - Cost reduction achieved: 35% year-over-year - Unused resources identified: 200+ per month - Right-sizing recommendations: 95% accuracy - Cost anomaly detection: < 24 hours
Compliance and Security: - Security policy violations: < 1% of deployments - Compliance score: 98.5% average - Encryption coverage: 100% of data at rest - Tag compliance: 99.1% of resources
Team Productivity: - Development velocity: 40% increase - Infrastructure provisioning: 90% reduction in time - Self-service adoption: 85% of teams - Support tickets: 60% reduction
Conclusion
Enterprise Terraform requires a comprehensive approach to state management, security, governance, and automation. The key success factors from our large-scale implementations:
1. Modular architecture - Reusable, composable infrastructure components 2. State isolation - Proper backend configuration and workspace management 3. Security by default - Encryption, IAM policies, and compliance automation 4. CI/CD integration - Automated testing, validation, and deployment 5. Cost governance - Continuous monitoring and optimization 6. Policy as code - Automated compliance and security validation
With these patterns, you can scale Terraform across enterprise environments while maintaining security, compliance, and operational excellence.
Next Steps
Ready to implement enterprise-grade Terraform in your organization? Our team has successfully deployed these patterns across dozens of large-scale environments. Contact us for expert guidance on your infrastructure automation journey.
Tags: