Infrastructure as Code

Terraform State Locking: How DynamoDB Saves You from Corruption

Updated By Zak Kann

Key takeaways

  • State locking prevents concurrent Terraform operations from corrupting state files through race conditions and conflicting writes
  • DynamoDB provides distributed locking with S3 backends using a single table with LockID as the partition key and conditional writes for atomic operations
  • Failed locks require manual intervention with terraform force-unlock, but must be done carefully to avoid breaking active operations
  • Lock timeouts, stale locks from crashed processes, and CI/CD pipeline deadlocks are common issues with specific remediation strategies
  • Proper monitoring (DynamoDB metrics, lock age alerts, CloudWatch logs) and automation (lock cleanup Lambda, Terraform Cloud for managed locking) prevent operational issues

The State Corruption Disaster

It's 3 PM on Friday. Two engineers are deploying infrastructure changes to production.

Engineer A (in San Francisco):

cd production
terraform apply
# Plan: 5 to add, 2 to change, 0 to destroy
# Type 'yes' to proceed

Engineer B (in New York, simultaneously):

cd production
terraform apply
# Plan: 3 to add, 1 to change, 0 to destroy
# Type 'yes' to proceed

Both type "yes" at the same time.

What happens:

  1. Both read the same state file from S3
  2. Both make their changes to AWS
  3. Both write their updated state back to S3
  4. Engineer B's state overwrites Engineer A's changes
  5. Terraform's state now thinks Engineer A's 5 new resources don't exist
  6. Next terraform plan wants to recreate them
  7. Your infrastructure is now in an unknown state

Cost of this incident:

  • 4 hours debugging state corruption
  • 2 hours manually importing resources
  • 1 production outage (resources deleted and recreated)
  • $50K revenue impact

This is a common consequence of ClickOps culture without proper infrastructure-as-code discipline.

Prevention cost:

  • 10 minutes to enable DynamoDB state locking
  • $0.20/month DynamoDB costs

How State Locking Works

The Problem: Race Condition

Time    Engineer A                  Engineer B                  State File
────────────────────────────────────────────────────────────────────────────
T0      terraform apply
T1      Read state (version 42)
T2                                  terraform apply
T3                                  Read state (version 42)      ← Both read v42
T4      Modify AWS resources
T5                                  Modify AWS resources
T6      Write state (version 43)                                 State = v43
T7                                  Write state (version 43)     State = v43 (overwrites!)
                                                                 ← A's changes lost

The Solution: Distributed Lock

Time    Engineer A                  Engineer B                  DynamoDB Lock
────────────────────────────────────────────────────────────────────────────
T0      terraform apply
T1      Acquire lock                                            Lock = A
T2      Read state (version 42)
T3                                  terraform apply
T4                                  Try to acquire lock
T5                                  ← BLOCKED (lock held by A)
T6      Modify AWS resources
T7      Write state (version 43)
T8      Release lock                                            Lock = (none)
T9                                  Acquire lock                Lock = B
T10                                 Read state (version 43)     ← Sees A's changes
T11                                 Modify AWS resources
T12                                 Write state (version 44)
T13                                 Release lock                Lock = (none)

DynamoDB State Locking Setup

Step 1: Create DynamoDB Table

Terraform configuration:

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"  # On-demand pricing
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
 
  # Enable point-in-time recovery
  point_in_time_recovery {
    enabled = true
  }
 
  # Enable server-side encryption
  server_side_encryption {
    enabled = true
  }
 
  tags = {
    Name        = "Terraform State Locks"
    Environment = "shared"
    Purpose     = "terraform-locking"
  }
}

Or via AWS CLI:

aws dynamodb create-table \
  --table-name terraform-state-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

Table structure:

LockID (String, Partition Key)  |  Info (String)  |  Digest (String)
────────────────────────────────────────────────────────────────────
production/terraform.tfstate    |  {"ID":"...", "Operation":"OperationTypeApply", ...}  |  "abc123..."
staging/terraform.tfstate       |  {"ID":"...", "Operation":"OperationTypePlan", ...}   |  "def456..."

Step 2: Configure S3 Backend with DynamoDB

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
 
    # Enable state locking
    dynamodb_table = "terraform-state-locks"
 
    # Optional: Enable versioning for state history
    versioning     = true
  }
}

Initialize backend:

terraform init
 
Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Step 3: Verify Locking Works

Terminal 1:

terraform apply
# Terraform acquires lock
Acquiring state lock. This may take a few moments...

Terminal 2 (simultaneously):

terraform apply
# Terraform waits for lock
 
Acquiring state lock. This may take a few moments...
 
Error: Error acquiring the state lock
 
Lock Info:
  ID:        a1b2c3d4-5e6f-7890-abcd-ef1234567890
  Path:      my-terraform-state/production/terraform.tfstate
  Operation: OperationTypeApply
  Who:       engineer1@my-laptop
  Version:   1.6.0
  Created:   2025-01-15 15:30:45.123 UTC
 
Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

How DynamoDB Locking Works Internally

Atomic Lock Acquisition

Terraform uses DynamoDB's conditional PutItem operation:

# Simplified Python equivalent
def acquire_lock(lock_id, lock_info):
    try:
        dynamodb.put_item(
            TableName='terraform-state-locks',
            Item={
                'LockID': lock_id,
                'Info': lock_info,
                'Digest': calculate_digest(lock_info)
            },
            ConditionExpression='attribute_not_exists(LockID)'  # Only succeed if lock doesn't exist
        )
        return True  # Lock acquired
    except ConditionalCheckFailedException:
        return False  # Lock already held by someone else

Why this is atomic:

  • DynamoDB guarantees that only one PutItem with the condition will succeed
  • If two clients try simultaneously, exactly one succeeds
  • The failed client receives ConditionalCheckFailedException

Lock Release

def release_lock(lock_id, lock_info):
    dynamodb.delete_item(
        TableName='terraform-state-locks',
        Key={'LockID': lock_id},
        ConditionExpression='Info = :expected_info',  # Only delete if we own the lock
        ExpressionAttributeValues={':expected_info': lock_info}
    )

Safety mechanism:

  • Only the client that created the lock can delete it
  • Prevents accidental deletion of another client's lock

Lock Information Structure

When Terraform acquires a lock, it stores:

{
  "ID": "a1b2c3d4-5e6f-7890-abcd-ef1234567890",
  "Operation": "OperationTypeApply",
  "Info": {
    "Who": "engineer1@my-laptop",
    "Version": "1.6.0",
    "Path": "my-terraform-state/production/terraform.tfstate",
    "Created": "2025-01-15T15:30:45.123Z"
  }
}

This metadata helps debug stuck locks.

Handling Lock Failures

Scenario 1: Stuck Lock from Crashed Process

Problem:

terraform apply
# Process crashes (laptop dies, network failure, Ctrl+C)
# Lock remains in DynamoDB

Next developer:

terraform apply
 
Error: Error acquiring the state lock
 
Lock Info:
  ID:        old-lock-id
  Who:       engineer1@my-laptop
  Created:   2025-01-15 10:00:00 UTC 6 hours ago!

Solution: Force Unlock

# 1. Verify lock is actually stale
aws dynamodb get-item \
  --table-name terraform-state-locks \
  --key '{"LockID":{"S":"production/terraform.tfstate"}}'
 
# 2. Check with lock owner (Slack/email)
# "Hey engineer1, are you still running terraform apply in production?"
 
# 3. If confirmed stale, force unlock
terraform force-unlock a1b2c3d4-5e6f-7890-abcd-ef1234567890
 
# Output:
Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!

⚠️ WARNING: Never force-unlock without verifying the lock is stale. You could corrupt state if the original operation is still running.

Scenario 2: Lock Timeout

Some backends support automatic lock timeouts, but S3 + DynamoDB does not.

Workaround: Implement custom lock expiration check

import boto3
from datetime import datetime, timedelta
 
def cleanup_stale_locks(table_name, max_age_hours=2):
    """Remove locks older than max_age_hours"""
 
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table(table_name)
 
    response = table.scan()
 
    for item in response.get('Items', []):
        lock_info = json.loads(item['Info'])
        created = datetime.fromisoformat(lock_info['Created'].replace('Z', '+00:00'))
        age = datetime.now(timezone.utc) - created
 
        if age > timedelta(hours=max_age_hours):
            print(f"Removing stale lock: {item['LockID']} (age: {age})")
            table.delete_item(Key={'LockID': item['LockID']})
 
# Run as cron job or Lambda on CloudWatch Events

Lambda function for automated cleanup:

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, ScanCommand, DeleteCommand } from '@aws-sdk/lib-dynamodb';
import { Handler } from 'aws-lambda';
 
const client = new DynamoDBClient({});
const dynamodb = DynamoDBDocumentClient.from(client);
const TABLE_NAME = process.env.DYNAMODB_TABLE!;
const MAX_LOCK_AGE_HOURS = 2;
 
export const handler: Handler = async (event) => {
  const { Items } = await dynamodb.send(new ScanCommand({
    TableName: TABLE_NAME
  }));
 
  if (!Items || Items.length === 0) {
    return { statusCode: 200, body: 'No locks found' };
  }
 
  const now = Date.now();
  const staleLocks = Items.filter((item) => {
    const lockInfo = JSON.parse(item.Info as string);
    const created = new Date(lockInfo.Created).getTime();
    const ageHours = (now - created) / (1000 * 60 * 60);
    return ageHours > MAX_LOCK_AGE_HOURS;
  });
 
  console.log(`Found ${staleLocks.length} stale locks`);
 
  for (const lock of staleLocks) {
    console.log(`Removing stale lock: ${lock.LockID}`);
 
    await dynamodb.send(new DeleteCommand({
      TableName: TABLE_NAME,
      Key: { LockID: lock.LockID }
    }));
 
    // Send alert to Slack
    await fetch(process.env.SLACK_WEBHOOK!, {
      method: 'POST',
      body: JSON.stringify({
        text: `⚠️ Removed stale Terraform lock: ${lock.LockID} (created ${Math.round((now - new Date(JSON.parse(lock.Info as string).Created).getTime()) / (1000 * 60))} minutes ago)`
      })
    });
  }
 
  return {
    statusCode: 200,
    body: JSON.stringify({ removed: staleLocks.length })
  };
};

CloudWatch Events trigger:

resource "aws_cloudwatch_event_rule" "cleanup_stale_locks" {
  name                = "cleanup-terraform-stale-locks"
  schedule_expression = "rate(15 minutes)"
}
 
resource "aws_cloudwatch_event_target" "cleanup_lambda" {
  rule      = aws_cloudwatch_event_rule.cleanup_stale_locks.name
  target_id = "cleanup-lambda"
  arn       = aws_lambda_function.cleanup_stale_locks.arn
}

Scenario 3: CI/CD Pipeline Deadlock

Problem:

# .github/workflows/terraform.yml
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Plan
        run: |
          terraform plan -out=tfplan
          # Plan holds lock for entire job duration
 
      - name: Long-running test
        run: sleep 600  # 10 minutes
        # Lock still held!
 
  apply:
    needs: plan
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Apply
        run: terraform apply tfplan
        # ← Blocked waiting for plan job to release lock

Solution: Minimize lock duration

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Plan
        run: |
          terraform plan -out=tfplan
          # Lock released immediately after plan
 
      - name: Upload plan
        uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan
 
      # Lock is now released, other jobs can proceed
 
      - name: Long-running test
        run: sleep 600
 
  apply:
    needs: plan
    runs-on: ubuntu-latest
    steps:
      - name: Download plan
        uses: actions/download-artifact@v3
        with:
          name: tfplan
 
      - name: Terraform Apply
        run: terraform apply tfplan
        # No deadlock - plan already released lock

Monitoring State Locks

CloudWatch Dashboard

resource "aws_cloudwatch_dashboard" "terraform_locks" {
  dashboard_name = "terraform-state-locks"
 
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/DynamoDB", "ConsumedReadCapacityUnits", { stat = "Sum", label = "Read Capacity" }],
            [".", "ConsumedWriteCapacityUnits", { stat = "Sum", label = "Write Capacity" }]
          ]
          period = 300
          stat   = "Sum"
          region = "us-east-1"
          title  = "DynamoDB Lock Table Activity"
          yAxis = {
            left = { min = 0 }
          }
        }
      },
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/DynamoDB", "UserErrors", { stat = "Sum" }],
            [".", "SystemErrors", { stat = "Sum" }]
          ]
          period = 300
          stat   = "Sum"
          region = "us-east-1"
          title  = "Lock Acquisition Errors"
        }
      },
      {
        type = "log"
        properties = {
          query = <<-EOQ
            SOURCE '/aws/lambda/terraform-lock-cleanup'
            | fields @timestamp, @message
            | filter @message like /Removing stale lock/
            | sort @timestamp desc
          EOQ
          region = "us-east-1"
          title  = "Recent Stale Lock Removals"
        }
      }
    ]
  })
}

Alert on Long-Held Locks

resource "aws_cloudwatch_metric_alarm" "long_held_lock" {
  alarm_name          = "terraform-lock-held-too-long"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 1
  alarm_description   = "Terraform lock held for over 30 minutes"
  alarm_actions       = [aws_sns_topic.alerts.arn]
 
  metric_query {
    id          = "m1"
    return_data = true
 
    metric {
      metric_name = "ItemCount"
      namespace   = "AWS/DynamoDB"
      period      = 1800  # 30 minutes
      stat        = "Average"
 
      dimensions = {
        TableName = aws_dynamodb_table.terraform_locks.name
      }
    }
  }
}

Custom Lock Age Metric

import boto3
from datetime import datetime, timezone
import json
 
cloudwatch = boto3.client('cloudwatch')
dynamodb = boto3.resource('dynamodb')
 
def publish_lock_age_metrics():
    table = dynamodb.Table('terraform-state-locks')
    response = table.scan()
 
    for item in response.get('Items', []):
        lock_info = json.loads(item['Info'])
        created = datetime.fromisoformat(lock_info['Created'].replace('Z', '+00:00'))
        age_seconds = (datetime.now(timezone.utc) - created).total_seconds()
 
        cloudwatch.put_metric_data(
            Namespace='Terraform',
            MetricData=[
                {
                    'MetricName': 'LockAge',
                    'Value': age_seconds,
                    'Unit': 'Seconds',
                    'Dimensions': [
                        {'Name': 'LockID', 'Value': item['LockID']},
                        {'Name': 'Who', 'Value': lock_info['Who']}
                    ]
                }
            ]
        )
 
# Run every 5 minutes via EventBridge

Advanced: Multi-Region State Locking

Problem: S3 + DynamoDB locking is regional. If your Terraform runs in multiple regions, you need global locking.

Solution 1: DynamoDB Global Tables

resource "aws_dynamodb_table" "terraform_locks" {
  name             = "terraform-state-locks"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "LockID"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
 
  attribute {
    name = "LockID"
    type = "S"
  }
 
  # Enable global tables
  replica {
    region_name = "eu-west-1"
  }
 
  replica {
    region_name = "ap-southeast-1"
  }
}

Caveat: DynamoDB Global Tables have eventual consistency across regions, which could allow duplicate locks in rare cases. For multi-region architecture patterns, see our guide on multi-region active-active design.

Solution 2: Terraform Cloud (Managed Locking)

terraform {
  cloud {
    organization = "my-company"
 
    workspaces {
      name = "production"
    }
  }
}

Terraform Cloud provides:

  • Centralized state storage
  • Built-in locking (no DynamoDB needed)
  • Run history and audit logs
  • Role-based access control

Cost: $20/user/month (paid tier)

Best Practices

1. Always Enable Locking

# ❌ Bad: No locking
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
  }
}
 
# ✅ Good: Locking enabled
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-locks"
  }
}

2. Use Separate Lock Tables per Environment (Optional)

# Option A: Shared lock table (simpler, recommended)
dynamodb_table = "terraform-state-locks"
# Lock keys: production/terraform.tfstate, staging/terraform.tfstate
 
# Option B: Per-environment lock tables (more isolation)
dynamodb_table = "terraform-state-locks-production"
# Lock keys: terraform.tfstate, vpc.tfstate, ecs.tfstate

Recommendation: Use a single shared table. Lock keys already include the environment path.

3. Document Force-Unlock Procedure

Create a runbook:

# Terraform State Lock Recovery
 
## When to Force Unlock
 
1. Verify lock is stale (created over 30 minutes ago)
2. Confirm with lock owner that operation is not running
3. Check CI/CD pipelines are not running
 
## How to Force Unlock
 
```bash
# 1. Get lock ID from error message
terraform apply
# Error: Lock ID: a1b2c3d4-5e6f-7890-abcd-ef1234567890
 
# 2. Force unlock
terraform force-unlock a1b2c3d4-5e6f-7890-abcd-ef1234567890
 
# 3. Notify team in #infrastructure Slack channel

Escalation

If force-unlock doesn't work:

  1. Manually delete from DynamoDB
  2. Run terraform init -reconfigure
  3. Contact platform team

### 4. Audit Lock Operations

```python
# Lambda function to log all lock operations
def log_lock_operation(event):
    if event['eventName'] == 'PutItem':
        lock_id = event['dynamodb']['Keys']['LockID']['S']
        info = json.loads(event['dynamodb']['NewImage']['Info']['S'])

        print(f"LOCK ACQUIRED: {lock_id} by {info['Who']} for {info['Operation']}")

        # Send to audit log
        audit_log.put_log_events(
            logGroupName='/terraform/locks',
            logStreamName='operations',
            logEvents=[{
                'timestamp': int(time.time() * 1000),
                'message': json.dumps({
                    'action': 'LOCK_ACQUIRED',
                    'lockID': lock_id,
                    'who': info['Who'],
                    'operation': info['Operation']
                })
            }]
        )

    elif event['eventName'] == 'DeleteItem':
        lock_id = event['dynamodb']['Keys']['LockID']['S']
        print(f"LOCK RELEASED: {lock_id}")

Conclusion: Locking is Non-Negotiable

State locking is not optional. The cost of not implementing it:

  • State corruption requiring manual recovery
  • Lost infrastructure changes
  • Production outages from duplicate resource creation
  • Hours of debugging and state file surgery

The cost of implementing it:

  • 10 minutes setup time
  • $0.10-$0.50/month DynamoDB costs

The math is simple: Always enable state locking.

Action Items

  1. Check if locking is enabled: terraform init and look for DynamoDB table
  2. Create DynamoDB table: Use Terraform or AWS CLI
  3. Update backend configuration: Add dynamodb_table parameter
  4. Test lock behavior: Run terraform apply in two terminals simultaneously
  5. Implement monitoring: CloudWatch dashboard + stale lock cleanup Lambda
  6. Document force-unlock procedure: Add to team runbook

If you need help setting up Terraform state management for your organization, schedule a consultation. We'll design a state backend strategy, implement locking and encryption, and provide monitoring and disaster recovery procedures.

Need Help with Your Cloud Infrastructure?

Our experts are here to guide you through your cloud journey

Schedule a Free Consultation