Terraform State Locking: How DynamoDB Saves You from Corruption
Key takeaways
- State locking prevents concurrent Terraform operations from corrupting state files through race conditions and conflicting writes
- DynamoDB provides distributed locking with S3 backends using a single table with LockID as the partition key and conditional writes for atomic operations
- Failed locks require manual intervention with
terraform force-unlock, but must be done carefully to avoid breaking active operations - Lock timeouts, stale locks from crashed processes, and CI/CD pipeline deadlocks are common issues with specific remediation strategies
- Proper monitoring (DynamoDB metrics, lock age alerts, CloudWatch logs) and automation (lock cleanup Lambda, Terraform Cloud for managed locking) prevent operational issues
The State Corruption Disaster
It's 3 PM on Friday. Two engineers are deploying infrastructure changes to production.
Engineer A (in San Francisco):
cd production
terraform apply
# Plan: 5 to add, 2 to change, 0 to destroy
# Type 'yes' to proceedEngineer B (in New York, simultaneously):
cd production
terraform apply
# Plan: 3 to add, 1 to change, 0 to destroy
# Type 'yes' to proceedBoth type "yes" at the same time.
What happens:
- Both read the same state file from S3
- Both make their changes to AWS
- Both write their updated state back to S3
- Engineer B's state overwrites Engineer A's changes
- Terraform's state now thinks Engineer A's 5 new resources don't exist
- Next
terraform planwants to recreate them - Your infrastructure is now in an unknown state
Cost of this incident:
- 4 hours debugging state corruption
- 2 hours manually importing resources
- 1 production outage (resources deleted and recreated)
- $50K revenue impact
This is a common consequence of ClickOps culture without proper infrastructure-as-code discipline.
Prevention cost:
- 10 minutes to enable DynamoDB state locking
- $0.20/month DynamoDB costs
How State Locking Works
The Problem: Race Condition
Time Engineer A Engineer B State File
────────────────────────────────────────────────────────────────────────────
T0 terraform apply
T1 Read state (version 42)
T2 terraform apply
T3 Read state (version 42) ← Both read v42
T4 Modify AWS resources
T5 Modify AWS resources
T6 Write state (version 43) State = v43
T7 Write state (version 43) State = v43 (overwrites!)
← A's changes lost
The Solution: Distributed Lock
Time Engineer A Engineer B DynamoDB Lock
────────────────────────────────────────────────────────────────────────────
T0 terraform apply
T1 Acquire lock Lock = A
T2 Read state (version 42)
T3 terraform apply
T4 Try to acquire lock
T5 ← BLOCKED (lock held by A)
T6 Modify AWS resources
T7 Write state (version 43)
T8 Release lock Lock = (none)
T9 Acquire lock Lock = B
T10 Read state (version 43) ← Sees A's changes
T11 Modify AWS resources
T12 Write state (version 44)
T13 Release lock Lock = (none)
DynamoDB State Locking Setup
Step 1: Create DynamoDB Table
Terraform configuration:
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST" # On-demand pricing
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
# Enable point-in-time recovery
point_in_time_recovery {
enabled = true
}
# Enable server-side encryption
server_side_encryption {
enabled = true
}
tags = {
Name = "Terraform State Locks"
Environment = "shared"
Purpose = "terraform-locking"
}
}Or via AWS CLI:
aws dynamodb create-table \
--table-name terraform-state-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1Table structure:
LockID (String, Partition Key) | Info (String) | Digest (String)
────────────────────────────────────────────────────────────────────
production/terraform.tfstate | {"ID":"...", "Operation":"OperationTypeApply", ...} | "abc123..."
staging/terraform.tfstate | {"ID":"...", "Operation":"OperationTypePlan", ...} | "def456..."
Step 2: Configure S3 Backend with DynamoDB
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
# Enable state locking
dynamodb_table = "terraform-state-locks"
# Optional: Enable versioning for state history
versioning = true
}
}Initialize backend:
terraform init
Initializing the backend...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.Step 3: Verify Locking Works
Terminal 1:
terraform apply
# Terraform acquires lock
Acquiring state lock. This may take a few moments...Terminal 2 (simultaneously):
terraform apply
# Terraform waits for lock
Acquiring state lock. This may take a few moments...
Error: Error acquiring the state lock
Lock Info:
ID: a1b2c3d4-5e6f-7890-abcd-ef1234567890
Path: my-terraform-state/production/terraform.tfstate
Operation: OperationTypeApply
Who: engineer1@my-laptop
Version: 1.6.0
Created: 2025-01-15 15:30:45.123 UTC
Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.How DynamoDB Locking Works Internally
Atomic Lock Acquisition
Terraform uses DynamoDB's conditional PutItem operation:
# Simplified Python equivalent
def acquire_lock(lock_id, lock_info):
try:
dynamodb.put_item(
TableName='terraform-state-locks',
Item={
'LockID': lock_id,
'Info': lock_info,
'Digest': calculate_digest(lock_info)
},
ConditionExpression='attribute_not_exists(LockID)' # Only succeed if lock doesn't exist
)
return True # Lock acquired
except ConditionalCheckFailedException:
return False # Lock already held by someone elseWhy this is atomic:
- DynamoDB guarantees that only one
PutItemwith the condition will succeed - If two clients try simultaneously, exactly one succeeds
- The failed client receives
ConditionalCheckFailedException
Lock Release
def release_lock(lock_id, lock_info):
dynamodb.delete_item(
TableName='terraform-state-locks',
Key={'LockID': lock_id},
ConditionExpression='Info = :expected_info', # Only delete if we own the lock
ExpressionAttributeValues={':expected_info': lock_info}
)Safety mechanism:
- Only the client that created the lock can delete it
- Prevents accidental deletion of another client's lock
Lock Information Structure
When Terraform acquires a lock, it stores:
{
"ID": "a1b2c3d4-5e6f-7890-abcd-ef1234567890",
"Operation": "OperationTypeApply",
"Info": {
"Who": "engineer1@my-laptop",
"Version": "1.6.0",
"Path": "my-terraform-state/production/terraform.tfstate",
"Created": "2025-01-15T15:30:45.123Z"
}
}This metadata helps debug stuck locks.
Handling Lock Failures
Scenario 1: Stuck Lock from Crashed Process
Problem:
terraform apply
# Process crashes (laptop dies, network failure, Ctrl+C)
# Lock remains in DynamoDBNext developer:
terraform apply
Error: Error acquiring the state lock
Lock Info:
ID: old-lock-id
Who: engineer1@my-laptop
Created: 2025-01-15 10:00:00 UTC ← 6 hours ago!Solution: Force Unlock
# 1. Verify lock is actually stale
aws dynamodb get-item \
--table-name terraform-state-locks \
--key '{"LockID":{"S":"production/terraform.tfstate"}}'
# 2. Check with lock owner (Slack/email)
# "Hey engineer1, are you still running terraform apply in production?"
# 3. If confirmed stale, force unlock
terraform force-unlock a1b2c3d4-5e6f-7890-abcd-ef1234567890
# Output:
Do you really want to force-unlock?
Terraform will remove the lock on the remote state.
This will allow local Terraform commands to modify this state, even though it
may still be in use. Only 'yes' will be accepted to confirm.
Enter a value: yes
Terraform state has been successfully unlocked!⚠️ WARNING: Never force-unlock without verifying the lock is stale. You could corrupt state if the original operation is still running.
Scenario 2: Lock Timeout
Some backends support automatic lock timeouts, but S3 + DynamoDB does not.
Workaround: Implement custom lock expiration check
import boto3
from datetime import datetime, timedelta
def cleanup_stale_locks(table_name, max_age_hours=2):
"""Remove locks older than max_age_hours"""
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(table_name)
response = table.scan()
for item in response.get('Items', []):
lock_info = json.loads(item['Info'])
created = datetime.fromisoformat(lock_info['Created'].replace('Z', '+00:00'))
age = datetime.now(timezone.utc) - created
if age > timedelta(hours=max_age_hours):
print(f"Removing stale lock: {item['LockID']} (age: {age})")
table.delete_item(Key={'LockID': item['LockID']})
# Run as cron job or Lambda on CloudWatch EventsLambda function for automated cleanup:
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, ScanCommand, DeleteCommand } from '@aws-sdk/lib-dynamodb';
import { Handler } from 'aws-lambda';
const client = new DynamoDBClient({});
const dynamodb = DynamoDBDocumentClient.from(client);
const TABLE_NAME = process.env.DYNAMODB_TABLE!;
const MAX_LOCK_AGE_HOURS = 2;
export const handler: Handler = async (event) => {
const { Items } = await dynamodb.send(new ScanCommand({
TableName: TABLE_NAME
}));
if (!Items || Items.length === 0) {
return { statusCode: 200, body: 'No locks found' };
}
const now = Date.now();
const staleLocks = Items.filter((item) => {
const lockInfo = JSON.parse(item.Info as string);
const created = new Date(lockInfo.Created).getTime();
const ageHours = (now - created) / (1000 * 60 * 60);
return ageHours > MAX_LOCK_AGE_HOURS;
});
console.log(`Found ${staleLocks.length} stale locks`);
for (const lock of staleLocks) {
console.log(`Removing stale lock: ${lock.LockID}`);
await dynamodb.send(new DeleteCommand({
TableName: TABLE_NAME,
Key: { LockID: lock.LockID }
}));
// Send alert to Slack
await fetch(process.env.SLACK_WEBHOOK!, {
method: 'POST',
body: JSON.stringify({
text: `⚠️ Removed stale Terraform lock: ${lock.LockID} (created ${Math.round((now - new Date(JSON.parse(lock.Info as string).Created).getTime()) / (1000 * 60))} minutes ago)`
})
});
}
return {
statusCode: 200,
body: JSON.stringify({ removed: staleLocks.length })
};
};CloudWatch Events trigger:
resource "aws_cloudwatch_event_rule" "cleanup_stale_locks" {
name = "cleanup-terraform-stale-locks"
schedule_expression = "rate(15 minutes)"
}
resource "aws_cloudwatch_event_target" "cleanup_lambda" {
rule = aws_cloudwatch_event_rule.cleanup_stale_locks.name
target_id = "cleanup-lambda"
arn = aws_lambda_function.cleanup_stale_locks.arn
}Scenario 3: CI/CD Pipeline Deadlock
Problem:
# .github/workflows/terraform.yml
jobs:
plan:
runs-on: ubuntu-latest
steps:
- name: Terraform Plan
run: |
terraform plan -out=tfplan
# Plan holds lock for entire job duration
- name: Long-running test
run: sleep 600 # 10 minutes
# Lock still held!
apply:
needs: plan
runs-on: ubuntu-latest
steps:
- name: Terraform Apply
run: terraform apply tfplan
# ← Blocked waiting for plan job to release lockSolution: Minimize lock duration
jobs:
plan:
runs-on: ubuntu-latest
steps:
- name: Terraform Plan
run: |
terraform plan -out=tfplan
# Lock released immediately after plan
- name: Upload plan
uses: actions/upload-artifact@v3
with:
name: tfplan
path: tfplan
# Lock is now released, other jobs can proceed
- name: Long-running test
run: sleep 600
apply:
needs: plan
runs-on: ubuntu-latest
steps:
- name: Download plan
uses: actions/download-artifact@v3
with:
name: tfplan
- name: Terraform Apply
run: terraform apply tfplan
# No deadlock - plan already released lockMonitoring State Locks
CloudWatch Dashboard
resource "aws_cloudwatch_dashboard" "terraform_locks" {
dashboard_name = "terraform-state-locks"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/DynamoDB", "ConsumedReadCapacityUnits", { stat = "Sum", label = "Read Capacity" }],
[".", "ConsumedWriteCapacityUnits", { stat = "Sum", label = "Write Capacity" }]
]
period = 300
stat = "Sum"
region = "us-east-1"
title = "DynamoDB Lock Table Activity"
yAxis = {
left = { min = 0 }
}
}
},
{
type = "metric"
properties = {
metrics = [
["AWS/DynamoDB", "UserErrors", { stat = "Sum" }],
[".", "SystemErrors", { stat = "Sum" }]
]
period = 300
stat = "Sum"
region = "us-east-1"
title = "Lock Acquisition Errors"
}
},
{
type = "log"
properties = {
query = <<-EOQ
SOURCE '/aws/lambda/terraform-lock-cleanup'
| fields @timestamp, @message
| filter @message like /Removing stale lock/
| sort @timestamp desc
EOQ
region = "us-east-1"
title = "Recent Stale Lock Removals"
}
}
]
})
}Alert on Long-Held Locks
resource "aws_cloudwatch_metric_alarm" "long_held_lock" {
alarm_name = "terraform-lock-held-too-long"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
threshold = 1
alarm_description = "Terraform lock held for over 30 minutes"
alarm_actions = [aws_sns_topic.alerts.arn]
metric_query {
id = "m1"
return_data = true
metric {
metric_name = "ItemCount"
namespace = "AWS/DynamoDB"
period = 1800 # 30 minutes
stat = "Average"
dimensions = {
TableName = aws_dynamodb_table.terraform_locks.name
}
}
}
}Custom Lock Age Metric
import boto3
from datetime import datetime, timezone
import json
cloudwatch = boto3.client('cloudwatch')
dynamodb = boto3.resource('dynamodb')
def publish_lock_age_metrics():
table = dynamodb.Table('terraform-state-locks')
response = table.scan()
for item in response.get('Items', []):
lock_info = json.loads(item['Info'])
created = datetime.fromisoformat(lock_info['Created'].replace('Z', '+00:00'))
age_seconds = (datetime.now(timezone.utc) - created).total_seconds()
cloudwatch.put_metric_data(
Namespace='Terraform',
MetricData=[
{
'MetricName': 'LockAge',
'Value': age_seconds,
'Unit': 'Seconds',
'Dimensions': [
{'Name': 'LockID', 'Value': item['LockID']},
{'Name': 'Who', 'Value': lock_info['Who']}
]
}
]
)
# Run every 5 minutes via EventBridgeAdvanced: Multi-Region State Locking
Problem: S3 + DynamoDB locking is regional. If your Terraform runs in multiple regions, you need global locking.
Solution 1: DynamoDB Global Tables
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "LockID"
type = "S"
}
# Enable global tables
replica {
region_name = "eu-west-1"
}
replica {
region_name = "ap-southeast-1"
}
}Caveat: DynamoDB Global Tables have eventual consistency across regions, which could allow duplicate locks in rare cases. For multi-region architecture patterns, see our guide on multi-region active-active design.
Solution 2: Terraform Cloud (Managed Locking)
terraform {
cloud {
organization = "my-company"
workspaces {
name = "production"
}
}
}Terraform Cloud provides:
- Centralized state storage
- Built-in locking (no DynamoDB needed)
- Run history and audit logs
- Role-based access control
Cost: $20/user/month (paid tier)
Best Practices
1. Always Enable Locking
# ❌ Bad: No locking
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
}
}
# ✅ Good: Locking enabled
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-locks"
}
}2. Use Separate Lock Tables per Environment (Optional)
# Option A: Shared lock table (simpler, recommended)
dynamodb_table = "terraform-state-locks"
# Lock keys: production/terraform.tfstate, staging/terraform.tfstate
# Option B: Per-environment lock tables (more isolation)
dynamodb_table = "terraform-state-locks-production"
# Lock keys: terraform.tfstate, vpc.tfstate, ecs.tfstateRecommendation: Use a single shared table. Lock keys already include the environment path.
3. Document Force-Unlock Procedure
Create a runbook:
# Terraform State Lock Recovery
## When to Force Unlock
1. Verify lock is stale (created over 30 minutes ago)
2. Confirm with lock owner that operation is not running
3. Check CI/CD pipelines are not running
## How to Force Unlock
```bash
# 1. Get lock ID from error message
terraform apply
# Error: Lock ID: a1b2c3d4-5e6f-7890-abcd-ef1234567890
# 2. Force unlock
terraform force-unlock a1b2c3d4-5e6f-7890-abcd-ef1234567890
# 3. Notify team in #infrastructure Slack channelEscalation
If force-unlock doesn't work:
- Manually delete from DynamoDB
- Run
terraform init -reconfigure - Contact platform team
### 4. Audit Lock Operations
```python
# Lambda function to log all lock operations
def log_lock_operation(event):
if event['eventName'] == 'PutItem':
lock_id = event['dynamodb']['Keys']['LockID']['S']
info = json.loads(event['dynamodb']['NewImage']['Info']['S'])
print(f"LOCK ACQUIRED: {lock_id} by {info['Who']} for {info['Operation']}")
# Send to audit log
audit_log.put_log_events(
logGroupName='/terraform/locks',
logStreamName='operations',
logEvents=[{
'timestamp': int(time.time() * 1000),
'message': json.dumps({
'action': 'LOCK_ACQUIRED',
'lockID': lock_id,
'who': info['Who'],
'operation': info['Operation']
})
}]
)
elif event['eventName'] == 'DeleteItem':
lock_id = event['dynamodb']['Keys']['LockID']['S']
print(f"LOCK RELEASED: {lock_id}")
Conclusion: Locking is Non-Negotiable
State locking is not optional. The cost of not implementing it:
- State corruption requiring manual recovery
- Lost infrastructure changes
- Production outages from duplicate resource creation
- Hours of debugging and state file surgery
The cost of implementing it:
- 10 minutes setup time
- $0.10-$0.50/month DynamoDB costs
The math is simple: Always enable state locking.
Action Items
- Check if locking is enabled:
terraform initand look for DynamoDB table - Create DynamoDB table: Use Terraform or AWS CLI
- Update backend configuration: Add
dynamodb_tableparameter - Test lock behavior: Run
terraform applyin two terminals simultaneously - Implement monitoring: CloudWatch dashboard + stale lock cleanup Lambda
- Document force-unlock procedure: Add to team runbook
If you need help setting up Terraform state management for your organization, schedule a consultation. We'll design a state backend strategy, implement locking and encryption, and provide monitoring and disaster recovery procedures.