From ClickOps to GitOps: Building a Culture of Infrastructure as Code
Key takeaways
- Configuration drift from ClickOps costs 4-16 engineering hours per incident and prevents reliable environment recreation
- Console changes create audit trail gaps that require 1-4 hours to investigate per compliance incident
- Disaster recovery fails when undocumented manual changes prevent accurate infrastructure replication
- Transitioning to GitOps requires cultural change, not just tools—enforcement and training are essential
- Successful GitOps adoption reduces incidents by 67% and deployment frequency by 156% year-over-year
It's 2 AM. Your production database is experiencing performance degradation. Users are complaining. Your on-call engineer logs into the AWS Console, navigates to RDS, increases the allocated storage, adjusts some performance parameters, and applies the changes. Within minutes, performance stabilizes. Crisis averted. The team goes back to sleep.
Three months later, you're setting up a new production-like environment for load testing. You run terraform apply, expecting an exact replica of production. But the database performance is nowhere near production levels. After hours of debugging, you discover the issue: those 2 AM parameter changes were never committed to code. Your infrastructure state has diverged. Welcome to the hidden cost of ClickOps.
This guide provides a comprehensive framework for transitioning from console-driven operations to GitOps, covering the cultural, technical, and organizational challenges you'll face along the way.
Understanding ClickOps: More Than Just Clicking
What is ClickOps?
ClickOps refers to managing cloud infrastructure through web console UIs rather than declarative code. It includes:
- Making changes via AWS/GCP/Azure consoles
- Using web-based dashboards for infrastructure modifications
- Relying on point-and-click interfaces for resource management
- Manual configuration through GUI tools
The Psychology of ClickOps
Understanding why teams default to ClickOps is critical to addressing it:
1. Immediate Gratification
- Console UIs provide instant visual feedback
- Changes appear to happen faster than writing code
- No PR review process to slow things down
- Satisfies the urgency bias during incidents
2. Lower Perceived Barrier to Entry
- Clicking feels easier than learning Terraform/CDK
- New team members can navigate GUIs without training
- No need to understand IaC syntax or patterns
- Visual interfaces seem more approachable
3. Exploration and Learning
- Console UIs help engineers discover available options
- Dropdown menus expose configuration possibilities
- Documentation is often embedded in the UI
- Experimentation feels safe in a sandbox environment
4. Legacy Workflows
- Teams that started before IaC maturity
- Organizational muscle memory from pre-cloud days
- "This is how we've always done it"
- Lack of exposure to better alternatives
The True Cost of ClickOps
1. Configuration Drift and Environment Divergence
The most immediate and visible cost is drift—the gap between your code and reality.
Real-World Scenario:
Timeline of a drift incident:
Day 1 (2 AM): Engineer modifies RDS parameter group via console
Day 1 (2:15 AM): Incident resolved, team goes back to sleep
Day 2: Business as usual, no one remembers to update Terraform
Week 2: Different engineer runs `terraform plan` for unrelated change
Plan shows no changes (Terraform doesn't know about manual change)
Month 3: Disaster recovery drill spins up environment from code
New environment has different performance characteristics
Hours spent debugging what's different
Eventually discovered via manual console comparison
Month 6: Production needs to be rebuilt due to security requirement
Dozens of undocumented manual changes discovered
Team spends weeks reverse-engineering current state
Cost Impact:
- Average drift incident: 4-16 engineering hours to identify and resolve
- Lost productivity: Unable to confidently recreate environments
- Increased MTTR (Mean Time To Recovery) during actual incidents
- Delayed features due to environment inconsistencies
2. The Audit Trail Problem
Compliance and Security Implications:
When changes are made via console:
CloudTrail Log Entry:
{
"eventTime": "2025-02-18T02:14:23Z",
"eventName": "ModifyDBInstance",
"userIdentity": {
"principalId": "AIDAI23HXJ7...",
"userName": "sarah.oncall"
},
"requestParameters": {
"dBInstanceIdentifier": "prod-db",
"maxAllocatedStorage": 1000,
"applyImmediately": true
}
}
What's Missing:
- Why: No business justification or incident link
- Context: No related changes or dependencies
- Review: No peer review or approval process
- Reversibility: No easy rollback mechanism
- Documentation: No comment explaining the change
Compare to GitOps:
git log --oneline terraform/rds.tf
a3b2c1d (HEAD -> main) Increase RDS storage for incident INC-12345
b4c3d2e Add performance insights to production database
c5d4e3f Enable automated backups with 30-day retentionEach commit has:
- Link to incident or feature ticket
- Pull request with discussion and review
- Approval from required reviewers
- Automatic security scanning
- Clear rollback path (
git revert)
Compliance Consequences:
- SOC 2 audits require change documentation
- HIPAA mandates access controls and change tracking
- PCI-DSS requires all changes to be authorized
- ISO 27001 demands change management processes
Console changes create audit gaps that can fail compliance requirements.
3. The Knowledge Silo Effect
Tribal Knowledge vs. Institutional Knowledge:
Tribal Knowledge (ClickOps):
- Configuration exists in Bob's memory
- "Bob knows how to configure the load balancer"
- Bob goes on vacation → deployments halt
- Bob leaves company → knowledge is lost
- New engineers shadow Bob to learn
Institutional Knowledge (GitOps):
- Configuration exists in Git repository
- Anyone can read the Terraform to understand setup
- Code review spreads knowledge across team
- Documentation is the code itself
- New engineers self-serve via repository
Bus Factor Analysis:
A study of 100 organizations found:
- ClickOps teams: Average bus factor of 1.8 (losing 2 people cripples team)
- GitOps teams: Average bus factor of 5.3 (team resilient to turnover)
4. Fear-Driven Operations
When infrastructure configuration is unknown:
Symptoms of Fear-Driven Operations:
- "Don't touch that server, we don't know what's special about it"
- "Let's not upgrade, something might break"
- "We can't recreate this if it fails"
- "Schedule the change for 2 AM on Sunday when no one is using the system"
- "Let's spin up a new server instead of fixing the old one"
Organizational Impact:
- Security patches delayed due to fear of breaking changes
- Technical debt accumulates indefinitely
- Infrastructure becomes "legacy" within months
- Team velocity decreases over time
- Innovation stagnates
5. The Scalability Ceiling
Single Environment Management (Works):
- One production environment
- Small team (2-5 engineers)
- Low change frequency
- Manual changes are trackable
Multi-Environment, Multi-Region Reality (Breaks):
- Production, Staging, Development, QA, Demo
- Multiple regions (US, EU, APAC)
- 20+ engineers making changes
- Dozens of changes per day
Math of Chaos:
Scenario: 5 environments × 3 regions × 20 engineers = 300 potential drift points
With ClickOps:
- 10% of changes are manual (conservative estimate)
- 30 manual changes per day
- 150 manual changes per week
- 7,800 manual changes per year
Finding one specific manual change in CloudTrail:
- Review hundreds of events
- Correlate user identity with change
- Determine if change was intentional or error
- Identify cascading impacts
Time cost: 1-4 hours per incident
Annual cost: 100-400 hours of engineering time
6. The Disaster Recovery Nightmare
True Story from the Field:
"Our production infrastructure was in us-east-1. We needed to spin up an identical environment in us-west-2 for DR compliance. We ran our Terraform in the new region and everything came up... but nothing worked correctly. Over the next two weeks, we discovered 47 manual changes that had accumulated over 18 months. The changes weren't documented anywhere except in one engineer's memory (and he had left the company). We ended up doing a 'configuration archaeology' project where we manually compared every resource in both regions to reverse-engineer what was different."
— CTO, Series B SaaS Company
Cost: $120,000 in engineering time + 2-week delay in DR compliance certification
Measuring ClickOps in Your Organization
Before you can fix the problem, you need to quantify it.
Metric 1: Drift Detection Rate
How to Measure:
#!/bin/bash
# Weekly drift detection report
cd /path/to/terraform
# Run plan and capture output
terraform plan -detailed-exitcode > /dev/null 2>&1
EXIT_CODE=$?
# Exit codes:
# 0 = No changes (no drift)
# 1 = Error
# 2 = Changes detected (drift exists)
if [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ DRIFT DETECTED"
# Generate detailed drift report
terraform plan -no-color > drift-report-$(date +%Y%m%d).txt
# Count number of resources with drift
DRIFT_COUNT=$(terraform plan -no-color | grep -c "# .* will be updated in-place")
echo "Resources with drift: $DRIFT_COUNT"
# Alert team
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🚨 Drift detected: $DRIFT_COUNT resources differ from code\"}"
fiTrack Over Time:
- Week 1: 23 resources with drift (baseline)
- Week 4: 31 resources (getting worse)
- Week 8: 12 resources (improvement after intervention)
- Week 12: 2 resources (almost there)
Metric 2: Console vs. API Usage Ratio
AWS CloudTrail Analysis:
#!/usr/bin/env python3
"""
Analyze CloudTrail to identify console vs. programmatic changes
"""
import boto3
from datetime import datetime, timedelta
from collections import defaultdict
cloudtrail = boto3.client('cloudtrail')
# Look back 7 days
start_time = datetime.now() - timedelta(days=7)
# Fetch events
events = cloudtrail.lookup_events(
StartTime=start_time,
LookupAttributes=[
{'AttributeKey': 'EventName', 'AttributeValue': 'ModifyDBInstance'},
{'AttributeKey': 'EventName', 'AttributeValue': 'ModifySecurityGroup'},
{'AttributeKey': 'EventName', 'AttributeValue': 'PutBucketPolicy'},
# Add more mutation events
]
)
console_changes = 0
programmatic_changes = 0
for event in events['Events']:
user_agent = event.get('UserAgent', '')
if 'console.aws.amazon.com' in user_agent:
console_changes += 1
print(f"🖱️ Console: {event['Username']} - {event['EventName']}")
else:
programmatic_changes += 1
total = console_changes + programmatic_changes
console_percentage = (console_changes / total * 100) if total > 0 else 0
print(f"\n📊 Console Changes: {console_changes} ({console_percentage:.1f}%)")
print(f"📊 Programmatic: {programmatic_changes} ({100-console_percentage:.1f}%)")
# Set target: < 5% console changes (excluding break-glass scenarios)
if console_percentage > 5:
print("⚠️ Console usage above target threshold")Metric 3: Time to Environment Parity
How to Measure:
Track how long it takes to create a production-like environment from scratch:
Baseline (ClickOps dominant):
- Run terraform apply: 15 minutes
- Manually configure undocumented changes: 4-8 hours
- Debug why things don't work: 2-6 hours
- Total: 6-14 hours (inconsistent)
Target (GitOps):
- Run terraform apply: 15 minutes
- Manual configuration: 0 minutes
- Debugging: 0-30 minutes (only for genuine bugs)
- Total: 15-45 minutes (consistent)
Metric 4: Incident Postmortem ClickOps References
Tracking Exercise:
Review your last 10 incident postmortems. Count how many include phrases like:
- "We discovered a manual change from..."
- "The production environment had been modified..."
- "Configuration drift caused..."
- "Someone had clicked..."
- "We couldn't reproduce because staging was different..."
Benchmark Data:
- Healthy GitOps culture: 0-1 of 10 incidents
- Transitioning teams: 3-5 of 10 incidents
- ClickOps-heavy teams: 7-10 of 10 incidents
The GitOps Maturity Model
Organizations typically progress through five stages:
Level 0: Console-Driven (ClickOps)
Characteristics:
- All changes made via web console
- No infrastructure code
- Tribal knowledge dominates
- High bus factor risk
- Frequent "it works on my machine" issues
Organizational Signals:
- "Just log into the console and..."
- "I think Bob changed that last month"
- "Let me check the console to see what's configured"
Level 1: Code Exists, But Optional
Characteristics:
- Some infrastructure defined in code
- Console access unrestricted
- Code and reality drift constantly
- Code viewed as documentation, not source of truth
- Changes made in console, sometimes backported to code
Organizational Signals:
- "We have Terraform, but we don't always use it"
- "Update the code when you have time"
- "The code is probably out of date"
Level 2: Code First, Console for Emergencies
Characteristics:
- Infrastructure code is primary method
- Console access restricted to emergencies
- Break-glass procedures exist
- Regular drift detection
- Most changes via PR workflow
Organizational Signals:
- "Let me write a PR for that"
- "Did you update the Terraform?"
- "We detected drift yesterday"
Level 3: Strict GitOps
Characteristics:
- All changes via code
- Read-only console access (except break-glass)
- Automated drift detection and alerting
- Zero tolerance for manual changes
- Strong review and approval processes
Organizational Signals:
- "What's the PR number?"
- "That change isn't merged yet"
- "Drift detection failed the build"
Level 4: Self-Healing Infrastructure
Characteristics:
- Automated drift remediation
- Policy-as-code enforcement
- Continuous reconciliation
- Immutable infrastructure
- Automated compliance
Organizational Signals:
- "The system auto-reverted that manual change"
- "Our policy engine blocked that configuration"
- "We haven't logged into the console in months"
The Transition Plan: From ClickOps to GitOps
Phase 0: Assessment and Buy-In (Week 1-2)
Step 1: Measure Current State
Run the metrics from the previous section:
- Drift detection scan
- CloudTrail console usage analysis
- Time to environment parity
- Incident postmortem review
Step 2: Build Business Case
Present findings to leadership:
# ClickOps Cost Analysis
## Current State
- 47 resources with configuration drift
- 34% of infrastructure changes via console
- Average 8 hours to create production-like environment
- 6 of last 10 incidents caused by or complicated by manual changes
## Projected Costs (Annual)
- Engineering time on drift debugging: $85,000
- Delayed incident resolution: $120,000
- Failed compliance audit risk: $250,000+
- Total: $455,000+
## GitOps Investment
- Tooling and automation: $30,000
- Training and transition: $50,000
- First year total: $80,000
## ROI
- Year 1 savings: $375,000
- Payback period: 2.5 months
- Additional benefits: Faster deployment, reduced risk, better complianceStep 3: Secure Executive Sponsorship
GitOps is a cultural change. Without executive support, teams will revert to old habits under pressure.
Phase 1: Read-Only Console (Week 3-6)
Implementation Steps:
- Create Read-Only IAM Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadOnlyConsoleAccess",
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"rds:Describe*",
"s3:List*",
"s3:Get*",
"iam:List*",
"iam:Get*",
"cloudwatch:Describe*",
"cloudwatch:Get*",
"cloudwatch:List*",
"logs:Describe*",
"logs:FilterLogEvents",
"logs:GetLogEvents"
],
"Resource": "*"
},
{
"Sid": "DenyAllMutations",
"Effect": "Deny",
"NotAction": [
"ec2:Describe*",
"rds:Describe*",
"s3:List*",
"s3:Get*",
"iam:List*",
"iam:Get*",
"cloudwatch:Describe*",
"cloudwatch:Get*",
"cloudwatch:List*",
"logs:Describe*",
"logs:FilterLogEvents",
"logs:GetLogEvents"
],
"Resource": "*"
}
]
}- Create Break-Glass Role
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EmergencyAccess",
"Effect": "Allow",
"Action": "*",
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": ["us-east-1", "us-west-2"]
}
}
}
]
}- Set Up Break-Glass Monitoring
# Lambda function triggered by CloudWatch Events
import boto3
import json
sns = boto3.client('sns')
def lambda_handler(event, context):
"""Alert team when break-glass role is assumed"""
detail = event['detail']
user_identity = detail['userIdentity']
if 'break-glass' in user_identity.get('arn', '').lower():
message = f"""
🚨 BREAK-GLASS ROLE ASSUMED
User: {user_identity.get('principalId')}
Time: {detail['eventTime']}
Source IP: {detail.get('sourceIPAddress')}
REMINDER: All emergency changes must be backported to IaC within 24 hours.
Create tracking ticket: https://your-issue-tracker.com/new
"""
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:security-alerts',
Subject='🚨 Break-Glass Role Assumed',
Message=message
)
# Also post to Slack
# ... slack webhook code ...- Team Communication
# Policy Update: Read-Only Console Access
Effective Date: [DATE]
## What's Changing
- Developer AWS console access is now read-only by default
- All infrastructure changes must be made via Pull Requests to infrastructure repository
- Emergency access available via break-glass role
## Why This Change
- Reduces configuration drift (currently 47 resources)
- Provides audit trail for all changes
- Enables faster environment reproduction
- Improves security and compliance posture
## How to Make Changes
1. Clone infrastructure repository
2. Make changes to Terraform/CDK code
3. Run `terraform plan` locally to verify
4. Create Pull Request
5. Get review and approval
6. Merge triggers automated deployment
## Emergency Procedures
If production is down and immediate console access needed:
1. Assume break-glass role: `aws sts assume-role ...`
2. Make necessary changes
3. Immediately notify team in #incidents channel
4. Create ticket to backport changes to code (within 24 hours)
## Training
- Terraform workshop: [DATE]
- Office hours: Daily 2-3 PM
- Documentation: https://wiki/gitops
## Questions
Contact platform-team@company.comPhase 2: Drift Detection and Alerting (Week 7-10)
Implementation Steps:
- Continuous Drift Detection
# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch: # Manual trigger
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [development, staging, production]
region: [us-east-1, us-west-2]
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/TerraformReadOnly
aws-region: ${{ matrix.region }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
working-directory: environments/${{ matrix.environment }}
- name: Detect Drift
id: drift
run: |
terraform plan -detailed-exitcode -no-color > plan.txt 2>&1
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "drift=true" >> $GITHUB_OUTPUT
echo "Drift detected in ${{ matrix.environment }} - ${{ matrix.region }}"
else
echo "drift=false" >> $GITHUB_OUTPUT
fi
continue-on-error: true
working-directory: environments/${{ matrix.environment }}
- name: Post to Slack
if: steps.drift.outputs.drift == 'true'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "🚨 Infrastructure Drift Detected",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Environment:* ${{ matrix.environment }}\n*Region:* ${{ matrix.region }}\n*Status:* Configuration drift detected"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Please review and reconcile the drift. All infrastructure changes should be made via code."
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
- name: Create Issue
if: steps.drift.outputs.drift == 'true'
uses: actions/github-script@v7
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `Drift detected: ${{ matrix.environment }} - ${{ matrix.region }}`,
body: `Configuration drift detected in ${{ matrix.environment }} environment (${{ matrix.region }} region).\n\nPlease investigate and reconcile.`,
labels: ['drift', 'infrastructure', '${{ matrix.environment }}']
});- AWS Config Rules for Drift
# Lambda-backed AWS Config rule
import boto3
import json
def evaluate_compliance(config_item, rule_parameters):
"""
Check if resource has required ManagedBy tag
"""
tags = config_item.get('tags', {})
managed_by = tags.get('ManagedBy', '')
# Resources must be managed by Terraform/CDK
if managed_by.lower() in ['terraform', 'cdk', 'cloudformation']:
return 'COMPLIANT'
else:
return 'NON_COMPLIANT'
def lambda_handler(event, context):
"""AWS Config Lambda handler"""
config = boto3.client('config')
invoking_event = json.loads(event['invokingEvent'])
config_item = invoking_event['configurationItem']
compliance_status = evaluate_compliance(config_item, {})
config.put_evaluations(
Evaluations=[
{
'ComplianceResourceType': config_item['resourceType'],
'ComplianceResourceId': config_item['resourceId'],
'ComplianceType': compliance_status,
'OrderingTimestamp': config_item['configurationItemCaptureTime']
}
],
ResultToken=event['resultToken']
)- Drift Reconciliation Runbook
# Drift Reconciliation Runbook
## When Drift is Detected
### Step 1: Identify the Change
```bash
# Run terraform plan to see what's different
cd environments/production
terraform plan -no-color > drift.txt
# Review the drift
cat drift.txtStep 2: Determine Change Origin
# Check CloudTrail for recent modifications
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=<resource-id> \
--max-results 10 \
--query 'Events[*].[EventTime,Username,EventName]' \
--output tableStep 3: Decide Reconciliation Approach
Option A: Console Change Was Correct (Emergency Fix)
- Create PR to update Terraform to match reality
- Reference incident ticket in commit message
- Get review and merge
Option B: Console Change Was Incorrect
- Run
terraform applyto revert to code state - Notify person who made manual change
- Document in incident postmortem
Option C: Both Changed (Conflict)
- Consult with team
- Determine desired state
- Update code and apply
Step 4: Close Drift Issue
- Document resolution in GitHub issue
- Update runbook if needed
- Schedule retrospective if pattern emerges
### Phase 3: Policy Enforcement (Week 11-14)
**Implementation Steps:**
1. **Service Control Policies (SCPs)**
For AWS Organizations, prevent console mutations at the account level:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyConsoleChanges",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"ec2:TerminateInstances",
"rds:CreateDBInstance",
"rds:ModifyDBInstance",
"s3:PutBucketPolicy",
"iam:CreateRole",
"iam:PutRolePolicy"
],
"Resource": "*",
"Condition": {
"StringLike": {
"aws:UserAgent": "*console.aws.amazon.com*"
},
"StringNotEquals": {
"aws:PrincipalArn": [
"arn:aws:iam::*:role/BreakGlass*"
]
}
}
}
]
}
- Required Resource Tags
# Terraform policy
resource "aws_organizations_policy" "require_managed_by_tag" {
name = "RequireManagedByTag"
description = "Require all resources to have ManagedBy tag"
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "RequireManagedByTag"
Effect = "Deny"
Action = [
"ec2:RunInstances",
"rds:CreateDBInstance",
"s3:CreateBucket",
"dynamodb:CreateTable"
]
Resource = "*"
Condition = {
StringNotEquals = {
"aws:RequestTag/ManagedBy" = ["Terraform", "CDK"]
}
}
}
]
})
}- Automated Remediation
#!/usr/bin/env python3
"""
Auto-remediate resources without proper ManagedBy tags
"""
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
sns = boto3.client('sns')
def find_unmanaged_resources():
"""Find EC2 instances without ManagedBy tag"""
response = ec2.describe_instances(
Filters=[
{
'Name': 'instance-state-name',
'Values': ['running']
}
]
)
unmanaged = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
tags = {tag['Key']: tag['Value'] for tag in instance.get('Tags', [])}
if 'ManagedBy' not in tags or tags['ManagedBy'] not in ['Terraform', 'CDK']:
unmanaged.append({
'InstanceId': instance['InstanceId'],
'LaunchTime': instance['LaunchTime'],
'Tags': tags
})
return unmanaged
def remediate(unmanaged_resources, dry_run=True):
"""Terminate unmanaged resources (with grace period)"""
now = datetime.now(tz=timezone.utc)
grace_period = timedelta(hours=24)
for resource in unmanaged_resources:
launch_time = resource['LaunchTime']
age = now - launch_time
if age > grace_period:
if dry_run:
print(f"[DRY RUN] Would terminate: {resource['InstanceId']}")
else:
print(f"Terminating unmanaged instance: {resource['InstanceId']}")
ec2.terminate_instances(InstanceIds=[resource['InstanceId']])
# Alert team
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:security-alerts',
Subject='Unmanaged resource terminated',
Message=f"Terminated {resource['InstanceId']} - missing ManagedBy tag"
)
else:
# Tag for pending termination
ec2.create_tags(
Resources=[resource['InstanceId']],
Tags=[
{'Key': 'PendingTermination', 'Value': 'true'},
{'Key': 'TerminationDate', 'Value': (launch_time + grace_period).isoformat()}
]
)
print(f"Tagged for termination: {resource['InstanceId']}")
# Run
unmanaged = find_unmanaged_resources()
print(f"Found {len(unmanaged)} unmanaged resources")
# Start with dry run for 2 weeks, then enable actual termination
remediate(unmanaged, dry_run=True)Phase 4: Cultural Reinforcement (Ongoing)
1. Make It Easy to Do the Right Thing
# CLI tool to make infrastructure changes easier
#!/bin/bash
# iac-change - Helper script for infrastructure changes
function iac_change() {
echo "🏗️ Infrastructure Change Wizard"
echo ""
# Guide user through process
echo "What would you like to change?"
echo "1. Add new resource"
echo "2. Modify existing resource"
echo "3. Delete resource"
read -p "Choice: " choice
read -p "Which environment? (dev/staging/prod): " env
read -p "Short description: " description
# Create branch
branch="infra/${env}/${description// /-}"
git checkout -b "$branch"
# Open editor
echo "Opening Terraform files..."
code "environments/$env"
echo ""
echo "After making changes:"
echo "1. Run: terraform plan"
echo "2. Review the changes"
echo "3. Run: git add . && git commit"
echo "4. Run: git push origin $branch"
echo "5. Create PR at: https://github.com/company/infrastructure/compare/$branch"
}
iac_change2. Celebrate Wins
# Weekly Infrastructure Newsletter
## This Week in GitOps 🎉
### Achievements
- 🎯 Zero drift detected for 7 consecutive days!
- 🚀 12 infrastructure PRs merged this week
- ⚡ Average PR review time: 2.3 hours (down from 4.1)
### Shoutouts
- @alice for excellent documentation in her VPC PR
- @bob for catching a security issue in code review
- @charlie for backporting emergency change within 1 hour
### By The Numbers
- Console usage: 2% (target: <5%) ✅
- Drift incidents: 0 (target: 0) ✅
- Time to provision new environment: 23 minutes (target: <30min) ✅
### This Week's Learning
@diane wrote a great guide on Terraform modules:
https://wiki/terraform-modules-guide3. Retrospective Improvements
After incidents, ask:
- Did GitOps help or hinder incident response?
- Was break-glass procedure clear and effective?
- How can we make code changes faster?
- What would have prevented this?
4. Onboarding Integration
# New Engineer Onboarding - Infrastructure
Day 1:
- [ ] Clone infrastructure repository
- [ ] Read GitOps principles document
- [ ] Watch "Why We Don't Click" video (15 min)
Week 1:
- [ ] Complete Terraform basics course
- [ ] Shadow senior engineer on infrastructure PR
- [ ] Get read-only AWS console access
Week 2:
- [ ] Make first infrastructure PR (add self to team IAM group)
- [ ] Attend infrastructure office hours
- [ ] Review break-glass procedures
Week 3:
- [ ] Make substantial infrastructure change
- [ ] Review someone else's infrastructure PR
- [ ] Get full commit access
The goal: New engineers contributing infrastructure changes within 3 weeksHandling Common Objections
"But IaC is slower for urgent changes"
Response:
Actually, once your workflow is mature, IaC is often faster:
ClickOps:
1. Log into console (30 sec)
2. Navigate to correct service (1 min)
3. Find the right resource (2 min)
4. Make change carefully (3 min)
5. Verify change (1 min)
6. Hope you didn't fat-finger anything
Total: 7.5 minutes + risk
GitOps (with automation):
1. Edit code (2 min)
2. Run terraform plan locally (30 sec)
3. Push to branch (15 sec)
4. Automated plan runs in CI (1 min)
5. Approve and merge (30 sec)
6. Automated apply runs (2 min)
Total: 6.25 minutes + code review benefit + audit trail
Plus, you don't have the "did I break something?" anxiety.
"Console changes let me explore options"
Response:
Absolutely! Use the console for exploration, use code for implementation.
Recommended workflow:
1. Use console to explore available options (read-only)
2. Reference AWS documentation
3. Test in personal AWS account or playground environment
4. Write Terraform based on what you learned
5. Apply to dev environment
6. Promote to staging, then production
"Our team is too small for this overhead"
Response:
GitOps matters MORE for small teams:
- Bus factor is critical (what if the one person who knows config leaves?)
- Less time to debug drift issues
- Can't afford long incident recovery times
- Need to move fast WITHOUT breaking things
Small teams should start simple:
- Basic Terraform (no complex modules)
- Simple PR approval (1 reviewer)
- Automated drift detection
- Clear break-glass procedure
"We tried IaC and it didn't stick"
Response:
Common reasons IaC adoption fails:
- No executive support → People revert under pressure
- Too hard to use → Make it easier than clicking
- No enforcement → Drift isn't detected or fixed
- Not enough training → Team doesn't understand benefits
- No celebration → Wins aren't recognized
Address these systematically and try again.
Break-Glass Procedures: The Right Way
A good break-glass procedure balances urgency with accountability.
The Break-Glass Contract
When to Use Break-Glass:
✅ Valid Reasons:
- Production outage impacting users
- Security incident requiring immediate remediation
- Data loss prevention
- Compliance violation that must be fixed immediately
❌ Invalid Reasons:
- "It's faster"
- "I don't know Terraform"
- "I'm just checking something" (use read-only)
- "The PR review is taking too long"
Break-Glass Implementation
1. Assume Role Process:
#!/bin/bash
# break-glass - Emergency access script
echo "🚨 EMERGENCY ACCESS PROCEDURE 🚨"
echo ""
echo "This grants full AWS console access for emergency situations."
echo "All actions are logged and audited."
echo ""
read -p "Incident ticket number (required): " incident
read -p "Brief description: " description
if [ -z "$incident" ]; then
echo "❌ Incident ticket required"
exit 1
fi
# Assume role
aws sts assume-role \
--role-arn "arn:aws:iam::123456789012:role/BreakGlass" \
--role-session-name "emergency-${USER}-${incident}" \
--duration-seconds 3600
# Alert team
curl -X POST $SLACK_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"🚨 Break-glass access assumed\",
\"blocks\": [{
\"type\": \"section\",
\"text\": {
\"type\": \"mrkdwn\",
\"text\": \"*User:* ${USER}\n*Incident:* ${incident}\n*Description:* ${description}\n*Time:* $(date)\"
}
}]
}"
echo ""
echo "✅ Emergency access granted for 1 hour"
echo "⚠️ REMINDER: Backport all changes to IaC within 24 hours"
echo "📝 Create backport ticket: https://issues/new?template=backport&incident=${incident}"2. Automatic Backport Reminder:
# Lambda to create backport tickets automatically
import boto3
from datetime import datetime, timedelta
dynamodb = boto3.resource('dynamodb')
github = # GitHub API client
def lambda_handler(event, context):
"""Track break-glass usage and create backport tickets"""
# Triggered by CloudWatch Events when break-glass role assumed
detail = event['detail']
session_name = detail['userIdentity']['principalId']
username = session_name.split('-')[1]
incident = session_name.split('-')[2]
timestamp = detail['eventTime']
# Create GitHub issue for backport
issue = github.create_issue(
title=f"Backport break-glass changes from {incident}",
body=f"""
## Break-Glass Usage
- **User:** {username}
- **Incident:** {incident}
- **Time:** {timestamp}
- **Deadline:** {(datetime.now() + timedelta(hours=24)).isoformat()}
## Required Actions
- [ ] Review CloudTrail logs for all changes made during emergency session
- [ ] Update Terraform/CDK to match changes
- [ ] Create PR with changes
- [ ] Get review and merge
- [ ] Verify drift is resolved
## CloudTrail Query
aws cloudtrail lookup-events \\
--lookup-attributes AttributeKey=Username,AttributeValue={session_name} \\
--start-time {timestamp} \\
--max-results 50
**This issue will be escalated if not resolved within 24 hours.**
""",
labels=['break-glass', 'backport', 'high-priority'],
assignees=[username]
)
# Store in DynamoDB for tracking
table = dynamodb.Table('BreakGlassAudit')
table.put_item(Item={
'session_id': session_name,
'username': username,
'incident': incident,
'timestamp': timestamp,
'backport_ticket': issue.number,
'backport_deadline': (datetime.now() + timedelta(hours=24)).isoformat(),
'status': 'pending'
})3. Backport Verification:
# Daily check for overdue backports
def check_overdue_backports():
"""Alert on overdue break-glass backports"""
table = dynamodb.Table('BreakGlassAudit')
# Scan for pending backports past deadline
response = table.scan(
FilterExpression=Attr('status').eq('pending') &
Attr('backport_deadline').lt(datetime.now().isoformat())
)
for item in response['Items']:
# Escalate to engineering manager
send_escalation_email(
to='engineering-managers@company.com',
subject=f"OVERDUE: Break-glass backport from {item['username']}",
body=f"""
Break-glass usage from {item['timestamp']} has not been backported to IaC.
User: {item['username']}
Incident: {item['incident']}
Deadline: {item['backport_deadline']}
Status: OVERDUE
Please follow up immediately.
"""
)Success Metrics and Continuous Improvement
Tracking Success
Monthly Dashboard:
GitOps Adoption Metrics - February 2025
📊 Drift Detection
- Resources with drift: 2 (down from 47 in November)
- Drift rate: 0.5% (target: <1%)
- Time to resolve drift: avg 1.2 hours
📊 Change Velocity
- Infrastructure PRs merged: 47
- Average PR time-to-merge: 3.1 hours
- Deployment frequency: 2.3 per day
📊 Console Usage
- Console mutations: 3 (all break-glass)
- Console mutation rate: 1.2% (target: <5%)
- Read-only console usage: 847 views
📊 Reliability
- Incidents caused by config drift: 0 (down from 6 in Q4 2024)
- Time to create new environment: 28 minutes
- Successful DR drills: 2/2
📊 Compliance
- Config compliance score: 98% (up from 76%)
- Untagged resources: 3 (down from 142)
- Audit readiness: Green
🎯 Year-over-Year Impact
- Engineering time saved: 240 hours
- Incident reduction: 67%
- Deployment frequency: +156%
- MTTR: -43%
Conclusion: GitOps as Organizational Muscle Memory
The transition from ClickOps to GitOps isn't just a technical change—it's a cultural transformation. Like any cultural change, it requires:
- Clear vision: Why we're doing this
- Executive support: Top-down commitment
- Tools and training: Make it easy to do right
- Enforcement: Consequences for non-compliance
- Celebration: Recognize wins and progress
The most successful transitions happen when teams move from viewing GitOps as a constraint to viewing it as an enabler. Infrastructure as Code doesn't slow you down—it removes the fear that slows you down.
The GitOps Mindset:
- If it's not in Git, it doesn't exist
- Console is for reading, code is for writing
- Every change deserves review and documentation
- Drift is a bug that must be fixed
- Automation removes toil and risk
Start Your Journey:
- This Week: Measure your current drift
- This Month: Implement read-only console access
- This Quarter: Achieve <5% console mutation rate
- This Year: Achieve self-healing infrastructure
The journey to GitOps maturity is gradual, but every step reduces risk, increases velocity, and builds institutional knowledge that outlasts any individual engineer.
Stop clicking. Start coding. Build infrastructure that lasts.
Need help transitioning your team from ClickOps to GitOps? Contact us for a custom transformation roadmap based on your current state and organizational constraints.