Cost Optimization

Advanced Cloud Cost Optimization: A Technical Deep Dive

Zak Kann
AWSFinOpsArchitectureDatabaseDevOps

Key takeaways

  • Data transfer and storage operations often exceed compute costs but receive less optimization attention
  • VPC Endpoints eliminate NAT Gateway fees for AWS-to-AWS communication, saving $450+ per 10TB monthly transfer
  • Migrating to Graviton processors delivers 20-40% cost reduction with equal or better performance
  • Architectural optimizations (VPC Endpoints, CloudFront caching, cross-AZ reduction) deliver 25-40% additional savings beyond basic right-sizing
  • Advanced FinOps practices including observability sampling and Spot instances can reduce costs by 40-60% total

Most organizations approach cloud cost optimization with a predictable playbook: purchase reserved instances, shut down idle resources, and right-size over-provisioned compute. These tactics deliver quick wins—typically 15-25% savings—but they're fundamentally reactive strategies that treat symptoms rather than root causes.

The next frontier of cloud cost optimization requires rethinking architectural decisions. This guide examines the structural cost drivers in modern cloud infrastructure: network topology, data gravity, storage intelligence, compute efficiency, and observability overhead. These strategies have helped engineering teams reduce cloud spend by 40-60% while improving performance and reliability.

The Hidden Economics of Cloud Infrastructure

Before diving into tactics, it's essential to understand where money actually goes in cloud infrastructure. Based on analysis of hundreds of AWS bills, typical cost distribution for mature applications looks like this:

  • Compute (EC2/ECS/Lambda): 30-40%
  • Data Services (RDS/DynamoDB/ElastiCache): 20-30%
  • Data Transfer: 15-25%
  • Storage (EBS/S3/Backup): 10-15%
  • Observability (CloudWatch/Third-party): 5-15%

The surprising insight: data movement and storage operations often exceed raw compute costs, yet they receive far less optimization attention. This is where architectural changes deliver outsized returns.

1. Network Architecture: The Biggest Hidden Cost Driver

Data transfer costs are deceptively complex. AWS charges differently for data moving between AZs ($0.01/GB), between regions ($0.02/GB), and out to the internet ($0.09/GB for the first 10TB). These "small" fees compound rapidly at scale.

Real-World Example: The $40,000/Month API Gateway Bill

A SaaS company discovered their API Gateway was costing $40,000/month, despite modest traffic. The culprit: their frontend made 12 API calls per page load, each fetching small JSON responses. The actual compute cost was negligible—they were paying for millions of tiny HTTPS requests plus egress.

The Solution: They implemented GraphQL with batching, reducing to 1-2 requests per page, and moved static assets behind CloudFront. Monthly API Gateway costs dropped to $8,000, an 80% reduction.

Strategy 1: Optimize Data Locality with VPC Endpoints

When EC2 instances communicate with AWS services like S3 or DynamoDB through the internet gateway, you pay for NAT Gateway processing ($0.045/GB) plus data transfer costs. This creates a double charge for internal AWS-to-AWS communication.

VPC Endpoint Types:

  1. Gateway Endpoints (Free): S3 and DynamoDB only. Routes traffic through the AWS backbone with zero additional cost.

  2. Interface Endpoints (PrivateLink): $0.01/GB + $0.01/hour per AZ. Worth it for high-traffic services like Secrets Manager, ECR, or SageMaker.

Implementation:

# Create Gateway Endpoint for S3 (Free)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxx \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-xxxxx
 
# Create Interface Endpoint for Secrets Manager
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxx \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --subnet-ids subnet-xxxxx subnet-yyyyy

ROI Calculation: If you transfer 10TB/month from EC2 to S3 via NAT Gateway, you pay $450 in NAT processing fees. Using a Gateway Endpoint reduces this to $0. Break-even time: immediate.

Strategy 2: CloudFront as a Cost Shield

CloudFront data transfer rates start at $0.085/GB (vs. $0.09/GB for EC2 egress) and drop to $0.02/GB at higher tiers. More importantly, CloudFront's edge cache reduces origin load.

Advanced Pattern: API Response Caching

Beyond static assets, CloudFront can cache API responses with appropriate Cache-Control headers:

// API Response Example
export async function GET(request: Request) {
  const data = await fetchProducts();
 
  return new Response(JSON.stringify(data), {
    headers: {
      'Content-Type': 'application/json',
      'Cache-Control': 'public, max-age=300, s-maxage=3600',
      // Browser caches for 5 min, CloudFront for 1 hour
    },
  });
}

Case Study: An e-commerce platform cached product catalog API responses at the edge. Origin requests dropped by 85%, and they shifted from expensive EC2 egress ($0.09/GB) to cheaper CloudFront egress ($0.085/GB → $0.02/GB at volume). Combined savings: $18,000/month.

Strategy 3: Minimize Cross-AZ Traffic

AWS charges $0.01/GB for data transfer between Availability Zones. This seems trivial until you have microservices making thousands of calls per second.

Kubernetes Topology-Aware Routing:

apiVersion: v1
kind: Service
metadata:
  name: api-service
  annotations:
    service.kubernetes.io/topology-aware-hints: auto
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080
  topologyKeys:
    - "kubernetes.io/hostname"
    - "topology.kubernetes.io/zone"

This configuration prefers routing to pods in the same zone, falling back to other zones only when necessary.

For RDS Multi-AZ: Use read replicas in the same AZ as your application tier for read-heavy workloads. Cross-AZ reads to the primary add unnecessary data transfer costs.

2. The ARM Revolution: Graviton Economics

AWS Graviton3 processors deliver 25% better performance than comparable x86 instances at 20% lower cost. This is a rare "better and cheaper" scenario, yet adoption remains surprisingly low.

Migration Priority Matrix

Tier 1: Zero-Code Changes

  • AWS Lambda: Change runtime from python3.11 to python3.11 (arm64) in console
  • RDS/Aurora: Modify instance class from db.r6i to db.r6g
  • ElastiCache: Switch from cache.r6i to cache.r6g
  • OpenSearch: Change to r6g or m6g instances

Tier 2: Container Rebuilds

  • ECS/EKS: Multi-arch Docker images using BuildKit
# Multi-architecture Dockerfile
FROM --platform=$BUILDPLATFORM node:20-alpine AS builder
ARG TARGETPLATFORM
ARG BUILDPLATFORM
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
 
FROM node:20-alpine
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/index.js"]
# Build for both architectures
docker buildx build --platform linux/amd64,linux/arm64 \
  -t myapp:latest --push .

Tier 3: Compiled Languages For Go, Rust, or C++, add ARM64 to your build matrix:

# GitHub Actions Example
jobs:
  build:
    strategy:
      matrix:
        arch: [amd64, arm64]
    steps:
      - name: Build
        run: GOARCH=${{ matrix.arch }} go build -o app-${{ matrix.arch }}

Real-World Results

A fintech company migrated 400 microservices from c5.xlarge ($0.17/hour) to c6g.xlarge ($0.136/hour). Monthly compute costs dropped from $48,960 to $39,168—a $9,792/month reduction with identical capacity and better performance.

Critical Success Factors:

  • Test in staging first (some native libraries lack ARM builds)
  • Check dependencies (old npm packages may need updates)
  • Monitor performance metrics (usually improves, but verify)

3. Storage Intelligence: Beyond Basic Lifecycle Policies

Storage costs compound over time. A 100TB S3 bucket at $0.023/GB costs $2,300/month. If 80% of that data is rarely accessed, you're wasting $1,840/month.

S3 Intelligent-Tiering: Set and Forget

S3 Intelligent-Tiering automatically moves objects between access tiers based on usage patterns:

  • Frequent Access: $0.023/GB (same as S3 Standard)
  • Infrequent Access: $0.0125/GB (after 30 days no access)
  • Archive Instant Access: $0.004/GB (after 90 days)
  • Archive Access: $0.0036/GB (after 90-270 days, optional)
  • Deep Archive Access: $0.00099/GB (after 180-730 days, optional)

Monitoring cost: $0.0025 per 1,000 objects (negligible for most workloads)

# Enable Intelligent-Tiering on existing bucket
aws s3api put-bucket-intelligent-tiering-configuration \
  --bucket my-bucket \
  --id EntireBucket \
  --intelligent-tiering-configuration '{
    "Id": "EntireBucket",
    "Status": "Enabled",
    "Tierings": [
      {
        "Days": 90,
        "AccessTier": "ARCHIVE_ACCESS"
      },
      {
        "Days": 180,
        "AccessTier": "DEEP_ARCHIVE_ACCESS"
      }
    ]
  }'

ROI Example: 100TB bucket with 70% of data untouched for 90+ days:

  • Before: 100TB × $23/TB = $2,300/month
  • After: (30TB × $23/TB) + (70TB × $4/TB) = $690 + $280 = $970/month
  • Savings: $1,330/month ($15,960/year)

EBS Modernization: gp3 Over gp2

The gp2 volume type ties IOPS to storage size (3 IOPS per GB). This forces over-provisioning: if you need 10,000 IOPS, you must buy a 3,334GB volume even if you only need 500GB of space.

gp3 Economics:

  • Baseline: 3,000 IOPS and 125MB/s throughput, regardless of size
  • Cost: ~20% cheaper per GB than gp2
  • Flexibility: Purchase additional IOPS ($0.005/provisioned IOPS) and throughput ($0.04/MB/s) independently

Migration (Zero Downtime):

aws ec2 modify-volume \
  --volume-id vol-xxxxx \
  --volume-type gp3 \
  --iops 4000 \
  --throughput 250

Case Study: A database on a 1TB gp2 volume (3,000 IOPS, $100/month) was right-sized to a 500GB gp3 volume with 4,000 IOPS ($40 storage + $5 IOPS = $45/month). Savings: $55/month per volume, improved performance.

EBS Snapshots: The Forgotten Cost Center

EBS snapshots are incremental, but orphaned snapshots (from deleted volumes) accumulate. A common pattern: automated daily snapshots for 500 volumes over 2 years = 365,000 snapshots.

Snapshot Lifecycle Management:

# Create lifecycle policy
aws dlm create-lifecycle-policy \
  --description "Daily snapshots with 30-day retention" \
  --state ENABLED \
  --execution-role-arn arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRole \
  --policy-details '{
    "ResourceTypes": ["VOLUME"],
    "TargetTags": [{"Key": "Backup", "Value": "true"}],
    "Schedules": [{
      "Name": "DailySnapshots",
      "CreateRule": {"Interval": 24, "IntervalUnit": "HOURS", "Times": ["03:00"]},
      "RetainRule": {"Count": 30}
    }]
  }'

Audit Script:

import boto3
from datetime import datetime, timedelta
 
ec2 = boto3.client('ec2')
snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
volumes = {v['VolumeId'] for v in ec2.describe_volumes()['Volumes']}
 
orphaned = [s for s in snapshots if s['VolumeId'] not in volumes]
old = [s for s in orphaned if s['StartTime'] < datetime.now(s['StartTime'].tzinfo) - timedelta(days=90)]
 
print(f"Orphaned snapshots: {len(orphaned)}")
print(f"Orphaned snapshots >90 days old: {len(old)}")
print(f"Estimated monthly cost: ${len(orphaned) * 0.05 * 100:.2f}")  # Assume 100GB avg

4. Database Cost Optimization: The Biggest Line Item

Databases frequently represent 30-40% of total cloud spend. The challenge: databases are stateful, mission-critical, and risky to modify.

Strategy 1: Non-Production Environment Scheduling

Development and staging databases often run 24/7/365 but are used 40 hours/week (17% utilization). Stopping them during off-hours yields 75%+ savings.

Lambda Auto-Scheduler:

import boto3
import os
 
rds = boto3.client('rds')
 
def lambda_handler(event, context):
    action = os.environ['ACTION']  # 'stop' or 'start'
    tag_key = os.environ['TAG_KEY']  # e.g., 'Environment'
    tag_value = os.environ['TAG_VALUE']  # e.g., 'dev'
 
    # Find tagged DB instances
    instances = rds.describe_db_instances()['DBInstances']
 
    for instance in instances:
        arn = instance['DBInstanceArn']
        tags = rds.list_tags_for_resource(ResourceName=arn)['TagList']
 
        if any(t['Key'] == tag_key and t['Value'] == tag_value for t in tags):
            db_id = instance['DBInstanceIdentifier']
 
            if action == 'stop' and instance['DBInstanceStatus'] == 'available':
                print(f"Stopping {db_id}")
                rds.stop_db_instance(DBInstanceIdentifier=db_id)
            elif action == 'start' and instance['DBInstanceStatus'] == 'stopped':
                print(f"Starting {db_id}")
                rds.start_db_instance(DBInstanceIdentifier=db_id)

EventBridge Schedule:

# Stop at 7 PM weekdays
aws events put-rule \
  --name stop-dev-databases \
  --schedule-expression "cron(0 19 ? * MON-FRI *)"
 
# Start at 8 AM weekdays
aws events put-rule \
  --name start-dev-databases \
  --schedule-expression "cron(0 8 ? * MON-FRI *)"

ROI: A db.r5.2xlarge instance ($1.084/hour) running 24/7 costs $782/month. Scheduled for business hours only: $782 × 0.238 = $186/month. Savings: $596/month per instance.

Strategy 2: Aurora Serverless v2 for Variable Workloads

Aurora Serverless v2 scales from 0.5 ACUs to 128 ACUs (Aurora Capacity Units) in seconds. Unlike v1, it supports read replicas, Multi-AZ, and all Aurora features.

Cost Model:

  • ACU pricing: $0.12/hour (MySQL/PostgreSQL compatible)
  • Scaling: Adjusts in 0.5 ACU increments every 15 seconds

Ideal Use Cases:

  • Development/test environments (scale to 0.5 ACUs during idle)
  • Intermittent batch processing
  • Applications with unpredictable traffic spikes

Example: A staging database using db.r5.large (2 vCPU, 16GB RAM, $0.29/hour = $209/month) can be replaced with Aurora Serverless v2 configured for 0.5-4 ACUs. Actual usage averages 1.5 ACUs during business hours, 0.5 ACUs at night:

  • Business hours (9 AM–7 PM, 10 hrs): 1.5 ACUs × $0.12 × 10 = $1.80/day
  • Off hours (14 hrs): 0.5 ACUs × $0.12 × 14 = $0.84/day
  • Monthly cost: ($1.80 + $0.84) × 30 = $79.20

Savings: $130/month (62% reduction)

Strategy 3: DynamoDB Capacity Mode Optimization

DynamoDB offers two capacity modes with radically different cost profiles:

On-Demand Mode:

  • Write: $1.25 per million write request units
  • Read: $0.25 per million read request units
  • Best for: Unpredictable workloads, new applications

Provisioned Mode:

  • Write: $0.00065/hour per WCU (Write Capacity Unit)
  • Read: $0.00013/hour per RCU (Read Capacity Unit)
  • Best for: Predictable, consistent traffic

Break-Even Analysis:

For 100 WCU and 100 RCU:

  • Provisioned: (100 × $0.00065 + 100 × $0.00013) × 730 hrs = $56.94/month
  • On-Demand equivalent: This supports ~44M writes and ~220M reads per month

If your actual usage is lower than 35M writes and 175M reads, On-Demand is cheaper. Higher than that, switch to Provisioned with Auto Scaling.

Auto Scaling Configuration:

aws application-autoscaling register-scalable-target \
  --service-namespace dynamodb \
  --resource-id "table/my-table" \
  --scalable-dimension "dynamodb:table:WriteCapacityUnits" \
  --min-capacity 5 \
  --max-capacity 500
 
aws application-autoscaling put-scaling-policy \
  --service-namespace dynamodb \
  --resource-id "table/my-table" \
  --scalable-dimension "dynamodb:table:WriteCapacityUnits" \
  --policy-name "my-scaling-policy" \
  --policy-type "TargetTrackingScaling" \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "DynamoDBWriteCapacityUtilization"
    }
  }'

Strategy 4: RDS Storage Autoscaling

RDS allows storage autoscaling up to a maximum threshold, preventing both over-provisioning and manual resize operations.

aws rds modify-db-instance \
  --db-instance-identifier my-database \
  --max-allocated-storage 1000 \
  --apply-immediately

This allocates storage dynamically as needed (up to 1TB), avoiding the pattern of provisioning 800GB "just in case" when you currently need 300GB.

5. The Observability Cost Crisis

Modern observability platforms (Datadog, New Relic, Splunk, CloudWatch) charge by ingestion volume, metric cardinality, and retention. As systems grow more distributed, observability costs often spiral out of control—sometimes exceeding the infrastructure costs they're monitoring.

The Problem: Unbounded Cardinality

High-cardinality data (unique values like user IDs, request IDs, session tokens) creates exponential metric growth:

// ❌ BAD: Creates millions of unique metrics
metrics.increment('api.request', {
  user_id: userId,
  request_id: requestId,
  endpoint: '/api/users',
  status_code: 200
});
 
// ✅ GOOD: Bounded cardinality
metrics.increment('api.request', {
  endpoint: '/api/users',
  status_code: 200,
  is_authenticated: true
});

Case Study: A SaaS company tracked API latency with user_id as a tag. With 500K active users and 10 endpoints, this created 5 million unique metric series. At Datadog's pricing ($0.05/custom metric), this alone cost $250,000/month. After removing user_id and aggregating to endpoint-level metrics, costs dropped to $500/month.

Strategy 1: Structured Log Sampling

Not all logs deserve permanent storage. Implement dynamic sampling based on log level and environment:

import winston from 'winston';
 
const logger = winston.createLogger({
  level: process.env.NODE_ENV === 'production' ? 'info' : 'debug',
  format: winston.format.json(),
  transports: [
    new winston.transports.Console(),
    new winston.transports.CloudWatch({
      logGroupName: '/app/production',
      logStreamName: process.env.INSTANCE_ID,
      // Only send WARN and above to CloudWatch in production
      level: 'warn'
    })
  ]
});
 
// For debugging specific users, enable verbose logging selectively
if (debugUserIds.includes(userId)) {
  logger.level = 'debug';
}

Advanced Pattern: Sampling by Response Time

Always log slow requests, sample fast ones:

const shouldLog = (duration: number) => {
  if (duration > 1000) return true;  // Always log >1s requests
  if (duration > 500) return Math.random() < 0.5;  // 50% of 500-1000ms
  return Math.random() < 0.01;  // 1% of fast requests
};
 
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    if (shouldLog(duration)) {
      logger.info('request_completed', {
        path: req.path,
        duration,
        status: res.statusCode
      });
    }
  });
  next();
});

Strategy 2: Distributed Tracing Sampling

100% trace collection is rarely necessary. OpenTelemetry supports sophisticated sampling strategies:

import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import {
  ParentBasedSampler,
  TraceIdRatioBasedSampler,
  AlwaysOnSampler
} from '@opentelemetry/sdk-trace-base';
 
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.05), // Sample 5% of traces
  remoteParentSampled: new AlwaysOnSampler(), // But always sample if parent was sampled
  remoteParentNotSampled: new TraceIdRatioBasedSampler(0.01),
});
 
const provider = new NodeTracerProvider({ sampler });

Advanced: Error-Priority Sampling

Always sample traces with errors, sample successful traces at a lower rate:

import { Sampler, SamplingResult } from '@opentelemetry/sdk-trace-base';
 
class ErrorPrioritySampler implements Sampler {
  shouldSample(context, traceId, spanName, spanKind, attributes): SamplingResult {
    // Always sample if there's an error
    if (attributes['http.status_code'] >= 400) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
 
    // Sample 2% of successful requests
    return Math.random() < 0.02
      ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
      : { decision: SamplingDecision.NOT_RECORD };
  }
}

Strategy 3: CloudWatch Logs Retention Policies

Default CloudWatch Logs retention is "Never Expire." Over years, this accumulates massive storage costs.

# Set 30-day retention on all log groups
for log_group in $(aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text); do
  aws logs put-retention-policy \
    --log-group-name "$log_group" \
    --retention-in-days 30
done

Retention Strategy by Log Type:

  • Application logs: 30 days (export critical data to S3 for long-term analysis)
  • Access logs: 90 days
  • Audit logs: 365 days (compliance requirement)
  • Debug logs: 7 days

Cost Impact: CloudWatch Logs costs $0.50/GB ingested + $0.03/GB/month storage. For 1TB/month of logs:

  • Ingestion: $500/month (unavoidable)
  • Storage with indefinite retention: $0.03/GB × 1000GB × (1+2+3+...+12 months) = $1,950 in year 1
  • Storage with 30-day retention: $0.03/GB × 1000GB × 1 month = $30/month

Savings: ~$1,600/year after first year

Strategy 4: Metric Aggregation at Source

Instead of sending millions of individual events to your observability platform, aggregate them locally:

// ❌ BAD: Sends 1M events/hour to Datadog (expensive)
app.post('/api/orders', async (req, res) => {
  await processOrder(req.body);
  datadogClient.increment('orders.created', { product: req.body.productId });
  res.json({ success: true });
});
 
// ✅ GOOD: Aggregate locally, flush every 60s
const orderMetrics = new Map<string, number>();
 
app.post('/api/orders', async (req, res) => {
  await processOrder(req.body);
  const key = req.body.productId;
  orderMetrics.set(key, (orderMetrics.get(key) || 0) + 1);
  res.json({ success: true });
});
 
setInterval(() => {
  for (const [productId, count] of orderMetrics.entries()) {
    datadogClient.gauge('orders.created', count, { product: productId });
  }
  orderMetrics.clear();
}, 60000);

6. Compute Optimization: Beyond Right-Sizing

Right-sizing instances is table stakes. Advanced compute optimization focuses on matching compute patterns to pricing models.

Spot Instances for Fault-Tolerant Workloads

Spot instances offer 70-90% discounts compared to On-Demand, but AWS can reclaim them with 2 minutes notice.

Ideal Use Cases:

  • CI/CD build agents
  • Batch processing jobs
  • Stateless web servers (behind load balancer)
  • Machine learning training

Kubernetes Spot Integration:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  weight: 100
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: my-cluster
  securityGroupSelector:
    karpenter.sh/discovery: my-cluster
  instanceProfile: KarpenterNodeInstanceProfile

Graceful Shutdown Handler:

import express from 'express';
import http from 'http';
 
const app = express();
const server = http.createServer(app);
 
// Health check endpoint
app.get('/health', (req, res) => res.json({ status: 'ok' }));
 
// Spot termination warning listener
const checkSpotTermination = async () => {
  try {
    const response = await fetch('http://169.254.169.254/latest/meta-data/spot/termination-time', {
      timeout: 1000
    });
 
    if (response.status === 200) {
      console.log('Spot termination notice received, shutting down gracefully');
 
      // Mark unhealthy (ALB stops sending new requests)
      app.get('/health', (req, res) => res.status(503).json({ status: 'shutting_down' }));
 
      // Allow 90 seconds for existing requests to complete
      setTimeout(() => {
        server.close(() => {
          console.log('Server shut down gracefully');
          process.exit(0);
        });
      }, 90000);
    }
  } catch (err) {
    // 404 means no termination notice, continue normally
  }
};
 
// Check every 5 seconds
setInterval(checkSpotTermination, 5000);
 
server.listen(3000);

Lambda Cost Optimization

Lambda pricing is based on GB-seconds: memory allocation × execution time. Counterintuitively, increasing memory can reduce costs by decreasing execution time.

Example: A Lambda function with 512MB memory takes 10 seconds to process an image (CPU-bound task).

  • Cost: 512MB × 10s = 5,120 MB-seconds × $0.0000166667 = $0.0000853
  • With 2048MB: Same task completes in 2.5s (4× memory = ~4× CPU)
  • New cost: 2048MB × 2.5s = 5,120 MB-seconds × $0.0000166667 = $0.0000853

Same cost, but faster execution! At 3,072MB, execution drops to 1.8s:

  • Cost: 3072MB × 1.8s = 5,529.6 MB-seconds × $0.0000166667 = $0.0000922

Lambda Power Tuning Tool:

AWS provides an open-source tool to automatically test multiple memory configurations:

# Deploy the tool
git clone https://github.com/alexcasalboni/aws-lambda-power-tuning
cd aws-lambda-power-tuning
sam deploy --guided
 
# Run optimization
aws stepfunctions start-execution \
  --state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine" \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:my-function",
    "powerValues": [512, 1024, 1536, 2048, 3008],
    "num": 50
  }'

Savings Plans vs. Reserved Instances

Compute Savings Plans: Up to 66% discount, flexible across instance types, regions, and even Lambda/Fargate.

EC2 Instance Savings Plans: Up to 72% discount, flexible within instance family (e.g., m5.large ↔ m5.xlarge).

Reserved Instances: Up to 72% discount, locked to specific instance type and region.

Recommendation Strategy:

  1. Analyze baseline usage over 30 days
  2. Commit Compute Savings Plans for 70% of baseline (maximum flexibility)
  3. Use Spot instances for burst capacity
  4. Reserve On-Demand for the remaining 30% (unpredictable workloads)

ROI Example:

  • Baseline: 50× c5.xlarge instances running 24/7 = $3,060/month
  • With Compute Savings Plan (1-year, no upfront): $1,468/month (52% savings)
  • Annual savings: $19,104

7. FinOps: Unit Economics Over Absolute Costs

The mental shift from "our AWS bill is $100K/month" to "we spend $0.12 per 1,000 API requests" transforms cost optimization from a reactive cost-cutting exercise to a strategic discipline.

Key Unit Metrics by Application Type

SaaS Applications:

  • Cost per Monthly Active User (MAU)
  • Cost per 1,000 API requests
  • Cost per GB of data processed

E-commerce:

  • Cost per transaction
  • Cost per customer session
  • Infrastructure cost as % of GMV

Media/Content:

  • Cost per hour of video streamed
  • Cost per 1,000 page views
  • CDN cost per TB delivered

Data Analytics:

  • Cost per query
  • Cost per GB scanned
  • Cost per transformation pipeline run

Building a Unit Economics Dashboard

import boto3
from datetime import datetime, timedelta
 
# Get AWS costs
ce = boto3.client('ce', region_name='us-east-1')
 
response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
        'End': datetime.now().strftime('%Y-%m-%d')
    },
    Granularity='MONTHLY',
    Metrics=['UnblendedCost'],
    GroupBy=[{'Type': 'SERVICE', 'Key': 'SERVICE'}]
)
 
total_cost = sum(
    float(result['Groups'][0]['Metrics']['UnblendedCost']['Amount'])
    for result in response['ResultsByTime']
)
 
# Get business metrics from your database
api_requests = get_total_api_requests_last_30_days()  # e.g., 50M
mau = get_monthly_active_users()  # e.g., 10,000
 
# Calculate unit economics
cost_per_1k_requests = (total_cost / api_requests) * 1000
cost_per_mau = total_cost / mau
 
print(f"Total infrastructure cost: ${total_cost:,.2f}")
print(f"Cost per 1,000 API requests: ${cost_per_1k_requests:.4f}")
print(f"Cost per MAU: ${cost_per_mau:.2f}")

The "Good Growth" Test

Your AWS bill increased 20%. Is this bad?

  • If revenue increased 30%: You're becoming more efficient (good growth)
  • If users increased 15%: You're becoming less efficient (investigate)
  • If neither changed: You have a cost problem (immediate action required)

Track these metrics in your monitoring dashboard alongside infrastructure metrics:

// Report unit economics to CloudWatch
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch();
 
async function reportUnitEconomics() {
  const totalCost = await getMonthlyAWSCost();
  const apiRequests = await getAPIRequestCount();
  const mau = await getMonthlyActiveUsers();
 
  await cloudwatch.putMetricData({
    Namespace: 'FinOps',
    MetricData: [
      {
        MetricName: 'CostPerThousandRequests',
        Value: (totalCost / apiRequests) * 1000,
        Unit: 'None',
        Timestamp: new Date()
      },
      {
        MetricName: 'CostPerMAU',
        Value: totalCost / mau,
        Unit: 'None',
        Timestamp: new Date()
      }
    ]
  }).promise();
}

8. Automation and Continuous Optimization

Cost optimization isn't a one-time project—it's an ongoing discipline requiring automation and continuous monitoring.

AWS Cost Anomaly Detection

Enable AWS Cost Anomaly Detection to receive alerts when spending patterns deviate:

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "EntireAWSAccount",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'
 
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "DailyAnomalyAlerts",
    "Threshold": 100,
    "Frequency": "DAILY",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomaly-monitor/xxxxx"],
    "Subscribers": [
      {
        "Type": "EMAIL",
        "Address": "devops@company.com"
      },
      {
        "Type": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cost-alerts"
      }
    ]
  }'

Tagging Strategy for Cost Allocation

Without proper tagging, you can't answer "which team/project/environment is driving costs?"

Recommended Tag Schema:

# Apply to all resources
Environment: production | staging | dev
Team: platform | data | frontend | mobile
Project: customer-portal | analytics-platform | mobile-app
CostCenter: engineering | marketing | sales
Owner: email@company.com

Enforce with AWS Organizations Service Control Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "dynamodb:CreateTable"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:RequestTag/Environment": ["production", "staging", "dev"],
          "aws:RequestTag/Team": "*",
          "aws:RequestTag/Owner": "*"
        }
      }
    }
  ]
}

AWS Trusted Advisor Automation

AWS Trusted Advisor provides cost optimization recommendations. Automate weekly reports:

import boto3
import json
 
support = boto3.client('support', region_name='us-east-1')
 
# Get Trusted Advisor checks
checks = support.describe_trusted_advisor_checks(language='en')
 
cost_optimization_checks = [
    check for check in checks['checks']
    if check['category'] == 'cost_optimizing'
]
 
for check in cost_optimization_checks:
    result = support.describe_trusted_advisor_check_result(
        checkId=check['id'],
        language='en'
    )
 
    if result['result']['flaggedResources']:
        print(f"\n{check['name']}:")
        print(f"Estimated monthly savings: ${check.get('estimatedMonthlySavings', 0)}")
 
        for resource in result['result']['flaggedResources'][:5]:  # Top 5
            print(f"  - {resource['metadata']}")

Summary: The Cost Optimization Maturity Model

Level 1: Basic Hygiene (15-25% savings)

  • Right-size over-provisioned instances
  • Delete unused resources
  • Purchase Savings Plans/Reserved Instances
  • Set up basic CloudWatch billing alarms

Level 2: Architectural Optimization (25-40% additional savings)

  • Implement VPC Endpoints
  • Migrate to Graviton
  • Upgrade to gp3 EBS volumes
  • Enable S3 Intelligent-Tiering
  • Schedule non-production resources

Level 3: Advanced FinOps (10-20% additional savings)

  • Implement observability sampling
  • Use Spot instances for fault-tolerant workloads
  • Optimize Lambda memory configurations
  • Implement unit economics tracking
  • Deploy cost anomaly detection

Level 4: Continuous Optimization (ongoing 5-10%/year)

  • Automated cost governance policies
  • Engineering KPIs tied to unit economics
  • Infrastructure-as-Code cost awareness
  • Real-time cost allocation and chargeback

Action Checklist

Immediate Actions (Week 1)

  • Enable S3 Intelligent-Tiering on buckets >10TB
  • Convert all EBS gp2 volumes to gp3
  • Enable AWS Cost Anomaly Detection
  • Audit and delete orphaned EBS snapshots
  • Set CloudWatch Logs retention policies (30-90 days)

Short-term Actions (Month 1)

  • Create VPC Gateway Endpoints for S3/DynamoDB
  • Migrate RDS instances to Graviton (db.m6g/r6g)
  • Schedule dev/staging databases to stop at night
  • Implement log sampling in production applications
  • Configure DynamoDB Auto Scaling for provisioned tables

Medium-term Actions (Quarter 1)

  • Migrate Lambda functions to ARM64
  • Build multi-arch Docker images for ECS/EKS
  • Implement distributed tracing sampling (5-10%)
  • Deploy Spot instances for CI/CD and batch jobs
  • Create unit economics dashboard

Strategic Actions (Ongoing)

  • Establish tagging policy and enforce with SCPs
  • Build cost allocation reporting by team/project
  • Integrate cost metrics into engineering dashboards
  • Conduct quarterly architectural cost reviews
  • Implement FinOps training for engineering teams

Need expert guidance on cloud cost optimization? Schedule a consultation for a comprehensive infrastructure audit and custom optimization roadmap.

Need Help with Your Cloud Infrastructure?

Our experts are here to guide you through your cloud journey

Schedule a Free Consultation