The System Design Problem Disguised As A Billing Problem

There's a server running right now. 16 CPU cores. Uses 1.3 of them.

Not sometimes. All day. Every day. For three years.

This isn't a sizing problem. It's a systems design problem. And it shows up in hundreds of cloud accounts because engineers are designing based on fear instead of data.

The Core Design Flaw

Here's the broken design pattern:

Engineers architect for Black Friday. The traffic spike. Everything going viral at once. The worst day that might never come.

So the system gets designed around m5.4xlarge. Sixteen CPUs. 64GB of memory. Always-on, sustained performance architecture.

Then the actual workload shows up. Normal traffic. Bursty patterns. Light load most of the time.

The server sits there, 14.7 cores idle, because the architecture was designed for a workload that doesn't exist.

Even with autoscaling turned on. Scaling from 10 oversized instances to 50 oversized instances just multiplies the design mistake.

This is designing a fire station into the living room because the kitchen might catch fire someday.

The design is fundamentally wrong.

Architecture Mismatch In Action

One team architected their system around fifty m5.2xlarge instances—eight CPUs and 32GB each.

The design decision was based on capacity planning: "Need headroom for spikes. Need room to grow. Need to stay safe."

Standard architecture thinking. Also completely wrong.

The actual workload ran at 8% CPU usage. 15% memory usage. For months.

When someone finally profiled the system, the design flaw became obvious:

The traffic pattern was bursty—quick spikes every few hours, then quiet.

The architecture was built for sustained, constant load.

Sustained-performance instances (m5 family) for a burst-pattern workload.

Wrong architecture pattern from day one.

The redesign moved to t3.large instances. Burst-optimized architecture. Two CPUs instead of eight.

This wasn't downsizing. This was fixing the architecture.

Nothing broke. Performance improved. Because the architecture finally matched the actual system behavior.

The first design was built on assumptions. The second design was built on observed workload patterns.

That's the difference between bad design and good design.

Why This Design Pattern Keeps Repeating

The incentives punish good design and reward bad design.

Design too tight and the system goes down. Customers complain. Managers ask questions. Engineers start job hunting.

Over-design? Nothing happens. System runs fine. No one questions the architecture decisions.

So systems get designed big. Just in case.

But "just in case" isn't architecture. It's guessing. And it means building real infrastructure for imaginary workloads.

What Good Design Looks Like

Good design starts with: "What workload am I actually running?"

Not "what might I need?" but "what does my system actually do?"

The monitoring data already has the answer. CloudWatch, Datadog, whatever's running. The actual system behavior has been recorded for months.

It just doesn't get used in the design process.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-your-instance \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average

Numbers under 20% CPU usage mean the architecture is mismatched. Under 10%? The design is 4x off from reality.

This isn't a metric. It's feedback on the architecture design.

The Design Anti-Patterns

These show up everywhere:

Memory-optimized architecture (r5 instances) running CPU-bound work.

Compute-optimized architecture (c5 instances) running balanced workloads.

Production-grade architecture running development environments.

Sustained-performance architecture running burst-pattern workloads.

These aren't resource problems. These are architecture mistakes.

Using r5 instances (designed for memory-heavy databases) to run stateless web servers isn't a sizing issue—it's picking the wrong tool for the job.

Instance families aren't about price points. They're architecture patterns. Each one is designed for specific workload characteristics.

When the data shows the workload isn't memory-heavy, isn't CPU-heavy, just lightly loaded most of the time—that's the system revealing the architecture doesn't match reality.

Fixing The Design

Pick one system. The biggest one.

Pull 30 days of actual behavior data. CPU, memory, network, disk.

Profile the workload pattern:

  • Sustained load or bursty?

  • CPU-bound, memory-bound, or neither?

  • What actually constrains the system?

Then design for that.

Sustained-performance instances (m5, c5, r5) handling bursty workloads? That's wrong architecture. Switch to burst-optimized (t3, t4g).

Memory-optimized instances (r5) with low memory usage? Wrong architecture. Switch to balanced (m5) or compute-optimized (c5).

Anything running under 20% CPU constantly? The architecture is over-designed by at least 2x.

Test it. Pick something non-critical. Redesign based on actual workload profile. Monitor for a week.

Nothing breaks? The architecture just got fixed.

Something breaks? That's real data about actual system needs—use it to design correctly.

The Design Philosophy Shift

Stop designing for disasters that never come.

Start designing for workloads that actually exist.

Design based on observed behavior, not imagined scenarios.

Profile the system. Understand the workload. Match the architecture to reality.

That m5.4xlarge running at 8% CPU usage isn't underutilized. It's incorrectly designed. Infrastructure architected for a system that doesn't exist.

Fix the architecture. Everything else follows.

Three Design Questions

When was the last time actual workload data informed an architecture decision?

Are systems designed based on profiled behavior patterns or "just in case" thinking?

How many systems are running architectures from two years ago, never profiled, never validated against actual behavior?

If these questions don't have answers, that's not systems design. That's architecture guessing.

And guessing wrong doesn't just cost money. It costs good engineering.

Formula Used To Save Millions

If CPU < 20% for 7 days = Downsize 2 levels
If CPU < 40% for 7 days = Downsize 1 level  
If Memory < 40% = Wrong instance family
If Dev/Test = t3.small + scheduling
If Batch = Use Spot

Keep Reading

No posts found