How We Model Lambda Cold-Start Behaviour Under Spike Traffic

← All posts

There is a particular category of AWS incident that I have started calling the "everything looked fine in testing" failure. It goes like this. You design a serverless API. You configure a Lambda function with sensible defaults, wire it through API Gateway, point it at DynamoDB, and test it in your dev environment with the handful of engineers pinging it throughout the day. Everything looks healthy. Latency is acceptable. Costs are tracking to plan.

Then you run a campaign. Or you land on the front page. Or your sales team does their job too well and signs a new customer who brings three thousand of their users on day one. Your traffic goes from three hundred requests per second to three thousand in the space of a minute. And your Lambda function, which has never had to spin up more than a dozen concurrent instances at once, is now being asked to handle a hundred.

2,400ms

p99 latency during spike

concurrent cold starts at peak

80ms

p99 latency at steady state

Customers leave. The Slack channel lights up. You are spending a Saturday explaining to your CTO why the architecture that "passed all our tests" just fell over under a load it should have anticipated. I have been in this situation. Not once. The second time is when I stopped treating load testing as a post-deployment activity.

This post is about how I now model Lambda cold-start behaviour under spike traffic before a single resource is provisioned — and specifically how I use pinpole to make that modelling rigorous, reproducible, and tied directly to the deployment that follows.

The Cold Start Problem, Precisely Stated

Before going into tooling and workflow, it is worth being precise about what we are actually modelling. Lambda's execution model does not maintain persistent servers. When an invocation arrives and no warm execution environment exists for that function, Lambda must provision one. That provisioning sequence involves selecting a host, initialising the execution environment, loading the runtime, and executing your initialisation code — the logic in your function's module-level scope that runs once per environment, not once per invocation. The total elapsed time for this sequence is the cold start duration.

Cold start duration is not a constant. It varies along several dimensions:

Runtime. Node.js with the V8 engine has the shortest median cold start, typically under one hundred milliseconds for lightweight functions. Python is slightly slower but comparable. Java with the JVM is substantially longer — three hundred milliseconds to well over a second for functions with significant class-loading requirements.
Memory allocation. Lambda allocates CPU proportionally to memory. A function allocated 1,024 MB gets significantly more CPU than a function at 128 MB, affecting initialisation speed directly. Counterintuitively, right-sizing upward can reduce both cold start latency and total execution cost.
Package size and initialisation code. Every module your function imports adds to the cold start. A Lambda function with minimal imports will cold-start faster than one that imports the entire AWS SDK, three ORM libraries, and a logging framework.

⚡ The spike dynamic

Cold starts are not primarily a problem at steady state. The cold start problem is a spike problem. When traffic increases rapidly, Lambda must provision new environments in parallel. Under a genuine traffic spike, you can find yourself with dozens or hundreds of concurrent cold starts happening simultaneously — all spiking p99 latency at the moment when your users' experience is most consequential. Steady-state load testing does not expose this.

Why "Deploy to Discover" Is Not a Testing Strategy

For most of my career, the accepted practice for understanding how a Lambda architecture behaved under spike traffic was to deploy it and generate spike traffic against the live environment. This workflow has real problems, and I do not think the engineering community has been honest enough about them.

The cost problem. Running a realistic load test environment — and paying for Lambda invocations, API Gateway requests, DynamoDB capacity, and data transfer — is not free. Running it multiple times across iterative configuration changes multiplies that cost. For a growth-stage company with a $30,000 monthly AWS bill, burning $2,000 to $5,000 on ephemeral load test environments is a material expense.
The iteration speed problem. The cycle of change, deploy, stabilise, test, collect, analyse, and repeat is slow. Optimising a Lambda architecture through five or six significant configuration changes means a full day of elapsed time, even for an experienced engineer.
The late feedback problem. By the time you are load testing a deployed environment, your infrastructure decisions are partially locked in. If your load test reveals a four-hundred-millisecond cold start that makes your p99 budget unachievable, and you decide to switch runtimes, you are refactoring a deployed function in a live environment.
The fundamental constraint. Load testing tools like k6 and Gatling require deployed infrastructure to exist. There is no AWS-native mechanism for simulating traffic through an architecture design before provisioning. The industry has historically told you: build it first, then find out.

pinpole changes that constraint directly. The core value proposition — the reason I now use it as the primary validation tool for any Lambda-heavy architecture — is that it runs traffic simulation against architecture designs before any infrastructure is provisioned.

Building the Simulation Model: Canvas and Configuration

Before running any simulation, I spend time on the canvas getting the architecture right. This is not busywork — the fidelity of the simulation depends directly on the fidelity of the model.

For a typical Lambda API, my starting canvas is: Route 53 → CloudFront → API Gateway → Lambda → DynamoDB. pinpole enforces compatibility and directionality rules in real time as I wire services together — if I attempt an invalid connection, the platform blocks it before it is created.

The Lambda node configuration panel is where most of the cold-start-relevant decisions live:

Config Parameter	Baseline Value	Cold Start Relevance	Notes
Runtime	Node.js 20.x	High	Directly factors into cold start latency model
Memory Allocation	512 MB	High	More CPU → faster init; non-linear cost relationship
Reserved Concurrency	Explicit (not default)	Critical	Defines throttle ceiling; reduces pool for other functions
Provisioned Concurrency	0 (baseline run)	Intentional	Set to zero first to observe the cold start problem honestly

⚠ Critical: Concurrency and Simulation State

If you change Lambda concurrency settings while a simulation is paused, you must stop the simulation fully and restart it. Concurrency values are applied at simulation initialisation. Resuming a paused run after changing concurrency will not pick up the new values — you will be looking at results that reflect the previous configuration.

The Spike Pattern Simulation: What the Metrics Tell You

With the canvas wired and Lambda configured at baseline, I set up the first spike simulation. pinpole provides four traffic patterns: Constant, Ramp, Spike, and Wave. For cold start modelling, Spike is the right choice and Constant will actively mislead you. Under a Constant pattern at your expected steady-state RPS, Lambda has time to maintain a pool of warm environments and cold starts are infrequent. The metrics look healthy, and you might conclude the architecture is production-ready.

For spike testing a Lambda API, my three-scenario approach is:

Scenario 1 — Baseline

1,000 RPS

Constant traffic at expected daily load. Confirms steady-state health. Cold starts are infrequent here — this is your sanity check, not your stress test.

Scenario 2 — Peak

3–5k RPS

Spike at 3–5× baseline. Your expected high-traffic period: busy Monday morning, a sales campaign, a feature launch. Watch that alert counter.

Scenario 3 — Stress

10k RPS

Spike at 10× baseline. The scenario where the architecture either holds or it does not. This is where cold starts become production incidents.

What I am watching during the Spike simulation is Lambda's latency metric in the node panel. Under Constant load at 1,000 RPS, Lambda latency might sit at 150–200ms. Under a Spike at 10,000 RPS, I typically observe latency spike sharply in the first several seconds as Lambda provisions new execution environments in parallel, then stabilise as warm instances fill the concurrency pool. The shape and magnitude of that initial spike is the data I am after.

The AI Recommendation Cycle: Closing the Cold Start Gap

After the baseline spike simulation, I request AI recommendations. The pinpole recommendation engine analyses the current architecture and simulation results and returns prioritised, categorised findings. For a Lambda API with no provisioned concurrency running under spike traffic, the recommendations follow a predictable priority order that I have found is also the correct order to address them:

Add CloudFront WARNING

At high RPS, CloudFront absorbs cacheable requests before they reach API Gateway and Lambda — reducing the effective invocation rate and smoothing the spike. A burst that causes 10,000 Lambda invocations per second at origin may translate to only 2,000–3,000 after cache hits absorb the rest. This does not eliminate cold starts, but it reduces their frequency and the peak concurrency demand that drives them.

Enable Provisioned Concurrency INFO

The direct cold start mitigation. Pre-initialises a specified number of execution environments, keeping them warm with no cold start delay. My heuristic for the initial value is 20–30% of expected peak concurrency — covering the rapid burst at the start of a spike, while on-demand scaling fills in remaining capacity as the spike sustains. Stop the simulation fully, apply the change, restart, and re-run Spike.

Introduce Circuit Breaker Pattern WARNING

Once Lambda's own cold start behaviour is addressed, the simulation often surfaces downstream risk. Without a circuit breaker, Lambda will continue invoking degraded downstream services, queuing up invocations and exhausting its concurrency pool waiting for timeouts that may not come. Verify the circuit breaker thresholds match your downstream SLAs explicitly.

Implement Asynchronous Processing via SQS INFO

For write-path Lambda functions, introducing SQS between API Gateway and Lambda converts the invocation model from push to pull. Lambda controls the consumption rate, naturally smoothing traffic spikes — the burst fills the queue, and Lambda works through messages at a controlled rate. Note: SQS visibility timeout must exceed your Lambda timeout with margin.

Configure Lambda Auto Scaling INFO

Ensure Lambda's concurrency limits can grow with sustained load. The auto-scaling configuration sets targets for concurrency scaling — typically keeping utilisation within 60–70% of reserved concurrency at expected peak RPS. This provides headroom for unexpected additional spikes without hitting the hard throttle ceiling.

Execution History: The Version Record of Your Optimisation Journey

Every simulation run in pinpole is saved automatically to the Execution History log. Each entry records the run number and status, timestamp, duration, peak RPS, and the estimated monthly cost of the simulated architecture at that load level.

The Version Workflow Viewer stores the exact architecture snapshot associated with each run. I can select any historical run and inspect the exact canvas state at that point — the services present, the connections wired, and every configuration value on every node. For cold start modelling work, this creates a precise, version-controlled record of the optimisation journey.

When I hand an architecture to a team for implementation — or when a new engineer joins and needs to understand why the architecture is configured the way it is — the simulation history is the evidence. It is a design artefact that carries its own rationale, which is a capability that no draw.io diagram or Lucidchart export has ever provided.

Deploying the Validated Architecture

Once the architecture passes the spike simulation with all WARNING-level recommendations addressed and p99 latency within the target budget, pinpole's deploy-to-cloud workflow takes over.

✓ Security model

The deployment uses a secure STS cross-account IAM workflow. pinpole does not store credentials — the integration is established through a one-time IAM role configuration in the target AWS account, and each deployment uses short-lived STS tokens. This is the right security model; I would not adopt a deployment tool that stored long-lived AWS credentials.

The recommended promotion sequence is Canvas → ST (System Test) → UAT → PR (Production). I do not deploy directly from canvas to production. The ST and UAT stages confirm that the architecture behaves correctly in a real AWS account — with real Lambda cold starts, real DynamoDB latency, real API Gateway throttle enforcement — before production traffic is at risk.

A Note on What Competitors Do and Do Not Offer

The absence of pre-deployment traffic simulation is not an oversight in competing tools — it reflects a genuinely hard engineering problem. Simulating how a Lambda function behaves under spike traffic without deploying it requires a model that accounts for runtime, memory, initialisation code, invocation model, concurrency pool dynamics, and the interaction between provisioned and on-demand concurrency.

Tool	Visual Design	Cost Est.	Spike Simulation	Pre-Deploy	Verdict
Cloudcraft (Datadog)	✓	✓	✗	✗	Excellent diagrams, no traffic modelling
Brainboard	✓	Hints only	✗	✓	IaC generation, zero simulation capability
System Initiative	✓	✗	✗	✓	Wiring correctness, not throughput or latency
AWS Infrastructure Composer	✓	✗	✗	✗	1,134+ resource types, no performance modelling
k6 / Gatling / JMeter	✗	✗	Post-deploy only	✗	Excellent post-deploy tools, not pre-deploy
pinpole	✓	✓ (live)	✓ Pre-deploy	✓	The only tool that simulates spike traffic pre-deploy

The Broader Discipline: Simulation as Engineering Standard

Cold start modelling is the use case that motivated me to adopt pinpole, but it is not the only simulation I run. The workflow generalises across every architectural question that involves load-dependent behaviour. DynamoDB hot partition risk is not visible under steady-state load — it appears under spike traffic when a campaign drives thousands of writes per second to a poorly chosen partition key. API Gateway throttle limits are not hit at daily average load — they are hit at the peak of a content launch.

Every architectural decision that has load-dependent consequences should be validated under simulated spike traffic before deployment, not after. The cost of discovering a cold start problem in simulation is zero. The cost of discovering it in production — incident response time, engineer weekend hours, customer experience degradation — is substantial.

Summary: The Spike Simulation Checklist

For engineers who want to apply this workflow to their own Lambda architectures, the sequence I follow on every new architecture:

Configure Lambda correctly first. Set the actual runtime, memory allocation, and explicit reserved concurrency before running any simulation. Leave provisioned concurrency at zero for the baseline run.
Run Constant first, then Spike. Constant at expected daily load as a sanity check. Spike at 10× daily load as the meaningful test. Watch Lambda latency during the spike phase, not just at steady state.
Request AI recommendations after the spike run. Address WARNING items first, in order: CloudFront positioning, circuit breaker pattern, API Gateway throttle configuration. Then address INFO items: provisioned concurrency, SQS async decoupling, Lambda auto-scaling.
Stop and restart between significant changes. After each recommendation that changes concurrency configuration, stop the simulation fully and restart. Do not resume a paused run.
Use the Cloud Terminal to verify. Query Lambda service state mid-simulation when the node metrics alone do not fully explain what you are observing.
Use Execution History to document. The Version Workflow Viewer is the record you will want when explaining why the architecture is configured the way it is.
Deploy through ST and UAT before production. The simulation is a high-confidence pre-deployment filter, not a substitute for real infrastructure validation.

The Saturday incident that opened this post was the last time I discovered a cold start problem in production.

The workflow described here is why. Run a Spike simulation on an architecture you think is ready to deploy — the first time you watch Lambda latency spike to two seconds, you will understand why this belongs at the beginning of infrastructure delivery, not the end.

Start 14-day free trial →

Senior AWS Solutions Architect at a growth-stage technology company. AWS Solutions Architect — Professional. Focuses on serverless architecture design, infrastructure cost optimisation, and pre-deployment simulation as a standard engineering practice.

Tags: AWS · Lambda · Cold Starts · Serverless · Spike Traffic · Shift-Left · pinpole · FinOps

This post reflects the author's independent experience using pinpole in production architecture work. A 14-day free trial with full feature access — including Spike, Ramp, and Wave traffic patterns, AI recommendations, execution history, and deploy-to-cloud — requires no credit card to start.