Problem
When your server needs to handle 10,000 concurrent connections, you must choose a concurrency model. The wrong choice wastes CPU, memory, and latency budget. The right choice depends on your workload type (CPU-bound vs I/O-bound) and language runtime.
Common question: "Should I use threads, async/await, or an event loop?"
Answer: "It depends on whether you're CPU-bound or I/O-bound,
and what your language's concurrency primitives actually do."The three models are not competing alternatives β they are complementary tools with different cost structures.
---
Why It Matters (Latency, Throughput, Cost)
The C10K problem (1999, still relevant):
In 1999, Dan Kegel identified that serving 10,000 concurrent clients with thread-per-connection required 10,000 OS threads. At 1β8MB per thread stack: 10GBβ80GB RAM just for stacks. Context switching 10,000 threads: thousands of microseconds per operation.
Modern event-loop and async solutions serve 100,000+ concurrent connections on a single core with megabytes of RAM.
Cost comparison at 10,000 connections:
Thread-per-connection (10,000 threads):
Stack memory: 10,000 Γ 1MB = 10GB
Context switches: ~50,000/second Γ 10ΞΌs = 500ms CPU/second (50% of a core!)
Async/coroutine (10,000 tasks):
Stack equivalent: 10,000 Γ 2KB = 20MB (500Γ less)
Context switches: ~50,000/second Γ 0.1ΞΌs = 5ms CPU/second (0.5% of a core)
Savings: 500Γ less memory, 100Γ less CPU overhead for I/O-bound workloads.---
Mental Model
Threading (OS kernel manages):
βββββββββββββββββββββββββββββββββββββββ
β OS Scheduler β
β βββββββ βββββββ βββββββ βββββββ β
β βThr1 β βThr2 β βThr3 β βThr4 β β
β β(run)β β(waitβ β(waitβ β(run)β β
β β β β I/O)β βlock)β β β β
β βββββββ βββββββ βββββββ βββββββ β
β CPU1 CPU2 β
βββββββββββββββββββββββββββββββββββββββ
Preemptive: OS can interrupt any thread at any time
Async/Event Loop (application manages):
βββββββββββββββββββββββββββββββββββββββ
β Event Loop (single thread) β
β β
β event_queue: [sock1_readable, β
β sock3_writable, β
β timer_expired] β
β β
β while True: β
β event = poll_io() # epoll() β
β callback = handlers[event] β
β callback() # runs fast β
βββββββββββββββββββββββββββββββββββββββ
Cooperative: code explicitly yields control---
Underlying Theory (OS / CN / DSA / Math Linkage)
OS Threads: Full Stack Machine
An OS thread is a complete CPU execution context:
- Stack: 1β8MB per thread (default: Linux 8MB, can be reduced with
ulimit -sorpthread_attr_setstacksize) - Registers: RSP (stack pointer), RIP (instruction pointer), 16 general-purpose registers
- TLB entries: Thread's code references virtual addresses; TLB caches virtualβphysical mappings
- Kernel resources: Thread Control Block (TCB) in kernel space
Context switch cost:
Context switch steps:
1. Save current thread's registers to TCB (~50 ns)
2. Load next thread's registers from TCB (~50 ns)
3. TLB flush (if different process, or sometimes same) (~200 ns)
4. Cache warmup for new thread's working set (~500 ns - 2ΞΌs)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total: 1-10ΞΌs per context switch
At 10,000 threads switching at 1ms intervals:
Switches/second = 10,000 Γ 1,000 = 10,000,000
CPU time = 10,000,000 Γ 10ΞΌs = 100 seconds of CPU/second
(impossible β this is why thread-per-connection doesn't scale)Green Threads / Goroutines: User-Space Scheduler
Go's goroutines are multiplexed M goroutines onto N OS threads (M:N threading):
Go runtime scheduler:
βββββββββββββββββββββββββββββββββββββββββββββββ
β M1 (OS thread) M2 (OS thread) β
β βββββββββββββββ βββββββββββββββ β
β β P1 (proc) β β P2 (proc) β β
β β run queue: β β run queue: β β
β β [G3,G5,G7] β β [G2,G4,G6] β β
β β currently: β β currently: β β
β β G1 β β G8 β β
β βββββββββββββββ βββββββββββββββ β
β β
β Global run queue: [G9, G10, G11, ...] β
β β
β Work stealing: P1 steals from P2 when idle β
βββββββββββββββββββββββββββββββββββββββββββββββ- Goroutine stack: starts at 2KB, grows on demand up to 1GB (stack copying/segmented stacks)
- Goroutine switch cost: ~100β200ns (no TLB flush, no kernel involvement)
- GOMAXPROCS: number of OS threads = number of CPU cores by default
Async/Await: Cooperative Coroutines
Async functions are state machines compiled by the language runtime. await is a yield point:
# This Python function:
async def fetch_user(user_id):
conn = await db.acquire() # yield point 1
row = await conn.fetchrow(...) # yield point 2
await conn.release() # yield point 3
return row
# Is approximately equivalent to this state machine:
class FetchUserStateMachine:
def __init__(self, user_id):
self.user_id = user_id
self.state = 0
def send(self, value):
if self.state == 0:
self.conn_future = db.acquire()
self.state = 1
return self.conn_future # suspend, return future to event loop
elif self.state == 1:
self.conn = value # resumed with connection
self.row_future = self.conn.fetchrow(...)
self.state = 2
return self.row_future # suspend again
elif self.state == 2:
self.row = value
# ... continueNo new OS thread is created. The coroutine is resumed by the event loop when the awaited I/O completes.
Event Loop: epoll/kqueue Under the Hood
The event loop uses OS-level I/O multiplexing to monitor thousands of sockets with a single syscall:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β epoll (Linux) / kqueue (macOS/BSD) β
β β
β epoll_create(): create epoll interest list β
β epoll_ctl(fd, EPOLL_CTL_ADD, events): register socket β
β epoll_wait(timeout): block until any registered fd is ready β
β β returns list of ready file descriptors in O(ready_fds) β
β NOT O(total_fds) β this is the key advantage over select β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Node.js libuv event loop phases:
ββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββββββββββββββββββββββββββ β
β β timers β β setTimeout, setInterval callbacks
β βββββββββββββββββββ¬ββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββ β
β β pending callbacks β β I/O callbacks deferred to next loop
β βββββββββββββββββββ¬ββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββ β
β β poll (epoll_wait) β β Wait for I/O events, execute callbacks
β βββββββββββββββββββ¬ββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββ β
β β check β β setImmediate callbacks
β βββββββββββββββββββ¬ββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββ β
β β close callbacks β β socket.on('close', ...)
β βββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββ
Between each phase: process ALL microtasks (Promise.then, queueMicrotask)The GIL (CPython Global Interpreter Lock)
CPython's GIL is a mutex that allows only one thread to execute Python bytecode at a time:
Thread 1: ββ[acquire GIL]ββ[execute]ββ[release GIL]ββ[wait]ββββββββββββββ
Thread 2: ββ[wait]βββββββββββββββββββββββββββββββββ[acquire GIL][execute]β
Thread 3: ββ[wait]ββββββββββββββββββββββββββββββββββββββββββββββ[wait]ββββ
Result: Python threads do NOT parallelize CPU-bound code.
Two Python threads on a quad-core CPU = 1 core effectively used.
Exception: I/O-bound work DOES release the GIL:
Thread calls recv() β releases GIL β OS performs I/O β thread reacquires GIL
Multiple threads CAN overlap on I/O, just not CPU.Escape hatches:
multiprocessingmodule: separate processes, each with their own GIL- C extensions (NumPy, etc.) can release GIL during computation
- PyPy, GraalPy: alternative runtimes without GIL (PEP 703 for no-GIL CPython in progress)
---
When Threads Win
- CPU-bound work: image processing, compression, cryptography. Use
multiprocessingin Python (bypasses GIL). Use OS threads in Go and Node.js worker threads. - Blocking syscalls you can't avoid: some legacy libraries only offer blocking APIs. Wrap in a thread pool.
- Simple synchronization requirements: a fixed pool of 4 workers processing a queue is simpler with threads than async.
- Shared mutable state: locks are simpler to reason about than async coordination (though both are hard to get right).
When Async Wins
- I/O-bound, high concurrency: HTTP servers, microservices, API gateways.
- Many idle connections: WebSocket servers, long-polling, chat applications.
- Low memory budget: embedded systems, resource-constrained environments.
- Low tail latency: no context switch overhead means more predictable p99.
---
Naive Approach β Thread per Connection
import threading
import socket
import time
def handle_client(conn, addr):
"""Each client gets its own OS thread."""
data = conn.recv(1024)
time.sleep(0.01) # simulate I/O work
conn.send(b"HTTP/1.1 200 OK\r\n\r\nHello")
conn.close()
def run_threaded_server(port=8080):
sock = socket.socket()
sock.bind(('', port))
sock.listen(128)
while True:
conn, addr = sock.accept()
# NEW THREAD PER CONNECTION β does not scale
t = threading.Thread(target=handle_client, args=(conn, addr))
t.daemon = True
t.start()At 10,000 concurrent clients: 10,000 threads Γ 8MB stack = 80GB RAM. Crash.
---
Optimized Approach
Python β asyncio (event loop, single thread)
import asyncio
import time
async def handle_client(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
"""Single thread, async I/O β handles thousands of connections."""
data = await reader.read(1024) # yields to event loop (non-blocking)
await asyncio.sleep(0.01) # yields to event loop (not thread sleep!)
writer.write(b"HTTP/1.1 200 OK\r\n\r\nHello")
await writer.drain()
writer.close()
async def main():
server = await asyncio.start_server(handle_client, '0.0.0.0', 8080)
async with server:
await server.serve_forever()
# 10,000 concurrent connections:
# Memory: 10,000 Γ ~2KB coroutine state = 20MB
# CPU: near zero (blocked in epoll_wait 99% of the time)
asyncio.run(main())asyncio with thread pool for CPU work:
import asyncio
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=4) # one per CPU core
def cpu_intensive(data: bytes) -> bytes:
"""CPU-bound work: runs in a process, not blocking the event loop."""
import hashlib
return hashlib.sha256(data).digest()
async def handle_request(data: bytes) -> bytes:
loop = asyncio.get_event_loop()
# run_in_executor: submits to process pool, awaits result without blocking loop
result = await loop.run_in_executor(executor, cpu_intensive, data)
return resultGo β Goroutines + Channels
package main
import (
"fmt"
"net"
"sync"
"time"
)
func handleConn(conn net.Conn, wg *sync.WaitGroup) {
defer wg.Done()
defer conn.Close()
// Goroutine: 2KB stack, cheap to create
buf := make([]byte, 1024)
conn.Read(buf)
time.Sleep(10 * time.Millisecond) // simulate I/O
conn.Write([]byte("HTTP/1.1 200 OK\r\n\r\nHello"))
}
func main() {
ln, _ := net.Listen("tcp", ":8080")
var wg sync.WaitGroup
for {
conn, _ := ln.Accept()
wg.Add(1)
go handleConn(conn, &wg) // one goroutine per connection β cheap!
}
// 10,000 connections: 10,000 Γ 2KB = 20MB
// Go's runtime schedules them on GOMAXPROCS OS threads
}Go is unique: goroutines give you the "one goroutine per connection" simplicity of thread-per-connection, with async-level memory efficiency. The runtime does the multiplexing.
Worker pool with goroutines:
func main() {
jobs := make(chan Job, 1000)
results := make(chan Result, 1000)
// Start fixed worker pool (bounded goroutines)
for w := 0; w < 10; w++ {
go func() {
for job := range jobs {
results <- process(job)
}
}()
}
// Submit jobs
for _, job := range allJobs {
jobs <- job
}
close(jobs)
}Node.js β async/await + EventEmitter
const net = require('net');
// Single-threaded event loop handles all connections
const server = net.createServer((socket) => {
socket.on('data', async (data) => {
// Non-blocking: returns to event loop while waiting
await new Promise(resolve => setTimeout(resolve, 10)); // simulate I/O
socket.write('HTTP/1.1 200 OK\r\n\r\nHello');
socket.end();
});
});
server.listen(8080);
// Worker threads for CPU-bound work (Node.js 10.5+)
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
if (isMainThread) {
// Main event loop thread
async function runCpuWork(data) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, { workerData: data });
worker.on('message', resolve);
worker.on('error', reject);
});
}
} else {
// Worker thread β can do blocking CPU work safely
const result = heavyComputation(workerData);
parentPort.postMessage(result);
}Never block the Node.js event loop:
// BAD: synchronous computation blocks all connections
app.get('/hash', (req, res) => {
const hash = require('crypto')
.createHash('sha256')
.update(req.body.data.repeat(100000)) // blocks for 500ms!
.digest('hex');
res.json({ hash });
});
// GOOD: offload to worker thread
const { Worker } = require('worker_threads');
app.get('/hash', async (req, res) => {
const hash = await runInWorker(req.body.data); // event loop free
res.json({ hash });
});---
Complexity Analysis
| Model | Memory per connection | Context switch | Max concurrent (practical) |
|---|---|---|---|
| OS Thread | 1β8MB stack | 1β10ΞΌs + TLB flush | |
| Goroutine | 2KBβgrow | ~100β200ns | |
| async/await | ~2KB coroutine state | ~100ns | |
| Event loop (callbacks) | ~1KB | ~50ns |
Time complexity for N concurrent I/O tasks:
- Thread-per-connection: O(N) OS threads, O(N Γ switch_cost) scheduling overhead
- Goroutine: O(N/GOMAXPROCS) switches per second β scales with CPU count
- Async: O(1) OS threads (approximately), O(N) task state, O(ready_events) per loop iteration
---
Benchmark (p50, p99, CPU, Memory)
Setup: 10,000 concurrent clients, each holding an HTTP connection with 10ms simulated I/O latency.
βββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββββ¬ββββββββββββ
β Model β p50 β p99 β Memory β CPU (1req)β
βββββββββββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββββΌββββββββββββ€
β Python threads (GIL) β 14ms β 95ms β 10GB+ β 2ms β
β Python asyncio β 11ms β 18ms β 25MB β 0.3ms β
β Go goroutines β 11ms β 15ms β 22MB β 0.2ms β
β Node.js async β 11ms β 17ms β 80MB β 0.5ms β
βββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββββ΄ββββββββββββ
For CPU-bound work (hash computation) on 10,000 requests:
βββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββββ¬ββββββββββββ
β Model β p50 β p99 β Memory β CPU cores β
βββββββββββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββββΌββββββββββββ€
β Python asyncio (single) β 120ms β 200ms β 25MB β 1 (GIL) β
β Python multiprocessing β 35ms β 55ms β 400MB β 4 β
β Go goroutines β 28ms β 45ms β 22MB β 4 β
β Node.js worker_threads β 32ms β 50ms β 200MB β 4 β
βββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββββ΄ββββββββββββ---
Observability
Thread pool metrics
from concurrent.futures import ThreadPoolExecutor
from prometheus_client import Gauge, Counter
thread_pool_active = Gauge('thread_pool_active_threads', 'Active threads')
thread_pool_queue = Gauge('thread_pool_queued_tasks', 'Queued tasks')
class InstrumentedExecutor(ThreadPoolExecutor):
def submit(self, fn, *args, **kwargs):
thread_pool_queue.inc()
future = super().submit(fn, *args, **kwargs)
future.add_done_callback(lambda _: thread_pool_queue.dec())
return futureEvent loop lag (Node.js)
// Event loop lag: time between when a callback was scheduled and when it ran
let lastTick = Date.now();
setInterval(() => {
const now = Date.now();
const lag = now - lastTick - 100; // expected 100ms interval
prometheus.gauge('event_loop_lag_ms').set(lag);
lastTick = now;
}, 100);
// ALERT: event loop lag > 50ms means the loop is being blockedGoroutine leak detection (Go)
import (
"net/http"
_ "net/http/pprof" // registers /debug/pprof/goroutine endpoint
"runtime"
)
// Expose goroutine count as metric
func goroutineCount() int {
return runtime.NumGoroutine()
}
// Alert: goroutine count grows without bound β leak
// Normal: stable count proportional to active connections---
Failure Modes
1. Blocking the event loop (Node.js / Python asyncio):
// BAD: This blocks Node.js for 5 seconds β ALL other requests stall
app.get('/slow', (req, res) => {
const start = Date.now();
while (Date.now() - start < 5000) {} // CPU spin β NEVER do this
res.send('done');
});
// How to detect: event loop lag metric spikes
// How to fix: Worker threads, process.nextTick decomposition, or restructure algorithm2. Goroutine leak (Go):
// BAD: goroutine blocked forever on channel nobody will send to
func leaky() {
ch := make(chan int)
go func() {
val := <-ch // blocks forever if nobody sends
fmt.Println(val)
}()
// Function returns, ch goes out of scope, goroutine is stuck
}
// GOOD: use context for cancellation
func safe(ctx context.Context) {
ch := make(chan int)
go func() {
select {
case val := <-ch:
fmt.Println(val)
case <-ctx.Done():
return // goroutine exits cleanly
}
}()
}3. Thread starvation:
When all threads in a pool are waiting for I/O, and no threads are available for new requests:
# Starvation scenario:
executor = ThreadPoolExecutor(max_workers=10)
async def handler():
# Submits synchronous DB call to thread pool
result = await loop.run_in_executor(executor, blocking_db_call)
# If blocking_db_call takes 30s and there are 10 concurrent requests:
# all 10 executor threads are occupied β 11th request waits foreverFix: Separate thread pools for I/O-bound and CPU-bound work. Set timeouts.
4. Priority inversion:
A high-priority goroutine waiting on a mutex held by a low-priority goroutine. The low-priority goroutine can't run because the high-priority one is consuming CPU time.
Go 1.14+ uses asynchronous preemption to avoid full priority inversion.
5. Python GIL + threads CPU-bound misuse:
# TRAP: looks parallel, isn't
import threading
def compute(n):
for i in range(n):
i ** 2 # pure Python β GIL prevents true parallelism
threads = [threading.Thread(target=compute, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# This takes ~4Γ LONGER than single-threaded due to GIL contention!
# Use multiprocessing.Pool instead.---
When NOT to Use Async
- CPU-bound Python code: async doesn't help (still GIL-bound). Use
multiprocessing. - Simple scripts with 1β5 operations: async adds complexity without benefit.
- Blocking-only libraries: some Python libraries (e.g., certain DB drivers, boto3) are synchronous only. Running them in
asynciowithoutrun_in_executorwill block the event loop. - Debugging: async stack traces are harder to read. Callbacks and coroutines fragment the call stack.
- Teams unfamiliar with async: async bugs (forgotten
await, accidental blocking) are subtle and production-unsafe without team experience.
---
Decision Matrix
I/O-bound CPU-bound
ββββββββββββββββ¬βββββββββββββββββββ
Python β asyncio β β multiprocessing ββ
β (threads OK β (NOT threads β β
β for legacy) β GIL kills you) β
ββββββββββββββββΌβββββββββββββββββββ€
Go β goroutines β β goroutines β β
β β (GOMAXPROCS=ncpu) β
ββββββββββββββββΌβββββββββββββββββββ€
Node.js β async/await ββ worker_threads β β
β β (NOT main thread) β
ββββββββββββββββ΄βββββββββββββββββββ---
Lab
See ../../benchmarks/01-thread-vs-async-vs-event-loop/README.md for a complete benchmark comparing:
- Python threading vs asyncio for 1,000 concurrent I/O tasks
- Go goroutines for the same workload
- Node.js async for the same workload
The benchmark measures p50, p99, memory usage, and CPU utilization.
---
Key Takeaways
- OS threads: 1β8MB each, 1β10ΞΌs context switch. Good for CPU work and legacy blocking code.
- Goroutines: 2KB each, ~100ns switch, M:N scheduling. Go's sweet spot β simple code, async performance.
- async/await: ~2KB state machine, no context switch for scheduling. Best for I/O-bound, high-concurrency Python and Node.js.
- Python GIL: threads cannot parallelize CPU work in CPython. Use
multiprocessingfor CPU. Useasynciofor I/O. - Go's goroutines give you both: one goroutine per connection (like thread-per-connection simplicity) at async memory costs.
- Never block the event loop: any synchronous work > 1ms in Node.js or Python asyncio stalls all other connections.
- Goroutine leaks grow linearly; detect via
runtime.NumGoroutine()metric. Always passcontext.Contextfor cancellation. - The decision tree: I/O-bound β async or goroutines. CPU-bound β OS threads (Go/Node) or processes (Python).