Threading vs Async vs Event Loop

Problem

When your server needs to handle 10,000 concurrent connections, you must choose a concurrency model. The wrong choice wastes CPU, memory, and latency budget. The right choice depends on your workload type (CPU-bound vs I/O-bound) and language runtime.

Common question: "Should I use threads, async/await, or an event loop?"
Answer:          "It depends on whether you're CPU-bound or I/O-bound,
                  and what your language's concurrency primitives actually do."

The three models are not competing alternatives β€” they are complementary tools with different cost structures.

---

Why It Matters (Latency, Throughput, Cost)

The C10K problem (1999, still relevant):

In 1999, Dan Kegel identified that serving 10,000 concurrent clients with thread-per-connection required 10,000 OS threads. At 1–8MB per thread stack: 10GB–80GB RAM just for stacks. Context switching 10,000 threads: thousands of microseconds per operation.

Modern event-loop and async solutions serve 100,000+ concurrent connections on a single core with megabytes of RAM.

Cost comparison at 10,000 connections:

Thread-per-connection (10,000 threads):
  Stack memory: 10,000 Γ— 1MB = 10GB
  Context switches: ~50,000/second Γ— 10ΞΌs = 500ms CPU/second (50% of a core!)
  
Async/coroutine (10,000 tasks):
  Stack equivalent: 10,000 Γ— 2KB = 20MB (500Γ— less)
  Context switches: ~50,000/second Γ— 0.1ΞΌs = 5ms CPU/second (0.5% of a core)
  
Savings: 500Γ— less memory, 100Γ— less CPU overhead for I/O-bound workloads.

---

Mental Model

Threading (OS kernel manages):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  OS Scheduler                       β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚Thr1 β”‚ β”‚Thr2 β”‚ β”‚Thr3 β”‚ β”‚Thr4 β”‚  β”‚
  β”‚  β”‚(run)β”‚ β”‚(waitβ”‚ β”‚(waitβ”‚ β”‚(run)β”‚  β”‚
  β”‚  β”‚     β”‚ β”‚ I/O)β”‚ β”‚lock)β”‚ β”‚     β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚     CPU1           CPU2             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Preemptive: OS can interrupt any thread at any time

Async/Event Loop (application manages):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Event Loop (single thread)         β”‚
  β”‚                                     β”‚
  β”‚  event_queue: [sock1_readable,      β”‚
  β”‚                sock3_writable,      β”‚
  β”‚                timer_expired]       β”‚
  β”‚                                     β”‚
  β”‚  while True:                        β”‚
  β”‚    event = poll_io()    # epoll()   β”‚
  β”‚    callback = handlers[event]       β”‚
  β”‚    callback()           # runs fast β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Cooperative: code explicitly yields control

---

Underlying Theory (OS / CN / DSA / Math Linkage)

OS Threads: Full Stack Machine

An OS thread is a complete CPU execution context:

  • Stack: 1–8MB per thread (default: Linux 8MB, can be reduced with ulimit -s or pthread_attr_setstacksize)
  • Registers: RSP (stack pointer), RIP (instruction pointer), 16 general-purpose registers
  • TLB entries: Thread's code references virtual addresses; TLB caches virtualβ†’physical mappings
  • Kernel resources: Thread Control Block (TCB) in kernel space

Context switch cost:

Context switch steps:
  1. Save current thread's registers to TCB              (~50 ns)
  2. Load next thread's registers from TCB               (~50 ns)
  3. TLB flush (if different process, or sometimes same) (~200 ns)
  4. Cache warmup for new thread's working set           (~500 ns - 2ΞΌs)
  ─────────────────────────────────────────────────────────────────
  Total: 1-10ΞΌs per context switch

At 10,000 threads switching at 1ms intervals:
  Switches/second = 10,000 Γ— 1,000 = 10,000,000
  CPU time = 10,000,000 Γ— 10ΞΌs = 100 seconds of CPU/second
  (impossible β€” this is why thread-per-connection doesn't scale)

Green Threads / Goroutines: User-Space Scheduler

Go's goroutines are multiplexed M goroutines onto N OS threads (M:N threading):

Go runtime scheduler:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  M1 (OS thread)    M2 (OS thread)           β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
  β”‚  β”‚  P1 (proc)  β”‚  β”‚  P2 (proc)  β”‚          β”‚
  β”‚  β”‚  run queue: β”‚  β”‚  run queue: β”‚          β”‚
  β”‚  β”‚  [G3,G5,G7] β”‚  β”‚  [G2,G4,G6] β”‚          β”‚
  β”‚  β”‚  currently: β”‚  β”‚  currently: β”‚          β”‚
  β”‚  β”‚     G1      β”‚  β”‚     G8      β”‚          β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
  β”‚                                             β”‚
  β”‚  Global run queue: [G9, G10, G11, ...]      β”‚
  β”‚                                             β”‚
  β”‚  Work stealing: P1 steals from P2 when idle β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Goroutine stack: starts at 2KB, grows on demand up to 1GB (stack copying/segmented stacks)
  • Goroutine switch cost: ~100–200ns (no TLB flush, no kernel involvement)
  • GOMAXPROCS: number of OS threads = number of CPU cores by default

Async/Await: Cooperative Coroutines

Async functions are state machines compiled by the language runtime. await is a yield point:

# This Python function:
async def fetch_user(user_id):
    conn = await db.acquire()        # yield point 1
    row = await conn.fetchrow(...)   # yield point 2
    await conn.release()             # yield point 3
    return row

# Is approximately equivalent to this state machine:
class FetchUserStateMachine:
    def __init__(self, user_id):
        self.user_id = user_id
        self.state = 0

    def send(self, value):
        if self.state == 0:
            self.conn_future = db.acquire()
            self.state = 1
            return self.conn_future   # suspend, return future to event loop
        elif self.state == 1:
            self.conn = value          # resumed with connection
            self.row_future = self.conn.fetchrow(...)
            self.state = 2
            return self.row_future    # suspend again
        elif self.state == 2:
            self.row = value
            # ... continue

No new OS thread is created. The coroutine is resumed by the event loop when the awaited I/O completes.

Event Loop: epoll/kqueue Under the Hood

The event loop uses OS-level I/O multiplexing to monitor thousands of sockets with a single syscall:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  epoll (Linux) / kqueue (macOS/BSD)                             β”‚
β”‚                                                                 β”‚
β”‚  epoll_create():  create epoll interest list                    β”‚
β”‚  epoll_ctl(fd, EPOLL_CTL_ADD, events):  register socket        β”‚
β”‚  epoll_wait(timeout):  block until any registered fd is ready   β”‚
β”‚      β†’ returns list of ready file descriptors in O(ready_fds)  β”‚
β”‚        NOT O(total_fds) β€” this is the key advantage over select β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Node.js libuv event loop phases:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚    β”‚         timers                  β”‚   β”‚  setTimeout, setInterval callbacks
  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β”‚                      β”‚                   β”‚
  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚    β”‚     pending callbacks           β”‚   β”‚  I/O callbacks deferred to next loop
  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β”‚                      β”‚                   β”‚
  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚    β”‚      poll (epoll_wait)          β”‚   β”‚  Wait for I/O events, execute callbacks
  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β”‚                      β”‚                   β”‚
  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚    β”‚           check                 β”‚   β”‚  setImmediate callbacks
  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β”‚                      β”‚                   β”‚
  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚    β”‚    close callbacks              β”‚   β”‚  socket.on('close', ...)
  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Between each phase: process ALL microtasks (Promise.then, queueMicrotask)

The GIL (CPython Global Interpreter Lock)

CPython's GIL is a mutex that allows only one thread to execute Python bytecode at a time:

Thread 1: ──[acquire GIL]──[execute]──[release GIL]──[wait]──────────────
Thread 2: ──[wait]─────────────────────────────────[acquire GIL][execute]─
Thread 3: ──[wait]──────────────────────────────────────────────[wait]────

Result: Python threads do NOT parallelize CPU-bound code.
        Two Python threads on a quad-core CPU = 1 core effectively used.

Exception: I/O-bound work DOES release the GIL:
  Thread calls recv() β†’ releases GIL β†’ OS performs I/O β†’ thread reacquires GIL
  Multiple threads CAN overlap on I/O, just not CPU.

Escape hatches:

  • multiprocessing module: separate processes, each with their own GIL
  • C extensions (NumPy, etc.) can release GIL during computation
  • PyPy, GraalPy: alternative runtimes without GIL (PEP 703 for no-GIL CPython in progress)

---

When Threads Win

  1. CPU-bound work: image processing, compression, cryptography. Use multiprocessing in Python (bypasses GIL). Use OS threads in Go and Node.js worker threads.
  2. Blocking syscalls you can't avoid: some legacy libraries only offer blocking APIs. Wrap in a thread pool.
  3. Simple synchronization requirements: a fixed pool of 4 workers processing a queue is simpler with threads than async.
  4. Shared mutable state: locks are simpler to reason about than async coordination (though both are hard to get right).

When Async Wins

  1. I/O-bound, high concurrency: HTTP servers, microservices, API gateways.
  2. Many idle connections: WebSocket servers, long-polling, chat applications.
  3. Low memory budget: embedded systems, resource-constrained environments.
  4. Low tail latency: no context switch overhead means more predictable p99.

---

Naive Approach β€” Thread per Connection

import threading
import socket
import time

def handle_client(conn, addr):
    """Each client gets its own OS thread."""
    data = conn.recv(1024)
    time.sleep(0.01)           # simulate I/O work
    conn.send(b"HTTP/1.1 200 OK\r\n\r\nHello")
    conn.close()

def run_threaded_server(port=8080):
    sock = socket.socket()
    sock.bind(('', port))
    sock.listen(128)
    while True:
        conn, addr = sock.accept()
        # NEW THREAD PER CONNECTION β€” does not scale
        t = threading.Thread(target=handle_client, args=(conn, addr))
        t.daemon = True
        t.start()

At 10,000 concurrent clients: 10,000 threads Γ— 8MB stack = 80GB RAM. Crash.

---

Optimized Approach

Python β€” asyncio (event loop, single thread)

import asyncio
import time

async def handle_client(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
    """Single thread, async I/O β€” handles thousands of connections."""
    data = await reader.read(1024)       # yields to event loop (non-blocking)
    await asyncio.sleep(0.01)            # yields to event loop (not thread sleep!)
    writer.write(b"HTTP/1.1 200 OK\r\n\r\nHello")
    await writer.drain()
    writer.close()

async def main():
    server = await asyncio.start_server(handle_client, '0.0.0.0', 8080)
    async with server:
        await server.serve_forever()

# 10,000 concurrent connections:
#   Memory: 10,000 Γ— ~2KB coroutine state = 20MB
#   CPU: near zero (blocked in epoll_wait 99% of the time)
asyncio.run(main())

asyncio with thread pool for CPU work:

import asyncio
from concurrent.futures import ProcessPoolExecutor

executor = ProcessPoolExecutor(max_workers=4)  # one per CPU core

def cpu_intensive(data: bytes) -> bytes:
    """CPU-bound work: runs in a process, not blocking the event loop."""
    import hashlib
    return hashlib.sha256(data).digest()

async def handle_request(data: bytes) -> bytes:
    loop = asyncio.get_event_loop()
    # run_in_executor: submits to process pool, awaits result without blocking loop
    result = await loop.run_in_executor(executor, cpu_intensive, data)
    return result

Go β€” Goroutines + Channels

package main

import (
    "fmt"
    "net"
    "sync"
    "time"
)

func handleConn(conn net.Conn, wg *sync.WaitGroup) {
    defer wg.Done()
    defer conn.Close()
    // Goroutine: 2KB stack, cheap to create
    buf := make([]byte, 1024)
    conn.Read(buf)
    time.Sleep(10 * time.Millisecond) // simulate I/O
    conn.Write([]byte("HTTP/1.1 200 OK\r\n\r\nHello"))
}

func main() {
    ln, _ := net.Listen("tcp", ":8080")
    var wg sync.WaitGroup

    for {
        conn, _ := ln.Accept()
        wg.Add(1)
        go handleConn(conn, &wg) // one goroutine per connection β€” cheap!
    }
    // 10,000 connections: 10,000 Γ— 2KB = 20MB
    // Go's runtime schedules them on GOMAXPROCS OS threads
}

Go is unique: goroutines give you the "one goroutine per connection" simplicity of thread-per-connection, with async-level memory efficiency. The runtime does the multiplexing.

Worker pool with goroutines:

func main() {
    jobs := make(chan Job, 1000)
    results := make(chan Result, 1000)

    // Start fixed worker pool (bounded goroutines)
    for w := 0; w < 10; w++ {
        go func() {
            for job := range jobs {
                results <- process(job)
            }
        }()
    }

    // Submit jobs
    for _, job := range allJobs {
        jobs <- job
    }
    close(jobs)
}

Node.js β€” async/await + EventEmitter

const net = require('net');

// Single-threaded event loop handles all connections
const server = net.createServer((socket) => {
    socket.on('data', async (data) => {
        // Non-blocking: returns to event loop while waiting
        await new Promise(resolve => setTimeout(resolve, 10)); // simulate I/O
        socket.write('HTTP/1.1 200 OK\r\n\r\nHello');
        socket.end();
    });
});

server.listen(8080);

// Worker threads for CPU-bound work (Node.js 10.5+)
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

if (isMainThread) {
    // Main event loop thread
    async function runCpuWork(data) {
        return new Promise((resolve, reject) => {
            const worker = new Worker(__filename, { workerData: data });
            worker.on('message', resolve);
            worker.on('error', reject);
        });
    }
} else {
    // Worker thread β€” can do blocking CPU work safely
    const result = heavyComputation(workerData);
    parentPort.postMessage(result);
}

Never block the Node.js event loop:

// BAD: synchronous computation blocks all connections
app.get('/hash', (req, res) => {
    const hash = require('crypto')
        .createHash('sha256')
        .update(req.body.data.repeat(100000)) // blocks for 500ms!
        .digest('hex');
    res.json({ hash });
});

// GOOD: offload to worker thread
const { Worker } = require('worker_threads');
app.get('/hash', async (req, res) => {
    const hash = await runInWorker(req.body.data); // event loop free
    res.json({ hash });
});

---

Complexity Analysis

ModelMemory per connectionContext switchMax concurrent (practical)
OS Thread1–8MB stack1–10ΞΌs + TLB flush
Goroutine2KB–grow~100–200ns
async/await~2KB coroutine state~100ns
Event loop (callbacks)~1KB~50ns

Time complexity for N concurrent I/O tasks:

  • Thread-per-connection: O(N) OS threads, O(N Γ— switch_cost) scheduling overhead
  • Goroutine: O(N/GOMAXPROCS) switches per second β€” scales with CPU count
  • Async: O(1) OS threads (approximately), O(N) task state, O(ready_events) per loop iteration

---

Benchmark (p50, p99, CPU, Memory)

Setup: 10,000 concurrent clients, each holding an HTTP connection with 10ms simulated I/O latency.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model                   β”‚  p50   β”‚  p99   β”‚ Memory   β”‚ CPU (1req)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Python threads (GIL)    β”‚  14ms  β”‚  95ms  β”‚ 10GB+    β”‚ 2ms       β”‚
β”‚ Python asyncio          β”‚  11ms  β”‚  18ms  β”‚ 25MB     β”‚ 0.3ms     β”‚
β”‚ Go goroutines           β”‚  11ms  β”‚  15ms  β”‚ 22MB     β”‚ 0.2ms     β”‚
β”‚ Node.js async           β”‚  11ms  β”‚  17ms  β”‚ 80MB     β”‚ 0.5ms     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For CPU-bound work (hash computation) on 10,000 requests:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model                   β”‚  p50   β”‚  p99   β”‚ Memory   β”‚ CPU cores β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Python asyncio (single) β”‚ 120ms  β”‚ 200ms  β”‚ 25MB     β”‚ 1 (GIL)   β”‚
β”‚ Python multiprocessing  β”‚  35ms  β”‚  55ms  β”‚ 400MB    β”‚ 4         β”‚
β”‚ Go goroutines           β”‚  28ms  β”‚  45ms  β”‚ 22MB     β”‚ 4         β”‚
β”‚ Node.js worker_threads  β”‚  32ms  β”‚  50ms  β”‚ 200MB    β”‚ 4         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

---

Observability

Thread pool metrics

from concurrent.futures import ThreadPoolExecutor
from prometheus_client import Gauge, Counter

thread_pool_active = Gauge('thread_pool_active_threads', 'Active threads')
thread_pool_queue = Gauge('thread_pool_queued_tasks', 'Queued tasks')

class InstrumentedExecutor(ThreadPoolExecutor):
    def submit(self, fn, *args, **kwargs):
        thread_pool_queue.inc()
        future = super().submit(fn, *args, **kwargs)
        future.add_done_callback(lambda _: thread_pool_queue.dec())
        return future

Event loop lag (Node.js)

// Event loop lag: time between when a callback was scheduled and when it ran
let lastTick = Date.now();
setInterval(() => {
    const now = Date.now();
    const lag = now - lastTick - 100; // expected 100ms interval
    prometheus.gauge('event_loop_lag_ms').set(lag);
    lastTick = now;
}, 100);

// ALERT: event loop lag > 50ms means the loop is being blocked

Goroutine leak detection (Go)

import (
    "net/http"
    _ "net/http/pprof"  // registers /debug/pprof/goroutine endpoint
    "runtime"
)

// Expose goroutine count as metric
func goroutineCount() int {
    return runtime.NumGoroutine()
}

// Alert: goroutine count grows without bound β†’ leak
// Normal: stable count proportional to active connections

---

Failure Modes

1. Blocking the event loop (Node.js / Python asyncio):

// BAD: This blocks Node.js for 5 seconds β€” ALL other requests stall
app.get('/slow', (req, res) => {
    const start = Date.now();
    while (Date.now() - start < 5000) {}  // CPU spin β€” NEVER do this
    res.send('done');
});

// How to detect: event loop lag metric spikes
// How to fix: Worker threads, process.nextTick decomposition, or restructure algorithm

2. Goroutine leak (Go):

// BAD: goroutine blocked forever on channel nobody will send to
func leaky() {
    ch := make(chan int)
    go func() {
        val := <-ch  // blocks forever if nobody sends
        fmt.Println(val)
    }()
    // Function returns, ch goes out of scope, goroutine is stuck
}

// GOOD: use context for cancellation
func safe(ctx context.Context) {
    ch := make(chan int)
    go func() {
        select {
        case val := <-ch:
            fmt.Println(val)
        case <-ctx.Done():
            return  // goroutine exits cleanly
        }
    }()
}

3. Thread starvation:

When all threads in a pool are waiting for I/O, and no threads are available for new requests:

# Starvation scenario:
executor = ThreadPoolExecutor(max_workers=10)

async def handler():
    # Submits synchronous DB call to thread pool
    result = await loop.run_in_executor(executor, blocking_db_call)
    # If blocking_db_call takes 30s and there are 10 concurrent requests:
    # all 10 executor threads are occupied β†’ 11th request waits forever

Fix: Separate thread pools for I/O-bound and CPU-bound work. Set timeouts.

4. Priority inversion:

A high-priority goroutine waiting on a mutex held by a low-priority goroutine. The low-priority goroutine can't run because the high-priority one is consuming CPU time.

Go 1.14+ uses asynchronous preemption to avoid full priority inversion.

5. Python GIL + threads CPU-bound misuse:

# TRAP: looks parallel, isn't
import threading

def compute(n):
    for i in range(n):
        i ** 2  # pure Python β€” GIL prevents true parallelism

threads = [threading.Thread(target=compute, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()

# This takes ~4Γ— LONGER than single-threaded due to GIL contention!
# Use multiprocessing.Pool instead.

---

When NOT to Use Async

  1. CPU-bound Python code: async doesn't help (still GIL-bound). Use multiprocessing.
  2. Simple scripts with 1–5 operations: async adds complexity without benefit.
  3. Blocking-only libraries: some Python libraries (e.g., certain DB drivers, boto3) are synchronous only. Running them in asyncio without run_in_executor will block the event loop.
  4. Debugging: async stack traces are harder to read. Callbacks and coroutines fragment the call stack.
  5. Teams unfamiliar with async: async bugs (forgotten await, accidental blocking) are subtle and production-unsafe without team experience.

---

Decision Matrix

                    I/O-bound         CPU-bound
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Python           β”‚ asyncio βœ“    β”‚ multiprocessing βœ“β”‚
                 β”‚ (threads OK  β”‚ (NOT threads β€”   β”‚
                 β”‚  for legacy) β”‚  GIL kills you)  β”‚
                 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
Go               β”‚ goroutines βœ“ β”‚ goroutines βœ“     β”‚
                 β”‚              β”‚ (GOMAXPROCS=ncpu) β”‚
                 β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
Node.js          β”‚ async/await βœ“β”‚ worker_threads βœ“ β”‚
                 β”‚              β”‚ (NOT main thread) β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

---

Lab

See ../../benchmarks/01-thread-vs-async-vs-event-loop/README.md for a complete benchmark comparing:

  • Python threading vs asyncio for 1,000 concurrent I/O tasks
  • Go goroutines for the same workload
  • Node.js async for the same workload

The benchmark measures p50, p99, memory usage, and CPU utilization.

---

Key Takeaways

  1. OS threads: 1–8MB each, 1–10ΞΌs context switch. Good for CPU work and legacy blocking code.
  2. Goroutines: 2KB each, ~100ns switch, M:N scheduling. Go's sweet spot β€” simple code, async performance.
  3. async/await: ~2KB state machine, no context switch for scheduling. Best for I/O-bound, high-concurrency Python and Node.js.
  4. Python GIL: threads cannot parallelize CPU work in CPython. Use multiprocessing for CPU. Use asyncio for I/O.
  5. Go's goroutines give you both: one goroutine per connection (like thread-per-connection simplicity) at async memory costs.
  6. Never block the event loop: any synchronous work > 1ms in Node.js or Python asyncio stalls all other connections.
  7. Goroutine leaks grow linearly; detect via runtime.NumGoroutine() metric. Always pass context.Context for cancellation.
  8. The decision tree: I/O-bound β†’ async or goroutines. CPU-bound β†’ OS threads (Go/Node) or processes (Python).

πŸ“š Related Topics