Problem
TCP's reliability guarantees come at a cost: connection setup (1.5 RTT), connection teardown (2 RTT), head-of-line blocking, and retransmission delays. Knowing exactly what TCP does explains why every optimization in connection management (pooling, keepalive, HTTP/2) works.
Why It Matters (Latency, Throughput, Cost)
TCP connection lifecycle costs:
Handshake (SYN/SYN-ACK/ACK): 1.5 RTT = 15ms (cross-AZ)
TLS 1.3 on top: 1 RTT additional = 10ms
First data exchange: 1 RTT
Total to first byte (TTFB): 3.5 RTT = 35ms
With connection pool (reuse):
Time to first byte: 0 (connection already ESTABLISHED)
First data exchange: 1 RTT = 10msTCP State Machine
CLOSED
β
SYN sentβ (client)
βΌ
SYN_SENT βββββββββββββββββββββββββββββββ
β SYN received + SYN-ACK sent β
β (server: LISTEN β SYN_RECEIVED) β
ACK received β
βΌ β
ESTABLISHED ββββββββββββββββββββββββββββ
(data flows)
β
FIN sentβ (active close)
βΌ
FIN_WAIT_1
β FIN-ACK received
βΌ
FIN_WAIT_2
β FIN received from remote
βΌ
TIME_WAIT β stays here for 2ΓMSL (60β120 seconds)
β
CLOSEDTIME_WAIT: The connection stays in TIME_WAIT for 2 Γ MSL (Maximum Segment Lifetime, typically 60s) to ensure delayed packets from the old connection don't corrupt a new connection on the same port pair. This is why rapidly cycling connections (no pool) can exhaust ephemeral ports (EADDRINUSE).
TCP Flow Control and Congestion Control
Flow control (receive window): Receiver advertises how much buffer space it has. Sender cannot send more than the window allows.
Bandwidth-delay product (BDP):
BDP = bandwidth Γ RTT
For a 1Gbps link with 10ms RTT:
BDP = 1,000,000,000 Γ 0.010 = 10,000,000 bytes = 10MB
TCP window must be at least 10MB to fully utilize this link.
Default TCP receive buffer: 4MB β only 40% link utilization!
Tune: net.core.rmem_max and net.ipv4.tcp_rmemSlow start: New TCP connections start with a small congestion window (cwnd = ~10 MSS = ~14KB). cwnd doubles each RTT until packet loss is detected. This is why a fresh connection to a CDN is slower than a warm one.
Nagle's Algorithm
Nagle's algorithm coalesces small writes to reduce packet count:
- Buffer small writes until: ACK received OR buffer β₯ MSS OR 40ms timeout
Impact on backend: A database client sending a small query followed by recv() will wait up to 40ms for Nagle before the query is sent. Fix: TCP_NODELAY socket option.
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # disable NagleMost database drivers set TCP_NODELAY by default. Verify yours does.
Key Takeaways
- TCP connection = 1.5 RTT setup + 2 RTT teardown. Pool connections to avoid this.
- TIME_WAIT (2ΓMSL) is intentional β it prevents packet reuse bugs. Don't try to eliminate it; increase your ephemeral port range instead.
- Slow start means fresh connections have low throughput initially. Warm connections perform better β another argument for connection pooling.
- BDP = bandwidth Γ RTT. Your TCP window must be β₯ BDP to saturate a link.
- Nagle's algorithm delays small writes up to 40ms. Set
TCP_NODELAYfor database and RPC connections.
Related Modules
./05-congestion-control.mdβ CUBIC, BBR algorithms../../07-core-backend-engineering/02-connection-pooling.mdβ TCP cost eliminated by pooling