VR

Backend Audit Checklist


Use this checklist to assess the health of a backend system. Each item has a measurable threshold. Items marked ✗ are immediate action items. Items marked ⚠ are improvement opportunities.

Scoring: 1 point per ✓. 0 points per ✗ or ⚠.

  • 90–100%: Production-ready
  • 70–89%: Some gaps, addressable within one sprint
  • 50–69%: Significant technical debt, prioritize this quarter
  • <50%: High risk, stop and fix before further feature development

1. API Performance

#CheckThresholdMethod
1.1Read endpoints p99 latency< 200msAPM or load test with k6/wrk
1.2Write endpoints p99 latency< 500msAPM or load test
1.3No endpoint exceeds p99 > 1s under normal load0 violationsAPM alerting
1.4p99/p50 ratio (tail latency amplification)< 4×Histogram analysis
1.5Timeouts configured on all outbound HTTP calls100% coverageCode audit
1.6Request size limits enforcedMax body size configuredFramework config

How to measure:

# k6 load test — generates p50/p99 breakdown
k6 run --vus 50 --duration 60s script.js

# wrk quick benchmark
wrk -t 4 -c 100 -d 30s --latency http://localhost:8080/api/users

2. Database

#CheckThresholdMethod
2.1No query exceeds p99 > 100ms under normal load0 violationspg_stat_statements
2.2Foreign key columns have indexes100% of FKs\d tablename or schema audit
2.3No sequential scans on tables > 10,000 rows in hot paths0 violationsEXPLAIN ANALYZE + pg_stat_user_tables
2.4N+1 query patterns absent0 hot endpoints with queries/req > 10APM query count tracing
2.5Query timeouts configuredstatement_timeout setSHOW statement_timeout
2.6Slow query log enabledlog_min_duration_statement ≤ 100msSHOW log_min_duration_statement
2.7EXPLAIN ANALYZE reviewed for all queries used > 1000×/dayDocumentedQuery plan audit
2.8Index bloat < 20%pgstattupleSELECT * FROM pgstattuple('tablename')

Quick FK index audit (PostgreSQL):

SELECT
    tc.table_name,
    kcu.column_name,
    ccu.table_name AS foreign_table,
    (SELECT COUNT(*) FROM pg_indexes
     WHERE tablename = tc.table_name
     AND indexdef LIKE '%' || kcu.column_name || '%') AS index_count
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
    ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.referential_constraints rc
    ON tc.constraint_name = rc.constraint_name
JOIN information_schema.constraint_column_usage ccu
    ON ccu.constraint_name = rc.unique_constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
HAVING index_count = 0;
-- Results: FK columns with no index → add these indexes

3. Caching

#CheckThresholdMethod
3.1Cache hit rate for hot read paths> 80%redis-cli INFO stats
3.2TTL configured on all cache keys100% have explicit TTLCode audit
3.3Eviction policy configuredallkeys-lru or volatile-lru setredis-cli CONFIG GET maxmemory-policy
3.4maxmemory configuredNot unlimitedredis-cli CONFIG GET maxmemory
3.5Cache key namespace collision checkNo two entities share prefixNaming convention audit
3.6Thundering herd protection on popular keysPER or mutex in placeCode audit of cache-miss paths
3.7Negative caching for non-existent lookupsBloom filter or null-TTL cachingCode audit
3.8Cache eviction rate acceptable< 5% eviction/hourredis-cli INFO stats evicted_keys

Redis health check:

redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"
redis-cli INFO memory | grep -E "used_memory_human|maxmemory_human"
# hit_rate = keyspace_hits / (keyspace_hits + keyspace_misses)
# Target: > 0.80

4. Connection Management

#CheckThresholdMethod
4.1Connection pool configured (not new-connection-per-request)Pool existsCode audit
4.2Pool size follows sizing formulapool_size ≈ DB_CPU_cores × 2Config review
4.3Pool acquire timeout configured≤ 30 secondsConfig review
4.4max_conn_lifetime configured< 30 minutesConfig review
4.5max_conn_idle_time configured< 10 minutesConfig review
4.6Connection leak detection enabledLeak detection implementedCode audit
4.7Pool metrics exposed (active, idle, waiting)Metrics endpoint exists/metrics or APM
4.8Connections released in finally blocks100% coverageCode audit
4.9No connections held across external HTTP calls0 violationsCode audit
4.10For serverless: connection proxy (RDS Proxy/PgBouncer) in useProxy configuredArchitecture review

Pool sizing calculator:

target_pool_size = ceil(peak_rps × avg_query_duration_seconds × 1.3)

Example:
  Peak load: 500 req/s
  Avg query: 15ms = 0.015s
  Safety factor: 1.3
  pool_size = ceil(500 × 0.015 × 1.3) = ceil(9.75) = 10
  Cross-check: DB has 4 cores → 4 × 2 = 8 → take max(10, 8) = 10

5. Observability

#CheckThresholdMethod
5.1Distributed tracing implemented100% of service boundariesJaeger/Datadog APM
5.2RED metrics exported (Rate, Errors, Duration)Per endpointPrometheus/Datadog
5.3Structured logging (JSON, not freeform text)100% of log linesLog aggregator check
5.4Request IDs propagated across service calls100% of requestsHeader audit
5.5Error rates alertedAlert at > 1% error rateAlert config
5.6Latency p99 alertedAlert at > 500ms p99Alert config
5.7DB query count per request trackedMetric existsAPM or middleware
5.8Cache hit rate trackedMetric existsRedis metrics

Minimum viable metric set (Prometheus):

# Must have these metrics for every HTTP endpoint:
http_requests_total{method, path, status_code}
http_request_duration_seconds{method, path}  # histogram with p50/p99

# Database:
db_queries_total{endpoint}
db_query_duration_seconds  # histogram

# Cache:
cache_hits_total
cache_misses_total
cache_evictions_total

# Connection pool:
db_pool_connections_total{state}  # state: idle, active
db_pool_wait_duration_seconds     # histogram

6. Security

#CheckThresholdMethod
6.1Authentication on all non-public endpoints100% coverageEndpoint audit
6.2Authorization checked at data layer (not just route layer)100% coverageCode audit
6.3Rate limiting on all public endpointsLimits configuredRate limiter config
6.4Rate limiting on auth endpointsStricter limits (e.g., 10 req/min)Rate limiter config
6.5SQL injection prevention (parameterized queries)0 string-concatenated queriesCode audit + SAST
6.6Input validation on all user-provided dataValidation library in useCode audit
6.7Secrets not in source code0 secrets in git historygit log -S "password" + vault
6.8CORS configured (not * on API)Domain allowlistConfig audit
6.9Security headers set (HSTS, X-Frame-Options, etc.)Headers presentcurl -I https://...
6.10Dependencies scanned for CVEsScan in CI pipelinenpm audit, safety, govulncheck

7. Reliability

#CheckThresholdMethod
7.1Circuit breakers on all external service calls100% of external callsCode audit
7.2Retry with exponential backoff implementedRetries capped, jitteredCode audit
7.3Timeouts on ALL outbound calls (DB, HTTP, cache)100% have explicit timeoutCode audit
7.4Graceful shutdown implementedSIGTERM handled, in-flight requests completeCode audit
7.5Health check endpoints exist (liveness + readiness)/health/live and /health/readyEndpoint test
7.6Deployment is zero-downtimeRolling update or blue/greenDeploy config
7.7Database migrations are backward-compatibleNon-breaking schema changesMigration review
7.8Feature flags availableFlag system in useCode audit

Circuit breaker threshold guidance:

Open circuit when:
  error_rate > 50% in last 10 seconds AND at least 20 requests
Half-open: allow 1 request through every 30 seconds to test recovery
Close: if 3 consecutive successes in half-open state

8. Scalability

#CheckThresholdMethod
8.1Horizontal scaling testedService runs with N>1 instances without issuesLoad test
8.2Stateless service (no in-process session state)State in Redis/DB onlyArchitecture review
8.3No shared mutable in-memory state across requestsConfirmed statelessCode audit
8.4Database read replicas used for read-heavy queriesRead/write split configuredDB config
8.5Long-running jobs in async queue (not synchronous HTTP)Job queue in useArchitecture review
8.6File uploads/downloads proxied (not through app server)Presigned URLs or CDNCode audit
8.7Pagination on all list endpointsCursor or offset paginationEndpoint audit
8.8Max response size boundedNo unbounded result setsCode audit

9. Development Practices

#CheckThresholdMethod
9.1EXPLAIN ANALYZE for every new queryPR review requirementCode review
9.2Load tests for every new endpoint > 100 req/dayLoad test in CICI config
9.3Query count asserted in testsTest fails if N+1 introducedTest suite
9.4Database migrations reviewed for lock riskAdvisory lock auditMigration review
9.5Performance budget defined and enforcedp99 SLA per endpoint documentedADR or runbook

Score Summary Template

Team: _______________   Date: _______________   Reviewer: _______________

Section                    Score    Max    %
─────────────────────────────────────────────
1. API Performance         ___      6      ___
2. Database                ___      8      ___
3. Caching                 ___      8      ___
4. Connection Management   ___      10     ___
5. Observability           ___      8      ___
6. Security                ___      10     ___
7. Reliability             ___      8      ___
8. Scalability             ___      8      ___
9. Development Practices   ___      5      ___
─────────────────────────────────────────────
TOTAL                      ___      71     ___  %

Top 3 action items:
1. _______________________________________________
2. _______________________________________________
3. _______________________________________________