Life of a Packet in Amazon EKS
If you already know Kubernetes architecture, skip to section 3.
1. Kubernetes architecture
Kubernetes has two planes. The control plane runs the API server, etcd, scheduler, controller manager, and cloud controller manager. The data plane is worker nodes where pods run. Each node has kubelet (starts pods, checks health) and kube-proxy (configures network rules for service traffic).
+----------------------- Kubernetes Cluster ---------------------+
| |
| +--------------- Control Plane ----------------+ |
| | | |
| | +-----------+ +------+ +-----------+ | |
| | |API Server | | etcd | | Scheduler | | |
| | +-----------+ +------+ +-----------+ | |
| | +------------------+ +----------------+ | |
| | |Controller Manager| |Cloud Controller| | |
| | +------------------+ +----------------+ | |
| +----------------------------------------------+ |
| | |
| kubectl / API |
| | |
| +--------------- Data Plane -------------------------------+ |
| | | |
| | +--- Node 1 ---+ +--- Node 2 ---+ +--- Node N ---+ | |
| | | kubelet | | kubelet | | kubelet | | |
| | | kube-proxy | | kube-proxy | | kube-proxy | | |
| | | +----++----+ | | +----++----+ | | +----++----+ | | |
| | | |Pod ||Pod | | | |Pod ||Pod | | | |Pod ||Pod | | | |
| | | +----++----+ | | +----++----+ | | +----++----+ | | |
| | +--------------+ +--------------+ +--------------+ | |
| +----------------------------------------------------------+ |
+----------------------------------------------------------------+
In EKS, the control plane lives in an AWS-managed VPC (not yours). It runs at least 2 API server instances and 3 etcd instances across 3 AZs, all in an EC2 Auto Scaling Group. The Kubernetes API sits behind an NLB.
Your worker nodes live in your VPC. They reach the control plane through cross-account ENIs (X-ENIs) in at least two AZs.
+---------- AWS-Managed VPC (EKS Service) ----------+
| |
| +--- AZ-a ----+ +--- AZ-b -----+ +--- AZ-c ----+ |
| | API Server | | API Server | | | |
| | + Scheduler | | + Scheduler | | | |
| | + Ctrl Mgr | | + Ctrl Mgr | | | |
| | | | | | | |
| | etcd | | etcd | | etcd | |
| +------+-------+ +------+-------+ +------+------+ |
| | | | |
+---------+----------------+----------------+--------+
| | |
+----+----------------+----------------+----+
| NLB (Kubernetes API) |
+----+----------------+----------------+----+
| | |
========+================+================+======== Cross-Account ENIs (X-ENIs)
| | |
+---------+----------------+----------------+--------+
| +------+------+ +------+------+ +------+-------+ |
| | Worker | | Worker | | Worker | |
| | Node(s) | | Node(s) | | Node(s) | |
| | AZ-a | | AZ-b | | AZ-c | |
| +-------------+ +-------------+ +--------------+ |
| |
| Your VPC |
+----------------------------------------------------+
2. The Kubernetes network model
Kubernetes requires that every pod gets its own IP, pods can talk to each other without NAT, and agents on a node can reach all pods on that node.
A pod can have multiple containers. They share a network namespace: they talk over localhost and share a single eth0 for everything external.
+------------- Pod ----------------+
| |
| +-----------+ +-----------+ |
| |Container A| |Container B| |
| +-----+-----+ +-----+-----+ |
| | localhost | |
| +-------+-------+ |
| | |
| +-----+------+ |
| | lo | |
| | 127.0.0.1 | |
| +------------+ |
| +------------+ |
| | eth0 | |
| | 10.0.3.42 | |
| +-----+------+ |
| | |
+----------------+-----------------+
|
to the network
3. How pods connect to the network
Each pod gets its own Linux network namespace, connected to the host’s root namespace through a veth pair, a virtual Ethernet cable with one end in the pod and one end on the host.
How those veth interfaces plug into the host’s IP stack varies by implementation. Kubernetes doesn’t own this part. A spec called CNI (Container Network Interface) defines plugins that handle the wiring: creating the pod’s interface, assigning an IP, setting up the veth pair.
Built-in plugins include loopback, bridge, and ipvlan. Third-party ones include Calico, Cilium, and Amazon VPC CNI.
+---- Pod Netns (Pod A) -------+ +---- Root Network Namespace (Node) ----------+
| | | |
| +----------+ | | +----------+ |
| | eth0 |<-- veth pair --+------+--->| veth1 | |
| |10.0.3.42 | | | +----------+ |
| +----------+ | | |
| | | +----------+ +----------+ |
+------------------------------+ | | veth2 | | ENI-0 |--> VPC |
| +----------+ | (primary)| |
+---- Pod Netns (Pod B) -------+ | ^ +----------+ |
| | | | +----------+ |
| +----------+ | | | | ENI-1 |--> VPC |
| | eth0 |<-- veth pair --+------+---------+ |(secondary| |
| |10.0.3.55 | | | +----------+ |
| +----------+ | | |
| | | Linux IP Stack / Routing Tables |
+------------------------------+ +---------------------------------------------+
What VPC CNI does
VPC CNI assigns pod IPs from the VPC CIDR using secondary IPs or prefix delegation on the node’s EC2 ENIs. Pod IPs are real, routable VPC IPs, not overlay addresses.
As pods come and go, VPC CNI adds or removes ENIs on the node to keep enough IPs available. It also configures routing entries on the host and routing + ARP entries inside each pod’s namespace.
4. Inside the pod and node
Inside the pod
$ ip addr show $ ip route show
lo: 127.0.0.1/8 default via 169.254.1.1 dev eth0
eth0: 10.0.3.42/32 <-- /32 mask! 169.254.1.1 dev eth0 scope link
$ arp -a
? (169.254.1.1) at ee:35:a3:c4:21:b7 [ether] PERM <-- 'M' flag = manual entry
^
|
+-- This MAC belongs to the veth on the HOST side
Three things to notice: the pod’s eth0 has a /32 subnet mask, the default gateway is 169.254.1.1 (a link-local address), and the ARP entry for that gateway is a permanent manual entry pointing to the host-side veth’s MAC.
These three pieces form a system. Each one exists for a specific reason.
Why /32?
An interface with a /24 (say 10.0.3.42/24) tells the kernel “there are 254 other hosts on this subnet, ARP for their MAC and send directly.” The pod would broadcast ARP on the veth, bypassing the host’s routing tables.
A /32 means the subnet contains exactly one IP, the pod itself. No other IP is “on-link.” The kernel routes every packet through the default gateway, which delivers it to the host-side veth. The node’s policy-based routing tables take it from there.
Without the /32, pods could ARP for each other directly over the veth, bypassing host routing. That would break VPC CNI’s control over traffic paths.
Why a link-local address?
A link-local address (169.254.0.0/16 for IPv4, fe80::/10 for IPv6) is only valid on a single network link. Routers will never forward it. You’ve seen this range before: Windows/Mac APIPA fallback when DHCP fails, and the AWS metadata service at 169.254.169.254.
VPC CNI uses 169.254.1.1 because it can’t collide with any real VPC IP (VPCs use 10.x, 172.16-31.x, 192.168.x), it can’t leak beyond the veth pair, and it doesn’t need coordination. Every pod on every node uses the same address.
Nobody actually “owns” 169.254.1.1. The host-side veth has no IP at all. The trick works entirely through the PERM ARP entry: the kernel looks up 169.254.1.1 in its cache, finds the veth MAC, and sends the frame there. No ARP exchange happens on the wire. Running arping -I eth0 169.254.1.1 from inside the pod returns zero responses. Normal traffic works fine because the kernel uses the cache, not the wire.
On the node
The node has multiple routing tables. Policy-based routing uses the source IP of the traffic to pick which table to consult.
+--- Node Routing Architecture ----------------------------------------+
| |
| Policy-Based Routing Table (ip rule) |
| +---------------------------------------------------------------+ |
| | from 10.0.3.42 lookup main <-- Pod 51 (sec IP on ENI-1) | |
| | from 10.0.3.55 lookup 2 <-- Pod 61 (sec IP on ENI-2) | |
| | from all lookup main | |
| +----------+----------------------------+-----------------------+ |
| | | |
| v v |
| +--- Main Routing Table ---+ +--- Routing Table 2 -------+ |
| | | | | |
| | 10.0.3.42 dev veth1 | | default via 10.0.0.1 | |
| | 10.0.3.55 dev veth2 | | dev eni2 | |
| | 10.0.0.0/24 dev eni0 | | (single entry!) | |
| | default via 10.0.0.1 | | | |
| | dev eni0 | +----------------------------+ |
| +---------------------------+ |
| |
| The main table has routes to local pods AND the subnet. |
| Table 2 only has a default gateway. Traffic from pods on |
| ENI-2 always hits the VPC router, even for same-subnet |
| destinations. |
+----------------------------------------------------------------------+
5. Packet walks
5a. Ingress to a pod
Traffic arrives at the node for Pod 61 (a secondary IP on ENI 2):
+--------------- Node -------------------+
| |
Incoming traffic | Policy Routing Main Routing Table |
dst: Pod 61 IP ->-| -------------> ------------------> |
| "lookup main" "Pod 61 -> veth2" |
| | |
| v |
| +--------------+ |
| | Pod 61 | |
| | (via veth2) | |
| +--------------+ |
+----------------------------------------+
Policy routing says “look up main table.” Main table has a /32 route for Pod 61 pointing at its veth.
5b. Pod egress to the VPC
Pod 61 sends traffic somewhere in the VPC:
+---------------- Node ------------------------------------------+
| |
| +--------+ Policy Routing Routing Table 2 |
| | Pod 61 |-->---------------> ------------------> |
| | | "from Pod61 IP "default gw via ENI-2" |
| +--------+ lookup table 2" | |
| v |
| +----------+ |
| | ENI-2 |----> VPC Router
| +----------+ |
| |
+----------------------------------------------------------------+
Source IP is Pod 61’s, so policy routing sends it to table 2. Table 2 only has a default gateway through ENI-2. Traffic always goes to the VPC router, even if the destination is on the same subnet. There’s no other route in that table.
5c. Pod-to-pod, same node
Pod 51 talks to Pod 61, both on the same node:
+------------------------- Node --------------------------------+
| |
| +--------+ +--------+ |
| | Pod 51 | | Pod 61 | |
| +---+----+ +---^----+ |
| | src MAC: Pod51 MAC | |
| | dst MAC: veth1 MAC | |
| v | |
| +--------+ Policy Main Table +--------+ | |
| | veth1 |-->Routing-->Pod61->veth2->| veth2 |----+ |
| +--------+ "lookup +--------+ |
| main" src MAC: veth2 MAC |
| dst MAC: Pod61 MAC |
| |
| No ENIs involved -- traffic stays within the node |
+---------------------------------------------------------------+
Main table, /32 route for Pod 61, forwarded through its veth. Never touches an ENI.
5d. Pod-to-pod, across nodes
Pod 51 on Node A to Pod 81 on Node B:
+----------- Node A ----------------+ +----------- Node B -----------------+
| | | |
| +--------+ | | +--------+ |
| | Pod 51 | | | | Pod 81 | |
| +---+----+ | | +---^----+ |
| | | | | |
| v | | | |
| Policy -> Main Table | | Policy -> Main Table | |
| "Pod81 IP on same subnet | | "Pod81 -> veth" | |
| as ENI-0 -> forward via ENI-0" | | | |
| | | | +--------+ | |
| v | | | veth |-----------+ |
| +----------+ | | +----^---+ |
| | ENI-0 |-------------------+---(VPC)--+----------+ |
| +----------+ | | +--------+ |
| | | | ENI-0 | |
| src MAC: NodeA ENI-0 MAC | | +--------+ |
| dst MAC: NodeB ENI-0 MAC | | |
+------------------------------------+ | src MAC: veth MAC |
| dst MAC: Pod81 MAC |
+-----------------------------------+
Node A’s main table sees Pod 81’s IP on the same subnet as ENI-0, forwards via ENI-0 across the VPC. Node B receives it, main table finds the /32 route for Pod 81, delivers through the veth.
5e. The return path (secondary ENI asymmetry)
Pod 81 responds.
+----------- Node B ----------------+ +----------- Node A -----------------+
| | | |
| +--------+ | | +--------+ |
| | Pod 81 | | | | Pod 51 | |
| +---+----+ | | +---^----+ |
| | | | | |
| v | | Policy -> Main Table | |
| Policy -> Table 2 | | "Pod51 -> veth" | |
| "default gw via ENI-2" <--! | | | |
| | | | | |
| v | | | |
| +----------+ | | +----------+ | |
| | ENI-2 |-----+ | | | ENI-0 |----------+ |
| +----------+ | | | +-----^----+ |
| | | | | |
+----------------------+-------------+ +-----------+-----------------------+
| |
v |
+--------------+ |
| VPC Router |-----------------------------+
| (default gw) |
+--------------+
Even though Pod 81 and Pod 51 are on the SAME SUBNET,
traffic goes through the VPC router because routing
table 2 only has a default gateway entry!
Pod 81 is on a secondary ENI, so its traffic uses table 2. Table 2 only knows the default gateway. Even though Pod 51 is on the same subnet, the response goes through the VPC router. Extra hop, but transparent.
6. Kubernetes Services
Pods are ephemeral. They die, get recreated, come back with new IPs. This can happen thousands of times a day. Other services can’t track individual pod IPs.
A Kubernetes Service groups pods by label selectors and gives them a stable virtual IP (the “ClusterIP”). An endpoints controller keeps the backing pod list current.
+----------------------+
| Kubernetes Service |
| name: app1-service |
| VIP: 172.20.0.100 |
| selector: name=app1 |
+----------+-----------+
|
+-----------+-----------+
| | |
+----v---+ +----v---+ +----v---+
| Pod | | Pod | | Pod |
| app1 | | app1 | | app1 |
|10.0.1.5| |10.0.2.8| |10.0.3.2|
|(ep 1) | |(ep 2) | |(ep 3) |
+--------+ +--------+ +--------+
Node A Node B Node C
Three service types, each building on the previous:
| Type | What it does |
|---|---|
| ClusterIP | Virtual IP reachable only inside the cluster |
| NodePort | Opens a port on every node, forwards to the service. Built on ClusterIP. |
| LoadBalancer | Provisions a cloud LB in front of NodePort. Built on NodePort. |
7. ClusterIP
When you create a ClusterIP service, kube-proxy watches the API server and programs iptables rules for the service’s VIP on every node. Kubernetes DNS assigns a name like app1-service.default.svc.cluster.local.
Node 51 Node 71
+-----------------------------------------+ +----------------------+
| | | |
| +-------+ +-------------------------+ | | +-------+ |
| |Pod 51 |>| iptables | | | |Pod 71 | (part of |
| |(app2) | | | | | |(app1) | app1 svc) |
| +-------+ | 1. Load balance: | | | +---^---+ |
| dst: | pick Pod 71 IP | | | | |
| SVC VIP | (round-robin, may | | | | |
| | pick ANY pod, even | | | +---+---------------+|
| | local ones are not | | | | Node 71 forwards ||
| | preferred!) | | | | to local Pod 71 ||
| | | | | +-------------------+|
| | 2. DNAT: | | | |
| | dst: VIP -> Pod71 IP | | +---+------------------+
| | | | ^
| | 3. Mark flow: | | |
| | (stateful tracking | | +---+
| | for return traffic) | | |
| +----------+--------------+ | |
| | | |
| v | |
| Forward to Node 71 ------+---+
| dst IP: Pod 71 IP |
| |
+----------------------------------------+
iptables picks a backend pod (round-robin, no preference for local pods), DNATs the destination from the VIP to the pod IP, and marks the flow for stateful tracking.
Pod 51 doesn’t know the VIP by default. It resolves the service DNS name first, and that DNS query itself goes through a ClusterIP service (kube-dns):
Pod 51 --DNS query--> kube-dns Service VIP --iptables DNAT--> CoreDNS Pod
|
Pod 51 <--DNS response (Service VIP: 172.20.0.100)----------------+
On the return path, iptables on Node 51 matches the response as return traffic (stateful match) and SNATs the source IP from Pod 71 back to the service VIP. Pod 51 never sees Pod 71’s IP.
Node 71 Node 51
+------------------+ +---------------------------------------------+
| | | |
| +-------+ | | +-----------------------------+ +-------+|
| |Pod 71 |-------+----+--->| iptables |>|Pod 51 ||
| | | | | | | | ||
| +-------+ | | | 1. Identify return traffic | +-------+|
| src: Pod71 IP | | | (stateful match) | |
| dst: Pod51 IP | | | | |
| | | | 2. SNAT: | |
+------------------+ | | src IP: Pod71 -> SVC VIP | |
| | (Pod 51 thinks it's | |
| | talking to the VIP, | |
| | not Pod 71 directly) | |
| +-----------------------------+ |
+---------------------------------------------+
kube-proxy can also use IPVS or eBPF instead of iptables for better load balancing characteristics.
8. NodePort
NodePort exposes applications externally, mostly for testing. It builds on ClusterIP but also configures iptables to listen on a specific port (range 30000-32767) on every node.
Node 51 (receives traffic) Node 71
+--------------------------------------------+ +------------------+
| | | |
| External +--------------------------------+| | +-------+ |
| Client ->| iptables (4 tasks) || | |Pod 71 | |
| | || | |(app1) | |
| dst: | 1. Load balance -> Pod 71 IP || | +---^---+ |
| Node51 | (may pick remote pod even || | | |
| :31234 | if local pod exists!) || | | |
| | || +-----+------------+
| | 2. DNAT: dst -> Pod 71 IP || |
| | || |
| | 3. SNAT: src -> Node 51 IP |+--------+
| | (flow symmetry! without ||
| | this, Pod71 would respond ||
| | directly to client from ||
| | a different IP, breaking ||
| | the connection) ||
| | ||
| | 4. Mark flow (stateful) ||
| +--------------------------------+|
+--------------------------------------------+
The difference from ClusterIP: iptables now does four things instead of three. The extra one is SNAT, rewriting the source IP to the node’s IP. Without it, Pod 71 would respond directly to the client from a different IP, and the client would drop the response:
WITHOUT SNAT (broken): WITH SNAT (works):
Client --> Node51:31234 Client --> Node51:31234
| dst NAT to Pod71 | dst NAT to Pod71
v | src NAT to Node51 IP
Pod71 v
| Pod71
| responds to Client IP |
v src: Pod71 IP X BROKEN! | responds to Node51 IP
Client sees response from v
unknown IP -- drops it! Node51
| reverse NAT
v
Client sees response from
Node51:31234 OK works!
The client IP is always SNATed away, even when the destination pod is on the same node. The application never sees the real client IP.
The operational problem: you need to track node IPs (which change as nodes fail and get replaced) and distribute traffic across them. You need a load balancer.
9. LoadBalancer
LoadBalancer builds on NodePort. A service controller provisions a cloud load balancer that forwards to the NodePort on each node.
AWS has two controllers:
| Controller | Source | Provisions |
|---|---|---|
| Service controller | Built into K8s | CLB (legacy) or NLB |
| AWS Load Balancer Controller | K8s SIG project on GitHub | NLB + ALB, Target Type IP, Ingress |
Default behavior (target type = instance)
+------------ PROBLEM ---------------+
| Node 51 has NO pods for this |
| service, but NLB thinks it's |
| healthy because the NodePort |
| health check passes on ALL nodes |
+------------------------------------+
Client NLB Node 51 Node 71
| | | |
|--- request ----->| | |
| dst: NLB IP | | |
| port: 80 | | |
| |-- forward --------->| |
| | dst: Node51 IP | |
| | port: 31234 | |
| | (NodePort) | |
| | | iptables: |
| | | 1. LB -> Pod71 IP |
| | | 2. DNAT dst -> Pod71 |
| | | 3. SNAT src -> Node51 |
| | | 4. Mark flow |
| | | |
| | |---- forward ---------> |
| | | | -> Pod 71
| | | |
| | |<--- response --------- |
| |<-- response --------| |
| | (after iptables | |
|<-- response -----| reverse NAT) | |
| | | |
NLB health-checks the NodePort, and all nodes pass, even ones with no pods for that service. Traffic can land on Node 51 (no local pod) and bounce to Node 71 (has the pod). This is traffic tromboning: extra hops, latency, cross-AZ data transfer charges, and the client IP gets SNATed.
Fix 1: externalTrafficPolicy: Local
spec:
externalTrafficPolicy: Local
Client NLB Node 51 Node 71
| | | |
|--- request ----->| | |
| | | |
| | health check | |
| | (NodePort) -------->| PASS |
| | | |
| | health check | |
| | (EXTRA port) ------>| FAIL |
| | "any local pods?" | (no local pods) |
| | | |
| | health check | |
| | (EXTRA port) -------+------------------>| PASS
| | | | (has Pod71)
| | | |
| |-- forward directly --+------------------>|
| | (skips Node 51!) | |
| | | | iptables:
| | | | DNAT only
| | | | (no SNAT!)
| | | | -> Pod 71
| | | |
|<-----------------+----------------------+--- response ----- |
| | | client IP visible|
NLB runs an additional health check on a different port. Nodes without local pods fail it. iptables only forwards to local pods and skips SNAT, so the application sees the real client IP.
Trade-off: uneven pod distribution means uneven traffic distribution.
Fix 2: Target Type IP (requires AWS Load Balancer Controller)
NLB targets pods directly instead of nodes.
Client NLB Node 71
| | |
|--- request ----->| |
| dst: NLB IP | |
| | |
| | Target Group: |
| | +----------------------+ |
| | | Pod71 IP healthy | |
| | | Pod81 IP healthy | |
| | | Pod91 IP healthy | |
| | | (pods, NOT nodes!) | |
| | +----------------------+ |
| | |
| |------ forward ---------------->|
| | dst IP: Pod71 IP directly |
| | |
| | iptables does NOTHING |
| | (dst IP already is |
| | Pod71 IP) |
| | | -> Pod 71
| | |
|<-----------------+-------- response ------------- |
| | |
No NodePort, no iptables. The destination IP is already the pod IP, so the node just forwards through the veth. Source IP is the NLB’s by default; enable preserve client IP via annotation if needed.
Watch ELB service quotas. Thousands of pods means thousands of targets in a single target group.
10. Ingress (Layer 7)
Everything above is Layer 4. Kubernetes Services don’t understand HTTP paths or hostnames.
Ingress handles Layer 7 routing. You define rules like /order goes to order-service, /rating goes to rating-service:
apiVersion: networking.k8s.io/v1
kind: Ingress
spec:
rules:
- http:
paths:
- path: /order
backend:
service:
name: order-service
- path: /rating
backend:
service:
name: rating-service
Kubernetes expects an Ingress controller to provision a load balancer that implements these rules. The AWS Load Balancer Controller provisions an ALB.
With Target Type IP, the ALB sends traffic directly to pods:
Client ALB Node 71
| | |
|-- GET /order --->| |
| | |
| | L7 routing decision: |
| | /order -> order-service |
| | pick Pod 71 from targets |
| | |
| |---- forward ---------------->|
| | dst: Pod71 IP |
| | src: ALB IP (always SNAT!) |
| | |
| | iptables: does NOTHING |
| | | -> Pod 71
| | |
|<-- response -----|<---- response -------------- |
| src: ALB IP | |
ALB always SNATs. You can’t preserve the client IP at L4, but X-Forwarded-For headers carry it at L7.
Full end-to-end: DNS to pod
Client Route 53 Internet GW ALB Pod
| | | | |
| DNS: portal. | | | |
| example.com ---->| | | |
| | | | |
|<- alias record --| | | |
| returns ALB | | | |
| public IPs: | | | |
| [IP1,IP2,IP3] | | | |
| | | | |
| picks IP3 | | |
|-------- request (dst: IP3) ----->| | |
| | | |
| | DNAT: | |
| | dst: Public IP3| |
| | -> ALB Private| |
| | IP (AZ-c) | |
| | | |
| |--- forward --->| |
| | | |
| | | L7 routing |
| | | decision |
| | | |
| | | Cross-zone LB ON |
| | | may pick pod in |
| | | ANY AZ |
| | | |
| | |--- forward ----->|
| | | dst: Pod IP |
| | | src: ALB IP |
| | | |
|<------------------------- response ----------------------------------|
Route 53 returns ALB public IPs. The Internet Gateway DNATs the public IP to the ALB’s private IP. Cross-zone load balancing is on by default, so the ALB may forward to any AZ. Disable it for AZ-local routing.
11. Pod egress
By default (AWS_VPC_K8S_CNI_EXTERNALSNAT=false), VPC CNI applies SNAT on the node itself for all pod traffic leaving the VPC CIDR. The pod IP gets rewritten to the node’s primary ENI IP before the packet even leaves the node:
+------------ Node ----------------------+
| |
| +---------------+ |
| | Pod | |
| | 192.168.1.51 | |
| +-------+-------+ |
| | dst: 8.8.8.8 (internet) | Internet Gateway
| | src: 192.168.1.51 | |
| v | |
| +---------------+ | |
| | VPC CNI | | |
| | SNAT #1 | | |
| | src: .51 -> | | |
| | .50 (node | | |
| | primary ENI | | |
| | primary IP) | | |
| +-------+-------+ | |
| | src: 192.168.1.50 | |
| v | |
| +----------+ | |
| | ENI-0 |-----------------------+------------>|
| | (primary)| | | SNAT #2
| +----------+ | | src: .50 -> Public IP
| | | (associated with .50)
+----------------------------------------+ |
+-------> Internet
| src: Public IP
| dst: 8.8.8.8
Two-stage SNAT: VPC CNI rewrites the pod IP to the node’s primary ENI IP, then the IGW/NAT Gateway rewrites it to the public IP.
AWS_VPC_K8S_CNI_EXTERNALSNAT
The default (false) means VPC CNI does SNAT on the node. All pods on a node share that node’s IP for outbound traffic. External services see one IP per node, not per pod.
Setting AWS_VPC_K8S_CNI_EXTERNALSNAT=true disables the node-level SNAT. The pod’s real IP survives all the way to the NAT Gateway, which does the only SNAT (pod IP directly to public IP). The node doesn’t rewrite anything.
EXTERNALSNAT=false (default): EXTERNALSNAT=true:
Pod (.51) --> Node SNAT (.50) --> Pod (.51) --> Node (no SNAT) -->
NAT GW SNAT (public IP) --> NAT GW SNAT (public IP) -->
Internet Internet
External sees: node IP External sees: pod IP
(all pods on this node (until NAT GW, where it
look the same) becomes the public IP)
When EXTERNALSNAT=true:
- Pod IPs are visible in VPC flow logs and security group tracking, making it easier to trace which pod is talking to what
- External APIs that rate-limit by source IP won’t treat all pods on a node as one client
- Service meshes (Istio, Linkerd) work correctly since they expect pod IPs in the traffic
- You need your VPC routing and NAT Gateway set up to handle pod CIDR traffic (the pod IPs need a route to the NAT Gateway)
# check current setting
kubectl get daemonset aws-node -n kube-system -o json | \
jq '.spec.template.spec.containers[0].env[] | select(.name=="AWS_VPC_K8S_CNI_EXTERNALSNAT")'
# enable it
kubectl set env daemonset -n kube-system aws-node AWS_VPC_K8S_CNI_EXTERNALSNAT=true
The experiments in the appendix were run with EXTERNALSNAT=true. That’s why the checkip.amazonaws.com result shows the NAT Gateway’s public IP directly, not the node IP — VPC CNI isn’t doing node-level SNAT.
Summary
Inbound external traffic, from least to most efficient:
LEAST EFFICIENT MOST EFFICIENT
<--------------------------------------------------------------------------->
LoadBalancer + externalTraffic + Target Type IP Ingress + ALB
(default) Policy: Local (NLB) + Target IP
+----------+ +----------+ +----------+ +----------+
| NLB | | NLB | | NLB | | ALB |
| | | | | | | | | | | |
| Node | | Node | | Pod | | Pod |
| (any!) | | (w/ pod) | | directly | | directly |
| | | | | | | | | |
| iptables | | iptables | | no | | no |
| DNAT+SNAT| | DNAT only| | iptables | | iptables |
| | | | | | | | | |
| Node | | Pod | +----------+ +----------+
| (maybe) | | (local) |
| | | | | Client IP: Client IP:
| Pod | +----------+ via annotation via X-Forwarded-For
+----------+
Client IP:
Client IP: visible
SNATed away
| Approach | Hops | SNAT? | Client IP visible? | iptables? |
|---|---|---|---|---|
| LoadBalancer (default) | NLB -> Node -> maybe another Node -> Pod | DNAT+SNAT | No | Yes |
+ externalTrafficPolicy: Local |
NLB -> Node (with pod) -> Pod | DNAT only | Yes | Yes |
| + Target Type IP (NLB) | NLB -> Pod | NLB IP | Via annotation | No |
| Ingress + ALB + Target IP | ALB -> Pod | ALB IP | Via X-Forwarded-For | No |
Appendix: Experiments on a live cluster
Everything above was verified on a live EKS cluster using two netshoot toolbox pods and an echoserver deployment. If you want to follow along, apply these manifests:
tools.yaml — namespace + netshoot toolbox pod for debugging:
---
apiVersion: v1
kind: Namespace
metadata:
name: packetlab
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: toolbox
namespace: packetlab
spec:
replicas: 1
selector:
matchLabels:
app: toolbox
template:
metadata:
labels:
app: toolbox
spec:
containers:
- name: netshoot
image: ghcr.io/nicolaka/netshoot:latest
imagePullPolicy: IfNotPresent
command: ["sleep", "infinity"]
securityContext:
allowPrivilegeEscalation: true
capabilities:
add: ["NET_ADMIN", "NET_RAW"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
echoserver.yaml — echo server with ClusterIP and NodePort services:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: echoserver
namespace: packetlab
spec:
replicas: 2
selector:
matchLabels:
app: echoserver
template:
metadata:
labels:
app: echoserver
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: echoserver
containers:
- name: echoserver
image: registry.k8s.io/echoserver:1.10
ports:
- containerPort: 8080
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "128Mi"
---
apiVersion: v1
kind: Service
metadata:
name: echo-clusterip
namespace: packetlab
spec:
type: ClusterIP
selector:
app: echoserver
ports:
- port: 80
targetPort: 8080
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: echo-nodeport
namespace: packetlab
spec:
type: NodePort
selector:
app: echoserver
ports:
- port: 80
targetPort: 8080
protocol: TCP
$ kubectl apply -f tools.yaml
$ kubectl apply -f echoserver.yaml
$ kubectl -n packetlab get pods -o wide
NAME READY STATUS IP NODE
toolbox-a-6fb5b9b786-lszr4 1/1 Running 10.0.1.51 ip-10-0-0-75.ec2.internal
toolbox-b-754f9d4d47-qbnzl 1/1 Running 10.0.2.29 ip-10-0-0-49.ec2.internal
echoserver-545755bc68-v5kdv 1/1 Running 10.0.3.193 ip-10-0-1-228.ec2.internal
- Toolbox A:
10.0.1.51on node10.0.0.75 - Toolbox B:
10.0.2.29on node10.0.0.49 - Echoserver:
10.0.3.193on node10.0.1.228
Experiment 1: Pod interfaces, routes, and ARP
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- ip addr show
3: eth0@if336: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default
link/ether 0a:12:c5:d0:e4:e1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.1.51/32 scope global eth0 <-- /32!
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- ip route show
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- arp -a
? (169.254.1.1) at 5a:f3:56:0b:86:d4 [ether] PERM on eth0
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- cat /etc/resolv.conf
search packetlab.svc.cluster.local svc.cluster.local cluster.local ec2.internal
nameserver 172.20.0.10
options ndots:5
/32 on eth0, 169.254.1.1 fake gateway, PERM ARP entry with MAC 5a:f3:56:0b:86:d4, kube-dns at 172.20.0.10.
Experiment 2: Node-side veth, ENIs, and routing tables
SSH to the node:
$ ip link show eniaee705c1f69
336: eniaee705c1f69@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP
link/ether 5a:f3:56:0b:86:d4 brd ff:ff:ff:ff:ff:ff link-netns cni-01fc068f-c24b-470b-7551-ac4d8bc24a67
MAC 5a:f3:56:0b:86:d4 matches the pod’s ARP entry. The pod sends frames here thinking it’s the gateway.
$ ip addr show ens5 # Primary ENI: 10.0.0.75 (main table)
$ ip addr show ens6 # Secondary ENI: 10.0.0.85 (table 2)
$ ip addr show ens7 # Third ENI: 10.0.0.84 (table 3)
| ENI | Interface | IP | Routing Table |
|---|---|---|---|
| Primary | ens5 | 10.0.0.75 | main |
| Secondary | ens6 | 10.0.0.85 | 2 |
| Third | ens7 | 10.0.0.84 | 3 |
$ ip rule show
512: from all to 10.0.1.51 lookup main <-- ingress TO our pod
1536: from 10.0.1.51 lookup 2 <-- egress FROM our pod
$ ip route show table main
default via 10.0.0.1 dev ens5 proto dhcp src 10.0.0.75 metric 512
10.0.0.0/24 dev ens5 proto kernel scope link src 10.0.0.75 metric 512
10.0.1.51 dev eniaee705c1f69 scope link <-- our pod's veth
$ ip route show table 2
default via 10.0.0.1 dev ens6
10.0.0.1 dev ens6 scope link
Table 2: two entries, just the default gateway via ens6. Pods on secondary ENIs always go through the VPC router.
Experiment 3: Pod-to-pod connectivity
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- ping -c 3 10.0.0.75
64 bytes from 10.0.0.75: icmp_seq=1 ttl=127 time=0.062 ms <-- same node, ~0.06ms
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- ping -c 3 10.0.2.29
64 bytes from 10.0.2.29: icmp_seq=1 ttl=125 time=1.67 ms <-- cross-node, ~0.4ms
64 bytes from 10.0.2.29: icmp_seq=2 ttl=125 time=0.424 ms
TTL 125 = three hops (pod to host, host to VPC router, VPC router to remote host).
Experiment 4: DNS through ClusterIP
$ kubectl get svc -n kube-system kube-dns -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) SELECTOR
kube-dns ClusterIP 172.20.0.10 <none> 53/UDP,53/TCP,9153/TCP k8s-app=kube-dns
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- dig +short kubernetes.default.svc.cluster.local
172.20.0.1
iptables on the node:
$ sudo iptables -t nat -L KUBE-SVC-TCOU7JCQXEZGVUNU -n
Chain KUBE-SVC-TCOU7JCQXEZGVUNU (1 references)
target prot opt source destination
KUBE-SEP-43APF7RZ6NYHDOWA all -- 0.0.0.0/0 0.0.0.0/0 /* -> 10.0.1.181:53 */ statistic mode random probability 0.50000000000
KUBE-SEP-5PFK5H2IGDZ73DVD all -- 0.0.0.0/0 0.0.0.0/0 /* -> 10.0.3.231:53 */
50/50 random split across two CoreDNS pods.
Experiment 5: ClusterIP service
$ kubectl -n packetlab get svc echo-clusterip -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) SELECTOR
echo-clusterip ClusterIP 172.20.122.189 <none> 80/TCP app=echoserver
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- curl -s http://echo-clusterip.packetlab.svc.cluster.local
Request Information:
client_address=10.0.1.51 <-- real pod IP, no SNAT
iptables DNAT chain:
$ sudo iptables -t nat -L KUBE-SVC-B23I5IPCVQPNADR7 -n
KUBE-SEP-MNWX5K5OIII64PTW all -- 0.0.0.0/0 0.0.0.0/0 /* -> 10.0.2.127:8080 */ statistic mode random probability 0.50000000000
KUBE-SEP-XAUWYAMA2TYGF5RO all -- 0.0.0.0/0 0.0.0.0/0 /* -> 10.0.3.193:8080 */
Conntrack after a request:
$ sudo conntrack -L | grep 172.20.122.189
tcp 6 60 TIME_WAIT
src=10.0.1.51 dst=172.20.122.189 sport=59432 dport=80
src=10.0.3.193 dst=10.0.1.51 sport=8080 dport=59432
[ASSURED] mark=0
Original: toolbox -> service VIP. Reply: echoserver -> toolbox. iptables reverse-NATs so the toolbox thinks it talked to the VIP.
Experiment 6: NodePort service
$ kubectl -n packetlab get svc echo-nodeport -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) SELECTOR
echo-nodeport NodePort 172.20.152.84 <none> 80:32385/TCP app=echoserver
iptables: KUBE-MARK-MASQ fires on all NodePort traffic (not just hairpin like ClusterIP), marking packets for SNAT in KUBE-POSTROUTING:
$ sudo iptables -t nat -L KUBE-POSTROUTING -n
RETURN all -- 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000/0x4000
MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK xor 0x4000
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully
From the same node:
$ curl -s http://localhost:32385
client_address=10.0.0.75 <-- node IP, not us
From a different node:
$ ssh ip-10-0-0-49.ec2.internal -- curl -s http://10.0.0.75:32385
client_address=10.0.0.75 <-- still the receiving node
Both times echoserver sees the receiving node’s IP. The actual client gets SNATed for flow symmetry.
Experiment 7: Pod egress
$ kubectl -n packetlab exec toolbox-a-6fb5b9b786-lszr4 -- curl -s https://checkip.amazonaws.com
44.218.31.211
Neither pod IP nor node IP. Two-stage SNAT: VPC CNI rewrites pod IP to node primary ENI IP, then IGW/NAT GW rewrites to public IP.
$ sudo iptables -t mangle -L PREROUTING -n
CONNMARK all -- 0.0.0.0/0 0.0.0.0/0 /* AWS, primary ENI */ ADDRTYPE match dst-type LOCAL limit-in CONNMARK or 0x80
CONNMARK all -- 0.0.0.0/0 0.0.0.0/0 /* AWS, primary ENI */ CONNMARK restore mask 0x80
VPC CNI marks with 0x80. Policy rule fwmark 0x80/0x80 lookup main routes return traffic through the main table.
Experiment 8: Full node interface view
$ ip link show
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 <-- Primary ENI
165: ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 <-- Secondary ENI
264: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 <-- Third ENI
4: eni36f7c958f87@if3: ... <-- veth for pod
336: eniaee705c1f69@if3: ... <-- OUR TOOLBOX POD
Three ENIs (ensN), a veth pair per pod (eniXXX@if3). VPC CNI added secondary ENIs as pod count grew.
What we verified
| What | Experiment | Result |
|---|---|---|
| /32 mask on pod eth0 | 1 | 10.0.1.51/32 |
| Fake gateway 169.254.1.1 | 1 | default route via 169.254.1.1 |
| Static ARP entry | 1 | 5a:f3:56:0b:86:d4 PERM |
| ARP MAC = host-side veth | 2 | both show 5a:f3:56:0b:86:d4 |
| Policy routing (ip rule) | 2 | to pod -> main, from pod -> table 2 |
| Secondary table only has default gw | 2 | table 2: just default via ... dev ens6 |
| Same-node: sub-ms | 3 | 0.05ms |
| Cross-node: ~0.4ms | 3 | 0.4-1.7ms via VPC fabric |
| DNS through ClusterIP | 4 | 172.20.0.10 -> DNAT to CoreDNS pods |
| ClusterIP: round-robin | 4 | statistic mode random probability 0.5 |
| ClusterIP: no SNAT | 5 | echoserver sees real pod IP |
| Conntrack tracks NAT | 5 | bidirectional mapping visible |
| NodePort: SNAT masks client | 6 | echoserver sees node IP |
| NodePort: SNAT even on localhost | 6 | localhost curl still shows node IP |
| NodePort: cross-node SNAT | 6 | remote curl shows receiving node IP |
| Pod egress works | 7 | checkip returns public IP |
| VPC CNI mangle marks | 7 | CONNMARK 0x80 for primary ENI |
| Multiple ENIs per node | 8 | ens5, ens6, ens7 |
Cleanup
kubectl delete namespace packetlab