21/08/2023 6 Minutes read Tech 

Addressing private IPv4 shortage: 5 Strategies for Amazon EKS

Photo by Taylor Vick on Unsplash

Everyone who works with Amazon EKS has already faced the challenge of IPv4 scarcity. We will explore 5 solutions to mitigate or to end this problem.

Context

In Kubernetes, you have to use a Container Network Interface plugin (CNI) to allow pods to communicate with each other with their respective IP addresses.

When using Amazon EKS, by default you get the VPC Container Network Interface (CNI). This plugin is installed as a daemonset in the cluster.

EKS Networking with VPC CNI​ uses AWS VPC resources such as Elastic Network Interfaces (ENI) and IP addresses that exist in the VPC and subnets. The CNI plugin allows Kubernetes Pods to have the same IP address as they do on the VPC network.

It provisions one secondary IP address (an ENI has 1 primary and multiple secondary ones) for each pod in the cluster, which can take up a lot of IP space in your subnets.

IPv4 allocation by the VPC CNI
IPv4 allocation by the VPC CNI

Why not use another CNI?

You can disable native AWS IPs provisioning by using another CNI. IPv4 management will work in the following way:

  • IP addresses of pods are scoped to the internal network​
  • Pods are not directly reachable from outside the cluster​
  • Every pod uses the primary node ENI

There are many benefits you lose by dropping the VPC CNI:

We will focus in this article on keeping the usage of VPC CNI.

An easy change: VPC CNI tweaking

There are several environment variables used by the VPC CNI to configure its behavior. We will address WARM_ENI_TARGET and WARM_IP_TARGET.

WARM_ENI_TARGET: Specifies the number of free elastic network interfaces and all of their available IP addresses. For WARM_ENI_TARGET = 1, A full ENI of unused IP addresses will be provisioned.
The maximum number of IPs by ENI is tied to the ec2 instance type. For example, there are 15 IPs on an ENI for an m6a.xlarge. You can find the reference in the AWS Documentation.

Unless you have specific requirements, I would advise you to set WARM_ENI_TARGET to 0.

WARM_IP_TARGET: Specifies the number of free IP addresses that should be kept at all times on all interfaces.

When a pod is scheduled on a node, if there are no IP addresses available on all ENIs, there can be an added delay before the pod starts because the VPC CNI plugin is provisioning resources (by calling AssignPrivateIpAddress).

Due to this risk, it is a good practice to set WARM_IP_TARGET to at least 1.

When using prefix delegation: use subnet CIDR reservation

With the VPC CNI, you have the ability to configure IPv4 allocation to be either

  • Individual secondary IPv4 addresses (the default mode)
  • Or whole /28 prefixes: Prefix delegation (used when you need larger pod density per node)

For a prefix to be allocated, a whole /28 range should be available. If at least one IPv4 has been taken beforehand, the prefix cannot be allocated. If there is no contiguous free /28 prefix in your subnet, the allocation fails and your pod will not start (even if there are hundreds of IPs available)

The diagram below outlines this situation:

CIDR prefix that fails to allocate due to fragmented subnet
CIDR prefix that fails to allocate due to fragmented subnet

One solution to overcome this problem is using VPC subnet CIDR reservation. This feature allows to pre-allocate many CIDR ranges, that can only be used by whole /28 prefixes and not individual IP addresses

You can learn more about this feature in the AWS documentation

If you still have IP scarcity issues even after these 2 first improvements, the following ones have more impact but require more work to achieve.

VPC CNI Custom Networking using Internal SNAT

The following setup involves 2 features of VPC CNI

  • Custom Networking: allows to have a separate ENI configuration between pods and their node. Practically, we will configure all pods to ENIs in another subnet than the one used by the node
  • Internal SNAT: routes the outbound traffic from pods to anywhere outside their subnet through the primary IP of the node.

The diagram below shows an overview of the solution:

Custom networking with internal SNAT
Custom networking with internal SNAT
  • We create dedicated subnets for the pods. These ones should not be routable from outside the VPC (we can call these subnets internal)
  • New subnets can have any CIDR block, but a good choice is to use CG-NAT space because companies don’t usually use these ranges for private networks.
  • Use custom networking use the new subnets for pods ENIs (more information on EKS best practices guide)
  • Internal SNAT is enabled by default when using custom networking (see AWS documentation)

With this configuration:

  • Network flow from pod to pod will not be NAT-ed. You will see pods IPs in the VPC flow logs
  • Network flow from a load balancer in the same VPC to any pod is possible without any NAT
  • Network flow from a pod to an address IP not associated with a pod is NAT-ed through the primary node ENI.
  • Ingress traffic from outside the VPC to the pods should pass through a load balancer (a.k.a layer 4 proxy) located in non-internal subnets
  • You can create as many subnets with overlapping CIDR in different VPCs as you want.

NATted data-plane with private NAT gateway

A more recent solution to address the issue is using private NAT gateways, described in this blog post.

Architecture overview of pods from overlapping CIDRs communicating through a private NAT gateway
Architecture overview of pods from overlapping CIDRs communicating through a private NAT gateway

Since mid-2021, you can create private NAT gateways, allowing you to NAT the entire EKS data plane:

  • You create nonroutable subnets (preferably with CIDR ranges from CG-NAT space), but this time for hosting all the data place IPs (pods and nodes)
  • This removes the need for Kubernetes-specific configuration to separate node and pods ENIs

There is a caveat to keep in mind: you get charged per private NAT gateway and for data processing costs for VPC-to-VPC communication. This has to be put in perspective with your whole bill for the EKS infrastructure.

IPv6

The most future-proof solution to solve IPv4 exhaustion — but the most disruptive — is setting up IPv6.

Here are some advantages of using IPv6:

  • The network infrastructure is simpler: network flow does not need to go through a NAT gateway. Your pods are assigned a public IPv6 address, this simplifies troubleshooting.
  • Cost reduction: IPv6 to public IPv6 traffic does not charge you for NAT gateway data processing.
  • You can still keep compatibility with existing IPv4-only infrastructure using NAT64/DNS64

There are 2 ways to run IPv6 currently on EKS.

The first one is using dual-stack subnets for the data plane (discussed in the EKS best practices website):

  • Every pod has only an IPv6 address
  • The node has an IPv4 and an IPv6 address
  • For egress traffic from a pod to an IPv4 address, the behavior is similar to using custom networking, the traffic goes through the primary node IP address.
Custom logic for IPv6 to IPv4 translation by the VPC CNI
Custom logic for IPv6 to IPv4 translation by the VPC CNI

The second one uses IPv6-only subnets:

Conclusion

Here is a matrix summarising the main characteristics of each solution:


Addressing private IPv4 shortage : 5 Strategies for Amazon EKS was originally published in ekino-france on Medium, where people are continuing the conversation by highlighting and responding to this story.