Why can’t the Pod in my EKS connect to a certain EC2? Strange Docker Compose Bridge Network Debug Notes

The book continues from the previous chapter: “A Web application must use AWS if it wants to scale elastically? Notes on deploying a typical Web App on Amazon Elastic Kubernetes Service (EKS) “, there is already an EKS cluster, and some applications have been successfully dropped. It started running in EKS, but due to historical problems, some applications still run under the original EC2 in the form of docker-compose , and need to be linked with some applications in EKS.

However, during the linkage process, it was found that the Pod of EKS could not connect to a certain EC2, all the connections would time out, and the ping would not work. Other EC2s under the same VPC could be connected normally, so we investigated this problem. .

network

At the network level, we know that EKS and EC2 are not under the same VPC, and are connected through VPC Peering and static routing:

EC2 VPC segment:
* 172.31.0.0/16

EKS VPC segment:
* 192.168.0.0/16

All EC2 machines do not have UFW enabled, and there are no additional iptables rules. SG has been configured to allow traffic from EKS VPC, which also explains why services such as EC2 and RDS within EKS can be directly connected. .

service

All services on all EC2 are deployed through docker-compose.yml , written similarly as follows:

 version : "3"  
services :  
  yyyy :  
    image : ghcr.io/xxx/yyyy : latest  
    restart : always  
    ports :  
      - '8080:8080'  
    environment :  
      DB_HOST : 'db'  
      DB_PORT : '3306'  
      DB_DATABASE : 'yyyy'  
      DB_USERNAME : 'root'  
      DB_PASSWORD : 'password'  
      APP_DEBUG : 'true'  
    volumes :  
      - ./.env : /app/.env  
  
  yyyy-service :  
    image : ghcr.io/xxx/yyyy-service : latest  
    restart : always  
    environment :  
      - DSN=root : password@tcp(db : 3306 )/yyyy ? charset=utf8mb4 &parseTime=True&loc=Local  
      - REDIS=redis : 6379  
      - ENV=mini  
  
    ports :  
      - '8090:8080'

The installation method of Docker is also exactly the same, but a certain EC2 machine cannot be accessed from EKS. Next, smart readers are invited to spend a little time to think about it. There may be some strange problems here~

Debug

The first reaction in the investigation process was that UFW or SG settings were wrong, but I found that UFW was not enabled at all, SG was also connected, and even other EC2s were accessible. The problem of networking gradually came back to mind.

I don’t know how to memorize it like this, I can also encounter this kind of weird problem with EKS

I looked at iptables and there are no strange rules. When I was about to give up and start Google, a prompt connected to EC2 reminded me:

 Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-1011-aws x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage System information as of Wed Dec 7 03:32:08 UTC 2022 System load: 2.35498046875 Usage of /: 55.7% of 193.66GB Memory usage: 79% Swap usage: 0% Processes: 721 Users logged in: 1 IPv4 address for br-65abb51ee0f8: 172.24.0.1 IPv4 address for br-69e9038f7cbe: 192.168.176.1 IPv4 address for br-72d131ef9461: 172.19.0.1 IPv4 address for br-777aba48864a: 192.168.96.1 IPv4 address for br-a9e07fa9b259: 172.22.0.1 IPv4 address for br-f06719849d74: 192.168.192.1 IPv4 address for docker0: 172.17.0.1 IPv4 address for ens5: 172.31.5.22 => There are 3 zombie processes.

Afterwards, I found out that this is a very important reminder, so I plan to wait for a stick of incense time to ask smart readers to think about it~

The main problem here is: IPv4 address for br-777aba48864a: 192.168.96.1 , why does this segment seem to overlap with the EKS segment?

First of all, when I see the network card starting with br , the first reaction is the Network created by Docker itself. Through docker network ls , we can see which containers are running these networks:

 NETWORK ID NAME DRIVER SCOPE 72066261ca9e bridge bridge local a9e07fa9b259 clickhouse_default bridge local 65abb51ee0f8 dddd_default bridge local cabd6fc84675 xxxx-mxxxxxx_default bridge local 45ecdfc7a691 host host local 777aba48864a monitoring_default bridge local 4c95c382c5a8 none null local 503d7d864c98 xxx-xxx-up_default bridge local c5946111d925 redis-insight_default bridge local ad1686916ed0 runner_default bridge local 72d131ef9461 novaext-data-availability-runnergro-rustct_default bridge local 69e9038f7cbe novaext-data-availability-runnergro-mgolang-new_default bridge local f06719849d74 novaext-data-availability-webassets-mgolang-new_default bridge local

It can be seen that for example a9e07fa9b259 clickhouse_default uses 172.22.0.1 , which seems to be a reasonable IP, but f06719849d74 novaext-data-availability-webassets-mgolang-new_default starts to use 192.168.192.1 for some reason. It happens to overlap with the segment in EKS, which leads to the above problem.

We can know from Networking in Compose combined with practical experience:

By default Compose sets up a single network for your app. Each container for a service joins the default network and is both reachable by other containers on that network, and discoverable by them at a hostname identical to the container name.

Looking at docker network create, we can guess that the bottom layer of docker-compose still calls Docker’s native interface to create a network, and the network segment is provided by Docker Engine:

When you create a network, Engine creates a non-overlapping subnetwork for the network by default

When starting the docker daemon, the daemon will check local network (RFC1918) ranges, and tries to detect if the range is not in use. The first range that is not in use, will be used as a the default “pool” to create networks.

https://github.com/moby/moby/issues/37823

As for how the specific code of the Engine is implemented here, I have searched the code for a long time, but I didn’t find it. I left a hole here, and then I found it and filled it in.

So the problem is very clear. The problem here is that when Docker starts, it does not find that EKS is also using the 192.168.0.0/16 segment, so it thinks that this segment is available, and then the bridged network is partially overlapped with the EKS network segment, resulting in The Pod in EKS cannot be connected to EC2. As for why other machines have no problem, I also logged in and checked, and found that it just happened not to be opened on the EKS segment.

solution

There are two solutions to this problem. The first one is to modify Docker’s daemon.json and add content similar to the following to make Bridged’s network open under the specified network segment:

 { "bip" : "192.168.1.5/24" , "fixed-cidr" : "192.168.1.5/25" , "default-address-pools" :[ { "base" : "192.168.2.5/24" , "size" : 28 } ] }

This method needs to restart the entire Docker after modification, which will cause the service to go offline for a certain period of time (especially if there are many containers).

Another method is to specify the default Bridged network segment under the docker-compose.yml file, written similarly as follows:

 networks :  
  default :  
    driver : bridge  
    ipam :  
     config :  
       - subnet : 10.7 . 0.0 / 16  
         gateway : 10.7 . 0.1

After modification, docker-compose down && docker-compose up -d to restart the corresponding service.

After solving the above problems, the communication between EKS and EC2 returned to normal immediately.

Reference

This article is transferred from https://nova.moe/debug-eks-ec2-connection-problem/
This site is only for collection, and the copyright belongs to the original author.