满堂花醉三千客,一剑霜寒十四洲。这篇文章主要讲述Cilium Vxlan 跨节点通信过程相关的知识,希望能为你提供帮助。
测试前提因为 cilium 1.10.6 版本 monitor 存在缺少 HOST NS 封装过程,所以我们使用的是 cilium 1.11.4
进行抓包分析。
具体部署过程请参考 ??Cilium v1.10.6 安装部署??,helm pull 的时候选择 1.11.4 即可,其余修改一样的
cilium 1.10.6 报文格式
cilium 1.11.4 报文格式
【Cilium Vxlan 跨节点通信过程】
跨节点通信特点分析不同节点Pod之间通信,对应此前我们熟悉的CNI(Calico Flannel)均是使用路由表,FDB表,ARP表等网络知识便可以分析的非常清楚,但是在Cilium中我们发现此种分析思路便"失效"了。究其原因,是由于Cilium的CNI实现结合eBPF技术实现了datapath的"跳跃式"转发。
我们需要结合 Cilium 提供的Tools 来辅助分析
如下图报文发送路线图,
我们需要对 pod1 的 lxc 网卡进行抓包、vxlan 抓包、ens33 抓包
需要对 pod2的 lxc 网卡进行抓包、vxlan 抓包
tcpdump确定 pod 分布情况
node-1:10.0.0.222(简称:pod1**)
node-2:10.0.1.208(简称:pod2)
root@master:~# kubectl get pod -o wide
NAMEREADYSTATUSRESTARTSAGEIPNODENOMINATED NODEREADINESS GATES
cni-test-76d79dfb85-28bpq1/1Running019m10.0.0.222node-1.whale.com<
none>
<
none>
cni-test-76d79dfb85-tjhdp1/1Running019m10.0.1.208node-2.whale.com<
none>
<
none>
pod1 对应网卡 lxc91ffd83cbb3e
root@master:<
sub>
# kubectl exec -it cni-test-76d79dfb85-28bpq-- ethtool -S eth0
NIC statistics:
peer_ifindex: 30
# node-1 节点
root@node-1:<
/sub>
# ip linkshow | grep ^30
30: lxc91ffd83cbb3e@if29: <
BROADCAST,MULTICAST,UP,LOWER_UP>
mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
pod 2 对应网卡 lxc978830fe1a23
root@master:<
sub>
# kubectl exec -it cni-test-76d79dfb85-tjhdp-- ethtool -S eth0
NIC statistics:
peer_ifindex: 22
# node-2 节点
root@node-2:<
/sub>
# ip link show | grep ^22
22: lxc978830fe1a23@if21: <
BROADCAST,MULTICAST,UP,LOWER_UP>
mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
node-1 抓包
tcpdump -pne -i lxc91ffd83cbb3e -w lxc_pod1.cap
tcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod1.cap
tcpdump -pne -i ens33 -w ens33_pod1.cap
node-2 抓包
tcpdump -pne -i lxc978830fe1a23 -w lxc_pod2.cap
tcpdump -pne -i cilium_vxlan -w cilium_vxlan_pod2.cap
ping 测试
kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208
lxc_pod1.cap
cilium_vxlan_pod1.cap
ens33_pod1.cap
lxc_pod2.cap
cilium_vxlan_pod2.cap
通过上述抓包,我们验证了场景:
跨节点通信的时候,pod 的报文需要pod内部封装一次,然后通过 HOST NS(宿主机)vxlan 在进行封装,然后通过宿主机的物理网卡传到对端 宿主机,在经过 vxlan 解封装后 直接 redict 到pod 内部,而不需要经过它对应的 lxc 网卡,我们就可以论证下图的通信流程图。
cilium monitor确定 pod 分布情况
node-1:10.0.0.222(简称:pod1**)
node-2:10.0.1.208(简称:pod2)
root@master:~# kubectl get pod -o wide
NAMEREADYSTATUSRESTARTSAGEIPNODENOMINATED NODEREADINESS GATES
cni-test-76d79dfb85-28bpq1/1Running019m10.0.0.222node-1.whale.com<
none>
<
none>
cni-test-76d79dfb85-tjhdp1/1Running019m10.0.1.208node-2.whale.com<
none>
<
none>
在对应节点分布的 cilium pod 查看他们的网卡信息
圈出来的意思,就是 pod 的 eth0 网卡和
# node-1
root@master:<
sub>
# kubectl -n kube-system exec -it cilium-xnmfw -- cilium bpf endpoint list
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)
IP ADDRESSLOCAL ENDPOINT INFO
10.0.0.112:0id=415flags=0x0000 ifindex=24mac=EA:CF:FE:E8:E7:26 nodemac=BE:12:EB:4E:E9:30
192.168.0.120:0(localhost)
10.0.0.215:0(localhost)
10.0.0.222:0id=2164flags=0x0000 ifindex=30mac=32:30:9C:CA:09:8E nodemac=2E:3C:E3:61:26:45
# node-2
root@master:<
/sub>
# kubectl -n kube-system exec -it cilium-jztvj -- cilium bpf endpoint list
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)
IP ADDRESSLOCAL ENDPOINT INFO
10.0.1.208:0id=969flags=0x0000 ifindex=22mac=DA:97:53:7E:9A:CA nodemac=62:57:5C:C9:D6:0C
192.168.0.130:0(localhost)
10.0.1.249:0id=2940flags=0x0000 ifindex=16mac=02:55:31:EC:28:60 nodemac=32:FD:46:2F:CB:8A
10.0.1.10:0(localhost)
node1 抓包
使用 pod1 ping pod2,同时在 pod1 所在节点的 cilium 进行抓包
root@master:~# kubectl exec -it cni-test-76d79dfb85-28bpq -- ping -c 1 10.0.1.208
PING 10.0.1.208 (10.0.1.208): 56 data bytes
64 bytes from 10.0.1.208: seq=0 ttl=63 time=0.790 ms
--- 10.0.1.208 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.790/0.790/0.790 ms
cilium monitor 抓包分析
root@master:~# kubectl -n kube-system exec -it cilium-xnmfw -- cilium monitor -vv >
monitor.yaml
关键部分
node 3232235650 (0xc0a80082)
0x 表示 16进制
c0-a8-00-82-->
192.168.0.130
这个是 16 进制的 ip 地址,意思就是对端 node 的 ip地址
------------------------------------------------------------------------------
CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=https://www.songbingjia.com/android/10.0.0.222:4864 dst=10.0.1.208:0
CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1
CPU 02: MARK 0x0 FROM 2164 DEBUG: CT verdict: New, revnat=0
CPU 02: MARK 0x0 FROM 2164 DEBUG: Successfully mapped addr=10.0.1.208 to identity=3352
CPU 02: MARK 0x0 FROM 2164 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0
CPU 02: MARK 0x0 FROM 2164 DEBUG: Encapsulating to node 3232235650 (0xc0a80082) from seclabel 3352
------------------------------------------------------------------------------
CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 1/2: src=https://www.songbingjia.com/android/192.168.0.120:56435 dst=192.168.0.130:8472
CPU 02: MARK 0x0 FROM 3 DEBUG: Conntrack lookup 2/2: nexthdr=17 flags=1
CPU 02: MARK 0x0 FROM 3 DEBUG: CT entry found lifetime=16823678, revnat=0
CPU 02: MARK 0x0 FROM 3 DEBUG: CT verdict: Established, revnat=0
可以参考下图,我们发现 NodePort Remote Endpoint 的时候是具有 redict 的能力的
CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0
CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 2164 from seclabel 3352
CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 1/2: src=https://www.songbingjia.com/android/10.0.1.208:0 dst=10.0.0.222:4864
CPU 03: MARK 0x0 FROM 2164 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0
CPU 03: MARK 0x0 FROM 2164 DEBUG: CT entry found lifetime=16823776, revnat=0
CPU 03: MARK 0x0 FROM 2164 DEBUG: CT verdict: Reply, revnat=0
node2 抓包
进入 node2 的 cilium 进行 monitor 抓包
root@master:~# kubectl -n kube-system exec -it cilium-jztvj -- cilium monitor -vv >
monitor2.yaml
查看报文关键部分
CPU 03: MARK 0x0 FROM 0 DEBUG: Tunnel decap: id=3352 flowlabel=0
CPU 03: MARK 0x0 FROM 0 DEBUG: Attempting local delivery for container id 969 from seclabel 3352
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=https://www.songbingjia.com/android/10.0.0.222:7936 dst=10.0.1.208:0
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=0
CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: New, revnat=0
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack create: proxy-port=0 revnat=0 src-identity=3352 lb=0.0.0.0
CPU 03: MARK 0x0 FROM 969 from-endpoint: 98 bytes (98 captured), state new, , identity 3352->
unknown, orig-ip 0.0.0.0
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 1/2: src=https://www.songbingjia.com/android/10.0.1.208:0 dst=10.0.0.222:7936
CPU 03: MARK 0x0 FROM 969 DEBUG: Conntrack lookup 2/2: nexthdr=1 flags=1
CPU 03: MARK 0x0 FROM 969 DEBUG: CT entry found lifetime=16826421, revnat=0
CPU 03: MARK 0x0 FROM 969 DEBUG: CT verdict: Reply, revnat=0
CPU 03: MARK 0x0 FROM 969 DEBUG: Successfully mapped addr=10.
推荐阅读
- WTL atlApp.h
- (程序员面试题精选(02))-设计包含min函数的栈
- # yyds干货盘点 # 厉害了,Python也能使用动态链接库
- SVNX使用教程
- Python 函数进阶-递归函数
- 云原生Docker入门 -- 阿里云服务器Linux环境下安装Docker
- html js调试
- [OpenCV实战]28 基于OpenCV的GUI库cvui
- PE工具中的键盘检测工具-KeyBoardTest