Troubleshooting|Troubleshooting 故障排除 kubeadm
Troubleshooting kubeadm
As with any program, you might run into an error installing or running kubeadm. This page lists some common failure scenarios and have provided steps that can help you understand and fix the problem.
对于任何项目,你可能会遇到一个错误安装或运行kubeadm。这个页面列出了一些常见的故障场景和提供可以帮助你理解和解决问题的步骤。
[TOC]
ebtables or some similar executable not found during installation
在安装期间ebtables或一些类似的可执行文件没有找到
If you see the following warnings while running kubeadm init
[preflight] WARNING: ebtables not found in system path
[preflight] WARNING: ethtool not found in system path
Then you may be missing ebtables, ethtool or a similar executable on your node. You can install them with the following commands:
- For Ubuntu/Debian users, run
apt install ebtables ethtool
. - For CentOS/Fedora users, run
yum install ebtables ethtool
.
[apiclient] Created API client, waiting for the control plane to become ready
This may be caused by a number of problems. The most common are:
- network connection problems. Check that your machine has full network connectivity before continuing.
- the default cgroup driver configuration for the kubelet differs from that used by Docker. Check the system log file (e.g.
/var/log/message
) or examine the output fromjournalctl -u kubelet
. If you see something like the following:
error: failed to run Kubelet: failed to create kubelet:
misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"
There are two common ways to fix the cgroup driver problem:
- Install Docker again following instructions here.
- Change the kubelet config to match the Docker cgroup driver manually, you can refer to Configure cgroup driver used by kubelet on Master Node
kubeadm blocks when removing managed containers The following could happen if Docker halts and does not remove any Kubernetes-managed containers:
sudo kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
(block)
A possible solution is to restart the Docker service and then re-run kubeadm reset:
一个可能的解决办法是重新启动
Docker
服务,然后重新运行kubeadm reset
sudo systemctl restart docker.service
sudo kubeadm reset
Inspecting the logs for docker may also be useful:
检查``docker`日志也可能是有用的
journalctl -ul docker
Pods in RunContainerError, CrashLoopBackOff or Error state Right after kubeadm init there should not be any pods in these states.
在kubeadm init之后,这些状态中不应该有任何pod。
If there are pods in one of these states right after kubeadm init, please open an issue in the kubeadm repo. coredns (or kube-dns) should be in the Pending state until you have deployed the network solution.
如果在kubeadm初始化之后有pod处于这些状态之一,请在kubeadm repo中打开一个问题。在部署网络解决方案之前,coredns(或kube-dns)应该处于挂起状态。
If you see Pods in the
RunContainerError
, CrashLoopBackOff
or Error
state after deploying the network solution and nothing happens to coredns (or kube-dns), it’s very likely that the Pod Network solution that you installed is somehow broken.部署网络解决方案后,如果你看到
RunContainerError
, CrashLoopBackOff
or Error
状态,什么都没发生coredns(或kube-dns),它很可能是你安装的吊舱网络解决方案是破碎的。You might have to grant it more RBAC privileges or use a newer version. Please file an issue in the Pod Network providers’ issue tracker and get the issue triaged there.
你可能需要授予它更多的RBAC特权或使用一个新版本。请在Pod网络提供商的issue tracker中的一个问题,这个问题修复。
If you install a version of Docker older than 1.12.1, remove the MountFlags=slave option when booting dockerd with systemd and restart docker. You can see the MountFlags in /usr/lib/systemd/system/docker.service. MountFlags can interfere with volumes mounted by Kubernetes, and put the Pods in CrashLoopBackOff state. The error happens when Kubernetes does not find var/run/secrets/kubernetes.io/serviceaccount files.
coredns (or kube-dns) is stuck in the Pending state coredns (or kube-dns)处于挂起状态
This is expected and part of the design.
这是预期和设计的一部分。
【Troubleshooting|Troubleshooting 故障排除 kubeadm】kubeadm is network provider-agnostic, so the admin should install the pod network solution of choice.
kubeadm is network provider-agnostic,,所以管理员应该安装可选的pod网络解决方案。
You have to install a Pod Network before CoreDNS may be deployed fully.
你必须安装
Pod Network
,CoreDNS
才可能完全部署。Hence the Pending state before the network is set up.
因此,网络设置之前挂起状态
HostPort services do not work The HostPort and HostIP functionality is available depending on your Pod Network provider. Please contact the author of the Pod Network solution to find out whether HostPort and HostIP functionality are available.
主机端口和HostIP功能取决于Pod网络提供程序。请联系Pod网络解决方案的作者,以了解是否有可用的主机端口和HostIP功能。
Calico, Canal, and Flannel CNI providers are verified to support HostPort.
经过验证,Calico, Canal, and Flannel CNI支持HostPort。
For more information, see the CNI portmap documentation.
If your network provider does not support the portmap CNI plugin, you may need to use the NodePort feature of services or use HostNetwork=true.
如果您的网络提供商不支持portmap CNI插件,您可能需要使用服务的NodePort特性或使用HostNetwork=true
Pods are not accessible via their Service IP Many network add-ons do not yet enable hairpin mode which allows pods to access themselves via their Service IP. This is an issue related to CNI. Please contact the network add-on provider to get the latest status of their support for hairpin mode.
许多网络附加组件还没有启用允许pod通过其服务IP访问自身的发夹模式。这是一个与CNI相关的问题。请与网络插件提供商联系,以获得他们支持发夹模式的最新状态。
If you are using VirtualBox (directly or via Vagrant), you will need to ensure that
hostname -i
returns a routable IP address. By default the first interface is connected to a non-routable host-only network. A work around is to modify /etc/hosts
, see this Vagrantfile for an example.如果您正在使用VirtualBox(直接或通过Vagrant),则需要确保
hostname -i
返回一个可路由的IP地址。默认情况下,第一个接口连接到一个不可路由的主机网络。一种解决方法是修改/etc/hosts
,参见这个Vagrantfile中的示例。TLS certificate errors The following error indicates a possible certificate mismatch.
# kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
- Verify that the
$HOME/.kube/config
file contains a valid certificate, and regenerate a certificate if necessary. The certificates in a kubeconfig file are base64 encoded. Thebase64 -d
command can be used to decode the certificate andopenssl x509 -text -noout
can be used for viewing the certificate information. - Unset the KUBECONFIG environment variable using:
unset KUBECONFIG
Or set it to the default KUBECONFIG location:export KUBECONFIG=/etc/kubernetes/admin.conf
- Another workaround is to overwrite the existing kubeconfig for the “admin” user:
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
Error from server (NotFound): the server could not find the requested resource
If you’re using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel.
如果在Vagrant中使用
flannel
作为pod网络,则必须为flannel
指定默认接口名。Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address 10.0.2.15, is for external traffic that gets NATed.
Vagrant通常为所有vm分配两个接口。第一种方法为所有主机分配IP地址10.0.2.15,用于指定外部流量。
This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the
--iface eth1
flag to flannel so that the second interface is chosen.这可能会导致
flannel
的问题,flannel
默认是主机上的第一个接口。这导致所有主机认为它们拥有相同的公共IP地址。为了防止这种情况发生,将—iface eth1
标志传递给flannel
,以便选择第二个接口。Non-public IP used for containers In some situations kubectl logs and kubectl run commands may return with the following errors in an otherwise functional cluster:
在某些情况下,kubectl日志和kubectl run命令可能会在功能集群中返回以下错误
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
- This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider. 这可能是由于Kubernetes使用的IP无法在看似相同的子网上与其他IP通信,这可能是由机器供应商的策略造成的。
- Digital Ocean assigns a public IP to
eth0
as well as a private one to be used internally as anchor for their floating IP feature, yetkubelet
will pick the latter as the node’sInternalIP
instead of the public one.Digital Ocean将一个公共IP分配给eth0
,同时也将一个私有IP分配给eth0作为其浮动IP特性的内部锚,但是kubelet将选择后者作为节点的InternalIP
,而不是公共IP。
ip addr show
to check for this scenario instead of ifconfig
because ifconfig
will not display the offending alias IP address.Alternatively an API endpoint specific to Digital Ocean allows to query for the anchor IP from the droplet:
curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address
The workaround is to tell
kubelet
which IP to use using --node-ip
. When using Digital Ocean, it can be the public one (assigned to eth0
) or the private one (assigned to eth1
) should you want to use the optional private network. The KubeletExtraArgs section of the kubeadm NodeRegistrationOptions structure can be used for this.解决方法是告诉
kubelet
使用哪个IP——node-ip。在使用Digital Ocean时,如果您想使用可选的专用网络,它可以是公共的(分配给eth0),也可以是私有的(分配给eth1)。kubeadm NodeRegistrationOptions结构中的KubeletExtraArgs部分可以用于此目的。Then restart kubelet:
systemctl daemon-reload
systemctl restart kubelet
coredns pods have CrashLoopBackOff or Error state
If you have nodes that are running SELinux with an older version of Docker you might experience a scenario where the coredns pods are not starting. To solve that you can try one of the following options:
Upgrade to a newer version of Docker.
Disable SELinux.
Modify the coredns deployment to set allowPrivilegeEscalation to true:
kubectl -n kube-system get deployment coredns -o yaml |
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' |
kubectl apply -f -
Another cause for CoreDNS to have CrashLoopBackOff is when a CoreDNS Pod deployed in Kubernetes detects a loop. A number of workarounds are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
Warning: Disabling SELinux or setting allowPrivilegeEscalation to true can compromise the security of your cluster.
etcd pods restart continually
If you encounter the following error:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused "read parent: connection reset by peer""
this issue appears if you run CentOS 7 with Docker 1.13.1.84. This version of Docker can prevent the kubelet from executing into the etcd container.
To work around the issue, choose one of these options:
Roll back to an earlier version of Docker, such as 1.13.1-75
yum downgrade docker-1.13.1-75.git8633870.el7.centos.x86_64 docker-client-1.13.1-75.git8633870.el7.centos.x86_64 docker-common-1.13.1-75.git8633870.el7.centos.x86_64
Install one of the more recent recommended versions, such as 18.06:
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install docker-ce-18.06.1.ce-3.el7.x86_64
Not possible to pass a comma separated list of values to arguments inside a --component-extra-args flag
kubeadm init flags such as --component-extra-args allow you to pass custom arguments to a control-plane component like the kube-apiserver. However, this mechanism is limited due to the underlying type used for parsing the values (mapStringString).
If you decide to pass an argument that supports multiple, comma-separated values such as --apiserver-extra-args "enable-admission-plugins=LimitRanger,NamespaceExists" this flag will fail with flag: malformed pair, expect string=string. This happens because the list of arguments for --apiserver-extra-args expects key=value pairs and in this case NamespacesExists is considered as a key that is missing a value.
Alternatively, you can try separating the key=value pairs like so: --apiserver-extra-args "enable-admission-plugins=LimitRanger,enable-admission-plugins=NamespaceExists" but this will result in the key enable-admission-plugins only having the value of NamespaceExists.
A known workaround is to use the kubeadm configuration file.
kube-proxy scheduled before node is initialized by cloud-controller-manager
In cloud provider scenarios, kube-proxy can end up being scheduled on new worker nodes before the cloud-controller-manager has initialized the node addresses. This causes kube-proxy to fail to pick up the node’s IP address properly and has knock-on effects to the proxy function managing load balancers.
The following error can be seen in kube-proxy Pods:
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane nodes regardless of their conditions, keeping it off of other nodes until their initial guarding conditions abate:
kubectl -n kube-system patch ds kube-proxy -p='{ "spec": { "template": { "spec": { "tolerations": [ { "key": "CriticalAddonsOnly", "operator": "Exists" }, { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ] } } } }'
The tracking issue for this problem is here.
The NodeRegistration.Taints field is omitted when marshalling kubeadm configuration
Note: This issue only applies to tools that marshal kubeadm types (e.g. to a YAML configuration file). It will be fixed in kubeadm API v1beta2.
By default, kubeadm applies the role.kubernetes.io/master:NoSchedule taint to control-plane nodes. If you prefer kubeadm to not taint the control-plane node, and set InitConfiguration.NodeRegistration.Taints to an empty slice, the field will be omitted when marshalling. When the field is omitted, kubeadm applies the default taint.
There are at least two workarounds:
Use the role.kubernetes.io/master:PreferNoSchedule taint instead of an empty slice. Pods will get scheduled on masters, unless other nodes have capacity.
Remove the taint after kubeadm init exits:
kubectl taint nodes NODE_NAME role.kubernetes.io/master:NoSchedule-
推荐阅读
- 【故障公告】周五下午的一次突发故障
- 青岛机情派iPhone5s指纹识别修复
- 故障分析 | MongoDB 5.0 报错 Illegal instruction 解决
- 59期day1(常见红蜘蛛故障和服务器基础知识)
- 从0开始学架构|从0开始学架构 - 高可用计算架构、异地多活架构、如何应对接口级故障
- 常见服务器故障有哪些(如何预防服务器发生故障?服务器故障后如何恢复数据?)
- 群面与排除法
- 当基础设施故障后,声网 SD-RTN? 如何保障 RTE 服务的高可用性
- 故障分析 | MySQL 设置 terminology_use_previous 参数导致数据库 Crash
- 记一次蓝牙故障(蓝牙不见了或设备管理器里蓝牙设备不停的在刷新)