node_exporter监控宿主机磁盘的源码剖析及问题定位

node_exporter以Pod形式部署,它监控宿主机的CPU、Mem、Disk等监控指标。
Pod隔离的运行环境,会对宿主机的监控造成干扰,故尽量与宿主机share namespace,通常配置

hostNetwork: true hostPID: true

这里重点关注监控宿主机Disk分区使用率的过程。
node_exporter运行的用户 Dockerfile中,以USER指定运行用户,若未指定,则为root;
可以看出,node_exporter默认的用户为: nobody,其用户Id=65534
...... COPY ./node_exporter /bin/node_exporterEXPOSE9100 USERnobody ENTRYPOINT[ "/bin/node_exporter" ]

node_exporter的daemonset.yaml中,配置的securityContext为:
...... hostNetwork: true hostPID: true securityContext: runAsNonRoot: true runAsUser: 65534 ......

这里runAsNonRoot的配置:
  • 若runAsNonRoot未配置,则使用镜像内的默认用户;
  • 若配置了runAsNonRoot,则使用指定的用户执行容器进程;
# kubectl explain daemonset.spec.template.spec.securityContext.runAsNonRoot KIND:DaemonSet VERSION:apps/v1FIELD:runAsNonRoot DESCRIPTION: Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

可以看出,node_exporter指定以非root用户(nobody)执行node_exporter。
node_exporter监控宿主机磁盘分区的原理 1) 挂载宿主机的/proc目录
将宿主机的/proc目录,挂载到容器内的/host/root/proc;
将宿主机的/目录,挂载到容器内的/host/root;
spec: template: spec: containers: - name: node-exporter volumeMounts: - mountPath: /host/proc name: proc - mountPath: /host/root mountPropagation: HostToContainer name: root volumes: - hostPath: path: /proc name: proc - hostPath: path: / name: root

2) node_exporter读取磁盘分区
读取容器内的/host/proc/1/mounts文件,实际上读1号进程(也是宿主机进程)的mounts信息:
// node_exporter/collector/filesystem_linux.go func mountPointDetails() ([]filesystemLabels, error) { file, err := os.Open(procFilePath("1/mounts")) if os.IsNotExist(err) { // Fallback to `/proc/mounts` if `/proc/1/mounts` is missing due hidepid. log.Debugf("Got %q reading root mounts, falling back to system mounts", err) file, err = os.Open(procFilePath("mounts")) } if err != nil { return nil, err } defer file.Close()return parseFilesystemLabels(file) }

文件内容:
cat /host/proc/1/mounts /dev/sda1 /root/workspace xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/nvme0n1p2 /boot ext3 rw,relatime 0 0 /dev/nvme0n1p1 /boot/efi vfat rw,relatime,fmask=0077,dmask=0077,codepage=936,iocharset=cp936,shortname=winnt,errors=remount-ro 0 0 /dev/nvme0n1p3 / xfs rw,relatime,attr2,inode64,noquota 0 0

解析文件内容:
func parseFilesystemLabels(r io.Reader) ([]filesystemLabels, error) { var filesystems []filesystemLabels scanner := bufio.NewScanner(r) for scanner.Scan() { parts := strings.Fields(scanner.Text()) if len(parts) < 4 { return nil, fmt.Errorf("malformed mount point information: %q", scanner.Text()) } // Ensure we handle the translation of \040 and \011 // as per fstab(5). parts[1] = strings.Replace(parts[1], "\\040", " ", -1) parts[1] = strings.Replace(parts[1], "\\011", "\t", -1) filesystems = append(filesystems, filesystemLabels{ device:parts[0], mountPoint: parts[1], fsType:parts[2], options:parts[3], }) } return filesystems, scanner.Err() }

3) 查询分区大小及使用情况
  • 首先,读取mount分区情况;
  • 然后,对每个mount点,执行系统命令stat,查询其大小和使用情况;
  • 若stat命令执行失败,则记录该分区的deviceError=1;
// node_exporter/collector/filesystem_linux.go // GetStats returns filesystem stats. func (c *filesystemCollector) GetStats() ([]filesystemStats, error) { mps, err := mountPointDetails() if err != nil { return nil, err } stats := []filesystemStats{} for _, labels := range mps { ......// The success channel is used do tell the "watcher" that the stat // finished successfully. The channel is closed on success. success := make(chan struct{}) go stuckMountWatcher(labels.mountPoint, success)// 对mountPoint执行stat命令,将执行结果存入buf buf := new(syscall.Statfs_t) err = syscall.Statfs(rootfsFilePath(labels.mountPoint), buf)close(success)if err != nil { stats = append(stats, filesystemStats{ labels:labels, deviceError: 1, }) log.Debugf("Error on statfs() system call for %q: %s", rootfsFilePath(labels.mountPoint), err) continue }var ro float64 for _, option := range strings.Split(labels.options, ",") { if option == "ro" { ro = 1 break } } stats = append(stats, filesystemStats{ labels:labels, size:float64(buf.Blocks) * float64(buf.Bsize), free:float64(buf.Bfree) * float64(buf.Bsize), avail:float64(buf.Bavail) * float64(buf.Bsize), files:float64(buf.Files), filesFree: float64(buf.Ffree), ro:ro, }) } return stats, nil }

问题:node_exporter监控宿主机分区:/root/workspace 宿主机上/dev/sda1挂载分区/root/workspace,通过node_filesystem_size_byte查询不到其分区大小;
但是通过node_filesystem_device_error,查询到其信息:
node_filesystem_device_error{device="/dev/sda1",endpoint="https",fstype="xfs",instance="master1",job="node-exporter",mountpoint="/root/workspace",namespace="monitoring",pod="node-exporter-69hpl",service="node-exporter"}1

通过上面的代码可以看出,应该是读取到了分区,但是执行stat命令的时候失败;
到容器内看一下:
  • 宿主机的/,挂载到容器的/host/root;
  • 故宿主机的/root/workspace,应该挂载到/host/root/root/workspace
/host/root $ ls root ls: can't open 'root': Permission denied

原因是没有读取宿主机root目录的权限。
解决:node_exporter监控宿主机分区:/root/workspace 修改node-exporter-daemonset.yaml,让pod以root用户执行容器:
...... securityContext: runAsUser: 0 ......

可以解决该问题,不过以root用户运行node_exporter,可能会有安全隐患。
值得注意的是,若宿主机的分区未挂载在/root/...目录下,那么不需要node_exporter以root运行,也就是不需要修改上面的配置,因为它可以stat命令读取到其大小及使用信息。
通常情况下,我们一般不会将分区挂载到/root/...目录下,所以这个问题一般也不会遇到。
参考: 【node_exporter监控宿主机磁盘的源码剖析及问题定位】1.https://www.cnblogs.com/YaoDD...
2.https://kubernetes.io/docs/co...

    推荐阅读