linux性能优化学习系列之uptime linux

定义
每次执行uptime 都会显示如下信息

root@user:~# uptime 20:54:13 up 428 days,4:28,7 users,load average: 0.00, 0.10, 0.55

对于每个信息的展示可以通过man uptime 查看到

man uptimeUPTIME(1)User CommandsUPTIME(1)NAME uptime - Tell how long the system has been running.SYNOPSIS uptime [options]DESCRIPTION uptimegivesaone line display of the following information.The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.This is the same information contained in the header line displayed by w(1).System load averages is the average number of processes that are either in a runnable or uninterruptable state.A process in a runnable stateiseitherusing theCPUorwaitingtouse the CPU.A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.The averages are taken over the three time intervals.Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system isloadedallthe time while on a 4 CPU system it means it was idle 75% of the time.

首先这个命令用来展示主要当前系统已经运行了多长时间,第一列是当前时间,第二列是系统已经运行的时间,第三列当前已登陆用户数,第四列展示过去1分钟、5分钟、15分钟系统平均负载

20:54:13//当前时间 up 428 days,4:28, //系统已经运行的时间 7 users, //当前已登陆用户数 load average: 0.00, 0.10, 0.55 //过去1分钟、5分钟、15分钟系统平均负载

前三列都好理解,第四列什么是系统平均负载?首先他不是cpu使用率,解释中提到它是单位时间内系统处于**可运行状态(runnable)以及不可中断状态(uninterruptable)**的平均进程数

可运行状态(runnable) 就是进程处于正在使用cpu或者正在等待cpu的状态
不可中断状态(uninterruptable) 是进程处于内核态的原子方法调用,不可打断

那如何理解系统平均负载?
如果系统平均负载是 2,那意味着:

在有2个cpu的系统上,所有cpu刚好被全部占满
在有4个cpu的系统上,cpu有50%的空闲
在有1个cpu的系统上,有50%的进程竞争不到cpu

系统负载 ?= cpu使用率
首先cpu使用一旦接近100% 肯定是有问题的那么如何评价系统平均负载是合理的?根据上面的解释,最理想的情况是每个cpu上运行着一个进程,那么前提是我们要知道当前系统有多少cpu:

grep 'model name' /proc/cpuinfo -l

知道了cpu个数,当系统平均负载大于cpu个数的时候,系统肯定是超负荷的,实际的生产环境下,当系统平均负载大于 cpu个数70%的时候,就需要排查问题了,因为一旦负载过高,就会导致系统响应过慢,影响服务功能了
但是系统平均负载高就一定意味着cpu使用率高么?
回看平均负载的定义,它不仅包括正在使用cpu的进程,还有等待cpu核等待io的进程,而cpu使用率的定义是单位时间内cpu处于占用情况的统计,所以两者并不是完全对等的
那就可能意味着:

cpu密集型进程,由于大量使用cpu必然导致平均负载很高
io密集型进程,等待也会导致平均负载很高,但是cpu使用率不一定高
大量等待cpu进程调度的进程,也会导致平均负载很高,cpu使用率也会很高

为了验证这三种情况,首先需要安装几个辅助性能测试工具

apt-get install -y stress apt-get install -y stress-ng apt-get install -y sysstat

stress 是一个linux 系统压力测试工具,用来模拟平均负载的情况,用来模拟cpu高的情况,但不适合模拟io高的情况,因为stress使用的是 sync() 系统调用，它的作用是刷新缓冲区内存到磁盘中。对于新安装的虚拟机，缓冲区可能比较小，无法产生大的IO压力，这样大部分就都是系统调用的消耗了,所以不能模拟io高的情况
stress-ng stress的下一代stress-ng,有更丰富的选贤,用来模拟io高的情况,
sysstat 是个监控和分析系统性能的工具,后面主要用到里面的mpstat和pidstat,一定要安装11.5.5版本以上的软件,否则看不到%wait 列的信息
- mpstat 多核cpu系统系统分析工具,用来查看每个cpu的性能指标以及平均指标
- pidstat 进程性能分析工具,用来实时查看进程的cpu使用、内存使用、io、上下文切换等

cpu密集型首先查看平均负载

root@user:~# uptime 21:53:02 up 428 days,5:27,7 users,load average: 0.01, 0.01, 0.00

在第一个终端执行stress 命令模拟cpu使用100%

stress --cpu 1 --timeout 600

在第二个终端实时查看cpu平均负载

root@user:~# watch -d uptime //-d 会高亮显示变化的区域 Every 2.0s: uptimeSat Jul 13 21:56:17 2019 21:56:17 up 428 days,5:30,7 users,load average: 0.12, 0.04, 0.01

第三个终端使用mpstat,查看cpu使用率的变化情况

root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:06:23 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:06:28 PMall99.580.000.420.000.000.000.000.000.000.00 10:06:28 PM099.580.000.420.000.000.000.000.000.000.0010:06:28 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:06:33 PMall99.790.000.210.000.000.000.000.000.000.00 10:06:33 PM099.790.000.210.000.000.000.000.000.000.0010:06:33 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:06:38 PMall99.780.000.220.000.000.000.000.000.000.00 10:06:38 PM099.780.000.220.000.000.000.000.000.000.00

第四个终端使用pidstat查看到底哪个进程cpu使用率高

root@user:~# pidstat -u 5 1 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:18:43 PMUIDPID%usr %system%guest%wait%CPUCPUCommand 10:18:48 PM091650.210.000.000.000.210AliYunDun 10:18:48 PM017652100.000.000.000.64100.000stressAverage:UIDPID%usr %system%guest%wait%CPUCPUCommand Average:091650.210.000.000.000.21-AliYunDun Average:017652100.000.000.000.64100.00-stressroot@user:~# ps -ef| grep 17652 root17652 17651 98 22:18 pts/500:00:53 stress --cpu 1 --timeout 600 root17769 176950 22:19 pts/800:00:00 grep --color=auto 17652

终端2看到系统平均负载会接近1
终端3看到cpu 0的使用率接近100%
终端4看到stress 进程使用率很高
io密集型终端1执行stress 模拟io压力

root@user:~# stress-ng -i 1 --hdd 1 --timeout 600 stress-ng: info:[17352] dispatching hogs: 1 hdd, 1 iosync stress-ng: info:[17352] cache allocate: default cache size: 33792K

终端2查看平均负载

root@user:~# watch -d uptime //-d 会高亮显示变化的区域

发现平均负载高到3 +
终端3查看cpu使用率以及io等待

root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:15:38 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:15:43 PMall0.450.007.8091.760.000.000.000.000.000.00 10:15:43 PM00.450.007.8091.760.000.000.000.000.000.0010:15:43 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:15:48 PMall0.430.0013.3686.210.000.000.000.000.000.00 10:15:48 PM00.430.0013.3686.210.000.000.000.000.000.0010:15:48 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:15:53 PMall0.650.009.2990.060.000.000.000.000.000.00 10:15:53 PM00.650.009.2990.060.000.000.000.000.000.0010:15:53 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:15:58 PMall0.670.007.7891.560.000.000.000.000.000.00 10:15:58 PM00.670.007.7891.560.000.000.000.000.000.0010:15:58 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:16:03 PMall0.650.0014.0785.280.000.000.000.000.000.00 10:16:03 PM00.650.0014.0785.280.000.000.000.000.000.00

发现cpu使用率只有14%,而iowait 高达90%多
终端4 查看到底哪个进行io高

root@user:~# pidstat -u 5 1 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:21:25 PMUIDPID%usr %system%guest%wait%CPUCPUCommand 10:21:30 PM01560.000.900.000.680.900jbd2/vda1-8 10:21:30 PM091650.450.000.000.000.450AliYunDun 10:21:30 PM0156540.230.000.000.000.230watch 10:21:30 PM0179180.0015.320.000.9015.320stress-ng-hdd 10:21:30 PM0204900.002.030.000.452.030kworker/u2:2Average:UIDPID%usr %system%guest%wait%CPUCPUCommand Average:01560.000.900.000.680.90-jbd2/vda1-8 Average:091650.450.000.000.000.45-AliYunDun Average:0156540.230.000.000.000.23-watch Average:0179180.0015.320.000.9015.32-stress-ng-hdd Average:0204900.002.030.000.452.03-kworker/u2:2 root@user:~# ps -ef| grep 17918 root17918 17917 14 22:20 pts/500:00:15 stress-ng -i 1 --hdd 1 --timeout 600 root18083 176950 22:22 pts/800:00:00 grep --color=auto 17918

看到stress-ng io比较高
进程密集型当系统中出现超出cpu运行能力时,就会出现cpu等待时间
终端1用stress 模拟8个进程

stress -c 8 --timeout 600

终端2查看系统负载

Every 2.0s: uptimeSat Jul 13 22:30:47 201922:30:57 up 428 days,6:05,4 users,load average: 7.91, 5.70, 3.19

看到系统负载接近运行的进程数
终端3查看cpu使用率

root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:27:21 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:26 PMall99.560.000.440.000.000.000.000.000.000.00 10:27:26 PM099.560.000.440.000.000.000.000.000.000.0010:27:26 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:31 PMall99.800.000.200.000.000.000.000.000.000.00 10:27:31 PM099.800.000.200.000.000.000.000.000.000.0010:27:31 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:36 PMall99.570.000.430.000.000.000.000.000.000.00 10:27:36 PM099.570.000.430.000.000.000.000.000.000.0010:27:36 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:41 PMall99.790.000.210.000.000.000.000.000.000.00 10:27:41 PM099.790.000.210.000.000.000.000.000.000.0010:27:41 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:46 PMall99.790.000.210.000.000.000.000.000.000.00 10:27:46 PM099.790.000.210.000.000.000.000.000.000.0010:27:46 PMCPU%usr%nice%sys %iowait%irq%soft%steal%guest%gnice%idle 10:27:51 PMall99.790.000.210.000.000.000.000.000.000.00 10:27:51 PM099.790.000.210.000.000.000.000.000.000.00

可以看到cpu使用率到了100%
终端4查看哪个进程导致cpu使用率高

root@user:~# pidstat -u 5 1 Linux 4.4.0-105-generic (user)07/13/2019_x86_64_(1 CPU)10:29:30 PMUIDPID%usr %system%guest%wait%CPUCPUCommand 10:29:35 PM091650.210.210.000.000.430AliYunDun 10:29:35 PM01850113.520.000.0093.9913.520stress 10:29:35 PM01850213.520.000.0094.2113.520stress 10:29:35 PM01850311.160.000.00100.0011.160stress 10:29:35 PM01850414.380.000.0083.4814.380stress 10:29:35 PM01850513.520.000.0094.2113.520stress 10:29:35 PM01850613.520.000.0094.2113.520stress 10:29:35 PM01850713.520.000.0094.4213.520stress 10:29:35 PM01850813.300.000.0093.7813.300stress root@user:~# ps -ef| grep 18503 root18503 18500 12 22:27 pts/500:00:36 stress -c 8 --timeout 600 root18948 176950 22:32 pts/800:00:00 grep --color=auto 18503

可以看到时stress进程导致了cpu高
现在基本感受到了平均系统负载与cpu使用率之间的关系,以及我们如何判断具体的情况
【linux性能优化学习系列之uptime】参考 https://time.geekbang.org/column/article/69618