贵有恒,何必三更起、五更眠、最无益,只怕一日曝、十日寒。这篇文章主要讲述linux安装prometheus+grafana+alermanager相关的知识,希望能为你提供帮助。
Liunx装prometheus+grafana+alermanager
一 软件与系统版本信息
软件版本 |
├── grafana-enterprise-8.0.5-1.x86_64.rpm └── prometheus-2.32.1.linux-amd64.tar.gz └──node_exporter-1.3.1.linux-amd64.tar.gz └── alertmanager-0.23.0.linux-amd64.tar.gz |
系统内核版本 | 系统版本 |
2.6.32-642.el6.x86_64 | CentOS release 6.8(Final) |
3.10.0-1160.el7.x86_64 | CentOS Linuxrelease 7.9.2009 (Core) |
二 安装prometheus
[root@master ~]# tar
zxvf prometheus-2.32.1.linux-amd64.tar.gz -C /home
[root@master ~]# mv
/home/prometheus-2.32.1.linux-amd64/ /home/prometheus-2.32.1
[root@master ~]# cd
/home/prometheus-2.32.1/ & & nohup ./prometheus &
[root@master ~]#
netstat -ntpl
Active Internetconnections (only servers) Proto Recv-QSend-Q Local Address Foreign Address State PID/Programname tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1360/sshd tcp 0 0 :::9090 :::* LISTEN 1528/./prometheus #服务启动成功 tcp 0 0 :::22 :::* LISTEN 1360/sshd |
[root@master ~]# cat > /usr/lib/systemd/system/prometheus.service < < EOF [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/home/prometheus-2.32.1/prometheus--config.file=/home/prometheus-2.32.1/prometheus.yml [Install] WantedBy=multi-user.target EOF |
三 node_exporter 安装(所有被控端的linux主机都需要安装)
将 node_exporter-1.3.1.linux-amd64.tar.gz上传至客户端
[root@master ~]# tar
zxvf node_exporter-1.3.1.linux-amd64.tar.gz -C /home
[root@master ~]# mv
/home/node_exporter-1.3.1.linux-amd64/ /home/node_exporter-1.3.1
[root@master ~]# cd
/home/node_exporter-1.3.1/ & & nohup ./node_exporter &
[root@master ~]#
netstat -ntpl
Active Internetconnections (only servers) Proto Recv-QSend-Q Local Address ForeignAddress State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 961/sshd tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1150/master tcp6 0 0 :::9100 :::* LISTEN 2153/./node_exporte tcp6 0 0 :::22 :::* LISTEN 961/sshd tcp6 0 0 ::1:25 :::* LISTEN 1150/master tcp6 0 0 :::9090 :::* LISTEN 2130/./prometheus |
[root@master~]# cat > /etc/systemd/system/node_exporter.service < < EOF [Unit] Description=node_exporter Documentation=https://prometheus.io/docs/introduction/overview After=network-online.target remote-fs.targetnss-lookup.target Wants=network-online.target [Service] Type=simple PIDFile==/var/run/node_exporter.pid ExecStart=/home/node_exporter-1.3.1/node_exporter ExecReload=/bin/kill -s HUP $MAINPID ExecStop=/bin/kill -s TERM $MAINPID [Install] WantedBy=multi-user.target EOF |
[root@master ~]# vim /home/prometheus-2.32.1/prometheus.yml
添加以下信息,要注意格式 - job_name: "linux" static_configs: - targets:["192.168.1.60:9100","192.168.1.61:9100","192.168.1.62:9100"] |
设置开机启动
[root@master ~]#
systemctl daemon-reload
[root@master ~]#
systemctl enable node_exporter
[root@master ~]#
systemctl start node_exporter
[root@master ~]#
systemctl enable prometheus
[root@master ~]#
systemctl start prometheus
查看状态
五 安装grafana
[root@master ~]# yum -y install urw-fonts[root@master ~]# rpm -ivh grafana-enterprise-8.0.5-1.x86_64.rpm
warning:grafana-enterprise-8.0.5-1.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID24098cb6: NOKEY Preparing... ################################# [100%] Updating /installing... 1:grafana-enterprise-8.0.5-1 #################################[100%] ### NOT startingon installation, please execute the following statements to configure grafanato start automatically using systemd sudo /bin/systemctl daemon-reload sudo /bin/systemctl enablegrafana-server.service ### You can startgrafana-server by executing sudo /bin/systemctl startgrafana-server.service POSTTRANS: Runningscript |
[root@master ~]# systemctl enable grafana-server
[root@master~]# systemctl start grafana-server
[root@master ~]# netstat -ntpl
5.1.安装pie插件(饼图显示报错再安装)[root@master ~]#
grafana-cli plugins install grafana-piechart-panel
2.重启grafana-server
[root@master ~]#
systemctl restart grafana-server
打开grafans网页
prometheus+grafana安装完成
六 alermanager安装
[root@master~ ]# tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz -C /home/
[root@master~ ]# mv /home/alertmanager-0.23.0.linux-amd64 /home/alertmanager-0.23.0
七 配置报警监控模板
[root@master~ ]# cd /home/alertmanager-0.23.0
[root@master alertmanager-0.23.0]# cp alertmanager.yml alertmanager.yml-back
[root@master alertmanager-0.23.0]# vim alertmanager.yml
global: resolve_timeout: 5m smtp_smarthost: smtp.163.com:25 smtp_from: 18600000000@163.com smtp_auth_username: 18600000000@163.com smtp_auth_password: XXXXXXXXXX smtp_require_tls: false route: group_by: [alertname] group_wait: 10s group_interval: 1m repeat_interval: 3m receiver: mail receivers: - name: mail email_configs: - to: 58888888@qq.com headers:Subject: "[WARN] 报警邮件" send_resolved: true inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, dev, instance] |
如图:
[root@master ~ ]# mkdir /home/alertmanager-0.23.0/rules
[root@master ~ ]# vim /home/alertmanager-0.23.0/rules/node.yml
groups: - name: 主机状态-监控告警 rules: - alert: 主机状态 expr: up == 0 for: 5m labels: status: 非常严重 annotations: summary:"$labels.instance:服务器宕机" description:"$labels.instance:服务器延时超过5分钟" - alert: CPU使用情况 expr:100-(avg(irate(node_cpu_seconds_totalmode="idle"[5m]))by(instance)* 100) > 80 for: 1m labels: status: 一般告警 annotations: summary: "$labels.instanceCPU使用率过高!" description:"$labels.instanceCPU使用大于80%(目前使用:$value%)" - alert: 内存使用 expr: (1 -(node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instance内存使用率过高!" description:"$labels.instance内存使用大于80%(目前使用:$value%)" - alert: IO性能 expr:(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 80 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instance流入磁盘IO使用率过高!" description:"$labels.instance流入磁盘IO大于80%(目前使用:$value)" - alert: 网络 expr: ((sum(rate(node_network_receive_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m]))by (instance)) / 100) > 102400 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instance流入网络带宽过高!" description:"$labels.instance 流入网络带宽持续2分钟高于100M. RX带宽使用率$value" - alert: 网络 expr: ((sum(rate(node_network_transmit_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m]))by (instance)) / 100) > 102400 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instance流出网络带宽过高!" description:"$labels.instance 流出网络带宽持续2分钟高于100M. RX带宽使用率$value" - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instanceTCP_ESTABLISHED过高!" description:"$labels.instanceTCP_ESTABLISHED大于1000(目前使用:$value%)" - alert: 磁盘容量 expr:100-(node_filesystem_free_bytesfstype=~"ext4|xfs"/node_filesystem_size_bytesfstype=~"ext4|xfs"*100) > 80 for: 1m labels: status: 严重告警 annotations: summary: "$labels.instance磁盘分区使用率过高!" description:"$labels.instance磁盘分区使用大于80%(目前使用:$value%)" |
修改prometheus.yml配置文件,配置报警规则,打开alerting
和 rule_files 文件指定(增加红色字体部分)
[root@master ~]# vim /home/prometheus-2.32.1/prometheus.yml
# my global config global: scrape_interval: 15s # Set the scrapeinterval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rulesevery 15 seconds. The default is every 1 minute. # scrape_timeout is set to the globaldefault (10s). # Alertmanagerconfiguration alerting: alertmanagers: - static_configs: -targets: ["192.168.1.60:9093"] # - alertmanager:9093 # Load rules onceand periodically evaluate them according to the global evaluation_interval. rule_files: -"/home/alertmanager-0.23.0/rules/*.yml" # - "first_rules.yml" # - "second_rules.yml" # A scrapeconfiguration containing exactly one endpoint to scrape: # Here itsPrometheus itself. scrape_configs: # The job name is added as a label`job=< job_name> ` to any timeseries scraped from this config. - job_name: "prometheus" #抓取间隔时间 scrape_interval:5s # metrics_path defaults to /metrics # scheme defaults to http. static_configs: - targets:["localhost:9090","192.168.2.231:9100"] - job_name: "linux-host" scrape_interval: 5s # metrics_path defaults to /metrics # scheme defaults to http. static_configs: - targets:["192.168.2.230:9100"] |
重启服务
[root@master~]# cat > /home/alertmanager-0.23.0/start_alertmanager.sh < < EOF #!/bin/bash nohup/home/alertmanager-0.23.0/alertmanager--config.file="/home/alertmanager-0.23.0/alertmanager.yml" > /home/alertmanager-0.23.0/alertmanager.log 2> & 1 & EOF |
[root@master ~]# sh
/home/alertmanager-0.23.0/start_alertmanager.sh
[root@master ~]#
netstat -ntpl
重启 prometheus 服务
[root@master ~ ]# kill -9 `ps -ef | grep prometheus | grep -v grep |awk print $2`
[root@master ~ ]# systemctl start prometheus
查看监控项配置信息
八 测试邮件报警
[root@node1 ~ ] # kill -9 `ps -ef | grep node_exporter | grep -v grep |awk print $2`
进入邮箱查看告警邮件:
开启node1主机的 node_exporter 服务
收到恢复邮件
九 配置邮件告警模板
[root@master ~]# cat /home/alertmanager-0.23.0/template/wechat.tmpl
(邮件告警模板)
define"wechat.default.message" - if gt (len.Alerts.Firing) 0 - - range $index,$alert := .Alerts - - if eq $index 0- **********告警通知**********< br> 告警类型:$alert.Labels.alertname < br> 告警级别:$alert.Labels.severity < br> - end =====================< br> 告警主题:$alert.Annotations.summary < br> 告警详情:$alert.Annotations.description < br> 故障时间:$alert.StartsAt.Local.Format"2006-01-02 15:04:05" < br> if gt (len $alert.Labels.instance) 0 -故障实例:$alert.Labels.instance - end - - end - end - if gt (len.Alerts.Resolved) 0 - - range $index,$alert := .Alerts - - if eq $index 0- **********恢复通知**********< br> 告警类型:$alert.Labels.alertname < br> 告警级别:$alert.Labels.severity < br> - end =====================< br> 告警主题:$alert.Annotations.summary < br> 告警详情:$alert.Annotations.description < br> 故障时间:$alert.StartsAt.Local.Format"2006-01-02 15:04:05" < br> 恢复时间:$alert.EndsAt.Local.Format "2006-01-0215:04:05" < br> if gt (len $alert.Labels.instance) 0 -故障实例:$alert.Labels.instance - end-< br> - end - end - end |
[root@master ~]# cat /home/alertmanager-0.23.0/alertmanager.yml
global: resolve_timeout: 5m smtp_smarthost: smtp.163.com:25 smtp_from: 1866@163.com smtp_auth_username: 1866@163.com smtp_auth_password: RKONLSDFSDMFZM smtp_require_tls: false templates: -/home/alertmanager-0.23.0/template/wechat.tmpl route: group_by: [alertname] group_wait: 10s group_interval: 20s repeat_interval: 3m receiver: mail receivers: - name: mail email_configs: - to: 534234548@qq.com headers:Subject: "[WARN] 报警邮件" send_resolved: true html:template"wechat.default.message" . inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, dev, instance] |
查看邮件
十 配置(微信告警模板)
[root@master ~]# cat /home/alertmanager-0.23.0/template/wechat.tmpl (微信告警模板)
define"wechat.default.message" - if gt (len.Alerts.Firing) 0 - - range $index,$alert := .Alerts - - if eq $index 0- **********告警通知********** 告警类型:$alert.Labels.alertname 告警级别:$alert.Labels.severity - end ===================== 告警主题:$alert.Annotations.summary 告警详情:$alert.Annotations.description 故障时间:$alert.StartsAt.Local.Format"2006-01-02 15:04:05" if gt (len $alert.Labels.instance) 0 -故障实例:$alert.Labels.instance - end - - end - end - if gt (len.Alerts.Resolved) 0 - - range $index,$alert := .Alerts - - if eq $index 0- **********恢复通知********** 告警类型:$alert.Labels.alertname 告警级别:$alert.Labels.severity - end ===================== 告警主题:$alert.Annotations.summary 告警详情:$alert.Annotations.description 故障时间:$alert.StartsAt.Local.Format"2006-01-02 15:04:05" 恢复时间:$alert.EndsAt.Local.Format "2006-01-0215:04:05" if gt (len $alert.Labels.instance) 0 -故障实例:$alert.Labels.instance - end - - end - end - end |
[root@master ~]# cat /home/alertmanager-0.23.0/alertmanager.yml
global: resolve_timeout: 5m wechat_api_url: ??https://qyapi.weixin.qq.com/cgi-bin/?? templates: -/home/alertmanager-0.23.0/template/wechat.tmpl route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 3m receiver: wechat receivers: - name: wechat wechat_configs: - corp_id: weterterterterhc6 to_party: 1 agent_id: 1000002 api_secret: Ha_wefsdfertgretgerguRCVPnzvK1fY send_resolved: true inhibit_rules: - equal:[alertname, cluster, service] source_match: severity: high target_match: severity: warning |
【linux安装prometheus+grafana+alermanager】
推荐阅读
- 使用ultraiso制作启动盘
- LINUX下载编译ccrtp(未成功)
- redis 数据量大,异常停机导致全量同步问题
- 豌豆荚看不到我的图片的处理方案
- 豌豆荚:我的照片保存位置搜索
- 靠谱助手玩游戏卡机的处理妙招
- 靠谱助手清除游戏数据的技巧
- 删除爱思助手的图文详细教程
- 豌豆荚:银行卡充值交易记录的提取