链路监控
链路监控
以前的方案
全量校验,逻辑上就是select * from tcbuyer.order的结果,和select * from tc.order 的结果作对比。
伪增量校验 ,比较上一个小时的数据。
单流增量校验, 基于事件的比较,当买家库生成一笔订单后,相应地MySQL会产生一条binlog,单流增量校验系统就能以这条binlog作为触发条件,解析出binlog内容,去实时反查卖家库有没有对应记录。AMG的校验图模型——Check Graph
假设交易链路有4个业务系统需要对账,分别是交易、库存、资金和支付,其中涉及的事件分别对应 交易下单事件、减库存事件、使用红包资金事件、支付事件。对账的需求如下:交易事件 和 库存事件 做校验;交易事件 和 资金事件 做校验;交易事件 和 支付事件 做校验;资金事件 和 支付事件 做校验。一旦上面4个校验中的其中一个出现问题,都认为是业务系统存在异常,需要及时报出来。明显可以看出来,这是一个图模型。比如,A事件和B事件校验,则存在一条边,连接A和B点。以事件作为点(Node),事件间的校验方法作为边(Edge),构造出一个图(Graph)模型。按照上述场景,构造的图模型如下:交易 《-----校验----》
集团的各个系统为了业务解耦、保证主链路的性能或可用性,各系统之间常常存在各种同步异步调用、强弱依赖关系。一旦网络抖动、业务系统bug、或是某个子系统出现异常,就可能就会出现业务数据不一致。拿最核心的交易系统和库存系统来说,用户下了单之后,没减库存,那么很有可能出现超卖;用户关闭订单之后,没有回补库存,那么就会导致少卖。这就是交易和库存系统之间的数据不一致。
from influxdb import InfluxDBClient json_body = [
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"value": 0.64
}
}
] client = InfluxDBClient('localhost', 8086, 'root', 'root', 'example') client.create_database('example') client.write_points(json_body) result = client.query('select value from cpu_load_short;
') print("Result: {0}".format(result))insert prism_trace_log,serverApp='camel',serviceName='index.api', rt=50 '2017-09-08 13:00:01'
// TRACE 类型默认不输出 rpcId
==========================BaseModel
traceId
rpcId
timestamp
rpcType
rpcId
hostIp
==========================RpcModel
clientApp
clientIp
clientSpan
serverApp
serverIp
serverSpan
opName //操作名称,一般视 RPC 情况确定,如 LOCAL、SYNC、CALLBACK、FUTURE 等;对于数据库,如 QUERY、UPDATE、INSERT、DELETE
opType //操作类型,一般视 RPC 情况确定,如序列化方式,或读写标记等;对于数据库,分成 R、W 两种表示读、写操作
serviceName //接口名,
methodName //方法名
error //0
result // 1,2,3,3,4,5
==========================
http 总量
select count(*),sum(error),avg(serverSpan) from prism_trace where rpcType=0 and serverApp = ?
【链路监控】http 按页面统计
select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=0 and serverApp = ? group by serviceName
RPC 总量
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ?
RPC 按服务统计
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName
RPC 服务来源
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp
RPC 服务去向
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by serverApp
RPC 应用来源
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by clientApp
RPC 应用去向
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serverApp
DB 总量
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=3 and serverApp = ?
DB 按表统计
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName
DB 统计表的来源
select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp
错误类型:
/**
* 未知
/
UNKNOWN,
/*
* 成功
/
OK,
/*
* 业务错误
/
BIZ_ERROR,
/*
* RPC 错误
/
RPC_ERROR,
/*
* 超时
/
TIMEOUT,
/*
* 软错误,一般用于资源找不到、未命中、加锁未成功、
* 版本不一致导致未更新等情况,需要根据中间件不同来判定
/
SOFT_ERROR,
/*
* 限流错误
*/
LIMIT_ERROR,
模型的字段如下:
OpName:DB 操作名,如 QUERY、UPDATE,(TDDL v5 后增加的)INSERT、DELETE
OpType:DB 操作类型,分成 R、W 两种表示读、写操作
ServiceDim1:物理库名
ServiceDim2:tableName,例如 JOIN:TABLE_A,TABLE_C,TABLE_B
ServiceDim3:逻辑 SQL 编码
ServerName:(db@dbName),例如 andor_mysql_group
ClientName:clientAppId
ServerDimKey:TDDL_opName@dbName:tableName
tlive,,mtop/get.do(),500
1. tlive,fun,CommentService,save,100
2. tlive,fun,CommentService,save, 90
fun,db,"table1",100
tlive,fun,MemberService,save,200
fun,db,"table2",100
/*
* Rpc 类型的数字编号
*/
// @formatter:off
public static final int RPC_TYPE_UNKNOWN =255;
public static final int RPC_TYPE_TRACE =0;
public static final int RPC_TYPE_HSF =1;
public static final int RPC_TYPE_HSF_SERVER =2;
public static final int RPC_TYPE_NOTIFY =3;
public static final int RPC_TYPE_TDDL =4;
public static final int RPC_TYPE_TAIR =5;
public static final int RPC_TYPE_SEARCH =6;
public static final int RPC_TYPE_MASTER =11;
public static final int RPC_TYPE_SLAVE =12;
public static final int RPC_TYPE_METAQ =13;
public static final int RPC_TYPE_DRDS =14;
public static final int RPC_TYPE_TFS =15;
public static final int RPC_TYPE_ALIPAY =16;
public static final int RPC_TYPE_HTTP_B =20;
public static final int RPC_TYPE_HTTP =25;
public static final int RPC_TYPE_SENTINEL =26;
public static final int RPC_TYPE_LOCAL =30;
public static final int RPC_TYPE_JINGWEI =32;
public static final int RPC_TYPE_ISEARCH =36;
public static final int RPC_TYPE_LOCAL_NG =40;
public static final int RPC_TYPE_CSB_SERVER =52;
public static final int RPC_TYPE_HTTP_SERVER =251;
public static final int RPC_TYPE_METAQ_RCV =252;
public static final int RPC_TYPE_ACCESS =253;
public static final int RPC_TYPE_NOTIFY_RCV =254;
//自定的RPCTYPE
public static final int RPC_TYPE_CUSTOM_TRACE =90;
public static final int RPC_TYPE_CUSTOM_RPC_CLIENT =91;
public static final int RPC_TYPE_CUSTOM_RPC_SERVER =92;
public static final int RPC_TYPE_CUSTOM_MESSAGE_PUB =93;
public static final int RPC_TYPE_CUSTOM_MESSAGE_SUB =96;
public static final int RPC_TYPE_CUSTOM_DB =94;
public static final int RPC_TYPE_CUSTOM_CACHE =95;
public static final int RPC_TYPE_CUSTOM_PROTOCOL_CLIENT =97;
public static final int RPC_TYPE_CUSTOM_PROTOCOL_SERVER =98;
// @formatter:on
->A->B->C
client, server, type
-,a,0
a,b,1
a,b,2
b,c,1
b,c,2
LOCAL_IP_ADDRESS= getLocalInetAddress();
IP_16 = getIP_16(LOCAL_IP_ADDRESS);
IP_16 = getIP_16(LOCAL_IP_ADDRESS);
1.应用概要,2. 服务详情,3. 应用去向 4.应用来源
- 概要
数字
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by serverIp,time(1d)
from prism_trace
where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by rpcType,serviceName,time(1d)
- 服务详情
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='cammel' and serviceName='?' and time>now()-1d and time<=now() group by time(1d)
去向
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where clientApp='camel' and rpcType='1' and clientService='/login.do' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)
来源
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='whale' and rpcType='1' and serviceName='MemberQueryService' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)
- 应用去向
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where clientApp='camel' and rpcType='1' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)
- 应用来源
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='whale' and rpcType='1' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)
推荐阅读
- 监控nginx
- 第六章|第六章 Sleuth--链路追踪
- sentry搭建错误监控系统(二)
- 网络|一文彻底搞懂前端监控
- linux|linux|常用的系统监控命令
- Linux监控工具(atop安装使用)
- influxDB|influxDB + grafana + python 监控windows服务器流量
- 计算机网络-数据链路层
- JVM监控工具教程
- 搭建监控