少年意气强不羁,虎胁插翼白日飞。这篇文章主要讲述SQL计算KSAUCIVPSI等风控模型指标相关的知识,希望能为你提供帮助。
引言KS、AUC、IV、PSI等指标是信贷风控业务中最为常见的模型指标。在模型开发和监控中都尤为重要。常规计算方案是将数据从数据库下载到本地,再用python代码来计算,这是一件非常费力的事情。为此,本文介绍了一种Spark-SQL来计算KS、AUC、IV、PSI的方法,供参考。
目录Part 1. KS的计算方法在《求是汪在路上:风控模型—区分度评估指标(KS)深入理解应用》这篇中,我们介绍过KS相关知识。另外可作为延伸阅读的一篇论文是《A comparison study of computational methods of Kolmogorov Smirnov statistic in credit scoring》,其摘要如下:
Part 1. KS的计算方法
Part 2. IV的计算方法
Part 3. AUC的计算方法
Part 4. PSI的计算方法
Part 5. 总结
致谢
版权声明
图 1 - 论文摘要
该文章提出了KS的三种计算方法,分别是不分箱法、等频分箱法(equal-size binning)和等距分箱法(equal-width binning),同时从计算值、排序性、几何含义三个维度对比,得出:1)不分箱法能到最大且唯一的值;2)只有等频分箱法能用于评估分数的排序性。
- ?不分箱法。?
- 对连续变量(或特指模型分数)排序后,计算累积好人数和累积坏人数;
- 得到累积好人捕捉率和累积坏人捕捉率
- 两者相减得到差值,并取绝对值(可用abs方法)。
- 计算所有绝对值的最大值。
其操作方法是在不分箱法的基础上多了一步分箱,也就是将分数离散化为有限个区间。通常分为10个或20个箱子。
在平时业务分析中,我们习惯于等频分箱法来看,原因是可以看到模型的排序性,以及头部坏人捕捉率等指标,所能传达的信息更为丰富。
等频分箱法的计算方法为:
drop table if exists db.bin_table;
create table db.bin_table as
select bucket
,round(min(score), 6) as min_score
,round(max(score), 6) as max_score
,count(y)as tot
,sum(y)as bad
,count(y) - sum(y)as good
from (
select distinct a.order_id, a.score, a.y
,case when score is nullthen B00
when score < percent[0]then B01
when score > = percent[0]and score < percent[1]then B02
when score > = percent[1]and score < percent[2]then B03
when score > = percent[2]and score < percent[3]then B04
when score > = percent[3]and score < percent[4]then B05
when score > = percent[4]and score < percent[5]then B06
when score > = percent[5]and score < percent[6]then B07
when score > = percent[6]and score < percent[7]then B08
when score > = percent[7]and score < percent[8]then B09
when score > = percent[8]and score < percent[9]then B10
when score > = percent[9]and score < percent[10] then B11
when score > = percent[10] and score < percent[11] then B12
when score > = percent[11] and score < percent[12] then B13
when score > = percent[12] and score < percent[13] then B14
when score > = percent[13] and score < percent[14] then B15
when score > = percent[14] and score < percent[15] then B16
when score > = percent[15] and score < percent[16] then B17
when score > = percent[16] and score < percent[17] then B18
when score > = percent[17] and score < percent[18] then B19
when score > = percent[18]then B20
end as bucket
from db.score_y a
left join (
select percentile_approx(score,
array(0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50
,0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95), 9999) as percent
from db.score_y
where y in (0,1)
) b on 1 = 1
) t
group by bucket
order by bucket;
【SQL计算KSAUCIVPSI等风控模型指标】其输出结果如下:
bucketmin_score max_score tot
B01 6.70E-040.0057842503
B02 0.0057840.0078732499
B03 0.0078740.0097122493
B04 0.0097120.0114432502
B05 0.0114440.0130652504
B06 0.0130650.0146632496
B07 0.0146640.0161212499
B08 0.0161220.0174302499
B09 0.0174310.0187442502
B10 0.0187440.0200092501
B11 0.0200090.0215952500
B12 0.0215960.0234482499
B13 0.0234490.0256432500
B14 0.0256440.0282242499
B15 0.0282260.0315892499
B16 0.0315900.0360232499
B17 0.0360250.0423352500
B18 0.0423370.0522322500
B19 0.0522340.0723082500
B20 0.0723090.8385302506
上述代码在分数分布集中时可能无法做到完全等频,为此可采用以下代码?(也更推荐)?:
drop table if exists db.bin_table;
create table db.bin_table as
select bucket
,round(min(score), 6) as min_score
,round(max(score), 6) as max_score
,count(y)as tot
,sum(y)as bad
,count(y) - sum(y)as good
from (
select a.order_id, a.score, a.y
,case when rn_percent > = 0.00 and rn_percent < 0.05then B01
when rn_percent > = 0.05 and rn_percent < 0.10then B02
when rn_percent > = 0.10 and rn_percent < 0.15then B03
when rn_percent > = 0.15 and rn_percent < 0.20then B04
when rn_percent > = 0.20 and rn_percent < 0.25then B05
when rn_percent > = 0.25 and rn_percent < 0.30then B06
when rn_percent > = 0.30 and rn_percent < 0.35then B07
when rn_percent > = 0.35 and rn_percent < 0.40then B08
when rn_percent > = 0.40 and rn_percent < 0.45then B09
when rn_percent > = 0.45 and rn_percent < 0.50then B10
when rn_percent > = 0.50 and rn_percent < 0.55then B11
when rn_percent > = 0.55 and rn_percent < 0.60then B12
when rn_percent > = 0.60 and rn_percent < 0.65then B13
when rn_percent > = 0.65 and rn_percent < 0.70then B14
when rn_percent > = 0.70 and rn_percent < 0.75then B15
when rn_percent > = 0.75 and rn_percent < 0.80then B16
when rn_percent > = 0.80 and rn_percent < 0.85then B17
when rn_percent > = 0.85 and rn_percent < 0.90then B18
when rn_percent > = 0.90 and rn_percent < 0.95then B19
when rn_percent > = 0.95 and rn_percent < 1.01then B20
end as bucket
from (
select *, percent_rank() over (order by rn asc ) as rn_percent
from (select *, row_number() over (order by score desc) rn from db.score_y) a
) a
) t
group by bucket
order by bucket;
其输出结果如下:
bucketmin_score max_score tot
B01 0.0723420.8385302500
B02 0.0522840.0723412500
B03 0.0423610.0522822500
B04 0.0360330.0423562500
B05 0.0315970.0360322500
B06 0.0282340.0315952500
B07 0.0256470.0282322500
B08 0.0234520.0256442500
B09 0.0215970.0234522500
B10 0.0200100.0215962500
B11 0.0187460.0200102500
B12 0.0174340.0187452500
B13 0.0161250.0174332500
B14 0.0146660.0161242500
B15 0.0130650.0146662500
B16 0.0114450.0130652500
B17 0.0097180.0114452500
B18 0.0078720.0097172500
B19 0.0057800.0078712500
B20 6.70E-040.0057802500
如前文所述逻辑,等频分箱法计算KS的代码如下:
select max(abs(cum_bad / tot_bad - cum_good / tot_good)) as KS
from (
select bucket
,sum(bad) over (order by bucket asc) as cum_bad
,sum(bad) over () as tot_bad
,sum(good) over (order by bucket asc) as cum_good
,sum(good) over () as tot_good
from db.bin_table
order by bucket
) t ;
KS
0.35775373467994226
如前文所述逻辑,不分箱法计算KS的代码如下:
select max(abs(cum_bad / tot_bad - cum_good / tot_good)) as KS
from (
select order_id, score
,count(if(y = 1, order_id, null)) over(order by score asc) as cum_bad
,count(if(y = 1, order_id, null)) over() as tot_bad
,count(if(y = 0, order_id, null)) over(order by score asc) as cum_good
,count(if(y = 0, order_id, null)) over() as tot_good
from db.score_y
where y in (0,1)
) c;
KS
0.36173529422394535
由于最终算KS的时候加上了绝对值函数abs(),所以窗口函数over()里面无论是asc或者desc,效果都一样。对比等频分箱法和不分箱法,我们发现结果略有差异。不分箱法是最精确的,因为其不受分箱方法的影响,毕竟精度的差异又会影响分箱边界,进而影响分箱内的样本量。
如果需要同时计算多个模型分数的KS,可以对表进行一些调整,如下所示。再把name放在窗口函数over()中。
create table db.score_y_new as
select * from (
select score1 as name, score1 as score from db.score_y
union all
select score2 as name, score2 as score from db.score_y
) t
;
?Part 2. IV的计算方法?
在《求是汪在路上:风控模型—WOE与IV指标的深入理解应用》中可参考IV的含义,其衡量自变量对于因变量的信息量。IV越高,说明该自变量对于预测因变量越有效。
在分箱表的基础上,计算IV的代码如下:
select sum((bad / tot_bad) * ln(bad / tot_bad) - (good / tot_good) * ln(good / tot_good)) as IV
from (
select bucket, bad, good
,sum(bad) over() as tot_bad
,sum(good) over() as tot_good
from db.bin_table
) a;
IV
0.3983099697476078
Part 3. AUC的计算方法
ROC曲线下方的面积即为AUROC(Area Under ROC),简称AUC。理论上,对于连续分数分布,我们可将计算公式定义为:
其中,
为坏人累积分布,
为好人累积分布。
在实际问题中,我们拥有的样本量总是有限的,因此上述连续积分问题就转变为累加问题。如何计算这个面积呢?这就需要用到微积分的知识,相邻两个坐标点可以组成一个很小的梯形,计算这个小梯形的面积后,我们将整个区间加起来即可得到。
图 2 - ROC与AUC
计算AUC的代码如下:
select sum((tpr + pre_tpr) / 2 * (fpr - pre_fpr)) as AUC
from (
select order_id, score
,cum_bad / tot_bad as tpr
,cum_good / tot_good as fpr
,lead(cum_bad / tot_bad) over(order by cum_bad / tot_bad desc) as pre_tpr
,lead(cum_good / tot_good) over(order by cum_good / tot_good desc) as pre_fpr
from (
select order_id, score
,count(if(y = 1, order_id, null)) over(order by score desc) as cum_bad
,count(if(y = 1, order_id, null)) over() as tot_bad
,count(if(y = 0, order_id, null)) over(order by score desc) as cum_good
,count(if(y = 0, order_id, null)) over() as tot_good
from db.score_y
where y in (0,1)
) b
) t;
AUC
0.7459597336169244
?Part 4. PSI的计算方法?
在《求是汪在路上:风控模型—群体稳定性指标(PSI)深入理解应用》中可参考PSI相关知识。首先,我们同样对分数进行分箱。
PSI的计算需要指定基准组,用以作为比较的参照。通常在模型监控中,我们会按月份来与基准组比较,因此我们引入了一个分组字段gp。
drop table if exists db.psi_temp;
create table db.psi_temp as
select gp, bucket, count(1) as cnt
from (
select a.order_id, a.score, a.gp
,case when score is nullthen B00
when score < percent[0]then B01
when score > = percent[0]and score < percent[1]then B02
when score > = percent[1]and score < percent[2]then B03
when score > = percent[2]and score < percent[3]then B04
when score > = percent[3]and score < percent[4]then B05
when score > = percent[4]and score < percent[5]then B06
when score > = percent[5]and score < percent[6]then B07
when score > = percent[6]and score < percent[7]then B08
when score > = percent[7]and score < percent[8]then B09
when score > = percent[8]and score < percent[9]then B10
when score > = percent[9]and score < percent[10] then B11
when score > = percent[10] and score < percent[11] then B12
when score > = percent[11] and score < percent[12] then B13
when score > = percent[12] and score < percent[13] then B14
when score > = percent[13] and score < percent[14] then B15
when score > = percent[14] and score < percent[15] then B16
when score > = percent[15] and score < percent[16] then B17
when score > = percent[16] and score < percent[17] then B18
when score > = percent[17] and score < percent[18] then B19
when score > = percent[18]then B20
end as bucket
from db.score a
left join (
select percentile_approx(score,
array(0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50
,0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95), 9999) as percent
from db.score
where gp = 基准组
) b on 1 = 1
) t
group by gp, bucket
order by gp, bucket;
此时,我们将会得到中间数据如下:
gpbucketcnt
基准组 B01 1000
基准组 B02 1000
基准组 B03 1004
基准组 B04 1000
基准组 B05 1001
基准组 B06 1002
基准组 B07 999
基准组 B08 1002
基准组 B09 1002
基准组 B10 999
基准组 B11 1002
基准组 B12 1002
基准组 B13 1000
基准组 B14 1002
基准组 B15 1000
基准组 B16 1002
基准组 B17 1001
基准组 B18 1001
基准组 B19 1001
基准组 B20 1004
对比组 B01 1557
对比组 B02 1563
对比组 B03 1404
对比组 B04 1545
对比组 B05 1464
对比组 B06 1489
对比组 B07 1533
对比组 B08 1407
对比组 B09 1590
对比组 B10 1460
对比组 B11 1530
对比组 B12 1490
对比组 B13 1417
对比组 B14 1524
对比组 B15 1455
对比组 B16 1468
对比组 B17 1506
对比组 B18 1521
对比组 B19 1540
对比组 B20 1513
接下来,计算得到最终的PSI结果:
select gp, sum((a.rate - b.rate) * ln(a.rate / b.rate)) as PSI
from (
select gp, bucket, cnt / sum(cnt) over (partition by gp) as rate
from db.psi_temp
) a
left join (
select gp, bucket, cnt / sum(cnt) over (partition by gp) as rate
from db.psi_temp
where gp = 基准组
) b on a.bucket = b.bucket
group by gp
order by gp;
gp PSI
基准组 0.0
对比组 0.0012083365466523768
?Part 5. 总结?
得益于分布式计算,以及简洁的语法,SQL可有效帮助数据从业者节省大量时间。由此,本文介绍了常见风控模型指标的SQL计算方法。
原文链接:https://zhuanlan.zhihu.com/p/402824981/
参考www.tandfonline.com/doi/full/10.1080/03610918.2016.1249883
感谢liaoge所提供的等频分箱的代码
欢迎各位同学了解??《python金融风控评分卡模型和数据分析(加强版)》??学习更多相关知识
?版权声明:文章来自公众号(python风控模型),未经许可,不得抄袭。遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。?
推荐阅读
- 实践GoF的23种设计模式(SOLID原则(上))
- GFS分布式文件系统
- #yyds干货盘点#关于 arm 和 x86 架构的思考
- SpringBoot+MybatisPlus+Mysql+Sharding-JDBC分库分表实践
- 「Vue系列」之面试官问NextTick是想考察什么()
- #yyds干货盘点# Kubernetes 如何高效调度 Pod((23))
- 使用DHCP+WDS+ADK+MDT批量部署windows10系统
- 融云 IM 在 Electron 平台上的设计实践
- #yyds干货盘点# C#中的File类