Hive开窗函数的应用场景

开窗函数分为两类:聚合开窗函数和排序开窗函数,有时候一组数组只返回一组值是不能满足我们的数据分析需求的,我们需要知道某个学科的前几名详细信息、最近一次下单的用户订单详情、下过n次单的用户等等,这个时候开窗函数就可以派上用场了。如果没有开窗函数的话,则需要通过复杂的关联查询才能获取对应的详细信息。
聚合函数和聚合开窗函数的区别如下:
  • SQL 标准允许将所有聚合函数用作开窗函数,用OVER 关键字区分开窗函数和聚合函数,例如sum()over(partition by field )
  • 聚合函数每组只返回一个值,开窗函数每组可返回多个值,这里的值指的是每个分组的行信息
    聚合开窗函数实际应用举例
  • 【Hive开窗函数的应用场景】下单累计金额达到3000元的用户历史订单详情,并且按照用户的订单时间降序,以便查看用户的购买情况
    select * from (select sale_order_id, order_code, customer_id, order_create_time, actual_sales_amount, sum(actual_sales_amount)over(partition by customer_id) as grand_total from cdp_order_model where channel='cdp_order_sale_01' ) a where a.grand_total > 3000 order by a.customer_id,a.order_create_time desc

  • 单笔最高金额达到某个标准的用户历史订单详情
    select * from (select sale_order_id, order_code, customer_id, order_create_time, actual_sales_amount, max(actual_sales_amount)over(partition by customer_id) as max_order_amount from cdp_order_model where channel='cdp_order_sale_01' ) a where a.max_order_amount > 10000 order by a.customer_id,a.order_create_time desc

    最常见的排序开窗函数有三种:
  • row_number() over(partition by filed2 order by filed2),对相同的值不进行区分,排序从1到n名次连续
  • rank() over(partition by filed2 order by filed2),相等的值排名相同,但若有相等的值,则序号从1到n不连续,如果有两个人都排在第3名,则没有第4名
  • dense_rank() over(partition by filed2 order by filed2),对相等的值排名相同,但序号从1到n连续,如果有两个人都排在第一名,则排在第2名(假设仅有1个第二名)的人是第3个人
    常见的排序开窗函数应用场景
  • 查询用户最近一次订单的详情
    select distinct superid,houses_condition as tag_value from( select b.superid, a.houses_condition, row_number() over(partition by b.superid order by a.order_create_time desc) as num from cdp_order_model a join superid_all b on a.customer_id = b.id and b.idtype ='jiami_mobile' where a.channel='cdp_order_sale_01' and trim(a.order_channel)='EOS')b where b.num = 1 limit 100000000

  • 查询小程序中用户连续登陆的最长天数,表名:login_data,表字段:user_id,login_date
    可用登陆日期减去序号(user_id+login_date去重,根据user_id分组,login_date升序排序),若得到的日期相同,则为连续登陆,根据计算后的日期计数即可得到最长连续登陆天数
    selectuser_id,max(cnt) from (select user_id,after_date,count(0) as cnt from (select user_id,login_date, date_depart('day',login_date,-1*row_number(partition by user_id order by login_date asc)) as after_date from(select user_id,login_date from login_data group by user_id,login_date) a) b group by user_id,after_date)c group by user_id

  • 查询用户第二次复购的订单详情,一天有多笔订单算作一单,这个时候就要用到连续排序的dense_rank()排序函数了,而rank()因为无法估计同一天订单的数量而无法估计对应的排名进而进行筛选
    select distinct superid,houses_condition as tag_value from( select b.superid, a.houses_condition, dense_rank() over(partition by b.superid order by substring(cast(a.order_create_time as varchar),1,10) desc) as num from cdp_order_model a join superid_all b on a.customer_id = b.id and b.idtype ='jiami_mobile' where a.channel='cdp_order_sale_01' and trim(a.order_channel)='EOS')b where b.num = 2 limit 100000000

    推荐阅读