hive|Hive统计最近七天内连续登陆3天的用户数量

需求:统计最近七天内连续登陆3天的用户数量
欢迎关注今日头条号、微信公众号、知乎号:仰望夜空一万次
基础数据:
uid为1的用户在‘2020-09-18’号没有登陆。
val df = Seq( ("2020-09-21",1), ("2020-09-20",1), ("2020-09-19",1), ("2020-09-17",1), ("2020-09-16",1), ("2020-09-15",1), ("2020-09-20",2), ("2020-09-19",2), ("2020-09-20",3), ("2020-09-19",3), ("2020-09-18",3) ).toDF("dt", "uid") df.createTempView("tmp")

第一步,使用rank over函数,根据用户分组,并且按时间排序
val sql2= s""" | selectuid, dt,rank() over(partition by uid order by dt) rank | from tmp | where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' | """.stripMargin println(sql2) ss.sql(sql2).show(false)结果: +---+----------+----+ |uid|dt|rank| +---+----------+----+ |1|2020-09-15|1| |1|2020-09-16|2| |1|2020-09-17|3| |1|2020-09-19|4| |1|2020-09-20|5| |1|2020-09-21|6| |3|2020-09-18|1| |3|2020-09-19|2| |3|2020-09-20|3| |2|2020-09-19|1| |2|2020-09-20|2| +---+----------+----+

第二步,使用date_sub函数,获得date_dif的值,此值相同表示发生连续登陆行为。
s""" | select uid,date_sub(dt,rank) date_dif |from |(selectuid, dt,rank() over(partition by uid order by dt) rank |from tmp |where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' |) t1 """.stripMargin 结果: +---+----------+ |uid|date_dif| +---+----------+ |1|2020-09-14| |1|2020-09-14| |1|2020-09-14| |1|2020-09-15| |1|2020-09-15| |1|2020-09-15| |3|2020-09-17| |3|2020-09-17| |3|2020-09-17| |2|2020-09-18| |2|2020-09-18| +---+----------+

发现uid为1的用户,在最近七天内有两段连续3天登陆的行为。

第三步,获取连续3天登陆的uid。

val sql2= s""" | select uid |from |( |select uid,date_sub(dt,rank) date_dif |from |(selectuid, dt,rank() over(partition by uid order by dt) rank |from tmp |where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' |) t1 |) t2 |group by uid,date_dif |having count(*)>=3 """.stripMargin结果: |uid| +---+ |1| |1| |3| +---

第四步,对uid进行去重,此处使用group by,而非count(distinct uid),为并行分组uid提供保障。
in Hive1.1, these two queries' explains have the same result. Both of them have ONLY ONE STAGE.
https://stackoverflow.com/questions/19311193/why-is-countdistinct-slower-than-group-by-in-hive
需要查看执行计划,看看版本是否支持。
|select uid |from |( |select uid |from |( |select uid,date_sub(dt,rank) date_dif |from |(selectuid, dt,rank() over(partition by uid order by dt) rank |from tmp |where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' |) t1 |) t2 |group by uid,date_dif |having count(*)>=3 |) t3 |group by uid+---+ |uid| +---+ |1| |3| +---+


第五步,获取最终数据结果。
compute_date表示统计日期,compute_period表示统计周期,continue_3_days_number表示统计周期内连续登陆3天的总用户数量。
val sql2= s""" |select |'2020-09-21' as compute_date, |concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period, |count(*) as continue_3_days_number |from |( |select uid |from |( |select uid |from |( |select uid,date_sub(dt,rank) date_dif |from |(selectuid, dt,rank() over(partition by uid order by dt) rank |from tmp |where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' |) t1 |) t2 |group by uid,date_dif |having count(*)>=3 |) t3 |group by uid |) t4 """.stripMargin+------------+---------------------+----------------------+ |compute_date|compute_period|continue_3_days_number| +------------+---------------------+----------------------+ |2020-09-21|2020-09-15_2020-09-21|2| +------------+---------------------+----------------------+


第五步的另外一种写法,使用count(distinct)的最终写法
val sql2= s""" |select |'2020-09-21' as compute_date, |concat(date_add('2020-09-21',-6),'_','2020-09-21') as compute_period, |count(distinct uid) as continue_3_days_number |from |( |select uid |from |( |select uid,date_sub(dt,rank) date_dif |from |(selectuid, dt,rank() over(partition by uid order by dt) rank |from tmp |where dt>=date_add('2020-09-21',-6) and dt<='2020-09-21' |) t1 |) t2 |group by uid,date_dif |having count(*)>=3 |) t3 """.stripMargin println(sql2) ss.sql(sql2).show(false)

【hive|Hive统计最近七天内连续登陆3天的用户数量】

    推荐阅读