最强最全面的大数据SQL面试系列

本套SQL题的答案是由许多小伙伴共同贡献的,1+1的力量是远远大于2的,有不少题目都采用了非常巧妙的解法,也有不少题目有多种解法。本套大数据SQL题不仅题目丰富多样,答案更是精彩绝伦!

注:以下参考答案都经过简单数据场景进行测试通过,但并未测试其他复杂情况。本文档的SQL主要使用 Hive SQL。
因内容较多,带目录的PDF查看是比较方便的:
最强最全面的大数据SQL经典面试题完整PDF版
一、行列转换 描述:表中记录了各年份各部门的平均绩效考核成绩。
\
表名:t1
\
表结构:
a -- 年份 b -- 部门 c -- 绩效得分

表内容:
abc 2014B9 2015A8 2014A10 2015B7

问题一:多行转多列 问题描述:将上述表内容转为如下输出结果所示:
acol_A col_B 2014109 201587

参考答案:
select a, max(case when b="A" then c end) col_A, max(case when b="B" then c end) col_B from t1 group by a;

问题二:如何将结果转成源表?(多列转多行) 问题描述:将问题一的结果转成源表,问题一结果表名为t1_2
参考答案:
select a, b, c from ( select a,"A" as b,col_a as c from t1_2 union all select a,"B" as b,col_b as c from t1_2 )tmp;

问题三:同一部门会有多个绩效,求多行转多列结果 问题描述:2014年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:
2014B9 2015A8 2014A10 2015B7 2014B6

输出结果如下所示:
acol_Acol_B 2014106,9 201587

参考答案:
select a, max(case when b="A" then c end) col_A, max(case when b="B" then c end) col_B from ( select a, b, concat_ws(",",collect_set(cast(c as string))) as c from t1 group by a,b )tmp group by a;

二、排名中取他值 表名:t2
\
表字段及内容:
abc 2014A3 2014B1 2014C2 2015A4 2015D3

问题一:按a分组取b字段最小时对应的c字段 输出结果如下所示:
amin_c 20143 20154

参考答案:
select a, c as min_c from ( select a, b, c, row_number() over(partition by a order by b) as rn from t2 )a where rn = 1;

问题二:按a分组取b字段排第二时对应的c字段 输出结果如下所示:
asecond_c 20141 20153

参考答案:
select a, c as second_c from ( select a, b, c, row_number() over(partition by a order by b) as rn from t2 )a where rn = 2;

问题三:按a分组取b字段最小和最大时对应的c字段 【最强最全面的大数据SQL面试系列】输出结果如下所示:
amin_cmax_c 201432 201543

参考答案:
select a, min(if(asc_rn = 1, c, null)) as min_c, max(if(desc_rn = 1, c, null)) as max_c from ( select a, b, c, row_number() over(partition by a order by b) as asc_rn, row_number() over(partition by a order by b desc) as desc_rn from t2 )a where asc_rn = 1 or desc_rn = 1 group by a;

问题四:按a分组取b字段第二小和第二大时对应的c字段 输出结果如下所示:
amin_cmax_c 201411 201534

参考答案:
select ret.a ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c ,max(case when ret.rn_max = 2 then ret.c else null end) as max_c from ( select * ,row_number() over(partition by t2.a order by t2.b) as rn_min ,row_number() over(partition by t2.a order by t2.b desc) as rn_max from t2 ) as ret where ret.rn_min = 2 or ret.rn_max = 2 group by ret.a;

问题五:按a分组取b字段前两小和前两大时对应的c字段 注意:需保持b字段最小、最大排首位
输出结果如下所示:
amin_cmax_c 20143,12,1 20154,33,4

参考答案:
select tmp1.a as a, min_c, max_c from ( select a, concat_ws(',', collect_list(c)) as min_c from ( select a, b, c, row_number() over(partition by a order by b) as asc_rn from t2 )a where asc_rn <= 2 group by a )tmp1 join ( select a, concat_ws(',', collect_list(c)) as max_c from ( select a, b, c, row_number() over(partition by a order by b desc) as desc_rn from t2 )a where desc_rn <= 2 group by a )tmp2 on tmp1.a = tmp2.a;

三、累计求值 表名:t3
\
表字段及内容:
abc 2014A3 2014B1 2014C2 2015A4 2015D3

问题一:按a分组按b字段排序,对c累计求和 输出结果如下所示:
absum_c 2014A3 2014B4 2014C6 2015A4 2015D7

参考答案:
select a, b, c, sum(c) over(partition by a order by b) as sum_c from t3;

问题二:按a分组按b字段排序,对c取累计平均值 输出结果如下所示:
abavg_c 2014A3 2014B2 2014C2 2015A4 2015D3.5

参考答案:
select a, b, c, avg(c) over(partition by a order by b) as avg_c from t3;

问题三:按a分组按b字段排序,对b取累计排名比例 输出结果如下所示:
abratio_c 2014A0.33 2014B0.67 2014C1.00 2015A0.50 2015D1.00

参考答案:
select a, b, c, round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_c from t3 order by a,b;

问题四:按a分组按b字段排序,对b取累计求和比例 输出结果如下所示:
abratio_c 2014A0.50 2014B0.67 2014C1.00 2015A0.57 2015D1.00

参考答案:
select a, b, c, round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_c from t3 order by a,b;

四、窗口大小控制 表名:t4
\
表字段及内容:
abc 2014A3 2014B1 2014C2 2015A4 2015D3

问题一:按a分组按b字段排序,对c取前后各一行的和 输出结果如下所示:
absum_c 2014A1 2014B5 2014C1 2015A3 2015D4

参考答案:
select a, b, lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_c from t4;

问题二:按a分组按b字段排序,对c取平均值 问题描述:前一行与当前行的均值!
输出结果如下所示:
abavg_c 2014A3 2014B2 2014C1.5 2015A4 2015D3.5

参考答案:
\
此处给出两种解法,其一:
select a, b, avg(c) over(partition by a order by b rows between 1 preceding and current row ) from t4;

其二:
select a, b, case when lag_c is null then c else (c+lag_c)/2 end as avg_c from ( select a, b, c, lag(c,1) over(partition by a order by b) as lag_c from t4 )temp;

五、产生连续数值 输出结果如下所示:
1 2 3 4 5 ... 100

参考答案:
\
不借助其他任何外表,实现产生连续数值
\
此处给出两种解法,其一:
select id_start+pos as id from( select 1 as id_start, 1000000 as id_end ) mlateral view posexplode(split(space(id_end-id_start), '')) t as pos, val

其二:
select row_number() over() as id from (select split(space(99), ' ') as x) t lateral view explode(x) ex;

那如何产生1至1000000连续数值?
参考答案:
select row_number() over() as id from (select split(space(999999), ' ') as x) t lateral view explode(x) ex;

六、数据扩充与收缩 表名:t6
\
表字段及内容:
a 3 2 4

问题一:数据扩充 输出结果如下所示:
ab 33、2、1 22、1 44、3、2、1

参考答案:
select t.a, concat_ws('、',collect_set(cast(t.rn as string))) as b from ( select t6.a, b.rn from t6 left join ( select row_number() over() as rn from (select split(space(5), ' ') as x) t -- space(5)可根据t6表的最大值灵活调整 lateral view explode(x) pe ) b on 1 = 1 where t6.a >= b.rn order by t6.a, b.rn desc ) t group byt.a;

问题二:数据扩充,排除偶数 输出结果如下所示:
ab 33、1 21 43、1

参考答案:
select t.a, concat_ws('、',collect_set(cast(t.rn as string))) as b from ( select t6.a, b.rn from t6 left join ( select row_number() over() as rn from (select split(space(5), ' ') as x) t lateral view explode(x) pe ) b on 1 = 1 where t6.a >= b.rn and b.rn % 2 = 1 order by t6.a, b.rn desc ) t group byt.a;

问题三:如何处理字符串累计拼接 问题描述:将小于等于a字段的值聚合拼接起来
输出结果如下所示:
ab 32、3 22 42、3、4

参考答案:
select t.a, concat_ws('、',collect_set(cast(t.a1 as string))) as b from ( select t6.a, b.a1 from t6 left join ( selecta as a1 from t6 ) b on 1 = 1 where t6.a >= b.a1 order by t6.a, b.a1 ) t group byt.a;

问题四:如果a字段有重复,如何实现字符串累计拼接 输出结果如下所示:
ab 22 32、3 32、3、3 42、3、3、4

参考答案:
select a, b from ( select t.a, t.rn, concat_ws('、',collect_list(cast(t.a1 as string))) as b from ( select a.a, a.rn, b.a1 from ( select a, row_number() over(order by a ) as rn from t6 ) a left join ( selecta as a1, row_number() over(order by a ) as rn from t6 ) b on 1 = 1 where a.a >= b.a1 and a.rn >= b.rn order by a.a, b.a1 ) t group byt.a,t.rn order by t.a,t.rn ) tt;

问题五:数据展开 问题描述:如何将字符串"1-5,16,11-13,9"扩展成"1,2,3,4,5,16,11,12,13,9"?注意顺序不变。
参考答案:
select concat_ws(',',collect_list(cast(rn as string))) from ( select a.rn, b.num, b.pos from ( select row_number() over() as rn from (select split(space(20), ' ') as x) t -- space(20)可灵活调整 lateral view explode(x) pe ) a lateral view outer posexplode(split('1-5,16,11-13,9', ',')) b as pos, num where a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = num order by pos, rn ) t;

七、合并与拆分 表名:t7
\
表字段及内容:
ab 2014A 2014B 2015B 2015D

问题一:合并 输出结果如下所示:
2014A、B 2015B、D

参考答案:
select a, concat_ws('、', collect_set(t.b)) b from t7 group by a;

问题二:拆分 问题描述:将分组合并的结果拆分出来
参考答案:
select t.a, d from ( select a, concat_ws('、', collect_set(t7.b)) b from t7 group by a )t lateral view explode(split(t.b, '、')) table_tmp as d;

八、模拟循环操作 表名:t8
\
表字段及内容:
a 1011 0101

问题一:如何将字符'1'的位置提取出来 输出结果如下所示:
1,3,4 2,4

参考答案:
select a, concat_ws(",",collect_list(cast(index as string))) as res from ( select a, index+1 as index, chr from ( select a, concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str from t8 ) tmp1 lateral view posexplode(split(str,",")) t as index,chr where chr = "1" ) tmp2 group by a;

九、不使用distinct或group by去重 表名:t9
\
表字段及内容:
abcd 201420162014A 201420152015B

问题一:不使用distinct或group by去重 输出结果如下所示:
2014A 2016A 2014B 2015B

参考答案:
select t2.year ,t2.num from ( select * ,row_number() over (partition by t1.year,t1.num) as rank_1 from ( select a as year, d as num from t9 union all select b as year, d as num from t9 union all select c as year, d as num from t9 )t1 )t2 where rank_1=1 order by num;

十、容器--反转内容 表名:t10
\
表字段及内容:
a AB,CA,BAD BD,EA

问题一:反转逗号分隔的数据:改变顺序,内容不变 输出结果如下所示:
BAD,CA,AB EA,BD

参考答案:
select a, concat_ws(",",collect_list(reverse(str))) from ( select a, str from t10 lateral view explode(split(reverse(a),",")) t as str ) tmp1 group by a;

问题二:反转逗号分隔的数据:改变内容,顺序不变 输出结果如下所示:
BA,AC,DAB DB,AE

参考答案:
select a, concat_ws(",",collect_list(reverse(str))) from ( select a, str from t10 lateral view explode(split(a,",")) t as str ) tmp1 group by a;

十一、多容器--成对提取数据 表名:t11
\
表字段及内容:
ab A/B1/3 B/C/D4/5/2

问题一:成对提取数据,字段一一对应 输出结果如下所示:
ab A1 B3 B4 C5 D2

参考答案:
select a_inx, b_inx from ( select a, b, a_id, a_inx, b_id, b_inx from t11 lateral view posexplode(split(a,'/')) t as a_id,a_inx lateral view posexplode(split(b,'/')) t as b_id,b_inx ) tmp where a_id=b_id;

十二、多容器--转多行 表名:t12
\
表字段及内容:
abc 001A/B1/3/5 002B/C/D4/5

问题一:转多行 输出结果如下所示:
ade 001type_bA 001type_bB 001type_c1 001type_c3 001type_c5 002type_bB 002type_bC 002type_bD 002type_c4 002type_c5

参考答案:
select a, d, e from ( select a, "type_b" as d, str as e from t12 lateral view explode(split(b,"/")) t as str union all select a, "type_c" as d, str as e from t12 lateral view explode(split(c,"/")) t as str ) tmp order by a,d;

十三、抽象分组--断点排序 表名:t13
\
表字段及内容:
ab 20141 20151 20161 20170 20180 2019-1 2020-1 2021-1 20221 20231

问题一:断点排序 输出结果如下所示:
abc 201411 201512 201613 201701 201802 2019-11 2020-12 2021-13 202211 202312

参考答案:
select a, b, row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组,排序 from ( select a, b, a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首] from ( select a, b, row_number() over( partition by b order byaasc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序 from t13 )tmp1 )tmp2--注意,如果不同的b列值,可能出现同样的组首值,但组首值需要和a列值 一并参与分组,故并不影响排序。 order by a asc;

十四、业务逻辑的分类与抽象--时效 日期表:d_date
\
表字段及内容:
date_idis_work 2017-04-131 2017-04-141 2017-04-150 2017-04-160 2017-04-171

工作日:周一至周五09:30-18:30
客户申请表:t14
\
表字段及内容:
abc 1申请2017-04-14 18:03:00 1通过2017-04-17 09:43:00 2申请2017-04-13 17:02:00 2通过2017-04-15 09:42:00

问题一:计算上表中从申请到通过占用的工作时长 输出结果如下所示:
ad 10.67h 210.67h

参考答案:
select a, round(sum(diff)/3600,2) as d from ( select a, apply_time, pass_time, dates, rn, ct, is_work, case when is_work=1 and rn=1 then unix_timestamp(concat(dates,' 18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') when is_work=0 then 0 when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,' 09:30:00'),'yyyy-MM-dd HH:mm:ss') when is_work=1 and rn!=ct then 9*3600 end diff from ( select a, apply_time, pass_time, time_diff, day_diff, rn, ct, date_add(start,rn-1) dates from ( select a, apply_time, pass_time, time_diff, day_diff, strs, start, row_number() over(partition by a) as rn, count(*) over(partition by a) as ct from ( select a, apply_time, pass_time, time_diff, day_diff, substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strs from ( select a, apply_time, pass_time, unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff, datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_diff from ( select a, max(case when b='申请' then c end) apply_time, max(case when b='通过' then c end) pass_time from t14 group by a ) tmp1 ) tmp2 ) tmp3 lateral view explode(split(strs,",")) t as start ) tmp4 ) tmp5 join d_date on tmp5.dates = d_date.date_id ) tmp6 group by a;

十五、时间序列--进度及剩余 表名:t15
\
表字段及内容:
date_idis_work 2017-07-300 2017-07-311 2017-08-011 2017-08-021 2017-08-031 2017-08-041 2017-08-050 2017-08-060 2017-08-071

问题一:求每天的累计周工作日,剩余周工作日 输出结果如下所示:
date_idweek_to_workweek_left_work 2017-07-3114 2017-08-0123 2017-08-0232 2017-08-0341 2017-08-0450 2017-08-0550 2017-08-0650

参考答案:
\
此处给出两种解法,其一:
select date_id ,case date_format(date_id,'u') when 1 then 1 when 2 then 2 when 3 then 3 when 4 then 4 when 5 then 5 when 6 then 5 when 7 then 5 end as week_to_work ,case date_format(date_id,'u') when 1 then 4 when 2 then 3 when 3 then 2 when 4 then 1 when 5 then 0 when 6 then 0 when 7 then 0 end as week_to_work from t15

其二:
select date_id, week_to_work, week_sum_work-week_to_work as week_left_work from( select date_id, sum(is_work) over(partition by year,week order by date_id) as week_to_work, sum(is_work) over(partition by year,week) as week_sum_work from( select date_id, is_work, year(date_id) as year, weekofyear(date_id) as week from t15 ) ta ) tb order by date_id;

十六、时间序列--构造日期 问题一:直接使用SQL实现一张日期维度表,包含以下字段:
datestring日期 d_weekstring年内第几周 weeksint周几 w_startstring周开始日 w_endstring周结束日 d_monthint第几月 m_startstring月开始日 m_endstring月结束日 d_quarterint第几季 q_startstring季开始日 q_endstring季结束日 d_yearint年份 y_startstring年开始日 y_endstring年结束日

参考答案:
drop table if exists dim_date; create table if not exists dim_date( `date` string comment '日期', d_week string comment '年内第几周', weeks string comment '周几', w_start string comment '周开始日', w_end string comment '周结束日', d_month string comment '第几月', m_start string comment '月开始日', m_end string comment '月结束日', d_quarter int comment '第几季', q_start string comment '季开始日', q_end string comment '季结束日', d_year int comment '年份', y_start string comment '年开始日', y_end string comment '年结束日' ); --自然月: 指每月的1号到那个月的月底,它是按照阳历来计算的。就是从每月1号到月底,不管这个月有30天,31天,29天或者28天,都算是一个自然月。insert overwrite table dim_date select `date` , d_week --年内第几周 , case weekid when 0 then '周日' when 1 then '周一' when 2 then '周二' when 3 then '周三' when 4 then '周四' when 5 then '周五' when 6 then '周六' endas weeks -- 周 , date_add(next_day(`date`,'MO'),-7) as w_start --周一 , date_add(next_day(`date`,'MO'),-1) as w_end-- 周日_end -- 月份日期 , concat('第', monthid, '月')as d_month , m_start , m_end-- 季节 , quarterid as d_quart , concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start --季开始日 , date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end--季结束日 -- 年 , d_year , y_start , y_endfrom ( select `date` , pmod(datediff(`date`, '2012-01-01'), 7)as weekid--获取周几 , cast(substr(`date`, 6, 2) as int)as monthid--获取月份 , case when cast(substr(`date`, 6, 2) as int) <= 3 then 1 when cast(substr(`date`, 6, 2) as int) <= 6 then 2 when cast(substr(`date`, 6, 2) as int) <= 9 then 3 when cast(substr(`date`, 6, 2) as int) <= 12 then 4 endas quarterid --获取季节 可以直接使用 quarter(`date`) , substr(`date`, 1, 4)as d_year-- 获取年份 , trunc(`date`, 'YYYY')as y_start--年开始日 , date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end--年结束日 , date_sub(`date`, dayofmonth(`date`) - 1)as m_start--当月第一天 , last_day(date_sub(`date`, dayofmonth(`date`) - 1))m_end--当月最后一天 , weekofyear(`date`)as d_week--年内第几周 from ( -- '2021-04-01'是开始日期, '2022-03-31'是截止日期 select date_add('2021-04-01', t0.pos) as `date` from ( select posexplode( split( repeat('o', datediff( from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'), 'yyyy-mm-dd'), '2021-04-01')), 'o' ) ) ) t0 ) t1 ) t2;

十七、时间序列--构造累积日期 表名:t17
\
表字段及内容:
date_id 2017-08-01 2017-08-02 2017-08-03

问题一:每一日期,都扩展成月初至当天 输出结果如下所示:
date_iddate_to_day 2017-08-012017-08-01 2017-08-022017-08-01 2017-08-022017-08-02 2017-08-032017-08-01 2017-08-032017-08-02 2017-08-032017-08-03

这种累积相关的表,常做桥接表。
参考答案:
select date_id, date_add(date_start_id,pos) as date_to_day from ( select date_id, date_sub(date_id,dayofmonth(date_id)-1) as date_start_id from t17 ) mlateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;

十八、时间序列--构造连续日期 表名:t18
\
表字段及内容:
abc 1012018-01-0110 1012018-01-0320 1012018-01-0640 1022018-01-0220 1022018-01-0430 1022018-01-0760

问题一:构造连续日期 问题描述:将表中数据的b字段扩充至范围[2018-01-01, 2018-01-07],并累积对c求和。
\
b字段的值是较稀疏的。
输出结果如下所示:
abcd 1012018-01-011010 1012018-01-02010 1012018-01-032030 1012018-01-04030 1012018-01-05030 1012018-01-064070 1012018-01-07070 1022018-01-0100 1022018-01-022020 1022018-01-03020 1022018-01-043050 1022018-01-05050 1022018-01-06050 1022018-01-0760110

参考答案:
select a, b, c, sum(c) over(partition by a order by b) as d from ( select t1.a, t1.b, case when t18.b is not null then t18.c else 0 end as c from ( select a, date_add(s,pos) as b from ( select a, '2018-01-01' as s, '2018-01-07' as r from (select a from t18 group by a) ta ) mlateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val ) t1 left join t18 ont1.a = t18.a and t1.b = t18.b ) ts;

十九、时间序列--取多个字段最新的值 表名:t19
\
表字段及内容:
date_idabc 2014AB12bc 201523 2016d 2017BC

问题一:如何一并取出最新日期 输出结果如下所示:
date_aadate_bbdate_cc 2017BC2015232016d

参考答案:
\
此处给出三种解法,其一:
SELECTmax(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b ,max(CASE WHEN rn_b = 1 THEN b else NULLEND) AS b ,max(CASE WHEN rn_c = 1 THEN date_idelse 0 END) AS date_c ,max(CASE WHEN rn_c = 1 THEN c else null END) AS c FROM( SELECTdate_id ,a ,b ,c --对每列上不为null的值的 日期 进行排序 ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c FROMt19 ) t WHEREt.rn_a = 1 ORt.rn_b = 1 ORt.rn_c = 1;

其二:
SELECT a.date_id ,a.a ,b.date_id ,b.b ,c.date_id ,c.c FROM ( SELECT t.date_id, t.a FROM ( SELECT t.date_id ,t.a ,t.b ,t.c FROM t19 t INNER JOINt19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT 1 ) a LEFT JOIN ( SELECT t.date_id ,t.b FROM ( SELECT t.date_id ,t.b FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT 1 ) b ON 1 = 1 LEFT JOIN ( SELECT t.date_id ,t.c FROM ( SELECT t.date_id ,t.c FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT1 ) c ON 1 = 1;

其三:
select * from ( select t1.date_id as date_a,t1.a from (select t1.date_id,t1.afrom t19 t1 where t1.a is not null) t1 inner join (select max(t1.date_id) as date_idfrom t19 t1 where t1.a is not null) t2 on t1.date_id=t2.date_id ) t1 cross join ( select t1.date_b,t1.b from (select t1.date_id as date_b,t1.bfrom t19 t1 where t1.b is not null) t1 inner join (select max(t1.date_id) as date_idfrom t19 t1 where t1.b is not null)t2 on t1.date_b=t2.date_id ) t2 cross join ( select t1.date_c,t1.c from (select t1.date_id as date_c,t1.cfrom t19 t1 where t1.c is not null) t1 inner join (select max(t1.date_id) as date_idfrom t19 t1 where t1.c is not null)t2 on t1.date_c=t2.date_id ) t3;

二十、时间序列--补全数据 表名:t20
\
表字段及内容:
date_idabc 2014AB12bc 201523 2016d 2017BC

问题一:如何使用最新数据补全表格 输出结果如下所示:
date_idabc 2014AB12bc 2015AB23bc 2016AB23d 2017BC23d

参考答案:
select date_id, first_value(a) over(partition by aa order by date_id) as a, first_value(b) over(partition by bb order by date_id) as b, first_value(c) over(partition by cc order by date_id) as c from ( select date_id, a, b, c, count(a) over(order by date_id) as aa, count(b) over(order by date_id) as bb, count(c) over(order by date_id) as cc from t20 )tmp1;

二十一、时间序列--取最新完成状态的前一个状态 表名:t21
\
表字段及内容:
date_idab 20141A 20151B 20161A 20171B 20132A 20142B 20152A 20143A 20153A 20163B 20173A

上表中B为完成状态。
问题一:取最新完成状态的前一个状态 输出结果如下所示:
date_idab 20161A 20132A 20153A

参考答案:
\
此处给出两种解法,其一:
select t21.date_id, t21.a, t21.b from ( select max(date_id) date_id, a from t21 where b = 'B' group by a ) t1 inner join t21 on t1.date_id -1 = t21.date_id and t1.a = t21.a;

其二:
select next_date_id as date_id ,a ,next_b as b from( select *,min(nk) over(partition by a,b) as minb from( select *,row_number() over(partition by a order by date_id desc) nk ,lead(date_id) over(partition by a order by date_id desc) next_date_id ,lead(b) over(partition by a order by date_id desc) next_b from( select * from t21 ) t ) t ) t where minb = nk and b = 'B';

问题二:如何将完成状态的过程合并 输出结果如下所示:
ab_merge 1A、B、A、B 2A、B 3A、A、B

参考答案:
select a ,collect_list(b) as b from( select * ,min(if(b = 'B',nk,null)) over(partition by a) as minb from( select *,row_number() over(partition by a order by date_id desc) nk from( select * from t21 ) t ) t ) t where nk >= minb group by a;

二十二、非等值连接--范围匹配 表f是事实表,表d是匹配表,在hive中如何将匹配表中的值关联到事实表中?
表d相当于拉链过的变化维,但日期范围可能是不全的。
表f:
date_idp_id 2017C 2018B 2019A 2013C

表d:
d_startd_endp_idp_value 20162018A1 20162018B2 20082009C4 20102015C3

问题一:范围匹配 输出结果如下所示:
date_idp_idp_value 2017Cnull 2018B2 2019Anull 2013C3

*参考答案*:
\
此处给出两种解法,其一:
select f.date_id, f.p_id, A.p_value from f left join ( select date_id, p_id, p_value from ( select f.date_id, f.p_id, d.p_value from f left join d on f.p_id = d.p_id where f.date_id >= d.d_start and f.date_id <= d.d_end )A )A ON f.date_id = A.date_id;

其二:
select date_id, p_id, flag as p_value from ( select f.date_id, f.p_id, d.d_start, d.d_end, d.p_value, if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag, max(d.d_end) over(partition by date_id) max_end from f left join d on f.p_id = d.p_id ) tmp where d_end = max_end;

二十三、非等值连接--最近匹配
表t23_1和表t23_2通过a和b关联时,有相等的取相等的值匹配,不相等时每一个a的值在b中找差值最小的来匹配。
t23_1和t23_2为两个班的成绩单,t23_1班的每个学生成绩在t23_2班中找出成绩最接近的成绩。
表t23_1:a中无重复值
a 1 2 4 5 8 10

表t23_2:b中无重复值
b 2 3 7 11 13

问题一:单向最近匹配 输出结果如下所示:
\
注意:b的值可能会被丢弃
ab 12 22 43 53 57 87 1011

参考答案:
select * from ( select ttt1.a, ttt1.b from ( select tt1.a, t23_2.b, dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr from ( select t23_1.a from t23_1 left join t23_2 on t23_1.a=t23_2.b where t23_2.b is null ) tt1 cross join t23_2 ) ttt1 where ttt1.dr=1 union all select t23_1.a, t23_2.b from t23_1 inner join t23_2 on t23_1.a=t23_2.b ) result_t order by result_t.a;

二十四、N指标--累计去重 假设表A为事件流水表,客户当天有一条记录则视为当天活跃。
表A:
time_iduser_id 2018-01-01 10:00:00001 2018-01-01 11:03:00002 2018-01-01 13:18:00001 2018-01-02 08:34:00004 2018-01-02 10:08:00002 2018-01-02 10:40:00003 2018-01-02 14:21:00002 2018-01-02 15:39:00004 2018-01-03 08:34:00005 2018-01-03 10:08:00003 2018-01-03 10:40:00001 2018-01-03 14:21:00005

假设客户活跃非常,一天产生的事件记录平均达千条。
问题一:累计去重 输出结果如下所示:
日期当日活跃人数月累计活跃人数_截至当日 date_iduser_cnt_actuser_cnt_act_month 2018-01-0122 2018-01-0234 2018-01-0335

参考答案:
SELECTtt1.date_id ,tt2.user_cnt_act ,tt1.user_cnt_act_month FROM (-- ④ 按照t.date_id分组求出user_cnt_act_month,得到tt1 SELECTt.date_id ,COUNT(user_id) AS user_cnt_act_month FROM (-- ③ 表a和表b进行笛卡尔积,按照a.date_id,b.user_id分组,保证截止到当日的用户唯一,得出表t。 SELECTa.date_id ,b.user_id FROM (-- ① 按照日期分组,取出date_id字段当主表的维度字段 得出表a SELECTfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id FROM test.temp_tanhaidi_20211213_1 GROUP BYfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ) a INNER JOIN (-- ② 按照date_id、user_id分组,保证每天每个用户只有一条记录,得出表b SELECTfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id ,user_id FROM test.temp_tanhaidi_20211213_1 GROUP BYfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ,user_id ) b ON 1 = 1 WHERE a.date_id >= b.date_id GROUP BYa.date_id ,b.user_id ) t GROUP BYt.date_id ) tt1 LEFT JOIN (-- ⑥ 按照date_id分组求出user_cnt_act,得到tt2 SELECTdate_id ,COUNT(user_id) AS user_cnt_act FROM (-- ⑤ 按照日期分组,取出date_id字段当主表的维度字段 得出表a SELECTfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id ,user_id FROM test.temp_tanhaidi_20211213_1 GROUP BYfrom_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ,user_id ) a GROUP BY date_id ) tt2 ON tt2.date_id = tt1.date_id

参考:
最强最全面的大数据SQL经典面试题完整PDF版

    推荐阅读