SQL|牛客刷题——题型总结

【SQL|牛客刷题——题型总结】
文章目录

  • (一)表连接
    • 1、多表连接
      • (1)join-on和多个where等价
      • (2)通过多表连接解决A成立B不成立的问题
      • (3)自连接
      • (4)from相同表但是where不同
    • 2、表连接函数
      • (1)union、union all
      • (2)left join、right join、join的区别
    • 3、行对应性表连接
    • 4、嵌套查询
  • (二)筛选:where、having、in
    • 1、where和having的区别
    • 2、where in与join等价的情况
  • (三)聚合信息:groupby与窗口函数
    • 1、要显示所有信息因此不能直接使用group by后的结果
    • 2、窗口函数+groupby
    • 3、根据两个变量分组
    • 4、聚合函数不一定要和groupby一起用
    • 5、groupby最易错点:select 分组变量/聚合函数
    • 6、根据不同字段group by
  • (四)排序
    • 1、第n多
    • 2、前n多:窗口函数
  • (五)执行顺序
  • (六)统计不同——去重
  • (七)行列转换
  • (八)时间函数
  • (九)字符串函数
    • 1、内置函数
    • 2、正则表达式
  • (十)类型转换
  • (十一)随机抽样
  • (十二)空值null
    • null和空值得区别

(一)表连接 1、多表连接 (1)join-on和多个where等价
涉及多个表,要么join用on来筛选,要么多表查询限制很多个where条件
-- 1、查询"01"课程比"02"课程成绩高的学生的信息及课程分数01-多表join,用on筛选 (1)a连b连c,不要a连(b连c),这样会把过程写复杂 (2)中间的筛选放在on里写,where只能在最后(查询前)写,不能在join的过程中 select a.* ,b.s_score as 01_score,c.s_score as 02_score from student a join score b on a.s_id=b.s_id and b.c_id='01'#01一定要有成绩,所以用了join left join score c on b.s_id=c.s_id and c.c_id='02' #02成绩可有可无,所以用left join where b.s_score>c.s_score; 02-不连接,直接一个where筛选出所有的结果,要哪些信息就直接选择 select a.*,b.s_score as 01_score,c.s_score as 02_score from student a,score b,score c where a.s_id=b.s_id and a.s_id=c.s_id and b.c_id='01' and c.c_id='02' and b.s_score>c.s_score-- 9、查询学过编号为"01"并且也学过编号为"02"的课程的同学的信息 (1)join 方法 select student.* from student join score a on a.s_id=student.s_id and a.c_id='01' join score b on b.s_id=student.s_id and b.c_id='02'(2)where方法,注意要from所有表,筛选所有条件都具备的情况 select student.* from student,score a ,score b where student.s_id=a.s_id and student.s_id=b.s_id and a.c_id='01' and b.c_id='02'

【例】SQL19、查找所有员工的last_name和first_name以及对应的dept_name,也包括暂时没有分配部门的员工
  • 思路:因为要包括暂时没有分配部门的员工,所以要把employees放在最左边,且用两次left join
# 两次LEFT JOIN连接 SELECT last_name, first_name, dept_name FROM employees LEFT JOIN dept_emp ON employees.emp_no=dept_emp.emp_no LEFT JOIN departments ON dept_emp.dept_no=departments.dept_no

【例】SQL22 统计各个部门的工资记录数
SELECT d.dept_no, dept_name,count(*) as sum FROM salaries s JOIN dept_emp de ON de.emp_no = s.emp_no JOIN departments d ON d.dept_no = de.dept_no GROUP BY dept_no-- 从groupby可以开始用select中的别名 ORDER BY dept_no

(2)通过多表连接解决A成立B不成立的问题
【例】SQL25 满足条件的用户的试卷完成数和题目练习数
请你找到高难度SQL试卷得分平均值大于80并且是7级的红名大佬,统计他们的2021年试卷完成数和题目练习数,只保留2021年有试卷完成记录的用户。结果按试卷完成数升序,按题目练习数降序。
  • 题目的意思是说,试卷一定要有完成记录,但是题目不一定要有,这种情况下应该把试卷完成情况作为左表,题目完成情况作为右表,其他情况再做筛选。
select uid, exam_cnt, (case when question_cnt is null then 0 else question_cnt end) #if(question_cnt is null, 0, question_cnt) from (select uid,count(score) as exam_cnt from exam_record where YEAR(submit_time) = 2021 group by uid) t-- 试卷有成绩 left join (select uid,count(submit_time) as question_cnt from practice_record where YEAR(submit_time) = 2021 group by uid) t2 using(uid)-- 题目不一定做了 where uid in ( select uid from exam_record join examination_info using(exam_id) join user_info using(uid) where tag = 'SQL' and difficulty = 'hard' and `level` = 7 group by uid having avg(score) >= 80 ) order by exam_cnt asc, question_cnt desc

(3)自连接
【例】SQL70 牛客每个人最近的登录日期(五)
  • 法一:sum(case when 1 else 0 end)分组计算
  • 法二:自连接:join on user_id相等并且datediff(nextday,today)=1
  • 法三:lead窗口函数
要在group by之后还能得到所有日期的结果,可以把原表和现表左边界,
select distinct date from login或者select date from login group by date(原表的date也要唯一)
# 法一 select date, ifnull( round( sum(case when (user_id,date) in (select user_id,date_add(date,interval -1 day) from login) and (user_id,date) in (select user_id,min(date) from login group by user_id) then 1 else 0 end)/ sum(case when (user_id,date) in (select user_id,min(date) from login group by user_id) then 1 else 0 end),3),0) as p from login group by date order by date# 法二 # a最早登录的日期左连接第二天的日期:on用户和时间差 # b所有的日期连接a:所有天的最早和第二天情况 # 对b计数select t0.date, ifnull(round(count(t2.user_id)/count(t1.user_id),3),0) from (select distinct date from login)t0 left join (select user_id,min(date)as min_date from login group by user_id)t1 on t0.date=min_date left join login t2 on t1.user_id=t2.user_id and datediff(t2.date,min_date)=1 group by t0.date# 法三 select date, ifnull(round(sum(case when date=min_date and datediff(next_date,date)=1 then 1 else 0 end) / sum(case when date=min_date then 1 else 0 end),3),0) as p from( select user_id, date, min(date) over (partition by user_id) as min_date, lead(date,1) over(partition by user_id order by date) as next_date from login ) a group by date order by date;

【例】SQL46 大小写混乱时的筛选统计
#自连接得到符合大小写要求的exam_id #on的妙用select a.tag,b.answer_cnt from (select tag,count(start_time) as answer_cnt from examination_info join exam_record using(exam_id) group by tag)ajoin(select tag,count(start_time) as answer_cnt from examination_info join exam_record using(exam_id) group by tag)b on a.tag!=b.tag and upper(a.tag)=b.tag group by tag order by answer_cnt desc

(4)from相同表但是where不同
with temp as ( select x from table where x1 in ('a','b','c') ) select x from temp wehre x1='a'

2、表连接函数 (1)union、union all
行合并,要求列是同数量且有相似的数据类型,每条 SELECT 语句中的列的顺序必须相同。union会去重并降低效率,union all允许重复的值。UNION 结果集中的列名总是等于 UNION 中第一个 SELECT 语句中的列名。
# union SELECT column_name(s) FROM table_name1 UNION SELECT column_name(s) FROM table_name2 # union all SELECT column_name(s) FROM table_name1 UNION ALL SELECT column_name(s) FROM table_name2

orderby只能在最后使用一次,所以这边只能放进子表中才能使用两次
【例】SQL23 每个题目和每份试卷被作答的人数和次数
select * from (select exam_id as tid, count(distinct uid) as uv, count(*) as pv from exam_record group by tid order by uv desc,pv desc)a-- orderby只能在最后使用一次,所以这边只能放进子表中才能使用两次union all select * from (select question_id as tid, count(distinct uid) as uv, count(*) as pv from practice_record group by tid order by uv desc,pv desc)b

【例】SQL24 分别满足两个活动的人
输出2021年里,所有每次试卷得分都能到85分的人以及至少有一次用了一半时间就完成高难度试卷且分数大于80的人的id和活动号,按用户ID排序输出。输出形式:
SQL|牛客刷题——题型总结
文章图片

所有每次试卷得分都能到85分的人:
(1)思路1:找到存在分数小于85分的用户,筛选的时候用not in排除
(2)思路2:根据用户分组,最小分数>=85的用户,就是符合条件的用户
  • 难点:分别’activity1’ as activity和’activity2’ as activity之后union all
# 思路1 with a as (select uid from exam_record where score<85 and year(submit_time) = 2021)-- 存在分数<85的用户,不符合activity1 select distinct uid, (case when uid not in (select * from a) then 'activity1' else null end) as activity from exam_record where (case when uid not in (select * from a) then 'activity1' else null end) is not null union all select distinct uid,'activity2' as activity from exam_record e_r left join examination_info e_i using(exam_id) where year(submit_time) = 2021 and difficulty = 'hard' and score > 80 and timestampdiff(minute, start_time, submit_time) * 2 < e_i.duration order by uid# 思路2 select uid,'activity1' as activity from exam_record where year(submit_time) = 2021 group by uid having min(score) >= 85 union all select distinct uid,'activity2' as activity from exam_record e_r left join examination_info e_i using(exam_id) where year(submit_time) = 2021 and difficulty = 'hard' and score > 80 and timestampdiff(minute, start_time, submit_time) * 2 < e_i.duration order by uid

(2)left join、right join、join的区别
SQL|牛客刷题——题型总结
文章图片

  • left join:查出来的结果显示左边的所有数据,然后右边显示的是和左边有交集部分的数据。
  • right join:查出表2所有数据,以及表1和表2有交集的数据。
  • join(inner join):查出两个表有交集的部分,其余没有关联就不额外显示出来。
3、行对应性表连接 SQL86 实习广场投递简历分析(三)
  • 代码注意点:(1)select一定要写清哪些,这里写*会报错;(2)RIGHT(s,n) 返回字符串 s 的后 n 个字符;(3)用right(first_year_mon,2)=right(second_year_mon,2)控制每行上时间的对应性;(4)顺序最后一定要调整
select t1.job,first_year_mon,first_year_cnt,second_year_mon,second_year_cnt from (select job,DATE_FORMAT(date,'%Y-%m') as first_year_mon,sum(num)as first_year_cnt from resume_info where date like '2025%' group by job,first_year_mon)t1 JOIN (select job,DATE_FORMAT(date,'%Y-%m') as second_year_mon,sum(num)as second_year_cnt from resume_info where date like '2026%' group by job,second_year_mon)t2 on t1.job=t2.job AND right(first_year_mon,2)=right(second_year_mon,2) order by first_year_mon desc,job desc

希望用到两张表的信息——表连接+条件筛选
SQL76 考试分数(五)
  • 代码注意点:所有语句中,如果变量名是唯一的,就不需要写表名,写表名是在易混淆的情况下才这么做。
-- 查询各个岗位分数的中位数位置上的所有grade信息,并且按id升序排序 select id,t1.job,score,t_rank from (select id,job,score, row_number() over (partition by job order by score desc) as t_rank from grade)t1 JOIN (select job, (case when count(id)%2=0 then count(id)/2 else ceiling(count(id)/2) end) as start, (case when count(id)%2=0 then count(id)/2+1 else ceiling(count(id)/2) end) as end from grade group by job )t2 on t1.job=t2.job where t_rank=start or t_rank=end order by id

4、嵌套查询 当下一层计算结果是基于上一层时,需要用到层层嵌套的方法
【例】SQL28 第二快/慢用时之差大于试卷时长一半的试卷
-- 先用窗口函数找出每门考试的正数和倒数的排名 -- 然后根据每门课分组,计算正数和倒数对应时间的时间差 -- 最后筛选出时间差符合要求的情况 -- 涉及的知识点:并列计数窗口、分组条件计算、嵌套查询select distinct exam_id, duration, release_time from (select exam_id, duration, release_time, #sum(case when rank1 = 2 then costtime when rank2 = 2 then -costtime else 0 end) as sub max(case when rank1=2 then costtime else null end)-max(case when rank2=2 then costtime else null end) as subfrom ( select e_i.exam_id, duration, release_time, timestampdiff(minute, start_time, submit_time) as costtime, row_number() over(partition by e_r.exam_id order by timestampdiff(minute, start_time, submit_time) desc) rank1, row_number() over(partition by e_r.exam_id order by timestampdiff(minute, start_time, submit_time) asc) rank2 from exam_record e_r join examination_info e_i on e_r.exam_id = e_i.exam_id ) table1 group by exam_id ) table2 where sub*2 >= duration order by exam_id desc

【例】SQL29 连续两次作答试卷的最大时间窗
时间函数:
datediff(end_time,start_time)
date(start_time)
-- 细节点:(1)作答过只需要有start_time就可以了;(2)根据题意算时间差都需要在公式的基础上+1 -- 需要的数据:每个人的前后期开始作答时间(窗口), -- groupby:每个人的最大窗口时间,对窗口时间筛选,每个人的最先时间,最后时间,作答次数 -- 在上面的基础上计算count,max,min -- 一层基于一层来计算,用层层嵌套来做select uid,days_window, round(counts/sub_day*days_window,2) as avg_exam_cnt from (select uid, max(datediff(next_time,start_time))+1 as days_window, datediff(max(date(start_time)),min(date(start_time)))+1 as sub_day, count(start_time) as counts from (select uid,start_time, lead(start_time,1)over(partition by uid order by start_time) as next_time from exam_record where year(start_time)=2021)a group by uid having count(distinct date(start_time))>=2 )b order by days_window desc,avg_exam_cnt desc;

【例】SQL30 近三个月未完成试卷数为0的用户完成情况
# 每个人的试卷作答(start)记录的月份:窗口函数,序号 # 做筛选:序号前三,没有未完成 # 筛选出用户,得到该用户的试卷完成数(近三个月) # 按试卷完成数和用户ID降序排名select uid,count(submit_time) as exam_complete_cnt from (select uid,date_format(start_time,'%Y%m') as ans_month,start_time,submit_time, dense_rank()over(partition by uid order by date_format(start_time,'%Y%m') desc) as recent_months from exam_record)a where recent_months between 1 and 3-- <=3 group by uid having count(start_time)=count(submit_time) order by exam_complete_cnt desc,uid desc

【例】SQL31 未完成率较高的50%用户近三个月答卷情况
  • 代码注意点:(1)三表嵌套join很复杂,用where in代替反而简化问题;(2)count()在只有一类的情况下可以不和groupby连用,但是只能显示一行结果。count()over()可以在每一行都显示结果;(3)判断前50%(中位数及之后):rank<=ceiling(总数/2),则是前50%,否则不是
    【法一】count(distinct uid) over ()把总人数连接到表上
    【法二】只join不on,可以把总人数连接到表上
    【法三】(select count(distinct uid) from)表示总人数
  • 思路:先把步骤和对应的方法按照先后顺序写出来,再写代码
# 数所有行数用count(1)或者count任意一个非空变量都可以 with a as ( select uid from (select *,row_number()over(order by incomplete_rate desc) incomplete_order,count(1)over() as numbers from (select uid,(count(1)-count(submit_time))/count(1) as incomplete_rate from exam_record where exam_id in (select exam_id from examination_info where tag='SQL') group by uid)t1)t2 join (select count(distinct uid) as total_user from exam_record join examination_info using(exam_id) where tag='SQL') t_u # where incomplete_order<=ceiling(numbers*0.5)-- 法一 # where incomplete_order<=ceiling(total_user*0.5)-- 法二 where incomplete_order<=ceiling((select count(distinct uid) as total_user from exam_record join examination_info using(exam_id) where tag='SQL')*0.5)-- 法三 and uid in (select uid from user_info where level in (6,7)) )select uid,start_month,count(start_time) as total_cnt,count(submit_time) as complete_cnt from (select uid,date_format(start_time,'%Y%m') as start_month,start_time,submit_time, dense_rank()over(partition by uid order by date_format(start_time,'%Y%m') desc) as recent_months from exam_record)recent_table where recent_months<=3 and uid in (select uid from a) group by uid,start_month-- 每个人每个月的登录情况 order by uid,start_month

(二)筛选:where、having、in 1、where和having的区别 (1)作用位置:都是筛选功能,where指定分组之前数据行的条件,having子句用来指定分组之后条件
(2)使用限制:where是对聚合前的信息进行筛选,having是对聚合后的信息进行筛选
(3)联系:where-groupby-having的使用顺序,where和having的区别在于筛选对象是分组前还是分组后
【易错点】涉及groupby的时候注意select的要么是聚合函数,要么是groupby的对象
-- 15、查询两门及其以上不及格课程的同学的学号,姓名及其平均成绩 select a.s_id,a.s_name,avg(b.s_score) as avg_score from student a join score b on a.s_id=b.s_id where s_score<60 group by a.s_id,a.s_name having count(c_id)>=2

SQL78 牛客的课程订单分析(二)
select user_id from order_info where status='completed' and product_name in ('C++','Java','Python') and date>'2025-10-15' group by user_id having count(id)>=2 order by user_id

SQL88 最差是第几名(二)
  • 涉及变量比较一定是在同一行上的数据
  • where的逻辑在from之后,select之前,所以这里的where筛选可以用到from表中有但是select中没有的变量
-- 中位数:正序和逆序的累积和都大于总和的一半,就是中位数 select grade FROM (select grade,(select sum(number) from class_grade) as total, sum(number) over (order by grade) as up, sum(number) over (order by grade desc) as down from class_grade )a where up>=total/2-- 涉及变量比较一定是在同一行上的,where在from之后,select之前 and down>=total/2 order by grade

2、where in与join等价的情况
当一次groupby,需要筛选条件时,where in和join时等价的
当多次groupby,需要筛选条件时,用where in () (注意不是where in ()a,不用标记表名)
等价:【例】SQL22 作答试卷得分大于过80的人的用户等级分布
# where in select level,count(uid) as level_cnt from user_info where (uid,level) in# 字段数需要统一 (select ui.uid,level from exam_record er left join user_info ui using(uid) left join examination_info ei using(exam_id) where tag='SQL' and score>80) group by level order by level_cnt DESC# join select level,count(distinct u_i.uid) as level_cnt from exam_record e_r left join examination_info e_i on e_r.exam_id = e_i.exam_id left join user_info u_i on e_r.uid = u_i.uid where tag = 'SQL' and score > 80 group by level order by level_cnt desc, level desc

不等价【例】月均完成试卷数不小于3的用户爱作答的类别
SELECT tag,count(tag) as tag_cnt from exam_record join examination_info using(exam_id) where uid in ( select uid from exam_record where submit_time is not null group by uid having count(submit_time)/count(distinct date_format(submit_time,'%Y%m'))>=3 ) group by tag order by tag_cnt desc

【例】SQL70 牛客每个人最近的登录日期(五)
查询每个日期新用户的次日留存率,结果保留小数点后面3位数(3位之后的四舍五入),并且查询结果按照日期升序排序
  • 代码注意点:(1)iffull(value,0)表示如果是null就输出0;(2)(a,b) in (select A,B from…)列数一定要对等
  • 明确问题:12号的新用户次留是指在12号是第一次登录,并且在13号也登录了。分母:当前日期新用户的特征是 当前日期=该用户所有登录日期的最小值。分子:当前日期作为前一天有该用户的登录记录,并且是第一次登录。(12号作为前一天登陆了并且是第一次登录,13号要登录了)
-- 通过in来筛选 -- 分子:今天在,昨天也在,且昨天是第一天登录 -- 分母:每天的新用户数 -- 易错点:分母为0,ifnullselect date, ifnull(round(sum(case when (user_id,date) in (select user_id,date_add(date,interval -1 day) from login) and (user_id,date) in (select user_id,min(date) from login group by user_id) then 1 else 0 end)/ sum(case when (user_id,date) in (select user_id,min(date) from login group by user_id) then 1 else 0 end),3) ,0) as p from login group by date order by date

(三)聚合信息:groupby与窗口函数 1、要显示所有信息因此不能直接使用group by后的结果 SQL79 牛客的课程订单分析(三)
【法一】内表找出user_id,外表找出该user_id符合的记录
-- 要显示所有信息因此不能直接使用group by后的结果 -- 先找到符合条件的人 with temp1 as (select user_id from order_info where date>'2025-10-15' and status='completed' and product_name in ('C++','Java','Python') group by user_id having count(id)>=2)-- 再找到符合条件的所有信息 select * from order_info where user_id in (select * from temp1)-- 注意不能直接写成temp1 and date>'2025-10-15' and status='completed' and product_name in ('C++','Java','Python') order by id

同理也可以不用临时表来写
select * from order_info where user_id in (select user_id from order_info where date>'2025-10-15' and status='completed' and product_name in ('C++','Java','Python') group by user_id having count(id)>=2)-- 注意不能直接写成temp1 and date>'2025-10-15' and status='completed' and product_name in ('C++','Java','Python') order by id

注:
临时表的写法:
with a as (),
b as (),
c as ()
【法二】窗口函数
groupby会把结果聚合成一行,所以如果需要所有信息,就要先内表后外表。但是窗口函数生成结果的行数不变,因此可以直接基于窗口函数做筛选,但是如果where筛选涉及窗口,还是要作为内表的,因为where的逻辑再select之前,但是可以少写很多筛选条件。
select id,user_id,product_name,status,client_id,date from (select *, count(id) over (partition by user_id) as counts from order_info where date>'2025-10-15' and status ="completed" and product_name in ("C++","Java","Python") ) a where counts>=2 order by id

2、窗口函数+groupby SQL80 牛客的课程订单分析(四)
  • 思路:最后需要最小日期所以肯定做聚合,做聚合就需要全部信息,所以前一步的计数肯定用到窗口函数
-- 首先在有次数的内表上做筛选,然后基于筛选结果做聚合函数求最小日期 -- 窗口函数+groupby select user_id,min(date) as first_buy_date,cnt from (select *, count(id) over (partition by user_id) as cnt from order_info where date>'2025-10-15' and status ="completed" and product_name in ("C++","Java","Python") ) a where cnt>=2 group by user_id order by user_id

SQL27 每类试卷得分前3名
select * from (select tag,er.uid, row_number() over (partition by tag order by max(score) desc,min(score) desc,uid desc)as ranking from examination_info ei join exam_record er using(exam_id) group by uid,tag ) a where ranking<=3

3、根据两个变量分组 SQL85 实习广场投递简历分析(二)
select job,date_format(date,'%Y-%m') as mon,sum(num) as cnt from resume_info where date like '2025%'-- 符合最左前缀匹配原则,也走索引 group by job,mon order by mon desc,cnt desc;

4、聚合函数不一定要和groupby一起用
当只有一类的情况下,聚合函数不一定要和groupby一起用
【例】SQL14 SQL类别高难度试卷得分的截断平均值
-- 法一:嵌套子函数,因为只有一类,所以不需要groupby直接就可以算出min和max select tag,difficulty,round(avg(score),1) as clip_avg_score from exam_record er join examination_info ei on er.exam_id=ei.exam_id where tag='SQL' and difficulty='hard' and score != (select max(score) from exam_record where tag='SQL' and difficulty='hard') and score != (select min(score) from exam_record where tag='SQL' and difficulty='hard')-- 法二:窗口函数,正序和倒序两次row_number来找到最大和最小 select tag,difficulty,round(avg(score),1) as clip_avg_score from ( select tag,difficulty,score, row_number() over (partition by tag order by score) as rank1, row_number() over (partition by tag order by score desc) as rank2 from exam_record er join examination_info ei on er.exam_id=ei.exam_id where tag='SQL' and difficulty='hard' and score is not null ) a where rank1!=1 and rank2!=1

【例】SQL31 未完成率较高的50%用户近三个月答卷情况
select count(distinct uid) as total_user from exam_record

5、groupby最易错点:select 分组变量/聚合函数 SQL18 月总刷题数和日均刷题数
这里也可以用ifnull,ifnull和coalesce的区别:
ifnull只有两个参数,coalesce有多个参数,返回第一个非空的值
group by with rollup具有汇总加和的功能,但是列名那里自动为null,如果希望有列名,则需要辅助ifnull/coalesce函数
Hive中with rollup和with cude都可以用于group by的汇总,但是当分组依据是三组的情况下,二者呈现出的汇总效果不一样。cube是3222211111,而rollup是321321321。
这里最易错的点在于每月天数的计算
(1)计算每个月的天数可以用函数:day(last_day(time))
(2)也可以自己写:case when month(time) in (1,3,5,7,8,10,12) then 31 else 30 end
(3)最易错的点在于这里用到了groupby month,需要用max(day_of_month)或者min、first、last汇总出唯一结果,这样才不会报错
select coalesce(date_format(submit_time,'%Y%m'),'2021汇总') as submit_month, count(score) as month_q_cnt, round(count(score)/max(case when month(submit_time) in (1,3,5,7,8,10,12) then 31 else 30 end),3) as avg_day_q_cnt FROM practice_record where year(submit_time)=2021 group by DATE_FORMAT(submit_time, "%Y%m") with rollup

6、根据不同字段group by
问题:
(1)涉及两种不同的groupby:每个人购买每个商品的次数至少两次的人数(筛选)+每种商品的购买人数
(2)因为涉及对其中一个groupby的筛选,因此如果直接在两个groupby的基础上直接再groupby会导致范围不对
-- 法一:在两个字段groupby的表上套一个字段groupby,用if来筛选 SELECT product_id, ROUND(SUM(repurchase) / COUNT(1), 3) as repurchase_rate FROM ( SELECT uid, product_id, IF(COUNT(1)>1, 1, 0) as repurchase FROM tb_order_detail JOIN tb_order_overall USING(order_id) JOIN tb_product_info USING(product_id) WHERE tag="零食" AND event_time >= ( SELECT DATE_SUB(MAX(event_time), INTERVAL 89 DAY) FROM tb_order_overall)GROUP BY uid, product_id ) as t_uid_product_info GROUP BY product_id ORDER BY repurchase_rate DESC, product_id LIMIT 3; -- 法二:直接表连接计算字段 select a.product_id, ifnull(round(cnt_2/cnt_total,3),0.000) repurchase_rate from tb_product_info a left join (select product_id, count(distinct uid) cnt_total-- 该产品被几个人购买过 from tb_order_detail t1 left join tb_order_overall t2 on t1.order_id = t2.order_id GROUP BY product_id) b on a.product_id = b.product_id left join (select product_id,count(distinct uid) as cnt_2-- count(1)也可以 from (select product_id,uid,count(1) as cnt from tb_order_detail join tb_order_overall using(order_id) where (DATEDIFF((select max(event_time) from tb_order_overall),date(event_time)) < 90) group by product_id,uid having cnt>=2)c group by product_id)d on a.product_id = d.product_id where tag='零食' order by repurchase_rate desc,product_id limit 3

(四)排序 1、第n多 用orderby之后limit个数
limit y --读取 y 条数据 limit x, y --跳过 x 条数据,读取 y 条数据 limit y offset x --跳过 x 条数据,读取 y 条数据

2、前n多:窗口函数 【例】leetcode–185. 部门工资前三高的所有员工
select Department,Employee,Salary from (select b.name as Department, a.Name as Employee, a.Salary, dense_rank() over (partition by a.DepartmentId order by a.Salary desc) as salary_rank from Employee a join Department b on a.DepartmentId=b.Id) c where salary_rank<=3

(五)执行顺序 where在join on后
SQL73 考试分数(二)
查询用户分数大于其所在工作(job)分数的平均分的所有grade的属性
  • 代码注意点:表连接中如果是唯一字段可以不加表名
select id,grade.job,score from grade left join (select job,avg(score) as avg_score from grade group by job) a on grade.job=a.job where score>avg_score order by id

(六)统计不同——去重 distinct 会对结果集去重,对全部选择字段进行去重,并不能针对其中部分字段进行去重。使用count distinct进行去重统计会将reducer数量强制限定为1,而影响效率,因此适合改写为子查询。
-- 统计不同的id的个数 select count(distinct id) fromtable_1-- 优化版本的count distinct select count(*) from (select distinct id from table_1) tb

SQL15 统计作答次数
数行数用count(*)
count(var) 如果var有空值会自动忽略
count(distinct var)在计数时去重
如果是限制var2不要有空值,数var1有多少种,要用到case when
select count(*) as total_pv, count(score) as complete_pv, -- 聚合函数计算时会忽略空值 count(distinct case when score is not null then exam_id else null end) as complete_exam_cnt from exam_record ;

(七)行列转换 tmp_column
select A,B,C from table lateral view explode(split(column_C,',')) tmp_table as C -- A,B,column_C 都是原表的列(字段),tmp_table:explode形成的新虚拟表,可以不写;

SQL|牛客刷题——题型总结
文章图片

select * from table LATERAL VIEW EXPLODE(SPLIT(ab_version, ',')) vidtb AS vid_explode where vid_explode in ("1262091")

(八)时间函数
# 提取时间 DATE_FORMAT(NOW(),'%Y') DATE_FORMAT(NOW(),'%m%d') year()/month()/day()/hour()/minute()/second()/date() # 转换类型 convert(log_time,date) # 时间差day datediff(string enddate, string startdate) -- datediff函数只能处理'yyyy-MM-dd'这种格式的日期,如果日期形式是'yyyyMMdd'的话,需要进行格式转换TIMESTAMPDIFF(interval, time_start, time_end) -- 可计算time_start-time_end的时间差,单位以指定的interval为准:second,minute,hour,day,month,year # 时间加 date_add(string startdate, int days) A.T_DATE = B.T_DATE+ interval 1 hour '2021-09-01 22:11:12'+interval 50 minute # 时间减 date_sub (string startdate, int days) A.T_DATE = B.T_DATE+ interval -1 hour

-- 日期(2020-03-21 17:13:39)和unix时间戳(1584782175)之间相互转换 ## 日期转化为时间戳 ## select unix_timestamp('2020-03-21 17:13:39'):得到 1584782019 select unix_timestamp('20200321 13:01:03','yyyyMMdd HH:mm:ss') 得到 1584766863 select unix_timestamp('20200321','yyyyMMdd') 得到 1584720000## 时间戳转化为日期 ## select from_unixtime (1584782175) 得到 2020-03-21 17:16:15 select from_unixtime (1584782175,'yyyyMMdd') 得到 20200321 select from_unixtime (1584782175,'yyyy-MM-dd')得到 2020-03-21## 日期和日期之间,也可以通过时间戳来进行转换 ## select from_unixtime(unix_timestamp('20200321','yyyymmdd'),'yyyy-mm-dd') 得到 2020-03-21 select from_unixtime(unix_timestamp('2020-03-21','yyyy-mm-dd'),'yyyymmdd')得到 20200321

(九)字符串函数 1、内置函数
-- 1、拼接 -- (1)concat( A, B...)返回将A和B按顺序连接在一起的字符串 select concat('abc', 'def','gh') 得到abcdefgh concat(round(num,1),'%')# 得到百分数-- (2)concat_ws(string X, stringA, string B) 返回字符串A和B由X拼接的结果 select concat_ws(',', 'abc', 'def', 'gh') 得到 abc,def,gh-- (3)根据分组情况连接字段 group_concat([DISTINCT] 要连接的字段 [Order BY ASC/DESC 排序字段] [Separator '分隔符']) # DISTINCT用来给字段去重 # 默认用逗号分隔 # 等同于先用窗口函数排序再用collect_set去重组合-- rk根据窗口函数得到 -- 把rankd的name组合在一起方便排序:['001张','002李'] SORT_ARRAY(COLLECT_SET(CONCAT(LPAD(CAST(rk AS STRING), 3, '0'),feature_name))) -- 把['001张','002李']中的数字去掉:先把列表组合成字符串,然后替换,再根据逗号拆分成数组 split(REGEXP_REPLACE(CONCAT_WS(',', rank), '[0-9]\{3\}', ''),',')-- 2、分割 substring_index(str,delim,count) str=www.wikidm.cn substring_index(str,'.',1)结果是:www(从左向右数) substring_index(str,'.',2)结果是:www.wikidm substring_index(str,'.',-2)结果为:wikidm.cn(从右向左数) substring_index(substring_index(str,'.',-2),'.',1)结果是:wikidm(中间的数)-- 3、切片 -- substr/substring(str,start,len) 截取字符串从0位开始的长度为len个字符。如果不加len,默认从start到end。 select substr('abcde',3,2) from iteblog; -- 得到cd-- 4、其他 select char_length('abcedfg') # 字符长度为7 ## 使用trim(string A) 去除字符串两边的空格 select trim(' abc ') 得到 'abc' ## 使用lower(string A)/ lcase(string A)返回字符串的小写形式,常用于不确定原始字段是否统一为大小写 select lower('abSEd') 得到 absed ## 使用upper(string A)/ ucase(string A)返回字符串的大写形式,常用于不确定原始字段是否统一为大小写 select upper('abSEd') 得到 ABSED

【例】用户行为分析
用户行为表tracking_log
SQL|牛客刷题——题型总结
文章图片

统计用户行为序列为A-B-D的用户数,其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
select count(*) from( select user_id,group_concat(opr_id) ubp from tracking_log group by user_id ) a where ubp like '%A%B%D%' and ubp not like '%A%B%C%D%' # 先提取子表后where筛选

【例】SQL19 未完成试卷数大于1的有效用户
SQL|牛客刷题——题型总结
文章图片

  • 拼接思路:首先字段拼接成新字段,然后是分组后的新字段拼接,要求分组时拼接的字段是不重复的,因此用distinct去重
select uid , sum(case when submit_time is null then 1 else 0 end) as incomplete_cnt , sum(case when submit_time is not null then 1 else 0 end) as complete_cnt , group_concat(distinct CONCAT(DATE_FORMAT(start_time, '%Y-%m-%d'),':',tag) separator '; ') as detail from exam_record er join examination_info ei using(exam_id) where YEAR(start_time) = 2021 group by uid having incomplete_cnt>1 and incomplete_cnt<5 and complete_cnt >= 1 order by incomplete_cnt desc

2、正则表达式
regexp_extract 提取
regexp_replace 替换
##regexp_extract(string subject, string pattern, int index) ## 将字符串subject按照pattern正则表达式的规则拆分,返回index指定的字符 select regexp_extract('foothebar', 'foo(.*?)(bar)', 1) 得到 the## regexp_replace(string A, string B, string C) ## 将字符串A中的符合java正则表达式B的部分替换为C select regexp_replace('foobar', 'oo|ar', '') 得到 fb

get_json_object(string json_string, string path)
第一个参数填写json对象变量,第二个参数使用$表示json变量标识,然后用 . 或 [] 读取对象或数组;
json对象相当于sql中的字典
data = https://www.it610.com/article/{"store": { "fruit":[{"weight":8,"type":"apple"}, {"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net", "owner":"amy" } hive> selectget_json_object(data, '$.owner') from test; 结果:amy hive> selectget_json_object(data, '$.store.bicycle.price') from test; 结果:19.95 hive> selectget_json_object(data, '$.store.fruit[0]') from test; 结果:{"weight":8,"type":"apple"}

【例】SQL39 筛选昵称规则和试卷规则的作答记录
select ui.uid,ei.exam_id,round(avg(score)) as avg_score from exam_record join user_info ui using(uid) join examination_info ei using(exam_id) where (nick_name rlike '^牛客[0-9]+号$'-- ^开头,[0-9]任意一个字符,+一个或多个匹配 or nick_name rlike '^[0-9]+$') -- $结尾 and tag rlike '(c|C).*'-- c或C,.任意字符,*0或多个匹配 and score is not null-- 这一行要加,因为如果哪一行只有一个结果就是空值,就没办法通过avg的计算把null抵消掉 group by uid,exam_id order by uid,avg_score

(十)类型转换
CAST (expression AS data_type)

可以转换的数据类型:
二进制,同带binary前缀的效果 : BINARY
字符型,可带参数 : CHAR()
日期 : DATE
时间: TIME
日期时间型 : DATETIME
浮点数 : DECIMAL
整数 : SIGNED
无符号整数 : UNSIGNED
SELECT CAST('9.0' AS decimal)结果:9 SELECT CAST('9.5' AS decimal(10,2))结果:9.50 SELECTCAST(NOW() AS DATE) 结果:2017-11-27 cast(exam_cnt_rank_21 as signed) -- 字符串转化为数字

【例】收入区间分组
select id, (case when CAST(salary as float)<50000 Then '0-5万' when CAST(salary as float)>=50000 and CAST(salary as float)<100000 then '5-10万' when CAST(salary as float) >=100000 and CAST(salary as float)<200000 then '10-20万' when CAST(salary as float)>200000 then '20万以上' else NULL end from table_1;

(十一)随机抽样
rand(),rand(int seed)## 从数据表中随机取两条数据,设定了rand(1)之后,每次跑出来的都是一样的两条数据 select * from dm_growth_da.xdl_test_20200328 order by rand(1) limit 2

(十二)空值null 什么时候需要标记is not null?
(1)当设计窗口函数排序row_number,需要where is not null
(2)count,avg,sum会自动排除null
(3)限制其他变量非空,对该变量计数,则需要写成类似count(distinct case when score is not null then exam_id else null end) as complete_exam_cnt 的形式
【例】SQL15 统计作答次数
  • 注意点:count中的casewhen在一般情况下也可以用where来替代,书写上会更加好读,(见下一个例子)但是因为这边要count(*)所以不能用where做统一筛选。
select count(*) as total_pv, count(score) as complete_pv, -- 聚合函数计算时会忽略空值 count(distinct case when score is not null then exam_id else null end) as complete_exam_cnt from exam_record ;

【例】SQL17 平均活跃天数和月活人数
  • 思路:因为平均活跃天数的分子是所有用户的活跃天数之和,需要用count来做,所以是根据用户和天来去重(每个用户一天如果登录多次,就记录一次)
select date_format(submit_time, '%Y%m') as month, round((count(distinct uid, date_format(submit_time, '%y%m%d'))) / count(distinct uid), 2) as avg_active_days,-- 每个人的登录天数count,要对人和天去重distinct count(distinct uid) as mau from exam_record where submit_time is not null# 很关键的非空 and year(submit_time) = 2021 group by month

【例】SQL19 未完成试卷数大于1的有效用户
  • count(非空字段)可以等价于sum(case is not null then end)
# 等价于sum(case when submit_time is null then 1 else null end) select uid , sum(if(submit_time is null,1,null)) as incomplete_cnt , sum(if(submit_time is not null,1,null)) as complete_cnt , group_concat(distinct CONCAT(DATE_FORMAT(start_time, '%Y-%m-%d'),':',tag) separator '; ') as detail from exam_record er join examination_info ei using(exam_id) where YEAR(start_time) = 2021 group by uid having incomplete_cnt>1 and incomplete_cnt<5 and complete_cnt >= 1 order by incomplete_cnt desc# count SELECT uid, (count(*)-count(submit_time)) as incomplete_cnt, count(submit_time) as complete_cnt, group_concat(distinct concat_ws(':', date(start_time), tag) SEPARATOR '; ') as detail from exam_record left join examination_info using(exam_id) where year(start_time)=2021 group by uid having complete_cnt >= 1 and incomplete_cnt BETWEEN 2 and 4 order by incomplete_cnt DESC

null和空值得区别 1、空值不占空间,NULL值占空间。当字段不为NULL时,也可以插入空值。
2、当使用 IS NOT NULL 或者 IS NULL 时,只能查出字段中没有不为NULL的或者为 NULL 的,不能查出空值。
3、判断NULL 用IS NULL 或者 is not null,SQL 语句函数中可以使用IFNULL()函数来进行处理,判断空字符用 =’‘或者<>’'来进行处理。
4、在进行count()统计某列的记录数的时候,如果采用的NULL值,会别系统自动忽略掉,但是空值是会进行统计到其中的。
5、MySql中如果某一列中含有NULL,那么包含该列的索引就无效了。这一句不是很准确。
6、实际到底是使用NULL值还是空值(’’),根据实际业务来进行区分。个人建议在实际开发中如果没有特殊的业务场景,可以直接使用空值。
题目参考:https://blog.csdn.net/fashion2014/article/details/78826299?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522163132590216780269843900%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=163132590216780269843900&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2alltop_positive~default-1-78826299.pc_search_result_hbase_insert&utm_term=sql&spm=1018.2226.3001.4187

    推荐阅读