spark|spark sql concat_ws 实现有序

spark(hive) sql 中的concat_ws 有两个易出错的地方

  • 无序
  • 忽略Null

concat_ws 有序 下面示例生成用户的行为序列,按时间有序
方法一:使用 window partition + row_number
SELECT * FROM ( SELECT user_id , concat_ws(' ', collect_list(event) OVER (PARTITION BY user_id ORDER BY event_time)) AS event_list , row_number() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn FROM table ) b where rn 1


方法二:使用group by + sort_array
create event_table as SELECT user_id ,sort_array(collect_list(concat_ws('_',event_time, event)))AS time_event_list FROM table group by user_id;


udf 去掉event_time,保留event
time_length = len('2019-11-21 22:06:55.564_')def event_list(time_event_text): time_event_list = time_event_text.split('~')event_list = [x[time_length:] for x in time_event_list ] return ' '.join(event_list)spark.udf.register("event_list", lambda x: event_list(x))spark.sql("select user_id, event_list(concat_ws('~', time_action_list)) event_list from event_table"


方法一非常耗时,每一行都生成action字段,通过rn取一行;
方法二是多行并一行,执行较快

concat_ws忽略Null
SELECT CONCAT_WS(',','First name',NULL,'Last Name'); --返回First name,Last Name

【spark|spark sql concat_ws 实现有序】
使用nvl进行防御编程
SELECT CONCAT_WS(',','First name', NVL(NULL, ''),'Last Name'); --返回First name,,Last Name

    推荐阅读