注意,有的公司并不会明显区分DW层位DWD和DWS,由公司直接将这2者视为一个层级,并不做明显区分。毕竟适合公司的才是最好的,而不是刻板地按照理论照本宣科。
注意,sql中,可以对数据库,对表,对字段都加注释,为了维护方便,最好加注释。 来自大数据开发的吐血建议!!!!
因为数据字段较多,想办法将这些字段的获取分开提取,然后再join起来按照guid,sessionid进行分组聚合,然后提取pageid,国家,省,市,区,设备类型等几个字段 SELECT guid , sessionid as session_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts asc) as first_page_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts desc) as last_page_id , first_value(country) over(partition by guid,sessionid order by ts asc) as country , first_value(province) over(partition by guid,sessionid order by ts asc) as province , first_value(city) over(partition by guid,sessionid order by ts asc) as city , first_value(region) over(partition by guid,sessionid order by ts asc) as region , first_value(devicetype) over(partition by guid,sessionid order by ts asc) as device_type FROM dwd17.app_action_detail WHERE dt='2020-10-20' AND eventid='pageView' 根据上述结果,再提取出省市区,因为用户可能处于移动中,这时候,同一个用户一天中可能会跨国家,跨省,跨市,所以采取最早时间规则。也就是第一条数据对应的国家,省,市,区来决定结果。(注意,这需要跟需求方确认,有时候现实数据会超过需求方的认知和预计,这时候需要确定统计口径) SELECT guid as guid , session_id as session_id , min(first_page_id) as first_page_id , min(last_page_id) as last_page_id , min(country) as country , min(province) as province , min(city) as city , min(region) as region , min(device_type) as device_type FROM ( SELECT guid , sessionid as session_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts asc) as first_page_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts desc) as last_page_id , first_value(country) over(partition by guid,sessionid order by ts asc) as country , first_value(province) over(partition by guid,sessionid order by ts asc) as province , first_value(city) over(partition by guid,sessionid order by ts asc) as city , first_value(region) over(partition by guid,sessionid order by ts asc) as region , first_value(devicetype) over(partition by guid,sessionid order by ts asc) as device_type FROM dwd17.app_action_detail WHERE dt='2020-10-20' AND eventid='pageView' ) o GROUP BY guid,session_id因为内层结果已经按照时间做了排序,并且拿到第一条数据。不过注意内层数据是按照时间和eventid做限制的。 外层就使用min函数,这时候就取到了所需要的数据。注意外层是按照guid和session id做限制的。 2层sql也可以做合并,不过拆分开来,可读性会好一些。
再另外查询,按照guid,sessionid分组聚合,拿到一次会话中最小时间,最大时间,小时时段,是否新用户,页面浏览次数 SELECT guid comment '全局唯一id' , sessionid as session_id comment '会话id' , min(ts) as start_ts comment '开始时间' , hour(from_unixtime(cast(min(ts)/1000 as bigint),'yyyy-MM-dd HH:mm:ss')) as hour_itv comment '小时时间段', max(ts) as end_ts comment '结束时间' , min(isnew) as isnew comment '是否新访客' , count(if(eventid='pageView',1,null)) as pv_cnt comment '页面浏览事件个数,注意count中如果为null则不会计数,所以使用if做判断' FROM dwd17.app_action_detail WHERE dt='2020-10-20' GROUP BY guid,sessionid 将上述2个大步骤的结果做聚合 WITH tmp1 AS ( SELECT guid comment '全局唯一id' , sessionid as session_id comment '会话id' , min(ts) as start_ts comment '开始时间' , hour(from_unixtime(cast(min(ts)/1000 as bigint),'yyyy-MM-dd HH:mm:ss')) as hour_itv comment '小时时间段', max(ts) as end_ts comment '结束时间' , min(isnew) as isnew comment '是否新访客' , count(if(eventid='pageView',1,null)) as pv_cnt comment '页面浏览事件个数,注意count中如果为null则不会计数,所以使用if做判断' FROM dwd.app_action_detail WHERE dt='2020-10-20' GROUP BY guid,sessionid ) ,tmp2 AS ( SELECT guid as guid , session_id as session_id , min(first_page_id) as first_page_id , min(last_page_id) as last_page_id , min(country) as country , min(province) as province , min(city) as city , min(region) as region , min(device_type) as device_type FROM ( SELECT guid , sessionid as session_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts asc) as first_page_id , first_value(properties['pageid']) over(partition by guid,sessionid order by ts desc) as last_page_id , first_value(country) over(partition by guid,sessionid order by ts asc) as country , first_value(province) over(partition by guid,sessionid order by ts asc) as province , first_value(city) over(partition by guid,sessionid order by ts asc) as city , first_value(region) over(partition by guid,sessionid order by ts asc) as region , first_value(devicetype) over(partition by guid,sessionid order by ts asc) as device_type FROM dwd.app_action_detail WHERE dt='2020-10-20' AND eventid='pageView' ) o GROUP BY guid,session_id ) INSERT INTO TABLE dws.app_action_agr_session PARTITION(dt='2020-10-20') SELECT tmp1.guid , tmp1.session_id , tmp1.start_ts , tmp1.end_ts , tmp2.first_page_id , tmp2.last_page_id , tmp1.pv_cnt , tmp1.isnew , tmp1.hour_itv , tmp2.country , tmp2.province , tmp2.city , tmp2.region , tmp2.device_type FROM tmp1 join tmp2 on tmp1.guid=tmp2.guid and tmp1.session_id=tmp2.session_id ;注意,每个sql语句结束之后,加分号 ; 注意,使用insert into 之后,跟随查询语句,可以将结果按照字段先后顺序依次插入这个表中。 注意,这里使用了事件分区。在大数据处理中,如果数据每天都有更新,并且数据较多。一般都会使用分区表或者分区文件进行存储。特殊一些的场景,有些公司会基于年,月,日,甚至小时做分区存储。 注意,当使用group by的时候,只有group by的单个或者多个字段可以被查询,其它字段就只能使用聚合函数。 注意,窗口函数非常强大,需要熟记常见的几种窗口函数。
从上述会话聚合表可以看出,如果需要分析PV,UV只需要在此基础上,做一次数据统计即可。 如:
访问次数访问深度 访问时长 访问入口页面 访问跳出页面回头访客分析 不同时段流量对比分析 至于基础的PV,UV直接从这个表格以guid对pageview做统计,以及以uid对用户做统计即可。