Hive练习面试题详细步骤及总结

it2023-10-18  81

1.现有这么一批数据,现要求出:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数

三个字段的意思:

用户名uid,月份month,访问次数count A,2015-01,5 A,2015-01,15 B,2015-01,5 A,2015-01,8 B,2015-01,25 A,2015-01,5 A,2015-02,4 A,2015-02,6 B,2015-02,10 B,2015-02,5 A,2015-03,16 A,2015-03,22 B,2015-03,23 B,2015-03,10 B,2015-03,11

最后结果展示

建表与导入表语句

create table if not exists interview01(userId string,month string,count int) row format delimited fields terminated by ",";

load data local inpath "/opt/module/hive/datas/interview01" into table interview01;

#第一步:查询出用户每个月的访问次数,根据用户和月份分组 为表1 select userid,month,sum(count) sumCount from interview01 group by userid,month #第二步:分阶段任务,所以求的是当前月份之前最大以及和 月份<=当前月份,所以#表1与表1连查 取出双份数据 称为表1 表2(与表一一样) select t1.userid userid1,t1.month month1,t1.sumCount sumCount1, t2.userid userid2,t2.month month2,t2.sumCount sumCount2 from ( select userid,month,sum(count) sumCount from interview01 group by userid,month )t1 join ( select userid,month,sum(count) sumCount from interview01 group by userid,month )t2 on t1.userid = t2.userid #第三步:表一的月份大于等于表二的月份,求表二的最大与和,因为求的是至今所以 #不求表一,取根据表一用户,表一月份,表一当月和分组 select userid1,month1,max(sumCount2) max_,sum(sumCount2),sumCount1 sum_ from( select t1.userid userid1,t1.month month1,t1.sumCount sumCount1, t2.userid userid2,t2.month month2,t2.sumCount sumCount2 from ( select userid,month,sum(count) sumCount from interview01 group by userid,month )t1 join ( select userid,month,sum(count) sumCount from interview01 group by userid,month )t2 on t1.userid = t2.userid )a where a.month1>=a.month2 group by a.month1,a.userid1,a.sumCount1 order by a.userid1,a.month1;

第二种方法 distribute by、sort by(也可以使用 cluster by相当于 distribute by 和sort by 的结合,默认只能是升序)

#使用窗口函数求截止到当月的最大访问次数,直接从上表里面查 #distribute by 常与 sort by连用,distribute by控制map输出结果的分发,相同字段的map输出会发到一个reduce节点去处理 select userid,month, max(counts) over(distribute by userid sort by month)maxcount, sum(counts) over(distribute by userid sort by month)sumcount, counts from ( select userid,month,sum(count)counts from interview01 group by userid,month )tmp;

2.求:所有数学课程成绩 大于 语文课程成绩的学生的学号

创表语句以及增加数据

CREATE TABLE course (  id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,  sid int(11) DEFAULT NULL,  course varchar(255) DEFAULT NULL,  score int(11) DEFAULT NULL )

INSERT INTO course VALUES (1, 1, yuwen, 43);INSERT INTO course VALUES (2, 1,  shuxue , 55);INSERT INTO course VALUES (3, 2,  yuwen , 77);INSERT INTO course VALUES (4, 2,  shuxue , 88);INSERT INTO course VALUES (5, 3,  yuwen , 98);INSERT INTO course VALUES (6, 3,  shuxue , 65);

数据及效果图

解题思路

select * from ( #求出数学的成绩 select sid,course,score from course where course = 'shuxue' ) a1 join ( #求出语文的成绩 select sid,course,score from course where course = 'yuwen' )a2 #两表join,同一个学生且表1的成绩大于表2的成绩 where a1.sid = a2.sid and a1.score > a2.score

第二种方法,为空缺字段补值

#第一步 为空缺的字段补值 值为0 selectsid, case course when "yuwen" then score else 0 end yuwen, case course when "shuxue" then score else 0 end shuxue from course #第二步,求出根据sid分组时 语文数学成绩的和或者最大 selectsid,sum(yuwen) yuwen,max(shuxue) shuxue from( selectsid, case course when "yuwen" then score else 0 end yuwen, case course when "shuxue" then score else 0 end shuxue from course ) a group by sid #最终数学大于语文的 selectsid,shuxue,yuwen from( selectsid,sum(yuwen) yuwen,max(shuxue) shuxue from( selectsid, case course when "yuwen" then score else 0 end yuwen, case course when "shuxue" then score else 0 end shuxue from course ) a group by sid ) b where b.shuxue>b.yuwen;

3.比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用hive, 计算每一年出现过的最大气温的日期+温度。

数据

2014010114 2014010216 2014010317 2014010410 2014010506 2012010609 2012010732 2012010812 2012010919 2012011023 2001010116 2001010212 2001010310 2001010411 2001010529 2013010619 2013010722 2013010812 2013010929 2013011023 2008010105 2008010216 2008010337 2008010414 2008010516 2007010619 2007010712 2007010812 2007010999 2007011023 2010010114 2010010216 2010010317 2010010410 2010010506 2015010649 2015010722 2015010812 2015010999 2015011023

创表以及导入数据语句

create table if not exists interview03(data string);

load data local inpath "/opt/module/hive/datas/interview03" into table interview03;

效果图

分析 substr() cast()每一年出现过的最大气温的日期+温度。根据年份分组,求出最大气温 

每一年出现过的最大气温的日期+温度。根据年份分组,求出最大气温 #substr(str,start,lengh)函数 切割哪个字段,从哪个位置开始,取几个 #substr(data,1,4)年份 substr(data,9,2)温度 #cast(xx as yy) 类型转换,转为int类型 select substr(data,1,4) year,max(cast(substr(data,9,2) as int)) max_temp from interview03 group by substr(data,1,4); #加入日期连查 年份相同 温度与最大相同 取出 select year,cast(substr(data,9,2) as int) temp,data day from interview03 i1 join ( select substr(data,1,4) year,max(cast(substr(data,9,2) as int)) max_temp from interview03 group by substr(data,1,4) ) i2 on(cast(substr(data,9,2) as int) = i2.max_temp and substr(data,1,4) = i2.year);

4.现有一份以下格式的数据:表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门

数据

id course  1,a  1,b  1,c  1,e  2,a  2,c  2,d  2,f  3,a  3,b  3,c  3,e

效果图

创表与导入数据

create table if not exists interview04(id int,course string) row format delimited fields terminated by "," ;  load data local inpath "/opt/module/hive-1.1.0-cdh5.14.0/datas/interview04" into table interview04;

步骤 collect_set(trim())  与sort_array()与 array_contains(str,str) 与 case when then else end

collect_set(trim()) 列转行函数,而且进行去重 类似于mysql中的group_caoncat() sort_array() 进行排序 因为collect_set是乱序数据 array_contains(str1,str2) 返回值为布尔类型,如果str1中有str2返回true case when then else end 判断 when后条件满足 走then 不满足走else end结束 #第一步查询所有的课程 select collect_set(trim(course)) id_course from interview04 #第二步求出每个学生选修的课程 select id,collect_set(trim(course)) id_course from interview04 group by id; #第三步 连查求出学生选修的课程 select id, case when array_contains(id_course,courses[0]) then 1 else 0 end a, case when array_contains(id_course,courses[1]) then 1 else 0 end b, case when array_contains(id_course,courses[2]) then 1 else 0 end c, case when array_contains(id_course,courses[3]) then 1 else 0 end d, case when array_contains(id_course,courses[4]) then 1 else 0 end e, case when array_contains(id_course,courses[5]) then 1 else 0 end f from( select b.id id,b.id_course id_course,a.courses courses from( select sort_array(collect_set(trim(course))) courses from interview04 ) a join( select id, sort_array(collect_set(trim(course))) id_course from interview04 group by id ) b ) c

5.现有如下格式的一份数据,需求:编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额

数据

a,01,150 a,01,200 b,01,1000 b,01,800 c,01,250 c,01,220 b,01,6000 a,02,2000 a,02,3000 b,02,1000 b,02,1500 c,02,350 c,02,280 a,03,350 a,03,250

创建表以及导数据

create table if not exists interview05(uid string,month string,sale int) row format delimited fields terminated by ","; load data local inpath "/opt/module/hive-1.1.0-cdh5.14.0/datas/interview05" into table interview05;

思路 sum() over(distribute by  sort by )

与第一题类似,甚至还要简单些,所以我们直接使用函数处理

#第一步 每个月份的当月销售额 select uid,month,sum(sale) sum_sale from interview05 group by uid,month; #第二步使用函数求出累积和 根据uid分区,月份排序 select uid, month, sum(sum_sale) over(distribute by uid sort by month) sum_month_sale, sum_sale from( select uid,month,sum(sale) sum_sale from interview05 group by uid,month )t;

 

最新回复(0)