这是一份来自于某在线考试系统的学员答题批改日志,日志中记录了日志生成时间,题目难度系数,题目所属的知识点ID,做题的学生ID,题目ID以及作答批改结果。日志的结构如下:
链接:answer_question.log 提取码:47k1
请在HDFS 中创建目录/app/data/exam,并将answer_question.log 传到该目录
①提取日志中的知识点ID,学生ID,题目ID,作答结果4 个字段的值
结果如下: 注: 要去除 题目字符串后缀中的 r ,除了用 substring 算子修剪掉,也可以用 replace(" r " ,"") ,将 r 替换为空字符
②将提取后的 知识点ID ,学生ID,题目ID,作答结果 ,字段的值以文件的形式保存到HDFS的 /app/data/result 目录下。一行保留一条数据,字段间以“\t”分割,文件格式如下所示。(提示:元组可使用tuple.productIterator.mkString("\t")组合字符串)
344344818195023659599101803443442581950236595997385134434457819502365959673461344344988195023659597667203443444981950236595944809134434489819502365959679980.53443449281950236595959406034434485819502365959787101在HBase中创建命名空间(namespace)exam,在该命名空间下创建analysis表,使用学生ID作为RowKey,该表下有2 个列族accuracy、question
请在Hive中创建数据库 exam ,在该数据库中创建外部表 ex_exam_record 指向 /app/data/result 下Spark处理后的日志数据;创建外部表 ex_exam_anlysis 映射至HBase中的 analysis 表的 accuracy 列族;创建外部表 ex_exam_question 映射至HBase中的 analysis 表的 question 列族
ex_exam_record 表结构如下:
字段名称字段类型字段含义topic_idstring知识点IDstudent_idstring学生IDquestion_idstring题目IDscorefloat作答结果ex_exam_anlysis表结构如下:
字段名称字段类型字段含义student_idstring学生IDtotal_scorefloat总分question_countint答题的试题数accuracyfloat正确率ex_exam_question 表结构如下:
字段名称字段类型字段含义student_idstring学生IDrightstring所有作对的题目的ID 列表halfstring所有半对的题目的ID 列表errorstring所有做错的题目的ID 列表①题目id 以逗号分割,并保存到ex_exam_question 表中
步骤:
1.查询一个学生做 对 的所有题目: select student_id,score,concat_ws(",",collect_set(question_id)) right from ex_exam_record where score = 1 group by student_id,score;部分截图如下:
2.查询一个学生做 半对 的所有题目: select student_id,score,concat_ws(",",collect_set(question_id)) half from ex_exam_record where score = 0.5 group by student_id,score; 3.查询一个学生做 错 的所有题目: select student_id,score,concat_ws(",",collect_set(question_id)) error from ex_exam_record where score = 0 group by student_id,score; 4.联表查询 select t1.student_id,t1.right,t2.half,t3.error from (select student_id,score,concat_ws(",",collect_set(question_id)) right from ex_exam_record where score = 1 group by student_id,score) t1 join (select student_id,score,concat_ws(",",collect_set(question_id)) half from ex_exam_record where score = 0.5 group by student_id,score) t2 on t1.student_id = t2.student_id join (select student_id,score,concat_ws(",",collect_set(question_id)) error from ex_exam_record where score = 0 group by student_id,score) t3 on t1.student_id = t3.student_id;部分截图如下:
5.将查询结果插入到表 ex_exam_question 中 insert into ex_exam_question select t1.student_id,t1.right,t2.half,t3.error from (select student_id,score,concat_ws(",",collect_set(question_id)) right from ex_exam_record where score = 1 group by student_id,score) t1 join (select student_id,score,concat_ws(",",collect_set(question_id)) half from ex_exam_record where score = 0.5 group by student_id,score) t2 on t1.student_id = t2.student_id join (select student_id,score,concat_ws(",",collect_set(question_id)) error from ex_exam_record where score = 0 group by student_id,score) t3 on t1.student_id = t3.student_id;注: 上述代码也可以简写成这样:
with t as (select student_id, score, concat_ws(",",collect_set(question_id)) as question_id from ex_exam_record group by student_id,score), a1 as (select student_id, question_id from t where score=1), a2 as (select student_id, question_id from t where score=0.5), a3 as (select student_id, question_id from t where score=0) insert into ex_exam_question select a1.student_id, a1.question_id right, a2.question_id half, a3.question_id error from a1 join a2 on a1.student_id = a2.student_id join a3 on a1.student_id = a3.student_id ;也可以换一种思路,用 case when 简写成这样:
insert into ex_exam_question select student_id, concat_ws('',collect_list(t.right)) right, concat_ws('',collect_list(t.half)) half, concat_ws('',collect_list(t.error)) error from ( select student_id, case when score=1.0 then concat_ws(",",collect_list(question_id)) else null end right, case when score=0.5 then concat_ws(",",collect_list(question_id)) else null end half, case when score=0.0 then concat_ws(",",collect_list(question_id)) else null end error from ex_exam_record group by student_id,score) t group by student_id;②完成统计后,在HBase Shell 中遍历 exam:analysis 表并只显示 question 列族中的数据,如下图所示: