Hive高级应用(三)

it2025-12-08  13

目录

十五、Hive中数据备份十六、Hive常用的查询十七、Hive中join表的关联十八、Hive几种By的使用详解(1)order by(2)sort by(3)distribute by(4)cluster by 十九、Hive中UDF编程详解二十、通过hiverserver2/beeline连接Hive二十一、Hive与JDBC的集成

十五、Hive中数据备份

1.export导出数据 :

EXPORT TABLE db_hive.order TO '/user/caizhengjie/datas/export/order'

2.import导入数据:

import table order_imp from '/user/caizhengjie/datas/export/order'; IMPORT TABLE order_imp_1 FROM '/user/caizhengjie/datas/export/order' LOCATION '/user/kfk/datas/imp/order'; --(location)指定数据表目录

十六、Hive常用的查询

准备数据:

/opt/datas/hive/order.txt 0001,cenry,2018-10-09,product-1,350,guangzhou 0002,beny,2018-09-09,product-2,180,beijing 0003,ben,2018-09-09,product-3,580,beijing 0004,cherry,2018-10-09,product-4,450,shenzheng 0005,jone,2018-10-09,product-5,30,nanjing 0006,lili,2018-09-09,product-6,50,hangzhou 0007,chen,2018-10-09,product-7,90,wuhan 0008,wiwi,2018-09-09,product-8,150,chengdu

创建表:

create table order(userid string,username string,order_date string,product_name string,price int,city string) row format delimited fields terminated by ',' stored as textfile;

1.字段查询

select userid,username from order; userid username 0001 cenry 0002 beny 0003 ben 0004 cherry 0005 jone 0006 lili 0007 chen 0008 wiwi

2.where查询

select * from order where price > 200; order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng

3.limit查询

select * from order limit 3; order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0003 ben 2018-09-09 product-3 580 beijing

4.distinct去重

select distinct city from order; city beijing chengdu guangzhou hangzhou nanjing shenzheng wuhan

5.max/min/count/sum

select max(price) from order; _c0 580

6.group by / having 分组查询

select sum(price) price,city from order group by city having price > 200; price city 760 beijing 350 guangzhou 450 shenzheng

十七、Hive中join表的关联

准备数据order1.txt

1,0001,2018-09-09,product-1,180 2,0001,2018-09-09,product-2,240 3,0001,2018-09-09,product-3,350 4,0002,2018-09-09,product-4,110 5,0002,2018-09-09,product-5,230 6,0003,2018-09-09,product-6,245 7,0004,2018-09-09,product-7,965 8,0004,2018-09-09,product-8,741 9,0008,2018-09-09,product-8,690 10,0009,2018-09-09,product-9,120 11,0009,2018-09-09,product-9,120

customer.txt

0001,jack,17425845325,guangzhou 0002,beny,15451225965,shenzhen 0003,jone,13754859326,nanjing 0004,cherry,13785415255,suzhou 0005,alex,15745869325,beijing 0006,aili,13245632581,xiamen

创建order1表:

create table if not exists db_hive.order1(orderid string,userid string,order_date string,product_name string,price string) row format delimited fields terminated by ',' stored as textfile;

创建customer表:

create table if not exists db_hive.customer(userid string,username string,telephone string,city string) row format delimited fields terminated by ',' stored as textfile;

加载数据:

load data local inpath '/opt/datas/hive/order1.txt' into table order1; load data local inpath '/opt/datas/hive/customer.txt' into table customer;

1.等值连接

select * from customer t1,order1 t2 where t1.userid = t2.userid; t1.userid t1.username t1.telephone t1.city t2.orderid t2.userid t2.order_date t2.product_name t2.price 0001 jack 17425845325 guangzhou 1 0001 2018-09-09 product-1 180 0001 jack 17425845325 guangzhou 2 0001 2018-09-09 product-2 240 0001 jack 17425845325 guangzhou 3 0001 2018-09-09 product-3 350 0002 beny 15451225965 shenzhen 4 0002 2018-09-09 product-4 110 0002 beny 15451225965 shenzhen 5 0002 2018-09-09 product-5 230 0003 jone 13754859326 nanjing 6 0003 2018-09-09 product-6 245 0004 cherry 13785415255 suzhou 7 0004 2018-09-09 product-7 965 0004 cherry 13785415255 suzhou 8 0004 2018-09-09 product-8 741

2.左连接

select t1.username,t2.product_name from customer t1 left join order1 t2 on t1.userid = t2.userid; t1.username t2.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 alex NULL aili NULL select t2.username,t1.product_name from order1 t1 left join customer t2 on t1.userid = t2.userid; t2.username t1.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 NULL product-8 NULL product-9 NULL product-9

3.右连接

select t1.username,t2.product_name from customer t1 right join order1 t2 on t1.userid = t2.userid; t1.username t2.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 NULL product-8 NULL product-9 NULL product-9 select t2.username,t1.product_name from order1 t1 right join customer t2 on t1.userid = t2.userid; t2.username t1.product_name jack product-1 jack product-2 jack product-3 beny product-4 beny product-5 jone product-6 cherry product-7 cherry product-8 alex NULL aili NULL

4.全连接

select t2.username,t1.product_name from order1 t1 full join customer t2 on t1.userid = t2.userid; t2.username t1.product_name jack product-3 jack product-2 jack product-1 beny product-5 beny product-4 jone product-6 cherry product-8 cherry product-7 alex NULL aili NULL NULL product-8 NULL product-9 NULL product-9

十八、Hive几种By的使用详解

order /sort / cluster / distribute by 准备数据:

order.userid order.username order.order_date order.product_name order.price order.city 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng 0005 jone 2018-10-09 product-5 30 nanjing 0006 lili 2018-09-09 product-6 50 hangzhou 0007 chen 2018-10-09 product-7 90 wuhan 0008 wiwi 2018-09-09 product-8 150 chengdu

(1)order by

order by (只有一个reduce ,全局排序)

select * from order order by price desc; select * from order order by price asc; order.userid order.username order.order_date order.product_name order.price order.city 0003 ben 2018-09-09 product-3 580 beijing 0004 cherry 2018-10-09 product-4 450 shenzheng 0001 cenry 2018-10-09 product-1 350 guangzhou 0002 beny 2018-09-09 product-2 180 beijing 0008 wiwi 2018-09-09 product-8 150 chengdu 0007 chen 2018-10-09 product-7 90 wuhan 0006 lili 2018-09-09 product-6 50 hangzhou 0005 jone 2018-10-09 product-5 30 nanjing

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

但是不起作用,运行的时候还是一个。

(2)sort by

每个reduce端都会进行排序,也就是局部有序,可以指定多个reduce。同时,如果想测试一下执行的效果,建议将输出结果保存到本地,并调整reduce的数量。(我设置成3个)

sort by (对每一个reduce内部的数据进行排序,最后的全局结果集不排序) 设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

这里的写法是将数据写入到本地

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order sort by price desc; [caizhengjie@bigdata-pro-m01 sort]$ ls 000000_0 000001_0 000002_0 [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0008,wiwi,2018-09-09,product-8,150,chengdu 0007,chen,2018-10-09,product-7,90,wuhan 0006,lili,2018-09-09,product-6,50,hangzhou 0005,jone,2018-10-09,product-5,30,nanjing [caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0 0003,ben,2018-09-09,product-3,580,beijing 0004,cherry,2018-10-09,product-4,450,shenzheng 0002,beny,2018-09-09,product-2,180,beijing [caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0 0001,cenry,2018-10-09,product-1,350,guangzhou

(3)distribute by

指定分区原则。通常和sort by一起用,distribute by必须要写在sort by之前。理解成:按照XX字段分区,再按照XX字段排序。

distribute by (作用类似于partition,一般与sort by 一起使用) 设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3; insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by city sort by price desc; [caizhengjie@bigdata-pro-m01 sort]$ ls 000000_0 000001_0 000002_0 [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0004,cherry,2018-10-09,product-4,450,shenzheng 0001,cenry,2018-10-09,product-1,350,guangzhou [caizhengjie@bigdata-pro-m01 sort]$ cat 000001_0 0003,ben,2018-09-09,product-3,580,beijing 0002,beny,2018-09-09,product-2,180,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0006,lili,2018-09-09,product-6,50,hangzhou 0005,jone,2018-10-09,product-5,30,nanjing [caizhengjie@bigdata-pro-m01 sort]$ cat 000002_0 0008,wiwi,2018-09-09,product-8,150,chengdu

(4)cluster by

当distribute by 和 sort by 所指定的字段相同时,即可以使用cluster by。 注意:cluster by指定的列只能是升序,不能指定asc和desc。

设置reduce个数,这里设置为3个

set mapreduce.job.reduces=3;

cluster by (distribute by 和sort by字段相同时,使用cluster by 代替)

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order distribute by username sort by username; [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0003,ben,2018-09-09,product-3,580,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0004,cherry,2018-10-09,product-4,450,shenzheng 0006,lili,2018-09-09,product-6,50,hangzhou

等价于:

insert overwrite local directory '/opt/datas/hive/sort' row format delimited fields terminated by ',' select * from order cluster by username; [caizhengjie@bigdata-pro-m01 sort]$ cat 000000_0 0003,ben,2018-09-09,product-3,580,beijing 0007,chen,2018-10-09,product-7,90,wuhan 0004,cherry,2018-10-09,product-4,450,shenzheng 0006,lili,2018-09-09,product-6,50,hangzhou

十九、Hive中UDF编程详解

1.自定义UDF函数 加载pom.xml文件内容:

<properties> <hive.version>0.13.1</hive.version> </properties> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>${hive.version}</version> </dependency>

编写udf代码

package com.kfk.hive; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; /** * @author : 蔡政洁 * @email :caizhengjie888@icloud.com * @date : 2020/10/23 * @time : 8:31 下午 */ public class KfkUDF extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toUpperCase()); } public static void main(String[] args) { System.out.println(new KfkUDF().evaluate(new Text("spark"))); } }

2.打jar包

3.添加jar包

hive (db_hive)> add jar /opt/jars/bigdata_study_udf.jar; hive (db_hive)> list jar;

4.Create Function

create temporary function kfk_udf as 'com.kfk.hive.KfkUDF';

5.使用自定义函数

select kfk_udf(city) city from order; city GUANGZHOU BEIJING BEIJING SHENZHENG NANJING HANGZHOU WUHAN CHENGDU

有关hive的UDF操作可以见官网: https://cwiki.apache.org/confluence/display/Hive/HivePlugins

二十、通过hiverserver2/beeline连接Hive

1.启动metastore:

bin/hive --service metastore

2.启动hiverserver2:

bin/hiveserver2

3.启动beeline客户端:

bin/beeline

4.连接hive:

% bin/beeline beeline> !connect jdbc:hive2://<host>:<port>/<db>;auth=noSasl hiveuser pass !connect jdbc:hive2://bigdata-pro-m01:10000 caizhengjie 199911

5.查看数据

1: jdbc:hive2://bigdata-pro-m01:10000> select * from order; +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+ | order.userid | order.username | order.order_date | order.product_name | order.price | order.city | +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+ | 0001 | cenry | 2018-10-09 | product-1 | 350 | guangzhou | | 0002 | beny | 2018-09-09 | product-2 | 180 | beijing | | 0003 | ben | 2018-09-09 | product-3 | 580 | beijing | | 0004 | cherry | 2018-10-09 | product-4 | 450 | shenzheng | | 0005 | jone | 2018-10-09 | product-5 | 30 | nanjing | | 0006 | lili | 2018-09-09 | product-6 | 50 | hangzhou | | 0007 | chen | 2018-10-09 | product-7 | 90 | wuhan | | 0008 | wiwi | 2018-09-09 | product-8 | 150 | chengdu | +---------------+-----------------+-------------------+---------------------+--------------+-------------+--+

可以看出来,通过beeline客户端连接的hive比hive本身的命令显示出来的数据格式更加直观。 具体beeline操作可以详见官网: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903124#HiveServer2Clients-table

二十一、Hive与JDBC的集成

1.添加pom.xml配置文件

<properties> <hive.version>0.13.1</hive.version> </properties> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive.version}</version> </dependency>

2.编写JDBC代码

package com.kfk.hive; /** * @author : 蔡政洁 * @email :caizhengjie888@icloud.com * @date : 2020/10/23 * @time : 9:53 下午 */ import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJDBC { private static String driverName = "org.apache.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace "hive" here with the name of the user the queries should run as Connection con = DriverManager.getConnection("jdbc:hive2://bigdata-pro-m01:10000/db_hive", "caizhengjie", "199911"); Statement stmt = con.createStatement(); String tableName = "order"; ResultSet res = null; // select * query String sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getString(1)) + "\t" + res.getString(2)); } } }

运行结果:

Running: select * from order 0001 cenry 0002 beny 0003 ben 0004 cherry 0005 jone 0006 lili 0007 chen 0008 wiwi

有关Hive与JDBC的集成操作见官网: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903124#HiveServer2Clients-UsingJDBC


以上内容仅供参考学习,如有侵权请联系我删除! 如果这篇文章对您有帮助,左下角的大拇指就是对博主最大的鼓励。 您的鼓励就是博主最大的动力!

最新回复(0)