SparkSQL 字符串函数
1.样例类的定义
case class Student
(name
:String,subject
:String,score
:String)
2.测试数据展示
val spark
: SparkSession
= SparkSession
.builder
().appName
("test").master
("local[*]").getOrCreate
()
import spark
.implicits
._
val df
= spark
.sparkContext
.textFile
("file:///E:\\MyEclipse\\java\\myfactory\\data\\data.txt").map
(line
=>{
val arr
: Array
[String] = line
.split
(",")
Student
(arr
(0),arr
(1),arr
(2))
}).toDF
()
df
.show
(false)
spark
.close
()
+--------+-------+-----+
|name |subject|score|
+--------+-------+-----+
|zhangsan|语文 |90 |
|zhangsan|数学 |85 |
|zhangsan|英语 |85 |
|zhangsan|物理 |95 |
|lisi |语文 |90 |
|lisi |化学 |100 |
|wangwu |物理 |95 |
|wangwu |化学 |95 |
+--------+-------+-----+
3.collect_set和collect_list
将同组的的字段合并成集合,collect_list(不去重),collect_set(去重)
val df1
: DataFrame
= df
.groupBy
("name").agg
(collect_list
("subject").alias
("subjects"),collect_set
($
"score").alias
("scores"))
.select
("name", "subjects","scores")
df1
.show
(false)
+--------+----------------+------------+
|name |subjects |scores |
+--------+----------------+------------+
|wangwu |[物理, 化学] |[95] |
|zhangsan|[语文, 数学, 英语, 物理]|[95, 85, 90]|
|lisi |[语文, 化学] |[100, 90] |
+--------+----------------+------------+
4.concat_ws
作用1:将字段值为集合的值,按照指定符号进行拼接(集合转字符串)
val df2
: DataFrame
= df1
.withColumn
("all_subjects", concat_ws
("-", $
"subjects"))
.withColumn
("all_scores",concat_ws
("-",$
"scores"))
df2
.show
(false)
+--------+----------------+------------+------------+----------+
|name |subjects |scores |all_subjects|all_scores|
+--------+----------------+------------+------------+----------+
|wangwu |[物理, 化学] |[95] |物理-化学 |95 |
|zhangsan|[语文, 数学, 英语, 物理]|[95, 85, 90]|语文-数学-英语-物理 |95-85-90 |
|lisi |[语文, 化学] |[100, 90] |语文-化学 |100-90 |
+--------+----------------+------------+------------+----------+
作用2:将字段按照指定的字符进行拼接(字符串拼接)
val test
: DataFrame
= df
.withColumn
("test", concat_ws
("-", $
"subject", $
"score"))
test
.show
(false)
+--------+-------+-----+------+
| name|subject|score| test|
+--------+-------+-----+------+
|zhangsan| 语文| 90| 语文-90|
|zhangsan| 数学| 85| 数学-85|
|zhangsan| 英语| 85| 英语-85|
|zhangsan| 物理| 95| 物理-95|
| lisi| 语文| 90| 语文-90|
| lisi| 化学| 100|化学-100|
| wangwu| 物理| 95| 物理-95|
| wangwu| 化学| 95| 化学-95|
+--------+-------+-----+------+
5.concat
将多个字段进行拼接
val df3
: DataFrame
= df
.withColumn
("info", concat
($
"name",$
"subject", $
"score"))
df3
.show
(false)
+--------+-------+-----+------------+
|name |subject|score|info |
+--------+-------+-----+------------+
|zhangsan|语文 |90 |zhangsan语文90|
|zhangsan|数学 |85 |zhangsan数学85|
|zhangsan|英语 |85 |zhangsan英语85|
|zhangsan|物理 |95 |zhangsan物理95|
|lisi |语文 |90 |lisi语文90 |
|lisi |化学 |100 |lisi化学100 |
|wangwu |物理 |95 |wangwu物理95 |
|wangwu |化学 |95 |wangwu化学95 |
+--------+-------+-----+------------+