SparkSQL 字符串函数
 
1.样例类的定义
 
case class Student
(name
:String,subject
:String,score
:String)
 
2.测试数据展示
 
val spark
: SparkSession 
= SparkSession
.builder
().appName
("test").master
("local[*]").getOrCreate
()
import spark
.implicits
._
val df
= spark
.sparkContext
.textFile
("file:///E:\\MyEclipse\\java\\myfactory\\data\\data.txt").map
(line 
=>{
   val arr
: Array
[String] = line
.split
(",")
   Student
(arr
(0),arr
(1),arr
(2))
   }).toDF
()
    
 df
.show
(false)
 spark
.close
()
 
+--------+-------+-----+
|name    |subject|score|
+--------+-------+-----+
|zhangsan|语文     |90   |
|zhangsan|数学     |85   |
|zhangsan|英语     |85   |
|zhangsan|物理     |95   |
|lisi    |语文     |90   |
|lisi    |化学     |100  |
|wangwu  |物理     |95   |
|wangwu  |化学     |95   |
+--------+-------+-----+
 
3.collect_set和collect_list
 
将同组的的字段合并成集合,collect_list(不去重),collect_set(去重)
 
val df1
: DataFrame 
= df
.groupBy
("name").agg
(collect_list
("subject").alias
("subjects"),collect_set
($
"score").alias
("scores"))
  .select
("name", "subjects","scores")
    
df1
.show
(false)
 
+--------+----------------+------------+
|name    |subjects        |scores      |
+--------+----------------+------------+
|wangwu  |[物理, 化学]        |[95]        |
|zhangsan|[语文, 数学, 英语, 物理]|[95, 85, 90]|
|lisi    |[语文, 化学]        |[100, 90]   |
+--------+----------------+------------+
 
4.concat_ws
 
作用1:将字段值为集合的值,按照指定符号进行拼接(集合转字符串)
 
val df2
: DataFrame 
= df1
.withColumn
("all_subjects", concat_ws
("-", $
"subjects"))
  .withColumn
("all_scores",concat_ws
("-",$
"scores"))
df2
.show
(false)
 
+--------+----------------+------------+------------+----------+
|name    |subjects        |scores      |all_subjects|all_scores|
+--------+----------------+------------+------------+----------+
|wangwu  |[物理, 化学]        |[95]        |物理-化学       |95        |
|zhangsan|[语文, 数学, 英语, 物理]|[95, 85, 90]|语文-数学-英语-物理 |95-85-90  |
|lisi    |[语文, 化学]        |[100, 90]   |语文-化学       |100-90    |
+--------+----------------+------------+------------+----------+
 
作用2:将字段按照指定的字符进行拼接(字符串拼接)
 
val test
: DataFrame 
= df
.withColumn
("test", concat_ws
("-", $
"subject", $
"score"))
test
.show
(false)
 
+--------+-------+-----+------+
|    name|subject|score|  test|
+--------+-------+-----+------+
|zhangsan|     语文|   90| 语文-90|
|zhangsan|     数学|   85| 数学-85|
|zhangsan|     英语|   85| 英语-85|
|zhangsan|     物理|   95| 物理-95|
|    lisi|     语文|   90| 语文-90|
|    lisi|     化学|  100|化学-100|
|  wangwu|     物理|   95| 物理-95|
|  wangwu|     化学|   95| 化学-95|
+--------+-------+-----+------+
 
5.concat
 
将多个字段进行拼接
 
val df3
: DataFrame 
= df
.withColumn
("info", concat
($
"name",$
"subject", $
"score"))
df3
.show
(false)
 
+--------+-------+-----+------------+
|name    |subject|score|info        |
+--------+-------+-----+------------+
|zhangsan|语文     |90   |zhangsan语文90|
|zhangsan|数学     |85   |zhangsan数学85|
|zhangsan|英语     |85   |zhangsan英语85|
|zhangsan|物理     |95   |zhangsan物理95|
|lisi    |语文     |90   |lisi语文90    |
|lisi    |化学     |100  |lisi化学100   |
|wangwu  |物理     |95   |wangwu物理95  |
|wangwu  |化学     |95   |wangwu化学95  |
+--------+-------+-----+------------+