留意变量类型:分类变量 (是否因子化),数值变量。
data(birthwt,package = “MASS”) str(birthwt) ‘data.frame’: 189 obs. of 10 variables: $ low : int 0 0 0 0 0 0 0 0 0 0 … $ age : int 19 33 20 21 18 21 22 17 29 26 … $ lwt : int 182 155 105 108 107 124 118 103 123 113 … $ race : int 2 3 1 1 1 3 1 3 1 1 … $ smoke: int 0 0 1 1 1 0 0 0 1 1 … $ ptl : int 0 0 0 0 0 0 0 0 0 0 … $ ht : int 0 0 0 0 0 0 0 0 0 0 … $ ui : int 1 0 0 1 1 0 0 0 0 0 … $ ftv : int 0 3 1 2 0 0 1 1 1 0 … $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 …
留意到该数据集中的几个分类变量low, race, smoke, ht, ui都是整数类型的数据。使用factor函数将其因子化,使用mutate()函数将这种更改更新到数据集。
> birthwt<-mutate(birthwt,low=factor(low,labels = c("no","yes"))) > birthwt<-mutate(birthwt,race=factor(race,labels = c("white","black","other"))) > birthwt<-mutate(birthwt,smoke=factor(smoke,labels = c("no","yes"))) > birthwt<-mutate(birthwt,ht=factor(ht,labels = c("no","yes"))) > birthwt<-mutate(birthwt,ui=factor(ui,labels = c("no","yes")))留意,mutate中的labels是按其水平数字从低到高的顺序进行赋值的。 使用str()检查因子化是否成功。
> str(birthwt) 'data.frame': 189 obs. of 10 variables: $ low : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... $ race : Factor w/ 3 levels "white","black",..: 2 3 1 1 1 3 1 3 1 1 ... $ smoke: Factor w/ 2 levels "no","yes": 1 1 2 2 2 1 1 1 2 2 ... $ ptl : int 0 0 0 0 0 0 0 0 0 0 ... $ ht : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... $ ui : Factor w/ 2 levels "no","yes": 2 1 1 2 2 1 1 1 1 1 ... $ ftv : int 0 3 1 2 0 0 1 1 1 0 ... $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...显然,low, race, smoke, ht, ui均已因子化。
对于有序分类变量,在factor()函数中声明ordered=TRUE即可。 更多有关factor()的笔记内容可参考这里。
汇总了数值变量的基本统计结果、分类变量的计数信息以及缺失值的信息。
> summary(birthwt) low age lwt no :130 Min. :14.00 Min. : 80.0 yes: 59 1st Qu.:19.00 1st Qu.:110.0 Median :23.00 Median :121.0 Mean :23.24 Mean :129.8 3rd Qu.:26.00 3rd Qu.:140.0 Max. :45.00 Max. :250.0 race smoke ptl ht white:96 no :115 Min. :0.0000 no :177 black:26 yes: 74 1st Qu.:0.0000 yes: 12 other:67 Median :0.0000 Mean :0.1958 3rd Qu.:0.0000 Max. :3.0000 ui ftv bwt no :161 Min. :0.0000 Min. : 709 yes: 28 1st Qu.:0.0000 1st Qu.:2414 Median :0.0000 Median :2977 Mean :0.7937 Mean :2945 3rd Qu.:1.0000 3rd Qu.:3487 Max. :6.0000 Max. :4990更适合用于快速查看变量类型。
> des(birthwt) No. of observations = 189 Variable Class Description 1 low factor 2 age integer 3 lwt integer 4 race factor 5 smoke factor 6 ptl integer 7 ht factor 8 ui factor 9 ftv integer 10 bwt integer把最小值、最大值放在一起,便于快速了解数据全距。 需要注意的是,该函数把所有变量都视为数值型处理,即便其为分类变量。
> summ(birthwt) No. of observations = 189 Var. name obs. mean median s.d. min. 1 low 189 1.312 1 0.465 1 2 age 189 23.24 23 5.3 14 3 lwt 189 129.81 121 30.58 80 4 race 189 1.847 1 0.918 1 5 smoke 189 1.392 1 0.489 1 6 ptl 189 0.2 0 0.49 0 7 ht 189 1.063 1 0.244 1 8 ui 189 1.148 1 0.356 1 9 ftv 189 0.79 0 1.06 0 10 bwt 189 2944.59 2977 729.21 709 max. 1 2 2 45 3 250 4 3 5 2 6 3 7 2 8 2 9 6 10 4990