一、Introduction
有关R语言对PDF词频统计的博客已很多,但有以下问题未解决:
对英文进行词频统计时,“a” “an” "it"等词汇无实际意义,数字的出现也会干扰词频统计。未把相关代码整合成自定义函数,导致使用不方便。
二、代码
hasdigit
<- function(str
){
if(!is.character
(str
)){
stop
("'str' should be character.")
}
n
<- nchar
(str
)
for(i
in 1:n
){
ch
<- substr
(str
, i
, i
)
if(ch
>="0"&&ch
<="9"){
return
(T
)
}
}
return
(F
)
}
wordstat.pdf
<- function(file
, lo
=3, simplify
=T
, del_num
=T
){
if(!"pdftools" %in% .packages
()){
library
(pdftools
)
}
if(!"jiebaRD" %in% .packages
()){
library
(jiebaRD
)
}
if(!"jiebaR" %in% .packages
()){
library
(jiebaR
)
}
if(!"wordcloud2" %in% .packages
()){
library
(wordcloud2
)
}
text
<- pdf_text
(file
)
seg
<- tolower
(qseg
[text
])
seg
<- sort
(seg
, decreasing
= TRUE)
seg
<- table
(seg
)
if(simplify
){
ex
<- c
("is","are","be","was","were","become","becomes","do","did","does","a","an","the",
"can","will","would","could","should","may","might","have","has",
"and","or","not","but","although","though","no","also","if","against","any",
"for","on","off","from","to","of","in","by","like","as","at","about","up","down",
"below","between","above","with",
"many","more","much","most","better","worse","worst","best","good","bad",
"it","them","its","their","we","you","our","this","that","these","those",
"what","when","where","how","which","whose","why",
"get","some","other","others")
seg
[ex
] <- -1
}
if(del_num
){
for (i
in 1:length
(seg
)){
if(hasdigit
(names
(seg
)[i
])){
seg
[i
] <- -1
}
}
}
seg
<- seg
[seg
> lo
]
return
(seg
)
}
三、使用案例
需要安装包:pdftools、jiebaRD、jiebaR、wordcloud2
以雅思Simon考官撰写的Ideas for IELTS topics为例(从网站ielts-simon.com中可获取PDF),先运行【二、代码】,将两个函数导入R语言环境中,再运行下列代码:
file
<- "D:\\Ideas_for_IELTS_topics.pdf"
res
<- wordstat.pdf
(file
)
wordcloud2
(res
, size
= 1, shape
= 'circle',color
= 'random-light')
结果如下: 可见Simon考官提供的雅思Part2素材中,children一词高频出现。
参考资料:https://blog.csdn.net/BEYONDMA/article/details/85465403