R语言PDF词频统计函数

it2026-03-03 19

一、Introduction

有关R语言对PDF词频统计的博客已很多，但有以下问题未解决：

对英文进行词频统计时，“a” “an” "it"等词汇无实际意义，数字的出现也会干扰词频统计。未把相关代码整合成自定义函数，导致使用不方便。

二、代码

hasdigit <- function(str){ if(!is.character(str)){ stop("'str' should be character.") } n <- nchar(str) for(i in 1:n){ ch <- substr(str, i, i) if(ch>="0"&&ch<="9"){ return(T) } } return(F) } wordstat.pdf <- function(file, lo=3, simplify=T, del_num=T){ # lo: minimum word length # simplify: whether to delete simple words like "a", "an", "we", etc # del_num: whether to delete numbers if(!"pdftools" %in% .packages()){ library(pdftools) } if(!"jiebaRD" %in% .packages()){ library(jiebaRD) } if(!"jiebaR" %in% .packages()){ library(jiebaR) } if(!"wordcloud2" %in% .packages()){ library(wordcloud2) } text <- pdf_text(file) seg <- tolower(qseg[text]) seg <- sort(seg, decreasing = TRUE) seg <- table(seg) # deplete namelist if(simplify){ ex <- c("is","are","be","was","were","become","becomes","do","did","does","a","an","the", "can","will","would","could","should","may","might","have","has", "and","or","not","but","although","though","no","also","if","against","any", "for","on","off","from","to","of","in","by","like","as","at","about","up","down", "below","between","above","with", "many","more","much","most","better","worse","worst","best","good","bad", "it","them","its","their","we","you","our","this","that","these","those", "what","when","where","how","which","whose","why", "get","some","other","others") seg[ex] <- -1 } # deplete digit if(del_num){ for (i in 1:length(seg)){ if(hasdigit(names(seg)[i])){ seg[i] <- -1 } } } seg <- seg[seg > lo] return(seg) }

三、使用案例

需要安装包：pdftools、jiebaRD、jiebaR、wordcloud2

以雅思Simon考官撰写的Ideas for IELTS topics为例（从网站ielts-simon.com中可获取PDF），先运行【二、代码】，将两个函数导入R语言环境中，再运行下列代码：

file <- "D:\\Ideas_for_IELTS_topics.pdf" res <- wordstat.pdf(file) wordcloud2(res, size = 1, shape = 'circle',color = 'random-light')

结果如下：可见Simon考官提供的雅思Part2素材中，children一词高频出现。

参考资料：https://blog.csdn.net/BEYONDMA/article/details/85465403

最新回复(0)