Stemming词干提取和 Lemmatization 词形还原

it2025-10-30 23

词干提取：基于规则、相对原始的操作，使用一些基本规则，可以有效地将任何token进行削减，得到其主干；比如eat具有不同地变体，e.g.eating eaten eats。在大部分时候，在这些变体之间做区分没有意义。因此需要stemming将单词归结到单词的根。

对于一些复杂的NLP任务，有必要使用词形还原lemmatization代替stemming，词形还原更健壮，结合语法变体，得到单词的根

词形还原：使用一种更有条理的方式，根据给每个单词的词性，应用不同的标准化规则，得到词根单元（词元）

区别：

词干提取（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义）

词形还原（lemmatization），是把一个任何形式的语言词汇还原为一般形式（能表达完整语义）

词形还原和词干提取是词形规范化的两类

注意：当使用一些NLP标注器的时候，词干提取和词形还原会修改token所以会产生不同的结果，这时候就以应该避免使用词干提取和词形还原

Stemming/Lemmatization in NLTK

WORD

#词干提取 from nltk import PorterStemmer pst=PorterStemmer() print(pst.stem("eating")) print(pst.stem("ate")) #词形还原 from nltk.stem import WordNetLemmatizer wlem=WordNetLemmatizer() print(wlem.lemmatize("eating")) print(wlem.lemmatize("ate")) output: PS C:\Users\HUST> & python "d:/NLTK Spacy学习/NLTK_learning.py" eat ate eating ate

SENTENCE

import nltk from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer nltk.download('punkt') text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance." stemmed_token=[] pts=PorterStemmer() for i in nltk.word_tokenize(text): stemmed_token.append(pts.stem(i)) print(' '.join(stemmed_token)) print() lemma_token=[] wnl=WordNetLemmatizer() for i in nltk.word_tokenize(text): lemma_token.append(wnl.lemmatize(i)) print(' '.join(lemma_token)) output: stemming: danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc . lemmatization: Dancing is an art . Students should be taught dance a a subject in school . I danced in many of my school function . Some people are always hesitating to dance .

最新回复(0)

Stemming词干提取 和 Lemmatization 词形还原

Stemming词干提取和 Lemmatization 词形还原