Stemming词干提取 和 Lemmatization 词形还原

it2025-10-30  2

词干提取:基于规则、相对原始的操作,使用一些基本规则,可以有效地将任何token进行削减,得到其主干;比如eat具有不同地变体,e.g.eating eaten eats。在大部分时候,在这些变体之间做区分没有意义。因此需要stemming将单词归结到单词的根。

对于一些复杂的NLP任务,有必要使用词形还原lemmatization代替stemming,词形还原更健壮,结合语法变体,得到单词的根

词形还原:使用一种更有条理的方式,根据给每个单词的词性,应用不同的标准化规则,得到词根单元(词元)

 

区别:

词干提取(stemming)是抽取词的词干或词根形式(不一定能够表达完整语义)

词形还原(lemmatization),是把一个任何形式的语言词汇还原为一般形式(能表达完整语义)

词形还原和词干提取是词形规范化的两类

注意:当使用一些NLP标注器的时候,词干提取和词形还原会修改token所以会产生不同的结果,这时候就以应该避免使用词干提取和词形还原

 

Stemming/Lemmatization in NLTK

WORD

#词干提取 from nltk import PorterStemmer pst=PorterStemmer() print(pst.stem("eating")) print(pst.stem("ate")) #词形还原 from nltk.stem import WordNetLemmatizer wlem=WordNetLemmatizer() print(wlem.lemmatize("eating")) print(wlem.lemmatize("ate")) output: PS C:\Users\HUST> & python "d:/NLTK Spacy学习/NLTK_learning.py" eat ate eating ate

SENTENCE

import nltk from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer nltk.download('punkt') text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance." stemmed_token=[] pts=PorterStemmer() for i in nltk.word_tokenize(text): stemmed_token.append(pts.stem(i)) print(' '.join(stemmed_token)) print() lemma_token=[] wnl=WordNetLemmatizer() for i in nltk.word_tokenize(text): lemma_token.append(wnl.lemmatize(i)) print(' '.join(lemma_token)) output: stemming: danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc . lemmatization: Dancing is an art . Students should be taught dance a a subject in school . I danced in many of my school function . Some people are always hesitating to dance .

 

最新回复(0)