电影评论分类
一、加载IMDB数据集二、准备数据三、 网络搭建1.模型定义2.编译模型2.1 配置优化器2.2 使用自定义损失函数和指标
四、模型训练与验证1.留出验证集2.训练模型3.使用matplotlib绘制训练损失和验证损失4.查看测试集模型性能5.使用训练好的网络在新数据上生成预测结果
五、整体代码清单六、运行结果——过拟合
二分类问题是机器学习中应用最广泛的问题,下面将学习根据电影评论的文字内容将其划分为正面或负面的例子。
一、加载IMDB数据集
下面代码将会下载imdb数据集(第一次运行将下载大约80M的数据)
from keras
.datasets
import imdb
(train_data
,train_labels
), (test_data
,test_labels
) = imdb
.load_data
(num_words
=10000)
train_data
[0] //单词索引组成的列表
>>[1,
...,
973,
1622]
train_labels
[0] //标签由
1和
0组成,其中
1代表正面
(positive
),0代表负面
(negative
)
>>1
max([max(sequence
) for sequence
in train_data
])
>>9999
num_words=10000的意思是保留训练集中常出现的10000个单词,低频单词将被舍弃,这种方式得到的向量数据不会太大,便于处理。
将评论解码为英文单词,下面的代码会将第一条评论解码出来
word_index
= imdb
.get_word_index
()
reverse_word_index
= dict([(value
,key
) for (key
,value
)in word_index
.items
()])
decoded_review
= ' '.join
(
[reverse_word_index
.get
(i
- 3, '?') for i
in train_data
[0]])
print(decoded_review
)
>>? this film was just brilliant casting location scenery story direction everyone's
>really suited the part they played
and you could just imagine being there robert ?
is an amazing actor
and now the same being director ? father came
from the same scottish island
as myself so i loved the fact there was a real connection
with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film
as soon
as it was released
for ?
and would recommend it to everyone to watch
and the fly fishing was amazing really cried at the end it was so sad
and you know what they say
if you cry at a film it must have been good
and this definitely was also ? to the two little boy's
>that played the ? of norman
and paul they were just brilliant children are often left out of the ?
list i think because the stars that play them
all grown up are such a big profile
for the whole film but these children are amazing
and should be praised
for what they have done don't
>you think the whole story was so lovely because it was true
>and was someone's life after
all that was shared
with us
all
二、准备数据
将整数序列编码为二进制矩阵
import numpy
as np
def vectorize_sequence(sequences
, dimension
=10000):
results
= np
.zeros
((len(sequences
), dimension
))
for i
, sequence
in enumerate(sequences
):
results
[i
, sequence
] = 1.
return results
x_train
= vectorize_sequence
(train_data
)
x_test
= vectorize_sequence
(test_data
)
y_train
= np
.asarray
(train_labels
).astype
('float32')
y_test
= np
.asarray
(test_labels
).astype
('float32')
三、 网络搭建
1.模型定义
from keras
import models
from keras
import layers
model
= models
.Sequential
()
model
.add
(layers
.Dense
(16, activation
='relu',input_shape
=(10000,)))
model
.add
(layers
.Dense
(16,activation
='relu'))
model
.add
(layers
.Dense
(1,activation
='sigmoid'))
2.编译模型
model
.compile(optimizer
='rmsprop',
loss
='binary_crossentropy',
metrics
=['accuracy'])
重点:model.compile() optimeizer代表优化器:本文使用rmsprop优化器; loss为损失函数:本文使用常用的交叉熵损失函数; metric为监控指标:例如本文只关心精度(accuracy)。 上述代码将优化器、损失函数和指标以字符串传入,因为rmsprop,binary_crossentropy,accuracy为Keras内置的一部分。我们也可以配置自定义优化器的参数,或者自定义损失函数和监控指标。
2.1 配置优化器
预先写好优化器函数,使用import导入,这里使用keras自带的优化器
from keras
import optimizers
model
.compile(optimizer
=optimizers
.RMSprop
(lr
=0.001),
loss
='binary_crossentropy',
metrics
=['accuracy'])
2.2 使用自定义损失函数和指标
同上
from keras
import losses
from keras
import metrics
model
.compile(optimizer
=optimizers
.RMSprop
(lr
=0.001),
loss
=losses
.binary_crossentropy
,
metrics
=[metrics
.binary_accuracy
])
四、模型训练与验证
1.留出验证集
x_val
= x_train
[:10000]
partial_x_train
= x_train
[10000:]
y_val
= y_train
[:10000]
partial_y_train
= y_train
[10000:]
2.训练模型
model
.compile(optimizer
='rmsprop',
loss
='binary_crossentropy',
metrics
=['acc'])
history
= model
.fit
(partial_x_train
,
partial_y_train
,
epochs
=20,
batch_size
=512,
validation_data
=(x_val
, y_val
))
注意!!! model.fit()返回一个History对象。这个对象有一个成员history,他是一个字典,包含训练中的所有数据。可以通过这个字典绘制训练图像。
history_dict
= history
.history
history_dict
.keys
()
>>dict_keys
(['val_loss', 'val_acc', 'loss', 'acc'])
3.使用matplotlib绘制训练损失和验证损失
import matplotlib
.pyplot
as plt
history_dict
= history
.history
loss_values
= history_dict
['loss']
val_loss_values
= history_dict
['val_loss']
epochs
= range(1, len(loss_values
) + 1)
plt
.plot
(epochs
, loss_values
, 'bo', label
='Training loss')
plt
.plot
(epochs
, val_loss_values
, 'b', label
='Validation loss')
plt
.title
('Training and Validation loss')
plt
.xlabel
= ('Epoch')
plt
.ylabel
= ('Loss')
plt
.legend
()
plt
.show
()
4.查看测试集模型性能
results
= model
.evaluate
(x_test
, y_test
)
print(results
)
5.使用训练好的网络在新数据上生成预测结果
model
.predict
(x_test
)
>>array
([[4.2676926e-05],
[1.0000000e+00],
[9.9798399e-01],
...,
[4.2319298e-06],
[6.9633126e-04],
[9.3176681e-01]], dtype
=float32
)
可以看到网络对某些样本的结过非常确定,可达0.99和1,也有不确定0.4或0.6
五、整体代码清单
from keras
.datasets
import imdb
import numpy
as np
from keras
import models
from keras
import layers
import matplotlib
.pyplot
as plt
(train_data
,train_labels
), (test_data
,test_labels
) = imdb
.load_data
(num_words
=10000)
def vectorize_sequence(sequences
, dimension
=10000):
results
= np
.zeros
((len(sequences
), dimension
))
for i
, sequence
in enumerate(sequences
):
results
[i
, sequence
] = 1.
return results
x_train
= vectorize_sequence
(train_data
)
x_test
= vectorize_sequence
(test_data
)
y_train
= np
.asarray
(train_labels
).astype
('float32')
y_test
= np
.asarray
(test_labels
).astype
('float32')
x_val
= x_train
[:10000]
partial_x_train
= x_train
[10000:]
y_val
= y_train
[:10000]
partial_y_train
= y_train
[10000:]
model
= models
.Sequential
()
model
.add
(layers
.Dense
(16, activation
='relu',input_shape
=(10000,)))
model
.add
(layers
.Dense
(16,activation
='relu'))
model
.add
(layers
.Dense
(1,activation
='sigmoid'))
model
.compile(optimizer
='rmsprop',
loss
='binary_crossentropy',
metrics
=['accuracy'])
history
= model
.fit
(partial_x_train
,
partial_y_train
,
epochs
=20,
batch_size
=512,
validation_data
=(x_val
, y_val
))
history_dict
= history
.history
loss_values
= history_dict
['loss']
val_loss_values
= history_dict
['val_loss']
epochs
= range(1, len(loss_values
) + 1)
plt
.plot
(epochs
, loss_values
, 'bo', label
='Training loss')
plt
.plot
(epochs
, val_loss_values
, 'b', label
='Validation loss')
plt
.title
('Training and Validation loss')
plt
.xlabel
= ('Epoch')
plt
.ylabel
= ('Loss')
plt
.legend
()
plt
.show
()
results
= model
.evaluate
(x_test
, y_test
)
print(results
)
print(model
.predict
(x_test
))
六、运行结果——过拟合
可以看到,训练损失和验证损失在不断降低,但是验证损失在第4轮达到最佳值,且后面开始不断增长。通俗来讲就是模型在训练数据上越带越好,但是在为见到过的数据上表现却不一定越来越好。用深度学习术语来讲发生了过拟合(overfit),即在第4轮后训练的数据过度拟化,最终学到的数据只针对训练集,无法泛化到训练集外的数据。 为了防止这种情况,可以在3轮之后停止训练,或者采用其它方法降低过拟合!