| 更新:2020.10.25 | fjy2035@foxmail.com
前言:本博客基本涵盖single-gpu和multi-gpu的使用,及训练模型的保存和加载。更复杂功能,修改后亦可得到。 查看gpu使用情况和哪些用户在使用gpu:(watch -n [time] nvidia-smi)和(gpustat -cpu) https://github.com/wookayin/gpustat https://pypi.org/project/gpustat/ 关闭服务器 GPU 占用线程:kill -9 PID
注意:Train/Test过程中 inputs 和 labels,以及待训练 model 均加载到GPU中。对小模型来说,多GPU并行运算反而耗时,大模型bath_size远大于GPU数(或加宽加深Hidden-layers),GPU优势才能体现。增大bath_size,导致预测准确率降低,可增大epoch。
因为pytorch是在第0块gpu上初始化,占用一定空间的显存,所以使用不当会遇到out of memory的问题。以下探讨涵盖single-GPU和Multi-GPU在训练前指定GPU、保存和加载训练模型、GPU和CPU互加载模型三个过程。
1. PyTorch使用指定GPU - 单GPU
直接使用代码 model.cuda(), PyTorch默认从0开始的单GPU:
model
= Model
()
if torch
.cuda
.is_available
():
model
= model
.cuda
()
有两种方法可直接指定单GPU:
在终端shell:CUDA_VISIBLE_DEVICES=1 python main.py,表示只有第1块gpu可见,其他gpu不可用。第1块gpu编号已变成第0块,如果依然使用cuda:1会报invalid device ordinal;以下同效。 python代码(2选1):
os
.environ
["CUDA_VISIBLE_DEVICES"] = "1"
model
= Model
()
if torch
.cuda
.is_available
():
model
= model
.cuda
()
images
= images
.cuda
()
labels
= labels
.cuda
()
or
device
= torch
.device
("cuda:0" if torch
.cuda
.is_available
() else "cpu")
net
= self
.model
.to
(device
)
images
= self
.images
.to
(device
)
labels
= self
.labels
.to
(device
)
Note,“cuda:0"或"cuda"都代表起始device_id为0,系统默认从0开始。可根据需要修改起始位置,如“cuda:1”等效"cuda:0"或"cuda”。
torch
.cuda
.device
(id)
or
torch
.cuda
.set_device
(id)
or
torch
.device
('cuda')
单GPU中保存训练模型(2选1)
state
= {'model': self
.model
.state_dict
(), 'epoch': ite
}
torch
.save
(state
, self
.model
.name
())
or
torch
.save
(self
.model
.state_dict
(), 'Mymodel.pth')
测试,单GPU/CPU中加载 single-gpu 训练模型(3选1)详解参考第3部分:[GPU和CPU互加载模型参数] (3. PyTorch使用指定GPU训练 - 其他问题详解(含CPU))
checkpoint
= torch
.load
(self
.model
.name
())
self
.model
.load_state_dict
(checkpoint
['model'])
or
self
.model
.load_state_dict
(torch
.load
('Mymodel.pth'))
or
if torch
.cuda
.is_available
():
self
.model
.load_state_dict
(torch
.load
('Mymodel.pth'))
else:
checkpoint
= torch
.load
(self
.model
.name
(),map_location
=lambda storage
, loc
: storage
)
self
.model
.load_state_dict
(checkpoint
['model'])
2. PyTorch使用指定GPU - 多GPU(DataParallel)
仍有两种方法可直接指定多GPU:
在终端shell:CUDA_VISIBLE_DEVICES=0,1,3 python main.pypython代码:
os
.environ
["CUDA_VISIBLE_DEVICES"] = "0,1,3"
device
= torch
.device
("cuda:0" if torch
.cuda
.is_available
() else "cpu")
if torch
.cuda
.device_count
() > 1:
print("Let's use", torch
.cuda
.device_count
(), "GPUs!")
self
.model
= torch
.nn
.DataParallel
(self
.model
)
net
= self
.model
.to
(device
)
images
= self
.images
.to
(device
)
labels
= self
.labels
.to
(device
)
Note:使用多GPU训练,单用 model = torch.nn.DataParallel(model),默认所有存在的显卡都会被使用。
多GPU中保存训练模型(3选1)
if isinstance(self
.model
,torch
.nn
.DataParallel
):
self
.model
= self
.model
.module
state
= {'model': self
.model
.state_dict
(), 'epoch': ite
}
torch
.save
(state
, self
.model
.name
())
or
if isinstance(self
.model
, torch
.nn
.DataParallel
):
torch
.save
(self
.model
.module
.stat_dict
, 'Mymodel')
else:
torch
.save
(self
.model
.stat_dict
, 'Mymodel')
or
torch
.save
(self
.model
.state_dict
(), 'Mymodel.pth')
测试,单GPU/多GPU/CPU加载 multi-gpu 训练模型:(3选1)详解参考第3部分:[GPU和CPU互加载模型参数] (3. PyTorch使用指定GPU训练 - 其他问题详解(含CPU))
net
= torch
.nn
.DataParallel
(net
)
net
.load_state_dict
(torch
.load
("model/cnn_train.pth"))
net
.load_state_dict
({k
.replace
('module.', ''): v
for k
, v
in torch
.load
("model/cnn_train.pth").items
()})
or
from collections
import OrderedDict
state_dict
= torch
.load
("model/cnn_train.pth")
new_state_dict
= OrderedDict
()
for k
, v
in state_dict
.items
():
name
= k
[7:]
new_state_dict
[name
] = v
net
.load_state_dict
(new_state_dict
)
3. PyTorch使用指定GPU训练 - 其他问题详解(含CPU)
DataParallel:torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0) (1)DataParallel 实现在module级别上的数据并行使用,返回新模型,即将model在每个GPU分别保存一份。 (2)DataParallel 将输入tensor自动划分并分配到多GPU上的多个模型,即每个GPU计算tensor的一部分,所以输入batch_size应大于设备量GPU。 (3)DataParallel 在每个model完成计算后,收集与合并结果然后可返回到某一个GPU集中处理。 Note:多GPU训练使用DataParallel对网络进行封装,因此在原网络结构中添加了一层module。 module:多GPU并行处理的模型 device_ids:GPU编号(默认全部GPU) output_device:输出位置(默认device_ids[0]或cuda:0) dim:tensors被分散的维度,默认0
gpu_ids
= [3, 4, 6, 7]
device
= torch
.device
("cuda:0" if torch
.cuda
.is_available
() else "cpu")
if torch
.cuda
.device_count
() > 1:
print("Let's use", torch
.cuda
.device_count
(), "GPUs!")
self
.model
= torch
.nn
.DataParallel
(self
.model
, device_ids
=gpu_ids
)
net
= self
.model
.to
(device
)
images
= self
.images
.to
(device
)
labels
= self
.labels
.to
(device
)
训练过程中,若用model的子模块:
model
= Net
()
out
= model
.fc
(input)
model
= Net
()
model
= torch
.nn
.DataParallel
(model
)
out
= model
.module
.fc
(input)
测试过程中,GPU和CPU互加载模型参数:参考博客 [gpu和cpu互加载模型参数] (https://blog.csdn.net/bc521bc/article/details/85623515)
checkpoint
= torch
.load
('modelparameters.pth')
model
.load_state_dict
(checkpoint
)
torch
.load
('modelparameters.pth', map_location
=lambda storage
, loc
: storage
.cuda
(1))
torch
.load
('modelparameters.pth', map_location
={'cuda:1':'cuda:0'})
torch
.load
('modelparameters.pth', map_location
=lambda storage
, loc
: storage
)
torch
.load
(opt
.model
,map_location
='cpu')
4. 完整代码示意
import torch
import torchvision
import torchvision
.transforms
as transforms
import numpy
as np
import torch
.nn
as nn
import torch
.nn
.functional
as F
import torch
.optim
as optim
import matplotlib
.pyplot
as plt
from torch
.autograd
import Variable
from torch
.backends
import cudnn
import os
from collections
import OrderedDict
os
.environ
['CUDA_VISIBLE_DEVICES'] = '0,1,3'
transform
= transforms
.Compose
(
[transforms
.ToTensor
(),
transforms
.Normalize
((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
train_set
= torchvision
.datasets
.CIFAR10
(root
='./data', train
=True, download
=True, transform
=transform
)
train_loader
= torch
.utils
.data
.DataLoader
(train_set
, batch_size
=10, shuffle
=True, num_workers
=0)
test_set
= torchvision
.datasets
.CIFAR10
(root
='./data', train
=False, download
=True, transform
=transform
)
test_loader
= torch
.utils
.data
.DataLoader
(test_set
, batch_size
=10, shuffle
=False, num_workers
=0)
classes
= ['plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
class CNN(nn
.Module
):
def __init__(self
):
super(CNN
, self
).__init__
()
self
.conv1
= nn
.Conv2d
(3, 6, 5)
self
.conv2
= nn
.Conv2d
(6, 16, 5)
self
.pool
= nn
.MaxPool2d
(2, 2)
self
.fc1
= nn
.Linear
(16 * 5 * 5, 120)
self
.fc2
= nn
.Linear
(120, 84)
self
.fc3
= nn
.Linear
(84, 10)
def forward(self
, x
):
h1
= self
.pool
(F
.relu
(self
.conv1
(x
)))
h2
= self
.pool
(F
.relu
(self
.conv2
(h1
)))
h2
= h2
.view
(-1, 16 * 5 * 5)
h3
= self
.fc1
(h2
)
h4
= self
.fc2
(h3
)
h5
= self
.fc3
(h4
)
return h5
net
= CNN
()
device
= torch
.device
('cuda:2' if torch
.cuda
.is_available
() else 'cpu')
print("GPU or CPU is available: ", device
)
if torch
.cuda
.device_count
() > 1:
print('Lets use', torch
.cuda
.device_count
(), 'GPUs!')
net
= nn
.DataParallel
(net
)
net
.to
(device
)
criterion
= nn
.CrossEntropyLoss
()
optimizer
= optim
.SGD
(net
.parameters
(), lr
=0.001, momentum
=0.9)
net
.train
()
for epoch
in range(1):
running_loss
= 0.
for i
, data
in enumerate(train_loader
, 0):
images
, labels
= data
images
= images
.to
(device
)
labels
= labels
.to
(device
)
optimizer
.zero_grad
()
outs
= net
(images
)
loss
= criterion
(outs
, labels
)
loss
.backward
()
optimizer
.step
()
running_loss
+= loss
.item
()
if i
% 2000 == 1999:
print('[epoch %d, iter %d] loss : %.3f' % (epoch
+ 1, i
+ 1, running_loss
/ 2000))
running_loss
= 0.
print('Finish Training!')
torch
.save
(net
.state_dict
(), 'model/cnn_train.pth')
print('Finish save the model!')
device
= torch
.device
('cuda:0' if torch
.cuda
.is_available
() else 'cpu')
print('Test is running:', device
)
if 'gpu' if torch
.cuda
.is_available
() else 'cpu' == 'gpu':
state_dict
= torch
.load
("model/cnn_train.pth")
else:
state_dict
= torch
.load
("model/cnn_train.pth", map_location
=lambda storage
, loc
: storage
)
new_state_dict
= OrderedDict
()
if isinstance(net
, torch
.nn
.DataParallel
):
print('\nThe source model is isinstance in test')
if list(state_dict
.keys
())[0][:6] == 'module':
print("The loaded model always contains 'module'")
net
.load_state_dict
(state_dict
)
else:
print("The loaded model is adding 'module'...")
net
= torch
.nn
.DataParallel
(net
)
net
.load_state_dict
(state_dict
)
print("Finish loading 'model.pth'\n")
else:
print('\nThe source model is not isinstance in test')
if list(state_dict
.keys
())[0][:6] == 'module':
print("The loaded model is removing 'module'")
for k
, v
in state_dict
.items
():
name
= k
[7:]
new_state_dict
[name
] = v
net
.load_state_dict
(new_state_dict
)
else:
print("The loaded model always contains 'module'")
net
.load_state_dict
(state_dict
)
print("Finish loading 'model.pth'\n")
net
.to
(device
)
net
.eval()
correct_test
= 0
total_test
= 0
for epoch
in range(1):
for data
in test_loader
:
images_test
, labels_test
= data
images_test
= images_test
.to
(device
)
labels_test
= labels_test
.to
(device
)
with torch
.no_grad
():
outs_test
= net
(images_test
)
_
, predict
= torch
.max(outs_test
.data
, 1)
total_test
+= labels_test
.size
(0)
correct_test
+= (predict
== labels_test
).sum().item
()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct_test
/ total_test
))
print('Finish Testing!')
5. 拓展其他博客
[1] CPU加载GPU训练model和GPU加载CPU训练model: https://www.ptorch.com/news/74.html [2] 单机多卡并行训练、多机多GPU训练和DistributedDataParallel解决显存使用不平衡: https://blog.csdn.net/weixin_47196664/article/details/106542016?utm_medium=distribute.wap_relevant.none-task-blog-title-2