很多初学 python 的同学会使用 print 或 log 调试程序,但是这只在小规模的程序下调试很方便,更好的调试应该是在一边运行的时候一边检查里面的变量和方法。
感兴趣的可以去了解 pycharm 的 debug 模式,功能也很强大,能够满足一般的需求,这里不多做赘述,我们这里介绍一个更适用于 pytorch 的一个灵活的 pdb 交互式调试工具。
Pdb 是一个交互式的调试工具,集成与 Python 标准库中,它能让你根据需求跳转到任意的 Python 代码断点、查看任意变量、单步执行代码,甚至还能修改变量的值,而没有必要去重启程序。
ipdb 则是一个增强版的 pdb,它提供了调试模式下的代码自动补全,还有更好的语法高亮和代码溯源,以及更好的内省功能,最重要的是它和 pdb 接口完全兼容,可以通过 pip install ipdb 安装。
99it [00:17, 6.07it/s]loss: 0.22854854568839075
119it [00:21, 5.79it/s]loss: 0.21267264398435753
139it [00:24, 5.99it/s]loss: 0.19839374726372108
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(80)train()
79 loss_meter.reset()
---> 80 confusion_matrix.reset()
81 for ii, (data, label) in tqdm(enumerate(train_dataloader)):
ipdb> break 88 # 在第88行设置断点,当程序运行到此处进入debug模式
Breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py:88
ipdb> # 打印所有参数及其梯度的标准差
for (name,p) in model.named_parameters(): \
print(name,p.data.std(),p.grad.data.std())
model.features.0.weight tensor(0.2615, device='cuda:0') tensor(0.3769, device='cuda:0')
model.features.0.bias tensor(0.4862, device='cuda:0') tensor(0.3368, device='cuda:0')
model.features.3.squeeze.weight tensor(0.2738, device='cuda:0') tensor(0.3023, device='cuda:0')
model.features.3.squeeze.bias tensor(0.5867, device='cuda:0') tensor(0.3753, device='cuda:0')
model.features.3.expand1x1.weight tensor(0.2168, device='cuda:0') tensor(0.2883, device='cuda:0')
model.features.3.expand1x1.bias tensor(0.2256, device='cuda:0') tensor(0.1147, device='cuda:0')
model.features.3.expand3x3.weight tensor(0.0935, device='cuda:0') tensor(0.1605, device='cuda:0')
model.features.3.expand3x3.bias tensor(0.1421, device='cuda:0') tensor(0.0583, device='cuda:0')
model.features.4.squeeze.weight tensor(0.1976, device='cuda:0') tensor(0.2137, device='cuda:0')
model.features.4.squeeze.bias tensor(0.4058, device='cuda:0') tensor(0.1798, device='cuda:0')
model.features.4.expand1x1.weight tensor(0.2144, device='cuda:0') tensor(0.4214, device='cuda:0')
model.features.4.expand1x1.bias tensor(0.4994, device='cuda:0') tensor(0.0958, device='cuda:0')
model.features.4.expand3x3.weight tensor(0.1063, device='cuda:0') tensor(0.2963, device='cuda:0')
model.features.4.expand3x3.bias tensor(0.0489, device='cuda:0') tensor(0.0719, device='cuda:0')
model.features.6.squeeze.weight tensor(0.1736, device='cuda:0') tensor(0.3544, device='cuda:0')
model.features.6.squeeze.bias tensor(0.2420, device='cuda:0') tensor(0.0896, device='cuda:0')
model.features.6.expand1x1.weight tensor(0.1211, device='cuda:0') tensor(0.2428, device='cuda:0')
model.features.6.expand1x1.bias tensor(0.0670, device='cuda:0') tensor(0.0162, device='cuda:0')
model.features.6.expand3x3.weight tensor(0.0593, device='cuda:0') tensor(0.1917, device='cuda:0')
model.features.6.expand3x3.bias tensor(0.0227, device='cuda:0') tensor(0.0160, device='cuda:0')
model.features.7.squeeze.weight tensor(0.1207, device='cuda:0') tensor(0.2179, device='cuda:0')
model.features.7.squeeze.bias tensor(0.1484, device='cuda:0') tensor(0.0381, device='cuda:0')
model.features.7.expand1x1.weight tensor(0.1235, device='cuda:0') tensor(0.2279, device='cuda:0')
model.features.7.expand1x1.bias tensor(0.0450, device='cuda:0') tensor(0.0100, device='cuda:0')
model.features.7.expand3x3.weight tensor(0.0609, device='cuda:0') tensor(0.1628, device='cuda:0')
model.features.7.expand3x3.bias tensor(0.0132, device='cuda:0') tensor(0.0079, device='cuda:0')
model.features.9.squeeze.weight tensor(0.1093, device='cuda:0') tensor(0.2459, device='cuda:0')
model.features.9.squeeze.bias tensor(0.0646, device='cuda:0') tensor(0.0135, device='cuda:0')
model.features.9.expand1x1.weight tensor(0.0840, device='cuda:0') tensor(0.1860, device='cuda:0')
model.features.9.expand1x1.bias tensor(0.0177, device='cuda:0') tensor(0.0033, device='cuda:0')
model.features.9.expand3x3.weight tensor(0.0476, device='cuda:0') tensor(0.1393, device='cuda:0')
model.features.9.expand3x3.bias tensor(0.0058, device='cuda:0') tensor(0.0030, device='cuda:0')
model.features.10.squeeze.weight tensor(0.0872, device='cuda:0') tensor(0.1676, device='cuda:0')
model.features.10.squeeze.bias tensor(0.0484, device='cuda:0') tensor(0.0088, device='cuda:0')
model.features.10.expand1x1.weight tensor(0.0859, device='cuda:0') tensor(0.2145, device='cuda:0')
model.features.10.expand1x1.bias tensor(0.0160, device='cuda:0') tensor(0.0025, device='cuda:0')
model.features.10.expand3x3.weight tensor(0.0456, device='cuda:0') tensor(0.1429, device='cuda:0')
model.features.10.expand3x3.bias tensor(0.0070, device='cuda:0') tensor(0.0021, device='cuda:0')
model.features.11.squeeze.weight tensor(0.0786, device='cuda:0') tensor(0.2003, device='cuda:0')
model.features.11.squeeze.bias tensor(0.0422, device='cuda:0') tensor(0.0069, device='cuda:0')
model.features.11.expand1x1.weight tensor(0.0690, device='cuda:0') tensor(0.1400, device='cuda:0')
model.features.11.expand1x1.bias tensor(0.0138, device='cuda:0') tensor(0.0022, device='cuda:0')
model.features.11.expand3x3.weight tensor(0.0366, device='cuda:0') tensor(0.1517, device='cuda:0')
model.features.11.expand3x3.bias tensor(0.0109, device='cuda:0') tensor(0.0023, device='cuda:0')
model.features.12.squeeze.weight tensor(0.0729, device='cuda:0') tensor(0.1736, device='cuda:0')
model.features.12.squeeze.bias tensor(0.0814, device='cuda:0') tensor(0.0084, device='cuda:0')
model.features.12.expand1x1.weight tensor(0.0977, device='cuda:0') tensor(0.1385, device='cuda:0')
model.features.12.expand1x1.bias tensor(0.0102, device='cuda:0') tensor(0.0032, device='cuda:0')
model.features.12.expand3x3.weight tensor(0.0365, device='cuda:0') tensor(0.1312, device='cuda:0')
model.features.12.expand3x3.bias tensor(0.0038, device='cuda:0') tensor(0.0026, device='cuda:0')
model.classifier.1.weight tensor(0.0285, device='cuda:0') tensor(0.0865, device='cuda:0')
model.classifier.1.bias tensor(0.0362, device='cuda:0') tensor(0.0192, device='cuda:0')
ipdb> opt.lr # 查看学习率
0.001
ipdb> opt.lr = 0.002 # 更改学习率
ipdb> for p in optimizer.param_groups: \
p['lr'] = opt.lr
ipdb> model.save() # 保存模型
'checkpoints/squeezenet_20191004212249.pth'
ipdb> c # 继续运行,直到第88行暂停
222it [16:38, 35.62s/it]> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(88)train()
87 optimizer.zero_grad()
1--> 88 score = model(input)
89 loss = criterion(score, target)
ipdb> s # 进入model(input)内部,即model.__call__(input)
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(537)__call__()
536
--> 537 def __call__(self, *input, **kwargs):
538 for hook in self._forward_pre_hooks.values():
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(538)__call__()
537 def __call__(self, *input, **kwargs):
--> 538 for hook in self._forward_pre_hooks.values():
539 result = hook(self, input)
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(544)__call__()
543 input = result
--> 544 if torch._C._get_tracing_state():
545 result = self._slow_forward(*input, **kwargs)
ipdb> n # 下一步
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(547)__call__()
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
ipdb> s # 进入forward函数内容
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\loss.py(914)forward()
913
--> 914 def forward(self, input, target):
915 return F.cross_entropy(input, target, weight=self.weight,
ipdb> input # 查看input变量值
tensor([[4.5005, 2.0725],
[3.5933, 7.8643],
[2.9086, 3.4209],
[2.7740, 4.4332],
[6.0164, 2.3033],
[5.2261, 3.2189],
[2.6529, 2.0749],
[6.3259, 2.2383],
[3.0629, 3.4832],
[2.7008, 8.2818],
[5.5684, 2.1567],
[3.0689, 6.1022],
[3.4848, 5.3831],
[1.7920, 5.7709],
[6.5032, 2.8080],
[2.3071, 5.2417],
[3.7474, 5.0263],
[4.3682, 3.6707],
[2.2196, 6.9298],
[5.2201, 2.3034],
[6.4315, 1.4970],
[3.4684, 4.0371],
[3.9620, 1.7629],
[1.7069, 7.8898],
[3.0462, 1.6505],
[2.4081, 6.4456],
[2.1932, 7.4614],
[2.3405, 2.7603],
[1.9478, 8.4156],
[2.7935, 7.8331],
[1.8898, 3.8836],
[3.3008, 1.6832]], device='cuda:0', grad_fn=AsStridedBackward>)
ipdb> input.data.mean() # 查看input的均值和标准差
tensor(3.9630, device='cuda:0')
ipdb> input.data.std()
tensor(1.9513, device='cuda:0')
ipdb> u # 跳回上一层
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\module.py(547)__call__()
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
ipdb> u # 跳回上一层
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py(88)train()
87 optimizer.zero_grad()
1--> 88 score = model(input)
89 loss = criterion(score, target)
ipdb> clear # 清除所有断点
Clear all breaks? y
Deleted breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/main.py:88
ipdb> c # 继续运行,记得先删除"debug/debug.txt",否则很快又会进入调试模式
59it [06:21, 5.75it/s]loss: 0.24856307208538073
76it [06:24, 5.91it/s]
当我们想要进入 debug 模式,修改程序中某些参数值或者想分析程序时,就可以通过创建 debug 标识文件,此时程序会进入调试模式,调试完成之后删除这个文件并在 ipdb 调试接口输入 c 继续运行程序。如果想退出程序,也可以使用这种方式,先创建 debug 标识文件,然后输入 quit 在退出 debug 的同时退出程序。这种退出程序的方式,与使用 Ctrl + C 的方式相比更安全,因为这能保证数据加载的多进程程序也能正确地退出,并释放内存、显存等资源。
PyTorch 调用 CuDNN 报错时,报错信息诸如 CUDNN_STATUS_BAD_PARAM,从这些报错内容很难得到有用的帮助信息,最后先利用 PCU 运行代码,此时一般会得到相对友好的报错信息,例如在 ipdb 中执行 model.cpu() (input.cpu()), PyTorch 底层的 TH 库会给出相对比较详细的信息。
此外,可能还会经常遇到程序正常运行、没有报错,但是模型无法收敛的问题。例如对于二分类问题,交叉熵损失一直徘徊在 0.69 附近(ln2),或者是数值出现溢出等问题,此时可以进入 debug 模式,用单步执行查看,每一层输出的均值和方差,观察从哪一层的输出开始出现数值异常。还要查看每个参数梯度的均值和方差,查看是否出现梯度消失或者梯度爆炸等问题。一般来说,通过再激活函数之前增加 BatchNorm 层、合理的参数初始化、使用 Adam 优化器、学习率设为0.001,基本就能确保模型在一定程度收敛。
本章带同学们从头实现了一个 Kaggle 上的经典竞赛,重点讲解了如何合理地组合安排程序,同时介绍了一些在PyTorch中调试的技巧,下章将正式的进入编程实战之旅,其中一些细节不会再讲的如此详细,做好心理准备。