We call the setDynamicRange or calibration table with name implicit int8 precision
,
and the Q/DQ generated by pytorch-quantization explicit int8 precision
.
The difference is that for implicit, TRT try to optimize the best performance, for explicit, TRT need guarantee the same accuracy as in the original framework while optimize the performance. So for explicit, we have rules how to propagate the Q/DQ nodes and fuse them with other nodes. put /dq everywhere would slowdown the performance, and we have a doc on the Q/DQ placements impact on the perf.
-
使用 TensorRT 闭源方法进行 PTQ
https://github.com/lix19937/trt-samples-for-hackathon-cn/tree/master/cookbook/03-BuildEngineByTensorRTAPI/MNISTExample-pyTorch/C%2B%2Bhttps://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_int8_calibrator.html
类型 说明 IInt8EntropyCalibrator Entropy calibrator. This is the Legacy Entropy calibrator. It is less complicated than the legacy calibrator and produces better results. IInt8EntropyCalibrator2 Entropy calibrator 2. This is the preferred calibrator. This is the required calibrator for DLA, as it supports per activation tensor scaling. IInt8LegacyCalibrator Legacy calibrator left for backward compatibility with TensorRT 2.0. This calibrator requires user parameterization, and is provided as a fallback option if the other calibrators yield poor results. IInt8MinMaxCalibrator It supports per activation tensor scaling.
NVIDIA/TensorRT#3205 (comment)
-
使用 pytorch-quantization 进行Q-DQ设置,然后进行开源方法 PTQ
https://github.com/lix19937/pytorch-quantization/tree/main/pytorch_quantization/calib- max
- hist
- 交叉熵
- mse
- 统计分位数
详细见 https://github.com/lix19937/pytorch-quantization/blob/main/readme_lix.md
- PTQ calib 期间可以进行fuse_bn,减少bn layer的标定,降低标定时间和calib 误差
- 找到所有 quant layer
- 每次仅使能一层quant layer进行指标eval,记录到dict中 {"layer_name":eval_value}
- 如果是使用 pytorch-quantization calib,则是在pytorch 下寻找敏感层
- 如果是使用 TensorRT calib,则是在 fp32 下,使用 TensorRT 或 onnxruntime 下寻找