PaddlePaddle 1.5.0
XiaoguangHu01
released this
01 Jul 09:49
·
64 commits
to release/1.5
since this release
Release Notes
重要更新
- 训练性能在数据读取、执行调度优化、OP计算逻辑及底层cudnn、CUDAKernel、MKLDNN等方面进行了大量优化,训练性能大幅提升;进一步优化显存占用,整体具备领先优势
- 新增基于Padding方式实现的LSTM、GRU,更方便用户学习和使用;并基于对应API新增语言模型、seq2seq翻译模型的示例模型;增强部分OP功能,更好地支持NLP中Tensor多个维度可变的任务
- 正式发布动态图Preview版并提供相关的API文档,并提供 7个模型动态图版本官方实现
- 官方模型库方面正式发布PaddleDetection物体检测统一框架,覆盖主流目标检测算法,易扩展和模块化组合使用;发布图像生成库,覆盖主流的GAN算法,可一键式运行;发布PaddleNLP-Research,包含百度在 NLP 领域最新研究工作
- 模型压缩框架PaddleSlim新增基于模拟退火的自动剪切策略和轻量级模型结构自动搜索功能(Light-NAS)
- 分布式训练发布HighLevel API Fleet,单机转分布式训练成本显著降低;GPU多机多卡性能显著提升,在ResNet50、BERT、ERNIE等模型中4x8 v100配置下相比此前发布的Benchmark提速超过50%
- PaddleHub新增29个预训练模型,总计覆盖文本、图像、视频三大领域40个模型,并全面提升易用性,发布PaddleHub官网
- 发布图学习框架PGL(Paddle Graph Learning) Preview版,提供基于游走以及消息传递两种计算范式去搭建最前沿的图学习算法
训练框架
- 安装&环境:
- 增加Linux下对CUDA 10的支持,增加Windows下对CUDA 9的支持,cuDnn版本统一为7.3+
- 安装包不按照CPU处理器是否支持AVX指令集做区分,支持自动判断并选择选择使用AVX指令集或不使用用AVX指令集
- 针对Python2、Python3下可能版本不兼容的依赖包限制了版本范围,以支持Python相应环境下正确安装
- 提供可全离线安装PaddlePaddle的Docker镜像
- 增加安装后的GPU多卡运行检测
- 解除GPU单卡训练时对NCCL的依赖
- 动态图Preview版:
- 发布动态图相关的API和文档
- 基础功能完善,显存和速度优化,支持GPU单机多卡训练
- 增加transformer、ocr recognition、resnet、language model等7个模型效果对齐的动态图版本实现
- 性能优化:
- 数据读取优化
- 使用多进程优化数据读取、预处理部分,DeepLab V3+单GPU训练获得63%的性能提升。
- Op计算逻辑优化
- 优化concat/spilt op输入/输出个数<=4的实现,避免1次CPU->GPU的数据传输。
- 优化recurrent op中执行器的调用方法,修改成在迭代前调用一次executor.Prepare,迭代中executor.RunPreparedContext执行计算,从而避免每次迭代反复创建op。该优化对PaddingRNN padding small和large模型分别带来23%和15%的性能提升。
- 融合优化器Momentum op的计算,对Resnet50单GPU、4 GPU训练分别可带来1.6%、10.6%的性能提升。
- cuDnn使用策略优化
- 使用cuDnn v7中新增的算法选择API cudnnGetConvolutionForwardAlgorithm_v7优化conv_cudnn op算法选择策略,Mask-RCNN和YoloV3单GPU训练分别取得32%和11%的加速。
- 一些op的cuDnn实现慢于cuda实现,比如conv2d_transpose、pool2d(global_pooling=True)时,设置use_cudnn=False后,Cycle GAN、SE-ResNeXt单GPU训练分别获得33%、34%的性能提升。
- Op CUDAKernel优化
- 使用精心优化的CUDA kernel优化sum op,对多个LoDTensor求和这种情况优化效果特别明显,GPU执行获得3.3x的加速。
- 使用2D线程Block配置优化elementwise_mul grad op,加速其CUDA Kernel中的Broadcast操作。
- Intel CPU底层计算优化
- 增加新的OP融合Pass(conv+relu6,conv_transpose+elementwise_add)
- 增加新的FP32 MKLDNN kernel (FC),INT8 MKLDNN kernel (Concat)
- 优化若干OP,包括sequence_reverse(前向), sequence_padding(前向), sequence_unpad(反向),bilinear interpolate(前向)
- 优化MKLDNN集成(如对reorder原语进行重用以减少每次创建原语的时间)
- 数据读取优化
- 显存优化:
- Op层显存优化(在Transformer、Mask-RCNN等模型上显存节省1G以上)
- 提高了inplace策略的覆盖面,支持sum、softmax、softmax_with_cross_entropy等op的inplace计算
- 修复了dropout、conv_transpose、activation op的反向注册,降低op的显存占用
- 显存分配与显存复用策略重构
- 重构Allocator底层架构,为后续扩展Allocator策略提供基础
- 重构Inplace策略重构,使其代码便于维护,并排除之前策略中变量可能存在误inplace、graph存在环等bu
- 配置优化
- 用户可通过环境变量FLAGS_conv_workspace_size_limit设置conv层的最大workspace size,单位为MB
- Op层显存优化(在Transformer、Mask-RCNN等模型上显存节省1G以上)
- 执行优化:
- 更新CPU_NUM的默认配置为1,之前为设备的逻辑总核数。
- 对Operator中OpKernel进行cache,避免每次run都重复的选择kernel。
- ParallelExecutor执行模式(CompiledProgram.with_data_parallel())下的优化:减少同步操作;优化在num_thread=1时的速度,对于小模型的速度提升较为明显。(对于PaddingRNN small model 速度提升16%)
- 框架基础功能增强
- build_strategy新增mkldnn_enabled_op_types选项,用户可以灵活地控制哪些op需要使用mkldnn kernel以获得加速
- 新增ParallelExecutor下的drop_local_exe_scopes接口,可以控制什么时候清理local scope中的数据num_iteration_per_drop_scope的设置依然有效
- 新增自动混合精度训练接口fluid.contrib.mixed_precision.decorate(),支持图像分类、BERT等模型的训练
- 新增fluid.gradients接口,11个操作支持做二次反向,使用于图像生成的梯度惩罚功能
- Intel nGraph图编译引擎支持加强,增加了Bert模型所需的op支持,可以通过Intel nGraph图编译引擎进行BERT模型训练,收敛效果对齐。
- OP完善
- 增强fused_elewise_activation op的功能,添加对x+sigmoid(y)、x+tanh(y)计算模式的支持
- 新增指数滑动平均(Exponential Moving Average), 是模型训练更加平滑稳定
- 新增sigmoid_focal_loss损失函数
- 新增deformable RoI pooling操作
- 新增deformable convolution v2操作
- 提供unfold操作(即im2col)操作
预测部署
- 服务端部署库
- 优化显存优化功能。DAM模型显存占用从4G下降至940M; MobileNet 模型显存占用从1G下降至500M。
- 将Paddle-TRT的优化过程迁移到模型初始化期间,解决Paddle-TRT初次预测时间过长的问题。例如使MobileNet初次预测时间从秒级别下降至毫秒级。
- 解决使用AnalysisPredictor从内存载入模型时,模型参数多次内存分配的问题。
- 增强Python预测API,并在官网文档预测部署下增加Python预测API的使用说明。
- Intel INT8 量化预测持续加强
- 持续优化INT8量化框架(训练后量化),新增五个模型( GoogleNet, MobileNetV2, VGG16, VGG19, ResNet101);与FP32模型相比,精度损失均在1%以内,性能提升2~3.7倍
- 支持QAT(训练中量化)训练出来的模型运行在INT8 kernel上,通过Pass对QAT模型进行修改,使其能运行在INT8 kernel上(目前支持 量化/去量化/卷积),在7个模型上(GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19, ResNet50, ResNet101),和在FP32 kernel上模拟运行相比,精度变化在0.1%以内
- Paddle Serving
- 支持GPU设备;支持多卡并行预测
- 提供SE_ResNeXt50_32x4d模型作为标准示例,给出图像分类任务上单卡多并发、多卡多并发等场景benchmark
- 支持大规模稀疏参数任务:用于CTR预估等场景下超大规模embedding的存储和在线访问。一期发布单机版本,支持亿级别embedding访问
- 易于使用的API接口,API demo示例
- PaddleSlim
- 集成INT8量化框架
- 新增自动剪切策略,基于模拟退火算法搜索最优剪切率:对比MobileNet V1在ImageNet 1000类分类任务上FLOPS减少50%; Top1-Accuracy=69.7%
- 新增轻量级模型结构自动搜索功能(Light-NAS):对比MobileNet V1在ImageNet 1000类分类任务上精度无损情况下FLOPS 减少17%
分布式训练
- 分布式High-Level API Fleet
- 分布式训练统一API,支持参数服务器(Parameter Server)和Collective模式训练,大幅度降低用户从单机切换到多机训练的新增代码量
- 用户可以通过配置分布式策略调用不同的并行训练方法,对于不同的分布式环境支持多种内建RoleMaker,方便用户调用
- 参数服务器(Parameter Server)训练新增Communicator设计
- 独立通信逻辑到Communicator,简化异步训练逻辑
- 提供可控制通信开关,可针对不同模型针对性调优
- GPU多机多卡增加多个提升扩展性Feature,NLP/CV经典模型下多机多卡训练提速50%
- 新增Fused All Reduce:通过对gradient tensor进行自动合并,降低参数同步次数
- 新增Hierachical All Reduce:层次化all reduce操作
- 新增All Reduce通信并发能力:增加多机训练下,训练对网络波动的容忍能力
- 新增反向与优化算法之间的依赖分析:提升通信与计算overlap并发的能力
- 以上新增能力融合可实现在Bert Large(batch 16 x 128)和Resnet50(batch 32)上多机(v100 8*4 卡)训练速度比PaddlePaddle1.4.1提速50%+。
- GPU多机多卡Benchmark更新
- ResNet50、VGG16、Transformer和Bert上的速度对比,并提供可复现的benchmarks脚本。
- CPU-GPU异构设备流水线并行能力支持
- 新增流水线并行能力,可支持用户自定义在异构硬件分配计算OP,通过流水线交换数据,从而实现异构计算设备的搭配和计算资源的自由配比,提升训练速度。
- 在IO量大、计算量较小的场景例如CTR预估,Graph Neural Network下相比纯GPU训练有明显速度优势。
模型建设
- 图像分类
- 发布9个ImageNet预训练模型,包含ResNet50_vc, ResNet50_vd, ResNet101_vd, ResNet152_vd, ResNet 200_vd, ResNeXt101_64x4d, ResNeXt101_vd_64x4d, SENet154_vd, InceptionV4
- ResNet50_vd相比已发布的ResNet50效果提升2.62%,可以达到ResNet101精度。ResNet101_vd相比已发布ResNet101效果提升1.88%
- PaddleDetection
- 发布PaddleDetection物体检测统一框架,包含Faster-RCNN (支持FPN), Mask-RCNN (支持FPN), Cascade-RCNN, RetinaNet, Yolo v3, SSD算法,其中FPN, CascadeRCNN, RetinaNet是本次新增算法。
- 发布一系列预训练模型,其中RCNN系列模型支持ResNet, ResNet_vd, ResNeXt, ResNeXt_vd, SEResNeXt主干网络。Yolo v3持续增加更加轻量的ResNet34, MobileNet主干网络,并发布预训练模型
- PaddleGAN
- 发布PaddleGAN图像生成库,包含CGAN、DCGAN、CycleGAN、Pix2Pix、StarGAN、AttGAN、STGAN,支持多种数据集,支持经典的GAN网络结构。其中STGAN是百度视觉技术部自研的任意图像属性编辑模型。
- PaddleVideo
- 优化已经发布的分类模型,NeXtVLAD训练速度提升60%, TSM速度领先竟品39%
- 增加已发布的模型骨干网络,Nonlocal模型增加ResNet101和I3d网络结构
- 增加动作定位模型C-TCN,百度2018年ActivityNet比赛夺冠方案
- PaddleNLP
- ERNIE / BERT支持动态混合精度训练;支持以多进程的方式进行多卡任务的训练,提高了多卡加速比;优化多机多卡训练的加速比,在 V100 GPU集群上将 6 机相对于单机的 FP32 训练加速效率提高至76%
- 发布PaddleNLP-Research,开源MRQA2019阅读理解竞赛Paddle Fluid基线、 DuConv (ACL2019)、ARNOR(ACL2019)、MMPMS(IJCAI2019)、MPM(NAACL2019) 等近期百度在 NLP 学术领域的工作
工具组件
- PaddleHub
- 全新发布PaddleHub官网,易用性全面提升
- 新增网站http://www.paddlepaddle.org.cn/hub, 包含PaddlePaddle生态的预训练模型使用介绍
- 迁移学习Demo接入AI Studio与AI Book,无需安装即可快速体验
- 新增PaddleHub后端服务,支持模型检索、下载、私有化部署等功能
- 新增29个预训练模型,覆盖文本、图像、视频三大领域;目前官方提供40个预训练模型
- CV预训练模型
- 新增图像分类预训练模型11个:SE_ResNeXt, GoogleNet, ShuffleNet等
- 新增目标检测模型Faster-RCNN和YOLOv3
- 新增图像生成模型CycleGAN
- 新增人脸检测模型Pyramidbox
- 新增视频分类模型4个: TSN, TSM, StNet, Non-Local
- NLP预训练模型
- 新增语义模型ELMo
- 新增情感分析模型3个: Senta-BOW, Senta-CNN, Senta-GRNN
- 新增中文情绪识别模型EmoTect
- 新增中文语义相似度分析模型Simnet
- 升级LAC词法分析模型,新增词典干预功能,支持用户自定义分词
- CV预训练模型
- Fine-tune API升级,灵活性与性能全面提升
- 支持多卡并行、PyReader多线程IO,ERNIE文本分类Fine-tune速度提升60%
- 简化finetune、evaluate、predict等使用逻辑,提升易用性
- 增加事件回调功能,方便用户快速实现自定义迁移学习任务
- 新增多标签分类Fine-tune任务
- 全新发布PaddleHub官网,易用性全面提升
- 图学习框架PGL (Paddle Graph Learning)
- 发布基于PaddlePaddle的图学习框架PGL Preview版,提供基于游走 (Walk Based) 以及消息传递(Message Passing)两种计算范式去搭建最前沿的图学习算法,如图表征学习、图神经网络等。PGL充分利用Paddle LoD Tensor特性大幅提升Message-Passing范式中信息聚合效率,兼顾了灵活性和高效性
- 新增基于PGL实现的GCN、GAT,在多个数据集达到SOTA水平
- 新增基于大规模子图采样模型Graphsage模型,单机可支持5千万节点、20亿条边的巨图
- 新增node2vec,deepwalk等图表征学习方法,达到SOTA水平
- 新增PGL文档、API、Tutorial等材料
- 发布基于PaddlePaddle的图学习框架PGL Preview版,提供基于游走 (Walk Based) 以及消息传递(Message Passing)两种计算范式去搭建最前沿的图学习算法,如图表征学习、图神经网络等。PGL充分利用Paddle LoD Tensor特性大幅提升Message-Passing范式中信息聚合效率,兼顾了灵活性和高效性
BUG修复
- 修复softmax_with_cross_entropy操作CPU版本中ignore_label不支持在0到类别数之外label的问题
- 修复import paddle之后logging.basicConfig设置失效问题
- 修复python/paddle/fluid/layers/ops.py在python3下报错的问题
- 修复sequence unpad op在训练过程中不稳定的问题
- 修复Concat Op属性axis为负数时挂掉的问题
- 修复了enable_inplace和memory_optimize的潜在bug,保证某些op的输出变量不会被错误地复用
- 修复了Eager Deletion策略可能会提前误删变量存储空间的bug,提高Eager Deletion策略的稳定性
- 修复了模型图分析中拓扑排序存在bug导致的在相同模型的输入情况下有不同的模型图的生成情况
- 修复了预测结束后其他服务线程OMP线程冲突的问题。修复为在CPU模式下,预测引擎会在预测结束后将全局的OMP线程数设回为1
Release Notes
Table of Contents
- Highlights
- Fundamental framework updates
- Installation
- Dynamic Diagram Preview Version
- Performance Optimization
- Optimization of Memory
- Execution optimization
- Framework basic functions enhancements
- OP perfect
- Inference engine
- Server-side Deployment Library
- Paddle Serving
- PaddleSlim
- Distributed training
- Model construction
- Image classification
- PaddleDetection
- PaddleGAN
- PaddleVideo
- PaddleNLP
- Tools and Components
- Bug fixes notes
Highlights
- The training performance has been greatly optimized in data reading, execution scheduling optimization, Op computing logic and base cuDNN API call, CUDA kernel and MKLDNN. Further optimize the memory occupation, the whole has the leading advantage.
- Add LSTM and GRU based on Padding, which is more convenient for users to learn and use. And add the new language model and the example model of seq2seq translation model based on corresponding API ; Enhanced partial OP functionality to better support Tensor multiple dimension-variable tasks in NLP.
- Release the dynamic Preview version and provide the relevant API documents, and provide the official implementation of the seven model dynamic versions.
- The official model library publishes the uniform framework of PaddleDetection object detection, which covers the mainstream target detection algorithm and is easy to be extended and modular. Publish image generation library, cover mainstream GAN algorithm, can run one-click; Launch Paddle NLP - Research, which includes Baidu's latest research in the NLP field.
- Model compression framework PaddleSlim adds auto-shear strategy based on simulated annealing and lightweight model structure auto-search function (Light-NAS).
- Distributed training releases High-Level API Fleet, single machine to distributed training cost significantly reduced; The multi-card performance of GPU is improved significantly. The speed of 4x8 v100 configuration in ResNet50, BERT and ERNIE models is more than 50% higher than that of Benchmark.
- PaddleHub added 29 pre-training models, covering 40 models in three fields, including text, image and video.
- Paddle Graph Learning (PGL) Preview is released to provide the most advanced graphic learning algorithms based on two computational paradigms: Wandering and messaging.
Fundamental Framework Updates
- Installation
- Add support to CUDA 10 under Linux; add support to CUDA 9 under Windows; unify cuDNN dependency to 7.3+ on all operating systems.
- Installation packages no longer differentiate based on whether the AVX instruction set is supported by the CPU; include new automated judgment and selection of whether to use the AVX instruction set or not.
- Limit the versions of dependent packages to avoid the potential version conflicts under Python2 and Python3.
- Provide a new Docker mirror that supports offline installation of PaddlePaddle.
- Add installation tests for multi-card GPU.
- Remove single-card training GPU’s dependency on NCCL.
- Dynamic Diagram Preview Version
- Release APIs and documentations related to dynamic diagram.
- Perfect fundamental functions; optimize memory and speed; support single multi-card GPU training.
- Add dynamic graph version implementations of 7 models including transformer, ocr recognition, resnet, and language model that have equivalent performance.
- Performance Optimization
- Optimization of Reading Data
- Use multi-thread to optimize data reading and pre-processing; DeepLab V3 + single GPU training achieves a 63% performance improvement.
- Optimization of Op Computing Logistics
- Optimize the implementation of concat/split op with number of input/output <= 4, avoiding 1 CPU -> GPU data transmission
- Optimize the calling method of the executor in recurrent op: now it calls
executor.Prepare
before each iteration, and performexecutor.RunPreparedContext
during the iteration, thus avoiding the repetition of creating op in each iteration. This optimization brings 23% and 15% performance improvements to the PaddingRNN padding small and large models, respectively. - Merge the calculation of the optimizer Momentum op, bringing 1.6% and 10.6% performance improvement to Resnet50 single GPU and 4 GPU training respectively.
- Optimization of cuDNN’s Utilization Strategy
- Use the new algorithm selection API in cuDNN v7--cudnnGetConvolutionForwardAlgorithm_v7—to optimize the algorithm selection strategy of conv_cudnn op, bringing 32% and 11% acceleration to Mask-RCNN and YoloV3 single GPU training, respectively.
- Some ops’ cuDNN implementations are slower than the CUDA counterparts, such as conv2d_transpose、pool2d (with
global_pooling=True
). Setuse_cudnn = False
to improve performance of Cycle GAN, SE-ResNeXt single GPU training by 33%, 34%, respectively.
- Optimization of Op CUDA Kernel
- Use the optimized CUDA kernel to optimize the sum op, bringing in 3.3 times acceleration to GPU execution. The effect is particularly obvious for multiple LoDTensor summation.
- Optimize elementwise_mul grad op with a 2D thread block configuration to speed up the Broadcast operation in its CUDA Kernel.
- Optimization of the Bottom-level Computing of Intel CPU
- Add new OP to merge Pass(conv+relu6,conv_transpose+elementwise_add)
- Add new FP32 MKLDNN kernel (FC),INT8 MKLDNN kernel (Concat)
- Optimize several OPs, including sequence_reverse (forward), sequence_padding (forward), sequence_unpad (reverse), and bilinear interpolate (forward).
- Optimize MKLDNN integration (such as re-using reorder primitives to reduce the time to create a new primitive each time).
- Optimization of Reading Data
- Optimization of Memory
- Optimize the Op layer memory (saving 1G or more memories on the Transformer, Mask-RCNN and other models).
- Improve the coverage of the inplace strategy, supporting the inplace calculation of op such as sum, softmax, softmax_with_cross_entropy, etc.
- Fix the reverse registration of dropout, conv_transpose, and activation op, reducing op memory usage.
- Memory Allocation and Memory Reuse Strategy Refactoring
- Refactors the underlying architecture of the Allocator to provide the foundation for subsequent extended Allocator policies.
- Refactors the Inplace strategy to make its code easy to maintain, and to rule out variables in previous strategies that may produce bugs such as inplace, graph existence, etc.
- Optimization of Configuration
- The user can use the environment variable
FLAGS_conv_workspace_size_limit
to set the maximum workspace size of the conv layer in MB.
- The user can use the environment variable
- Optimize the Op layer memory (saving 1G or more memories on the Transformer, Mask-RCNN and other models).
- Execution optimization
- Update the default configuration of CPU_NUM to 1, which is previously the total number of logical cores of the device.
- Cache the OpKernel in the Operator to avoid repeatedly selecting the kernel for each run.
- ParallelExecutor execution mode (CompiledProgram.with_data_parallel()) optimization: reduce synchronization operation; optimize the speed at num_thread=1 — the speed increase for small models is more obvious (16% increase for PaddingRNN small model).
- Framework basic functions enhancements
- Add mkldnn_enabled_op_types option to build_strategy, giving users the flexibility to control which ops need to use the mkldnn kernel for acceleration.
- Add drop_local_exe_scopes interface under ParallelExecutor. The setting of num_iteration_per_drop_scope that controls when the data in the local scope is cleaned is still valid.
- Add automatic mixed precision training interface
fluid.contrib.mixed_precision.decorate()
that supports image classification, BERT and other model training. - Add
fluid.gradients()
interface with 11 operations supporting secondary reversal, used by gradient penalty for image generation. - Enhance the support for the Intel nGraph compilation engine; add the op support required by the Bert model. The BERT model can be trained by the Intel nGraph compilation engine, and the convergence effect is comparable.
- OP perfect
- Enhance the fused_elewise_activation op function; add support for x+sigmoid(y), x+tanh(y) calculation modes.
- Add a new index, Exponential Moving Average, which makes model training smoother and more stable.
- Add sigmoid_focal_loss loss function
- Add deformable RoI pooling operation
- Add deformable convolution v2 operation
- Provide unfold operation (i.e. im2col) operation
Inference Engine
- Server-side Deployment Library
- Optimize “video memory optimization” function. DAM’s video memory occupation decreases from 4G to 940M; MobileNet’s video memory occupation decreases from 1G to 500M.
- The Paddle-TRT optimization process is migrated to model initialization to solve the problem that the Paddle-TRT initial prediction time is too long. For example, make MobileNet first predicted time drop from second level to millisecond level.
- Fix the issue that
AnalysisPredictor
allocate memory repeatedly when it loads models from memory. - Enhance Python interference API; include the related user manual under “Deploy Inference Model” section on Paddle’s documentation page.
- Intel INT8 Quantization Interference Improvements
- Continuously optimize the INT8 quantization framework (quantization after training); add five models (GoogLeNet, MobileNetV2, VGG16, VGG19, ResNet101); compared with the FP32 model, achieve a less than 1% accuracy loss and improve performance 2 to 3.7 times.
- Run the model that supports QAT (Quantization as Training) on the INT8 kernel; Modify the QAT model with Pass to enable it to run on the INT8 kernel (currently supports quantization/dequantization/convolution); compared to the simulation that runs on the FP32 kernel, achieve a less than 1% accuracy loss with 7 models (GoogleNet, MobileNetV1, MobileNetV2, VGG16, VGG19, ResNet50, ResNet101).
- Paddle Serving
- Support GPU devices; support multi-card parallel inference.
- Provide the SE_ResNeXt50_32x4d model as a standard example; give image classification task benchmark of single card multiple concurrency, multi-card multi-concurrency, etc.
- Support large-scale sparse parameter tasks: storage and online access for very large-scale embedding in scenarios such as CTR estimation; release a stand-alone version in the first phase, supporting billion-level embedding access.
- Provide easy to use API interface and API demo examples.
- PaddleSlim
- Integrated INT8 quantization framework
- New automatic shearing strategy based on simulated annealing algorithm to search for optimal shearing rate: 50% reduction in FLOPS compared to MobileNet V1 on ImageNet 1000 classification task; Top1 - Accuracy = 69.7%
- New Light-NAS feature: 17% reduction in FLOPS compared to MobileNet V1 for ImageNet 1000 classification tasks with no loss of accuracy
Distributed training
- Distributed High-Level API Fleet
- Distributed Training Unified API, which supports Parameter Server and Collective mode training, greatly reducing the number of new codes for users to switch from single computer to multi-computer training
- Users can invoke different parallel training methods by configuring distributed policies, supporting multiple built-in RoleMaker for different distributed environments to facilitate user calls
- New Communicator Design for Parameter Server Training
- Independent communication logic to Communicator to simplify asynchronous training logic
- Provides controllable communication switches that can be tuned to different models
- GPU multi-computer multi-card add multi-boosting extensible feature, NLP/CV classic model multi-computer multi-card training speed up 50%
- Add Fused All Reduce: Reduce the number of parameter sync times by automatically merging gradient tensor
- New Hierachical All Reduce: Hierarchical all reduce operation
- New All Reduce communication concurrent capability: Increased capacity for network wave tolerance under multi-machine training
- Added dependency analysis between reverse and optimization algorithms: Improving the ability to communicate and compute overlap concurrency
- The above-mentioned new capability convergence enables more than 50 percent faster training on Bert Large (batch 16x128) and Resnet 50 (batch 32) computers (v1008 - 4 cards) than PaddlePaddle1.4.1.
- GPU Multi-computer Multi-card Benchmark Update
- Speed comparisons on ResNet50, VGG16, Transformer and Bert, and reproducible benchmarks scripts.
- Pipeline parallel capability support for CPU-GPU heterogeneous equipment
- Add pipeline parallel capability to support user-defined allotment calculation OP in heterogeneous hardware, exchange data through pipeline, thus realize collocation of heterogeneous computing equipment and free allocation of computing resources, and improve training speed.
- In the case of large IO and small computation, such as CTR prediction, Graph Neural Network has obvious speed advantage over pure GPU training.
Model Construction
- Image classification
- 9 ImageNet pre-training models published, including ResNet50_vc, ResNet50_vd, ResNet101_vd, ResNet 152_vd, ResNet 200_vd, ResNeXt101_64x4d, ResNeXt101_vd_64x4d, SENet 154_vd, InceptionV4
- ResNet50_vd is 2.62% higher than the published ResNet50, and the accuracy of ResNet101 is achieved. ResNet101_vd 1.88% better than ResNet101
- PaddleDetection
- Publish a unified framework for detecting PaddleDetection objects, including Faster-RCNN (support FPN), Mask-RCNN (support FPN), Cascade-RCNN, RetinaNet, Yolo v3, SSD, FPN, Cascade RCNN and RetinaNet.
- Releases a series of pre-training models in which RCNN series models support ResNet, ResNet_vd, ResNeXt, ResNeXt_vd, SEResNeXt backbone networks. Yolo v3 continues to add lighter ResNet 34, MobileNet backbone networks and release pre-training models
- PaddleGAN
- Release the PaddleGAN Image Generation Library, which includes CGAN, DCGAN, CycleGAN, Pix2 Pix, StarGAN, AttGAN, STGAN, supporting a variety of datasets and supporting classic GAN network structures. STGAN is an arbitrary image attribute editing model developed by Baidu Visual Technology Department.
- PaddleVideo
- Optimize the already published classification model, NeXt VLAD training speed 60%, TSM speed 39%
- Add published model backbone networks and Nonlocal models add ResNet101 and I3d network structures
- Added motion positioning model C-TCN, Baidu 2018 ActivityNet Championship Scheme
- PaddleNLP
- ERNIE/BERT support dynamic mixed precision training; Supporting multi-card task training in a multi-process manner, increasing the multi-card acceleration ratio; To optimize the speedup ratio of multi-machine and multi-card training, the speedup efficiency of 6 machines to 76% on V100 GPU cluster compared to single machine FP32 training is improved.
- Launch of PaddleNLP-Research, open source MRQA2019, Paddle Fluid baseline, DuConv (ACL2019), ARNOR (ACL2019), MMPMS (IJCAI 2019), MPM (NAACL2019) and other recent Baidu work in the NLP academic field
Tools and Components
- PaddleHub
- New release of PaddleHub official web site, enhanced ease of use
- New website https://www.paddlepaddle.org.cn/hub, including introduction to pre-training models for PaddlePaddle ecology
- Migrate learning Demo to AI Studio and AI Book for quick experience without installation
- New PaddleHub back-end services to support model retrieval, download and privatization deployment
- 29 new pre-training models covering three areas: Text, image and video; 40 pre-training models currently available
- CV pre-training model
- 11 new pre-training models for image classification: SE_ResNeXt, GoogleNet, ShuffleNet, etc.
- Added target detection models Faster-RCNN and YOLOv3
- New image generation model CycleGAN
- New face detection model Pyramidbox
- 4 new video classification models: TSN, TSM, StNet, Non-Local
- NLP pre-training model
- New semantic model ELMo
- 3 new emotion analysis models: Senta-BOW, Senta-CNN, Senta-GRNN
- New Chinese Emotional Recognition Model EmoTect
- New Chinese Semantic Similarity Analysis Model Simnet
- Upgrading the LAC lexical analysis model, adding dictionary intervention to support user-defined segmentation
- CV pre-training model
- Fine-tune API upgrades, flexibility and performance upgrades
- Support for multi-card parallel, PyReader multi-threaded IO, ERNIE Text Classification Fine-tune 60% faster
- Simplified use logic for finetune, evaluuate, predict, etc., for ease of use
- Add event callback to facilitate users to quickly implement custom migration learning tasks
- New Tag Classification Task Fine-tune
- New release of PaddleHub official web site, enhanced ease of use
- Figure Learning Framework PGL (Paddle Graph Learning)
- The PaddlePaddle-based Graphics Framework PGL Preview is released to provide the most advanced Graphics algorithms based on Walk Based and Message Passing. PGL takes full advantage of Paddle LoD Tensor to greatly improve the efficiency of information aggregation in Message-Passing paradigm, which takes into account flexibility and efficiency.
- New GCN and GAT based on PGL to reach SOTA level in multiple datasets
- New Graphsage model based on large-scale subgraph sampling model with 50 million nodes and 2 billion edges
- Added node2vec, deep walk and other chart sign learning methods to reach SOTA level
- New PGL documentation, APIs, Tutorial, etc.
- The PaddlePaddle-based Graphics Framework PGL Preview is released to provide the most advanced Graphics algorithms based on Walk Based and Message Passing. PGL takes full advantage of Paddle LoD Tensor to greatly improve the efficiency of information aggregation in Message-Passing paradigm, which takes into account flexibility and efficiency.
BUG fixes notes
- Repair issues where ignore_label does not support labels in the version of softmax_with_cross_entropy operation CPU
- Repair Logging.basicConfig setup failure after import paddle
- Repair the problem of python/paddle/fluid/layers/ops.py reporting errors under python3
- Repair of sequence unpad op instability during training
- Repair the problem of dropping when the concat op attribute axis is a negative number
- Fixed potential bugs for enable_inplace and memory_optimize to ensure that some of the op's output variables are not reused incorrectly
- Fix the bug of Eager Deletion strategy which may erroneous delete variable storage space in advance and improve the stability of Eager Deletion strategy.
- Fixes the case of different model graph generation with the same model input due to bugs in topology sorting in model graph analysis
- Fixed a problem with other service thread OMP thread conflicts after the prediction ends. The fix is that in CPU mode, the prediction engine sets the number of global OMP threads to 1 after the prediction ends.