-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Operator Mechanism】Pr68945 _C_ops.c_concat in dynamic graph bug fix #70870
base: develop
Are you sure you want to change the base?
Conversation
305f5af
to
cfe9a24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for pir interpreter
@@ -517,6 +517,13 @@ def source_include(header_file_path, fw_header_file_path): | |||
#include "paddle/phi/api/profiler/event_tracing.h" | |||
#include "paddle/phi/api/profiler/supplement_tracing.h" | |||
#include "paddle/phi/core/platform/device_context.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么需要引用paddle/phi/core/platform/device_context.h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么需要引用
paddle/phi/core/platform/device_context.h
?
当时在dist_api_gen.py里用了device_context就下意识引进来了,刚刚看了看代码确实可以不引入,之后会把这里去掉重提
@@ -517,6 +517,13 @@ def source_include(header_file_path, fw_header_file_path): | |||
#include "paddle/phi/api/profiler/event_tracing.h" | |||
#include "paddle/phi/api/profiler/supplement_tracing.h" | |||
#include "paddle/phi/core/platform/device_context.h" | |||
#include "paddle/phi/core/distributed/store/store_utils.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_utils.h
引用加到PADDLE_WITH_DISTRIBUTE宏下面
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_utils.h
引用加到PADDLE_WITH_DISTRIBUTE宏下面
done
if self.kernel['func'][0] == 'c_concat': | ||
if_condition_code = ( | ||
if_condition_code | ||
+ '\n' | ||
+ self.generate_op_comm_init_code() | ||
+ '\n' | ||
+ self.generate_nccl_commcontext_init_code() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么c_concat
需要特判?是否可以做到不特判
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么
c_concat
需要特判?
之前提过一个大范围的不加特判的pr,那个提上去以后说推理组那边会hang住,和ld讨论以后这次先提个范围小点的,推理那边没问题后再提大范围的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种特判需要尽可能避免,过多的特判逻辑会导致代码生成没法维护
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种特判需要尽可能避免,过多的特判逻辑会导致代码生成没法维护
只有这一个,因为这个需求比较急,推理那边赶着用,那边用下来没问题的话会把这个特判删掉改成大范围的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉特判后遇到的问题要怎么处理?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉特判后遇到的问题要怎么处理?
像pr70352这样#70352 ,只有ring_id属性的算子才加入初始化
@@ -860,6 +896,15 @@ def process_data_type_args(args_item): | |||
input_args=input_args, mesh=mesh, kernel_code=kernel_select_code | |||
) | |||
|
|||
if self.kernel['func'][0] == 'c_concat': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里删掉 if 判断,测试有问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里删掉 if 判断,测试有问题吗?
test_margin_cross_entropy_op有问题,不过是环境变量未设置,正在修复
c2ab818
cfe9a24
to
c2ab818
Compare
nullptr, | ||
common::errors::Unavailable( | ||
"NCCLCommContext is nullptr, collective op should " | ||
"has ring_id attr.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print ring_id
True, | ||
) | ||
elif in_pir_mode(): | ||
if in_dynamic_mode() or in_pir_mode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> in_dynamic_or_pir_mode()
@@ -81,6 +81,45 @@ | |||
}} | |||
""" | |||
|
|||
OP_COMM_INIT = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OP_COMM_INIT -> NCCL_COMMCONTEXT_INIT
@@ -1310,6 +1364,12 @@ def generate_kernel_selection_code(self) -> str: | |||
self.api, self.kernel['func'][0], self.kernel['func'][0] | |||
) | |||
|
|||
def generate_op_comm_init_code(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename func name
c2ab818
to
253a352
Compare
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
修复了Pr68945新动态图模式下调用_C_ops.c_concat的通信初始化与精度错误
PCard-85979