Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Operator Mechanism】Pr68945 _C_ops.c_concat in dynamic graph bug fix #70870

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

Function-Samuel
Copy link
Contributor

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

修复了Pr68945新动态图模式下调用_C_ops.c_concat的通信初始化与精度错误
PCard-85979

@Function-Samuel Function-Samuel force-pushed the pr_69951 branch 2 times, most recently from 305f5af to cfe9a24 Compare January 19, 2025 06:35
zhangbo9674
zhangbo9674 previously approved these changes Jan 20, 2025
Copy link
Contributor

@zhangbo9674 zhangbo9674 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for pir interpreter

@@ -517,6 +517,13 @@ def source_include(header_file_path, fw_header_file_path):
#include "paddle/phi/api/profiler/event_tracing.h"
#include "paddle/phi/api/profiler/supplement_tracing.h"
#include "paddle/phi/core/platform/device_context.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要引用paddle/phi/core/platform/device_context.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要引用paddle/phi/core/platform/device_context.h?

当时在dist_api_gen.py里用了device_context就下意识引进来了,刚刚看了看代码确实可以不引入,之后会把这里去掉重提

@@ -517,6 +517,13 @@ def source_include(header_file_path, fw_header_file_path):
#include "paddle/phi/api/profiler/event_tracing.h"
#include "paddle/phi/api/profiler/supplement_tracing.h"
#include "paddle/phi/core/platform/device_context.h"
#include "paddle/phi/core/distributed/store/store_utils.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_utils.h引用加到PADDLE_WITH_DISTRIBUTE宏下面

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_utils.h引用加到PADDLE_WITH_DISTRIBUTE宏下面

done

Comment on lines 899 to 915
if self.kernel['func'][0] == 'c_concat':
if_condition_code = (
if_condition_code
+ '\n'
+ self.generate_op_comm_init_code()
+ '\n'
+ self.generate_nccl_commcontext_init_code()
)
Copy link
Contributor

@zyfncg zyfncg Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么c_concat需要特判?是否可以做到不特判

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么c_concat需要特判?

之前提过一个大范围的不加特判的pr,那个提上去以后说推理组那边会hang住,和ld讨论以后这次先提个范围小点的,推理那边没问题后再提大范围的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种特判需要尽可能避免,过多的特判逻辑会导致代码生成没法维护

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种特判需要尽可能避免,过多的特判逻辑会导致代码生成没法维护

只有这一个,因为这个需求比较急,推理那边赶着用,那边用下来没问题的话会把这个特判删掉改成大范围的

Copy link
Contributor

@zyfncg zyfncg Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉特判后遇到的问题要怎么处理?

Copy link
Contributor Author

@Function-Samuel Function-Samuel Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉特判后遇到的问题要怎么处理?

像pr70352这样#70352 ,只有ring_id属性的算子才加入初始化

zyfncg
zyfncg previously approved these changes Jan 20, 2025
@@ -860,6 +896,15 @@ def process_data_type_args(args_item):
input_args=input_args, mesh=mesh, kernel_code=kernel_select_code
)

if self.kernel['func'][0] == 'c_concat':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里删掉 if 判断,测试有问题吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里删掉 if 判断,测试有问题吗?

test_margin_cross_entropy_op有问题,不过是环境变量未设置,正在修复

nullptr,
common::errors::Unavailable(
"NCCLCommContext is nullptr, collective op should "
"has ring_id attr."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print ring_id

True,
)
elif in_pir_mode():
if in_dynamic_mode() or in_pir_mode():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> in_dynamic_or_pir_mode()

@@ -81,6 +81,45 @@
}}
"""

OP_COMM_INIT = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OP_COMM_INIT -> NCCL_COMMCONTEXT_INIT

@@ -1310,6 +1364,12 @@ def generate_kernel_selection_code(self) -> str:
self.api, self.kernel['func'][0], self.kernel['func'][0]
)

def generate_op_comm_init_code(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename func name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants