Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6067][CH] [Part 3-2] Basic support for Native Write in Spark 3.5 #6586

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Jul 25, 2024

What changes were proposed in this pull request?

(Fixes: #6067)

This PR implement baisc support for native write in spark 3.5. I first refactor codes, so that we can add sink transform after parsing substrait plan.

The main idea is using PartitionedSink, and the core of this pr is how to compute partition value, see following codes

    /// visible for UTs
    static ASTPtr make_partition_expression(const DB::Names & partition_columns)
    {
        /// Parse the following expression into ASTs
        /// cancat('/col_name=', 'toString(col_name)')
        bool add_slash = false;
        ASTs arguments;
        for (const auto & column : partition_columns)
        {
            // partition_column=
            std::string key = add_slash ? fmt::format("/{}=", column) : fmt::format("{}=", column);
            add_slash = true;
            arguments.emplace_back(std::make_shared<DB::ASTLiteral>(key));

            // ifNull(toString(partition_column), DEFAULT_PARTITION_NAME)
            // FIXME if toString(partition_column) is empty
            auto column_ast = std::make_shared<DB::ASTIdentifier>(column);
            ASTs if_null_args{makeASTFunction("toString", ASTs{column_ast}), std::make_shared<DB::ASTLiteral>(DEFAULT_PARTITION_NAME)};
            arguments.emplace_back(makeASTFunction("ifNull", std::move(if_null_args)));
        }
        return DB::makeASTFunction("concat", std::move(arguments));
    }

How was this patch tested?

Using existed UTs

Copy link

#6067

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

…tiveBlock::toColumnarBatch() to return ColumnarBatch

2. Extract a mew function SerializedPlanParser::buildPipeline, which used in the follow up PRs
3. Refactor File Wrapper, extract create_output_format_file for later use
4. Add GLUTEN_SOURCE_DIR, so that gtest can read java resource
5. Add SubstraitParserUtils.h, so that we can remove parseJson
6. Many litter refactor
@baibaichen baibaichen force-pushed the feature/native_write2 branch from 084807c to f78ffb3 Compare July 25, 2024 12:35
Copy link

Run Gluten Clickhouse CI

@baibaichen
Copy link
Contributor Author

we run org.apache.gluten.execution.GlutenClickHouseNativeWriteTableSuite in spark 3.5 here, after this pr is merged, we need update pipeline

[2024-07-25T13:14:02.512Z] Run completed in 1 minute, 50 seconds.
[2024-07-25T13:14:02.521Z] Total number of tests run: 23
[2024-07-25T13:14:02.521Z] Suites: completed 2, aborted 0
[2024-07-25T13:14:02.521Z] Tests: succeeded 23, failed 0, canceled 0, ignored 2, pending 0
[2024-07-25T13:14:02.521Z] All tests passed.

@baibaichen baibaichen changed the title [GLUTEN-6067][CH] [Part 3-2] [WIP] Basic support for Native Write in Spark 3.5 [GLUTEN-6067][CH] [Part 3-2] Basic support for Native Write in Spark 3.5 Jul 25, 2024
@baibaichen baibaichen merged commit d90a7f4 into apache:main Jul 26, 2024
9 checks passed
@baibaichen baibaichen deleted the feature/native_write2 branch July 26, 2024 01:29
@@ -496,8 +496,8 @@ std::map<std::string, std::string> BackendInitializerUtil::getBackendConfMap(con
/// Parse backend configs from plan extensions
do
{
auto plan_ptr = std::make_unique<substrait::Plan>();
auto success = plan_ptr->ParseFromString(plan);
substrait::Plan sPlan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the naming rule of clickhouse

@@ -151,7 +153,7 @@ class BackendInitializerUtil
/// Initialize two kinds of resources
/// 1. global level resources like global_context/shared_context, notice that they can only be initialized once in process lifetime
/// 2. session level resources like settings/configs, they can be initialized multiple times following the lifetime of executor/driver
static void init(const std::string & plan);
static void init(const std::string_view plan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support CH backend with Spark 3.5.x
3 participants