[Feature] Supports reading and writing data files bypass BE Server #52682

plotor · 2024-11-06T13:24:32Z

Feature request

Is your feature request related to a problem? Please describe.

StarRocks is a next-gen, high-performance analytical data warehouse. However, the limitations of the MPP architecture still cause it to face some problems in large-data ETL scenarios. For example:

Low resource utilization: StarRocks clusters adopt the resource reservation mode mostly, but in order to support large-data ETL scenarios, which require a large amount of resource overhead in a relatively short period of time, plan redundant resources in advance may reduce the overall resource utilization of the cluster.
Poor resource isolation: In terms of resource isolation, StarRocks adopts Group rather than Query level isolation. In large-data ETL scenarios, there is a risk that a query with large resource overhead will run out of resources, thereby starving some small queries to death.
Lack of failure tolerance: Due to the lack of task-level failure tolerance mechanism, when an ETL job fails, it will usually still fail even if rerun it manually.

Describe the solution you'd like

In order to solve the above problems, we have proposed the idea of bypassing BE server to directly read and write StarRocks data files in the storage and computing separation mode. Taking the Apache Spark as an example（In fact, we are not bound to it, and we can support more computing engines by design.）, the overall architecture design is as follows:

For complex query scenarios, users can submit jobs through the spark-sql client. Spark interacts with StarRocks to obtain tables's metadata information, manages data write transaction, and do read/write operation on shared storage media (eg. aws s3) through the native format SDK directly. Finally, for one piece of StarRocks's internal format data, multiple sets of analysis engines can read and write it.

In terms of implementation, we enhanced the existing StarRocks Spark Connector. Compared with the existing Spark Load and Spark Connector, the enhanced version has the following advantages:

Spark reads and writes StarRocks data files through the native format SDK directly, which avoid the overhead of BE node's resources.
Fully reuse the advantages of the Spark engine in large-data ETL scenarios to make up for the shortcomings that StarRocks currently faces.
Spark Catalog can connect and fetch StarRocks tables' metadata information directly by enhanced connector, which simplify the usage greatly.

Related PRs

The text was updated successfully, but these errors were encountered:

plotor added the type/feature-request label Nov 6, 2024

plotor mentioned this issue Nov 7, 2024

[Feature] Add native format writer to access StarRocks data bypass BE Server. #52700

Merged

23 tasks

plotor mentioned this issue Nov 29, 2024

[Enhancement] Append LIST partition info when list table partitions by rest api. #53382

Merged

24 tasks

decster added the feature label Dec 3, 2024

decster closed this as completed in #52700 Dec 18, 2024

wanpengfei-git reopened this Dec 18, 2024

wanpengfei-git added the version:3.5 label Dec 18, 2024

wangruin closed this as completed Dec 18, 2024

plotor mentioned this issue Dec 28, 2024

[Feature] Add native format reader to access StarRocks data bypass BE Server. #54470

Draft

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Supports reading and writing data files bypass BE Server #52682

[Feature] Supports reading and writing data files bypass BE Server #52682

plotor commented Nov 6, 2024 •

edited

Loading

[Feature] Supports reading and writing data files bypass BE Server #52682

[Feature] Supports reading and writing data files bypass BE Server #52682

Comments

plotor commented Nov 6, 2024 • edited Loading

Feature request

plotor commented Nov 6, 2024 •

edited

Loading