Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Supports reading and writing data files bypass BE Server #52682

Closed
6 of 8 tasks
plotor opened this issue Nov 6, 2024 · 0 comments · Fixed by #52700 · May be fixed by #54470
Closed
6 of 8 tasks

[Feature] Supports reading and writing data files bypass BE Server #52682

plotor opened this issue Nov 6, 2024 · 0 comments · Fixed by #52700 · May be fixed by #54470

Comments

@plotor
Copy link
Contributor

plotor commented Nov 6, 2024

Feature request

Is your feature request related to a problem? Please describe.

StarRocks is a next-gen, high-performance analytical data warehouse. However, the limitations of the MPP architecture still cause it to face some problems in large-data ETL scenarios. For example:

  • Low resource utilization: StarRocks clusters adopt the resource reservation mode mostly, but in order to support large-data ETL scenarios, which require a large amount of resource overhead in a relatively short period of time, plan redundant resources in advance may reduce the overall resource utilization of the cluster.
  • Poor resource isolation: In terms of resource isolation, StarRocks adopts Group rather than Query level isolation. In large-data ETL scenarios, there is a risk that a query with large resource overhead will run out of resources, thereby starving some small queries to death.
  • Lack of failure tolerance: Due to the lack of task-level failure tolerance mechanism, when an ETL job fails, it will usually still fail even if rerun it manually.

Describe the solution you'd like

In order to solve the above problems, we have proposed the idea of ​​bypassing BE server to directly read and write StarRocks data files in the storage and computing separation mode. Taking the Apache Spark as an example(In fact, we are not bound to it, and we can support more computing engines by design.), the overall architecture design is as follows:

starrocks-bypass-load

For complex query scenarios, users can submit jobs through the spark-sql client. Spark interacts with StarRocks to obtain tables's metadata information, manages data write transaction, and do read/write operation on shared storage media (eg. aws s3) through the native format SDK directly. Finally, for one piece of StarRocks's internal format data, multiple sets of analysis engines can read and write it.

In terms of implementation, we enhanced the existing StarRocks Spark Connector. Compared with the existing Spark Load and Spark Connector, the enhanced version has the following advantages:

  • Spark reads and writes StarRocks data files through the native format SDK directly, which avoid the overhead of BE node's resources.
  • Fully reuse the advantages of the Spark engine in large-data ETL scenarios to make up for the shortcomings that StarRocks currently faces.
  • Spark Catalog can connect and fetch StarRocks tables' metadata information directly by enhanced connector, which simplify the usage greatly.

Related PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment