You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
StarRocks is a next-gen, high-performance analytical data warehouse. However, the limitations of the MPP architecture still cause it to face some problems in large-data ETL scenarios. For example:
Low resource utilization: StarRocks clusters adopt the resource reservation mode mostly, but in order to support large-data ETL scenarios, which require a large amount of resource overhead in a relatively short period of time, plan redundant resources in advance may reduce the overall resource utilization of the cluster.
Poor resource isolation: In terms of resource isolation, StarRocks adopts Group rather than Query level isolation. In large-data ETL scenarios, there is a risk that a query with large resource overhead will run out of resources, thereby starving some small queries to death.
Lack of failure tolerance: Due to the lack of task-level failure tolerance mechanism, when an ETL job fails, it will usually still fail even if rerun it manually.
Describe the solution you'd like
In order to solve the above problems, we have proposed the idea of bypassing BE server to directly read and write StarRocks data files in the storage and computing separation mode. Taking the Apache Spark as an example(In fact, we are not bound to it, and we can support more computing engines by design.), the overall architecture design is as follows:
For complex query scenarios, users can submit jobs through the spark-sql client. Spark interacts with StarRocks to obtain tables's metadata information, manages data write transaction, and do read/write operation on shared storage media (eg. aws s3) through the native format SDK directly. Finally, for one piece of StarRocks's internal format data, multiple sets of analysis engines can read and write it.
In terms of implementation, we enhanced the existing StarRocks Spark Connector. Compared with the existing Spark Load and Spark Connector, the enhanced version has the following advantages:
Spark reads and writes StarRocks data files through the native format SDK directly, which avoid the overhead of BE node's resources.
Fully reuse the advantages of the Spark engine in large-data ETL scenarios to make up for the shortcomings that StarRocks currently faces.
Spark Catalog can connect and fetch StarRocks tables' metadata information directly by enhanced connector, which simplify the usage greatly.
Feature request
Is your feature request related to a problem? Please describe.
StarRocks is a next-gen, high-performance analytical data warehouse. However, the limitations of the MPP architecture still cause it to face some problems in large-data ETL scenarios. For example:
Describe the solution you'd like
In order to solve the above problems, we have proposed the idea of bypassing BE server to directly read and write StarRocks data files in the storage and computing separation mode. Taking the Apache Spark as an example(In fact, we are not bound to it, and we can support more computing engines by design.), the overall architecture design is as follows:
For complex query scenarios, users can submit jobs through the
spark-sql
client. Spark interacts with StarRocks to obtain tables's metadata information, manages data write transaction, and do read/write operation on shared storage media (eg. aws s3) through the native format SDK directly. Finally, for one piece of StarRocks's internal format data, multiple sets of analysis engines can read and write it.In terms of implementation, we enhanced the existing StarRocks Spark Connector. Compared with the existing Spark Load and Spark Connector, the enhanced version has the following advantages:
Related PRs
The text was updated successfully, but these errors were encountered: