
StarRocks 存算分离 Spark Hive Metastore MinIO 数据湖搭建全流程目标搭建一套完整的冷热分层数据湖架构热数据留在 StarRocks冷数据通过 Spark 搬迁到 MinIO 并通过 Hive Metastore 管理元数据StarRocks 通过 External Catalog 直接查询。整体架构┌─────────────────────────────────────────────────────┐ │ StarRocks │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ Internal Cat │ │ External Cat │ │ │ │ (热数据) │ │ (Hive/Metastore) │ │ │ │ 读写 │ │ 只读 │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ │ │ Spark ETL │ 读元数据 │ │ ▼ ▼ │ │ ┌──────────┐ saveAsTable ┌──────────────┐ │ │ │ Spark │ ──────────────│Hive Metastore│ │ │ │读取 SR │ │ (元数据) │ │ │ └──────────┘ └──────────────┘ │ │ │ │ │ │ 写 Parquet │ │ ▼ │ │ ┌──────────┐ 直读 Parquet │ │ │ MinIO │ ─────────────────────────────────────┘ │ │ (S3存储) │ │ └──────────┘ └─────────────────────────────────────────────────────┘一、Docker Compose 统一管理所有组件都放在spark-docker/下三个独立的docker-compose.yml共享同一个 Docker 网络spark-docker_spark-netspark-docker/ ├── docker-compose.yml ← Spark 集群 ├── starrocks-cluster/ │ └── docker-compose.yml ← StarRocks FE/BE └── hive-metastore/ └── docker-compose.yml ← Hive MetastoreMinIO 是之前已有的容器直接加进该网络即可。二、MinIO 容器部署已有# 查看 MinIO 端口映射确认 S3 API 端口dockerport minio# 输出9000/tcp - 0.0.0.0:9000S3 API# 输出9090/tcp - 0.0.0.0:9090Web 控制台# 加入 Spark 网络dockernetwork connect spark-docker_spark-net minio三、Spark 集群容器化部署3.1 docker-compose.ymlservices:spark-master:image:apache/spark:3.5.4container_name:spark-masterhostname:spark-masterrestart:unless-stoppedenvironment:-SPARK_NO_DAEMONIZEfalseports:-8180:8080-7077:7077command:/opt/spark/sbin/start-master.sh--host 0.0.0.0--port 7077--webui-port 8080networks:-spark-netspark-worker-1:image:apache/spark:3.5.4container_name:spark-worker-1hostname:spark-worker-1restart:unless-stoppeddepends_on:-spark-masterenvironment:-SPARK_NO_DAEMONIZEfalseports:-8181:8081command:/opt/spark/sbin/start-worker.sh spark://spark-master:7077--cores 4--memory 4g--webui-port 8081networks:-spark-netspark-worker-2:image:apache/spark:3.5.4container_name:spark-worker-2hostname:spark-worker-2restart:unless-stoppeddepends_on:-spark-masterenvironment:-SPARK_NO_DAEMONIZEfalseports:-8182:8081command:/opt/spark/sbin/start-worker.sh spark://spark-master:7077--cores 4--memory 4g--webui-port 8081networks:-spark-netnetworks:spark-net:external:truename:spark-docker_spark-net关键点使用apache/spark:3.5.4镜像自带 Hadoop 和 JVM 11--host 0.0.0.0确保 Master 监听所有网卡所有容器共用一个外部网络spark-docker_spark-net3.2 启动cdspark-dockerdocker-composeup-ddockerps--filternamespark3.3 JAR 包准备Spark 连接 StarRocks 和 MinIO 需要额外 JAR# StarRocks Connectorcurl-ostarrocks-spark-connector-3.5_2.12-1.1.3.jar\https://repo1.maven.org/maven2/com/starrocks/starrocks-spark-connector-3.5_2.12/1.1.3/starrocks-spark-connector-3.5_2.12-1.1.3.jar# MySQL JDBC Drivercp~/.m2/repository/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar.# 拷入 Spark Masterdockercpstarrocks-spark-connector-3.5_2.12-1.1.3.jar spark-master:/opt/spark/jars/dockercpmysql-connector-java-8.0.28.jar spark-master:/opt/spark/jars/四、StarRocks 存算分离架构搭建4.1 为什么拆 FE/BEStarRocks 官方allin1-ubuntu镜像将 FE 和 BE 打包在一个容器里缺点是 BE 的priority_networks被硬编码为127.0.0.1/32导致 Spark Connector 从其他容器无法访问 BE 的 Thrift 端口。拆分成 FE 和 BE 两个独立容器后可以分别配置网络绑定Connector 直接在 Docker 网络内连 BE。4.2 docker-compose.ymlservices:sr-fe:image:starrocks/fe-ubuntu:3.5.0container_name:sr-fehostname:sr-ferestart:unless-stoppedenvironment:-HOST_TYPEFQDNports:-8030:8030-9030:9030volumes:-sr-fe-meta:/opt/starrocks/fe/meta-sr-fe-log:/opt/starrocks/fe/logcommand:/opt/starrocks/fe/bin/start_fe.sh--logconsolenetworks:-spark-netsr-be:image:starrocks/be-ubuntu:3.5.0container_name:sr-behostname:sr-berestart:unless-stoppedenvironment:-HOST_TYPEFQDNports:-8040:8040volumes:-sr-be-data:/opt/starrocks/be/storage-sr-be-log:/opt/starrocks/be/logcommand:/opt/starrocks/be/bin/start_be.sh--logconsolenetworks:-spark-netnetworks:spark-net:external:truename:spark-docker_spark-netvolumes:sr-fe-meta:sr-fe-log:sr-be-data:sr-be-log:关键点--logconsole让 FE/BE 前台运行容器不会退出不指定固定 IPDocker 自动分配BE 启动后需要手动注册到 FE4.3 注册 BEdockerexecsr-fe mysql-h127.0.0.1-P9030-uroot\-eALTER SYSTEM ADD BACKEND sr-be:9050;4.4 Internal Catalog 存算分离StarRocks 3.5 支持storage_volume参数将内表数据直接存到 MinIO-- 第一步创建存储卷指向 MinIOCREATESTORAGE VOLUME minio_volumeTYPES3 LOCATIONS(s3://spark-output)PROPERTIES(enabledtrue,aws.s3.endpointhttp://minio:9000,aws.s3.access_keyMINIO_ACCESS_KEY,aws.s3.secret_keyMINIO_SECRET_KEY,aws.s3.enable_path_style_accesstrue);-- 第二步设为默认存储卷SETDEFAULTSTORAGE VOLUME minio_volume;-- 第三步建表时指定CREATETABLEdb1.dim_product(product_idINT,product_nameVARCHAR(200),...)ENGINEOLAPDUPLICATEKEY(product_id)DISTRIBUTEDBYHASH(product_id)BUCKETS10PROPERTIES(storage_volumeminio_volume,replication_num1);这样 SR 内表的数据文件就存到了 MinIO而非 BE 本地磁盘。对 DML 操作无感知Insert/Update/Delete 仍然正常工作。五、Hive Metastore 容器化部署5.1 docker-compose.ymlservices:hive-metastore:image:apache/hive:3.1.3container_name:hive-metastorerestart:unless-stoppedenvironment:SERVICE_NAME:metastoreDB_DRIVER:mysqlSERVICE_OPTS:--Djavax.jdo.option.ConnectionDriverNamecom.mysql.cj.jdbc.Driver-Djavax.jdo.option.ConnectionURLjdbc:mysql://宿主机IP:3306/hive_metastore-Djavax.jdo.option.ConnectionUserNamehive-Djavax.jdo.option.ConnectionPasswordMYSQL_PASSWORD-Dhive.metastore.warehouse.dirs3a://data-lake/warehouse-Dfs.s3a.endpointhttp://minio:9000-Dfs.s3a.access.keyMINIO_ACCESS_KEY-Dfs.s3a.secret.keyMINIO_SECRET_KEY-Dfs.s3a.path.style.accesstrueports:-9083:9083networks:-spark-netnetworks:spark-net:external:truename:spark-docker_spark-net关键点版本选择apache/hive:3.1.3与 Spark 3.5.x 内嵌 Hive 客户端兼容MySQL 元数据库需要提前建好宿主机的 MySQLwarehouse.dir指向 MinIO 上的 S3 路径5.2 准备 MySQL 元数据库CREATEDATABASEhive_metastoreDEFAULTCHARACTERSETlatin1;CREATEUSERhive%IDENTIFIEDBYMYSQL_PASSWORD;GRANTALLONhive_metastore.*TOhive%;FLUSHPRIVILEGES;5.3 S3A JAR 依赖Hive Metastore 默认不包含 S3A 驱动需要手动放入hadoop-aws和aws-java-sdk-bundlecurl-ohadoop-aws-3.3.4.jar https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jarcurl-oaws-java-sdk-bundle-1.12.367.jar https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jardockercphadoop-aws-3.3.4.jar hive-metastore:/opt/hive/lib/dockercpaws-java-sdk-bundle-1.12.367.jar hive-metastore:/opt/hive/lib/dockerrestart hive-metastore5.4 StarRocks 创建 Hive CatalogCREATEEXTERNAL CATALOG minio_catalog PROPERTIES(typehive,hive.metastore.uristhrift://宿主机IP:9083,aws.s3.endpointhttp://minio:9000,aws.s3.access_keyMINIO_ACCESS_KEY,aws.s3.secret_keyMINIO_SECRET_KEY,aws.s3.enable_path_style_accesstrue);六、Spark 读取 SR 并写入 MinIO Hive Metastore6.1 Maven 项目配置pom.xml关键依赖propertiesjava.version11/java.versionmaven.compiler.source11/maven.compiler.sourcemaven.compiler.target11/maven.compiler.target/propertiesdependencies!-- Spark SQLprovided集群自带--dependencygroupIdorg.apache.spark/groupIdartifactIdspark-sql_2.12/artifactIdversion3.5.4/versionscopeprovided/scope/dependency!-- StarRocks Connector打进 JAR--dependencygroupIdcom.starrocks/groupIdartifactIdstarrocks-spark-connector-3.5_2.12/artifactIdversion1.1.3/version/dependency!-- MySQL JDBC Driver打进 JAR--dependencygroupIdmysql/groupIdartifactIdmysql-connector-java/artifactIdversion8.0.28/version/dependency!-- S3A --dependencygroupIdorg.apache.hadoop/groupIdartifactIdhadoop-aws/artifactIdversion3.3.4/version/dependencydependencygroupIdcom.amazonaws/groupIdartifactIdaws-java-sdk-bundle/artifactIdversion1.12.367/version/dependency/dependenciesbuildpluginsplugingroupIdorg.apache.maven.plugins/groupIdartifactIdmaven-shade-plugin/artifactIdversion3.5.1/versionexecutionsexecutionphasepackage/phasegoalsgoalshade/goal/goals/execution/executions/plugin/plugins/build编译目标设为11因为 Spark 容器的 JVM 版本是 Java 11。6.2 Java 代码packagecom.example;importorg.apache.spark.sql.*;importorg.apache.spark.sql.SparkSession;publicclassSparkStarRocksDemo{publicstaticvoidmain(String[]args){SparkSessionsparkSparkSession.builder().appName(SR-to-MinIO-Hive)// MinIO S3A 配置 .config(spark.hadoop.fs.s3a.endpoint,http://minio:9000).config(spark.hadoop.fs.s3a.access.key,MINIO_ACCESS_KEY).config(spark.hadoop.fs.s3a.secret.key,MINIO_SECRET_KEY).config(spark.hadoop.fs.s3a.path.style.access,true)// Hive Metastore 配置 .config(hive.metastore.uris,thrift://hive-metastore:9083).config(spark.sql.warehouse.dir,s3a://data-lake/warehouse).enableHiveSupport().getOrCreate();// 创建数据库写入 Metastorespark.sql(CREATE DATABASE IF NOT EXISTS db1);// 注册 StarRocks 表为临时视图spark.sql(CREATE OR REPLACE TEMPORARY VIEW dim_product USING starrocks OPTIONS ( starrocks.fe.http.url sr-fe:8030, starrocks.fe.jdbc.url jdbc:mysql://sr-fe:9030, starrocks.table.identifier db1.dim_product, starrocks.user root, starrocks.password ));// 从 StarRocks 读取数据DatasetRowdfspark.sql(SELECT * FROM dim_product);// saveAsTable 做了两件事// 1. 写 Parquet 到 MinIOs3a://data-lake/warehouse/db1.db/dim_product/// 2. 注册表结构到 Hive Metastoredf.write().mode(SaveMode.Overwrite).saveAsTable(db1.dim_product);System.out.println(Done.);spark.stop();}}6.3 打包与提交# 1. 打包mvn clean package-DskipTests# 2. 拷入 Spark 容器dockercptarget/spark-starrocks-demo-1.0.0.jar spark-master:/tmp/# 3. 提交任务dockerexecspark-master /opt/spark/bin/spark-submit\--classcom.example.SparkStarRocksDemo\--masterspark://spark-master:7077\--deploy-mode client\/tmp/spark-starrocks-demo-1.0.0.jar6.4 验证Spark 任务跑完后在 StarRocks 端验证-- 切换到 Hive CatalogSETCATALOG minio_catalog;-- 查看库SHOWDATABASES;-- 查询数据直接读 MinIO 上的 ParquetUSEdb1;SELECTCOUNT(*)FROMdim_product;SELECT*FROMdim_productLIMIT10;-- 跨 Catalog JOIN内表 外表SELECTa.*,b.extra_infoFROMdefault_catalog.db1.hot_table aJOINminio_catalog.db1.dim_product bONa.product_idb.product_id;七、关键踩坑记录问题原因解决all-in-one SR 的 Connector 连不上 BEBE 元数据登记为127.0.0.1跨容器不可达拆 FE/BE 两个容器Spark 容器 JVM 版本不匹配Spark 用 Java 11代码用 Java 17 编译pom.xml 编译目标设 11saveAsTable报Invalid method name: get_tableSpark 内嵌 Hive 2.3.x 与 Metastore 4.0.0 API 不兼容Metastore 降为 3.1.3S3AFileSystem not foundMetastore 容器缺 S3A JAR手动拷入 hadoop-aws aws-sdkStream Load 重定向到 BE 内网 IPDocker 内网 IP 宿主机不可达从 FE 容器内部 curlCSV 导入全部行被过滤Windows 生成 CSV 默认不是 UTF-8Pythonopen()加encodingutf-8八、总结最终搭建完成的组件清单组件容器名端口说明Spark Masterspark-master8180Web UISpark Worker x2spark-worker-1/28181/8182各 4 核 4GStarRocks FEsr-fe8030/9030HTTP/JDBCStarRocks BEsr-be8040/9060HTTP/ThriftHive Metastorehive-metastore9083ThriftMinIOminio9000/9090S3 API / Console数据流向总结写链路Spark Connector 读 SR →saveAsTable()→ Parquet 入 MinIO 元数据入 Hive Metastore读链路StarRocks Hive Catalog → 从 Metastore 拿 Schema → 直连 MinIO 读 Parquet内表存算分离SR Internal Catalog 通过storage_volume直接将表数据存 MinIO