
Kubernetes大数据处理架构与Spark On K8s实践引言随着大数据处理需求的增长将大数据框架部署到Kubernetes已成为趋势。本文将深入探讨Spark on Kubernetes的部署策略和最佳实践。一、大数据处理架构1.1 架构设计┌─────────────────────────────────────────────────────────────────────┐ │ 大数据处理架构 │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Data │─────▶│ Spark │─────▶│ Storage │ │ │ │ Source │ │ Processing │ │ (HDFS/S3) │ │ │ └──────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Kafka │ │ Kubernetes │ │ Presto │ │ │ │ Flink │ │ Operator │ │ Trino │ │ │ └──────────┘ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘1.2 组件对比组件功能适用场景Spark批处理/流处理大规模数据处理Flink实时流处理低延迟实时计算PrestoSQL查询引擎交互式分析Trino分布式查询多数据源查询二、Spark On Kubernetes部署2.1 Spark配置apiVersion: v1 kind: ConfigMap metadata: name: spark-config namespace: spark data: spark-defaults.conf: | spark.master k8s://https://kubernetes.default.svc spark.kubernetes.namespace spark spark.kubernetes.container.image spark:3.5.0 spark.kubernetes.authenticate.driver.serviceAccountName spark-driver spark.kubernetes.authenticate.executor.serviceAccountName spark-executor spark.executor.instances 3 spark.executor.cores 2 spark.executor.memory 4g spark.driver.cores 1 spark.driver.memory 2g2.2 Spark应用提交spark-submit \ --master k8s://https://kubernetes.default.svc \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.kubernetes.container.imagespark:3.5.0 \ --conf spark.kubernetes.namespacespark \ --conf spark.executor.instances3 \ --conf spark.executor.cores2 \ --conf spark.executor.memory4g \ local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar三、Spark Operator配置3.1 安装Spark Operatorkubectl apply -f https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/releases/download/v1.1.1/spark-operator.yaml kubectl create namespace spark kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/spark-on-k8s-operator/master/charts/spark-operator/crds/sparkoperator.k8s.io_sparkapplications.yaml3.2 SparkApplication配置apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: spark spec: type: Scala mode: cluster image: spark:3.5.0 imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar sparkVersion: 3.5.0 restartPolicy: type: Never driver: cores: 1 coreLimit: 1200m memory: 512m labels: version: 3.5.0 serviceAccount: spark-driver executor: cores: 1 instances: 3 memory: 1g labels: version: 3.5.0 serviceAccount: spark-executor四、资源管理与调度4.1 资源配额配置apiVersion: v1 kind: ResourceQuota metadata: name: spark-quota namespace: spark spec: hard: pods: 100 requests.cpu: 50 requests.memory: 100Gi limits.cpu: 100 limits.memory: 200Gi4.2 节点选择器apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-data-processing spec: executor: nodeSelector: nodeType: worker zone: us-west-2a tolerations: - key: dedicated operator: Equal value: spark effect: NoSchedule五、存储配置5.1 HDFS集成apiVersion: v1 kind: ConfigMap metadata: name: hdfs-config namespace: spark data: hdfs-site.xml: | configuration property namedfs.nameservices/name valuemycluster/value /property property namedfs.ha.namenodes.mycluster/name valuenn1,nn2/value /property /configuration5.2 S3集成apiVersion: v1 kind: Secret metadata: name: spark-s3-secret namespace: spark type: Opaque data: AWS_ACCESS_KEY_ID: base64-encoded-key AWS_SECRET_ACCESS_KEY: base64-encoded-secret六、监控与日志6.1 Prometheus监控apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: spark-monitor namespace: monitoring spec: selector: matchLabels: sparkoperator.k8s.io/app-name: spark-pi endpoints: - port: http-prometheus path: /metrics interval: 15s6.2 日志采集apiVersion: v1 kind: ConfigMap metadata: name: spark-logging namespace: spark data: log4j.properties: | log4j.rootCategoryINFO, console log4j.appender.consoleorg.apache.log4j.ConsoleAppender log4j.appender.console.targetSystem.err log4j.appender.console.layoutorg.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n七、作业调度7.1 CronSparkApplicationapiVersion: sparkoperator.k8s.io/v1beta2 kind: ScheduledSparkApplication metadata: name: daily-report namespace: spark spec: schedule: 0 2 * * * concurrencyPolicy: Forbid template: spec: type: Python mode: cluster image: spark-py:3.5.0 mainApplicationFile: gs://my-bucket/scripts/daily_report.py sparkVersion: 3.5.0 executor: cores: 2 instances: 5 memory: 4g7.2 依赖管理apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-job-with-deps spec: dependencies: jars: - gs://my-bucket/jars/mysql-connector-java.jar - gs://my-bucket/jars/postgresql.jar files: - gs://my-bucket/config/config.properties pyFiles: - gs://my-bucket/python/utils.py八、性能优化8.1 配置调优参数说明建议值spark.executor.instancesExecutor数量根据数据量调整spark.executor.cores每个Executor核数4-8spark.executor.memoryExecutor内存8-16gspark.sql.shuffle.partitionsShuffle分区数200-1000spark.default.parallelism默认并行度2-3倍核数8.2 内存管理apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: memory-optimized-job spec: driver: memory: 4g javaOptions: -XX:UseG1GC -XX:MaxGCPauseMillis200 executor: memory: 16g memoryOverhead: 4g javaOptions: -XX:UseG1GC -XX:MaxGCPauseMillis200九、高可用性配置9.1 驱动故障恢复apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: ha-spark-job spec: restartPolicy: type: OnFailure onFailureRetries: 3 onFailureRetryInterval: 10 driver: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/control-plane operator: DoesNotExist9.2 数据容错apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: fault-tolerant-job spec: sparkConf: spark.task.maxFailures: 4 spark.stage.maxConsecutiveAttempts: 4 spark.shuffle.service.enabled: true十、常见问题与解决方案10.1 Executor启动失败问题分析资源不足镜像拉取失败权限问题解决方案# 检查Pod状态 kubectl get pods -n spark # 查看Executor日志 kubectl logs executor-pod-name -n spark # 检查资源配额 kubectl describe resourcequota spark-quota -n spark10.2 数据倾斜问题分析分区不均匀Key分布不均解决方案# 增加Shuffle分区 --conf spark.sql.shuffle.partitions1000 # 使用盐值随机化 df.withColumn(salt, rand() * 10)10.3 内存溢出问题分析Executor内存不足数据缓存过多解决方案# 增加Executor内存 --conf spark.executor.memory16g # 调整内存比例 --conf spark.memory.fraction0.8结论Spark on Kubernetes为大数据处理提供了弹性、可扩展的运行环境。通过合理配置资源、存储和调度策略可以高效处理大规模数据。结合监控和容错机制可以确保作业的可靠性和可观测性。