生产级Terraform+GitLab CI/CD+AWS基础设施流水线

发布时间:2026/6/5 12:22:17

生产级Terraform+GitLab CI/CD+AWS基础设施流水线 1. 这不是“又一个云部署教程”而是一套我在三家公司落地验证过的生产级基建流水线你点开这个标题大概率正被几件事反复折磨每次上线新服务都要手动点控制台配VPC、子网、安全组手抖删错一个资源就全链路中断团队里有人改了tfstate文件却没推Git第二天CI跑不起来查半天发现是本地state和远程不一致GitLab CI里一堆shell脚本拼凑的apply逻辑没人敢动一改就炸——这些不是理论风险是我亲手在金融、SaaS和电商客户现场踩出来的坑。核心关键词就三个Terraform、GitLab CI/CD、AWS云基础设施。它解决的不是“能不能用”的问题而是“能不能扛住日均200次变更、3个环境并行发布、审计要求留痕到每一行代码”的真实生产压力。适合两类人一类是刚从手动运维转过来的工程师想系统性建立IaCInfrastructure as Code工作流另一类是技术负责人需要一套能过等保三级、满足财务审计、且开发团队愿意长期维护的基建方案。它不教Terraform基础语法也不讲GitLab CI怎么装——那些文档里都有。我要拆的是为什么必须用remote state backend而不是本地文件为什么CI里不能直接用terraform apply -auto-approve为什么GitLab的protected branch规则要和Terraform workspace绑定这些决定背后全是血泪换来的经验。2. 整体架构设计与关键决策逻辑2.1 为什么选Terraform而非CloudFormation或CDK很多人第一反应是“AWS自家工具肯定更稳”但实际项目中我们放弃CloudFormation有三个硬伤一是模板嵌套超过5层后错误提示像天书比如一个参数传递失败报错位置指向第87行JSON但真正出错的是第3层嵌套模板里一个未声明的Outputs引用二是跨账户部署极其别扭CFN StackSets虽然存在但权限模型复杂一次更新要等15分钟以上而我们日常发布节奏是每小时1~2次三是团队协作成本高JSON/YAML写法对前端或测试工程师极不友好review时经常出现“这个Ref到底指哪个资源”的争论。CDK看似先进但它的抽象层太厚——当你需要精细控制IAM policy document的Condition块或者调试一个LambdaEdge的Origin Request事件结构时CDK生成的底层CFN模板反而成了黑盒。Terraform的优势在于“可控的透明”HCL语法接近自然语言count和for_each逻辑清晰Provider生态成熟AWS官方维护的provider每周更新更重要的是它的state机制是显式的——你永远知道当前状态存哪、谁在改、改了什么。我们最终采用Terraform 1.5版本因为1.4之前对workspace的并发锁支持不完善曾导致两个CI job同时apply同一workspacestate文件被覆盖回滚花了6小时。2.2 为什么GitLab CI/CD而非GitHub Actions或Jenkins选择GitLab不是因为“全家桶”而是三点不可替代性第一内置的Protected Branches Merge Request Approvals Pipeline Policies能天然实现“代码即策略”。比如我们强制要求所有prod环境的tf文件变更必须经过2名SRE批准且CI pipeline必须通过terraform validate和checkov扫描否则MR无法合并。GitHub Actions需要额外配Code Owners和第三方action组合稳定性差Jenkins则完全靠脚本控制权限分散审计困难。第二GitLab的CI Variables分层管理project/group/instance level完美匹配多环境需求。dev环境用AWS_ACCESS_KEY_ID_DEV变量staging用AWS_ACCESS_KEY_ID_STAGING且这些密钥在GitLab UI里设为masked不会出现在job log里——而Jenkins的Credentials Binding插件一旦配置失误密钥会明文打印。第三GitLab的Pipeline Editor和CI Lint功能让新人能实时看到YAML语法是否合法避免“push完才发现pipeline.yml格式错误整个团队卡住”的尴尬。我们实测过同样一个包含5个stage的pipelineGitLab平均启动时间12秒GitHub Actions 23秒Jenkins因插件加载慢常达40秒以上。2.3 核心架构图三层隔离双锁机制整个架构不是简单的“代码→CI→云”而是严格分三层代码层Git按环境分目录environments/dev/,environments/staging/,environments/prod/每个目录下是独立的Terraform配置共用modules/里的可复用模块如vpc,eks-cluster,rds-instance。关键设计是environments/目录本身不存放任何敏感值所有var都通过CI Variables注入。执行层CI Runner使用GitLab自托管Runner非shared runnerOS为Ubuntu 22.04预装Terraform 1.5.7、aws-cli v2、checkov 2.4.159。Runner以dockerexecutor运行每个job启动干净容器杜绝环境污染。状态层Remote State全部使用AWS S3 DynamoDB后端S3存储state文件DynamoDB做锁表。这里有个致命细节S3 bucket必须开启versioning和MFA delete否则误删state文件将无法恢复——我们曾因运维误操作清空bucket幸好versioning救回。双锁机制指DynamoDB锁Terraform apply前自动加锁防止并发冲突GitLab Pipeline锁通过.gitlab-ci.yml中的resource_group: terraform-prod指令确保同一环境的pipeline串行执行。比如prod环境的两个MR同时触发第二个会排队直到第一个完成。这比单纯依赖DynamoDB锁更可靠因为后者只锁state不锁CI执行逻辑。3. 核心细节解析与实操要点3.1 Terraform Backend配置S3 DynamoDB的避坑指南Backend配置看着简单但90%的线上事故源于此。我们用的配置如下backend.tfterraform { backend s3 { bucket mycompany-tfstate-prod key environments/prod/terraform.tfstate region us-east-1 dynamodb_table mycompany-tfstate-lock encrypt true # 关键必须显式声明否则默认不校验 skip_region_validation false } }这里埋了三个深坑第一bucket名称必须全局唯一且不能包含下划线。AWS S3规定bucket名只能含小写字母、数字、连字符我们曾用my_company-tfstate命名结果Terraform init时报错InvalidParameter: The specified bucket is not valid查了3小时才发现是下划线违规。解决方案统一用kebab-case如mycompany-tfstate-prod。第二dynamodb_table必须提前创建且分区键Partition key必须叫LockID类型为String。Terraform官方文档没强调这点但源码里硬编码了字段名。如果建表时用了lock_id或idapply时会报ValidationException: The provided key element does not match the schema。建表命令必须是aws dynamodb create-table \ --table-name mycompany-tfstate-lock \ --attribute-definitions AttributeNameLockID,AttributeTypeS \ --key-schema AttributeNameLockID,KeyTypeHASH \ --billing-mode PAY_PER_REQUEST第三skip_region_validation false是救命设置。当你的CI Runner在us-west-2但state bucket在us-east-1时Terraform默认会校验region一致性。若设为true它会静默跳过导致后续terraform plan读取state失败报错Failed to load state: InvalidAccessKeyId——因为请求发到了错误region的S3 endpoint。设为false后错误会立刻暴露“Backend configuration specifies region us-east-1, but AWS provider is configured for us-west-2”一目了然。提示S3 bucket策略必须显式允许Terraform执行角色的s3:GetObject、s3:PutObject、s3:ListBucket以及DynamoDB的dynamodb:GetItem、dynamodb:PutItem、dynamodb:DeleteItem。漏掉任一权限apply都会卡在“Acquiring state lock”阶段超时后报错Error acquiring the state lock。3.2 GitLab CI变量管理如何让密钥既安全又可用GitLab CI变量分三层我们严格遵循最小权限原则Project Level存放环境专属变量如AWS_ACCESS_KEY_ID_DEV、AWS_SECRET_ACCESS_KEY_DEV仅用于dev环境。Group Level存放跨环境共享变量如TF_VAR_project_namemyapp、TF_VAR_regionus-east-1所有子项目自动继承。Instance Level谨慎使用仅存TF_CLI_CONFIG_FILE/dev/null禁用Terraform CLI自动读取~/.terraformrc避免本地配置污染CI环境。关键技巧有三个变量命名规范所有AWS密钥变量名必须带环境后缀如AWS_ACCESS_KEY_ID_STAGING绝不用AWS_ACCESS_KEY_ID。这样在CI脚本里可以动态拼接export AWS_ACCESS_KEY_ID${AWS_ACCESS_KEY_ID_${ENVIRONMENT^^}} export AWS_SECRET_ACCESS_KEY${AWS_SECRET_ACCESS_KEY_${ENVIRONMENT^^}}${ENVIRONMENT^^}将小写转大写ENVIRONMENTstaging时自动取AWS_ACCESS_KEY_ID_STAGING。Masked变量的陷阱GitLab的Masked功能只隐藏log中完全匹配的字符串。如果你的密钥是abc123def456而log里打印Key: abc123def456它会被屏蔽但如果log是{key:abc123def456}引号会导致匹配失败密钥明文暴露。因此我们禁止在CI脚本里echo $AWS_SECRET_ACCESS_KEY所有调试用terraform output -json代替。Secrets轮换自动化我们用AWS Lambda监听IAM Access Key轮换事件自动更新GitLab Group Variables。脚本核心逻辑# Lambda handler def lambda_handler(event, context): new_key event[detail][requestParameters][accessKeyId] gitlab_api.update_variable( group_idmygroup, keyfAWS_ACCESS_KEY_ID_{env}, valuenew_key, maskedTrue )避免人工操作遗漏。3.3 模块化设计为什么modules/vpc要拆成4个子模块初学者常把VPC所有资源写在一个main.tf里但生产环境必须拆。我们的modules/vpc目录结构如下modules/vpc/ ├── main.tf # 定义vpc资源输出vpc_id, cidr_block ├── subnets/ # 子网模块按az和用途分 │ ├── public.tf │ ├── private.tf │ └── database.tf ├── security_groups/ # 安全组模块按服务分 │ ├── eks-node-sg.tf │ └── rds-sg.tf └── route_tables/ # 路由表模块 └── nat-gateway.tf拆分理由很实在权限隔离subnets/public.tf由网络组维护security_groups/rds-sg.tf由DBA组维护GitLab的File Path Restrictions可限制MR只能修改指定路径避免越权。复用粒度subnets/private.tf可被EKS和RDS同时调用但security_groups/eks-node-sg.tf含worker_node_role_arn参数RDS用不了——细粒度拆分让复用更精准。Plan速度terraform plan -targetmodule.vpc.module.subnets可单独计划子网变更不用扫描整个VPC模块。实测完整VPC plan耗时42秒只plan子网仅需8秒。注意模块间依赖必须显式声明。比如subnets/private.tf要用到main.tf输出的vpc_id必须在variables.tf里定义variable vpc_id { description ID of the VPC this subnet belongs to type string }然后在environments/prod/main.tf中调用时传入module private_subnets { source ../../modules/vpc/subnets/private vpc_id module.vpc.vpc_id # 显式依赖避免隐式循环 }4. 实操过程与核心环节实现4.1 CI Pipeline设计从MR提交到生产上线的7个阶段我们的.gitlab-ci.yml不是简单plan/apply两步而是7个严格串联的阶段每个阶段失败即终止stages: - validate - lint - plan - security-scan - approve - apply - notify validate: stage: validate script: - terraform init -backend-configbucket${TF_STATE_BUCKET} -backend-configkeyenvironments/${ENVIRONMENT}/terraform.tfstate - terraform validate rules: - if: $CI_PIPELINE_SOURCE merge_request_event variables: ENVIRONMENT: $CI_MERGE_REQUEST_TARGET_BRANCH_NAME lint: stage: lint script: - terraform fmt -check -diff rules: - if: $CI_PIPELINE_SOURCE merge_request_event plan: stage: plan script: - terraform init -backend-configbucket${TF_STATE_BUCKET} -backend-configkeyenvironments/${ENVIRONMENT}/terraform.tfstate - terraform plan -outtfplan.binary -varenvironment${ENVIRONMENT} artifacts: - tfplan.binary rules: - if: $CI_PIPELINE_SOURCE merge_request_event security-scan: stage: security-scan script: - checkov -d . --framework terraform --quiet --compact rules: - if: $CI_PIPELINE_SOURCE merge_request_event approve: stage: approve script: - echo Waiting for manual approval... when: manual allow_failure: false rules: - if: $CI_PIPELINE_SOURCE merge_request_event $ENVIRONMENT prod apply: stage: apply script: - terraform init -backend-configbucket${TF_STATE_BUCKET} -backend-configkeyenvironments/${ENVIRONMENT}/terraform.tfstate - terraform apply -inputfalse tfplan.binary rules: - if: $CI_PIPELINE_SOURCE pipeline # 仅由approve stage触发 variables: ENVIRONMENT: $CI_PIPELINE_SOURCE_ENVIRONMENT notify: stage: notify script: - curl -X POST -H Content-Type: application/json -d {text:Terraform ${ENVIRONMENT} applied successfully} $SLACK_WEBHOOK rules: - if: $CI_PIPELINE_SOURCE pipeline $CI_JOB_STATUS success关键设计点解析rules逻辑$CI_PIPELINE_SOURCE merge_request_event表示MR触发此时只跑validate到security-scan$CI_PIPELINE_SOURCE pipeline表示由上一stage触发才跑apply。这避免了MR直接触发apply的风险。artifacts传递plan阶段生成的tfplan.binary作为artifact在apply阶段被复用。这是关键——如果apply阶段重新terraform plan可能因state变化导致实际执行与预览不一致。二进制plan文件保证了原子性。when: manualprod环境的approve阶段必须手动点击且只有Maintainer角色能操作。GitLab的UI会显示“Approve by [user]”按钮点击后才触发apply。allow_failure: false手动stage也设为false确保未批准时pipeline明确失败而非跳过。4.2 Terraform Workspace管理如何避免dev配置误刷到prodWorkspace不是“多环境开关”而是“state隔离沙箱”。我们禁用terraform workspace new命令全部通过CI变量控制# 在CI script中 terraform workspace select ${ENVIRONMENT} || terraform workspace new ${ENVIRONMENT}但真正的隔离靠两点Backend key路径绑定环境backend.tf中key environments/${ENVIRONMENT}/terraform.tfstate每个workspace对应S3里不同路径的state文件。即使你在本地terraform workspace select prod但backend配置是dev路径init时会报错Backend configuration changed。CI变量注入环境上下文所有variables.tf里定义environment变量并在main.tf中用count控制资源创建resource aws_instance web { count var.environment prod ? 3 : 1 ami data.aws_ami.ubuntu.id instance_type t3.micro }这样dev环境只起1台EC2prod起3台且state文件完全隔离。实操心得我们曾因忘记在environments/prod/backend.tf里替换ENVIRONMENT变量导致prodpipeline用了dev的backend keyprodstate被写入dev路径。修复方法先terraform state pull prod-state.json导出当前state再terraform init -reconfigure -backend-configkeyenvironments/prod/...最后terraform state push prod-state.json。全程离线操作避免二次污染。4.3 安全合规加固Checkov扫描与IAM最小权限实践Checkov不是摆设我们定制了3层扫描规则基础层启用所有Terraform官方规则--framework terraform拦截硬编码密码、未加密S3 bucket等。公司层自定义custom-checks/目录例如no-public-rds.tf规则# custom-checks/no-public-rds.py from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck class NoPublicRDS(BaseResourceCheck): def __init__(self): name RDS instance must not be publicly accessible id CKV_AWS_123 supported_resources [aws_db_instance] super().__init__(namename, idid, supported_resourcessupported_resources) def scan_resource_conf(self, conf): if publicly_accessible in conf and conf[publicly_accessible][0]: return False return True审计层每日凌晨用checkov -d . --framework terraform --output junitxml report.xml生成JUnit报告接入GitLab CI的Test Reports失败即告警。IAM权限更是生死线。我们为CI Runner创建专用IAM RolePolicy精简到极致{ Version: 2012-10-17, Statement: [ { Effect: Allow, Action: [ s3:GetObject, s3:PutObject, s3:ListBucket ], Resource: [ arn:aws:s3:::mycompany-tfstate-*, arn:aws:s3:::mycompany-tfstate-*/environments/* ] }, { Effect: Allow, Action: [ dynamodb:GetItem, dynamodb:PutItem, dynamodb:DeleteItem ], Resource: arn:aws:dynamodb:us-east-1:123456789012:table/mycompany-tfstate-lock } ] }绝不赋予*通配符权限。曾有团队为省事给AdministratorAccess结果CI脚本误删了整个S3 bucket损失惨重。5. 常见问题与排查技巧实录5.1 “State lock failed”错误的5种根因与速查表现象根因排查命令解决方案Error acquiring the state lockDynamoDB表不存在或字段名错误aws dynamodb describe-table --table-name mycompany-tfstate-lock检查KeySchema是否为LockID重建表Lock ID: uuid already exists上次apply异常中断锁未释放aws dynamodb get-item --table-name mycompany-tfstate-lock --key {LockID:{S:uuid}}若Info字段为空手动aws dynamodb delete-item --key ...Failed to load state: InvalidAccessKeyIdCI变量未正确注入AWS密钥echo $AWS_ACCESS_KEY_ID在CI job中执行检查GitLab Variables是否masked且名称匹配Backend configuration changed本地backend配置与CI不一致cat backend.tf对比CI和本地统一用ENVIRONMENT变量动态生成keyError: Failed to read S3 bucketS3 bucket策略拒绝CI角色访问aws s3 ls s3://mycompany-tfstate-prod/ --profile ci-role添加s3:ListBucket权限实操心得我们写了个unlock.sh脚本放在CI Runner上一键清理锁#!/bin/bash LOCK_ID$(aws dynamodb scan --table-name mycompany-tfstate-lock --query Items[?contains(Info, acquiring state lock)].LockID.S --output text) if [ -n $LOCK_ID ]; then aws dynamodb delete-item --table-name mycompany-tfstate-lock --key {\LockID\:{\S\:\$LOCK_ID\}} echo Unlocked $LOCK_ID else echo No lock found fi但严禁在生产环境随意执行必须先确认Info字段内容。5.2terraform plan与apply结果不一致的3个隐形杀手这是最危险的问题——plan显示“add 1 resource”apply却删了3个。根因往往隐蔽State漂移Drift手动在AWS控制台改了资源如给EC2加了Tag但没terraform import。Plan时Terraform对比state和配置认为“配置没Tag所以要删”而apply时发现实际有Tag冲突报错。解决方案定期terraform refresh或用terraform plan -refresh-only检测漂移。Provider版本不一致本地用Terraform 1.4CI用1.5AWS Provider 4.60 vs 4.65。新版Provider可能改变默认行为如aws_s3_bucket的force_destroy默认值。解决方案CI脚本开头强制terraform init -upgradefalse且versions.tf锁定providerterraform { required_providers { aws { source hashicorp/aws version ~ 4.60 } } }变量注入时机错误CI中export TF_VAR_envprod但variables.tf里default dev。Terraform优先级是-var-file -var default如果-var没传就用default。Plan时用devapply时却用CI变量prod。解决方案所有变量必须nullable false且CI脚本显式传-varenvironment${ENVIRONMENT}。5.3 GitLab CI性能优化从30分钟到6分钟的提速实战初始pipeline耗时30分钟瓶颈在terraform init。分析日志发现每次init都从HashiCorp官网下载provider而国内网络不稳定单次下载常超5分钟。优化步骤Provider缓存在CI Runner宿主机挂载卷/opt/terraform-plugins并在/etc/gitlab-runner/config.toml中配置[[runners]] environment [TF_PLUGIN_CACHE_DIR/opt/terraform-plugins]第一次init后provider缓存在该目录后续job直接复用。Backend初始化分离terraform init分两步# Step 1: 初始化本地配置快 terraform init -backendfalse # Step 2: 初始化backend只在plan/apply前做 terraform init -backend-configbucket... -backend-configkey...避免每次validate都连S3。并行Plan对environments/下多个环境用GitLab的parallel: 3plan-all: stage: plan parallel: 3 script: - cd environments/$CI_NODE_INDEX - terraform init -backend-configbucket${TF_STATE_BUCKET} -backend-configkeyenvironments/$CI_NODE_INDEX/terraform.tfstate - terraform plan -outtfplan.binary variables: CI_NODE_INDEX: dev staging prod3个环境plan并行总时间从22分钟降至6分钟。最后分享一个小技巧我们在environments/目录下放了个README.md用Markdown表格自动生成环境状态| Environment | Last Applied | Resources | State Size | |-------------|--------------|-----------|------------| | dev | 2023-10-05 | 42 | 1.2 MB | | staging | 2023-10-04 | 87 | 2.8 MB | | prod | 2023-10-03 | 156 | 5.3 MB |这个表格由CI job定时更新用aws s3 cp s3://mycompany-tfstate-prod/environments/dev/terraform.tfstate .下载state再jq .serial提取版本号最后sed -i替换README。团队一眼看清各环境健康度比登录GitLab看pipeline历史直观多了。

相关新闻