Jupyter模型生产化:从Notebook到高可用ML服务的七步落地

发布时间:2026/6/6 6:04:03

Jupyter模型生产化:从Notebook到高可用ML服务的七步落地 1. 项目概述当Jupyter笔记本走出实验室真正扛起业务重担“From Notebook to Production: Running ML in the Real World (Part 4)”——这个标题里藏着太多一线工程师听到后会下意识摸摸后颈的熟悉感。它不是讲怎么调参、不是教模型结构而是直指那个被无数AI项目反复绕开、又最终不得不正面硬刚的命题你花三个月在Jupyter里跑通的模型怎么才能在凌晨三点订单洪峰时稳稳接住每秒2000次的实时预测请求我自己带过的17个落地项目里有12个卡在Part 4不是模型不行是它根本没准备好“上班”。这里的“Real World”不是PPT里的抽象概念是Kubernetes集群里突然OOM的Pod、是线上日志里飘着的NaN值、是业务方一句“昨天下午推荐点击率掉了3%是不是你们模型出问题了”——而你翻遍监控发现只是上游数据管道里一个字段类型从int64悄悄变成了float64。Part 4的本质是把“能跑通”的代码变成“敢签SLA”的服务。它不关心AUC多高只问P99延迟是否压在80ms内不讨论F1-score只盯着过去7天模型输入特征分布偏移PSI是否持续超过0.15。它要求你同时懂PyTorch张量运算、Docker镜像分层原理、Prometheus指标打点规范还得能用三句话向CTO解释清楚为什么这次模型上线要停服15分钟做蓝绿切换。如果你还在用pickle.dump(model, open(model.pkl, wb))然后手动scp到服务器上python serve.py那Part 4对你而言不是第四部分是第四座大山。这篇内容就是拆解这座山的岩层结构、找到最可行的攀爬路径并告诉你哪里有暗冰、哪里能借力——所有细节都来自我们团队在电商搜索、金融风控、工业设备预测三个场景中踩出的真实脚印。2. 核心设计思路为什么不能直接把Notebook代码扔进生产环境2.1 从交互式探索到确定性服务底层逻辑的根本断裂Jupyter Notebook的设计哲学是“探索”与“叙事”。它的单元格可以任意顺序执行、变量作用域全局共享、输出结果即时渲染——这种灵活性对调试和教学是福音对生产却是灾难。我见过最典型的事故某推荐模型在Notebook里AUC0.82上线后首日CTR暴跌。排查三天发现是Notebook中某个单元格手动执行了df[user_id] df[user_id].astype(str)但该行代码并未写入最终训练脚本而生产环境加载原始数据时user_id列默认被pandas读为int64导致特征哈希后桶编号完全错乱。Notebook的“状态依赖”特性天然违背生产环境“可重现、可审计、可回滚”的铁律。真正的生产服务必须满足给定相同输入无论何时何地运行输出必须严格一致。这要求我们彻底切断Notebook的“魔法”——所有数据预处理逻辑必须固化为独立、无副作用的Python函数所有随机种子必须在入口处统一设置并透传所有外部依赖如配置文件路径、S3桶名必须通过环境变量或配置中心注入而非硬编码在单元格里。我们团队强制推行“Notebook只读”原则所有分析性Notebook禁止包含model.fit()或joblib.dump()等写操作其唯一产出是经过版本控制的.py模块和一份自动生成的requirements.txt。这看似增加了步骤实则把“为什么上次跑得好这次不行”的模糊问题转化成了“哪个commit引入了变更”的清晰追踪。2.2 模型即API服务化不是包装而是重构很多团队把“模型上线”理解为“用Flask包一层Notebook里的predict函数”。这是最危险的幻觉。真正的服务化是围绕接口契约、资源边界、故障隔离三大支柱重构整个交付物。以我们为某银行做的反欺诈模型为例Notebook里一行y_pred model.predict_proba(X_test)[:, 1]在生产中必须拆解为输入契约定义明确的JSON Schema规定transaction_amount必须为正数浮点device_fingerprint长度必须在32-64字符之间否则直接返回400 Bad Request资源契约通过Dockerfile的--memory2g --cpus2.0硬性限制容器资源避免单个异常请求耗尽节点内存故障契约当模型内部发生ValueError如特征缺失服务不抛500而是捕获后返回标准化错误码ERR_MODEL_INPUT_INVALID及建议修复方案确保上游调用方能优雅降级。这种重构意味着你的predict.py不再是Notebook的衍生物而是一个遵循OpenAPI 3.0规范、自带健康检查端点/healthz、支持平滑重启SIGTERM处理、并内置熔断器如tenacity库的独立微服务。我们曾用locust对未重构服务做压测QPS刚过300延迟就飙升至2s以上重构后同一硬件稳定支撑1200 QPS且P9550ms。差距不在算法而在服务契约的严谨程度。2.3 数据与模型的双生命周期管理为什么监控比部署更重要Part 4的终极战场从来不是部署那一刻而是部署之后的第37个小时。模型衰减Model Decay不是理论风险是每日发生的现实。我们监控过某电商价格预测模型上线首周效果完美第二周起预测偏差MAE缓慢爬升第三周触发告警。根因分析显示上游供应链系统升级后inventory_status字段新增了PRE_ORDER枚举值而模型训练时该字段只有IN_STOCK/OUT_OF_STOCK两类——特征空间悄然扩展模型却浑然不觉。因此Part 4的设计核心必须是“数据-模型”联合监控闭环。这要求我们在数据管道出口ETL完成点埋点计算每个特征的统计摘要均值、方差、空值率、分布直方图在模型服务入口pre-predict阶段同步采集相同摘要用PSIPopulation Stability Index或KS检验量化新旧分布差异当PSI 0.1时自动触发告警并冻结模型流量同时模型本身需输出置信度分数如LightGBM的predict_proba第二维当批量请求中低置信度样本占比超阈值如15%同样触发人工复核流程。这套机制不是锦上添花而是生存必需。它让团队从“被动救火”转向“主动免疫”把模型失效的平均响应时间MTTR从47小时压缩到22分钟。3. 关键技术实现从代码到服务的七步炼金术3.1 步骤一Notebook到模块的原子化剥离含实操代码剥离不是复制粘贴而是“外科手术式”重构。以一个典型的时间序列预测Notebook为例原始结构如下# Cell 1: 数据加载 df pd.read_parquet(s3://data-lake/raw/sales.parquet) df[date] pd.to_datetime(df[date]) # Cell 2: 特征工程 df[day_of_week] df[date].dt.dayofweek df[rolling_mean_7d] df[sales].rolling(7).mean() # Cell 3: 模型训练 from sklearn.ensemble import RandomForestRegressor model RandomForestRegressor(n_estimators100) X_train, y_train prepare_features(df[df[date] 2023-01-01]) model.fit(X_train, y_train) # Cell 4: 保存模型 import joblib joblib.dump(model, models/rf_v1.joblib)正确剥离方式四步法创建features/目录将Cell 2逻辑封装为纯函数输入DataFrame输出增强后的DataFrame零外部依赖# features/engineer.py import pandas as pd from typing import Dict, Any def add_temporal_features(df: pd.DataFrame, date_col: str date) - pd.DataFrame: 添加时间特征确保date_col已转为datetime df df.copy() df[date_col] pd.to_datetime(df[date_col]) df[day_of_week] df[date_col].dt.dayofweek df[is_weekend] (df[date_col].dt.dayofweek 5).astype(int) return df def add_rolling_features(df: pd.DataFrame, target_col: str sales, windows: list [7, 30]) - pd.DataFrame: 添加滚动统计特征处理NaN填充逻辑 df df.copy() for window in windows: col_name f{target_col}_rolling_mean_{window}d df[col_name] df[target_col].rolling(windowwindow).mean() # 关键用前向填充替代默认NaN避免下游模型崩溃 df[col_name] df[col_name].fillna(methodffill).fillna(0) return df创建models/目录Cell 3逻辑转为可配置类分离训练与推理# models/regressor.py from sklearn.ensemble import RandomForestRegressor from typing import Dict, Any, Optional import joblib class SalesPredictor: def __init__(self, config: Dict[str, Any]): self.config config self.model None self.feature_names None def train(self, X: pd.DataFrame, y: pd.Series) - None: 训练入口确保X已包含所有必要特征 self.feature_names X.columns.tolist() self.model RandomForestRegressor( n_estimatorsself.config.get(n_estimators, 100), max_depthself.config.get(max_depth, 10), random_state42 # 固定种子 ) self.model.fit(X, y) def predict(self, X: pd.DataFrame) - pd.Series: 推理入口强制校验特征列 if set(self.feature_names) ! set(X.columns): missing set(self.feature_names) - set(X.columns) extra set(X.columns) - set(self.feature_names) raise ValueError(fFeature mismatch! Missing: {missing}, Extra: {extra}) return pd.Series(self.model.predict(X)) def save(self, path: str) - None: joblib.dump({ model: self.model, feature_names: self.feature_names, config: self.config }, path) classmethod def load(cls, path: str) - SalesPredictor: data joblib.load(path) instance cls(data[config]) instance.model data[model] instance.feature_names data[feature_names] return instance创建train.py主训练脚本替代Cell 134引入配置驱动# train.py import argparse import pandas as pd from features.engineer import add_temporal_features, add_rolling_features from models.regressor import SalesPredictor def main(): parser argparse.ArgumentParser() parser.add_argument(--data-path, typestr, requiredTrue) parser.add_argument(--model-output, typestr, requiredTrue) parser.add_argument(--config, typestr, defaultconfigs/train_config.yaml) args parser.parse_args() # 加载配置YAML格式 import yaml with open(args.config) as f: config yaml.safe_load(f) # 数据加载与处理模拟S3读取 df pd.read_parquet(args.data_path) df add_temporal_features(df) df add_rolling_features(df) # 准备训练数据 feature_cols [day_of_week, is_weekend, sales_rolling_mean_7d, sales_rolling_mean_30d] X_train df[df[date] 2023-01-01][feature_cols] y_train df[df[date] 2023-01-01][sales] # 训练与保存 predictor SalesPredictor(config[model]) predictor.train(X_train, y_train) predictor.save(args.model_output) print(fModel saved to {args.model_output}) if __name__ __main__: main()创建configs/train_config.yaml将魔法数字外置为A/B测试铺路model: n_estimators: 200 max_depth: 15 min_samples_split: 10 data: train_start: 2022-01-01 train_end: 2022-12-31 test_start: 2023-01-01提示剥离完成后原Notebook应仅保留数据探索、可视化、结果分析代码并在顶部添加注释# THIS NOTEBOOK IS FOR ANALYSIS ONLY. PRODUCTION CODE IS IN ./src/。我们团队用Git Hooks强制校验Notebook中不出现joblib.dump或model.fit违规提交直接拒绝。3.2 步骤二构建可复现的Docker镜像含Dockerfile详解生产镜像不是“能跑就行”而是“最小、最稳、最可知”。我们摒弃FROM python:3.9-slim采用多阶段构建Alpine基础镜像# Dockerfile # 构建阶段编译依赖不进入最终镜像 FROM python:3.9-slim AS builder # 安装编译工具为lightgbm/xgboost准备 RUN apt-get update apt-get install -y \ build-essential \ libglib2.0-0 \ rm -rf /var/lib/apt/lists/* # 复制依赖文件安装生产依赖不含dev WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir --upgrade pip RUN pip install --no-cache-dir --user -r requirements.txt # 生产阶段极简Alpine仅复制编译好的包 FROM alpine:3.18 # 创建非root用户安全刚需 RUN addgroup -g 1001 -f mlgroup adduser -S mluser -u 1001 # 安装运行时依赖musl兼容 RUN apk add --no-cache \ ca-certificates \ tzdata \ cp -f /usr/share/zoneinfo/Asia/Shanghai /etc/localtime # 复制构建阶段的Python包 COPY --frombuilder /root/.local /root/.local # 设置环境变量 ENV PATH/root/.local/bin:$PATH ENV PYTHONUNBUFFERED1 ENV TZAsia/Shanghai # 创建工作目录切换用户 WORKDIR /app COPY --chownmluser:mlgroup . . USER mluser # 暴露端口声明健康检查 EXPOSE 8000 HEALTHCHECK --interval30s --timeout3s --start-period5s --retries3 \ CMD wget --quiet --tries1 --spider http://localhost:8000/healthz || exit 1 # 启动命令使用gunicorn管理进程 CMD [gunicorn, --bind, 0.0.0.0:8000, --workers, 4, --worker-class, sync, --timeout, 120, --keep-alive, 5, app:app]关键参数解析--workers 4基于经验公式2 * CPU核心数 1在4核机器上设为4避免GIL争抢--timeout 120模型推理可能涉及复杂特征计算预留2分钟超时而非默认30秒--keep-alive 5HTTP长连接保持5秒减少TCP握手开销HEALTHCHECKKubernetes依赖此探针判断Pod是否Ready失败3次即重启。注意requirements.txt必须锁定所有依赖版本pip freeze requirements.txt禁用符号。我们曾因pandas1.4.0升级到1.5.3导致pd.concat行为变更引发线上数据错位损失2小时订单。3.3 步骤三编写健壮的FastAPI服务含完整代码服务框架选FastAPI而非Flask核心在于其自动生成OpenAPI文档、内置Pydantic验证、异步支持三大优势。以下是app.py核心实现# app.py from fastapi import FastAPI, HTTPException, BackgroundTasks, Depends from pydantic import BaseModel, Field, validator from typing import List, Optional, Dict, Any import numpy as np import pandas as pd import logging from models.regressor import SalesPredictor from features.engineer import add_temporal_features, add_rolling_features import time import psutil # 配置日志结构化JSON输出便于ELK收集 logging.basicConfig( levellogging.INFO, format{time: %(asctime)s, level: %(levelname)s, service: ml-api, message: %(message)s} ) logger logging.getLogger(__name__) app FastAPI( titleSales Prediction API, descriptionReal-time sales forecasting service, version1.0.0, docs_url/docs, # Swagger UI redoc_urlNone ) # 全局模型实例单例避免重复加载 _predictor: Optional[SalesPredictor] None def get_predictor() - SalesPredictor: global _predictor if _predictor is None: try: _predictor SalesPredictor.load(/app/models/sales_rf_v2.joblib) logger.info(Model loaded successfully) except Exception as e: logger.error(fFailed to load model: {str(e)}) raise HTTPException(status_code500, detailModel loading failed) return _predictor # 请求体模型强约束 class PredictionRequest(BaseModel): transactions: List[Dict[str, Any]] Field( ..., example[ {date: 2023-05-01, sales: 1200.5}, {date: 2023-05-02, sales: 1350.2} ] ) validator(transactions) def validate_transactions(cls, v): if len(v) 0: raise ValueError(At least one transaction required) if len(v) 100: # 防止DDoS式大请求 raise ValueError(Max 100 transactions per request) return v class PredictionResponse(BaseModel): predictions: List[float] Field(..., example[1250.3, 1380.7]) latency_ms: float Field(..., example42.5) model_version: str Field(..., examplev2.1.0) app.on_event(startup) async def startup_event(): 应用启动时预热模型可选 predictor get_predictor() # 用空数据触发一次预测加载模型到内存 dummy_df pd.DataFrame([{date: 2023-01-01, sales: 1000}]) dummy_df add_temporal_features(dummy_df) dummy_df add_rolling_features(dummy_df) _ predictor.predict(dummy_df[predictor.feature_names]) logger.info(Model pre-warmed) app.get(/healthz) def health_check(): K8s健康检查端点 return {status: ok, timestamp: int(time.time())} app.post(/predict, response_modelPredictionResponse) def predict( request: PredictionRequest, background_tasks: BackgroundTasks, predictor: SalesPredictor Depends(get_predictor) ): start_time time.time() try: # 1. 数据转换List[dict] - DataFrame df pd.DataFrame(request.transactions) # 2. 特征工程复用Notebook剥离的函数 df add_temporal_features(df) df add_rolling_features(df) # 3. 特征列校验关键 missing_features set(predictor.feature_names) - set(df.columns) if missing_features: raise HTTPException( status_code400, detailfMissing required features: {list(missing_features)} ) # 4. 模型预测 X df[predictor.feature_names] predictions predictor.predict(X).tolist() # 5. 记录性能指标供Prometheus抓取 latency_ms (time.time() - start_time) * 1000 logger.info(fPrediction completed: {len(predictions)} samples, latency{latency_ms:.2f}ms) # 6. 异步记录监控指标不影响主流程 background_tasks.add_task(log_metrics, latency_ms, len(predictions)) return PredictionResponse( predictionspredictions, latency_mslatency_ms, model_versionv2.1.0 ) except ValueError as e: logger.warning(fInput validation error: {str(e)}) raise HTTPException(status_code400, detailstr(e)) except Exception as e: logger.error(fPrediction failed: {str(e)}) raise HTTPException(status_code500, detailInternal server error) def log_metrics(latency_ms: float, sample_count: int): 异步记录指标到本地文件后续由Telegraf采集 with open(/tmp/ml_metrics.log, a) as f: f.write(f{int(time.time())},{latency_ms},{sample_count}\n)核心防护点validator装饰器在Pydantic解析阶段就拦截非法请求避免无效数据进入模型background_tasks将指标记录异步化防止I/O阻塞主请求流get_predictor()依赖注入确保单例避免每次请求都重新加载模型加载耗时2-3秒startup_event预热机制消除首个请求的冷启动延迟。3.4 步骤四Kubernetes部署与配置YAML清单详解生产K8s部署不是简单kubectl run而是精细化资源编排。以下是核心deployment.yaml# deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: sales-predictor labels: app: sales-predictor spec: replicas: 3 # 至少3副本保障高可用 selector: matchLabels: app: sales-predictor template: metadata: labels: app: sales-predictor annotations: prometheus.io/scrape: true # 启用Prometheus抓取 prometheus.io/port: 8000 spec: serviceAccountName: ml-service-account # 绑定RBAC权限 containers: - name: api image: registry.example.com/ml/sales-predictor:v2.1.0 imagePullPolicy: IfNotPresent ports: - containerPort: 8000 name: http env: - name: MODEL_PATH value: /app/models/sales_rf_v2.joblib - name: LOG_LEVEL value: INFO resources: requests: memory: 1Gi # 最小保证内存 cpu: 500m # 最小保证CPU0.5核 limits: memory: 2Gi # 内存硬上限防OOM cpu: 1500m # CPU硬上限防抢占 livenessProbe: httpGet: path: /healthz port: 8000 initialDelaySeconds: 60 # 启动后60秒开始探测 periodSeconds: 30 # 每30秒探测一次 timeoutSeconds: 5 # 探测超时5秒 failureThreshold: 3 # 连续3次失败才重启 readinessProbe: httpGet: path: /healthz port: 8000 initialDelaySeconds: 30 # 就绪探测更激进 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 1 volumeMounts: - name: model-storage mountPath: /app/models volumes: - name: model-storage persistentVolumeClaim: claimName: ml-model-pvc # 指向预置的PVC --- # Service暴露 apiVersion: v1 kind: Service metadata: name: sales-predictor-svc spec: selector: app: sales-predictor ports: - port: 80 targetPort: 8000 type: ClusterIP # 内部服务外部通过Ingress访问 --- # HorizontalPodAutoscaler自动扩缩容 apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sales-predictor-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: sales-predictor minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # CPU使用率超70%自动扩容 - type: Pods pods: metric: name: http_requests_total # 自定义指标需Prometheus配置 target: type: AverageValue averageValue: 1000 # 每秒请求数超1000自动扩容关键配置说明livenessProbe与readinessProbe分离存活探针容忍短暂卡顿如GC就绪探针更敏感确保流量只打到健康实例resources.limits.memory2Gi硬性限制当容器内存超2G时K8s直接OOMKilled避免影响同节点其他服务HorizontalPodAutoscaler双指标CPU利用率保障基础负载自定义HTTP请求数应对突发流量需配合PrometheusKube-State-MetricspersistentVolumeClaim模型文件存储在独立PVC与容器生命周期解耦方便模型热更新。3.5 步骤五全链路监控与告警PrometheusGrafana配置没有监控的生产服务等于裸奔。我们搭建的监控栈覆盖数据、模型、服务三层1. 数据层监控特征漂移在特征工程函数中注入统计埋点# features/engineer.py (增强版) def add_rolling_features(...): # ... 原有逻辑 # 新增记录滚动均值的统计摘要 if hasattr(add_rolling_features, metrics): add_rolling_features.metrics[rolling_mean_7d_mean] df[col_name].mean() add_rolling_features.metrics[rolling_mean_7d_std] df[col_name].std() return df # 初始化metrics字典 add_rolling_features.metrics {}通过/metrics端点暴露为Prometheus格式# HELP rolling_mean_7d_mean Feature rolling_mean_7d mean value # TYPE rolling_mean_7d_mean gauge rolling_mean_7d_mean 1250.32. 模型层监控性能衰减在predict函数中记录预测结果与真实标签需上游传递label# app.py (增强版) app.post(/predict_with_label) def predict_with_label( request: PredictionRequestWithLabel, # 新增label字段 predictor: SalesPredictor Depends(get_predictor) ): # ... 预测逻辑 # 计算MAE并上报 mae np.mean(np.abs(np.array(predictions) - np.array(request.labels))) # 上报到Prometheus Counter PREDICTION_MAE.observe(mae)3. 服务层监控SLO保障使用starlette_exporter自动暴露FastAPI指标# app.py from starlette_exporter import PrometheusMiddleware, handle_metrics app.add_middleware(PrometheusMiddleware) app.add_route(/metrics, handle_metrics)生成标准指标如http_request_duration_seconds_bucket{le0.1}P90延迟http_requests_total{status200}成功请求数process_cpu_seconds_totalCPU使用Grafana看板关键面板面板名称监控目标告警阈值作用Data Drift AlertPSI(rolling_mean_7d) 0.15触发Slack告警提前发现数据源变更Model Accuracy Droprate(ml_prediction_mae[1h]) 0.05邮件通知算法团队检测模型性能拐点Service Latency P95histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) 100页面弹窗告警保障用户体验实操心得我们最初只监控服务层结果某次特征漂移导致模型失效监控显示“一切正常”因为请求都成功返回了错误结果。后来强制要求每个模型服务必须暴露至少3个数据层指标、2个模型层指标才真正建立起可信的观测体系。3.6 步骤六CI/CD流水线GitHub Actions示例自动化是Part 4的基石。我们的CI/CD流水线分为Build、Test、Deploy三阶段# .github/workflows/ml-deploy.yml name: ML Model Deployment on: push: branches: [main] paths: - src/** - configs/** - requirements.txt jobs: build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkoutv3 - name: Set up Python uses: actions/setup-pythonv4 with: python-version: 3.9 - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements.txt pip install pytest pytest-cov - name: Run unit tests run: pytest tests/ --covsrc --cov-reportxml - name: Build Docker image uses: docker/build-push-actionv4 with: context: . push: false tags: ${{ github.sha }} - name: Upload artifact uses: actions/upload-artifactv3 with: name: model-image path: ./Dockerfile deploy-to-staging: needs: build-and-test runs-on: ubuntu-latest if: github.ref refs/heads/main # 仅main分支触发 steps: - uses: actions/checkoutv3 - name: Deploy to Staging K8s uses: appleboy/kubectl-actionv2.4.0 with: kubeconfig: ${{ secrets.KUBECONFIG_STAGING }} namespace: ml-staging command: | kubectl set image deployment/sales-predictor apiregistry.example.com/ml/sales-predictor:${{ github.sha }} kubectl rollout status deployment/sales-predictor --timeout120s - name: Smoke Test run: | # 调用Staging API进行冒烟测试 curl -s https://staging-api.example.com/predict \ -H Content-Type: application/json \ -d {transactions:[{date:2023-05-01,sales:1000}]} \ | jq -e .predictions[0] 0 /dev/null deploy-to-prod: needs: deploy-to-staging runs-on: ubuntu-latest if: github.ref refs/heads/main steps: - name: Manual approval required uses: actions/github-scriptv6 with: script: | console.log(Waiting for manual approval...) // 手动审批步骤需Owner确认 - name: Deploy to Prod K8s uses: appleboy/kubectl-actionv2.4.0 with: kubeconfig: ${{ secrets.KUBECONFIG_PROD }} namespace: ml-prod command: | # 蓝绿部署先部署新版本 kubectl set image deployment/sales-predictor-green apiregistry.example.com/ml/sales-predictor:${{ github.sha }} kubectl rollout status deployment/sales-predictor-green --timeout120s # 切换流量修改Service selector kubectl patch service sales-predictor-svc -p {spec:{selector:{app:sales-predictor-green}}}流水线设计哲学Staging环境必测每次部署到Staging后自动执行冒烟测试Smoke Test验证API连通性和基本功能Prod环境强管控必须经人工审批GitHub Environments Approval且采用蓝绿部署切换流量前确保新版本Pod全部Ready失败即停止任一环节失败流水线终止避免错误版本扩散。注意我们禁用auto-merge所有PR必须经两名ML工程师一名SRE共同Review重点检查features/函数的幂等性、models/类的异常处理、Dockerfile的安全配置。3.7 步骤七模型热更新与回滚零停机实践业务不能等你kubectl delete pod。我们实现的热更新方案核心是模型文件与服务进程解耦模型存储架构模型文件存于独立PVCml-model-pvc挂载到所有Pod的/app/models/目录每个模型版本存为独立子

相关新闻