Qwen3.6-27B-AWQ 16 路统一 Docker vLLM 集群部署报告

发布时间:2026/7/3 9:37:47

Qwen3.6-27B-AWQ 16 路统一 Docker vLLM 集群部署报告 4 × RTX 4090 (8卡) 服务器 | Docker Nginx 负载均衡 | 32 GPU 并发推理一、项目概述本报告记录了将 4 台配备 8× NVIDIA GeForce RTX 4090 GPU 的服务器通过 Docker 容器化技术部署 vLLM 推理服务并通过 Nginx 构建 16 路负载均衡集群的完整过程。集群目标对外提供统一的 OpenAI-compatible API 服务入口支持 Qwen3.6-27B-AWQ 模型的高并发推理32 张 GPU 并行实现自动故障恢复与开机自启动单节点故障不影响整体服务可用性二、硬件环境服务器内网 IPGPU 型号GPU 数量部署方式GPU-0010.255.xxx.00RTX 4090 (24GB)8Docker systemdGPU-0110.255.xxx.01RTX 4090 (24GB)8Docker systemdGPU-0210.255.xxx.02RTX 4090 (24GB)8Docker systemdGPU-0310.255.xxx.03RTX 4090 (24GB)8Docker systemd总计32 张 RTX 4090 768 GB GPU 显存三、软件环境操作系统Ubuntu 24.04 LTS (Noble)DockerDocker 24.x docker-compose 1.29.2NVIDIA 驱动580.159.03CUDA 版本13.0 / CUDA Toolkit 12.8Python3.10.12 (deadsnakes PPA)vLLM0.20.2 (自定义构建)PyTorch2.11.0cu128Transformers5.8.0四、部署架构4.1 逻辑架构互联网用户↓XXX.XXX.XXX.XXX:0101 (公网映射)↓Nginx (00 服务器)├─ 00:8000 ── vLLM Docker (GPU 0,1)├─ 00:8001 ── vLLM Docker (GPU 2,3)├─ 00:8002 ── vLLM Docker (GPU 4,5)├─ 00:8003 ── vLLM Docker (GPU 6,7)├─ 0101:8000 ── vLLM Docker (GPU 0,1)├─ 0101:8001 ── vLLM Docker (GPU 2,3)├─ 0101:8002 ── vLLM Docker (GPU 4,5)├─ 0101:8003 ── vLLM Docker (GPU 6,7)├─ 02:8000 ── vLLM Docker (GPU 0,1)├─ 02:8001 ── vLLM Docker (GPU 2,3)├─ 02:8002 ── vLLM Docker (GPU 4,5)├─ 02:8003 ── vLLM Docker (GPU 6,7)├─ 03:8000 ── vLLM Docker (GPU 0,1)├─ 03:8001 ── vLLM Docker (GPU 2,3)├─ 03:8002 ── vLLM Docker (GPU 4,5)└─ 03:8003 ── vLLM Docker (GPU 6,7)4.2 负载均衡策略Nginx 采用least_conn算法将请求分发到当前连接数最少的后端实例。所有后端配置统一Tensor Parallel 2每实例占用 2 张 GPUmax-model-len 262,100256K 上下文kv-cache-dtype fp8节省显存max-num-seqs 8每实例最大 8 条序列五、部署步骤详解5.1 前置准备每台服务器1. 安装 Docker 与 NVIDIA Container Toolkitsudo apt-get updatesudo apt-get install -y docker.io docker-compose# NVIDIA Container Toolkitcurl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpgcurl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed s#deb https://#deb [signed-by/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt-get updatesudo apt-get install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtimedockersudo systemctl restart docker2. 创建 Docker 基础镜像FROM ubuntu:22.04RUN apt-get update apt-get install -y python3.10 python3.10-venv python3.10-dev python3-pip curl wget git rm -rf /var/lib/apt/lists/*RUN ln -sf /usr/bin/python3.10 /usr/bin/python3RUN ln -sf /usr/bin/python3.10 /usr/bin/python3. 复制 Python 虚拟环境从 00 服务器 /opt/venv/vllm 复制到目标服务器修复 python 软链接sudo rm -f /opt/venv/vllm/bin/python3 /opt/venv/vllm/bin/pythonsudo ln -s /usr/bin/python3.10 /opt/venv/vllm/bin/python3sudo ln -s /usr/bin/python3.10 /opt/venv/vllm/bin/python5.2 Docker Compose 配置每台服务器部署 4 个 vLLM 容器实例每个实例使用 2 张 GPUservices:vllm-8000:image: ubuntu-python310:22.04runtime: nvidiarestart: alwaysports: [10.255.XXX.XX:8000:8000]deploy:resources:reservations:devices:- driver: nvidiadevice_ids: [0, 1]capabilities: [gpu]environment:- LD_LIBRARY_PATH/usr/local/cuda/lib64:$LD_LIBRARY_PATH- PATH/opt/venv/vllm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin- NCCL_CUMEM_ENABLE0- NCCL_P2P_DISABLE1volumes:- /data/models:/data/models:ro- /dev/shm:/dev/shm- /opt/venv/vllm:/opt/venv/vllm:ro- /usr/local/cuda:/usr/local/cuda:roshm_size: 16gbcommand: /opt/venv/vllm/bin/python3 -m vllm.entrypoints.cli.main serve/data/models/Qwen3.6-27B-AWQ--served-model-name Qwen3.6-27B-AWQ --trust-remote-code--tensor-parallel-size 2 --gpu-memory-utilization 0.8--max-model-len 262100 --kv-cache-dtype fp8--disable-custom-all-reduce --enforce-eager--distributed-executor-backend mp--reasoning-parser qwen3 --enable-auto-tool-choice--tool-call-parser qwen3_coder --block-size 16--enable-prefix-caching --max-num-seqs 8--port 8000 --host 0.0.0.0--api-key sk-XXX-XXX-XXX-XXX5.3 启动顺序关键由于 NCCL 在容器内并发初始化会竞争 cuMem 资源必须按顺序启动每个实例间隔 60 秒# 必须先启动 8000等待就绪后再启动 8001docker-compose up -d vllm-8000sleep 60 # 等待模型加载完成docker-compose up -d vllm-8001sleep 60docker-compose up -d vllm-8002sleep 60docker-compose up -d vllm-8003sleep 605.4 systemd 开机启动编写 /usr/local/bin/vllm-cluster-start.sh 脚本由 systemd 调用#!/bin/bashCOMPOSE_DIR/home/hmtsgpu/vllm-deploycd $COMPOSE_DIR || exit 1rm -f /dev/shm/nccl-* 2/dev/null || truefor svc in vllm-8000 vllm-8001 vllm-8002 vllm-8003; dodocker compose up -d $svcport${svc##*-}for i in {1..90}; dosleep 2status$(curl -s -o /dev/null -w %{http_code} --max-time 2 http://10.255.XXX.XX:${port}/health 2/dev/null || echo 000)[ $status 200 ] breakdone[ $svc ! vllm-8003 ] sleep 5done六、Nginx 负载均衡配置在 00 服务器上部署 Nginx作为统一 API 入口upstream all_vllm {least_conn;server 10.255.XXX.00:8000;server 10.255.XXX.00:8001;server 10.255.XXX.00:8002;server 10.255.XXX.00:8003;server 10.255.XXX.0101:8000;server 10.255.XXX.0101:8001;server 10.255.XXX.0101:8002;server 10.255.XXX.0101:8003;server 10.255.XXX.02:8000;server 10.255.XXX.02:8001;server 10.255.XXX.02:8002;server 10.255.XXX.02:8003;server 10.255.XXX.03:8000;server 10.255.XXX.03:8001;server 10.255.XXX.03:8002;server 10.255.XXX.03:8003;}七、关键问题与解决方案问题现象根因分析解决方案8002/8003 NCCL cuMemCreate SegfaultGPU 4,5 NUMA 节点未注册NCCL 无法分配 pinned memory重启服务器清理 cuMem 状态或 BIOS 修复 NUMA 设置Docker 无外网拉取镜像服务器无外网访问从 0101 导出 docker save 镜像scp 传输后 docker load 导入venv 复制后 Python 失效软链接指向不存在的python3.10重新 ln -s /usr/bin/python3.10 修复docker-compose Permission denied用户不在 docker 组sudo usermod -aG docker $USER; newgrp dockerDocker 启动失败: mkdir /var/lib/docker/var/lib/docker 软链接指向不存在的 /data/dockermkdir -p /data/docker 创建目标目录Nginx 公网 000 / 本地 200公网 IP 映射在外部防火墙联系网管确认 NAT 映射规则八、验证结果8.1 16 路后端健康检查服务器8000800180028003002002002002000120020020020002200200200200032002002002008.2 公网 API 测试测试命令curl -s http://XXX.XXX.XXX.XXX:0101/v1/chat/completions -H Content-Type: application/json -H Authorization: Bearer sk-XXX-XXX-XXX-XXX -d {model:Qwen3.6-27B-AWQ,messages:[{role:user,content:你好}],max_tokens:32}测试结果✅ 返回完整 JSONHTTP 200模型正常响应九、运维管理手册9.1 常用命令# 查看所有后端状态for ip in 10.255.XXX.00 10.255.XXX.0101 10.255.XXX.02 10.255.XXX.03; doecho $ip for port in 8000 8001 8002 8003; docurl -s -o /dev/null -w $ip:$port → %{http_code} --max-time 2 http://$ip:$port/healthdonedone# 查看 Nginx 访问日志sudo tail -f /var/log/nginx/vllm_access.log# 重启整个集群sudo systemctl restart vllm-cluster.service# 查看单容器日志docker compose -f ~/vllm-deploy/docker-compose.yml logs -f vllm-80029.2 自动恢复机制三层防护Docker 级restart: always — 容器崩溃/退出后自动重启systemd 级vllm-cluster.service — 系统开机按顺序启动全部实例Nginx 级least_conn — 自动跳过不可用的后端请求分发到健康节点9.3 监控指标建议监控项各后端 health 状态每 5 分钟探测Nginx access_log 中的请求延迟rt, uct, uht, urtGPU 显存使用率nvidia-smiDocker 容器重启次数docker inspect .RestartCount十、总结本次部署成功构建了 4 台服务器、16 路 vLLM 实例、32 张 RTX 4090 GPU 的推理集群。通过 Docker 容器化实现了环境统一和快速恢复通过 Nginx 负载均衡实现了高可用和高并发。所有实例参数一致运维管理标准化为后续扩展和维护奠定了坚实基础。

相关新闻