MacOS 上使用 Metal GPU 加速编译 llama.cpp 完整指南-尧图网站设计

本文详细记录了在 MacOS 上使用 Metal GPU 加速编译 llama.cpp 的全过程涵盖 cmake 安装、仓库克隆、编译配置、模型下载、GPU 验证、多模型运行测试以及性能监控工具的使用适合需要在 Apple Silicon Mac 上本地运行大语言模型的开发者参考。1.安装cmakebrewinstallgitcmake2.克隆仓库gitclone https://github.com/ggml-org/llama.cpp.gitcdllama.cpp3. 编译开启 Metal GPU 加速cmake-Bbuild-DGGML_METALON-DGGML_METAL_EMBED_LIBRARYON cmake--buildbuild--configRelease -j$(sysctl-nhw.ncpu)4.模型下载# 安装模型下载客户端pipinstallmodelscope# 下载模型modelscope download\--modelTencent-Hunyuan/HY-MT1.5-7B-GGUF HY-MT1.5-7B-Q4_K_M.gguf License.txt README.md configuration.json\--local_dir./HY-MT1.5-7B-Q4_K_M-GGUF5. 验证 GPU 是否启用./llama-cli--help模型启动参数说明./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-2B-GGUF/Qwen3.5-2B-Q4_K_M.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja-m指定模型权重文件 model-c模型上下文 context length-ngl加载模型的多少层到GPU number of GPU layers--host指定访问的IP地址--port指定访问的端口号访问的模型是通过端口号指定的不需要传递模型名--temp温度temperature--jinjaQwen模型启动的时候不添加 qwen3_nonthinking.jinja 这个文件默认启动思考模式添加这个文件可以关闭思考模型cd/Users/zephyrmuse/Git_projects/llama.cpp/build/bin# Qwen3.5-2B-Q4_K_M 多模态./llama-cli\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-2B-GGUF/Qwen3.5-2B-Q4_K_M.gguf\-c8192\-n512\-p你好请用中文自我介绍\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja ./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-2B-GGUF/Qwen3.5-2B-Q4_K_M.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.5-4B-Q4_K_M 多模态./llama-cli\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-4B-Q4_K_M-GGUF/qwen3-5-4B-Q4_K_M.gguf\-c8192\-n512\-p你好请用中文自我介绍\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.5-4B-Q4_K_M 多模态./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-4B-Q4_K_M-GGUF/qwen3-5-4B-Q4_K_M.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# HY-MT1.5-7B-Q4_K_M-GGUF (翻译模型)./llama-server\-m/Users/zephyrmuse/Projects/softwares/HY-MT1.5-7B-Q4_K_M-GGUF/HY-MT1.5-7B-Q4_K_M.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7# Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.6-35B-A3B-UD-Q5_K_S.gguf./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q5_K_S.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.6-35B-A3B-UD-Q5_K_M.gguf./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf\-c2048\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.6-35B-A3B-UD-Q4_K_S.gguf./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf\-c2048\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# 支持多模态./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf\--mmproj/Users/zephyrmuse/Projects/softwares/Qwen3.6-35B-A3B-GGUF/mmproj-BF16.gguf\-c2048\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja# Qwen3.5-27B.Q6_K.gguf 3 tokens/s./llama-server\-m/Users/zephyrmuse/Projects/softwares/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF/Qwen3.5-27B.Q6_K.gguf\-c8192\-ngl99\--host0.0.0.0\--port8080\--cache-type-k q8_0\--cache-type-v q8_0\--temp0.7\--jinja\--chat-template-file /Users/zephyrmuse/Projects/softwares/qwen3_nonthinking.jinja6.测试脚步importosimporttimefromopenaiimportOpenAIfromget_system_promptimportget_system_prompt# 初始化 OpenAI 客户端指向本地 llama-serverclientOpenAI(base_urlhttp://localhost:8080/v1,api_keysk-no-key-required# 本地服务不需要真实 API Key)# 获取系统提示词# system_prompt get_system_prompt()system_prompt你是一个乐于助人的智能助手# 创建对话历史存储messages[{role:system,content:system_prompt}]# Token 估算函数简化版中文约 1.5 字符/token英文约 4 字符/tokendefestimate_tokens(text):ifnottext:return0# 简单估算假设平均每个 token 约 2 个字符中英文混合returnint(len(text)/2)print(*50)print( AI 助手已启动输入退出结束对话)print(*50)# 对话循环turn_count0# 对话轮数计数器whileTrue:# 获取用户输入user_inputinput(\n 用户: ).strip()# 检查退出条件ifuser_input.lower()in[退出,exit,quit,q]:print(\n 感谢使用再见)breakifnotuser_input:continue# 增加轮数turn_count1# 添加用户消息到历史messages.append({role:user,content:user_input})print(\n AI: ,end,flushTrue)try:# 记录请求开始时间start_timetime.perf_counter()first_token_timeNone# 创建流式输出请求streamclient.chat.completions.create(modelqwen3-8b,messagesmessages,streamTrue,temperature0.7)# 流式接收并打印响应full_responseforchunkinstream:ifchunk.choicesandchunk.choices[0].delta.content:contentchunk.choices[0].delta.contentprint(content,end,flushTrue)full_responsecontent# 记录首 token 时间iffirst_token_timeisNone:first_token_timetime.perf_counter()# 记录结束时间end_timetime.perf_counter()print()# 换行# 计算统计指标ttft(first_token_time-start_time)*1000iffirst_token_timeelse0total_time(end_time-start_time)*1000generation_time(end_time-first_token_time)*1000iffirst_token_timeelse0# 估算 token 数current_turn_tokensestimate_tokens(full_response)system_prompt_tokensestimate_tokens(system_prompt)total_context_tokensestimate_tokens(\n.join([m[content]forminmessages]))tokens_per_second(current_turn_tokens/generation_time*1000)ifgeneration_time0else0# 打印统计信息print(\n-*50)print(f 性能统计 (第{turn_count}轮对话):)print(f ⏱️ TTFT (首字节延迟):{ttft:.2f}ms)print(f ⏱️ 总耗时:{total_time:.2f}ms)print(f ⏱️ 生成耗时:{generation_time:.2f}ms)print(f 本轮生成 Tokens:{current_turn_tokens})print(f 生成速度:{tokens_per_second:.2f}tokens/s)print(f 系统提示词 Tokens:{system_prompt_tokens})print(f 上下文总 Tokens:{total_context_tokens})print(-*50)# 添加 AI 回复到历史messages.append({role:assistant,content:full_response})# 控制历史长度保留 system 最近 10 轮对话iflen(messages)21:messages[messages[0]]messages[-20:]turn_count10# 重置轮数显示因为只保留最近10轮exceptExceptionase:print(f\n❌ 发生错误:{e})importtraceback traceback.print_exc()continue备注llama.cpp 启动的模型不需要指定模型名模型的访问是通过端口绑定的7.GPU/CPU 使用查看brewinstallmactopsudomactop 或 pipinstallasitopsudoasitop感谢阅读如果你在 MacOS 上成功跑通了 llama.cpp欢迎在评论区分享你的推理速度。有任何问题也可以留言交流标签llama.cppMacOSMetalGPU本文为原创内容版权归作者所有转载需注明出处。

MacOS 上使用 Metal GPU 加速编译 llama.cpp 完整指南

相关新闻

word删除空白页

SaaS多租户数据隔离：三种核心模式详解与实战选型指南

从PID控制到多传感器融合：智能寻迹机器人设计与实践

从屏幕取词到智能翻译：CuteTranslation如何重塑Linux用户的跨语言工作流

DDrawCompat开源项目：让Windows经典游戏在现代系统重生

GoldHEN金手指管理器：PS4游戏修改的终极完整指南

告别环境混乱！用Miniconda为PyCharm创建专属Jupyter内核（保姆级避坑指南）

3.5字节广告部门一面面经

QRazyBox完整指南：免费开源二维码修复工具的强大功能与应用实践

新闻编辑部正在悄悄部署NotebookLM，你还在用传统剪报法？

XUnity Auto Translator：Unity游戏多语言本地化的终极解决方案

Go语言轻量级分布式任务调度框架Roll：从架构到生产部署实战

2026年十大最佳地区搜索排名优化工具：权威榜单赋能企业高效增长

DDR3内存Row Hammer问题解析与防护方案

为ItsyBitsy ESP32设计3D打印外壳：从原型到产品的完整实践

别再手动点关了！用PowerShell永久关闭Windows Defender的保姆级教程（含Server 2016/2019）

别再只换芯片了！BP2832A替换CL1502，你的电感参数算对了吗？

全平台智能资源下载工具：res-downloader 完整使用教程