- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
BigDL-LLM: Intel XPU上的大模型低比特加速库-陈依娜
陈依娜-Intel亚太研发有限公司 AI Frameworks Engineer
分享介绍:
本讲座将介绍bigdl-llm,一个开源低比特大模型加速库,用于在英特尔XPU(包括CPU和GPU)上使用低比特优化(FP4/INT4/NF4/FP8/INT8)以极低的延迟运行和微调LLM(大语言模型)。它建立在各种技术(如llama.cpp、gptq、bitsandbytes、qlora 等)之上,并在英特尔CPU/GPU上进行了优化。有了bigdl-llm,用户就可以使用标准PyTorch API(如Hugging face Transformers和LangChain)在英特尔平台(笔记本、工作站、服务器和GPU)上构建和运行 LLM 应用程序;大量模型(如 LLaMA/LLaM2、ChatGLM2/ChatGLM3、Mistral、Falcon、MPT、Baichuan/Baichuan2、QWen、MOSS、Dolly/Dolly-v2、RedPajama、StarCoder、Whisper 等)已在bigdl-llm上得到验证和优化。
展开查看详情
1 .BigDL-LLM: Intel XPU上的大模型低比特加速库 陈依娜 - 英特尔亚太研发有限公司 AI Frameworks Engineer https://github.com/intel-analytics/BigDL/
2 . Background What is LLM? Human • Large Language Models A Large Language Model (LLM) is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. These models have become increasingly popular in recent years due to their ability to generate text that is often indistinguishable from human-written text. What can LLM do? Human LLMs have strong capacities for understanding natural language and solving complex tasks. ▪ Answer Questions ▪ Generate text ▪ Translate languages ▪ Summarize text ▪ Write different kinds of creative content ▪ … Gif from https://jalammar.github.io/how-gpt3-works-visualizations-animations/
3 . Background • LLM Applications Gif from https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html
4 .Background • LLMs are "Large" Model FP16 Size FP32 Size Llama2 7B 14G 28G • Compute resource hungry Llama2 13B 26G 52G • Costly hardware Llama2 70B 140G 280G • Local deployable LLMs • Data privacy • Cost • Customization
5 .Background Model FP16 Size INT4 Size Llama2 7B 14G ~4G • Low-bit optimization Llama2 13B 26G ~7G • Larger model on smaller Llama2 70B 140G ~37G device • Lower latency (memory bandwidth bound) • INT4, INT8, NF4, FP4, FP8, GPTQ, AWQ ...
6 .Background • Existing software for efficient LLMs (llama.cpp, bitsandbytes, mlc, exllama, .etc) • Model coverage • Precision coverage • Developer friendliness • Performance • Hardware coverage
7 .BigDL-LLM • Low-Bit LLM Library on Intel XPU (Xeon/Core/Flex/Arc/PVC) • Goal: • Broadest model support, lowest latency and smallest memory footprint on Intel XPU • Support: • Compressions: INC/NNCF, llama.cpp/GGML/GPTQ/bitsandbytes/QLoRA/ … • Low-bit optimizations: INT4/FP4/NF4/INT5/INT8/FP8 • Intel XPU: Xeon/Core/Flex/Arc/PVC, etc. • Any standard PyTorch model and API • Low-bit finetuning: QLoRA https://github.com/intel-analytics/BigDL/
8 .Demo: ChatGLM2-6B @ Intel Core CPU chatglm2-6B chatglm2-6B (4 cores @ 12th Gen Intel Core CPU) (12th Gen Intel Core CPU )
9 .Demo: LLaMA2-13B & StarCoder-15.5B @ Intel Core CPU llama2-13B-chat starcoder-15.5B (12th Gen Intel Core CPU) (12th Gen Intel Core CPU)
10 .Demo : Vicuna-33B @ Intel Xeon CPU Vicuna-33B (1 socket @ 4th Gen Intel Xeon Scalable Processor)
11 .Demo: LLaMA2-70B @ Intel Xeon CPU LLaMA2-70B (1 socket @ 4th Gen Intel Xeon Scalable Processor )
12 .Demo: ChatGLM2-6B @ Intel Arc GPU ChatGLM2-6B (Intel Arc A770)
13 .Demo: LLaMA2-13B @ Intel Arc GPU LLaMA2-13B (Intel Arc A770)
14 .Demo: StarCoder-15.5B @ Intel Arc GPU StarCoder-15.5B (Intel Arc A770)
15 .Demo: LLaMA2-13B & ChatGLM2-6B @ Intel Flex GPU LLaMA2-13B & ChatGLM2-6B (Intel Data Center Flex 170)
16 .Demo: LLaMA2-7B qlora finetune@ Intel PVC GPU LLaMA2-7B (Intel PVC Max 1550)
17 .Support Any PyTorch LLM
18 .Support HuggingFace Transformers AutoModel
19 .Finetuning LLM using QLoRA
20 .Support Standard LangChain API
21 .BigDL-LLM Accuracy Precisio Model BoolQ PIQA HellaSwag Winogrande Are-e Are-c OBQA Average n BF16 79.66 78.56 76.34 69.93 67.68 43 44.4 65.65 LLaMa INT8 79.54 77.97 75.04 65.67 67.38 41.98 45 64.65 gpt4all-7B INT4 79.57 78.29 74.18 69.53 65.49 41.72 43.2 64.57 BF16 83.39 77.58 76.63 70.8 71.38 45.31 45 67.16 LLaMa INT8 82.17 76.88 76.28 69.14 70.54 45.65 44.2 66.41 Vicuna1.1-13B INT4 83.12 76.28 74.93 68.51 71.3 45.73 45 66.41 LLaMa BF16 85.17 82.54 83.28 74.98 79.08 52.9 48 72.28 Baize-30B INT4 85.05 81.61 82.15 75.77 81.27 56.23 47.2 72.75 LLaMA BF16 84.95 82.37 84.9 77.43 79.38 53.92 47 72.85 65B INT4 84.07 81.77 83.94 76.72 80.01 54.95 47.2 72.67 GPT-NeoX BF16 71.13 76.71 70.11 64.4 69.49 38.74 38.6 61.31 RedPajama-INCITE-Chat- 7B INT4 70.46 77.26 69.6 64.01 68.52 39.33 40 61.31 Bloom BF16 86.61 74.86 63.42 63.85 69.15 41.55 40.6 62.86 phoenix-inst-chat-7B INT4 85.47 75.84 60.6 64.25 68.64 41.04 39 62.12 https://github.com/EleutherAI/lm-evaluation-harness
22 .Key Features Take Away Smallest Memory Apache 2.0 License Footprint Broadest Model Supports Developer-Friendly Abundant Examples Lowest Latency CPU/GPU
23 .Call for Actions • Try BigDL-LLM and give us feedback • https://github.com/intel-analytics/bigdl • Use BigDL-LLM on Intel platforms to develop LLM applications • Tutorial: https://github.com/intel-analytics/bigdl-llm-tutorial • Examples: https://github.com/intel-analytics/BigDL/tree/main/python/llm/example
24 .
25 .Notices & Disclaimers Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.