当前位置: 首页 > news >正文

大模型AWQ量化Qwen模型和推理实战教程

随着深度学习技术的发展,大规模语言模型(LLMs)因其强大的自然语言理解和生成能力而受到广泛关注。然而,这些模型通常参数量巨大,导致在实际部署过程中面临计算资源消耗高、推理延迟长等问题。为了克服这些挑战,模型量化技术应运而生,它通过减少模型权重表示的精度来降低模型的存储和计算成本,同时尽量保持模型的性能不变。

为什么需要进行模型量化?

  1. 提高效率:量化可以显著减少模型的存储需求和计算量,从而加快推理速度,这对于资源受限的设备尤为重要。
  2. 降低成本:通过减少对高性能硬件的需求,量化有助于降低模型部署的成本。
  3. 扩展应用范围:使大型模型能够在边缘设备上运行,扩大了应用场景,包括移动设备、物联网设备等。

AWQ量化的重要性

AWQ(Activation-aware Weight Quantization)是一种专门针对大规模语言模型设计的低比特权重量化方法。它不仅考虑了权重本身的分布特性,还考虑了激活值的影响,这使得量化后的模型能够更好地保持原始模型的性能。与传统的FP16浮点数表示相比,采用AWQ技术的AutoAWQ工具包能够实现以下优势:

  • 加速推理:将模型的运行速度提升3倍,极大地提高了处理效率。
  • 减少内存占用:将模型的内存需求降至原来的三分之一,使得更大规模的模型可以在更广泛的硬件平台上运行。
  • 硬件友好:优化了量化过程中的硬件适应性,确保了模型在不同硬件上的高效执行。

1.vllm GPU 环境安装

  • vllm官方安装教程:https://docs.vllm.ai/en/latest/getting_started/installation.html
# 创建新的虚拟环境
conda create -n myenv python=3.11 -y
conda activate myenv# 安装 vllm 对应 CUDA 12.1.
pip install vllm

如果安装上面版本失败,可以手动下载whl文件进行安装,安装示例和注意事项如下:

  • 查看自己显卡的CUDA是什么版本,如支持11.8版本,那就下载跟cu118对应的whl文件
  • 查看自己当前python环境是什么版本,如果是python 3.11,则下载cp311对应的whl文件
  • linux环境下载linux的whl文件,window下载window对应文件

比如我的cuda是11.8版本,python环境是3.11,在linux环境下,因此安装命令如下:

pip install /home/xxx/cuda_whl/vllm-0.4.2+cu118-cp311-cp311-manylinux1_x86_64.whl
pip install /home/xxx/cuda_whl/xformers-0.0.26.post1+cu118-cp311-cp311-manylinux2014_x86_64.whl
pip install /home/xxx/cuda_whl/torch-2.3.0+cu118-cp311-cp311-linux_x86_64.whl

如果安装过程有失败,要根据失败信息去解决。

2.安装autoawq

pip install autoawq

3.下载预训练基础模型

  • 下载Qwen2.5-7B-Instruct作为量化模型:https://www.modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct

安装modespace

pip install modelscope

模型下载

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct')

4.实现AWQ量化

  • autoawq: https://github.com/casper-hansen/AutoAWQ
  • qwen模型量化:https://qwen.readthedocs.io/zh-cn/latest/quantization/awq.html
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'# 前面下载到本地的模型路径
model_path = '/home/xxx/models/Qwen2.5-7B-Instruct'
# 量化后的保存路径,不存在会自动创建
quant_path = '/home/xxx/models/Qwen2.5-7B-Instruct-awq'  
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 校准数据微调,可以自定义这个数据集
dataset = [
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are an expert in travel planning and programming."},
{"role": "user", "content": "Can you recommend a scenic spot for a weekend trip near Beijing?"},
{"role": "assistant", "content": "Certainly! The Great Wall at Mutianyu would be a perfect choice for a weekend getaway. It's less than two hours' drive from Beijing and offers stunning views and a less crowded experience compared to Badaling."}
],
[
{"role": "user", "content": "How about a good place for a family vacation in China?"},
{"role": "assistant", "content": "For a family vacation, I would suggest Shanghai Disneyland. It offers a magical experience with various attractions and parades suitable for all ages. Plus, it's a great way to combine fun with a visit to one of China's most vibrant cities."}
],
[
{"role": "user", "content": "Can you help me with a simple Python code to sort a list of numbers?"},
{"role": "assistant", "content": "Of course! Here's a simple example of how to sort a list of numbers in Python:\n\npython\nnumbers = [3, 1, 4, 1, 5, 9]\nnumbers.sort()\nprint(numbers)\n"}
],
[
{"role": "user", "content": "What are some must-see destinations in Europe for a first-time traveler?"},
{"role": "assistant", "content": "Europe has many must-see destinations, but for a first-time traveler, I would recommend visiting Paris for its iconic Eiffel Tower and Louvre Museum, Rome for its ancient ruins like the Colosseum, and Amsterdam for its charming canals and museums."}
],
[
{"role": "user", "content": "Can you provide a tip for optimizing JavaScript code performance?"},
{"role": "assistant", "content": "One tip for optimizing JavaScript performance is to minimize DOM manipulations. Accessing the DOM is slow, so try to batch updates or use Document Fragments when appending multiple elements to the DOM."}
]
]
data = []
for msg in dataset:text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)data.append(text.strip())
# 开始量化
model.quantize(tokenizer, quant_config=quant_config, 
calib_data=data,  # 自己准备的校准数据,也可以不准备,有默认从网上下载
max_calib_seq_len=256  # 校准数据太少时,设置小一点
)
# 保存量化后的模型和分词器
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"')

4.1.量化报错

  • ConnectionError: Couldn’t reach ‘mit-han-lab/pile-val-backup’ on the Hub (ConnectionError)
    运行到:model.quantize(tokenizer, quant_config=quant_config) 产生报错
    解决方案:连接外网失败,下载默认微调校准数据,用户校准量化缩放因子大小。可以自定义数据集进行校准

  • RuntimeError: torch.cat(): expected a non-empty list of Tensors

原因:数据太少,源码中的 n_split = cat_samples.shape[1] // max_seq_len整除为 0,因此造成错误

解决方案:model.quantize(tokenizer, quant_config=quant_config, calib_data=data,
max_calib_seq_len=256 # 调整 max_calib_seq_len 的最大长度,由512改成256
)

5.使用VLLM进行推理

加载量化后的模型进行推理。

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import torch, os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'
# os.environ['CUDA_VISIBLE_DEVICES'] = '6,7'  # 多张卡
model_path = "/home/xxx/models/Qwen2.5-7B-Instruct-awq"
prompt = "介绍一下大模型技术!"tokenizer = AutoTokenizer.from_pretrained(model_path, # trust_remote_code=True)
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model=model_path,max_model_len=10000,  # 设置最大输入长度tensor_parallel_size=1,  # 多少张卡gpu_memory_utilization=0.95,trust_remote_code=True)
messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True
)# 输出
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512  # 设置最大输出长度)
outputs = llm.generate([text], sampling_params)
# print(outputs)# Print the outputs.
for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Generated text:\n {generated_text}")

6.使用transoformers进行推理

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, os
os.environ['CUDA_VISIBLE_DEVICES'] = '7'
# os.environ['CUDA_VISIBLE_DEVICES'] = '6,7'  # 多张卡
model_name = "/home/xxx/models/Qwen2.5-7B-Instruct-awq"
prompt = "介绍一下大模型技术!"model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)messages = [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate(**model_inputs,max_new_tokens=512,
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

http://www.mrgr.cn/news/69354.html

相关文章:

  • LabVIEW弧焊参数测控系统
  • ubuntu24.04设置开机自启动Eureka
  • RabbitMQ 在 Linux CentOS 和 Docker 环境下的部署及分布式部署指南
  • 标准C++ 字符串
  • 数据分析丨世界杯冠军猜想:EA 体育游戏模拟能成功预测吗?
  • Jmeter中的定时器(二)
  • Linux:调试器 gdb/cgdb 的使用
  • VMware中的重要日志文件 vobd.log 学习总结
  • C#核心(9)静态类和静态构造函数
  • 知识图谱是如何通过数据集构建的,比如通过在MSCOCO和Flickr30k数据集和Visual Genome数据集
  • MySQL性能测试方案设计
  • 万字长文解读深度学习——循环神经网络RNN、LSTM、GRU、Bi-RNN
  • Python数据预处理
  • 职场中如何向下属表达自己的观点
  • 华为私有接口类型hybrid
  • 医学可视化之热力图
  • C++接口类, 抽象类和实体类简述
  • 【C++】详解RAII思想与智能指针
  • 基于大语言模型的规划
  • 网站开发-苍穹外卖-day3:苍穹外卖和瑞吉外卖哪个更好???
  • UE5 HLSL 学习笔记
  • 111 - Lecture 8
  • 【MySQL场景题:如何保障传入id顺序与查询结果id顺序一致】---项目积累
  • A20红色革命文物征集管理系统
  • 【核心变量】上市公司企业战略激进度数据 (2004-2023年)
  • libaom 源码分析:重叠块运动补偿OBMC