当前位置：首页 > news >正文

InternVL 多模态模型部署微调实践 | 书生大模型

news 2026/1/1 1:32:06

文章目录

- 多模态大模型简介
- - 基本介绍
  - 例子
  - 常见设计模式
  - - BLIP 2
    - - Q-Former 模块细节
      - 应用案例：MiniGPT - 4
      - Q-Former 的缺点
    - LLaVA
    - - LLaVA - 1.5 - HD
      - LLaVA - Next
- InternVL2 介绍
- - 架构设计
  - - Intern Vit
    - Pixel Shuffle
    - Dynamic High - Resolution
    - Multitask output
  - 训练方法
- 环境配置
- - 基本配置
  - 训练环境配置
  - 推理环境配置
- LMDeploy 部署
- - LMDeploy 基本用法介绍
  - 网页应用部署体验
  - 进行端口映射
  - 测试
  - 多轮对话出现 BUG 解决
- XTuner 微调多模态大模型实践
- - 准备基本配置文件
  - 配置文件参数解读
  - 数据集准备
  - 模型微调
  - 模型合并
- 模型测试
- 参考文献

多模态大模型简介

基本介绍

多模态大模型是指能处理和融合多种不同类型数据（如文本、图像、音频、视频等）的大模型

常见的 MLLM

InternVL
GPT-4o
Qwen-VL
LLaVA

例子

在这里插入图片描述

常见设计模式

多模态大模型的核心：不同模态特征空间的对齐

不同模态的数据通常采用不同模块进行编码，因而得到的特征向量的表征空间不同
对于不同模态、相同语义的数据，在特征空间的表示可能不同，需要一些语义设计来弥补这个 gap【需要对齐不同模态的特征空间】

BLIP 2

在这里插入图片描述

Q-Former 模块细节

Learned Queries 的作用：通过 Cross Attention，将图片中的关键信息提取出来
通过 Self Attention 模块实现图像和文本两种模态的参数共享，起到模态融合的作用
FFN 层不共享参数，处理模块的差异化信息

Q-Former 需要计算三个 loss

ITM loss：图文匹配 loss。图像侧需要看到完整的文本，文本侧需要看到完整的图像【全部都是 unmasked】。
LM loss：基于图像文本生成的 loss
ITC loss：图文对比学习 loss

三个 loss 对应的任务不同，也需要不同的 Attention Mask：
在这里插入图片描述

ITM loss：图像侧需要看到完整的文本，文本侧需要看到完整的图像【全部都是 unmasked】。
LM loss：因为要做文本生成，图像侧不需要看到文本，而文本 - 文本应该是一个下三角矩阵。
ITC loss：图像和文本各自编码，只需要考虑图像-图像和文本-文本即可

应用案例：MiniGPT - 4

在这里插入图片描述

Q-Former 的缺点

收敛速度慢：Q-Former 的参数量较大。相比之下，MLP 作为 connector 的模型（如 LLaVA - 1.5）在相同设置下能够更快地收敛，并且取得更好的性能
性能收益不明显：Q-Former 的性能提升并不显著，无法超越简单的 MLP 方案
LLaVA 系列的模型结构更为简洁

LLaVA

在这里插入图片描述

核心思想：通过一个简单的线性层把图像特征（经过图像编码器提取的特征）投影到文本空间

LLaVA - 1.5 - HD

要解决的问题：训练好的图像编码器，分辨率一般都是固定的。对于分辨率不同的图像数据，通常是将对图像进行 reshape，但这样会丢失细粒度的信息

在这里插入图片描述

将图像切片成图像编码器可以处理的分辨率，保留细粒度信息
同时将 reshape 后的图像数据也丢入图像编码器，提供整体信息

LLaVA - Next

是 LLaVA - 1.5 - HD 的升级版：

动态分辨率
更强的训练数据

InternVL2 介绍

架构设计

InternLV2 采用了 LLaVA 式架构：
在这里插入图片描述

基座模型选用 InternLM2 - Chat - 20B
视觉编码器选用 InternVit - 6B
对齐模块选用 MLP projector

Intern Vit

在这里插入图片描述

视觉编码器有 6B 的参数
训练过程中，视觉编码器直接和 LLM 的文本编码器进行对齐【将传统图文对比学习的 text encoder（这部分后续会被丢弃）替换成了 LLM 的 encoder】

Pixel Shuffle

把不同通道的特征拼到一个通道上，对信息进行压缩：

在这里插入图片描述

Dynamic High - Resolution

采用动态分辨率，使得模型可以处理各种分辨率的情况：

在这里插入图片描述

Multitask output

在这里插入图片描述

训练方法

在这里插入图片描述

环境配置

基本配置

至少需要 40 GB 的显存才能完成后续的部署和微调

训练环境配置

# 新建虚拟环境
conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env# 安装与deepspeed集成的xtuner和相关包
pip install -U 'xtuner[deepspeed]' timm==1.0.9
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.39.0

推理环境配置

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy gradio==4.44.1 timm==1.0.9

LMDeploy 部署

LMDeploy 基本用法介绍

主要通过 pipeline.chat 接口来构造多轮对话管线，核心代码为：

## 1.导入相关依赖包
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image## 2.使用你的模型初始化推理管线
model_path = "your_model_path"
pipe = pipeline(model_path,backend_config=TurbomindEngineConfig(session_len=8192))## 3.读取图片（此处使用PIL读取也行）
image = load_image('your_image_path')## 4.配置推理参数
gen_config = GenerationConfig(top_p=0.8, temperature=0.8)
## 5.利用 pipeline.chat 接口 进行对话，需传入生成参数
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
## 6.之后的对话轮次需要传入之前的session，以告知模型历史上下文
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

网页应用部署体验

git clone https://github.com/Control-derek/InternVL2-Tutorial.git
cd InternVL2-Tutorial

demo.py文件中，MODEL_PATH处传入 InternVL2-2B 的路径，如果使用的是 InternStudio 的开发机则无需修改，否则改为模型路径

在这里插入图片描述

启动 demo：

conda activate lmdeploy
python demo.py

进行端口映射

StrictHostKeyChecking=no 记得加上，不然会因为 https 的原因导致网页元素无法完全加载

ssh -p 36100 root@ssh.intern-ai.org.cn -CNg -L 1096:127.0.0.1:1096 -o StrictHostKeyChecking=no

测试

在这里插入图片描述

先点击 Start Chat，Agent 才能初始化，然后才可以传图片和文字，让 Agent 回答对应问题。
没有经过微调，这个回答显然不大正确

在这里插入图片描述

图片描述的问题，效果倒是还可以

多轮对话出现 BUG 解决

如果输入多张图，或者开多轮对话时报错：
在这里插入图片描述
可以参考github的issue InternLM/lmdeploy#2101：

在这里插入图片描述
屏蔽报错的engine.py的126，127行，添加self._create_event_loop_task()后，即可解决上面报错。

在这里插入图片描述

XTuner 微调多模态大模型实践

准备基本配置文件

在 InternStudio 开发机的 /root/xtuner 路径下，即为开机自带的 xtuner，先进入工作目录并激活训练环境：

cd root/xtuner
conda activate xtuner-env  # 或者是你自命名的训练环境

原始 internvl 的微调配置文件在路径 ./xtuner/configs/internvl/v2 下，假设上面克隆的仓库在 /root/InternVL2-Tutorial，复制配置文件到目标目录下：

cp /root/InternVL2-Tutorial/xtuner_config/internvl_v2_internlm2_2b_lora_finetune_food.py /root/xtuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py

配置文件参数解读

在这里插入图片描述

path: 需要微调的模型路径，在InternStudio环境下，无需修改。
data_root: 数据集所在路径。
data_path: 训练数据文件路径。
image_folder: 训练图像根路径。
prompt_temple: 配置模型训练时使用的聊天模板、系统提示等。使用与模型对应的即可，此处无需修改。
max_length: 训练数据每一条最大token数。
batch_size: 训练批次大小，可以根据显存大小调整。
accumulative_counts: 梯度累积的步数，用于模拟较大的batch_size，在显存有限
情况下，提高训练稳定性。
dataloader_num_workers: 指定数据集加载时子进程的个数。
max_epochs:训练轮次。
optim_type:优化器类型。
lr: 学习率
betas: Adam优化器的beta1, beta2
weight_decay: 权重衰减，防止训练过拟合用
max_norm: 梯度裁剪时的梯度最大值
warmup_ratio: 预热比例，前多少的数据训练时，学习率将会逐步增加。
save_steps: 多少步存一次checkpoint
save_total_limit: 最多保存几个checkpoint，设为-1即无限制

在这里插入图片描述

r: 低秩矩阵的秩，决定了低秩矩阵的维度。
lora_alpha 缩放因子，用于调整低秩矩阵的权重。
lora_dropout dropout 概率，以防止过拟合。

如果想断点重训，可以在最下面传入参数：

在这里插入图片描述
把这里的 load_from 传入你想要载入的 checkpoint，并设置 resume = True 即可断点重续。

总结：核心参数

模型路径：path = '/root/share/new_models/OpenGVLab/InternVL2-2B'
数据路径：data_root = '/root/share/datasets/FoodieQA/'
多少步存一次 checkpoint：save_steps = 64
最多保存几个 checkpoint，设为 -1 即无限制：save_total_limit = -1

数据集准备

如果是使用 InternStudio 开发机，数据文件放在 /root/share/datasets/FoodieQA 路径下
在这里插入图片描述

模型微调

xtuner train /root/xtuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py --deepspeed deepspeed_zero2 --work-dir ./work_dirs/food_finetune

微调过程的资源消耗情况：
在这里插入图片描述
200 MB 的图文数据，使用 50% A100 进行微调，大概花了 1 个半小时

模型合并

这里要用到 /root/xtuner/xtuner/configs/internvl/v1_5/convert_to_official.py 脚本，完整代码如下：

import argparse
import os.path as ospimport torch
from mmengine.config import Config
from transformers import AutoTokenizerfrom xtuner.model.utils import LoadWoInit
from xtuner.registry import BUILDERdef convert_to_official(config, trained_path, save_path):cfg = Config.fromfile(config)cfg.model.pretrained_pth = trained_pathcfg.model.quantization_vit = Falsecfg.model.quantization_llm = Falsewith LoadWoInit():model = BUILDER.build(cfg.model)model.to(torch.bfloat16)if model.use_visual_encoder_lora:vision_model = model.model.vision_model.merge_and_unload()model.model.vision_model = vision_modelif model.use_llm_lora:language_model = model.model.language_model.merge_and_unload()model.model.language_model = language_modelmodel.model.save_pretrained(save_path)tokenizer = AutoTokenizer.from_pretrained(cfg.model.model_path, trust_remote_code=True)tokenizer.save_pretrained(save_path)print(model)def main():parser = argparse.ArgumentParser(description='Convert the pth model to HuggingFace model')parser.add_argument('config', help='config file name or path.')parser.add_argument('trained_model_pth', help='The trained model path.')parser.add_argument('save_path', help='The path to save the converted model.')args = parser.parse_args()if osp.realpath(args.trained_model_pth) == osp.realpath(args.save_path):raise ValueError('The trained path and save path should not be the same.')convert_to_official(args.config, args.trained_model_pth, args.save_path)if __name__ == '__main__':main()

脚本命令主要传递 3 个参数：

config：配置文件路径
train_path：训练好的模型路径
save_path：保存转换后的模型路径

脚本文件主要完成的任务：

从提供的配置文件路径 config 中加载模型配置
禁用 Vit 和 LL 的量化【量化通常用于减少模型大小并提高推理效率】
将模型转换为 torch.bfloat16 数据类型，节省内存
将 LoRA 模块的权重合并到模型的主权重中，并卸载 LoRA 模块
将模型和 tokenizer 保存到指定路径 save_path

最终完整的模型合并脚本命令：

conda activate xtuner-env
cd /root/xtunerpython xtuner/configs/internvl/v1_5/convert_to_official.py xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py /root/InternVL2-Tutorial/work_dirs/food_finetune/iter_640.pth /root/InternVL2-Tutorial/work_dirs/food_finetune/merge_model/lr35_ep10/

合并后的文件夹：
在这里插入图片描述

模型测试

修改 MODEL_PATH 为刚刚转换后保存的模型路径：

在这里插入图片描述

cd /root/InternVL2-Tutorial
conda activate lmdeploy
python demo.py

个人电脑与开发机进行端口映射

ssh -p 36100 root@ssh.intern-ai.org.cn -CNg -L 1096:127.0.0.1:1096 -o StrictHostKeyChecking=no

StrictHostKeyChecking=no 记得加上，不然会因为 https 的原因导致网页元素无法完全加载

测试一下，关于美食问题，回答比没微调前靠谱多了：

在这里插入图片描述
换另一个非美食的场景：

在这里插入图片描述

原神打钱？

参考文献

https://www.bilibili.com/video/BV1nESCYWEnN/?vd_source=92ae20b037ffc8aceaab1e118f74a5cc
https://github.com/InternLM/Tutorial/tree/camp4/docs/L2/InternVL

查看全文

http://www.mrgr.cn/news/76418.html