1. 总体概览
1.1. 视觉语言导航背景
视觉语言导航(Vision-Language Navigation, VLN)是一个多学科交叉的研究领域,涵盖了自然语言处理、计算机视觉、多模态信息融合以及移动机器人导航等多个学科。在该领域,移动机器人的任务是按照自然语言指令要求,在复杂环境中自主导航到目标位置。 该目标位置由VLN任务决定。例如,对于室内家居场景,若自然语言指令为:前进并离开客厅,穿过走廊,到达卧室的床边,则导航目标位置为卧室的床边。
现有主流的VLN模型核心均为端到端的视觉语言模型(Vision-Language Model,VLM)。VLM是一类同时支持图像与文本输入的多模态大模型,具备对图像内容的理解与跨模态信息处理能力。例如,英伟达等团队于2025年提出的NaVILA框架,以及上海人工智能实验室等团队于2025年提出的StreamVLN框架。其输入均为自然语言导航指令与一系列图像,输出为前进、左转、右转等动作。
1.2. 整体目标
本文档将以Qwen2.5-VL-7B模型为例,旨在将该模型推理部署于昇腾310P。注意:
(1)基于本文档部署模型前:读者应基于VLN领域的仿真数据集或真实机器人采集的数据集,对模型进行训练或微调,获得一套适用于VLN任务的模型权重。具体地,Qwen2.5-VL-7B模型针对VLN领域的训练或微调可参考上海人工智能实验室团队于2025年提出的InternVLA-N1算法。
(2)基于本文档部署模型后:读者可针对特定构型的机器人,将模型生成的动作转换为驱动移动机构的速度指令。
1.3. 系统概述
以千寻机器人为例,硬件系统框架如下图所示。
昇腾端侧算力介绍参考:开发套件介绍及组网
因此,本文档使用的计算平台为:香橙派310P扩展坞+MiniPC。软件系统框架如下图所示。
其中,预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU,而VLN系统的剩余部分将被部署于X86架构的MiniPC。输入模型的图像将由移动机器人搭载的RGB-D相机采集获得,预训练的Qwen2.5-VL-7B模型输出为自然语言动作,例如:前进10米、左转10°、停止等。最终,VLN系统输出的速度指令将用于控制移动机器人的轮式底盘。建议读者基于机器人厂商提供的SDK或ROS 2订阅图像topic与发布速度指令。
2. 环境搭建
2.1. 流程概述
为了将预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU,需按如下图所示流程进行环境搭建,为最后实现模型推理做准备。
2.2. 操作系统安装
参考链接提供的用户手册第 3 章节,烧录香橙派提供的Ubuntu22.04镜像。
2.3. 驱动安装
参考链接提供的用户手册第 2.6 章节,安装驱动。
2.4. Firmware安装
参考链接提供的用户手册第 2.7 章节,安装Firmware。
2.5. CANN安装
参考链接提供的用户手册第 6.1 章节,安装CANN。
2.6. Python3安装
若使用香橙派提供的Ubuntu22.04镜像,系统则会预安装Python3。
2.7. PyTorch安装
参考链接提供的用户手册第 8 章节,安装PyTorch与torch_npu插件。
2.8. MindIE docker镜像安装
Docker是一个软件平台,允许将应用程序及其所有依赖项打包成称为容器的标准单元,从而实现应用程序在任何环境中都能快速、可靠地部署和运行。参考链接提供的用户手册第 5.1 章节,安装docker软件:
apt update
apt install -y docker.io安装后查看是否安装成功,应输出版本号:
docker -v打开链接,并点击ATB-Models获取资源选项:
点击立即下载选项:
根据处理器架构选择输入相应的命令:
下载完成后可列出本地主机上已下载的docker镜像:
docker images记录已下载docker镜像的ID:<IMAGE_ID>。
2.9. 使用MindIE docker镜像
2.9.1. 首次启动MindIE docker镜像
当前昇腾社区发布的docker镜像还存在一些问题,需要替换一些文件来修复这些问题。参考链接提供的用户手册第 5.1.2.1 章节,修复上述问题。
创建并首次启动docker(需将<IMAGE_ID>替换):
docker run -it -d --net=host --shm-size=500g \
--name qwen7B-env \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /usr/local/sbin:/usr/local/sbin:ro \
-v /home/huawei/fix_openeuler_docker:/fix_openeuler_docker \
<IMAGE_ID> bash在docker中复制修复补丁文件到当前目录:
cd /usr/local/Ascend/ascend-toolkit/8.2.RC1/lib64/
ls /fix_openeuler_docker/fixhccl/8.2hccl/
cp /fix_openeuler_docker/fixhccl/8.2hccl/* ./重新使能CANN环境变量:
source /usr/local/Ascend/ascend-toolkit/set_env.sh升级Ascend-cann-nnal软件包:
chmod +x /fix_openeuler_docker/Ascend-cann-nnal/Ascend-cann-nnal_8.3.RC1_linux-x86_64.run
cd /fix_openeuler_docker/Ascend-cann-nnal
./Ascend-cann-nnal_8.3.RC1_linux-x86_64.run --install --quiet升级完成后需确认版本信息:
cat /usr/local/Ascend/nnal/atb/latest/version.info输出如下:
Ascend-cann-atb : 8.3.RC1
Ascend-cann-atb Version : 8.3.RC1.B106
Platform : x86_64
branch : 8.3.rc1-0702
commit id : 16004f23040e0dcdd3cf0c64ecf36622487038ba更新dnf软件源列表:
dnf update2.9.2. 非首次启动MindIE docker镜像
若重启电脑或者停止当前docker运行后,想重新进入docker环境,请依次执行如下命令。查看已创建的容器:
docker ps -a启动已创建的容器(需将<YOUR_DOCKER_NAME>替换):
docker start <YOUR_DOCKER_NAME>
docker exec -it <YOUR_DOCKER_NAME> bash若环境搭建过程中出现错误,想删除已创建的容器(需将<YOUR_DOCKER_NAME>替换):
docker rm -f <YOUR_DOCKER_NAME>2.10. 准备Qwen模型权重
在docker的根目录下创建models目录用于存放Qwen模型权重:
mkdir /models若读者已准备好适用于VLN任务的模型权重,可将其拷贝至该目录中。
若读者还未准备好权重,可基于modelscope下载Qwen2.5-VL-7B-Instruct模型权重。安装modelscope:
pip install modelscope下载Qwen2.5-VL-7B-Instruct模型权重:
cd /models
modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir ./Qwen2.5-VL-7B-Instruct在Qwen2.5-VL-7B-Instruct/config.json中设置torch_dtype为float16:
vim /models/Qwen2.5-VL-7B-Instruct/config.json
"torch_dtype": "float163. 模型推理
若读者已准备好适用于VLN任务的Qwen2.5-VL-7B模型权重,则输入为RGB图像与自然语言导航指令,期望输出为自然语言动作,例如:前进25厘米、左转15°、停止等。
本文档以modelscope下载Qwen2.5-VL-7B-Instruct模型权重为例,实现基于昇腾310P的模型推理。输入为RGB图像与自然语言prompt,期望输出由自然语言prompt指定。
docker内提供了如下推理代码,其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl:
# Copyright Huawei Technologies Co., Ltd. 2023-2024. All rights reserved.
import json
import math
import os
import time
import torch
import torch_npu
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor
from atb_llm.models.base.model_utils import safe_from_pretrained
from atb_llm.models.qwen2_vl.router_qwen2_vl import process_shared_memory
from atb_llm.runner.tokenizer_wrapper import TokenizerWrapper
from atb_llm.utils import argument_utils
from atb_llm.utils.cpu_binding import NpuHbmInfo
from atb_llm.utils.env import ENV
from atb_llm.utils.file_utils import safe_open, is_path_exists, safe_listdir, standardize_path, check_file_safety
from atb_llm.utils.log import logger, print_log
from atb_llm.utils.shm_utils import decode_shape_from_int64, release_shared_memory, get_data_from_shm
from examples.multimodal_runner import MultimodalPARunner, parser
from examples.multimodal_runner import path_validator
from examples.server.cache import CacheManager, CacheConfig
from examples.server.generate import decode_token, generate_req
from examples.server.request import MultiModalRequest
VISION_START_TOKEN_ID = 151652
IMAGE_TOKEN_ID = 151655
VISION_END_TOKEN_ID = 151653
IMAGE_FEATURE_LENS = 64
IMAGE_THW_TOKEN_OFFSET = 3
SECOND_PER_GRID_T_SHM_OFFSET = 4
SECOND_PER_GRID_T_SHAPE_OFFSET = 5
SUPPORTED_IMAGE_MODE = "RGB"
PYTORCH_TENSOR = "pt"
def request_from_token_qwen2_vl(config, input_ids, max_out_length, block_size, req_idx=0):
if not isinstance(input_ids, torch.Tensor):
input_ids = torch.tensor(input_ids, dtype=torch.int64)
position_ids = torch.arange(len(input_ids), dtype=torch.int64)
if torch.any(torch.eq(input_ids, VISION_START_TOKEN_ID)):
bos_pos = torch.where(torch.eq(input_ids, VISION_START_TOKEN_ID))[0]
eos_pos = torch.where(torch.eq(input_ids, VISION_END_TOKEN_ID))[0]
vision_num = bos_pos.shape[0]
deltas = 0
for i in range(vision_num):
thw_shape_value = input_ids[bos_pos[i] + IMAGE_THW_TOKEN_OFFSET]
thw_shape = decode_shape_from_int64(thw_shape_value)
vision_feature_len = eos_pos[i] - bos_pos[i] - 1
t_shape = thw_shape[0]
max_hw = max(thw_shape[1:])
if config.model_type == "qwen2_5_vl":
tokens_per_second = config.vision_config.tokens_per_second
second_per_grid_t_shm_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHM_OFFSET]
second_per_grid_t_shape_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHAPE_OFFSET]
if second_per_grid_t_shm_value < 0:
second_per_grid_t_value = get_data_from_shm(
second_per_grid_t_shm_value,
second_per_grid_t_shape_value,
np.float32
)
max_tokens_t = int(second_per_grid_t_value[0][0] * tokens_per_second * (thw_shape[0] - 1))
t_shape = max_tokens_t
if t_shape > (max_hw // 2):
deltas += vision_feature_len - t_shape
else:
deltas += vision_feature_len - max_hw // 2
position_ids[-1] = position_ids[-1] - deltas
request = MultiModalRequest(
max_out_length,
block_size,
req_idx,
input_ids,
adapter_id=None,
position_ids=position_ids
)
return request
class PARunner(MultimodalPARunner):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.shm_name_save_path = kwargs.get("shm_name_save_path", None)
self.tokenizer_wrapper = TokenizerWrapper(self.model_path)
self.tokenizer = self.tokenizer_wrapper.tokenizer
def init_processor(self):
try:
self.processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
except AssertionError:
self.processor = self.model.tokenizer
def warm_up(self):
all_input_length = self.max_input_length
input_ids_list = (
[VISION_START_TOKEN_ID]
+ [IMAGE_TOKEN_ID] * IMAGE_FEATURE_LENS
+ [VISION_END_TOKEN_ID]
+ [1] * (all_input_length - IMAGE_FEATURE_LENS - 2)
)
image = Image.new(SUPPORTED_IMAGE_MODE, (224, 224), (255, 255, 255))
warmup_image_processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
images_inputs = warmup_image_processor(images=image,
videos=None,
return_tensors=PYTORCH_TENSOR)
image.close()
shared_memory_result = process_shared_memory(
images_inputs.pixel_values,
self.shm_name_save_path,
images_inputs.image_grid_thw
)
input_ids_list[1] = shared_memory_result['pixel_values_shm_name']
input_ids_list[2] = shared_memory_result['pixel_values_shape_value']
input_ids_list[3] = shared_memory_result['thw_value']
input_ids = torch.tensor(input_ids_list, dtype=torch.int64).to(self.device)
print_log(self.rank, logger.info, "---------------Begin warm_up---------------")
try:
self.warm_up_num_blocks = math.ceil((self.max_input_length + self.max_output_length) /
self.block_size) * self.max_batch_size
except ZeroDivisionError as e:
raise ZeroDivisionError from e
cache_config = CacheConfig(self.warm_up_num_blocks, self.block_size)
self.cache_manager = CacheManager(cache_config, self.model_config)
max_output_length = 2
self.model.postprocessor.max_new_tokens = max_output_length
if self.max_prefill_tokens == -1:
self.max_prefill_tokens = self.max_batch_size * (self.max_input_length + self.max_output_length)
single_req = request_from_token_qwen2_vl(
self.model.config,
input_ids,
max_output_length,
self.block_size,
req_idx=0,
)
generate_req([single_req], self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager)
self.warm_up_memory = int(
self.max_memory
* NpuHbmInfo.get_hbm_usage(
self.local_rank, self.world_size, self.model.soc_info.need_nz
)
)
print_log(
self.rank,
logger.info,
f"warmup_memory(GB): {self.warm_up_memory / (1024 ** 3): .2f}",
)
print_log(self.rank, logger.info, "---------------End warm_up---------------")
def infer(self, mm_inputs, max_output_length, shm_name_save_path, **kwargs):
self.make_cache_manager()
self.model.postprocessor.max_new_tokens = max_output_length
req_list = []
if not ENV.profiling_enable:
torch.npu.synchronize()
e2e_start = time.time()
for i, mm_input in enumerate(mm_inputs):
input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
single_req = request_from_token_qwen2_vl(
self.model.config,
input_ids,
max_output_length,
self.block_size,
req_idx=i,
)
req_list.append(single_req)
generate_req(
req_list,
self.model,
self.max_batch_size,
self.max_prefill_tokens,
self.cache_manager,
)
generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
torch.npu.synchronize()
e2e_end = time.time()
e2e_time = e2e_end - e2e_start
else:
profiling_path = ENV.profiling_filepath
if not os.path.exists(profiling_path):
os.makedirs(profiling_path, exist_ok=True)
torch.npu.synchronize()
e2e_start = time.time()
experimental_config = torch_npu.profiler._ExperimentalConfig(
aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
l2_cache=False,
data_simplification=False,
)
with torch_npu.profiler.profile(
activities=[
torch_npu.profiler.ProfilerActivity.CPU,
torch_npu.profiler.ProfilerActivity.NPU,
],
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
profiling_path
),
record_shapes=True,
profile_memory=True,
with_stack=False,
with_flops=False,
with_modules=False,
experimental_config=experimental_config,
) as _:
for i, mm_input in enumerate(mm_inputs):
input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
single_req = request_from_token_qwen2_vl(
self.model.config,
input_ids,
max_output_length,
self.block_size,
req_idx=i,
)
req_list.append(single_req)
generate_req(
req_list,
self.model,
self.max_batch_size,
self.max_prefill_tokens,
self.cache_manager,
)
torch.npu.synchronize()
e2e_end = time.time()
e2e_time = e2e_end - e2e_start
generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
if self.rank == 0 and is_path_exists(shm_name_save_path):
try:
release_shared_memory(shm_name_save_path)
except Exception as e:
print_log(self.rank, logger.error, f"Release shared memory failed: {e}")
try:
os.remove(shm_name_save_path)
except Exception as e:
print_log(self.rank, logger.error, f"Remove shared memory file failed: {e}")
return generate_text_list, token_num_list, e2e_time
def parse_arguments():
string_validator = argument_utils.StringArgumentValidator(min_length=0, max_length=1000)
parser_qwen2_vl = parser
parser_qwen2_vl.add_argument(
"--input_text",
default="Describe the image.",
validator=string_validator
)
parser_qwen2_vl.add_argument(
"--input_image",
default="",
validator=path_validator
)
parser_qwen2_vl.add_argument(
"--dataset_path",
help="precision test dataset path",
default="",
validator=path_validator
)
parser_qwen2_vl.add_argument(
"--shm_name_save_path",
type=str,
help='This path is used to temporarily store the shared '
'memory addresses that occur during the inference process.',
default='./shm_name.txt',
validator=path_validator
)
parser_qwen2_vl.add_argument(
"--results_save_path",
help="precision test result path",
default="./results.json",
validator=path_validator,
)
return parser_qwen2_vl.parse_args()
def is_image(file_image_name):
ext = os.path.splitext(file_image_name)[1]
ext = ext.lower()
if ext in [".jpg", ".png", ".jpeg", ".bmp"]:
return True
return False
def is_video(file_video_name):
video_ext = os.path.splitext(file_video_name)[1]
video_ext = video_ext.lower()
if video_ext in [".mp4", ".wmv", ".avi"]:
return True
return False
def deal_dataset(dataset_path, text):
input_images = []
dataset_path = standardize_path(dataset_path)
check_file_safety(dataset_path)
images_list = safe_listdir(dataset_path)
for img_name in images_list:
image_path = os.path.join(dataset_path, img_name)
input_images.append(image_path)
input_texts = [text] * len(
input_images
)
return input_images, input_texts
def replace_crlf(mm_input):
result = []
for single_input in mm_input:
res = {}
for k, v in single_input.items():
input_text_filter = v
input_text_filter = input_text_filter.replace('\n', ' ').replace('\r', ' ').replace('\f', ' ')
input_text_filter = input_text_filter.replace('\t', ' ').replace('\v', ' ').replace('\b', ' ')
input_text_filter = input_text_filter.replace('\u000A', ' ').replace('\u000D', ' ').replace('\u000C', ' ')
input_text_filter = input_text_filter.replace('\u000B', ' ').replace('\u0008', ' ').replace('\u007F', ' ')
input_text_filter = input_text_filter.replace('\u0009', ' ').replace(' ', ' ')
res[k] = input_text_filter.replace("\n", "_").replace("\r", "_")
result.append(res)
return result
if __name__ == '__main__':
args = parse_arguments()
rank = ENV.rank
local_rank = ENV.local_rank
world_size = ENV.world_size
input_dict = {
'rank': rank,
'world_size': world_size,
'local_rank': local_rank,
**vars(args)
}
pa_runner = PARunner(**input_dict)
print_log(rank, logger.info, f"pa_runner: {pa_runner}")
pa_runner.warm_up()
npu_results_dict = {}
if args.dataset_path:
dataset_images, dataset_texts = deal_dataset(args.dataset_path, args.input_text)
mm_inputs = []
for dataset_image, dataset_text in zip(dataset_images, dataset_texts):
if is_video(dataset_image):
key = "video"
elif is_image(dataset_image):
key = "image"
else:
raise TypeError("The multimodal input field currently only supports 'image' and 'video'")
single_inputs = [{key: dataset_image}, {"text": dataset_text}]
mm_inputs.append(single_inputs)
else:
if args.input_image is None:
raise ValueError("The input image or video path is empty.")
elif is_video(args.input_image):
key = "video"
elif is_image(args.input_image):
key = "image"
else:
raise TypeError("The multimodal input field currently only supports 'image' and 'video'.")
mm_inputs = [
[
{key: args.input_image},
{"text": args.input_text},
]
] * args.max_batch_size
generate_texts, token_nums, latency = pa_runner.infer(
mm_inputs,
args.max_output_length,
args.shm_name_save_path,
)
token_nums_prev = 0
for i, generate_text in enumerate(generate_texts):
inputs = mm_inputs
if args.dataset_path:
rst_key = dataset_images[i].split("/")[-1]
npu_results_dict[rst_key] = generate_text
question = replace_crlf(inputs[i])
print_log(rank, logger.info, f"Question[{i}]: {question}")
print_log(rank, logger.info, f"Answer[{i}]: {generate_text}")
print_log(rank, logger.info, f"Generate[{i}] token num: {token_nums[i][1] - token_nums_prev}")
token_nums_prev = token_nums[i][1]
print_log(rank, logger.info, f"Latency(s): {latency}")
print_log(rank, logger.info, f"Throughput(tokens/s): {token_nums[-1][1] / latency}")
if args.dataset_path:
sorted_dict = dict(sorted(npu_results_dict.items()))
with safe_open(
args.results_save_path,
"w",
override_flags=os.O_WRONLY | os.O_CREAT | os.O_EXCL,
) as f:
json.dump(sorted_dict, f)docker内提供了上述推理代码的启动脚本,其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl:
#!/bin/bash
# Qwen2-VL model launch script
# For detailed parameter configuration and usage instructions, refer to README.md in the same directory
# Environment configuration
export BIND_CPU=1
export RESERVED_MEMORY_GB=0
export MASTER_PORT=20031
export ATB_LLM_BENCHMARK_ENABLE=1
export ATB_PROFILING_ENABLE=0
export ASCEND_RT_VISIBLE_DEVICES=0,1
# Calculate TP_WORLD_SIZE (number of NPUs used)
export TP_WORLD_SIZE=$(($(echo "${ASCEND_RT_VISIBLE_DEVICES}" | grep -o , | wc -l) + 1))
# Default parameters
MAX_BATCH_SIZE=1
MAX_INPUT_LENGTH=4096
MAX_OUTPUT_LENGTH=256
INPUT_TEXT="Explain the contents of the picture with more than 500 words and do not Answer the question using a single word or phrase."
SHM_NAME_SAVE_PATH="./shm_name.txt"
MODEL_PATH=""
INPUT_IMAGE=""
DATASET_PATH=""
MODE="single" # Added mode selection: single or path
# Parameter parsing
while [[ $# -gt 0 ]]; do
case "$1" in
--model_path)
if [[ -n "$2" && "$2" != --* ]]; then
MODEL_PATH="$2"
shift 2
else
echo "Error: --model_path requires a valid non-empty value"
exit 1
fi
;;
--input_image)
if [[ -n "$2" && "$2" != --* ]]; then
INPUT_IMAGE="$2"
MODE="single"
shift 2
else
echo "Error: --input_image requires a valid image path"
exit 1
fi
;;
--dataset_path)
if [[ -n "$2" && "$2" != --* ]]; then
DATASET_PATH="$2"
MODE="path"
shift 2
else
echo "Error: --dataset_path requires a valid dataset path"
exit 1
fi
;;
--max_batch_size)
if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
MAX_BATCH_SIZE="$2"
shift 2
else
echo "Error: --max_batch_size must be a positive integer"
exit 1
fi
;;
--max_input_length)
if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
MAX_INPUT_LENGTH="$2"
shift 2
else
echo "Error: --max_input_length must be a positive integer"
exit 1
fi
;;
--max_output_length)
if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
MAX_OUTPUT_LENGTH="$2"
shift 2
else
echo "Error: --max_output_length must be a positive integer"
exit 1
fi
;;
--input_text)
if [[ -n "$2" && "$2" != --* ]]; then
INPUT_TEXT="$2"
shift 2
else
echo "Error: --input_text requires valid text"
exit 1
fi
;;
--shm_name_save_path)
if [[ -n "$2" && "$2" != --* ]]; then
SHM_NAME_SAVE_PATH="$2"
shift 2
else
echo "Error: --shm_name_save_path requires a valid file path"
exit 1
fi
;;
*)
echo "Unknown option: $1"
exit 1
;;
esac
done
# Parameter validation
if [[ -z "$MODEL_PATH" ]]; then
echo "Error: --model_path parameter is required"
exit 1
fi
if [[ "$MODE" == "single" && -z "$INPUT_IMAGE" ]]; then
echo "Error: Single image mode requires --input_image parameter"
exit 1
elif [[ "$MODE" == "path" && -z "$DATASET_PATH" ]]; then
echo "Error: Path mode requires --dataset_path parameter"
exit 1
fi
# Build base command
base_cmd="torchrun --nproc_per_node $TP_WORLD_SIZE --master_port $MASTER_PORT \
-m examples.models.qwen2_vl.run_pa \
--model_path \"$MODEL_PATH\" \
--shm_name_save_path \"$SHM_NAME_SAVE_PATH\" \
--max_input_length $MAX_INPUT_LENGTH \
--max_output_length $MAX_OUTPUT_LENGTH \
--max_batch_size $MAX_BATCH_SIZE \
--input_text \"${INPUT_TEXT}\""
# Add mode-specific parameters
if [[ "$MODE" == "single" ]]; then
base_cmd+=" --input_image \"$INPUT_IMAGE\""
else
base_cmd+=" --dataset_path \"$DATASET_PATH\""
fi
# Execute command
eval "$base_cmd"修改推理使用的逻辑NPU核心为0,1:
cd /usr/local/Ascend/atb-models
vim examples/models/qwen2_vl/run_pa.sh
export ASCEND_RT_VISIBLE_DEVICES=0,1使用chown命令将/models/Qwen2.5-VL-7B-Instruct目录及其所有文件的所有者和组更改为root用户和root组:
chown root:root -R /models/Qwen2.5-VL-7B-Instruct将测试图像<TEST_IMAGE_NAME>拷贝至<TEST_IMAGE_DIR>。
启动Qwen模型推理(需将<TEST_IMAGE_DIR/TEST_IMAGE_NAME>替换):
cd /usr/local/Ascend/atb-models
export MINDIE_LOG_TO_STDOUT=1
bash examples/models/qwen2_vl/run_pa.sh \
--model_path /models/Qwen2.5-VL-7B-Instruct \
--input_image <TEST_IMAGE_DIR/TEST_IMAGE_NAME> \
--input_text "你擅长判断图像中是否存在指定物体。输出图像中存在床的概率,仅返回大于等于0小于等于1的数值,0表示不存在,1表示存在。"使用的测试图像如下,其分辨率为320×240:
可观察到如下输出:
实验结果表明:使用昇腾310P的NPU进行Qwen2.5-VL-7B模型推理时延约为300 ms。
建议读者进一步修改上述推理代码,基于socket进行进程间通信,实现与ROS 2节点的数据传输。
4. 将自然语言动作转换为速度指令
参考上海人工智能实验室等团队开源的StreamVLN框架中的实现。假设模型推理输出仅包含四种自然语言动作,分别是:
动作序号 | 自然语言动作 |
0 | 停止 |
1 | 前进25厘米 |
2 | 左转15° |
3 | 右转15° |
StreamVLN框架中提供了如何根据自然语言动作计算机器人导航目标homo_goal:
if each_action == 0:
pass
elif each_action == 1:
yaw = math.atan2(homo_goal[1, 0], homo_goal[0, 0])
homo_goal[0, 3] += 0.25 * np.cos(yaw)
homo_goal[1, 3] += 0.25 * np.sin(yaw)
elif each_action == 2:
angle = math.radians(15)
rotation_matrix = np.array([
[math.cos(angle), -math.sin(angle), 0],
[math.sin(angle), math.cos(angle), 0],
[0, 0, 1]
])
homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])
elif each_action == 3:
angle = -math.radians(15.0)
rotation_matrix = np.array([
[math.cos(angle), -math.sin(angle), 0],
[math.sin(angle), math.cos(angle), 0],
[0, 0, 1]
])
homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])此外,StreamVLN框架中还提供了PID控制器的一种实现方式,将导航目标homo_goal与机器人当前位姿一同输入solve函数,从而解得控制机器人所需的线速度与角速度:
class PID_controller:
def __init__(self, Kp_trans=1.0, Kd_trans=0.1, Kp_yaw=1.0, Kd_yaw=1.0, max_v=1.0, max_w=1.2):
self.Kp_trans = Kp_trans
self.Kd_trans = Kd_trans
self.Kp_yaw = Kp_yaw
self.Kd_yaw = Kd_yaw
self.max_v = max_v
self.max_w = max_w
def solve(self, odom, target, vel=np.zeros(2)):
translation_error, yaw_error = self.calculate_errors(odom, target)
v, w = self.pd_step(translation_error, yaw_error, vel[0], vel[1])
return v, w, translation_error, yaw_error
def pd_step(self, translation_error, yaw_error, linear_vel, angular_vel):
translation_error = max(-1.0, min(1.0, translation_error))
yaw_error = max(-1.0, min(1.0, yaw_error))
linear_velocity = self.Kp_trans * translation_error - self.Kd_trans * linear_vel
angular_velocity = self.Kp_yaw * yaw_error - self.Kd_yaw * angular_vel
linear_velocity = max(-self.max_v, min(self.max_v, linear_velocity))
angular_velocity = max(-self.max_w, min(self.max_w, angular_velocity))
return linear_velocity, angular_velocity
def calculate_errors(self, odom, target):
dx = target[0, 3] - odom[0, 3]
dy = target[1, 3] - odom[1, 3]
odom_yaw = math.atan2(odom[1, 0], odom[0, 0])
target_yaw = math.atan2(target[1, 0], target[0, 0])
translation_error = dx * np.cos(odom_yaw) + dy * np.sin(odom_yaw)
yaw_error = target_yaw - odom_yaw
yaw_error = (yaw_error + math.pi) % (2 * math.pi) - math.pi
return translation_error, yaw_error