Skip to content


1. 总体概览

1.1. 视觉语言导航背景

视觉语言导航(Vision-Language Navigation, VLN)是一个多学科交叉的研究领域,涵盖了自然语言处理、计算机视觉、多模态信息融合以及移动机器人导航等多个学科。在该领域,移动机器人的任务是按照自然语言指令要求,在复杂环境中自主导航到目标位置。 该目标位置由VLN任务决定。例如,对于室内家居场景,若自然语言指令为:前进并离开客厅,穿过走廊,到达卧室的床边,则导航目标位置为卧室的床边。

现有主流的VLN模型核心均为端到端的视觉语言模型(Vision-Language Model,VLM)。VLM是一类同时支持图像与文本输入的多模态大模型,具备对图像内容的理解与跨模态信息处理能力。例如,英伟达等团队于2025年提出的NaVILA框架,以及上海人工智能实验室等团队于2025年提出的StreamVLN框架。其输入均为自然语言导航指令与一系列图像,输出为前进、左转、右转等动作。

1.2. 整体目标

本文档将以Qwen2.5-VL-7B模型为例,旨在将该模型推理部署于昇腾310P。注意:

(1)基于本文档部署模型前:读者应基于VLN领域的仿真数据集或真实机器人采集的数据集,对模型进行训练或微调,获得一套适用于VLN任务的模型权重。具体地,Qwen2.5-VL-7B模型针对VLN领域的训练或微调可参考上海人工智能实验室团队于2025年提出的InternVLA-N1算法

(2)基于本文档部署模型后:读者可针对特定构型的机器人,将模型生成的动作转换为驱动移动机构的速度指令。

1.3. 系统概述

以千寻机器人为例,硬件系统框架如下图所示。

昇腾端侧算力介绍参考:开发套件介绍及组网

因此,本文档使用的计算平台为:香橙派310P扩展坞+MiniPC。软件系统框架如下图所示。

其中,预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU,而VLN系统的剩余部分将被部署于X86架构的MiniPC。输入模型的图像将由移动机器人搭载的RGB-D相机采集获得,预训练的Qwen2.5-VL-7B模型输出为自然语言动作,例如:前进10米、左转10°、停止等。最终,VLN系统输出的速度指令将用于控制移动机器人的轮式底盘。建议读者基于机器人厂商提供的SDK或ROS 2订阅图像topic与发布速度指令。


2. 环境搭建

2.1. 流程概述

为了将预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU,需按如下图所示流程进行环境搭建,为最后实现模型推理做准备。

2.2. 操作系统安装

参考链接提供的用户手册第 3 章节,烧录香橙派提供的Ubuntu22.04镜像。

2.3. 驱动安装

参考链接提供的用户手册第 2.6 章节,安装驱动。

2.4. Firmware安装

参考链接提供的用户手册第 2.7 章节,安装Firmware。

2.5. CANN安装

参考链接提供的用户手册第 6.1 章节,安装CANN。

2.6. Python3安装

若使用香橙派提供的Ubuntu22.04镜像,系统则会预安装Python3。

2.7. PyTorch安装

参考链接提供的用户手册第 8 章节,安装PyTorch与torch_npu插件。

2.8. MindIE docker镜像安装

Docker是一个软件平台,允许将应用程序及其所有依赖项打包成称为容器的标准单元,从而实现应用程序在任何环境中都能快速、可靠地部署和运行。参考链接提供的用户手册第 5.1 章节,安装docker软件:

plaintext
apt update
apt install -y docker.io

安装后查看是否安装成功,应输出版本号:

plaintext
docker -v

打开链接,并点击ATB-Models获取资源选项:

点击立即下载选项:

根据处理器架构选择输入相应的命令:

下载完成后可列出本地主机上已下载的docker镜像:

plaintext
docker images

记录已下载docker镜像的ID:<IMAGE_ID>。

2.9. 使用MindIE docker镜像

2.9.1. 首次启动MindIE docker镜像

当前昇腾社区发布的docker镜像还存在一些问题,需要替换一些文件来修复这些问题。参考链接提供的用户手册第 5.1.2.1 章节,修复上述问题。

创建并首次启动docker(需将<IMAGE_ID>替换):

plaintext
docker run -it -d --net=host --shm-size=500g \
    --name qwen7B-env \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /home/huawei/fix_openeuler_docker:/fix_openeuler_docker \
    <IMAGE_ID> bash

在docker中复制修复补丁文件到当前目录:

plaintext
cd /usr/local/Ascend/ascend-toolkit/8.2.RC1/lib64/
ls /fix_openeuler_docker/fixhccl/8.2hccl/
cp /fix_openeuler_docker/fixhccl/8.2hccl/* ./

重新使能CANN环境变量:

plaintext
source /usr/local/Ascend/ascend-toolkit/set_env.sh

升级Ascend-cann-nnal软件包:

plaintext
chmod +x /fix_openeuler_docker/Ascend-cann-nnal/Ascend-cann-nnal_8.3.RC1_linux-x86_64.run
cd /fix_openeuler_docker/Ascend-cann-nnal
./Ascend-cann-nnal_8.3.RC1_linux-x86_64.run --install --quiet

升级完成后需确认版本信息:

plaintext
cat /usr/local/Ascend/nnal/atb/latest/version.info

输出如下:

plaintext
Ascend-cann-atb : 8.3.RC1
Ascend-cann-atb Version : 8.3.RC1.B106
Platform : x86_64
branch : 8.3.rc1-0702
commit id : 16004f23040e0dcdd3cf0c64ecf36622487038ba

更新dnf软件源列表:

plaintext
dnf update

2.9.2. 非首次启动MindIE docker镜像

若重启电脑或者停止当前docker运行后,想重新进入docker环境,请依次执行如下命令。查看已创建的容器:

plaintext
docker ps -a

启动已创建的容器(需将<YOUR_DOCKER_NAME>替换):

plaintext
docker start <YOUR_DOCKER_NAME>
docker exec -it <YOUR_DOCKER_NAME> bash

若环境搭建过程中出现错误,想删除已创建的容器(需将<YOUR_DOCKER_NAME>替换):

plaintext
docker rm -f <YOUR_DOCKER_NAME>

2.10. 准备Qwen模型权重

在docker的根目录下创建models目录用于存放Qwen模型权重:

plaintext
mkdir /models

若读者已准备好适用于VLN任务的模型权重,可将其拷贝至该目录中。

若读者还未准备好权重,可基于modelscope下载Qwen2.5-VL-7B-Instruct模型权重。安装modelscope:

plaintext
pip install modelscope

下载Qwen2.5-VL-7B-Instruct模型权重:

plaintext
cd /models
modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir ./Qwen2.5-VL-7B-Instruct

在Qwen2.5-VL-7B-Instruct/config.json中设置torch_dtype为float16:

plaintext
vim /models/Qwen2.5-VL-7B-Instruct/config.json
"torch_dtype": "float16

3. 模型推理

若读者已准备好适用于VLN任务的Qwen2.5-VL-7B模型权重,则输入为RGB图像与自然语言导航指令,期望输出为自然语言动作,例如:前进25厘米、左转15°、停止等。

本文档以modelscope下载Qwen2.5-VL-7B-Instruct模型权重为例,实现基于昇腾310P的模型推理。输入为RGB图像与自然语言prompt,期望输出由自然语言prompt指定。

docker内提供了如下推理代码,其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl:

python
# Copyright Huawei Technologies Co., Ltd. 2023-2024. All rights reserved.
import json
import math
import os
import time

import torch
import torch_npu
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor
from atb_llm.models.base.model_utils import safe_from_pretrained
from atb_llm.models.qwen2_vl.router_qwen2_vl import process_shared_memory
from atb_llm.runner.tokenizer_wrapper import TokenizerWrapper
from atb_llm.utils import argument_utils
from atb_llm.utils.cpu_binding import NpuHbmInfo
from atb_llm.utils.env import ENV
from atb_llm.utils.file_utils import safe_open, is_path_exists, safe_listdir, standardize_path, check_file_safety
from atb_llm.utils.log import logger, print_log
from atb_llm.utils.shm_utils import decode_shape_from_int64, release_shared_memory, get_data_from_shm
from examples.multimodal_runner import MultimodalPARunner, parser
from examples.multimodal_runner import path_validator
from examples.server.cache import CacheManager, CacheConfig
from examples.server.generate import decode_token, generate_req
from examples.server.request import MultiModalRequest

VISION_START_TOKEN_ID = 151652
IMAGE_TOKEN_ID = 151655
VISION_END_TOKEN_ID = 151653
IMAGE_FEATURE_LENS = 64
IMAGE_THW_TOKEN_OFFSET = 3
SECOND_PER_GRID_T_SHM_OFFSET = 4
SECOND_PER_GRID_T_SHAPE_OFFSET = 5
SUPPORTED_IMAGE_MODE = "RGB"
PYTORCH_TENSOR = "pt"


def request_from_token_qwen2_vl(config, input_ids, max_out_length, block_size, req_idx=0):
    if not isinstance(input_ids, torch.Tensor):
        input_ids = torch.tensor(input_ids, dtype=torch.int64)
    position_ids = torch.arange(len(input_ids), dtype=torch.int64)
    if torch.any(torch.eq(input_ids, VISION_START_TOKEN_ID)):
        bos_pos = torch.where(torch.eq(input_ids, VISION_START_TOKEN_ID))[0]
        eos_pos = torch.where(torch.eq(input_ids, VISION_END_TOKEN_ID))[0]
        vision_num = bos_pos.shape[0]
        deltas = 0
        for i in range(vision_num):
            thw_shape_value = input_ids[bos_pos[i] + IMAGE_THW_TOKEN_OFFSET]
            thw_shape = decode_shape_from_int64(thw_shape_value)

            vision_feature_len = eos_pos[i] - bos_pos[i] - 1
            t_shape = thw_shape[0]
            max_hw = max(thw_shape[1:])

            if config.model_type == "qwen2_5_vl":
                tokens_per_second = config.vision_config.tokens_per_second
                second_per_grid_t_shm_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHM_OFFSET]
                second_per_grid_t_shape_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHAPE_OFFSET]
                if second_per_grid_t_shm_value < 0:
                    second_per_grid_t_value = get_data_from_shm(
                        second_per_grid_t_shm_value,
                        second_per_grid_t_shape_value,
                        np.float32
                    )
                    max_tokens_t = int(second_per_grid_t_value[0][0] * tokens_per_second * (thw_shape[0] - 1))
                    t_shape = max_tokens_t
            if t_shape > (max_hw // 2):
                deltas += vision_feature_len - t_shape
            else:
                deltas += vision_feature_len - max_hw // 2
        position_ids[-1] = position_ids[-1] - deltas

    request = MultiModalRequest(
        max_out_length,
        block_size,
        req_idx,
        input_ids,
        adapter_id=None,
        position_ids=position_ids
    )
    return request


class PARunner(MultimodalPARunner):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.shm_name_save_path = kwargs.get("shm_name_save_path", None)
        self.tokenizer_wrapper = TokenizerWrapper(self.model_path)
        self.tokenizer = self.tokenizer_wrapper.tokenizer

    def init_processor(self):
        try:
            self.processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
        except AssertionError:
            self.processor = self.model.tokenizer

    def warm_up(self):
        all_input_length = self.max_input_length
        input_ids_list = (
                [VISION_START_TOKEN_ID]
                + [IMAGE_TOKEN_ID] * IMAGE_FEATURE_LENS
                + [VISION_END_TOKEN_ID]
                + [1] * (all_input_length - IMAGE_FEATURE_LENS - 2)
        )
        image = Image.new(SUPPORTED_IMAGE_MODE, (224, 224), (255, 255, 255))
        warmup_image_processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
        images_inputs = warmup_image_processor(images=image,
                                               videos=None,
                                               return_tensors=PYTORCH_TENSOR)
        image.close()
        shared_memory_result = process_shared_memory(
            images_inputs.pixel_values,
            self.shm_name_save_path,
            images_inputs.image_grid_thw
        )
        input_ids_list[1] = shared_memory_result['pixel_values_shm_name']
        input_ids_list[2] = shared_memory_result['pixel_values_shape_value']
        input_ids_list[3] = shared_memory_result['thw_value']

        input_ids = torch.tensor(input_ids_list, dtype=torch.int64).to(self.device)
        print_log(self.rank, logger.info, "---------------Begin warm_up---------------")
        try:
            self.warm_up_num_blocks = math.ceil((self.max_input_length + self.max_output_length) /
                                                self.block_size) * self.max_batch_size
        except ZeroDivisionError as e:
            raise ZeroDivisionError from e
        cache_config = CacheConfig(self.warm_up_num_blocks, self.block_size)
        self.cache_manager = CacheManager(cache_config, self.model_config)
        max_output_length = 2
        self.model.postprocessor.max_new_tokens = max_output_length
        if self.max_prefill_tokens == -1:
            self.max_prefill_tokens = self.max_batch_size * (self.max_input_length + self.max_output_length)
        single_req = request_from_token_qwen2_vl(
            self.model.config,
            input_ids,
            max_output_length,
            self.block_size,
            req_idx=0,
        )
        generate_req([single_req], self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager)
        self.warm_up_memory = int(
            self.max_memory
            * NpuHbmInfo.get_hbm_usage(
                self.local_rank, self.world_size, self.model.soc_info.need_nz
            )
        )
        print_log(
            self.rank,
            logger.info,
            f"warmup_memory(GB): {self.warm_up_memory / (1024 ** 3): .2f}",
        )
        print_log(self.rank, logger.info, "---------------End warm_up---------------")

    def infer(self, mm_inputs, max_output_length, shm_name_save_path, **kwargs):
        self.make_cache_manager()

        self.model.postprocessor.max_new_tokens = max_output_length

        req_list = []
        if not ENV.profiling_enable:
            torch.npu.synchronize()
            e2e_start = time.time()
            for i, mm_input in enumerate(mm_inputs):
                input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
                single_req = request_from_token_qwen2_vl(
                    self.model.config,
                    input_ids,
                    max_output_length,
                    self.block_size,
                    req_idx=i,
                )
                req_list.append(single_req)
            generate_req(
                req_list,
                self.model,
                self.max_batch_size,
                self.max_prefill_tokens,
                self.cache_manager,
            )
            generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
            torch.npu.synchronize()
            e2e_end = time.time()
            e2e_time = e2e_end - e2e_start
        else:
            profiling_path = ENV.profiling_filepath
            if not os.path.exists(profiling_path):
                os.makedirs(profiling_path, exist_ok=True)
            torch.npu.synchronize()
            e2e_start = time.time()
            experimental_config = torch_npu.profiler._ExperimentalConfig(
                aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
                profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
                l2_cache=False,
                data_simplification=False,
            )
            with torch_npu.profiler.profile(
                    activities=[
                        torch_npu.profiler.ProfilerActivity.CPU,
                        torch_npu.profiler.ProfilerActivity.NPU,
                    ],
                    on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
                        profiling_path
                    ),
                    record_shapes=True,
                    profile_memory=True,
                    with_stack=False,
                    with_flops=False,
                    with_modules=False,
                    experimental_config=experimental_config,
            ) as _:
                for i, mm_input in enumerate(mm_inputs):
                    input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
                    single_req = request_from_token_qwen2_vl(
                        self.model.config,
                        input_ids,
                        max_output_length,
                        self.block_size,
                        req_idx=i,
                    )
                    req_list.append(single_req)
                generate_req(
                    req_list,
                    self.model,
                    self.max_batch_size,
                    self.max_prefill_tokens,
                    self.cache_manager,
                )
            torch.npu.synchronize()
            e2e_end = time.time()
            e2e_time = e2e_end - e2e_start
            generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
        if self.rank == 0 and is_path_exists(shm_name_save_path):
            try:
                release_shared_memory(shm_name_save_path)
            except Exception as e:
                print_log(self.rank, logger.error, f"Release shared memory failed: {e}")
            try:
                os.remove(shm_name_save_path)
            except Exception as e:
                print_log(self.rank, logger.error, f"Remove shared memory file failed: {e}")
        return generate_text_list, token_num_list, e2e_time


def parse_arguments():
    string_validator = argument_utils.StringArgumentValidator(min_length=0, max_length=1000)
    parser_qwen2_vl = parser

    parser_qwen2_vl.add_argument(
        "--input_text",
        default="Describe the image.",
        validator=string_validator
    )
    parser_qwen2_vl.add_argument(
        "--input_image",
        default="",
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--dataset_path",
        help="precision test dataset path",
        default="",
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--shm_name_save_path",
        type=str,
        help='This path is used to temporarily store the shared '
             'memory addresses that occur during the inference process.',
        default='./shm_name.txt',
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--results_save_path",
        help="precision test result path",
        default="./results.json",
        validator=path_validator,
    )

    return parser_qwen2_vl.parse_args()


def is_image(file_image_name):
    ext = os.path.splitext(file_image_name)[1]
    ext = ext.lower()
    if ext in [".jpg", ".png", ".jpeg", ".bmp"]:
        return True
    return False


def is_video(file_video_name):
    video_ext = os.path.splitext(file_video_name)[1]
    video_ext = video_ext.lower()
    if video_ext in [".mp4", ".wmv", ".avi"]:
        return True
    return False


def deal_dataset(dataset_path, text):
    input_images = []
    dataset_path = standardize_path(dataset_path)
    check_file_safety(dataset_path)
    images_list = safe_listdir(dataset_path)
    for img_name in images_list:
        image_path = os.path.join(dataset_path, img_name)
        input_images.append(image_path)
    input_texts = [text] * len(
        input_images
    )
    return input_images, input_texts


def replace_crlf(mm_input):
    result = []
    for single_input in mm_input:
        res = {}
        for k, v in single_input.items():
            input_text_filter = v
            input_text_filter = input_text_filter.replace('\n', ' ').replace('\r', ' ').replace('\f', ' ')
            input_text_filter = input_text_filter.replace('\t', ' ').replace('\v', ' ').replace('\b', ' ')
            input_text_filter = input_text_filter.replace('\u000A', ' ').replace('\u000D', ' ').replace('\u000C', ' ')
            input_text_filter = input_text_filter.replace('\u000B', ' ').replace('\u0008', ' ').replace('\u007F', ' ')
            input_text_filter = input_text_filter.replace('\u0009', ' ').replace('    ', ' ')
            res[k] = input_text_filter.replace("\n", "_").replace("\r", "_")
        result.append(res)
    return result

if __name__ == '__main__':
    args = parse_arguments()
    rank = ENV.rank
    local_rank = ENV.local_rank
    world_size = ENV.world_size
    input_dict = {
        'rank': rank,
        'world_size': world_size,
        'local_rank': local_rank,
        **vars(args)
    }

    pa_runner = PARunner(**input_dict)
    print_log(rank, logger.info, f"pa_runner: {pa_runner}")
    pa_runner.warm_up()
    npu_results_dict = {}
    if args.dataset_path:
        dataset_images, dataset_texts = deal_dataset(args.dataset_path, args.input_text)
        mm_inputs = []
        for dataset_image, dataset_text in zip(dataset_images, dataset_texts):
            if is_video(dataset_image):
                key = "video"
            elif is_image(dataset_image):
                key = "image"
            else:
                raise TypeError("The multimodal input field currently only supports 'image' and 'video'")
            single_inputs = [{key: dataset_image}, {"text": dataset_text}]
            mm_inputs.append(single_inputs)
    else:
        if args.input_image is None:
            raise ValueError("The input image or video path is empty.")
        elif is_video(args.input_image):
            key = "video"
        elif is_image(args.input_image):
            key = "image"
        else:
            raise TypeError("The multimodal input field currently only supports 'image' and 'video'.")
        mm_inputs = [
                        [
                            {key: args.input_image},
                            {"text": args.input_text},
                        ]
                    ] * args.max_batch_size

    generate_texts, token_nums, latency = pa_runner.infer(
        mm_inputs,
        args.max_output_length,
        args.shm_name_save_path,
    )
    token_nums_prev = 0
    for i, generate_text in enumerate(generate_texts):
        inputs = mm_inputs
        if args.dataset_path:
            rst_key = dataset_images[i].split("/")[-1]
            npu_results_dict[rst_key] = generate_text
        question = replace_crlf(inputs[i])
        print_log(rank, logger.info, f"Question[{i}]: {question}")
        print_log(rank, logger.info, f"Answer[{i}]: {generate_text}")
        print_log(rank, logger.info, f"Generate[{i}] token num: {token_nums[i][1] - token_nums_prev}")
        token_nums_prev = token_nums[i][1]
    print_log(rank, logger.info, f"Latency(s): {latency}")
    print_log(rank, logger.info, f"Throughput(tokens/s): {token_nums[-1][1] / latency}")

    if args.dataset_path:
        sorted_dict = dict(sorted(npu_results_dict.items()))
        with safe_open(
                args.results_save_path,
                "w",
                override_flags=os.O_WRONLY | os.O_CREAT | os.O_EXCL,
        ) as f:
            json.dump(sorted_dict, f)

docker内提供了上述推理代码的启动脚本,其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl:

shell
#!/bin/bash
# Qwen2-VL model launch script
# For detailed parameter configuration and usage instructions, refer to README.md in the same directory

# Environment configuration
export BIND_CPU=1
export RESERVED_MEMORY_GB=0
export MASTER_PORT=20031
export ATB_LLM_BENCHMARK_ENABLE=1
export ATB_PROFILING_ENABLE=0
export ASCEND_RT_VISIBLE_DEVICES=0,1

# Calculate TP_WORLD_SIZE (number of NPUs used)
export TP_WORLD_SIZE=$(($(echo "${ASCEND_RT_VISIBLE_DEVICES}" | grep -o , | wc -l) + 1))

# Default parameters
MAX_BATCH_SIZE=1
MAX_INPUT_LENGTH=4096
MAX_OUTPUT_LENGTH=256
INPUT_TEXT="Explain the contents of the picture with more than 500 words and do not Answer the question using a single word or phrase."
SHM_NAME_SAVE_PATH="./shm_name.txt"
MODEL_PATH=""
INPUT_IMAGE=""
DATASET_PATH=""
MODE="single"  # Added mode selection: single or path

# Parameter parsing
while [[ $# -gt 0 ]]; do
  case "$1" in
    --model_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        MODEL_PATH="$2"
        shift 2
      else
        echo "Error: --model_path requires a valid non-empty value"
        exit 1
      fi
      ;;
    --input_image)
      if [[ -n "$2" && "$2" != --* ]]; then
        INPUT_IMAGE="$2"
        MODE="single"
        shift 2
      else
        echo "Error: --input_image requires a valid image path"
        exit 1
      fi
      ;;
    --dataset_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        DATASET_PATH="$2"
        MODE="path"
        shift 2
      else
        echo "Error: --dataset_path requires a valid dataset path"
        exit 1
      fi
      ;;
    --max_batch_size)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_BATCH_SIZE="$2"
        shift 2
      else
        echo "Error: --max_batch_size must be a positive integer"
        exit 1 
      fi
      ;;
    --max_input_length)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_INPUT_LENGTH="$2"
        shift 2
      else
        echo "Error: --max_input_length must be a positive integer"
        exit 1
      fi
      ;;
    --max_output_length)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_OUTPUT_LENGTH="$2"
        shift 2
      else
        echo "Error: --max_output_length must be a positive integer"
        exit 1
      fi
      ;;
    --input_text)
      if [[ -n "$2" && "$2" != --* ]]; then
        INPUT_TEXT="$2"
        shift 2
      else
        echo "Error: --input_text requires valid text"
        exit 1
      fi
      ;;
    --shm_name_save_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        SHM_NAME_SAVE_PATH="$2"
        shift 2
      else
        echo "Error: --shm_name_save_path requires a valid file path"
        exit 1
      fi
      ;;
    *)
      echo "Unknown option: $1"
      exit 1
      ;;
  esac
done

# Parameter validation
if [[ -z "$MODEL_PATH" ]]; then
  echo "Error: --model_path parameter is required"
  exit 1
fi

if [[ "$MODE" == "single" && -z "$INPUT_IMAGE" ]]; then
  echo "Error: Single image mode requires --input_image parameter"
  exit 1
elif [[ "$MODE" == "path" && -z "$DATASET_PATH" ]]; then
  echo "Error: Path mode requires --dataset_path parameter"
  exit 1
fi

# Build base command
base_cmd="torchrun --nproc_per_node $TP_WORLD_SIZE --master_port $MASTER_PORT \
    -m examples.models.qwen2_vl.run_pa \
    --model_path \"$MODEL_PATH\" \
    --shm_name_save_path \"$SHM_NAME_SAVE_PATH\" \
    --max_input_length $MAX_INPUT_LENGTH \
    --max_output_length $MAX_OUTPUT_LENGTH \
    --max_batch_size $MAX_BATCH_SIZE \
    --input_text \"${INPUT_TEXT}\""

# Add mode-specific parameters
if [[ "$MODE" == "single" ]]; then
  base_cmd+=" --input_image \"$INPUT_IMAGE\""
else
  base_cmd+=" --dataset_path \"$DATASET_PATH\""
fi


# Execute command
eval "$base_cmd"

修改推理使用的逻辑NPU核心为0,1:

plaintext
cd /usr/local/Ascend/atb-models
vim examples/models/qwen2_vl/run_pa.sh
export ASCEND_RT_VISIBLE_DEVICES=0,1

使用chown命令将/models/Qwen2.5-VL-7B-Instruct目录及其所有文件的所有者和组更改为root用户和root组:

plaintext
chown root:root -R /models/Qwen2.5-VL-7B-Instruct

将测试图像<TEST_IMAGE_NAME>拷贝至<TEST_IMAGE_DIR>。

启动Qwen模型推理(需将<TEST_IMAGE_DIR/TEST_IMAGE_NAME>替换):

plaintext
cd /usr/local/Ascend/atb-models
export MINDIE_LOG_TO_STDOUT=1
bash examples/models/qwen2_vl/run_pa.sh \
	--model_path /models/Qwen2.5-VL-7B-Instruct \
	--input_image <TEST_IMAGE_DIR/TEST_IMAGE_NAME> \
	--input_text "你擅长判断图像中是否存在指定物体。输出图像中存在床的概率,仅返回大于等于0小于等于1的数值,0表示不存在,1表示存在。"

使用的测试图像如下,其分辨率为320×240:

可观察到如下输出:

实验结果表明:使用昇腾310P的NPU进行Qwen2.5-VL-7B模型推理时延约为300 ms。

建议读者进一步修改上述推理代码,基于socket进行进程间通信,实现与ROS 2节点的数据传输。

4. 将自然语言动作转换为速度指令

参考上海人工智能实验室等团队开源的StreamVLN框架中的实现。假设模型推理输出仅包含四种自然语言动作,分别是:

动作序号

自然语言动作

0

停止

1

前进25厘米

2

左转15°

3

右转15°

StreamVLN框架中提供了如何根据自然语言动作计算机器人导航目标homo_goal:

python
if each_action == 0:
    pass
elif each_action == 1:
    yaw = math.atan2(homo_goal[1, 0], homo_goal[0, 0])
    homo_goal[0, 3] += 0.25 * np.cos(yaw)
    homo_goal[1, 3] += 0.25 * np.sin(yaw)
elif each_action == 2:
    angle = math.radians(15)
    rotation_matrix = np.array([
        [math.cos(angle), -math.sin(angle), 0],
        [math.sin(angle),  math.cos(angle), 0],
        [0,                0,               1]
    ])
    homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])
elif each_action == 3:
    angle = -math.radians(15.0)
    rotation_matrix = np.array([
        [math.cos(angle), -math.sin(angle), 0],
        [math.sin(angle),  math.cos(angle), 0],
        [0,                0,               1]
    ])
    homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])

此外,StreamVLN框架中还提供了PID控制器的一种实现方式,将导航目标homo_goal与机器人当前位姿一同输入solve函数,从而解得控制机器人所需的线速度与角速度:

python
class PID_controller:
    def __init__(self, Kp_trans=1.0, Kd_trans=0.1, Kp_yaw=1.0, Kd_yaw=1.0, max_v=1.0, max_w=1.2):
        self.Kp_trans = Kp_trans
        self.Kd_trans = Kd_trans
        self.Kp_yaw = Kp_yaw
        self.Kd_yaw = Kd_yaw
        self.max_v = max_v
        self.max_w = max_w
    
    def solve(self, odom, target, vel=np.zeros(2)):
        translation_error, yaw_error = self.calculate_errors(odom, target)
        v, w = self.pd_step(translation_error, yaw_error, vel[0], vel[1])
        return v, w, translation_error, yaw_error
    
    def pd_step(self, translation_error, yaw_error, linear_vel, angular_vel):
        translation_error = max(-1.0, min(1.0, translation_error))
        yaw_error = max(-1.0, min(1.0, yaw_error))

        linear_velocity = self.Kp_trans * translation_error - self.Kd_trans * linear_vel
        angular_velocity = self.Kp_yaw * yaw_error - self.Kd_yaw * angular_vel

        linear_velocity = max(-self.max_v, min(self.max_v, linear_velocity))
        angular_velocity = max(-self.max_w, min(self.max_w, angular_velocity))
                
        return linear_velocity, angular_velocity
    
    def calculate_errors(self, odom, target):
        
        dx =  target[0, 3] - odom[0, 3]
        dy =  target[1, 3] - odom[1, 3]

        odom_yaw = math.atan2(odom[1, 0], odom[0, 0])
        target_yaw = math.atan2(target[1, 0], target[0, 0])
        
        translation_error = dx * np.cos(odom_yaw) + dy * np.sin(odom_yaw)    

        yaw_error = target_yaw - odom_yaw
        yaw_error = (yaw_error + math.pi) % (2 * math.pi) - math.pi
        
        return translation_error, yaw_error