1. 总体概览

1.1. 视觉语言导航背景

视觉语言导航（Vision-Language Navigation, VLN）是一个多学科交叉的研究领域，涵盖了自然语言处理、计算机视觉、多模态信息融合以及移动机器人导航等多个学科。在该领域，移动机器人的任务是按照自然语言指令要求，在复杂环境中自主导航到目标位置。该目标位置由VLN任务决定。例如，对于室内家居场景，若自然语言指令为：前进并离开客厅，穿过走廊，到达卧室的床边，则导航目标位置为卧室的床边。

现有主流的VLN模型核心均为端到端的视觉语言模型（Vision-Language Model，VLM）。VLM是一类同时支持图像与文本输入的多模态大模型，具备对图像内容的理解与跨模态信息处理能力。例如，英伟达等团队于2025年提出的NaVILA框架，以及上海人工智能实验室等团队于2025年提出的StreamVLN框架。其输入均为自然语言导航指令与一系列图像，输出为前进、左转、右转等动作。

1.2. 整体目标

本文档将以Qwen2.5-VL-7B模型为例，旨在将该模型推理部署于昇腾310P。注意：

（1）基于本文档部署模型前：读者应基于VLN领域的仿真数据集或真实机器人采集的数据集，对模型进行训练或微调，获得一套适用于VLN任务的模型权重。具体地，Qwen2.5-VL-7B模型针对VLN领域的训练或微调可参考上海人工智能实验室团队于2025年提出的InternVLA-N1算法。

（2）基于本文档部署模型后：读者可针对特定构型的机器人，将模型生成的动作转换为驱动移动机构的速度指令。

1.3. 系统概述

以千寻机器人为例，硬件系统框架如下图所示。

昇腾端侧算力介绍参考：开发套件介绍及组网

因此，本文档使用的计算平台为：香橙派310P扩展坞+MiniPC。软件系统框架如下图所示。

其中，预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU，而VLN系统的剩余部分将被部署于X86架构的MiniPC。输入模型的图像将由移动机器人搭载的RGB-D相机采集获得，预训练的Qwen2.5-VL-7B模型输出为自然语言动作，例如：前进10米、左转10°、停止等。最终，VLN系统输出的速度指令将用于控制移动机器人的轮式底盘。建议读者基于机器人厂商提供的SDK或ROS 2订阅图像topic与发布速度指令。

2. 环境搭建

2.1. 流程概述

为了将预训练的Qwen2.5-VL-7B模型部署于香橙派310P扩展坞的NPU，需按如下图所示流程进行环境搭建，为最后实现模型推理做准备。

2.2. 操作系统安装

参考链接提供的用户手册第 3 章节，烧录香橙派提供的Ubuntu22.04镜像。

2.3. 驱动安装

参考链接提供的用户手册第 2.6 章节，安装驱动。

2.4. Firmware安装

参考链接提供的用户手册第 2.7 章节，安装Firmware。

2.5. CANN安装

参考链接提供的用户手册第 6.1 章节，安装CANN。

2.6. Python3安装

若使用香橙派提供的Ubuntu22.04镜像，系统则会预安装Python3。

2.7. PyTorch安装

参考链接提供的用户手册第 8 章节，安装PyTorch与torch_npu插件。

2.8. MindIE docker镜像安装

Docker是一个软件平台，允许将应用程序及其所有依赖项打包成称为容器的标准单元，从而实现应用程序在任何环境中都能快速、可靠地部署和运行。参考链接提供的用户手册第 5.1 章节，安装docker软件：

plaintext

apt update
apt install -y docker.io

安装后查看是否安装成功，应输出版本号：

plaintext

docker -v

打开链接，并点击ATB-Models获取资源选项：

点击立即下载选项：

根据处理器架构选择输入相应的命令：

下载完成后可列出本地主机上已下载的docker镜像：

plaintext

docker images

记录已下载docker镜像的ID：<IMAGE_ID>。

2.9. 使用MindIE docker镜像

2.9.1. 首次启动MindIE docker镜像

当前昇腾社区发布的docker镜像还存在一些问题，需要替换一些文件来修复这些问题。参考链接提供的用户手册第 5.1.2.1 章节，修复上述问题。

创建并首次启动docker（需将<IMAGE_ID>替换）：

plaintext

docker run -it -d --net=host --shm-size=500g \
    --name qwen7B-env \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /usr/local/sbin:/usr/local/sbin:ro \
    -v /home/huawei/fix_openeuler_docker:/fix_openeuler_docker \
    <IMAGE_ID> bash

在docker中复制修复补丁文件到当前目录：

plaintext

cd /usr/local/Ascend/ascend-toolkit/8.2.RC1/lib64/
ls /fix_openeuler_docker/fixhccl/8.2hccl/
cp /fix_openeuler_docker/fixhccl/8.2hccl/* ./

重新使能CANN环境变量：

plaintext

source /usr/local/Ascend/ascend-toolkit/set_env.sh

升级Ascend-cann-nnal软件包：

plaintext

chmod +x /fix_openeuler_docker/Ascend-cann-nnal/Ascend-cann-nnal_8.3.RC1_linux-x86_64.run
cd /fix_openeuler_docker/Ascend-cann-nnal
./Ascend-cann-nnal_8.3.RC1_linux-x86_64.run --install --quiet

升级完成后需确认版本信息：

plaintext

cat /usr/local/Ascend/nnal/atb/latest/version.info

输出如下：

plaintext

Ascend-cann-atb : 8.3.RC1
Ascend-cann-atb Version : 8.3.RC1.B106
Platform : x86_64
branch : 8.3.rc1-0702
commit id : 16004f23040e0dcdd3cf0c64ecf36622487038ba

更新dnf软件源列表：

plaintext

dnf update

2.9.2. 非首次启动MindIE docker镜像

若重启电脑或者停止当前docker运行后，想重新进入docker环境，请依次执行如下命令。查看已创建的容器：

plaintext

docker ps -a

启动已创建的容器（需将<YOUR_DOCKER_NAME>替换）：

plaintext

docker start <YOUR_DOCKER_NAME>
docker exec -it <YOUR_DOCKER_NAME> bash

若环境搭建过程中出现错误，想删除已创建的容器（需将<YOUR_DOCKER_NAME>替换）：

plaintext

docker rm -f <YOUR_DOCKER_NAME>

2.10. 准备Qwen模型权重

在docker的根目录下创建models目录用于存放Qwen模型权重：

plaintext

mkdir /models

若读者已准备好适用于VLN任务的模型权重，可将其拷贝至该目录中。

若读者还未准备好权重，可基于modelscope下载Qwen2.5-VL-7B-Instruct模型权重。安装modelscope：

plaintext

pip install modelscope

下载Qwen2.5-VL-7B-Instruct模型权重：

plaintext

cd /models
modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir ./Qwen2.5-VL-7B-Instruct

在Qwen2.5-VL-7B-Instruct/config.json中设置torch_dtype为float16：

plaintext

vim /models/Qwen2.5-VL-7B-Instruct/config.json
"torch_dtype": "float16

3. 模型推理

若读者已准备好适用于VLN任务的Qwen2.5-VL-7B模型权重，则输入为RGB图像与自然语言导航指令，期望输出为自然语言动作，例如：前进25厘米、左转15°、停止等。

本文档以modelscope下载Qwen2.5-VL-7B-Instruct模型权重为例，实现基于昇腾310P的模型推理。输入为RGB图像与自然语言prompt，期望输出由自然语言prompt指定。

docker内提供了如下推理代码，其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl：

python

# Copyright Huawei Technologies Co., Ltd. 2023-2024. All rights reserved.
import json
import math
import os
import time

import torch
import torch_npu
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor
from atb_llm.models.base.model_utils import safe_from_pretrained
from atb_llm.models.qwen2_vl.router_qwen2_vl import process_shared_memory
from atb_llm.runner.tokenizer_wrapper import TokenizerWrapper
from atb_llm.utils import argument_utils
from atb_llm.utils.cpu_binding import NpuHbmInfo
from atb_llm.utils.env import ENV
from atb_llm.utils.file_utils import safe_open, is_path_exists, safe_listdir, standardize_path, check_file_safety
from atb_llm.utils.log import logger, print_log
from atb_llm.utils.shm_utils import decode_shape_from_int64, release_shared_memory, get_data_from_shm
from examples.multimodal_runner import MultimodalPARunner, parser
from examples.multimodal_runner import path_validator
from examples.server.cache import CacheManager, CacheConfig
from examples.server.generate import decode_token, generate_req
from examples.server.request import MultiModalRequest

VISION_START_TOKEN_ID = 151652
IMAGE_TOKEN_ID = 151655
VISION_END_TOKEN_ID = 151653
IMAGE_FEATURE_LENS = 64
IMAGE_THW_TOKEN_OFFSET = 3
SECOND_PER_GRID_T_SHM_OFFSET = 4
SECOND_PER_GRID_T_SHAPE_OFFSET = 5
SUPPORTED_IMAGE_MODE = "RGB"
PYTORCH_TENSOR = "pt"


def request_from_token_qwen2_vl(config, input_ids, max_out_length, block_size, req_idx=0):
    if not isinstance(input_ids, torch.Tensor):
        input_ids = torch.tensor(input_ids, dtype=torch.int64)
    position_ids = torch.arange(len(input_ids), dtype=torch.int64)
    if torch.any(torch.eq(input_ids, VISION_START_TOKEN_ID)):
        bos_pos = torch.where(torch.eq(input_ids, VISION_START_TOKEN_ID))[0]
        eos_pos = torch.where(torch.eq(input_ids, VISION_END_TOKEN_ID))[0]
        vision_num = bos_pos.shape[0]
        deltas = 0
        for i in range(vision_num):
            thw_shape_value = input_ids[bos_pos[i] + IMAGE_THW_TOKEN_OFFSET]
            thw_shape = decode_shape_from_int64(thw_shape_value)

            vision_feature_len = eos_pos[i] - bos_pos[i] - 1
            t_shape = thw_shape[0]
            max_hw = max(thw_shape[1:])

            if config.model_type == "qwen2_5_vl":
                tokens_per_second = config.vision_config.tokens_per_second
                second_per_grid_t_shm_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHM_OFFSET]
                second_per_grid_t_shape_value = input_ids[bos_pos[i] + SECOND_PER_GRID_T_SHAPE_OFFSET]
                if second_per_grid_t_shm_value < 0:
                    second_per_grid_t_value = get_data_from_shm(
                        second_per_grid_t_shm_value,
                        second_per_grid_t_shape_value,
                        np.float32
                    )
                    max_tokens_t = int(second_per_grid_t_value[0][0] * tokens_per_second * (thw_shape[0] - 1))
                    t_shape = max_tokens_t
            if t_shape > (max_hw // 2):
                deltas += vision_feature_len - t_shape
            else:
                deltas += vision_feature_len - max_hw // 2
        position_ids[-1] = position_ids[-1] - deltas

    request = MultiModalRequest(
        max_out_length,
        block_size,
        req_idx,
        input_ids,
        adapter_id=None,
        position_ids=position_ids
    )
    return request


class PARunner(MultimodalPARunner):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.shm_name_save_path = kwargs.get("shm_name_save_path", None)
        self.tokenizer_wrapper = TokenizerWrapper(self.model_path)
        self.tokenizer = self.tokenizer_wrapper.tokenizer

    def init_processor(self):
        try:
            self.processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
        except AssertionError:
            self.processor = self.model.tokenizer

    def warm_up(self):
        all_input_length = self.max_input_length
        input_ids_list = (
                [VISION_START_TOKEN_ID]
                + [IMAGE_TOKEN_ID] * IMAGE_FEATURE_LENS
                + [VISION_END_TOKEN_ID]
                + [1] * (all_input_length - IMAGE_FEATURE_LENS - 2)
        )
        image = Image.new(SUPPORTED_IMAGE_MODE, (224, 224), (255, 255, 255))
        warmup_image_processor = safe_from_pretrained(AutoImageProcessor, self.model_path)
        images_inputs = warmup_image_processor(images=image,
                                               videos=None,
                                               return_tensors=PYTORCH_TENSOR)
        image.close()
        shared_memory_result = process_shared_memory(
            images_inputs.pixel_values,
            self.shm_name_save_path,
            images_inputs.image_grid_thw
        )
        input_ids_list[1] = shared_memory_result['pixel_values_shm_name']
        input_ids_list[2] = shared_memory_result['pixel_values_shape_value']
        input_ids_list[3] = shared_memory_result['thw_value']

        input_ids = torch.tensor(input_ids_list, dtype=torch.int64).to(self.device)
        print_log(self.rank, logger.info, "---------------Begin warm_up---------------")
        try:
            self.warm_up_num_blocks = math.ceil((self.max_input_length + self.max_output_length) /
                                                self.block_size) * self.max_batch_size
        except ZeroDivisionError as e:
            raise ZeroDivisionError from e
        cache_config = CacheConfig(self.warm_up_num_blocks, self.block_size)
        self.cache_manager = CacheManager(cache_config, self.model_config)
        max_output_length = 2
        self.model.postprocessor.max_new_tokens = max_output_length
        if self.max_prefill_tokens == -1:
            self.max_prefill_tokens = self.max_batch_size * (self.max_input_length + self.max_output_length)
        single_req = request_from_token_qwen2_vl(
            self.model.config,
            input_ids,
            max_output_length,
            self.block_size,
            req_idx=0,
        )
        generate_req([single_req], self.model, self.max_batch_size, self.max_prefill_tokens, self.cache_manager)
        self.warm_up_memory = int(
            self.max_memory
            * NpuHbmInfo.get_hbm_usage(
                self.local_rank, self.world_size, self.model.soc_info.need_nz
            )
        )
        print_log(
            self.rank,
            logger.info,
            f"warmup_memory(GB): {self.warm_up_memory / (1024 ** 3): .2f}",
        )
        print_log(self.rank, logger.info, "---------------End warm_up---------------")

    def infer(self, mm_inputs, max_output_length, shm_name_save_path, **kwargs):
        self.make_cache_manager()

        self.model.postprocessor.max_new_tokens = max_output_length

        req_list = []
        if not ENV.profiling_enable:
            torch.npu.synchronize()
            e2e_start = time.time()
            for i, mm_input in enumerate(mm_inputs):
                input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
                single_req = request_from_token_qwen2_vl(
                    self.model.config,
                    input_ids,
                    max_output_length,
                    self.block_size,
                    req_idx=i,
                )
                req_list.append(single_req)
            generate_req(
                req_list,
                self.model,
                self.max_batch_size,
                self.max_prefill_tokens,
                self.cache_manager,
            )
            generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
            torch.npu.synchronize()
            e2e_end = time.time()
            e2e_time = e2e_end - e2e_start
        else:
            profiling_path = ENV.profiling_filepath
            if not os.path.exists(profiling_path):
                os.makedirs(profiling_path, exist_ok=True)
            torch.npu.synchronize()
            e2e_start = time.time()
            experimental_config = torch_npu.profiler._ExperimentalConfig(
                aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
                profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
                l2_cache=False,
                data_simplification=False,
            )
            with torch_npu.profiler.profile(
                    activities=[
                        torch_npu.profiler.ProfilerActivity.CPU,
                        torch_npu.profiler.ProfilerActivity.NPU,
                    ],
                    on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(
                        profiling_path
                    ),
                    record_shapes=True,
                    profile_memory=True,
                    with_stack=False,
                    with_flops=False,
                    with_modules=False,
                    experimental_config=experimental_config,
            ) as _:
                for i, mm_input in enumerate(mm_inputs):
                    input_ids = self.tokenizer_wrapper.tokenize(mm_input, shm_name_save_path=shm_name_save_path)
                    single_req = request_from_token_qwen2_vl(
                        self.model.config,
                        input_ids,
                        max_output_length,
                        self.block_size,
                        req_idx=i,
                    )
                    req_list.append(single_req)
                generate_req(
                    req_list,
                    self.model,
                    self.max_batch_size,
                    self.max_prefill_tokens,
                    self.cache_manager,
                )
            torch.npu.synchronize()
            e2e_end = time.time()
            e2e_time = e2e_end - e2e_start
            generate_text_list, token_num_list = decode_token(req_list, self.tokenizer, skip_special_tokens=True)
        if self.rank == 0 and is_path_exists(shm_name_save_path):
            try:
                release_shared_memory(shm_name_save_path)
            except Exception as e:
                print_log(self.rank, logger.error, f"Release shared memory failed: {e}")
            try:
                os.remove(shm_name_save_path)
            except Exception as e:
                print_log(self.rank, logger.error, f"Remove shared memory file failed: {e}")
        return generate_text_list, token_num_list, e2e_time


def parse_arguments():
    string_validator = argument_utils.StringArgumentValidator(min_length=0, max_length=1000)
    parser_qwen2_vl = parser

    parser_qwen2_vl.add_argument(
        "--input_text",
        default="Describe the image.",
        validator=string_validator
    )
    parser_qwen2_vl.add_argument(
        "--input_image",
        default="",
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--dataset_path",
        help="precision test dataset path",
        default="",
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--shm_name_save_path",
        type=str,
        help='This path is used to temporarily store the shared '
             'memory addresses that occur during the inference process.',
        default='./shm_name.txt',
        validator=path_validator
    )
    parser_qwen2_vl.add_argument(
        "--results_save_path",
        help="precision test result path",
        default="./results.json",
        validator=path_validator,
    )

    return parser_qwen2_vl.parse_args()


def is_image(file_image_name):
    ext = os.path.splitext(file_image_name)[1]
    ext = ext.lower()
    if ext in [".jpg", ".png", ".jpeg", ".bmp"]:
        return True
    return False


def is_video(file_video_name):
    video_ext = os.path.splitext(file_video_name)[1]
    video_ext = video_ext.lower()
    if video_ext in [".mp4", ".wmv", ".avi"]:
        return True
    return False


def deal_dataset(dataset_path, text):
    input_images = []
    dataset_path = standardize_path(dataset_path)
    check_file_safety(dataset_path)
    images_list = safe_listdir(dataset_path)
    for img_name in images_list:
        image_path = os.path.join(dataset_path, img_name)
        input_images.append(image_path)
    input_texts = [text] * len(
        input_images
    )
    return input_images, input_texts


def replace_crlf(mm_input):
    result = []
    for single_input in mm_input:
        res = {}
        for k, v in single_input.items():
            input_text_filter = v
            input_text_filter = input_text_filter.replace('\n', ' ').replace('\r', ' ').replace('\f', ' ')
            input_text_filter = input_text_filter.replace('\t', ' ').replace('\v', ' ').replace('\b', ' ')
            input_text_filter = input_text_filter.replace('\u000A', ' ').replace('\u000D', ' ').replace('\u000C', ' ')
            input_text_filter = input_text_filter.replace('\u000B', ' ').replace('\u0008', ' ').replace('\u007F', ' ')
            input_text_filter = input_text_filter.replace('\u0009', ' ').replace('    ', ' ')
            res[k] = input_text_filter.replace("\n", "_").replace("\r", "_")
        result.append(res)
    return result

if __name__ == '__main__':
    args = parse_arguments()
    rank = ENV.rank
    local_rank = ENV.local_rank
    world_size = ENV.world_size
    input_dict = {
        'rank': rank,
        'world_size': world_size,
        'local_rank': local_rank,
        **vars(args)
    }

    pa_runner = PARunner(**input_dict)
    print_log(rank, logger.info, f"pa_runner: {pa_runner}")
    pa_runner.warm_up()
    npu_results_dict = {}
    if args.dataset_path:
        dataset_images, dataset_texts = deal_dataset(args.dataset_path, args.input_text)
        mm_inputs = []
        for dataset_image, dataset_text in zip(dataset_images, dataset_texts):
            if is_video(dataset_image):
                key = "video"
            elif is_image(dataset_image):
                key = "image"
            else:
                raise TypeError("The multimodal input field currently only supports 'image' and 'video'")
            single_inputs = [{key: dataset_image}, {"text": dataset_text}]
            mm_inputs.append(single_inputs)
    else:
        if args.input_image is None:
            raise ValueError("The input image or video path is empty.")
        elif is_video(args.input_image):
            key = "video"
        elif is_image(args.input_image):
            key = "image"
        else:
            raise TypeError("The multimodal input field currently only supports 'image' and 'video'.")
        mm_inputs = [
                        [
                            {key: args.input_image},
                            {"text": args.input_text},
                        ]
                    ] * args.max_batch_size

    generate_texts, token_nums, latency = pa_runner.infer(
        mm_inputs,
        args.max_output_length,
        args.shm_name_save_path,
    )
    token_nums_prev = 0
    for i, generate_text in enumerate(generate_texts):
        inputs = mm_inputs
        if args.dataset_path:
            rst_key = dataset_images[i].split("/")[-1]
            npu_results_dict[rst_key] = generate_text
        question = replace_crlf(inputs[i])
        print_log(rank, logger.info, f"Question[{i}]: {question}")
        print_log(rank, logger.info, f"Answer[{i}]: {generate_text}")
        print_log(rank, logger.info, f"Generate[{i}] token num: {token_nums[i][1] - token_nums_prev}")
        token_nums_prev = token_nums[i][1]
    print_log(rank, logger.info, f"Latency(s): {latency}")
    print_log(rank, logger.info, f"Throughput(tokens/s): {token_nums[-1][1] / latency}")

    if args.dataset_path:
        sorted_dict = dict(sorted(npu_results_dict.items()))
        with safe_open(
                args.results_save_path,
                "w",
                override_flags=os.O_WRONLY | os.O_CREAT | os.O_EXCL,
        ) as f:
            json.dump(sorted_dict, f)

docker内提供了上述推理代码的启动脚本，其位于/usr/local/Ascend/atb-models/examples/models/qwen2_vl：

shell

#!/bin/bash
# Qwen2-VL model launch script
# For detailed parameter configuration and usage instructions, refer to README.md in the same directory

# Environment configuration
export BIND_CPU=1
export RESERVED_MEMORY_GB=0
export MASTER_PORT=20031
export ATB_LLM_BENCHMARK_ENABLE=1
export ATB_PROFILING_ENABLE=0
export ASCEND_RT_VISIBLE_DEVICES=0,1

# Calculate TP_WORLD_SIZE (number of NPUs used)
export TP_WORLD_SIZE=$(($(echo "${ASCEND_RT_VISIBLE_DEVICES}" | grep -o , | wc -l) + 1))

# Default parameters
MAX_BATCH_SIZE=1
MAX_INPUT_LENGTH=4096
MAX_OUTPUT_LENGTH=256
INPUT_TEXT="Explain the contents of the picture with more than 500 words and do not Answer the question using a single word or phrase."
SHM_NAME_SAVE_PATH="./shm_name.txt"
MODEL_PATH=""
INPUT_IMAGE=""
DATASET_PATH=""
MODE="single"  # Added mode selection: single or path

# Parameter parsing
while [[ $# -gt 0 ]]; do
  case "$1" in
    --model_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        MODEL_PATH="$2"
        shift 2
      else
        echo "Error: --model_path requires a valid non-empty value"
        exit 1
      fi
      ;;
    --input_image)
      if [[ -n "$2" && "$2" != --* ]]; then
        INPUT_IMAGE="$2"
        MODE="single"
        shift 2
      else
        echo "Error: --input_image requires a valid image path"
        exit 1
      fi
      ;;
    --dataset_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        DATASET_PATH="$2"
        MODE="path"
        shift 2
      else
        echo "Error: --dataset_path requires a valid dataset path"
        exit 1
      fi
      ;;
    --max_batch_size)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_BATCH_SIZE="$2"
        shift 2
      else
        echo "Error: --max_batch_size must be a positive integer"
        exit 1 
      fi
      ;;
    --max_input_length)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_INPUT_LENGTH="$2"
        shift 2
      else
        echo "Error: --max_input_length must be a positive integer"
        exit 1
      fi
      ;;
    --max_output_length)
      if [[ -n "$2" && "$2" =~ ^[0-9]+$ ]]; then
        MAX_OUTPUT_LENGTH="$2"
        shift 2
      else
        echo "Error: --max_output_length must be a positive integer"
        exit 1
      fi
      ;;
    --input_text)
      if [[ -n "$2" && "$2" != --* ]]; then
        INPUT_TEXT="$2"
        shift 2
      else
        echo "Error: --input_text requires valid text"
        exit 1
      fi
      ;;
    --shm_name_save_path)
      if [[ -n "$2" && "$2" != --* ]]; then
        SHM_NAME_SAVE_PATH="$2"
        shift 2
      else
        echo "Error: --shm_name_save_path requires a valid file path"
        exit 1
      fi
      ;;
    *)
      echo "Unknown option: $1"
      exit 1
      ;;
  esac
done

# Parameter validation
if [[ -z "$MODEL_PATH" ]]; then
  echo "Error: --model_path parameter is required"
  exit 1
fi

if [[ "$MODE" == "single" && -z "$INPUT_IMAGE" ]]; then
  echo "Error: Single image mode requires --input_image parameter"
  exit 1
elif [[ "$MODE" == "path" && -z "$DATASET_PATH" ]]; then
  echo "Error: Path mode requires --dataset_path parameter"
  exit 1
fi

# Build base command
base_cmd="torchrun --nproc_per_node $TP_WORLD_SIZE --master_port $MASTER_PORT \
    -m examples.models.qwen2_vl.run_pa \
    --model_path \"$MODEL_PATH\" \
    --shm_name_save_path \"$SHM_NAME_SAVE_PATH\" \
    --max_input_length $MAX_INPUT_LENGTH \
    --max_output_length $MAX_OUTPUT_LENGTH \
    --max_batch_size $MAX_BATCH_SIZE \
    --input_text \"${INPUT_TEXT}\""

# Add mode-specific parameters
if [[ "$MODE" == "single" ]]; then
  base_cmd+=" --input_image \"$INPUT_IMAGE\""
else
  base_cmd+=" --dataset_path \"$DATASET_PATH\""
fi


# Execute command
eval "$base_cmd"

修改推理使用的逻辑NPU核心为0,1：

plaintext

cd /usr/local/Ascend/atb-models
vim examples/models/qwen2_vl/run_pa.sh
export ASCEND_RT_VISIBLE_DEVICES=0,1

使用chown命令将/models/Qwen2.5-VL-7B-Instruct目录及其所有文件的所有者和组更改为root用户和root组：

plaintext

chown root:root -R /models/Qwen2.5-VL-7B-Instruct

将测试图像<TEST_IMAGE_NAME>拷贝至<TEST_IMAGE_DIR>。

启动Qwen模型推理（需将<TEST_IMAGE_DIR/TEST_IMAGE_NAME>替换）：

plaintext

cd /usr/local/Ascend/atb-models
export MINDIE_LOG_TO_STDOUT=1
bash examples/models/qwen2_vl/run_pa.sh \
	--model_path /models/Qwen2.5-VL-7B-Instruct \
	--input_image <TEST_IMAGE_DIR/TEST_IMAGE_NAME> \
	--input_text "你擅长判断图像中是否存在指定物体。输出图像中存在床的概率，仅返回大于等于0小于等于1的数值，0表示不存在，1表示存在。"

使用的测试图像如下，其分辨率为320×240：

可观察到如下输出：

实验结果表明：使用昇腾310P的NPU进行Qwen2.5-VL-7B模型推理时延约为300 ms。

建议读者进一步修改上述推理代码，基于socket进行进程间通信，实现与ROS 2节点的数据传输。

4. 将自然语言动作转换为速度指令

参考上海人工智能实验室等团队开源的StreamVLN框架中的实现。假设模型推理输出仅包含四种自然语言动作，分别是：

动作序号	自然语言动作
0	停止
1	前进25厘米
2	左转15°
3	右转15°

StreamVLN框架中提供了如何根据自然语言动作计算机器人导航目标homo_goal：

python

if each_action == 0:
    pass
elif each_action == 1:
    yaw = math.atan2(homo_goal[1, 0], homo_goal[0, 0])
    homo_goal[0, 3] += 0.25 * np.cos(yaw)
    homo_goal[1, 3] += 0.25 * np.sin(yaw)
elif each_action == 2:
    angle = math.radians(15)
    rotation_matrix = np.array([
        [math.cos(angle), -math.sin(angle), 0],
        [math.sin(angle),  math.cos(angle), 0],
        [0,                0,               1]
    ])
    homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])
elif each_action == 3:
    angle = -math.radians(15.0)
    rotation_matrix = np.array([
        [math.cos(angle), -math.sin(angle), 0],
        [math.sin(angle),  math.cos(angle), 0],
        [0,                0,               1]
    ])
    homo_goal[:3, :3] = np.dot(rotation_matrix, homo_goal[:3, :3])

此外，StreamVLN框架中还提供了PID控制器的一种实现方式，将导航目标homo_goal与机器人当前位姿一同输入solve函数，从而解得控制机器人所需的线速度与角速度：

python

class PID_controller:
    def __init__(self, Kp_trans=1.0, Kd_trans=0.1, Kp_yaw=1.0, Kd_yaw=1.0, max_v=1.0, max_w=1.2):
        self.Kp_trans = Kp_trans
        self.Kd_trans = Kd_trans
        self.Kp_yaw = Kp_yaw
        self.Kd_yaw = Kd_yaw
        self.max_v = max_v
        self.max_w = max_w
    
    def solve(self, odom, target, vel=np.zeros(2)):
        translation_error, yaw_error = self.calculate_errors(odom, target)
        v, w = self.pd_step(translation_error, yaw_error, vel[0], vel[1])
        return v, w, translation_error, yaw_error
    
    def pd_step(self, translation_error, yaw_error, linear_vel, angular_vel):
        translation_error = max(-1.0, min(1.0, translation_error))
        yaw_error = max(-1.0, min(1.0, yaw_error))

        linear_velocity = self.Kp_trans * translation_error - self.Kd_trans * linear_vel
        angular_velocity = self.Kp_yaw * yaw_error - self.Kd_yaw * angular_vel

        linear_velocity = max(-self.max_v, min(self.max_v, linear_velocity))
        angular_velocity = max(-self.max_w, min(self.max_w, angular_velocity))
                
        return linear_velocity, angular_velocity
    
    def calculate_errors(self, odom, target):
        
        dx =  target[0, 3] - odom[0, 3]
        dy =  target[1, 3] - odom[1, 3]

        odom_yaw = math.atan2(odom[1, 0], odom[0, 0])
        target_yaw = math.atan2(target[1, 0], target[0, 0])
        
        translation_error = dx * np.cos(odom_yaw) + dy * np.sin(odom_yaw)    

        yaw_error = target_yaw - odom_yaw
        yaw_error = (yaw_error + math.pi) % (2 * math.pi) - math.pi
        
        return translation_error, yaw_error

用户操作手册

3.机械设计

1. 总体概览

1.1. 视觉语言导航背景

1.2. 整体目标

1.3. 系统概述

2. 环境搭建

2.1. 流程概述

2.2. 操作系统安装

2.3. 驱动安装

2.4. Firmware安装

2.5. CANN安装

2.6. Python3安装

2.7. PyTorch安装

2.8. MindIE docker镜像安装

2.9. 使用MindIE docker镜像

2.9.1. 首次启动MindIE docker镜像

2.9.2. 非首次启动MindIE docker镜像

2.10. 准备Qwen模型权重

3. 模型推理

4. 将自然语言动作转换为速度指令

3.机械设计

1. 总体概览 ​

1.1. 视觉语言导航背景 ​

1.2. 整体目标 ​

1.3. 系统概述 ​

2. 环境搭建 ​

2.1. 流程概述 ​

2.2. 操作系统安装 ​

2.3. 驱动安装 ​

2.4. Firmware安装 ​

2.5. CANN安装 ​

2.6. Python3安装 ​

2.7. PyTorch安装 ​

2.8. MindIE docker镜像安装 ​

2.9. 使用MindIE docker镜像 ​

2.9.1. 首次启动MindIE docker镜像 ​

2.9.2. 非首次启动MindIE docker镜像 ​

2.10. 准备Qwen模型权重 ​

3. 模型推理 ​

4. 将自然语言动作转换为速度指令 ​

1. 总体概览

1.1. 视觉语言导航背景

1.2. 整体目标

1.3. 系统概述

2. 环境搭建

2.1. 流程概述

2.2. 操作系统安装

2.3. 驱动安装

2.4. Firmware安装

2.5. CANN安装

2.6. Python3安装

2.7. PyTorch安装

2.8. MindIE docker镜像安装

2.9. 使用MindIE docker镜像

2.9.1. 首次启动MindIE docker镜像

2.9.2. 非首次启动MindIE docker镜像

2.10. 准备Qwen模型权重

3. 模型推理

4. 将自然语言动作转换为速度指令