计算库层

ONNX Runtime 以 ONNX 图形格式或 ORT 格式（适用于内存和磁盘有限的环境）加载并执行模型推理。可以根据具体场景选择合适的方式来指定和访问模型所消耗和生成的数据。 InferenceSession 是 ONNX Runtime 的主类。它用于加载和运行 ONNX 模型，以及指定环境和应用程序配置选项。 ONNX Runtime 的推理会话通过 OrtValue 类处理数据的消耗和生成。在 CPU 上（默认），OrtValues 可以映射到本机 Python 数据结构，如 numpy 数组、字典和 numpy 数组列表。通常情况下，ONNX Runtime 会将输入和输出默认放置在 CPU 上。如果输入或输出是在其他设备上进行处理的，将数据放在 CPU 上可能并不是最佳选择，因为这会导致 CPU 与设备之间的数据复制。

# X is numpy array on cpu
ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X)
ortvalue.device_name()  # 'cpu'
ortvalue.shape()        # shape of the numpy array X
ortvalue.data_type()    # 'tensor(float)'
ortvalue.is_tensor()    # 'True'
np.array_equal(ortvalue.numpy(), X)  # 'True'

# ortvalue can be provided as part of the input feed to a model
session = onnxruntime.InferenceSession(
        'model.onnx',
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
results = session.run(["Y"], {"X": ortvalue})

ONNX Runtime 支持自定义数据结构，兼容所有 ONNX 数据格式，允许用户将支持这些格式的数据放置在设备上，例如支持 CUDA 的设备。这一功能称为 IOBinding。

要使用 IOBinding 功能，只需将 InferenceSession.run() 替换为 InferenceSession.run_with_iobinding()。这样，图形可以在 CPU 以外的设备上执行，例如 CUDA，用户可以通过 IOBinding 将数据复制到 GPU 上。

# X is numpy array on cpu
session = onnxruntime.InferenceSession(
        'model.onnx',
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
io_binding = session.io_binding()
# OnnxRuntime will copy the data over to the CUDA device if 'input' is consumed by nodes on the CUDA device
io_binding.bind_cpu_input('input', X)
io_binding.bind_output('output')
session.run_with_iobinding(io_binding)
Y = io_binding.copy_outputs_to_cpu()[0]

输入数据存放在设备上，用户可以直接使用这些输入，而输出数据则保留在 CPU 上。

# X is numpy array on cpu
X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
session = onnxruntime.InferenceSession(
        'model.onnx',
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
io_binding = session.io_binding()
io_binding.bind_input(name='input', device_type=X_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=X_ortvalue.shape(), buffer_ptr=X_ortvalue.data_ptr())
io_binding.bind_output('output')
session.run_with_iobinding(io_binding)
Y = io_binding.copy_outputs_to_cpu()[0

输入数据和输出数据都位于同一设备上，用户可以直接使用输入，同时将输出也保留在该设备上。

#X is numpy array on cpu
X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
Y_ortvalue = onnxruntime.OrtValue.ortvalue_from_shape_and_type([3, 2], np.float32, 'cuda', 0)  # Change the shape to the actual shape of the output being bound
session = onnxruntime.InferenceSession(
        'model.onnx',
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
io_binding = session.io_binding()
io_binding.bind_input(
        name='input',
        device_type=X_ortvalue.device_name(),
        device_id=0,
        element_type=np.float32,
        shape=X_ortvalue.shape(),
        buffer_ptr=X_ortvalue.data_ptr()
)
io_binding.bind_output(
        name='output',
        device_type=Y_ortvalue.device_name(),
        device_id=0,
        element_type=np.float32,
        shape=Y_ortvalue.shape(),
        buffer_ptr=Y_ortvalue.data_ptr()
)
session.run_with_iobinding(io_binding)

用户可以请求 ONNX Runtime 在设备上分配输出，这对于动态形状的输出尤其有用。用户可以通过 get_outputs() API 访问与分配的输出对应的 OrtValue。因此，用户可以将 ONNX Runtime 为输出分配的内存作为 OrtValue 使用。

#X is numpy array on cpu
X_ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)
session = onnxruntime.InferenceSession(
        'model.onnx',
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
)
io_binding = session.io_binding()
io_binding.bind_input(
        name='input',
        device_type=X_ortvalue.device_name(),
        device_id=0,
        element_type=np.float32,
        shape=X_ortvalue.shape(),
        buffer_ptr=X_ortvalue.data_ptr()
)
#Request ONNX Runtime to bind and allocate memory on CUDA for 'output'
io_binding.bind_output('output', 'cuda')
session.run_with_iobinding(io_binding)
# The following call returns an OrtValue which has data allocated by ONNX Runtime on CUDA
ort_output = io_binding.get_outputs()[0]

还可以将输入和输出直接绑定到 PyTorch 张量。

# X is a PyTorch tensor on device
session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider']))
binding = session.io_binding()

X_tensor = X.contiguous()

binding.bind_input(
    name='X',
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=tuple(x_tensor.shape),
    buffer_ptr=x_tensor.data_ptr(),
    )

## Allocate the PyTorch tensor for the model output
Y_shape = ... # You need to specify the output PyTorch tensor shape
Y_tensor = torch.empty(Y_shape, dtype=torch.float32, device='cuda:0').contiguous()
binding.bind_output(
    name='Y',
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=tuple(Y_tensor.shape),
    buffer_ptr=Y_tensor.data_ptr(),
)

session.run_with_iobinding(binding)

ONNX 后端（Backend）框架的一个实现，定义了用于处理和执行 ONNX 模型的基本结构和功能。

DeviceType 类定义了支持的设备类型，使用 NewType 创建了 CPU 和 CUDA 作为设备类型的常量。
Device 类用于表示设备及其 ID，构造函数解析设备字符串（如 "CUDA:1"），并设置相应的类型和 ID。
BackendRep 类表示后端准备执行模型后返回的句柄，提供一个 run 方法用于执行模型。
Backend 类是 ONNX 模型的执行单元，包含多个类方法，负责模型的兼容性检查、准备和执行。is_compatible：检查模型是否与后端兼容。prepare：准备模型以便重复执行，返回一个 BackendRep 实例。run_model：准备模型并运行，返回结果。run_node：运行单个操作（节点），用于快速测试和验证。supports_device：检查后端是否支持特定设备。

from __future__ import annotations

from collections import namedtuple
from typing import Any, NewType, Sequence

import numpy

import onnx.checker
import onnx.onnx_cpp2py_export.checker as c_checker
from onnx import IR_VERSION, ModelProto, NodeProto


class DeviceType:
    """Describes device type."""

    _Type = NewType("_Type", int)
    CPU: _Type = _Type(0)
    CUDA: _Type = _Type(1)


class Device:
    """Describes device type and device id
    syntax: device_type:device_id(optional)
    example: 'CPU', 'CUDA', 'CUDA:1'
    """

    def __init__(self, device: str) -> None:
        options = device.split(":")
        self.type = getattr(DeviceType, options[0])
        self.device_id = 0
        if len(options) > 1:
            self.device_id = int(options[1])


def namedtupledict(
    typename: str, field_names: Sequence[str], *args: Any, **kwargs: Any
) -> type[tuple[Any, ...]]:
    field_names_map = {n: i for i, n in enumerate(field_names)}
    # Some output names are invalid python identifier, e.g. "0"
    kwargs.setdefault("rename", True)
    data = namedtuple(typename, field_names, *args, **kwargs)  # type: ignore  # noqa: PYI024

    def getitem(self: Any, key: Any) -> Any:
        if isinstance(key, str):
            key = field_names_map[key]
        return super(type(self), self).__getitem__(key)  # type: ignore

    data.__getitem__ = getitem  # type: ignore[assignment]
    return data


class BackendRep:
    """BackendRep is the handle that a Backend returns after preparing to execute
    a model repeatedly. Users will then pass inputs to the run function of
    BackendRep to retrieve the corresponding results.
    """

    def run(self, inputs: Any, **kwargs: Any) -> tuple[Any, ...]:  # noqa: ARG002
        """Abstract function."""
        return (None,)


class Backend:
    """Backend is the entity that will take an ONNX model with inputs,
    perform a computation, and then return the output.

    For one-off execution, users can use run_node and run_model to obtain results quickly.

    For repeated execution, users should use prepare, in which the Backend
    does all of the preparation work for executing the model repeatedly
    (e.g., loading initializers), and returns a BackendRep handle.
    """

    @classmethod
    def is_compatible(
        cls, model: ModelProto, device: str = "CPU", **kwargs: Any  # noqa: ARG003
    ) -> bool:
        # Return whether the model is compatible with the backend.
        return True

    @classmethod
    def prepare(
        cls, model: ModelProto, device: str = "CPU", **kwargs: Any  # noqa: ARG003
    ) -> BackendRep | None:
        # TODO Remove Optional from return type
        onnx.checker.check_model(model)
        return None

    @classmethod
    def run_model(
        cls, model: ModelProto, inputs: Any, device: str = "CPU", **kwargs: Any
    ) -> tuple[Any, ...]:
        backend = cls.prepare(model, device, **kwargs)
        assert backend is not None
        return backend.run(inputs)

    @classmethod
    def run_node(
        cls,
        node: NodeProto,
        inputs: Any,  # noqa: ARG003
        device: str = "CPU",  # noqa: ARG003
        outputs_info: (  # noqa: ARG003
            Sequence[tuple[numpy.dtype, tuple[int, ...]]] | None
        ) = None,
        **kwargs: dict[str, Any],
    ) -> tuple[Any, ...] | None:
        """Simple run one operator and return the results.

        Args:
            node: The node proto.
            inputs: Inputs to the node.
            device: The device to run on.
            outputs_info: a list of tuples, which contains the element type and
                shape of each output. First element of the tuple is the dtype, and
                the second element is the shape. More use case can be found in
                https://github.com/onnx/onnx/blob/main/onnx/backend/test/runner/__init__.py
            kwargs: Other keyword arguments.
        """
        # TODO Remove Optional from return type
        if "opset_version" in kwargs:
            special_context = c_checker.CheckerContext()
            special_context.ir_version = IR_VERSION
            special_context.opset_imports = {"": kwargs["opset_version"]}  # type: ignore
            onnx.checker.check_node(node, special_context)
        else:
            onnx.checker.check_node(node)

        return None

    @classmethod
    def supports_device(cls, device: str) -> bool:  # noqa: ARG003
        """Checks whether the backend is compiled with particular device support.
        In particular it's used in the testing suite.
        """
        return True

AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院

计算库层