框架模型层

使用了 PyTorch XLA 来在 XLA（如 TPU 等加速设备）上运行张量操作。

引入 torch、torch_xla 和 torch_xla.core.xla_model，用于在 XLA 设备上执行 PyTorch 操作。
使用 torch.randn(2, 2, device=xm.xla_device()) 创建一个 2x2 的随机张量 t，并将其分配到 XLA 设备。
创建两个 2x2 的随机张量 t0 和 t1，并进行逐元素加法和矩阵乘法，打印结果。
创建一个大小为 10 的随机输入向量 l_in，并将其分配到 XLA 设备。
定义一个输入特征为 10、输出特征为 20 的线性层 linear，并迁移到 XLA 设备。
将输入 l_in 传入线性层，得到输出 l_out，并打印输出结果。

import torch
import torch_xla
import torch_xla.core.xla_model as xm

t = torch.randn(2, 2, device=xm.xla_device())
print(t.device)
print(t)

t0 = torch.randn(2, 2, device=xm.xla_device())
t1 = torch.randn(2, 2, device=xm.xla_device())
print(t0 + t1)
print(t0.mm(t1))

#神经网络
l_in = torch.randn(10, device=xm.xla_device())
linear = torch.nn.Linear(10, 20).to(xm.xla_device())
l_out = linear(l_in)
print(l_out)

结果

xla:0
tensor([[ 0.1028, -1.4783],
        [-0.4271,  1.3415]], device='xla:0')
tensor([[ 1.7679,  0.2210],
        [ 0.5831, -1.5733]], device='xla:0')
tensor([[ 0.6698, -0.5113],
        [ 0.9527,  0.2601]], device='xla:0')
tensor([-0.8333,  0.4356,  0.4277, -0.3944,  0.8075,  0.3516,  0.0455,  0.0778,
        -0.0822,  0.4418, -0.7217,  0.3582, -0.7285,  0.1117, -0.0466, -0.7045,
        -0.1443,  0.3461, -0.3151, -0.6094], device='xla:0',
       grad_fn=<AddBackward0>)

实现了一个使用 PyTorch XLA 再 TPU 训练和评估 MNIST 手写数字分类模型的完整流程，包括数据加载、模型构建、训练、保存和推理。

引入所需的 PyTorch 和 Torch XLA 库，以及 MNIST 数据集和数据处理工具。设置设备为 TPU，使用 xm.xla_device()。
使用 transforms.Compose 创建数据转换，将 MNIST 数据集中的图像转换为张量。下载 MNIST 训练集并创建数据加载器 train_loader，设置批量大小为 64，并随机打乱数据。
定义一个简单的神经网络模型，包括：扁平化层，将 28x28 的图像展平成一维。128 单元的全连接层，使用 ReLU 激活函数。10 单元的全连接层，使用 LogSoftmax 激活函数。将模型迁移到 TPU 设备。
使用负对数似然损失函数 NLLLoss。使用随机梯度下降优化器 SGD，学习率为 0.01，动量为 0.9。
对训练数据进行迭代：清零优化器的梯度。将数据和目标迁移到 TPU 设备。通过模型进行前向传播，计算损失。进行反向传播以计算梯度。更新模型参数。调用 xm.mark_step() 同步 TPU。
使用 torch.save() 保存训练好的模型到 mnist_model_full.pth 文件中。
加载保存的模型，并将其迁移到 TPU 设备，切换到评估模式。
在不计算梯度的上下文中：遍历测试数据，迁移到 TPU 设备。进行前向传播，计算输出。使用 torch.max() 获取预测结果的最大值索引。打印预测结果，且仅处理一个批次作为示例。

import torch
import torch.nn as nn
import torch.optim as optim
import torch_xla.core.xla_model as xm
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms

# 设备设定（TPU）
device = xm.xla_device()

# 数据集与数据加载器设定
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = MNIST(root='data', train=True, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# 模型设定
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
    nn.LogSoftmax(dim=1)
).to(device)

# 损失函数和优化器设定
loss_fn = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# 训练循环
for data, target in train_loader:
    optimizer.zero_grad()
    data = data.to(device)
    target = target.to(device)
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()
    
    optimizer.step()
    xm.mark_step()  # TPU同步

# 保存整个模型
torch.save(model, 'mnist_model_full.pth')

# 模型推理
import torch

# 加载整个模型
model = torch.load('mnist_model_full.pth').to(device)
model.eval()  # 切换到评估模式

# 加载测试数据
test_dataset = MNIST(root='data', train=False, transform=transform, download=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# 使用模型进行推理
with torch.no_grad():  # 禁用梯度计算以加快推理
    for data, _ in test_loader:
        data = data.to(device)
        output = model(data)
        xm.mark_step()  # TPU同步
        
        # 获取预测结果
        _, predicted = torch.max(output, 1)
        print(predicted)
        break  # 仅处理一个批次的示例

结果

tensor([7, 2, 1, 0, 4, 1, 4, 9, 6, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5,
        4, 0, 7, 4, 0, 1, 3, 1, 3, 6, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 3, 5, 1, 2,
        4, 4, 6, 3, 5, 5, 6, 0, 4, 1, 9, 5, 7, 8, 4, 3], device='xla:0')

将一个 PyTorch 模型导出并转换为一种适合跨平台应用的格式（ StableHLO ），以便进行优化、部署和进一步分析。

模型加载：加载了预训练的 ResNet-18 模型，使用 torchvision 提供的默认权重。
样本输入生成：创建了一个形状为 (4, 3, 224, 224) 的随机张量，模拟输入的图像数据。
模型导出：使用 export 函数将 ResNet-18 模型导出为中间表示，以便后续处理。
转换为 StableHLO：将导出的模型转换为 StableHLO 格式，适用于跨平台优化和部署。
输出 StableHLO 文本：打印模型前向计算图的 StableHLO 文本表示的前 400 个字符，以供检查和分析。

import torch
import torchvision
from torch.export import export

resnet18 = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)
sample_input = (torch.randn(4, 3, 224, 224), )
exported = export(resnet18, sample_input)

from torch_xla.stablehlo import exported_program_to_stablehlo

stablehlo_program = exported_program_to_stablehlo(exported)
print(stablehlo_program.get_stablehlo_text('forward')[0:400],"\n...")

结果

module @IrToHlo.484 attributes {mhlo.cross_program_prefetches = [], mhlo.is_dynamic = false, mhlo.use_auto_spmd_partitioning = false} {
  func.func @main(%arg0: tensor<1000xf32>, %arg1: tensor<1000x512xf32>, %arg2: tensor<512xf32>, %arg3: tensor<512xf32>, %arg4: tensor<512xf32>, %arg5: tensor<512xf32>, %arg6: tensor<512x256x1x1xf32>, %arg7: tensor<256xf32>, %arg8: tensor<256xf32>, %arg9: tensor<25 
...

定义一个简单的加法模型，并创建输入数据。
将模型导出为中间表示，并转换为 StableHLO 格式，便于跨平台应用和优化。
最后，输出转换后的模型信息，便于分析和调试。

import torch
import torch.nn as nn
from torch.export import export
from torch_xla.stablehlo import exported_program_to_stablehlo

# 定义一个简单的加法模型
class AddModel(nn.Module):
    def __init__(self):
        super(AddModel, self).__init__()
    
    def forward(self, x, y):
        return x + y

# 创建模型实例
add_model = AddModel()

# 创建示例输入
x_input = torch.randn(4, 3, 224, 224)  # 第一个输入
y_input = torch.randn(4, 3, 224, 224)  # 第二个输入

# 使用 export 函数导出模型
exported = export(add_model, (x_input, y_input))

# 将导出的模型转换为 StableHLO 格式
stablehlo_program = exported_program_to_stablehlo(exported)

# 打印 StableHLO 程序文本的一部分
print(stablehlo_program.get_stablehlo_text('forward')[0:400], "\n...")

结果

module @IrToHlo.8 attributes {mhlo.cross_program_prefetches = [], mhlo.is_dynamic = false, mhlo.use_auto_spmd_partitioning = false} {
  func.func @main(%arg0: tensor<4x3x224x224xf32>, %arg1: tensor<4x3x224x224xf32>) -> tensor<4x3x224x224xf32> {
    %0 = stablehlo.add %arg1, %arg0 : tensor<4x3x224x224xf32>
    return %0 : tensor<4x3x224x224xf32>
  }
}

实现了使用 TensorFlow 定义一个简单的神经网络模型，生成随机输入，并使用 XLA（加速线性代数）优化进行前向传播。

使用 tf.config.list_physical_devices('GPU') 检查可用的 GPU 数量。输出可用 GPU 的数量。
使用 tf.keras.Sequential 创建一个顺序模型。第一层是一个全连接层（Dense），有 10 个单元，输入维度为 10，激活函数为 ReLU。第二层是另一个全连接层，包含 5 个单元，激活函数为 softmax。
定义批量大小（batch_size）为 16，输入向量维度（input_vector_dim）为 10。使用 tf.random.normal 生成形状为 (16, 10) 的随机输入。
使用 @tf.function(jit_compile=True) 装饰器定义前向传播函数，以启用 XLA 优化。函数接受输入并返回模型的输出。
调用前向传播函数 forward_pass，传入随机输入进行计算。

import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

# Define the model
model = tf.keras.Sequential(
    [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"),
     tf.keras.layers.Dense(5, activation="softmax")]
)

# Generate random inputs for the model
batch_size = 16
input_vector_dim = 10
random_inputs = tf.random.normal((batch_size, input_vector_dim))

# Run a forward pass
_ = model(random_inputs)

# Compile the model function with XLA optimization
@tf.function(jit_compile=True)
def forward_pass(inputs):
    return model(inputs)

# Run the forward pass with XLA
_ = forward_pass(random_inputs)

结果

I0000 00:00:1727407770.382644 1007512 service.cc:146] XLA service 0x8ec22c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727407770.382662 1007512 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 4080 SUPER, Compute Capability 8.9
2024-09-27 11:29:30.387574: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-09-27 11:29:31.040309: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
I0000 00:00:1727407771.151882 1007512 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

AI技术栈解析及应用- 作者：张真瑜 | 山东大学智能创新研究院

框架模型层