ShopTRAINING/server/trainers/mlstm_trainer.py
xz2000 28bae35783 # 扁平化模型数据处理规范 (最终版)
**版本**: 4.0 (最终版)
**核心思想**: 逻辑路径被转换为文件名的一部分,实现极致扁平化的文件存储。

---

## 一、 文件保存规则

### 1.1. 核心原则

所有元数据都被编码到文件名中。一个逻辑上的层级路径(例如 `product/P001_all/mlstm/v2`)应该被转换为一个用下划线连接的文件名前缀(`product_P001_all_mlstm_v2`)。

### 1.2. 文件存储位置

-   **最终产物**: 所有最终模型、元数据文件、损失图等,统一存放在 `saved_models/` 根目录下。
-   **过程文件**: 所有训练过程中的检查点文件,统一存放在 `saved_models/checkpoints/` 目录下。

### 1.3. 文件名生成规则

1.  **构建逻辑路径**: 根据训练参数(模式、范围、类型、版本)确定逻辑路径。
    -   *示例*: `product/P001_all/mlstm/v2`

2.  **生成文件名前缀**: 将逻辑路径中的所有 `/` 替换为 `_`。
    -   *示例*: `product_P001_all_mlstm_v2`

3.  **拼接文件后缀**: 在前缀后加上描述文件类型的后缀。
    -   `_model.pth`
    -   `_metadata.json`
    -   `_loss_curve.png`
    -   `_checkpoint_best.pth`
    -   `_checkpoint_epoch_{N}.pth`

#### **完整示例:**

-   **最终模型**: `saved_models/product_P001_all_mlstm_v2_model.pth`
-   **元数据**: `saved_models/product_P001_all_mlstm_v2_metadata.json`
-   **最佳检查点**: `saved_models/checkpoints/product_P001_all_mlstm_v2_checkpoint_best.pth`
-   **Epoch 50 检查点**: `saved_models/checkpoints/product_P001_all_mlstm_v2_checkpoint_epoch_50.pth`

---

## 二、 文件读取规则

1.  **确定模型元数据**: 根据需求确定要加载的模型的训练模式、范围、类型和版本。
2.  **构建文件名前缀**: 按照与保存时相同的逻辑,将元数据拼接成文件名前缀(例如 `product_P001_all_mlstm_v2`)。
3.  **定位文件**:
    -   要加载最终模型,查找文件: `saved_models/{prefix}_model.pth`。
    -   要加载最佳检查点,查找文件: `saved_models/checkpoints/{prefix}_checkpoint_best.pth`。

---

## 三、 数据库存储规则

数据库用于索引,应存储足以重构文件名前缀的关键元数据。

#### **`models` 表结构建议:**

| 字段名 | 类型 | 描述 | 示例 |
| :--- | :--- | :--- | :--- |
| `id` | INTEGER | 主键 | 1 |
| `filename_prefix` | TEXT | **完整文件名前缀,可作为唯一标识** | `product_P001_all_mlstm_v2` |
| `model_identifier`| TEXT | 用于版本控制的标识符 (不含版本) | `product_P001_all_mlstm` |
| `version` | INTEGER | 版本号 | `2` |
| `status` | TEXT | 模型状态 | `completed`, `training`, `failed` |
| `created_at` | TEXT | 创建时间 | `2025-07-21 02:29:00` |
| `metrics_summary`| TEXT | 关键性能指标的JSON字符串 | `{"rmse": 10.5, "r2": 0.89}` |

#### **保存逻辑:**
-   训练完成后,向表中插入一条记录。`filename_prefix` 字段是查找与该次训练相关的所有文件的关键。

---

## 四、 版本记录规则

版本管理依赖于根目录下的 `versions.json` 文件,以实现原子化、线程安全的版本号递增。

-   **文件名**: `versions.json`
-   **位置**: `saved_models/versions.json`
-   **结构**: 一个JSON对象,`key` 是不包含版本号的标识符,`value` 是该标识符下最新的版本号(整数)。
    -   **Key**: `{prefix_core}_{model_type}` (例如: `product_P001_all_mlstm`)
    -   **Value**: `Integer`

#### **`versions.json` 示例:**
```json
{
  "product_P001_all_mlstm": 2,
  "store_S001_P002_transformer": 1
}
```

#### **版本管理流程:**

1.  **获取新版本**: 开始训练前,构建 `key`。读取 `versions.json`,找到对应 `key` 的 `value`。新版本号为 `value + 1` (若key不存在,则为 `1`)。
2.  **更新版本**: 训练成功后,将新的版本号写回到 `versions.json`。此过程**必须使用文件锁**以防止并发冲突。

调试完成药品预测和店铺预测
2025-07-21 16:39:52 +08:00

527 lines
20 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""
药店销售预测系统 - mLSTM模型训练函数
"""
import os
import time
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from tqdm import tqdm
from models.mlstm_model import MLSTMTransformer as MatrixLSTM
from utils.data_utils import create_dataset, PharmacyDataset
from utils.multi_store_data_utils import get_store_product_sales_data, aggregate_multi_store_data
from utils.visualization import plot_loss_curve
from analysis.metrics import evaluate_model
from core.config import (
DEVICE, DEFAULT_MODEL_DIR, LOOK_BACK, FORECAST_HORIZON
)
from utils.training_progress import progress_manager
def save_checkpoint(checkpoint_data: dict, epoch_or_label, path_info: dict):
"""
保存训练检查点 (已适配扁平化路径规范)
Args:
checkpoint_data: 检查点数据
epoch_or_label: epoch编号或标签'best'或整数)
path_info (dict): 包含所有路径信息的字典
"""
if epoch_or_label == 'best':
# 使用由 ModelPathManager 直接提供的最佳检查点路径
checkpoint_path = path_info['best_checkpoint_path']
else:
# 使用 epoch 检查点模板生成路径
template = path_info.get('epoch_checkpoint_template')
if not template:
raise ValueError("路径信息 'path_info' 中缺少 'epoch_checkpoint_template'")
checkpoint_path = template.format(N=epoch_or_label)
# 保存检查点
torch.save(checkpoint_data, checkpoint_path)
print(f"[mLSTM] 检查点已保存: {checkpoint_path}", flush=True)
return checkpoint_path
def train_product_model_with_mlstm(
product_id,
product_df,
store_id=None,
training_mode='product',
aggregation_method='sum',
epochs=50,
model_dir=DEFAULT_MODEL_DIR, # 将被 path_info 替代
version=None, # 将被 path_info 替代
socketio=None,
task_id=None,
continue_training=False,
progress_callback=None,
path_info=None, # 新增参数
patience=10,
learning_rate=0.001,
clip_norm=1.0
):
"""
使用mLSTM训练产品销售预测模型
参数:
product_id: 产品ID
store_id: 店铺ID为None时使用全局数据
training_mode: 训练模式 ('product', 'store', 'global')
aggregation_method: 聚合方法 ('sum', 'mean', 'weighted')
epochs: 训练轮次
model_dir: 模型保存目录
version: 模型版本如果为None则自动生成
socketio: Socket.IO实例用于实时进度推送
task_id: 任务ID
continue_training: 是否继续训练
progress_callback: 进度回调函数,用于多进程训练
"""
# 验证 path_info 是否提供
if not path_info:
raise ValueError("train_product_model_with_mlstm 需要 'path_info' 参数。")
version = path_info['version']
# 创建WebSocket进度反馈函数支持多进程
def emit_progress(message, progress=None, metrics=None):
"""发送训练进度到前端"""
progress_data = {
'task_id': task_id,
'message': message,
'timestamp': time.time()
}
if progress is not None:
progress_data['progress'] = progress
if metrics is not None:
progress_data['metrics'] = metrics
# 在多进程环境中使用progress_callback
if progress_callback:
try:
progress_callback(progress_data)
except Exception as e:
print(f"[mLSTM] 进度回调失败: {e}")
# 在单进程环境中使用socketio
if socketio and task_id:
try:
socketio.emit('training_progress', progress_data, namespace='/training')
except Exception as e:
print(f"[mLSTM] WebSocket发送失败: {e}")
print(f"[mLSTM] {message}", flush=True)
# 强制刷新输出缓冲区
import sys
sys.stdout.flush()
sys.stderr.flush()
emit_progress(f"开始训练 mLSTM 模型版本 v{version}")
# 初始化训练进度管理器(如果还未初始化)
if socketio and task_id:
print(f"[mLSTM] 任务 {task_id}: 开始mLSTM训练器", flush=True)
try:
# 初始化进度管理器
if not hasattr(progress_manager, 'training_id') or progress_manager.training_id != task_id:
progress_manager.start_training(
training_id=task_id,
product_id=product_id,
model_type='mlstm',
training_mode=training_mode,
total_epochs=epochs,
total_batches=0, # 将在后面设置
batch_size=32, # 默认值
total_samples=0 # 将在后面设置
)
print(f"[mLSTM] 任务 {task_id}: 进度管理器已初始化", flush=True)
else:
print(f"[mLSTM] 任务 {task_id}: 使用现有进度管理器", flush=True)
except Exception as e:
print(f"[mLSTM] 任务 {task_id}: 进度管理器初始化失败: {e}", flush=True)
# 数据现在由调用方传入,不再在此处加载
if training_mode == 'store' and store_id:
training_scope = f"店铺 {store_id}"
elif training_mode == 'global':
training_scope = f"全局聚合({aggregation_method})"
else:
training_scope = "所有店铺"
# 数据量检查
min_required_samples = LOOK_BACK + FORECAST_HORIZON
if len(product_df) < min_required_samples:
error_msg = (
f"❌ 训练数据不足错误\n"
f"当前配置需要: {min_required_samples} 天数据 (LOOK_BACK={LOOK_BACK} + FORECAST_HORIZON={FORECAST_HORIZON})\n"
f"实际数据量: {len(product_df)}\n"
f"产品ID: {product_id}, 训练模式: {training_mode}\n"
f"建议解决方案:\n"
f"1. 生成更多数据: uv run generate_multi_store_data.py\n"
f"2. 调整配置参数: 减小 LOOK_BACK 或 FORECAST_HORIZON\n"
f"3. 使用全局训练模式聚合更多数据"
)
print(error_msg)
emit_progress(f"训练失败:数据不足 ({len(product_df)}/{min_required_samples} 天)")
raise ValueError(error_msg)
product_name = product_df['product_name'].iloc[0]
print(f"[mLSTM] 使用mLSTM模型训练产品 '{product_name}' (ID: {product_id}) 的销售预测模型", flush=True)
print(f"[mLSTM] 训练范围: {training_scope}", flush=True)
print(f"[mLSTM] 版本: v{version}", flush=True)
print(f"[mLSTM] 使用设备: {DEVICE}", flush=True)
print(f"[mLSTM] 模型将保存到: {path_info['base_dir']}", flush=True)
print(f"[mLSTM] 数据量: {len(product_df)} 条记录", flush=True)
emit_progress(f"训练产品: {product_name} (ID: {product_id}) - {training_scope}")
# 创建特征和目标变量
features = ['sales', 'weekday', 'month', 'is_holiday', 'is_weekend', 'is_promotion', 'temperature']
print(f"[mLSTM] 开始数据预处理,特征: {features}", flush=True)
# 预处理数据
X = product_df[features].values
y = product_df[['sales']].values # 保持为二维数组
print(f"[mLSTM] 特征矩阵形状: {X.shape}, 目标矩阵形状: {y.shape}", flush=True)
emit_progress("数据预处理中...")
# 归一化数据
scaler_X = MinMaxScaler(feature_range=(0, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y)
print(f"[mLSTM] 数据归一化完成", flush=True)
# 划分训练集和测试集80% 训练20% 测试)
train_size = int(len(X_scaled) * 0.8)
X_train, X_test = X_scaled[:train_size], X_scaled[train_size:]
y_train, y_test = y_scaled[:train_size], y_scaled[train_size:]
# 创建时间序列数据
trainX, trainY = create_dataset(X_train, y_train, LOOK_BACK, FORECAST_HORIZON)
testX, testY = create_dataset(X_test, y_test, LOOK_BACK, FORECAST_HORIZON)
# 转换为PyTorch的Tensor
trainX_tensor = torch.Tensor(trainX)
trainY_tensor = torch.Tensor(trainY)
testX_tensor = torch.Tensor(testX)
testY_tensor = torch.Tensor(testY)
# 创建数据加载器
train_dataset = PharmacyDataset(trainX_tensor, trainY_tensor)
test_dataset = PharmacyDataset(testX_tensor, testY_tensor)
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# 更新进度管理器的批次信息
total_batches = len(train_loader)
total_samples = len(train_dataset)
print(f"[mLSTM] 数据加载器创建完成 - 批次数: {total_batches}, 样本数: {total_samples}", flush=True)
emit_progress(f"数据加载器准备完成 - 批次数: {total_batches}, 样本数: {total_samples}")
# 初始化mLSTM结合Transformer模型
input_dim = X_train.shape[1]
output_dim = FORECAST_HORIZON
hidden_size = 128
num_heads = 4
dropout_rate = 0.1
num_blocks = 3
embed_dim = 32
dense_dim = 32
print(f"[mLSTM] 初始化模型 - 输入维度: {input_dim}, 输出维度: {output_dim}", flush=True)
print(f"[mLSTM] 模型参数 - 隐藏层: {hidden_size}, 注意力头: {num_heads}", flush=True)
emit_progress(f"初始化mLSTM模型 - 输入维度: {input_dim}, 隐藏层: {hidden_size}")
model = MatrixLSTM(
num_features=input_dim,
hidden_size=hidden_size,
mlstm_layers=2,
embed_dim=embed_dim,
dense_dim=dense_dim,
num_heads=num_heads,
dropout_rate=dropout_rate,
num_blocks=num_blocks,
output_sequence_length=output_dim
)
print(f"[mLSTM] 模型创建完成", flush=True)
emit_progress("mLSTM模型初始化完成")
# 如果是继续训练,加载现有模型
if continue_training and version != 'v1':
# TODO: 继续训练的逻辑需要调整以适应新的路径结构
# 例如,加载上一个版本的 best checkpoint
emit_progress("继续训练功能待适配新路径结构,暂时作为新训练开始。")
# 将模型移动到设备上
model = model.to(DEVICE)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=patience // 2, factor=0.5)
emit_progress("数据预处理完成,开始模型训练...", progress=10)
# 训练模型
train_losses = []
test_losses = []
start_time = time.time()
# 配置检查点保存
checkpoint_interval = max(1, epochs // 10) # 每10%进度保存一次最少每1个epoch
best_loss = float('inf')
epochs_no_improve = 0
emit_progress(f"开始训练 - 总epoch: {epochs}, 检查点间隔: {checkpoint_interval}, 耐心值: {patience}")
for epoch in range(epochs):
emit_progress(f"开始训练 Epoch {epoch+1}/{epochs}")
model.train()
epoch_loss = 0
for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)
# 确保目标张量有正确的形状
# 前向传播
outputs = model(X_batch)
loss = criterion(outputs, y_batch)
# 反向传播和优化
optimizer.zero_grad()
loss.backward()
if clip_norm:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_norm)
optimizer.step()
epoch_loss += loss.item()
# 计算训练损失
train_loss = epoch_loss / len(train_loader)
train_losses.append(train_loss)
# 在测试集上评估
model.eval()
test_loss = 0
with torch.no_grad():
for batch_idx, (X_batch, y_batch) in enumerate(test_loader):
X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)
outputs = model(X_batch)
loss = criterion(outputs, y_batch)
test_loss += loss.item()
test_loss = test_loss / len(test_loader)
test_losses.append(test_loss)
# 更新学习率
scheduler.step(test_loss)
# 计算总体训练进度
epoch_progress = ((epoch + 1) / epochs) * 90 + 10 # 10-100% 范围
# 发送训练进度
current_metrics = {
'train_loss': train_loss,
'test_loss': test_loss,
'epoch': epoch + 1,
'total_epochs': epochs,
'learning_rate': optimizer.param_groups[0]['lr']
}
emit_progress(f"Epoch {epoch+1}/{epochs} 完成 - Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}",
progress=epoch_progress, metrics=current_metrics)
# 定期保存检查点
if (epoch + 1) % checkpoint_interval == 0 or epoch == epochs - 1:
checkpoint_data = {
'epoch': epoch + 1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_loss': train_loss,
'test_loss': test_loss,
'train_losses': train_losses,
'test_losses': test_losses,
'scaler_X': scaler_X,
'scaler_y': scaler_y,
'config': {
'input_dim': input_dim,
'output_dim': output_dim,
'hidden_size': hidden_size,
'num_heads': num_heads,
'dropout': dropout_rate,
'num_blocks': num_blocks,
'embed_dim': embed_dim,
'dense_dim': dense_dim,
'sequence_length': LOOK_BACK,
'forecast_horizon': FORECAST_HORIZON,
'model_type': 'mlstm'
},
'training_info': {
'product_id': product_id,
'product_name': product_name,
'training_mode': training_mode,
'store_id': store_id,
'aggregation_method': aggregation_method,
'training_scope': training_scope,
'timestamp': time.time()
}
}
# 保存检查点
save_checkpoint(checkpoint_data, epoch + 1, path_info)
# 如果是最佳模型,额外保存一份
if test_loss < best_loss:
best_loss = test_loss
save_checkpoint(checkpoint_data, 'best', path_info)
emit_progress(f"💾 保存最佳模型检查点 (epoch {epoch+1}, test_loss: {test_loss:.4f})")
epochs_no_improve = 0
else:
epochs_no_improve += 1
emit_progress(f"💾 保存训练检查点 epoch_{epoch+1}")
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}", flush=True)
# 提前停止逻辑
if epochs_no_improve >= patience:
emit_progress(f"连续 {patience} 个epoch测试损失未改善提前停止训练。")
break
# 计算训练时间
training_time = time.time() - start_time
emit_progress("生成损失曲线...", progress=95)
# 从 path_info 获取损失曲线保存路径
loss_curve_path = path_info['loss_curve_path']
# 绘制损失曲线并保存到模型目录
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss')
plt.plot(test_losses, label='Test Loss')
title_suffix = f" - {training_scope}" if store_id else " - 全局模型"
plt.title(f'mLSTM 模型训练损失曲线 - {product_name} (v{version}){title_suffix}')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.savefig(loss_curve_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"损失曲线已保存到: {loss_curve_path}")
emit_progress("模型评估中...", progress=98)
# 评估模型
model.eval()
with torch.no_grad():
test_pred = model(testX_tensor.to(DEVICE)).cpu().numpy()
test_true = testY
# 反归一化预测结果和真实值
test_pred_inv = scaler_y.inverse_transform(test_pred)
test_true_inv = scaler_y.inverse_transform(test_true)
# 计算评估指标
metrics = evaluate_model(test_true_inv, test_pred_inv)
metrics['training_time'] = training_time
metrics['version'] = version
# 打印评估指标
print("\n模型评估指标:")
print(f"MSE: {metrics['mse']:.4f}")
print(f"RMSE: {metrics['rmse']:.4f}")
print(f"MAE: {metrics['mae']:.4f}")
print(f"R²: {metrics['r2']:.4f}")
print(f"MAPE: {metrics['mape']:.2f}%")
print(f"训练时间: {training_time:.2f}")
emit_progress("保存最终模型...", progress=99)
# 保存最终训练完成的模型基于最终epoch
final_model_data = {
'epoch': epochs, # 最终epoch
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_loss': train_losses[-1],
'test_loss': test_losses[-1],
'train_losses': train_losses,
'test_losses': test_losses,
'scaler_X': scaler_X,
'scaler_y': scaler_y,
'config': {
'input_dim': input_dim,
'output_dim': output_dim,
'hidden_size': hidden_size,
'num_heads': num_heads,
'dropout': dropout_rate,
'num_blocks': num_blocks,
'embed_dim': embed_dim,
'dense_dim': dense_dim,
'sequence_length': LOOK_BACK,
'forecast_horizon': FORECAST_HORIZON,
'model_type': 'mlstm'
},
'metrics': metrics,
'loss_curve_path': loss_curve_path,
'training_info': {
'product_id': product_id,
'product_name': product_name,
'training_mode': training_mode,
'store_id': store_id,
'aggregation_method': aggregation_method,
'training_scope': training_scope,
'timestamp': time.time(),
'training_completed': True
}
}
# 检查模型性能是否达标
# 移除R2检查始终保存模型
if metrics:
# 保存最终模型到 model.pth
final_model_path = path_info['model_path']
torch.save(final_model_data, final_model_path)
print(f"[mLSTM] 最终模型已保存: {final_model_path}", flush=True)
else:
final_model_path = None
print(f"[mLSTM] 训练过程中未生成评估指标,不保存最终模型。", flush=True)
# 发送训练完成消息
final_metrics = {
'mse': metrics['mse'],
'rmse': metrics['rmse'],
'mae': metrics['mae'],
'r2': metrics['r2'],
'mape': metrics['mape'],
'training_time': training_time,
'final_epoch': epochs,
'model_path': final_model_path
}
if final_model_path:
emit_progress(f"✅ mLSTM模型训练完成最终epoch: {epochs} 已保存", progress=100, metrics=final_metrics)
else:
emit_progress(f"❌ mLSTM模型训练失败性能不达标", progress=100, metrics={'error': '模型性能不佳'})
return model, metrics, epochs, final_model_path