一、根据PaddleNLP的端到端智能家居对话目的辨认

0.摘要

对智能设备的对话数据进行收拾,然后根据预练习模型微调的二分类/多分类端到端运用计划,运用ernie-3.0-tiny-medium-v2-zh模型,进行练习、评价、模型导出,以及猜测,并提交成果。

1.竞赛介绍

目的辨认是指剖析用户的中心需求,输出与查询输入最相关的信息,例如在查找中要找电影、查快递、市政工作等需求,这些需求在底层的检索战略会有很大的不同,过错的辨认简直能够确认找不到能满意用户需求的内容,导致发生十分差的用户体验;在对话过程中要精确理解对方所想表达的意思,这是具有很大挑战性的使命。

目的辨认的精确性能在很大程度上影响着查找的精确性和对话体系的智能性,在本赛题中我们需求选手对中文对话进行目的辨认。

2.数据集介绍

  • 练习数据:大约1.2万条中文对话
  • 测验数据:3000条无标示对话

3.提交样式

评分运用精确率进行评分,精确率值越大越好。

  • 实操计划不允许运用外部数据集,允许运用公开的外部预练习模型。
  • 实操计划需求在指定平台进行评分,提交csv格局。

提交样例:

ID,Target
1,TVProgram-Play
2,HomeAppliance-Control
3,Audio-Play
4,Alarm-Update
5,HomeAppliance-Control
6,FilmTele-Play
7,FilmTele-Play
8,Music-Play
9,Calendar-Query
10,Video-Play
11,Alarm-Update
12,Music-Play
13,Travel-Query
14,TVProgram-Play

4.基本思路

基于PaddleNLP的端到端智能家居对话意图识别

二、环境预备

%cd ~
!git clone https://gitee.com/paddlepaddle/PaddleNLP/
/home/aistudio
正克隆到 'PaddleNLP'...
remote: Enumerating objects: 47494, done.[K
remote: Counting objects: 100% (34730/34730), done.[K
remote: Compressing objects: 100% (17072/17072), done.[K
remote: Total 47494 (delta 23983), reused 27016 (delta 16711), pack-reused 12764[K
接纳目标中: 100% (47494/47494), 87.84 MiB | 4.86 MiB/s, 完结.
处理 delta 中: 100% (32328/32328), 完结.
检查衔接... 完结。
!pip install -U paddlenlp

三、数据处理

import pandas as pd
df=pd.read_csv('data/data208091/train.csv',sep='\t',header=None)
df.head(10)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1
0 还有双鸭山到淮阴的汽车票吗13号的 Travel-Query
1 从这里怎样回家 Travel-Query
2 随意播映一首专辑阁楼里的佛里的歌 Music-Play
3 给看一下墓王之王嘛 FilmTele-Play
4 我想看挑战两把s686打骤变团竞的游戏视频 Video-Play
5 我想看平和精英上战神必备技巧的游戏视频 Video-Play
6 2019年古装爱情电视剧小女花不弃的花絮播映一下 Video-Play
7 找一个2004年的推理剧给我看一会呢 FilmTele-Play
8 自驾游去深圳都经过那些当地啊 Travel-Query
9 给我转播今日的女子双打乒乓球竞赛现场 Video-Play

1.生成label文件

labels=df[1].unique()
# 翻开文件并写入列表中的元素  
with open('label.txt', 'w') as f:  
    for item in labels:  
        f.write(str(item) + '\n')
!cat label.txt
Travel-Query
Music-Play
FilmTele-Play
Video-Play
Radio-Listen
HomeAppliance-Control
Weather-Query
Alarm-Update
Calendar-Query
TVProgram-Play
Audio-Play
Other
%cd ~/PaddleNLP/applications/text_classification/multi_class
!mkdir data
/home/aistudio/PaddleNLP/applications/text_classification/multi_class

2.区分数据集

  • train_test_split直接依照 8:2 区分练习集、测验集
import os
from sklearn.model_selection import train_test_split
# 区分练习及测验集
train_data, dev_data= train_test_split( df, test_size=0.2)
root='data'
train_filename = os.path.join(root, 'train.txt')
dev_filename = os.path.join(root, 'dev.txt')
train_data.to_csv(train_filename, index=False, sep="\t", header=None)
dev_data.to_csv(dev_filename, index=False, sep="\t", header=None)

3.数据收拾

练习需求预备指定格局的本地数据集,如果没有已标示的数据集,能够参阅文本分类使命doccano数据标示运用指南进行文本分类数据标示。指定格局本地数据集目录结构:

3.1目录结构

data/
├── train.txt # 练习数据集文件
├── dev.txt # 开发数据集文件
└── label.txt # 分类标签文件

3.2数据集格局

练习、开发、测验数据集 文件中文本与标签类别号用tab符’\t’分离隔,文本中防止出现tab符’\t’。

train.txt/dev.txt/test.txt 文件格局:

<文本>'\t'<标签>
<文本>'\t'<标签>
...

3.3分类标签格局

label.txt(分类标签文件)记载数据集中一切标签调集,每一行为一个标签名。

  • label.txt 文件格局:
<标签>
<标签>
!cp ~/data/data208091/test.csv data/test.txt
!cp ~/label.txt data/label.txt
!tree data
data
├── bad_case.txt
├── dev.txt
├── label.txt
├── test.txt
└── train.txt
0 directories, 5 files

四、模型练习

运用运用 Trainer API 对模型进行微调

只需输入模型、数据集等就能够运用 Trainer API 高效快速地进行预练习、微调和模型压缩等使命,能够一键发动多卡练习、混合精度练习、梯度累积、断点重启、日志显示等功能,Trainer API 还针对练习过程的通用练习装备做了封装,比如:优化器、学习率调度等。

1.练习参数

主要的装备的参数为:

  • do_train: 是否进行练习。
  • do_eval: 是否进行评价。
  • debug: 与do_eval配合运用,是否敞开debug模型,对每一个类别进行评价。
  • do_export: 练习完毕后是否导出静态图。
  • do_compress: 练习完毕后是否进行模型裁剪。
  • model_name_or_path: 内置模型名,或许模型参数装备目录途径。默以为ernie-3.0-tiny-medium-v2-zh
  • output_dir: 模型参数、练习日志和静态图导出的保存目录。
  • device: 运用的设备,默以为gpu
  • num_train_epochs: 练习次序,运用早停法时能够选择100。
  • early_stopping: 是否运用早停法,也即一定次序后评价目标不再增长则中止练习。
  • early_stopping_patience: 在设定的早停练习次序内,模型在开发集上体现不再上升,练习终止;默以为4。
  • learning_rate: 预练习言语模型参数根底学习率巨细,将与learning rate scheduler发生的值相乘作为当时学习率。
  • max_length: 最大语句长度,超越该长度的文本将被切断,缺乏的以Pad补全。提示文本不会被切断。
  • per_device_train_batch_size: 每次练习每张卡上的样本数量。可根据实际GPU显存恰当调小/调大此装备。
  • per_device_eval_batch_size: 每次评价每张卡上的样本数量。可根据实际GPU显存恰当调小/调大此装备。
  • max_length: 最大语句长度,超越该长度的文本将被切断,缺乏的以Pad补全。提示文本不会被切断。
  • train_path: 练习集途径,默以为”./data/train.txt”。
  • dev_path: 开发集集途径,默以为”./data/dev.txt”。
  • test_path: 测验集途径,默以为”./data/dev.txt”。
  • label_path: 标签途径,默以为”./data/label.txt”。
  • bad_case_path: 过错样本保存途径,默以为”./data/bad_case.txt”。
  • width_mult_list:裁剪宽度(multi head)保存的份额列表,表明对self_attention中的 qkv 以及 ffn 权重宽度的保存份额,保存份额乘以宽度(multi haed数量)应为整数;默许是None。 练习脚本支撑一切TraingArguments的参数,更多参数介绍可参阅TrainingArguments 参数介绍。

2.开始练习

!python train.py \
    --do_train \
    --do_eval \
    --do_export \
    --model_name_or_path ernie-3.0-tiny-medium-v2-zh \
    --output_dir checkpoint \
    --device gpu \
    --num_train_epochs 100 \
    --early_stopping True \
    --early_stopping_patience 5 \
    --learning_rate 3e-5 \
    --max_length 128 \
    --per_device_eval_batch_size 32 \
    --per_device_train_batch_size 32 \
    --metric_for_best_model accuracy \
    --load_best_model_at_end \
    --logging_steps 5 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --save_total_limit 1    

3.练习日志

[2023-04-11 17:30:31,229] [    INFO] -   Num examples = 2420
[2023-04-11 17:30:31,229] [    INFO] -   Total prediction steps = 76
[2023-04-11 17:30:31,229] [    INFO] -   Pre device batch size = 32
[2023-04-11 17:30:31,229] [    INFO] -   Total Batch size = 32
  0%|                                                    | 0/76 [00:00<?, ?it/s]
  9%|████                                        | 7/76 [00:00<00:01, 58.78it/s]
 17%|███████▎                                   | 13/76 [00:00<00:01, 53.60it/s]
 25%|██████████▊                                | 19/76 [00:00<00:01, 53.72it/s]
 33%|██████████████▏                            | 25/76 [00:00<00:00, 54.40it/s]
 41%|█████████████████▌                         | 31/76 [00:00<00:00, 53.83it/s]
 49%|████████████████████▉                      | 37/76 [00:00<00:00, 53.93it/s]
 57%|████████████████████████▎                  | 43/76 [00:00<00:00, 54.07it/s]
 64%|███████████████████████████▋               | 49/76 [00:00<00:00, 53.70it/s]
 72%|███████████████████████████████            | 55/76 [00:01<00:00, 54.41it/s]
 80%|██████████████████████████████████▌        | 61/76 [00:01<00:00, 53.62it/s]
 88%|█████████████████████████████████████▉     | 67/76 [00:01<00:00, 53.80it/s]
eval_loss: 0.38935500383377075, eval_accuracy: 0.9347107438016529, eval_micro_precision: 0.9347107438016529, eval_micro_recall: 0.9347107438016529, eval_micro_f1: 0.9347107438016529, eval_macro_precision: 0.8868087630817776, eval_macro_recall: 0.883506204765109, eval_macro_f1: 0.8840559834605317, eval_runtime: 1.4272, eval_samples_per_second: 1695.617, eval_steps_per_second: 53.251, epoch: 12.0
 12%|████▌                                 | 3636/30300 [02:54<15:32, 28.61it/s]
100%|███████████████████████████████████████████| 76/76 [00:01<00:00, 58.29it/s]
                                                                                [2023-04-11 17:30:32,658] [    INFO] - Saving model checkpoint to checkpoint/checkpoint-3636
[2023-04-11 17:30:32,659] [    INFO] - Configuration saved in checkpoint/checkpoint-3636/config.json
[2023-04-11 17:30:33,291] [    INFO] - tokenizer config file saved in checkpoint/checkpoint-3636/tokenizer_config.json
[2023-04-11 17:30:33,291] [    INFO] - Special tokens file saved in checkpoint/checkpoint-3636/special_tokens_map.json
[2023-04-11 17:30:34,681] [    INFO] - Deleting older checkpoint [checkpoint/checkpoint-3333] due to args.save_total_limit
[2023-04-11 17:30:34,814] [    INFO] - 
Training completed. 
[2023-04-11 17:30:34,814] [    INFO] - Loading best model from checkpoint/checkpoint-2121 (score: 0.9400826446280992).
train_runtime: 177.1813, train_samples_per_second: 5463.331, train_steps_per_second: 171.011, train_loss: 0.3273027794348789, epoch: 12.0
 12%|████▌                                 | 3636/30300 [02:57<21:39, 20.52it/s]
[2023-04-11 17:30:35,236] [    INFO] - Saving model checkpoint to checkpoint
[2023-04-11 17:30:35,238] [    INFO] - Configuration saved in checkpoint/config.json
[2023-04-11 17:30:35,875] [    INFO] - tokenizer config file saved in checkpoint/tokenizer_config.json
[2023-04-11 17:30:35,875] [    INFO] - Special tokens file saved in checkpoint/special_tokens_map.json
[2023-04-11 17:30:35,876] [    INFO] - ***** train metrics *****
[2023-04-11 17:30:35,876] [    INFO] -   epoch                    =       12.0
[2023-04-11 17:30:35,876] [    INFO] -   train_loss               =     0.3273
[2023-04-11 17:30:35,876] [    INFO] -   train_runtime            = 0:02:57.18
[2023-04-11 17:30:35,876] [    INFO] -   train_samples_per_second =   5463.331
[2023-04-11 17:30:35,876] [    INFO] -   train_steps_per_second   =    171.011
[2023-04-11 17:30:36,113] [    INFO] - ***** Running Evaluation *****
[2023-04-11 17:30:36,113] [    INFO] -   Num examples = 2420
[2023-04-11 17:30:36,113] [    INFO] -   Total prediction steps = 76
[2023-04-11 17:30:36,113] [    INFO] -   Pre device batch size = 32
[2023-04-11 17:30:36,113] [    INFO] -   Total Batch size = 32
100%|███████████████████████████████████████████| 76/76 [00:01<00:00, 55.75it/s]
[2023-04-11 17:30:37,541] [    INFO] - ***** eval metrics *****
[2023-04-11 17:30:37,541] [    INFO] -   epoch                   =       12.0
[2023-04-11 17:30:37,541] [    INFO] -   eval_accuracy           =     0.9401
[2023-04-11 17:30:37,541] [    INFO] -   eval_loss               =     0.2693
[2023-04-11 17:30:37,541] [    INFO] -   eval_macro_f1           =     0.8951
[2023-04-11 17:30:37,541] [    INFO] -   eval_macro_precision    =     0.8971
[2023-04-11 17:30:37,541] [    INFO] -   eval_macro_recall       =     0.8938
[2023-04-11 17:30:37,541] [    INFO] -   eval_micro_f1           =     0.9401
[2023-04-11 17:30:37,541] [    INFO] -   eval_micro_precision    =     0.9401
[2023-04-11 17:30:37,541] [    INFO] -   eval_micro_recall       =     0.9401
[2023-04-11 17:30:37,541] [    INFO] -   eval_runtime            = 0:00:01.42
[2023-04-11 17:30:37,541] [    INFO] -   eval_samples_per_second =   1695.007
[2023-04-11 17:30:37,541] [    INFO] -   eval_steps_per_second   =     53.232
[2023-04-11 17:30:37,543] [    INFO] - Exporting inference model to checkpoint/export/model
[2023-04-11 17:30:43,826] [    INFO] - Inference model exported.
[2023-04-11 17:30:43,827] [    INFO] - tokenizer config file saved in checkpoint/export/tokenizer_config.json
[2023-04-11 17:30:43,827] [    INFO] - Special tokens file saved in checkpoint/export/special_tokens_map.json
[2023-04-11 17:30:43,827] [    INFO] - id2label file saved in checkpoint/export/id2label.json

4.练习成果和可选模型

程序运行时将会主动进行练习,评价。同时练习过程中会主动保存开发集上最佳模型在指定的 output_dir 中,保存模型文件结构如下所示:

checkpoint/
├── export # 静态图模型
├── config.json # 模型装备文件
├── model_state.pdparams # 模型参数文件
├── tokenizer_config.json # 分词器装备文件
├── vocab.txt
└── special_tokens_map.json
  • 中文练习使命(文本支撑含部分英文)引荐运用”ernie-1.0-large-zh-cw”、”ernie-3.0-tiny-base-v2-zh”、”ernie-3.0-tiny-medium-v2-zh”、”ernie-3.0-tiny-micro-v2-zh”、”ernie-3.0-tiny-mini-v2-zh”、”ernie-3.0-tiny-nano-v2-zh”、”ernie-3.0-tiny-pico-v2-zh”。
  • 英文练习使命引荐运用”ernie-3.0-tiny-mini-v2-en”、 “ernie-2.0-base-en”、”ernie-2.0-large-en”。
  • 英文和中文以外言语的文本分类使命,引荐运用根据96种言语(包括法语、日语、韩语、德语、西班牙语等简直一切常见言语)进行预练习的多言语预练习模型”ernie-m-base”、”ernie-m-large”,详情请拜见ERNIE-M论文。

五、模型评价

练习后的模型我们能够敞开debug形式,对每个类别别离进行评价,并打印过错猜测样本保存在bad_case.txt。默许在GPU环境下运用,在CPU环境下修改参数装备为–device “cpu”:

1.开始练习

!python train.py \
    --do_eval \
    --debug True \
    --device gpu \
    --model_name_or_path checkpoint \
    --output_dir checkpoint \
    --per_device_eval_batch_size 32 \
    --max_length 128 \
    --test_path './data/dev.txt'

2.输出日志

[2023-04-11 17:38:48,156] [    INFO] - ----------------------------
[2023-04-11 17:38:48,156] [    INFO] - Class name: Calendar-Query
[2023-04-11 17:38:48,156] [    INFO] - Evaluation examples in dev dataset: 241(10.0%) | precision: 99.17 | recall: 99.17 | F1 score 99.17
[2023-04-11 17:38:48,156] [    INFO] - ----------------------------
[2023-04-11 17:38:48,156] [    INFO] - Class name: TVProgram-Play
[2023-04-11 17:38:48,156] [    INFO] - Evaluation examples in dev dataset: 47(1.9%) | precision: 71.43 | recall: 63.83 | F1 score 67.42
[2023-04-11 17:38:48,156] [    INFO] - ----------------------------
[2023-04-11 17:38:48,156] [    INFO] - Class name: Audio-Play
[2023-04-11 17:38:48,156] [    INFO] - Evaluation examples in dev dataset: 49(2.0%) | precision: 78.43 | recall: 81.63 | F1 score 80.00
[2023-04-11 17:38:48,156] [    INFO] - ----------------------------
[2023-04-11 17:38:48,156] [    INFO] - Class name: Other
[2023-04-11 17:38:48,156] [    INFO] - Evaluation examples in dev dataset: 40(1.7%) | precision: 65.85 | recall: 67.50 | F1 score 66.67
[2023-04-11 17:38:48,156] [    INFO] - ----------------------------
[2023-04-11 17:38:48,158] [    INFO] - Bad case in dev dataset saved in ./data/bad_case.txt
100%|███████████████████████████████████████████| 76/76 [00:01<00:00, 55.79it/s]

3.过错剖析

猜测过错的会进行题型,bad case。

文本分类猜测过程中常会遇到比如”模型为什么会猜测出过错的成果”,”怎么提升模型的体现”等问题。Analysis模块 提供了可解释性剖析、数据优化等功能,旨在帮助开发者更好地剖析文本分类模型猜测成果和对模型作用进行优化。

具体见 bad_case.txt

!head -n10 data/bad_case.txt
Text	Label	Prediction
一禅小和尚第4集往后接着播映我要看呢	Video-Play	FilmTele-Play
济南日子的交通进行时还在直播中吗我想看下	TVProgram-Play	Video-Play
能否回放一下早上七点二十分的时事关提案吗我想看下	Video-Play	TVProgram-Play
播映一下那个启航	FilmTele-Play	Music-Play
电视只有声响而没有图画该打什么号码的电话	HomeAppliance-Control	Other
最近有什么新电影,调到小楚和野营的节目爱电影了解一下	Radio-Listen	TVProgram-Play
那射手座呢牧羊座呢牧羊座是白羊座吗	Other	Calendar-Query
飞轮海挂彩排舞谢歌迷为坚持最佳状态进补图	Other	Video-Play
吴彦祖还表明一旦老婆有了他就会停工一年当专业奶爸	Music-Play	FilmTele-Play

六、源码剖析

1.train.py


import functools
import json
import os
import shutil
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional
import numpy as np
import paddle
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    precision_recall_fscore_support,
)
from utils import log_metrics_debug, preprocess_function, read_local_dataset
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.datasets import load_dataset
from paddlenlp.trainer import (
    CompressionArguments,
    EarlyStoppingCallback,
    PdArgumentParser,
    Trainer,
)
from paddlenlp.transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    export_model,
)
from paddlenlp.utils.log import logger
# 支撑的模型列表
SUPPORTED_MODELS = [
    "ernie-1.0-large-zh-cw",
    "ernie-1.0-base-zh-cw",
    "ernie-3.0-xbase-zh",
    "ernie-3.0-base-zh",
    "ernie-3.0-medium-zh",
    "ernie-3.0-micro-zh",
    "ernie-3.0-mini-zh",
    "ernie-3.0-nano-zh",
    "ernie-3.0-tiny-base-v2-zh",
    "ernie-3.0-tiny-medium-v2-zh",
    "ernie-3.0-tiny-micro-v2-zh",
    "ernie-3.0-tiny-mini-v2-zh",
    "ernie-3.0-tiny-nano-v2-zh ",
    "ernie-3.0-tiny-pico-v2-zh",
    "ernie-2.0-large-en",
    "ernie-2.0-base-en",
    "ernie-3.0-tiny-mini-v2-en",
    "ernie-m-base",
    "ernie-m-large",
]
# 默许参数
# yapf: disable
@dataclass
class DataArguments:
    max_length: int = field(default=128, metadata={"help": "Maximum number of tokens for the model."})
    early_stopping: bool = field(default=False, metadata={"help": "Whether apply early stopping strategy."})
    early_stopping_patience: int = field(default=4, metadata={"help": "Stop training when the specified metric worsens for early_stopping_patience evaluation calls"})
    debug: bool = field(default=False, metadata={"help": "Whether choose debug mode."})
    train_path: str = field(default='./data/train.txt', metadata={"help": "Train dataset file path."})
    dev_path: str = field(default='./data/dev.txt', metadata={"help": "Dev dataset file path."})
    test_path: str = field(default='./data/dev.txt', metadata={"help": "Test dataset file path."})
    label_path: str = field(default='./data/label.txt', metadata={"help": "Label file path."})
    bad_case_path: str = field(default='./data/bad_case.txt', metadata={"help": "Bad case file path."})
@dataclass
class ModelArguments:
    model_name_or_path: str = field(default="ernie-3.0-tiny-medium-v2-zh", metadata={"help": "Build-in pretrained model name or the path to local model."})
    export_model_dir: Optional[str] = field(default=None, metadata={"help": "Path to directory to store the exported inference model."})
# yapf: enable
def main():
    """
    Training a binary or multi classification model
    """
    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    if training_args.do_compress:
        training_args.strategy = "dynabert"
    if training_args.do_train or training_args.do_compress:
        training_args.print_config(model_args, "Model")
        training_args.print_config(data_args, "Data")
    paddle.set_device(training_args.device)
    # Define id2label
    id2label = {}
    label2id = {}
    with open(data_args.label_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            l = line.strip()
            id2label[i] = l
            label2id[l] = i
    # Define model & tokenizer
    if os.path.isdir(model_args.model_name_or_path):
        model = AutoModelForSequenceClassification.from_pretrained(
            model_args.model_name_or_path, label2id=label2id, id2label=id2label
        )
    elif model_args.model_name_or_path in SUPPORTED_MODELS:
        model = AutoModelForSequenceClassification.from_pretrained(
            model_args.model_name_or_path, num_classes=len(label2id), label2id=label2id, id2label=id2label
        )
    else:
        raise ValueError(
            f"{model_args.model_name_or_path} is not a supported model type. Either use a local model path or select a model from {SUPPORTED_MODELS}"
        )
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
    # load and preprocess dataset
    train_ds = load_dataset(read_local_dataset, path=data_args.train_path, label2id=label2id, lazy=False)
    dev_ds = load_dataset(read_local_dataset, path=data_args.dev_path, label2id=label2id, lazy=False)
    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_length=data_args.max_length)
    train_ds = train_ds.map(trans_func)
    dev_ds = dev_ds.map(trans_func)
    if data_args.debug:
        test_ds = load_dataset(read_local_dataset, path=data_args.test_path, label2id=label2id, lazy=False)
        test_ds = test_ds.map(trans_func)
    # Define the metric function.
    def compute_metrics(eval_preds):
        pred_ids = np.argmax(eval_preds.predictions, axis=-1)
        metrics = {}
        metrics["accuracy"] = accuracy_score(y_true=eval_preds.label_ids, y_pred=pred_ids)
        for average in ["micro", "macro"]:
            precision, recall, f1, _ = precision_recall_fscore_support(
                y_true=eval_preds.label_ids, y_pred=pred_ids, average=average
            )
            metrics[f"{average}_precision"] = precision
            metrics[f"{average}_recall"] = recall
            metrics[f"{average}_f1"] = f1
        return metrics
    def compute_metrics_debug(eval_preds):
        pred_ids = np.argmax(eval_preds.predictions, axis=-1)
        metrics = classification_report(eval_preds.label_ids, pred_ids, output_dict=True)
        return metrics
    # Define the early-stopping callback.
    if data_args.early_stopping:
        callbacks = [EarlyStoppingCallback(early_stopping_patience=data_args.early_stopping_patience)]
    else:
        callbacks = None
    # 定义 Trainer
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        criterion=paddle.nn.loss.CrossEntropyLoss(),
        train_dataset=train_ds,
        eval_dataset=dev_ds,
        callbacks=callbacks,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics_debug if data_args.debug else compute_metrics,
    )
    # 练习
    if training_args.do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.save_model()
        trainer.log_metrics("train", metrics)
        for checkpoint_path in Path(training_args.output_dir).glob("checkpoint-*"):
            shutil.rmtree(checkpoint_path)
    # 测验、猜测
    if training_args.do_eval:
        if data_args.debug:
            output = trainer.predict(test_ds)
            log_metrics_debug(output, id2label, test_ds, data_args.bad_case_path)
        else:
            eval_metrics = trainer.evaluate()
            trainer.log_metrics("eval", eval_metrics)
    # 模型导出
    if training_args.do_export:
        if model.init_config["init_class"] in ["ErnieMForSequenceClassification"]:
            input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
        else:
            input_spec = [
                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
            ]
        if model_args.export_model_dir is None:
            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
        tokenizer.save_pretrained(model_args.export_model_dir)
        id2label_file = os.path.join(model_args.export_model_dir, "id2label.json")
        with open(id2label_file, "w", encoding="utf-8") as f:
            json.dump(id2label, f, ensure_ascii=False)
            logger.info(f"id2label file saved in {id2label_file}")
    # 模型压缩
    if training_args.do_compress:
        trainer.compress()
        for width_mult in training_args.width_mult_list:
            pruned_infer_model_dir = os.path.join(training_args.output_dir, "width_mult_" + str(round(width_mult, 2)))
            tokenizer.save_pretrained(pruned_infer_model_dir)
            id2label_file = os.path.join(pruned_infer_model_dir, "id2label.json")
            with open(id2label_file, "w", encoding="utf-8") as f:
                json.dump(id2label, f, ensure_ascii=False)
                logger.info(f"id2label file saved in {id2label_file}")
    for path in Path(training_args.output_dir).glob("runs"):
        shutil.rmtree(path)
if __name__ == "__main__":
    main()

2.utils.py


import numpy as np
from paddlenlp.utils.log import logger
# 预处理
def preprocess_function(examples, tokenizer, max_length, is_test=False):
    """
    Builds model inputs from a sequence for sequence classification tasks
    by concatenating and adding special tokens.
    """
    result = tokenizer(examples["text"], max_length=max_length, truncation=True)
    if not is_test:
        result["labels"] = np.array([examples["label"]], dtype="int64")
    return result
# 读取数据集
def read_local_dataset(path, label2id=None, is_test=False):
    """
    Read dataset.
    """
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if is_test:
                sentence = line.strip()
                yield {"text": sentence}
            else:
                items = line.strip().split("\t")
                yield {"text": items[0], "label": label2id[items[1]]}
# 打印日志                
def log_metrics_debug(output, id2label, dev_ds, bad_case_path):
    """
    Log metrics in debug mode.
    """
    predictions, label_ids, metrics = output
    pred_ids = np.argmax(predictions, axis=-1)
    logger.info("-----Evaluate model-------")
    logger.info("Dev dataset size: {}".format(len(dev_ds)))
    logger.info("Accuracy in dev dataset: {:.2f}%".format(metrics["test_accuracy"] * 100))
    logger.info(
        "Macro average | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
            metrics["test_macro avg"]["precision"] * 100,
            metrics["test_macro avg"]["recall"] * 100,
            metrics["test_macro avg"]["f1-score"] * 100,
        )
    )
    for i in id2label:
        l = id2label[i]
        logger.info("Class name: {}".format(l))
        i = "test_" + str(i)
        if i in metrics:
            logger.info(
                "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
                    metrics[i]["support"],
                    100 * metrics[i]["support"] / len(dev_ds),
                    metrics[i]["precision"] * 100,
                    metrics[i]["recall"] * 100,
                    metrics[i]["f1-score"] * 100,
                )
            )
        else:
            logger.info("Evaluation examples in dev dataset: 0 (0%)")
        logger.info("----------------------------")
    with open(bad_case_path, "w", encoding="utf-8") as f:
        f.write("Text\tLabel\tPrediction\n")
        for i, (p, l) in enumerate(zip(pred_ids, label_ids)):
            p, l = int(p), int(l)
            if p != l:
                f.write(dev_ds.data[i]["text"] + "\t" + id2label[l] + "\t" + id2label[p] + "\n")
    logger.info("Bad case in dev dataset saved in {}".format(bad_case_path))

七、模型猜测

运用taskflow进行模型猜测

  • 加载模型
  • 加载数据
  • 进行猜测

1.加载模型进行单个猜测

from paddlenlp import Taskflow
# 模型猜测
cls = Taskflow("text_classification", task_path='checkpoint/export', is_static_model=True)
cls(["回放CCTV2的消费建议"])
[2023-04-11 17:42:26,315] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'checkpoint/export'.
W0411 17:42:26.472223   349 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0411 17:42:26.475904   349 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-04-11 17:42:29,395] [    INFO] - Load id2label from checkpoint/export/id2label.json.
[{'predictions': [{'label': 'TVProgram-Play', 'score': 0.9521104350237317}],
  'text': '回放CCTV2的消费建议'}]

2.读取待猜测数据

读取待猜测数据到列表

with open('data/test.txt', 'r') as file:
    mytests = file.readlines()
print(mytests[:3])
['回放CCTV2的消费建议\n', '给我翻开玩具房的灯\n', '循环播映赵本山的小品相亲来听\n']

3.整体猜测

result = cls(mytests)
print(result[:3])
[{'predictions': [{'label': 'TVProgram-Play', 'score': 0.9521104350237317}], 'text': '回放CCTV2的消费建议\n'}, {'predictions': [{'label': 'HomeAppliance-Control', 'score': 0.9970951493859599}], 'text': '给我翻开玩具房的灯\n'}, {'predictions': [{'label': 'Audio-Play', 'score': 0.9710607817649783}], 'text': '循环播映赵本山的小品相亲来听\n'}]

4.按格局保存

f=open('/home/aistudio/result.txt', 'w')
f.write("ID,Target\n")
for i in range(len(result)):
    f.write(f"{i+1},{result[i]['predictions'][0]['label']}\n")
f.close()
!head -n10 /home/aistudio/result.txt
ID,Target
1,TVProgram-Play
2,HomeAppliance-Control
3,Audio-Play
4,Alarm-Update
5,HomeAppliance-Control
6,FilmTele-Play
7,FilmTele-Play
8,Music-Play
9,Calendar-Query

八、提交成果

基于PaddleNLP的端到端智能家居对话意图识别

  • 项目地址: 根据PaddleNLP的端到端智能家居对话目的辨认 – 飞桨AI Studio
  • github地址: livingbody/Conversational_intention_recognition: 根据PaddleNLP的对话目的辨认

基于PaddleNLP的端到端智能家居对话意图识别

本文正在参加「金石计划」