本文已参加「新人创造礼」活动,一起开启创造之路。

本文首发于CSDN。

诸神缄默不语-个人CSDN博文目录

本文属于huggingface.transformers悉数文档学习笔记博文的一部分。 全文链接:huggingface transformers包 文档学习笔记(持续更新ing…)

本部分网址:huggingface.co/docs/transf… 本部分以文本分类使命为例,介绍transformers上怎么微调预练习模型。 由于自己主要运用PyTorch结构,因而本文仅介绍运用transformers.Trainer(文档:huggingface.co/docs/transf…)和运用原生PyTorch来进行微调的办法。 由于教程中的代码是分散的,所以我会在这两个部分的最终一节各自出现完好的脚本代码。 此外,由于①自己有用自己数据集的需求。②由于我的服务器欠好挂署理,所以用datasets不方便。所以本文将用一些篇幅介绍不运用datasets实现所需功能的办法。(可是也会介绍本部分文档所提到datasets包的运用内容) (大陆地区无法加载数据集和目标的解决方案可参阅我之前编撰的博文:huggingface.datasets无法加载数据集的解决方案_诸神缄默不语的博客-CSDN博客) 另请留意:我一部分代码是在jupyter notebook上跑的,一部分代码是用脚本跑的,而且运用的环境有所改变,所以输出的环境可能不一致。

一个本文代码可用的Python环境:Python 3.8,PyTorch 1.8.1,cudatoolkit 10.2,transformers 4.18.0,datasets 2,scikit-learn 1.0.2 (据我观察别的版别应该也能够,影响不大)

在一个特定使命的数据集上练习预练习模型,就叫微调(finetune)。

@[toc]

1. datasets包

datasets包的官方GitHub项目:huggingface/datasets: The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

datasets包能够加载许多公开数据集,并对其进行预处理。 datasets包的建构参阅了TFDS项目:tensorflow/datasets: TFDS is a collection of datasets ready to use with TensorFlow, Jax, …

1.1 datasets包的装置

假如运用anaconda作为包办理环境,并已经运用pip装置的transformers包,则能够直接运用pip来装置datasets:

pip install datasets

其他装置办法可参阅datasets文档:huggingface.co/docs/datase…

1.2 datasets简易入门

本部分运用的函数参阅官方GitHub项目的README文件。

  1. 一切可用的数据集:all_available_datasets=datasets.list_datasets() 列出前5条:all_available_datasets[:5] 输出:['assin', 'ar_res_reviews', 'ambig_qa', 'bianet', 'ag_news']
  2. 加载数据集:datasets.load_dataset(dataset_name, **kwargs) 以本教程所运用的"yelp_review_full"数据集为例:dataset=datasets.load_dataset("yelp_review_full")
  3. 一切可用的目标:datasets.list_metrics()
  4. 加载目标:datasets.load_metric(metric_name, **kwargs)
  5. 检查数据集:
    用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
  6. 获取数据集中的样本:dataset['train'][123456]
    用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
  7. 检查数据词典:dataset['train'].features
    用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

其他所需的函数会在后文跟着例子一起叙述。

1.3 Yelp Reviews数据集的加载和预处理

数据集在huggingface上的官方网址:yelp_review_full Datasets at Hugging Face

这是个用于英文短文本分类(情感分类)使命的数据集,是Yelp(美国点评网站)上的评论(text)和对应的评分星级(1-5星)(label)。 提取自Yelp Dataset Challenge 2015数据集。出自该论文:Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)

加载并检查数据集:

from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
dataset["train"][100]

输出:

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

用datasets的map函数(文档:huggingface.co/docs/datase…)将文本格局的原数据text列值经tokenize后转化为模型能够读取的格局(tokenizer的输出):

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mypath/bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

输出:

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

转化后的数据集:

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

留意这儿原教程中tokenizer没有max_length参数,可是这样的话,这一部分就会输出正告:

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

后边练习时就会报错:

Traceback (most recent call last):
  File "mypath/huggingfacedatasets1.py", line 47, in <module>
    trainer.train()
  File "myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1396, in train
    for step, inputs in enumerate(epoch_iterator):
  File "myenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "myenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "myenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "myenv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in default_data_collator
    return torch_default_data_collator(features)
  File "myenv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 72 at dim 1 (got 118)

这一bug的产生便是由于同一batch的sequence不同长。所以要加max_length入参。 可是奇怪的是,我用colab试了一遍,就发现能够直接运转……可是我没找到模型的max_length本身是界说在哪里的? 运用512是由于在colab运转时看了一下len(tokenized_datasets['train'][0]['input_ids']),发现是512,说明默许界说的max_length值反正是512。至于为什么我这个还需求手动界说,鬼知道。 我一开端猜想是由于我运用的pretrained_model_name_or_path参数是本地地址,而不是这个里面的某一个键:

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
测验证明似乎不是,see: 手动输入max_length入参的状况:
用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
不手动输入max_length入参的状况:
用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
修正模型max_model_input_sizes,并不手动输入max_length入参的状况:
用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
那我就特么的疑惑这个特点是拿来干嘛的了…… 我用dir(tokenizer)硬找到了另一个看起来也很符合要求的特点名model_max_length,经测验发现这个应该才真的是:
用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
可是这个试验成果没有在整个的试验上重做,由于我觉得应该没有必要,由于这两个状况是相同的(能够参阅我之前编撰的博文huggingface.transformers术语表_诸神缄默不语的博客-CSDN博客2.2节,tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=512)意为一切sequence都固定为512长度,tokenizer(batch_sentences, padding='max_length', truncation=True)意为一切sequence都固定为模型max_length长度,当模型max_length便是512时,两种状况等价)。事实上我觉得手动加max_length入参可能更好,更适宜于操控代码。


为了加速示例代码的练习速度,咱们抽样出一个较小的数据集来做示例:

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

1.4 将自界说数据集转化为datasets的数据集格局

更多细节可参阅datasets官方文档huggingface.co/docs/datase…

(本文专门作此条编撰,主要是为了今后用Trainer时将自界说数据集转为datasets.Dataset,会比较方便(Trainer的*_dataset入参能够接受datasets.Dataset或torch的数据集,假如是datasets.Dataset的话看起来应该会按列名自动输入模型所需的入参,而且看本教程示例是能够直接用list格局的。Trainer会自动移除其他列(后边练习时输出会显示这一部分内容)。假如是torch数据集我咋知道它要输入啥,有点麻烦。所以直接转化成datasets.Dataset就不用麻烦了)

本文直接以in-memory的dict目标格局(别的in-memory数据相似。暂时没有考虑大到一次加载不到内存的状况,等今后遇到这种状况了再解决)的yelp_review_full数据集为例(Dataset.load_from_dict()文档:huggingface.co/docs/datase…):

①将上文得到的small_train_dataset转化为dict目标(键是列名,值是列值(list)),作为示例数据:

example_dict={'label':small_train_dataset['label'],'text':small_train_dataset['text']}

②将这个字典转为Dataset:

example_dataset=datasets.Dataset.from_dict(example_dict)
example_dataset
Dataset({
    features: ['label', 'text'],
    num_rows: 1000
})

将Dataset组合为DatasetDict的办法我还没有找到,可是看起来不太需求,由于DatasetDict本质上就相当于一个Dataset的字典,对DatasetDict的操作应该就相当于对其一切值做原地操作。不需求专门运用这个类,假如有需求相似DatasetDict的操作,对一切Dataset都做一遍就行。

2. 运用Trainer(以PyTorch为后端结构)进行微调

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

2.1 界说分类模型

这个数据集的标签有5类。

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased", num_labels=5)

输出:

Some weights of the model checkpoint at mypath/bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at mypath/bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

(对该输出的解说可参阅我之前写的博文:Some weights of the model checkpoint at mypath/bert-base-chinese were not used when initializing Ber_诸神缄默不语的博客-CSDN博客)

这个代码也能够这么写:

from transformers import AutoConfig,AutoModelForSequenceClassification
model_path="mypath/bert-base-cased"
config=AutoConfig.from_pretrained(model_path,num_labels=5)
model=AutoModelForSequenceClassification.from_pretrained(model_path,config=config)

2.2 练习超参数

TrainingArguments类(文档:huggingface.co/docs/transf…)包含了一切可调的超参数、练习设置。在本教程中用的是默许超参数。

界说checkpoint存储方位:

from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

2.3 目标

Trainer不会自动评价模型,所以需求传递给它用以核算和打印目标的函数。 更多目标相关的内容可参阅:huggingface.co/docs/datase…

accuracy(准确率)目标的huggingface官方网页:Hugging Face – The AI community building the future.

加载准确率目标:

import numpy as np
metric=datasets.load_metric("accuracy")

在metric上调用compute()办法,就能够核算猜测值(模型回来值中的logits)的准确率了。

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

假如想要在微调进程中监测目标的改变状况,需求在TrainingArguments中界说evaluation_strategy超参,以在每个epoch完毕时打印测验集上的目标):

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

2.4 Trainer

界说Trainer目标:

from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

开端练习:

trainer.train()

脚本运转输出:(在这儿就能够看到,text列没有传入模型)

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
  0%|                                                                                  | 0/96 [00:00<?, ?it/s]myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 33%|████████████████████████▎                                                | 32/96 [00:19<00:23,  2.73it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.219325304031372, 'eval_accuracy': 0.487, 'eval_runtime': 5.219, 'eval_samples_per_second': 191.609, 'eval_steps_per_second': 6.131, 'epoch': 1.0}                                                           
 33%|████████████████████████▎                                                | 32/96 [00:24<00:23,  2.73it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:37<00:11,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.0443027019500732, 'eval_accuracy': 0.57, 'eval_runtime': 5.1937, 'eval_samples_per_second': 192.539, 'eval_steps_per_second': 6.161, 'epoch': 2.0}                                                          
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:42<00:11,  2.87it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [00:55<00:00,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 0.9776290655136108, 'eval_accuracy': 0.598, 'eval_runtime': 5.2137, 'eval_samples_per_second': 191.803, 'eval_steps_per_second': 6.138, 'epoch': 3.0}                                                         
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  2.87it/s]
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 60.8009, 'train_samples_per_second': 49.341, 'train_steps_per_second': 1.579, 'train_loss': 1.0931960741678874, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  1.58it/s]

jupyter notebook的输出作用,看起来比脚本输出更明晰一些:

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=96, training_loss=1.1009167830149333, metrics={'train_runtime': 60.9212, 'train_samples_per_second': 49.244, 'train_steps_per_second': 1.576, 'total_flos': 789354427392000.0, 'train_loss': 1.1009167830149333, 'epoch': 3.0})

由于我为了debug在colab上也跑了一遍,所以也展现一下colab上的输出作用(我也用了GPU,但还是比在本地慢了许多,不知道为啥。我本地是有4张卡,但这明显不止慢了4倍啊):

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=375, training_loss=1.2140440266927084, metrics={'train_runtime': 780.671, 'train_samples_per_second': 3.843, 'train_steps_per_second': 0.48, 'total_flos': 789354427392000.0, 'train_loss': 1.2140440266927084, 'epoch': 3.0})

(留意这儿还有一点在于torch.nn.parallel的报错,colab运转时没有报错,我怀疑要么是由于colab只有一张卡,要么是由于torch版别的问题(我本地用的是PyTorch 1.8.1,colab是PyTorch 1.10)。可是这玩意欠好验证,我就猜猜)

2.5 完好的脚本代码

import datasets
import numpy as np
from transformers import AutoTokenizer,AutoModelForSequenceClassification,TrainingArguments,Trainer
dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")
tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
model = AutoModelForSequenceClassification.from_pretrained("pretrained_models/bert-base-cased",
                                                            num_labels=5)
training_args = TrainingArguments(output_dir="pt_save_pretrained",evaluation_strategy="epoch")
metric=datasets.load_metric('datasets/accuracy.py')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

3. 运用原生PyTorch进行微调

Trainer虽好,屁事太多。太难debug了,不如直接用原生PyTorch写。

这一部分的理解可参阅我之前写的博文60分钟闪击速成PyTorch(Deep Learning with PyTorch: A 60 Minute Blitz)学习笔记_诸神缄默不语的博客-CSDN博客

一个training loop: 将练习数据输入模型,得到猜测成果→核算损失函数→核算梯度→更新参数→从头将练习数据输入模型,得到猜测成果

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

假如在notebook上照着之前的代码后边持续跑的,建议先把之前的模型、Trainer之类的先删掉,清一下cuda上的缓存,以省出内存。或许直接重启notebook:

del model
del trainer
torch.cuda.empty_cache()

3.1 数据集

预处理dataset(后文会介绍怎么运用Python原生的数据目标来生成所需的数据集):

from torch.utils.data import DataLoader
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,由于AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就能够啦……
tokenized_datasets.set_format("torch")  #将值转化为torch.Tensor目标
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
#抽样出一部分数据来,快速完成教程
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
#将数据集转为DataLoader,便是键-值相对应的方式,后文能够看出是经过**batch的方式将数据传入模型的

运用自己的数据集:

示例数据集是这么得到的:

example_dict={'labels':dataset['train']['label'],'text':dataset['train']['text']}

展现数据:

print(type(example_dict['labels']))
print(example_dict['labels'][12345])
print(type(example_dict['text']))
print(example_dict['text'][12345])

输出:

<class 'list'>
2
<class 'list'>
I went here in search of a crepe with Nutella and I got a really good crepe. I wouldn't exactly say this place is authentic French because you've got Americans cooking the food,  but my crepe was still good. \n\nIt doesn't taste like the ones I had in France, Carmon's puts a twist on (or maybe it was just overcooked) theirs by making the crepe more firm. \n\nThe whipped cream was also made fresh and delightful. The prices were horrid though.\n\nCrepes don't cost that much to make, so they're clearly overpricing here. Price is the only reason I won't come back so often.

①运用torch的DataSet和DataLoader类(跟上面将datasets.Dataset最终得到的东西相当于是相同的):

from torch.utils.data import Dataset,DataLoader
#界说DataSet
class YelpDataset(Dataset):
    def __init__(self,dict_data) -> None:
        """
        dict_data: dict格局的data,键labels对应标签列表(元素是数值),键text对应文本列表
        """
        super(YelpDataset,self).__init__()
        self.data=dict_data
    def __getitem__(self, index):
        return [self.data['text'][index],self.data['labels'][index]]
        #回来一个列表,第一个元素是文本,第二个元素是标签
    def __len__(self):
        return len(self.data['text'])
#界说collate函数
def collate_fn(batch):
    pt_batch=tokenizer([b[0] for b in batch],padding=True,truncation=True,max_length=512,
                        return_tensors='pt')
    labels=torch.tensor([b[1] for b in batch])
    return {'labels':labels,'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],
            'attention_mask':pt_batch['attention_mask']}
train_data=YelpDataset(example_dict)
train_dataloader=DataLoader(train_data,batch_size=8,shuffle=True,collate_fn=collate_fn)

②手写DataLoader: 在每个training loop中,如此遍历(大多数变量我觉得都能看姓名就看出来什么意思,就不做详细介绍了):

#练习部分
#(验证部分差不多)
train_data_length=len(example_dict['labels'])
if train_data_length%batch_size==0:
    batch_num=int(train_data_length/batch_size)
else:
    batch_num=int(train_data_length/batch_size)+1
for b in range(batch_num):
    index_begin=b*batch_size
    index_end=min(train_data_length,index_begin+batch_size)
    this_batch_text=example_dict['text'][index_begin:index_end]
    this_batch_labels=example_dict['labels'][index_begin:index_end]
    pt_batch=tokenizer(this_batch_text,padding=True,truncation=True,max_length=512,return_tensors='pt')
    #pt_batch我就懒得按键拆开了,以下运转练习部分代码,和用DataLoader的相似,一望而知不言而喻,略

3.2 神经网络模型

界说分类模型:

from transformers import AutoModelForSequenceClassification
model=AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased",
                                                        num_labels=5)

3.3 优化器和learning rate scheduler

在前文能够看到,transformers的Trainer默许调用的是transformers的AdamW优化器,并会报此正告: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning

所以以前的AdamW都别用了,用PyTorch官方的AdamW优化器:

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

从Trainer创建默许的learning rate scheduler:

from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

3.4 运转设备

指定设备(单卡状况)并将模型转移到指定设备上:

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

3.5 Training Loop

tqdm包官网:tqdm.github.io/

from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

输出:

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型
在实在代码中还能够加上早停、保存在验证集上目标最高的模型等功能。

3.6 目标

和运用Trainer时相同,用datasets包的Metric来核算目标。 这儿的验证进程是在练习完毕后,经过Metric的add_batch()函数(文档:huggingface.co/docs/datase…)来累积一切batch。

metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()

输出:{'accuracy': 0.588}

3.7 完好的脚本代码

from tqdm.auto import tqdm
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
import datasets
from transformers import AutoTokenizer,AutoModelForSequenceClassification,get_scheduler
dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")
tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",truncation=True,max_length=512)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
#Postprocess dataset
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,由于AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就能够啦……
tokenized_datasets.set_format("torch")  #将值转化为torch.Tensor目标
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
model=AutoModelForSequenceClassification.from_pretrained\
                        ("pretrained_models/bert-base-cased",num_labels=5)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
metric=datasets.load_metric('datasets/accuracy.py')
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
print(metric.compute())

4. 教程中给出的其他学习资源

  1. Transformers Examples:这个我有编撰相关学习笔记博文的方案。
  2. Transformers Notebooks:这个我也也许会编撰相关学习笔记。