携手创造,共同生长!这是我参加「日新方案 8 月更文应战」的第30天,点击检查活动概况

阐明

之前的一篇博客NLP实战高手课学习笔记(14):文本分类实践1–预处理与模型界说 为咱们展现了在学习NLP实战高手课中的一个简单的文本分类示例,因为当时采用的是torchtext的方法对数据集进行处理,而且模型也是简单的LSTM,并非现在的主流趋势。本篇博客将介绍怎么运用Huggingface Transformers库来快速上手完成一个现代的文本分类模型的范式。

使命简介

文本情感剖析(Sentiment analysis)是根据数据的情感,如积极、消极和中性,自动标示数据的过程。情感剖析使公司可以大规模剖析数据,发现见地并自动化这些过程。一个闻名的情感剖析数据集IMDB影评数据集包含了很多的用户影评,这些影评都有着强烈的正向或反向情感。一个示例数据如下:

It hurt to watch this movie, it really did... I wanted to like it, even going in.
Shot obviously for very little cash, I looked past and told myself to appreciate the inspiration. 
Unfortunately, although I did appreciate the film on that level, the acting and editing was terrible, and the last 25-30 minutes were severe thumb-twiddling territory. 
A 95 minute film should not drag. The ratings for this one are good so far, but I fear that the friends and family might have had a say in that one. What was with those transitions? 
Dear Mr. Editor, did you just purchase your first copy of Adobe Premiere and make it your main goal to use all the goofy transitions that come with that silly program? 
Anyway... some better actors, a little more passion, and some more appealing editing and this makes a decent movie.

以上是一段标记为负向情感的文本数据,咱们需求模型可以对这样一大段长文本进行二分类使命。
上篇博客咱们了解了Huggingface的Pipeline模块,这里咱们也试一下。

运用Pipeline

咱们运用Transformers库中的Pipeline模型快速进行开发,相关代码如下:

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

这里咱们以列表的方式添加了两个测试样例,得到的输出为:

[{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9991}]

可以看到,pipeline返回的成果正如预期,实际上这两个测试样例仍是比较简单的。

运用Trainer进行模型练习

上面咱们运用Pipeline方法对情感剖析使命小试牛刀,接下来咱们回归正题,即运用Trainer方法练习自己的模型。

数据集加载

首先,咱们加载IMDB影评数据,因为数据集太大,这里只取3000个数据用作练习,300个用作测试。

from datasets import load_dataset
imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])

数据预处理

获得数据后,咱们对其进行分词和编码,相关代码如下:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

这里,咱们将输入的text悉数运用tokenizer进行分词转换为输入的token id序列。

为了加快练习速度,咱们运用data_collator将练习样本转换为PyTorch张量,并将它们与正确数量的填充连接起来:

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

模型加载

接着,咱们初始化模型,这里以distilbert为例,

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

这里咱们设置num_labels=2,清晰为2分类使命。

设置评价函数

然后咱们需求设置一个评价的函数,用来评价模型的功能。其界说如下:

import numpy as np
from datasets import load_metric
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

这里咱们统计了二分类使命的猜测准确性和F1分数。

初始化trainer

接着,咱们来初始化一个trainer来准备练习,首先需求界说好trainer的参数:

from transformers import TrainingArguments, Trainer
repo_name = "./finetuning-sentiment-model-1000-samples"
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=False,
)

这里,咱们挑选将模型的checkpoint存储路径设置为本地目录,学习率为2E-5,train和为evalution的batch_size都设置为了16,收拾的练习epoch为2,并设置了weight decay为0.01。

设置好这些参数,咱们实例化一个trainer

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

进行练习

最后,咱们一行代码完成练习。

trainer.train()

练习完成后,得到输出成果如下:

TrainOutput(global_step=126, training_loss=0.39784825037396143, metrics={'train_runtime': 7407.6124, 'train_samples_per_second': 0.27, 'train_steps_per_second': 0.017, 'total_flos': 263009880425280.0, 'train_loss': 0.39784825037396143, 'epoch': 2.0})

参考

  1. Getting Started with Sentiment Analysis using Python,huggingface.co/blog/sentim…