本文分享自华为云社区《多模态比照言语图画预练习CLIP:打破言语与视觉的边界》,作者:汀丶。

一种基于多模态(图画、文本)比照练习的神经网络。它能够在给定图画的情况下,运用自然言语来猜测最相关的文本片段,而无需为特定使命进行优化。CLIP的规划类似于GPT-2和GPT-3,具备出色的零射击才能,能够应用于多种多模态使命。

  • 多模态比照言语图画预练习(CLIP)是一种神经网络模型,它经过多模态比照练习来学习图画和文本之间的相关。与传统的单模态预练习模型不同,CLIP能够一起处理图画和文本,从而更好地理解它们之间的语义关系。
  • CLIP的规划类似于GPT-2和GPT-3,是一种自回归言语模型。它经过比照学习来学习图画和文本之间的映射关系。在练习过程中,CLIP会接纳一张图画和一个与之相关的文本片段,并学习如何将这两个模态的信息进行相关。经过这种方式,CLIP能够学会将图画与相应的文本片段进行匹配,从而在给定图画的情况下,运用自然言语来猜测最相关的文本片段。
  • 由于CLIP采用了比照学习的办法,它能够在无需为特定使命进行优化的前提下,表现出色地完成多种多模态使命。这使得CLIP成为了一种通用的多模态预练习模型,能够广泛应用于图画标注、视觉问答、图画生成等领域。

带你认识一下多模态比照言语图画预练习CLIP

CLIP(比照言语图画预练习)是一种基于多种(图画、文本)对进行练习的神经网络。在给定图画的情况下,它能够用自然言语来猜测最相关的文本片段,而无需直接针对使命进行优化,类似于GPT-2和gpt – 3的零射击才能。我们发现CLIP在不运用任何原始的1.28M符号示例的情况下,在ImageNet“零射击”上匹配原始ResNet50的功能,克服了计算机视觉中的几个主要应战。

带你认识一下多模态比照言语图画预练习CLIP

1.安装

ftfy
regex
tqdm
torch
torchvision

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0

$ pip install ftfy regex tqdm

$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU.

import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
  • API

The CLIP module clip provides the following methods:

  • clip.available_models()

Returns the names of the available CLIP models.

  • clip.load(name, device=…, jit=False)

回来模型和模型所需的TorchVision转化,由’ clip.available_models() ‘回来的模型名指定。它将根据需要下载模型。’ name ‘参数也能够是本地检查点的途径。

能够挑选性地指定运行模型的设备,默认是运用第一个CUDA设备(如果有的话),否则运用CPU。当’ jit ‘为’ False ‘时,将加载模型的非jit版别。

  • clip.tokenize(text: Union[str, List[str]], context_length=77)

回来一个LongTensor,其中包括给定文本输入的符号化序列。这能够用作模型的输入

’ clip.load() ‘回来的模型支撑以下办法:

  • model.encode_image(image: Tensor)

给定一批图画,回来由CLIP模型的视觉部分编码的图画特征。

  • model.encode_text(text: Tensor)

给定一批文本tokens,回来由CLIP模型的言语部分编码的文本特征。

  • model(image: Tensor, text: Tensor)

给定一批图画和一批文本符号,回来两个张量,包括对应于每个图画和文本输入的logit分数。其值是对应图画和文本特征之间的类似度的余弦值,乘以100。

2.案例介绍

2.1 零样本才能

下面的代码运用CLIP履行零样本猜测,如本文附录B所示。本例从CIFAR-100数据集获取图画,并在数据集的100个文本标签中猜测最或许的标签。

import os
import clip
import torch
from torchvision.datasets import CIFAR100
#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
#Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
#Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
#Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
#Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
#Print the result
print("nTop predictions:n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出将如下所示(具体数字或许因计算设备的不同而略有不同):

Top predictions:
snake: 65.31%
turtle: 12.29%
sweet_pepper: 3.83%
lizard: 1.88%
crocodile: 1.75%

Note that this example uses the encode_image() and encode_text() methods that return the encoded features of given inputs.

2.2 Linear-probe 评价

The example below uses scikit-learn to perform logistic regression on image features.

import os
import clip
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm
#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
#Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)
def get_features(dataset):
all_features = []
all_labels = []
with torch.no_grad():
for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
features = model.encode_image(images.to(device))
all_features.append(features)
all_labels.append(labels)
return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()
#Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)
#Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)
#Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

3.更多材料参考

  • OpenCLIP: includes larger and independently trained CLIP models up to ViT-G/14
  • Hugging Face implementation of CLIP: for easier integration with the HF ecosystem

点击重视,第一时间了解华为云新鲜技术~