本文为稀土技能社区首发签约文章，30天内制止转载，30天后未获授权制止转载，侵权必究！

一、前言

上一篇介绍以图搜图的完结：/post/725585… ，咱们利用了卷积神经网络提取特征，然后比照特征类似度，并运用向量数据库加快查找。本文咱们将介绍依据文本查找图片的完结。

首要需求知道依据文本查找图片详细是什么问题，这儿能够有两个层面。第一个则是图片中包含的文本内容，这个能够用OCR辨认提取出来。第二个则是深层次的对图片描绘的文本，比方红色的狗、跑步的猪、骑猪的人。这些都是对图片内容的描绘，相比之下第二种要复杂得多。

二、OCR+文字搜图

OCR是指光学字符辨认，也便是咱们常说的文字辨认。OCR的完结方法是多样的，这儿运用Tesseract或许各种神经网络。OCR不是文本要点，因而这儿只简略介绍其运用。概况可见：/post/696437…

OCR+文字搜图的原理十分简略，便是先辨认文字，然后依据文字含糊查询找到相关图片即可。为了便利查询，这儿需求运用数据库。

2.1 文字辨认

运用pytesseract模块能够很便利完结OCR，详细代码如下：

import os, cv2
import numpy as np
import pytesseract
from tqdm import tqdm
from PIL import Image
from sqlalchemy import create_engine, String, select
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, Session
base_path = "G:\datasets\emoji"
files = [os.path.join(base_path, file) for file in os.listdir(base_path) if file.endswith(".jpg")]
for file in files:
try:
    image = Image.open(file)
    string = pytesseract.image_to_string(image, lang='chi_sim')
    print(file, ":", string.strip())
except Exception as e:
    pass

其间string便是辨认到的文本内容。pytesseract中也提供了批量辨认的接口，由于这儿存在一些过错图片，因而这儿不适用批量接口。

2.2 存储数据库

为了便利查询，能够把图片路径和图片中包含的文本内容存储到数据库中。这儿运用sqlalchemy+sqlite，代码如下：

from sqlalchemy import create_engine
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy import String
class Base(DeclarativeBase):
    pass
class ImageInformation(Base):
    __tablename__ = "image_information"
    id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
    filepath: Mapped[str] = mapped_column(String(255))
    content: Mapped[str] = mapped_column(String(255))
    def __repr__(self) -> str:
        return f"User(id={self.id!r}, filepath={self.filepath!r}, content={self.content!r})"
engine = create_engine("sqlite:///image_search.db", echo=False)
Base.metadata.create_all(engine)

其间ImageInformation类对应咱们需求创立的数据库表。创立好后，辨认图片的文字，然后存储到图片数据库中：

base_path = "G:\datasets\emoji"
files = [os.path.join(base_path, file) for file in os.listdir(base_path) if file.endswith(".jpg")]
bar = tqdm(total=len(files))
for file in files:
    try:
        # 辨认文字
        image = Image.open(file)
        string = pytesseract.image_to_string(image, lang='chi_sim').strip()
        file = file[:255] if len(file) > 255 else file
        string = string[:255] if len(string) > 255 else string
        # 存储数据库
        with Session(engine) as session:
            info = ImageInformation(filepath=file, content=string)
            session.add_all([info])
            session.commit()
    except Exception as e:
        pass
    bar.update(1)

这个进程会比较久。

2.3 依据文字查找图片

完结上面的存储操作后，就能够开始依据文字查找图片了。这儿只需求运用简略的数据库查询操作即可完结，代码如下，咱们先把你好作为输入文本：

keyword = '你好'
w, h = 224, 224
with Session(engine) as session:
    stmt = select(ImageInformation).where(ImageInformation.content.contains(keyword)).limit(8)
    images = [cv2.resize(cv2.imread(ii.filepath), (w, h)) for ii in session.scalars(stmt)]
    if len(images) > 0:
        result = np.hstack(images)
        cv2.imwrite("result.jpg", result)
    else:
        print("没有找到成果")

下面是查询到的成果图片：

假如关键词改为喜欢，得到成果如下：

通过测试，发现在一些短文本查找中，这种方法比较奏效，但是在长文本则常常查找不到成果。一种改善方法是不存储文本自身，而是运用Bert等模型把文本转换成Embedding，然后存储Embedding。这样咱们就不能再运用sqlite了，而需求运用向量数据库。

三、依据Transformer的改善

在前面的比方中，查找成果十分依赖字符串匹配。比方查找鸡，只要图片中有鸡字才会被查找到，而与坤相关的图片则查找不到。为此咱们用Transformer对上面进行改善，主要思路便是先辨认文字，然后把文字交给文本编码器，转换成Embedding，然后在查找时查找输入文本和Embedding的类似度，这样就能够缓解上述问题。

3.1 创立数据库

这儿咱们还是运用向量数据库，向量数据库有许多挑选，这儿运用Milvus数据库，详细运用能够参考：milvus.io/docs/instal…

首要创立数据库和调集：

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
# 创立数据库
connections.connect(host='127.0.0.1', port='19530')
def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    fields = [
        FieldSchema(name='id', dtype=DataType.INT64, descrition='ids', max_length=500, is_primary=True,
                    auto_id=True),
        FieldSchema(name='filepath', dtype=DataType.VARCHAR, description='filepath', max_length=512),
        FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, descrition='embedding vectors', dim=dim),
    ]
    schema = CollectionSchema(fields=fields, description='reverse image search')
    collection = Collection(name=collection_name, schema=schema)
    # create IVF_FLAT index for collection.
    index_params = {
        'metric_type': 'L2',
        'index_type': "IVF_FLAT",
        'params': {"nlist": 2048}
    }
    collection.create_index(field_name="embedding", index_params=index_params)
    return collection
collection = create_milvus_collection('image_information', 768)

调集中主要有filepath和embedding两个字段。其间embedding有768个维度，这是由Transformer决议的，我这儿挑选的Transformer输出768个维度，因而这儿填768。

3.2 text2vec

创立完结后，便是读取图片、辨认文字、文字编码、存入数据库。其间文字编码能够运用Transformers模块或许text2vec完结，这儿运用text2vec，其操作如下：

from text2vec import SentenceModel
model = SentenceModel('shibing624/text2vec-base-chinese')
embeddings = model.encode(['不要温柔地走进那个良夜'])
print(embeddings.shape)

在创立SentenceModel时传入对应的模型，然后调用model.encode方法即可。输出如下成果：

(1, 768)

其他操作则不详细解释，详细代码如下：

from text2vec import SentenceModel
model = SentenceModel('shibing624/text2vec-base-chinese', device="cuda")
base_path = "G:\datasets\emoji"
files = [os.path.join(base_path, file) for file in os.listdir(base_path) if file.endswith(".jpg")]
bar = tqdm(total=len(files))
for idx, file in enumerate(files):
    try:
        image = Image.open(file)
        string = pytesseract.image_to_string(image, lang='chi_sim').strip()
        embedding = model.encode([string])[0]
        collection.insert([
            [file],
            [embedding]
        ])
    except Exception as e:
        pass
    bar.update(1)

3.3 依据文字查找图片

在插入数据后，直接运用数据库的查询操作即可完结查找操作，详细代码如下：

import cv2
import numpy as np
from text2vec import SentenceModel
from pymilvus import connections, Collection
# 加载模型
model = SentenceModel('shibing624/text2vec-base-chinese', device="cuda")
# 衔接数据库，加载调集
connections.connect(host='127.0.0.1', port='19530')
collection = Collection(name='image_information')
search_params = {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 5}
collection.load()
# 用来查询的文本
keyword = "今日不开心"
embedding = model.encode([keyword])
print(embedding.shape)
# 在数据库中查找
results = collection.search(
    data=[embedding[0]],
    anns_field='embedding',
    param=search_params,
    output_fields=['filepath'],
    limit=10,
    consistency_level="Strong"
)
collection.release()
# 展现查询成果
w, h = 224, 224
images = []
for result in results[0]:
    entity = result.entity
    filepath = entity.get('filepath')
    image = cv2.resize(cv2.imread(filepath), (w, h))
    images.append(np.array(image))
result = np.hstack(images)
cv2.imwrite("result.jpg", result)

向量数据库在查询时，能够依据向量的类似度回来查询成果。在前面咱们存储了句向量，所以咱们能够把查询文本转换成句向量，然后利用向量数据库的查询功用，查找类似成果。在上面代码中，咱们查询“今日不开心”，这次不再是字符串层面的查询，而是语句意义层面的查询，因而能够查询的不包含这些字符的图片，下面是查询成果：

把关键词修正为“我想吃饭”后得到下面的成果：

整体作用还是十分不错的。

不过前面的成果是建立在能在图片中辨认到文本的情况下，假如是咱们随手拍的照片，那么就不能运用上面的方法来完结文字查找图片。

四、依据图片意义的文字搜图

在多模态领域有许多组合模型，而咱们需求的便是Image-to-Text类模型。假如要手艺给图片增加画面描绘会十分麻烦，因而咱们挑选运用Image-to-Text模型完结自动辨认。

4.1 完结原理

依据图片意义的文字搜图的完结与前面依据OCR的类似，只不过需求把OCR修正为Image Captioning网络。在前面咱们的流程是：

读取图片
OCR辨认
把辨认成果转换成向量
存入数据库

现在只需求把第二步修正为运用Image Captioning生成图片描绘即可。后面部分则是完全一致的。

4.2 Image Captioning

像这类输入图片，输出画面描绘的使命叫做Image Captioning，用于这一使命的模型十分多。包含CNN+LSTM，Vit等都能够完结Image Captioning。两者都是一个Encoder-Decoder架构，运用CNN、Vit作为图片Encoder，将图片转换成特征图或许特征向量。然后把Encoder的输出作为Decoder的输入，并输入，然后顺次生成图片描绘。

以Vit为例，其结构如图：

Vit其实便是一个为图片设计的Transformer架构，在某些细节上为图片做了一些修正。

4.3 依据图片意义查找图片

首要咱们能够运用和前面相同的方法创立数据库，这儿不再重复，咱们复用前面的数据库image_information。然后需求修正插入数据的代码，首要来创立一个函数加载模型，并创立一个函数用于将图片转换成文本向量，代码如下：

import os
import torch
from tqdm import tqdm
from PIL import Image
from text2vec import SentenceModel
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from pymilvus import Collection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_model():
    """
    加载需求运用到的模型
    """
    sentence_model = SentenceModel('shibing624/text2vec-base-chinese', device="cuda")
    model = VisionEncoderDecoderModel.from_pretrained("bipin/image-caption-generator")
    image_processor = ViTImageProcessor.from_pretrained("bipin/image-caption-generator")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    model.to(device)
    return sentence_model, model, image_processor, tokenizer
def get_embedding(filepath):
    """
    输入图片路径，将图片转成描绘向量
    """
    pixel_values = image_processor(images=[Image.open(filepath)], return_tensors="pt").pixel_values.to(device)
    output_ids = model.generate(pixel_values, num_beams=4, max_length=128)
    pred = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return sentence_model.encode(pred)

后续只需求调用get_embedding函数就能够完结图片到向量的转换。接下来便是修正插入数据的代码，详细如下：

connections.connect(host='127.0.0.1', port='19530')
collection = Collection("image_information")
collection.load()
sentence_model, model, image_processor, tokenizer = load_model()
base_path = "G:\datasets\people"
files = [os.path.join(base_path, file) for file in os.listdir(base_path)]
bar = tqdm(total=len(files))
for idx, file in enumerate(files):
    try:
        embedding = get_embedding(file)
        collection.insert([
            [file],
            [embedding]
        ])
    except Exception as e:
        pass
    bar.update(1)

最终则是搜图操作了，这个和前面是完全一样的：

search_params = {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 5}
# 用来查询的文本
keyword = "girl"
embedding = sentence_model.encode([keyword])
# 在数据库中查找
results = collection.search(
    data=[embedding[0]],
    anns_field='embedding',
    param=search_params,
    output_fields=['filepath'],
    limit=10,
    consistency_level="Strong"
)
collection.release()
# 展现查询成果
w, h = 224, 224
images = []
for result in results[0]:
    entity = result.entity
    filepath = entity.get('filepath')
    image = cv2.resize(cv2.imread(filepath), (w, h))
    images.append(np.array(image))
result = np.hstack(images)
cv2.imwrite("result.jpg", result)

由于这儿挑选的Image Captioning模型输出为英文，因而这儿把英文作为关键字。这儿关键字为”girl”，下面是查找成果：

由于数据库中还存储了之前的表情包，因而表情包中关于与”girl”有关的表情包也查找出来了，比方”娘们”、”女性”等。

假如把关键字改为”smile girl”，查找成果如下：

假如图片数量足够，则能够得到一个比较好的查找成果。

上面的成果还能够有一些改善，在Image Captioning过程，咱们只生成了一个描绘。在许多情况下，这个描绘不一定准确，比方下面的图片：

能够描绘为“拿着话筒的姑娘”、“一个姑娘在浅笑”或许“一个拿着话筒的姑娘在浅笑”。因而咱们能够生成多个描绘，存入数据库，这样在查找时成果能够更准确。能够通过修正temperature参数来生成不同的描绘：

output_ids = model.generate(pixel_values, num_beams=4, max_length=128, temperature=0.8)

当temperature小于1时，生成成果带有随机性。temperature越小，成果越随机。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

如何实现文字搜图