小说人物分析与形象生成

敞开成长之旅！这是我参加「日新计划 2 月更文挑战」的第 8 天，点击查看活动概况

一、前语

跟着AI技能的发展，AI从了解内容，走向了主动生成内容，包括AIGC用于作画、图文、视频等多类型的内容创作。AIGC是继 UGC、PGC 之后新式运用AI技能主动生成内容的生产方法。经过此项技能，借助大模型的跨模态综合技能能力，能够激起创意，提高内容多样性，降低制造本钱，将会完成大规模运用。本项目中，经过对小说人物形象的分析进行形象的生成，完成AIGC技能的探索运用。

二、人物形象关键词提取

通常来说，关于小说人物形象的描绘都是一大段话，而文生图则需求几个描绘明晰的Promt词汇，例如少年等，或许一句描绘明晰的话，而小说关于形象的描绘则偏向于感性，比方病态等，这个时分模型则很难了解这样的Promt输入，所以咱们需求运用一些自然言语处理技能生成一些关键词描绘，最后产出咱们需求的人物形象。

跨模态生成的一些现存的问题。首要是易用性问题。在运用中，用户需求输入文本描绘。但事实上，输入文本描绘是很复杂的。比方左边的例子，需求这里密密麻麻的文字才干生成一个图片。再比方右边文心一格的例子上，通用需求这么一大串文字，不管是主体、内容、风格各方面都需求描绘才干生成足够好。所以易用性是要进一步提高的。

2.1 文本摘要生成

运用根据启发式规矩的算法完成了一个抽取式摘要算法，一篇文章假如要用里边几个语句来代表，那么肯定选择那些拥有更多个与文章信息相关的关键词的那些语句；另外根据从小语文课上讲的中心句概念，文章首位和每个段落首位的语句根本也是中心句；更进一步，咱们经过分析，假如文章中某个语句和文章中大部分语句表达的意思都附近，那么这个语句也能很好的作为摘要语句。

因此本文运用关键词信息量、语句方位、语句类似度三个参数来构建一个语句权重的函数，核算一切语句的权重之后按照降序排序，去前面固定比例的语句，然后根据它们在原文中的先后顺序再次进行排序输出，这样就得到咱们要的摘要了。其首要思路如下所示：

1、文本切分和文本表示（切分语句、构建TFIDF矩阵）

2、核算语句权重（核算方位权重、核算类似度权重）

3、抽取语句权重最高的语句作为摘要

参阅链接：运用Python完成一个文本主动摘要工具

import jieba
import numpy as np
import collections
from sklearn import feature_extraction  
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import CountVectorizer  
def split_sentence(text, punctuation_list='!?。！？'):
    """
    将文本段装置标点符号列表里的符号切分成语句，将一切语句保存在列表里。
    """
    sentence_set = []
    inx_position = 0         #索引标点符号的方位
    char_position = 0        #移动字符指针方位
    for char in text:
        char_position += 1
        if char in punctuation_list:
            next_char = list(text[inx_position:char_position+1]).pop()
            if next_char not in punctuation_list:
                sentence_set.append(text[inx_position:char_position])
                inx_position = char_position
    if inx_position < len(text):
        sentence_set.append(text[inx_position:])
    sentence_with_index = {i:sent for i,sent in enumerate(sentence_set)} #dict(zip(sentence_set, range(len(sentences))))
    return sentence_set,sentence_with_index
def get_tfidf_matrix(sentence_set,stop_word):
    corpus = []
    for sent in sentence_set:
        sent_cut = jieba.cut(sent)
        sent_list = [word for word in sent_cut if word not in stop_word]
        sent_str = ' '.join(sent_list)
        corpus.append(sent_str)
    vectorizer=CountVectorizer()
    transformer=TfidfTransformer()
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))
    # word=vectorizer.get_feature_names()
    tfidf_matrix=tfidf.toarray()
    return np.array(tfidf_matrix)
def get_sentence_with_words_weight(tfidf_matrix):
    sentence_with_words_weight = {}
    for i in range(len(tfidf_matrix)):
        sentence_with_words_weight[i] = np.sum(tfidf_matrix[i])
    max_weight = max(sentence_with_words_weight.values()) #归一化
    min_weight = min(sentence_with_words_weight.values())
    for key in sentence_with_words_weight.keys():
        x = sentence_with_words_weight[key]
        sentence_with_words_weight[key] = (x-min_weight)/(max_weight-min_weight)
    return sentence_with_words_weight
def get_sentence_with_position_weight(sentence_set):
    sentence_with_position_weight = {}
    total_sent = len(sentence_set)
    for i in range(total_sent):
        sentence_with_position_weight[i] = (total_sent - i) / total_sent
    return sentence_with_position_weight
def similarity(sent1,sent2):
    """
    核算余弦类似度
    """
    return np.sum(sent1 * sent2) / 1e-6+(np.sqrt(np.sum(sent1 * sent1)) *\
                                    np.sqrt(np.sum(sent2 * sent2)))
def get_similarity_weight(tfidf_matrix):
    sentence_score = collections.defaultdict(lambda :0.)
    for i in range(len(tfidf_matrix)):
        score_i = 0.
        for j in range(len(tfidf_matrix)):
            score_i += similarity(tfidf_matrix[i],tfidf_matrix[j])
        sentence_score[i] = score_i
    max_score = max(sentence_score.values()) #归一化
    min_score = min(sentence_score.values())
    for key in sentence_score.keys():
        x = sentence_score[key]
        sentence_score[key] = (x-min_score)/(max_score-min_score)
    return sentence_score
def ranking_base_on_weigth(sentence_with_words_weight,
                            sentence_with_position_weight,
                            sentence_score, feature_weight = [1,1,1]):
    sentence_weight = collections.defaultdict(lambda :0.)
    for sent in sentence_score.keys():
        sentence_weight[sent] = feature_weight[0]*sentence_with_words_weight[sent] +\
                                feature_weight[1]*sentence_with_position_weight[sent] +\
                                feature_weight[2]*sentence_score[sent]
    sort_sent_weight = sorted(sentence_weight.items(),key=lambda d: d[1], reverse=True)
    return sort_sent_weight
def get_summarization(sentence_with_index,sort_sent_weight,topK_ratio =0.3):
    topK = int(len(sort_sent_weight)*topK_ratio)
    summarization_sent = sorted([sent[0] for sent in sort_sent_weight[:topK]])
    summarization = []
    for i in summarization_sent:
        summarization.append(sentence_with_index[i])
    summary = ''.join(summarization)
    return summary
if __name__ == '__main__':
    test_text = 'rose.txt'
    # with open(test_text,'r', encoding="gb18030") as f:
    with open(test_text,'r') as f:
        text = f.read()
    stop_word = []
    with open('StopWords.txt','r') as f:
        for line in f.readlines():
            stop_word.append(line.strip())
    sentence_set,sentence_with_index = split_sentence(text, punctuation_list='!?。！？')
    tfidf_matrix = get_tfidf_matrix(sentence_set,stop_word)
    sentence_with_words_weight = get_sentence_with_words_weight(tfidf_matrix)
    sentence_with_position_weight = get_sentence_with_position_weight(sentence_set)
    sentence_score = get_similarity_weight(tfidf_matrix)
    sort_sent_weight = ranking_base_on_weigth(sentence_with_words_weight,
                                                sentence_with_position_weight,
                                                sentence_score, feature_weight = [1,1,1])
    summarization = get_summarization(sentence_with_index,sort_sent_weight,topK_ratio=0.8)
    # print(type(summarization))
    # test_text_out = 'rose_out.txt'
    # with open(test_text_out,'w') as f:
    #     f.write(summarization)
    print('summarization:\n',summarization)

summarization:
 一个很漂亮的女孩子——这是郝仁的第一印象。对方一身挺清凉的打扮，上身穿着件贴身的白色短袖衫，衣领上缀着一片略有些孩子气的塑料小狗装饰，下身则是深色的短裤+休闲鞋，看起来好像一个悄悄翘课出来逛街的女大学生。这个自来熟的女孩子留着一头披肩短发，可能是很喜欢运动吧，皮肤带着些微的小麦色，健康又充满阳光，她的容貌秀丽可人，最让人留意的是那一双灵动的大眼睛，比郝仁见过的任何一双眼睛都充满活力，仿佛整个人的精气神都要从这双眼睛中透出来相同。

2.2 关键词提取

jieba模块的关键词获取能够经过两种方法来获取：

在运用jieba分词对文本进行处理之后，能够经过计算词频来获取关键词：jieba.analyse.extract_tags(news, topK=10)，获取词频在前10的作为关键词。
运用TF-IDF权重来进行关键词获取，首要需求对文本构建词频矩阵，其次才干运用向量求TF-IDF值。

import jieba
import jieba.analyse
text = summarization
promt_texyrank =list()
# 根据TextRank
keywords = jieba.analyse.textrank(text, topK=5, withWeight=True, allowPOS=('ns', 'n', 'vn', 'v'))
for item in keywords:
    print(item[0], item[1])
    promt_texyrank.append(item[0])

白色 1.0
翘课 0.8915410498073457
出来 0.8867536986126129
着件 0.8553706593637174
短袖衫 0.8453456531844306

import jieba.analyse as analyse
tfidf = analyse.extract_tags
test_text = summarization
# with open(test_text,'r',) as f:
#     text = f.read()
promt_tfidf =list()
# TF-IDF 提取关键词
keywords = tfidf(test_text, topK=5, withWeight=True, allowPOS=())
for item in keywords:
    print(item[0], item[1])
    promt_tfidf.append(item[0])

女孩子 0.22130913022742857
眼睛 0.220014001314
一双 0.194186673114
休闲鞋 0.18867900673428573
郝仁 0.17078239289857142

2.3 小结

经过本小节，咱们现已拿到了需求输入的promt，可是咱们同时也能够发现提取到的promt并没有十分理想，此处不足之处大概有两个原因，一是关于整个文档的摘要提取，核算语句权重的问题，当然，这段描绘文字偏少也是其间的一个问题；二是关键词的提取，并不能十分有用的提取到一切的形容词。在这里能够给我们提供两个优化思路，其一，运用PaddleNLP的摘要提取功能，直接运用其生成的摘要作为输入的promt或许对promt提取关键词之后再输入；其二，运用文心大模型的摘要提取功能，可是字数需求限制在1000字以内，这两种优化方法不是很难，我们能够尝试做一下。

三、人物形象图生成

# -*- coding: utf-8 -*
! pip install wenxin-api
import wenxin_api # 能够经过"pip install wenxin-api"指令装置
from wenxin_api.tasks.text_to_image import TextToImage
wenxin_api.ak = ""
wenxin_api.sk = ""
input_dict = {
    "text":  promt_tfidf + promt_texyrank + ["超高清，动漫，超细节，唯美，插画，壁纸"],
    "style": "二次元", #解锁更多风格后，非必选参数
    "resolution":"1024*1024" , #也可设置为 1024*1536、1536*1024
    "num": "2",    #功能解锁后，可设置的范围为[1,2,3,4,5,6]
}
rst = TextToImage.create(**input_dict)
print(rst)

2023-01-28 11:20:01,006 - model is painting now!, taskId: 13230768, waiting: 2m
{'imgUrls': ['https://wenxin.baidu.com/younger/file/ERNIE-ViLG/0dbd02ced2ee0feb6726c127347e7fceex', 'https://wenxin.baidu.com/younger/file/ERNIE-ViLG/0dbd02ced2ee0feb6726c127347e7fcei4']}

四、总结

AIGC 便是用人工智能来进行内容生产，它的特点是有十分强大的内容生产力，大幅提高内容生产的质量和功率，将来也会极大地丰厚我们的数字日子。跨模态内容生成。中心来讲，咱们期望用文本的描绘来生成视觉的内容。比方说一句话能生成一个图画，或许咱们写一篇文章，能把文章主动转成视频。

首要要做文本了解，也便是咱们常说的 Prompt 学习，这里边其实首要是要做一些了解，并根据常识进行扩充。当然，最中心的部分仍是文生图。文本现已确定下来了，输入到系统里边，效果一定要足够的好。为此，百度提出了 ERNIE-ViLG 2. 0，这是一个常识增强的混合降噪专家模型。

从图文相关性上面来看，在跨模态生成里边，言语跟视觉之间的对应关系要做得很好，才干确保用户说什么就生成什么。技能上首要经过对言语、视觉还有跨模态做一些常识增强，更好的完成跨模态常识之间的映射，然后完成图文相关性的提高

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

小说人物分析与形象生成

小说人物分析与形象生成

一、前语

二、人物形象关键词提取

2.1 文本摘要生成

2.2 关键词提取

2.3 小结

三、人物形象图生成

四、总结

近期文章

近期评论