孙燕姿公然不愧是孙燕姿,不愧为南洋理工大学的高材生,近日她在个人官方媒体博客上写了一篇英文版的长文,正式回应现在沸沸扬扬的“AI孙燕姿”现象,盛行天后展现了超人一等的智识水平,行文优美,绵恒隽永,对AIGC艺术表现得极其抑制,又适当宽容,充满了语言上的古典之美,表现出了“任彼如泰山压顶,我只当清风拂面”的博大胸怀。

本次咱们利用edge-tts和Sadtalker库让AI孙燕姿朗诵本尊的博文,让盛行天后念给你听。

Sadtalker装备

之前咱们从前运用百度开源的PaddleGAN视觉效果模型中一个子模块Wav2lip完成了人物口型与输入的歌词语音同步,但Wav2lip的问题是虚拟人物的动态效果只能限制在嘴唇邻近,事实上,音频和不同面部动作之间的衔接是不同的,也便是说,虽然嘴唇运动与音频的联络最强,但能够经过不同的头部姿势和眨眼来反作用于音频。

和Wav2lip比较,SadTaker是一种经过隐式3D系数调制的风格化音频驱动Talking头部视频生成的库,一方面,它从音频中生成传神的运动系数(例如,头部姿势、嘴唇运动和眨眼),并单独学习每个运动以减少不确定性。关于表达,经过从的仅嘴唇运动系数和重建的渲染三维人脸上的感知损失(唇读损失,面部landmark loss)中提取系数,规划了一种新的音频到表达系数网络。

关于程序化的头部姿势,经过学习给定姿势的残差,运用条件VAE来对多样性和传神的头部运动进行建模。在生成传神的3DMM系数后,经过一种新颖的3D感知人脸渲染来驱动源图画。并且经过源和驱动的无监督3D关键点生成歪曲场,并歪曲参阅图画以生成终究视频。

Sadtalker能够单独装备,也能够作为Stable-Diffusion-Webui的插件而存在,这儿推荐运用Stable-Diffusion插件的方式,因为这样Stable-Diffusion和Sadtalker能够共用一套WebUI的界面,更方便将Stable-Diffusion生成的图片做成动态效果。

进入到Stable-Diffusion的项目目录:

cd stable-diffusion-webui

发动服务:

python3.10 webui.py

程序返回:

Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Version: v1.3.0  
Commit hash: 20ae71faa8ef035c31aa3a410b707d792c8203a3  
Installing requirements  
Launching Web UI with arguments: --xformers --opt-sdp-attention --api --lowvram  
Loading weights [b4d453442a] from D:\work\stable-diffusion-webui\models\Stable-diffusion\protogenV22Anime_protogenV22.safetensors  
load Sadtalker Checkpoints from D:\work\stable-diffusion-webui\extensions\SadTalker\checkpoints  
Creating model from config: D:\work\stable-diffusion-webui\configs\v1-inference.yaml  
LatentDiffusion: Running in eps-prediction mode  
DiffusionWrapper has 859.52 M params.  
Running on local URL:  http://127.0.0.1:7860

代表发动成功,随后http://localhost:7860

挑选插件(Extensions)选项卡

点击从url安装,输入插件地址:github.com/Winfredy/SadTalker

安装成功后,重启WebUI界面。

接着需要手动下载相关的模型文件:

https://pan.baidu.com/s/1nXuVNd0exUl37ISwWqbFGA?pwd=sadt

随后将模型文件放入项目的stable-diffusion-webui/extensions/SadTalker/checkpoints/目录即可。

接着装备一下模型目录的环境变量:

set SADTALKER_CHECKPOINTS=D:/stable-diffusion-webui/extensions/SadTalker/checkpoints/

至此,SadTalker就装备好了。

edge-tts音频转录

之前的歌曲复刻是经过So-vits库对原歌曲的音色进行替换和猜测,也便是说需要原版的歌曲作为基础数据。但目前的场景明显有别于歌曲替换,咱们首先需要将文本转换为语音,才干替换音色。

这儿运用edge-tts库进行文本转语音操作:

import asyncio
import edge_tts  
TEXT = '''  
As my AI voice takes on a life of its own while I despair over my overhanging stomach and my children's every damn thing, I can't help but want to write something about it.  
My fans have officially switched sides and accepted that I am indeed 冷门歌手 while my AI persona is the current hot property. I mean really, how do you fight with someone who is putting out new albums in the time span of minutes.  
Whether it is ChatGPT or AI or whatever name you want to call it, this "thing" is now capable of mimicking and/or conjuring,  unique and complicated content by processing a gazillion chunks of information while piecing and putting together in a most coherent manner the task being asked at hand. Wait a minute, isn't that what humans do? The very task that we have always convinced ourselves; that the formation of thought or opinion is not replicable by robots, the very idea that this is beyond their league, is now the looming thing that will threaten thousands of human conjured jobs. Legal, medical, accountancy, and currently, singing a song.   
You will protest, well I can tell the difference, there is no emotion or variance in tone/breath or whatever technical jargon you can come up with. Sorry to say, I suspect that this would be a very short term response.  
Ironically, in no time at all, no human will be able to rise above that. No human will be able to have access to this amount of information AND make the right calls OR make the right mistakes (ok mayyyybe I'm jumping ahead). This new technology will be able to churn out what exactly EVERYTHING EVERYONE  needs. As indie or as warped or as psychotic as you can get, there's probably a unique content that could be created just for you. You are not special you are already predictable and also unfortunately malleable.  
At this point, I feel like a popcorn eater with the best seat in the theatre. (Sidenote: Quite possibly in this case no tech is able to predict what it's like to be me, except when this is published then ok it's free for all). It's like watching that movie that changed alot of our lives Everything Everywhere All At Once, except in this case, I don't think it will be the idea of love that will save the day.   
In this boundless sea of existence, where anything is possible, where nothing matters, I think it will be purity of thought, that being exactly who you are will be enough.   
With this I fare thee well.  
'''  
VOICE = "en-HK-YanNeural"  
OUTPUT_FILE = "./test_en1.mp3"  
async def _main() -> None:  
    communicate = edge_tts.Communicate(TEXT, VOICE)  
    await communicate.save(OUTPUT_FILE)  
if __name__ == "__main__":  
    asyncio.run(_main())

音频运用英文版别的女声:en-HK-YanNeural,关于edge-tts,请移步:口播神器,依据Edge,微软TTS(text-to-speech)文字转语音免费开源库edge-tts语音合成实践(Python3.10),这儿不再赘述。

随后再将音频文件的音色替换为AI孙燕姿的音色即可:AI天后,在线飙歌,人工智能AI孙燕姿模型应用实践,复刻《悠远的歌》,原唱晴子(Python3.10)。

本地推理和爆显存问题

准备好生成的图片以及音频文件后,就能够在本地进行推理操作了,拜访 localhost:7860

南洋才女,德艺双馨,孙燕姿本尊回应AI孙燕姿(基于Sadtalker/Python3.10)

这儿输入参数挑选full,如此会保存整个图片区域,否则只保存头部部分。

生成效果:

南洋才女,德艺双馨,孙燕姿本尊回应AI孙燕姿(基于Sadtalker/Python3.10)

SadTalker会依据音频文件生成对应的口型和表情。

这儿需要注意的是,音频文件只支撑MP3或许wav。

除此以外,推理过程中Pytorch库可能会报这个过错:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 5.38 GiB already allocated; 0 bytes free; 5.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

这便是所谓的”爆显存问题”。

一般情况下,是因为当时GPU的显存不够了所导致的,能够考虑缩小torch分片文件的体积:

set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:60

如果音频文件真实过大,也能够经过ffmpeg对音频文件切片操作,分屡次进行推理:

ffmpeg -ss 00:00:00 -i test_en.wav -to 00:30:00 -c copy test_en_01.wav

藉此,就处理了推理过程中的爆显存问题。

结语

和Wav2Lip比较,SadTalker(Stylized Audio-Driven Talking-head)供给了更加纤细的面部运动细节(如眼睛眨动)等等,可谓是细致入微,巨细靡遗,当然随之而来的是模型数量和推理本钱以及推理时间的添加,但明显,这些都是值得的。