Deep Synthesis: Real-time Audio-Visual Synthesis Using Deep Learning

This is the final project for the course ECE-GY 7123: Deep Learning by Professor Chinmay Hedge.

Team members

Kaustubh Agarwal (ka3210)
Srujana Kanchisamudram (sk11115)
Aashir Saroya (as17888)

This repository contains the code for Deep Synthesis, a system that generates real-time audio-visual speech synthesis. It works by taking three inputs:

Text: The text that you want the person in the image or video to say.
Voice sample: A short audio clip of the target speaker's voice. This is used to capture the speaker's unique vocal characteristics, such as pitch, timbre, and accent.
Image or video: An image or video of the person you want to synthesize speech for. Deep Synthesis will use this image or video to create a new video where the person appears to be speaking the text with synchronized lip movements.

The system works by first using a text-to-speech (TTS) model to generate a speech waveform based on the input text and the voice sample. Then, a lip synchronization model is used to generate realistic lip movements for the person in the image or video that are synchronized with the speech waveform. Finally, the generated speech waveform and lip movements are combined to create a new video.

The complete flow diagram of the project is as follows

Generating Samples

To generate samples run the interactive Python notebook Sample_Generator.ipynb or just run this Colab Notebook.

It is a relatively straightforward inferencing file. Run the first few sections to get the models and necessary libraries.

Following that you can give the 3 inputs described above. Due to limited time and scope, we were not able to develop a webpage to interactively upload these inputs (something which we would've really liked to do) but we were able to give very clear directions to do so.

You will be prompted to provide the text input which you can give by typing.

The other two inputs- image and voice sample should be added to the temporary folder sample_data created on Google Colab with the names mentioned below

The audio file should be audio_file.mp3
The image should be img_sample.jpg

Following this you'll have to run the next 2 segments of code to generate the audio of the voice sample speaking the text provided followed by lip syncing the words in the audio sample overlayed on the image provided.

Finally, you'll be able to see the generated video on the Colab Notebook. It can also be found in the temporary files on Colab with the name 'result.mp4' which can be downloaded and used as you'd like!

You can find more details about each of the modules, their working and some plots in thier respective READMEs.

We have generated some samples to show the performance of our model. These samples can be found under the 'samples' folder. Hope you have fun watching them!

The detailed report of this project can be found here or here (with hyperrefs).

Acknowledgement

We express our deepest gratitude to Professor Chinmay Hedge and the TAs along with all contributors who offered their expertise and insights throughout the "Deep Synthesis" project. We acknowledge the assistance of OpenAI's language model, ChatGPT 4.0, for its role in generating parts of this report. Additionally, we are grateful for the extensive resources available through online platforms such as Stack Overflow and various GitHub repositories. Our project has also benefited greatly from the official documentation of PyTorch, NumPy, and seaborn, which guided our development decisions. Moreover, we thank the NYU High Performance Computing (HPC) facilities for the computational resources that were essential for training our models. These tools and supports were instrumental in the successful completion of our research.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
LipSync		LipSync
VoiceCloning		VoiceCloning
samples		samples
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Complete_Flow_Diagram.png		Complete_Flow_Diagram.png
Deep_Synthesis_Report.pdf		Deep_Synthesis_Report.pdf
README.md		README.md
Sample_Generator.ipynb		Sample_Generator.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Synthesis: Real-time Audio-Visual Synthesis Using Deep Learning

Team members

Generating Samples

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deep Synthesis: Real-time Audio-Visual Synthesis Using Deep Learning

Team members

Generating Samples

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages