ActorVoice: Zero-Shot Cross-Lingual Video Dubbing through Personalized Voice Preservation and Language Style Transfer

Anonymous
2024

*Indicates Equal Contribution

Abstract

Cross-lingual video dubbing presents significant challenges in maintaining speaker identity, emotional expressiveness, and natural speech flow while translating content across languages. Traditional approaches often result in synthetic-sounding voice, inconsistent prosody, and loss of original speaker characteristics, limiting the widespread adoption of automated dubbing solutions. This paper introduces ActorVoice, a novel zero-shot cross-lingual dubbing framework that achieves professional-quality voice translation through several key innovations. First, we propose an advanced F5-based text-to-speech synthesis system enhanced with dynamic utterance merging, which effectively preserves natural speech patterns and emotional nuances across language boundaries. Second, we develop a high-precision voice-background separation module that achieves superior audio track isolation while maintaining environmental acoustics. Third, our intelligent audio fusion system implements adaptive prosody matching and coherent style transfer, effectively addressing common issues such as robotic artifacts, tonal inconsistencies, and unnatural transitions. Our framework incorporates three key technical contributions: (1) a context-aware utterance analysis mechanism that optimizes speech segment boundaries for natural flow, (2) a neural prosody transfer network that accurately captures and reproduces speaker-specific characteristics, and (3) an adaptive audio mixing pipeline that ensures seamless integration of synthesized speech with background elements. Extensive experiments demonstrate that ActorVoice achieves comparable quality to professional human dubbing services and outperforms existing solutions like Heygen across multiple evaluation metrics. Quantitative evaluations on our multi-language dubbing benchmark show that ActorVoice achieves a Mean Opinion Score (MOS) of 4.6 for naturalness and 82% for speaker similarity, representing a significant improvement over current state-of-the-art systems. User studies indicate that 96% of participants could not distinguish our synthesized dubbing from professional human recordings. These results demonstrate the potential of ActorVoice to revolutionize cross-lingual video content adaptation while maintaining high production quality.

Video source from Tencent Video

ActorVoice can preserve high-fedility and naturalness of voices in the original video

ActorVoice handles high-frequency speech and complex scenes well

ActorVoice Pipeline

BibTeX

        
@article{Fagang,
  title={ActorVoice: Zero-Shot Cross-Lingual Video Dubbing through Personalized Voice Preservation and Language Style Transfer},
  author={Fagang Jin, Shen Wei, Tong Ye},
  year={2024},
  primaryClass={cs.CV}
}