What is PodAgent?
PodAgent is a podcast generation framework jointly launched by the Chinese University of Hong Kong, Microsoft, and Xiaohongshu. Based on the simulation of real talk show scenes, a multi-agent collaborative system (including hosts, guests, and screenwriters) is used to automatically generate rich and structured dialogue content. PodAgent has built a diverse voice library for accurate matching of roles and voices to ensure the naturalness and immersion of audio. PodAgent introduces speech synthesis technology based on large language models (LLM) to generate expressive and emotional voices, making podcasts more attractive. PodAgent has launched a comprehensive evaluation index to measure the quality of generated podcasts and ensure the professionalism and diversity of content.
Main functions of PodAgent
Generate high-quality dialogue content: Automatically generate rich and diverse dialogue scripts covering a variety of topics.
Voice role matching: Dynamically match the most suitable voice according to the character's personality and content background.
Speech synthesis and expressiveness enhancement: Adjust the tone, rhythm, and emotion of the voice according to the emotion and context of the dialogue content to make the podcast more vivid. Generate a complete podcast structure: Support adding appropriate sound effects and background music to generate a complete podcast structure. Support multi-language generation to meet the needs of different scenarios and listeners.
Evaluation and optimization: Provide comprehensive evaluation indicators to measure the quality of the generated podcast, including the richness of the dialogue content, the accuracy of the sound matching, and the expressiveness of the voice.
Technical principles of PodAgent
Multi-agent collaboration system:
Host: Carry a microphone to formulate a dialogue outline and guide the topic discussion.
Guests: Provide professional insights and opinions based on the role setting.
Screenwriter: Integrate the dialogue content and optimize the coherence and diversity of the script.
Sound feature analysis and matching: Build a sound library, analyze the characteristics of the sound (such as timbre, intonation, emotion, etc.), and match the most suitable sound for each character. Extract sound samples from open source data sets (such as LibriTTS and AISHELL-3) and generate a diverse sound library based on deduplication and screening.
LLM-guided speech synthesis: Use speech synthesis technology based on large language model (LLM) to convert text content into natural and expressive speech. Use the speaking style predicted by LLM as instructions to guide speech synthesis models (such as CosyVoice) to generate speech that matches the content sentiment.
Comprehensive evaluation indicators: Launch a set of evaluation indicators to measure the quality of generated podcasts. The indicators include the vocabulary diversity, semantic richness, information density of the dialogue content, the accuracy of sound matching and the expressiveness of the speech. Based on LLM as an evaluation tool, the generated content is compared and scored.
PodAgent project address
GitHub repository: https://github.com/yujxx/PodAgentz
arXiv technical paper: https://arxiv.org/pdf/2503.00455
Application scenarios of PodAgent
Media and content creation: Quickly generate high-quality podcast programs covering topics such as news, culture, and technology, saving creation time and cost.
Education and learning: Generate educational podcasts, such as language learning, academic lectures, etc., to provide a vivid and interesting learning experience.
Corporate promotion: Create brand promotion podcasts, share product stories or industry insights, and enhance brand influence.
Self-media and personal branding: Help creators quickly generate podcast content, break through creative bottlenecks, and enhance content appeal.
Entertainment and creativity: Generate entertainment podcasts such as fictional stories and comedy talk shows to provide an immersive listening experience.