Xinsheng Wang 王新升

Research Scientist & Tech Lead

Multimodal Interaction Team · Soul AI Lab

I build speech-centric AI across the full loop of interaction, understanding, and generation — from real-time full-duplex dialogue to expressive voice and singing synthesis.

Portrait of Xinsheng Wang

About

I lead the Multimodal Interaction Team at Soul AI Lab, where we explore the next generation of human–AI interaction through speech, audio, and multimodal technologies.

My work focuses on building end-to-end systems that enable AI to listen, speak, sing, and interact naturally. I am particularly interested in bridging foundation models and real-world user experiences, transforming research advances into practical products.

As a first or corresponding author, I have led several open-source projects, including OpenCpop, Spark-TTS, SoulX-Podcast, SoulX-Singer, SoulX-Duplug, and SoulX-Transcriber.

Work Experience

News

Projects

SoulX-Transcriber

Multi-Speaker Speech Transcription

Corresponding Author Project Leader

An end-to-end multi-speaker transcription system for long-form conversational audio. Jointly solves speaker attribution and speech recognition — who spoke, when, and what — with robust performance under rapid speaker switching and complex dialogue.

SoulX-Duplug

Realtime Full-Duplex Speech Conversation

Corresponding Author Project Leader

A plug-and-play streaming semantic VAD model for real-time full-duplex speech conversation. Text-guided streaming state prediction enables low-latency, semantic-aware dialogue management in production systems.

SoulX-Singer

Zero-Shot Singing Voice Synthesis

Corresponding Author Project Leader

A high-fidelity zero-shot singing voice synthesis model for unseen singers. Supports melody-conditioned (F0 contour) and score-conditioned (MIDI notes) control for precise pitch, rhythm, and expression.

SoulX-Podcast

Long-form Podcast Generation

Corresponding Author Project Leader

Podcast-style multi-turn, multi-speaker dialogic speech generation with paralinguistic controls. Supports Mandarin, English, and Chinese dialects including Sichuanese, Henanese, and Cantonese.

Spark-TTS

LLM-Based Text-to-Speech

First Author Project Leader

Built on BiCodec, a single-stream speech codec decomposing speech into semantic and global tokens. Combined with Qwen2.5 LLM and chain-of-thought generation for coarse- and fine-grained voice control.