Research Experiences
Diffusion-based Speech-Text Language Model
NUS - Tencent
Sep 2025 – Present
Developed DiffuSpeech, the first diffusion-based speech-text language model supporting both understanding and generation, introducing a "Silent Thought, Spoken Answer" paradigm where internal text reasoning informs spoken responses. Unified discrete text and tokenized speech under a single masked diffusion framework with modality-specific masking schedules. Constructed ThinkingTalk, the first speech QA dataset with paired text reasoning traces (26K samples, 319 hours), achieving state-of-the-art speech-to-speech QA accuracy (+9 points over best baseline) and best TTS quality among generative models (6.2% WER).
Diffusion LLM
Speech-Text
Multimodal Reasoning
Speech QA
Efficient Foundation Models with Mixture of Experts
NUS - Apple
Sep 2024 – May 2025
Developed MoST, a novel speech-text foundation model featuring a Modality-Aware Mixture of Experts (MAMOE) architecture; achieved competitive performance across multiple speech-text benchmarks using exclusively open-source data. Developed MoRS, a four-stage distillation method that compresses large language models (70B) into efficient MoE architectures (12B, 3B activated) while preserving reasoning capabilities, achieving up to +14.5% on benchmarks. Created the first framework to distill dense LMs into MoE without relying on pre-existing small models.
Mixture of Experts
Foundation Models
Knowledge Distillation
Speech-Text
Multimodal LLM Agent with Retrieval Augmented Planning
NUS - Panasonic
Oct 2023 – May 2024
Developed RAP, a multimodal planning agent leveraging past successful experiences to enhance decision-making. Developed EnvBridge, a multimodal embodied agent transferring knowledge across diverse environments. Achieved SOTA on text-only environments (ALFWorld, Webshop) and significant improvements on multimodal robotics benchmarks (Franka Kitchen, Meta-World, RLBench).
LLM Agents
Retrieval-Augmented Planning
Embodied AI
Cross-Environment Transfer
Vision Model Scaling with Mixture of Experts
HPC-AI Lab
Mar 2021 – Jan 2022
Developed large-scale vision models: Sparse-MLP, Widenet based on Mixture of Experts. Proposed a fully-MLP architecture with conditional computation in two directions and extended MoE to the spatial dimension of image representation.
Mixture of Experts
Vision Models
Conditional Computation