Yuxuan Lou

Ph.D. Student in Computer Science · National University of Singapore
yuxuanlou@u.nus.edu · +65 82600153

I am a Ph.D. student in Computer Science at the National University of Singapore, advised by Prof. Yang You. I received my B.Sc. in Applied Mathematics from Fudan University and M.Sc. in Statistics from NUS. I was also a research intern at Harvard University. My research interests lie in diffusion large language models, efficient LLM scaling with Mixture of Experts, and multimodal foundation model adaptation from large language models. I am currently working on speech-text diffusion models, scalable MoE architectures, and large-scale pretrained models.


News

Jan 2026 Our paper DiffuSpeech is released — the first diffusion-based speech-text language model supporting both understanding and generation.
Sep 2025 Started research internship at Tencent, working on diffusion-based speech-text models.
Jan 2025 Our paper MoST is released with code — a modality-aware MoE for speech-text foundation models.
Oct 2024 Our paper EnvBridge is released — cross-environment knowledge transfer for embodied AI.
Feb 2024 Our paper RAP is released — retrieval-augmented planning for multimodal LLM agents.

Research Interest

  • Diffusion Large Language Models
  • Efficient Large Language Model Scaling with Mixture of Experts
  • Multimodal Foundation Model Adaptation from Large Language Models

Education

National University of Singapore

Ph.D. in Computer Science
School of Computing, HPC-AI Lab
Advised by Prof. Yang You
2023 – Present

National University of Singapore

M.Sc. in Statistics
School of Statistics and Probability
2020 – 2022

Harvard University

Research Intern
Computer Science Department, DAS Lab
2019 – 2020

Fudan University

B.Sc. in Applied Mathematics
School of Mathematical Science
2016 – 2020

Selected Publications

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Yuxuan Lou*, Ziming Wu*, Yaochen Wang, Yong Liu, Yingxuan Ren, Fuming Lai, Shaobing Lian, Jie Tang, Yang You

2026

MoST: Modality-Aware Mixture of Experts for Efficient Speech-Text Foundation Model

Yuxuan Lou, Kai Yang, Yang You

2025

MoRS: Distill Large Language Model into Mixture of Reasoning Students In Submission

Yuxuan Lou, Yang You

2025

EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

Tomoyuki Kagaya*, Yuxuan Lou*, Thong Jing Yuan*, Subramanian Lakshmi*, Jayashree Karlekar, Sugiri Pranata, Natsuki Murakami, Akira Kinose, Koki Oguri, Felix Wick, Yang You

2024

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Tomoyuki Kagaya*, Yuxuan Lou*, Thong Jing Yuan*, Subramanian Lakshmi*, Jayashree Karlekar, Sugiri Pranata, Natsuki Murakami, Akira Kinose, Koki Oguri, Felix Wick, Yang You

2024

Cross-token Modeling with Conditional Computation

Yuxuan Lou, Fuzhao Xue, Zangwei Zheng, Yang You

2022

Research Experiences

Diffusion-based Speech-Text Language Model

NUS - Tencent
Sep 2025 – Present

Developed DiffuSpeech, the first diffusion-based speech-text language model supporting both understanding and generation, introducing a "Silent Thought, Spoken Answer" paradigm where internal text reasoning informs spoken responses. Unified discrete text and tokenized speech under a single masked diffusion framework with modality-specific masking schedules. Constructed ThinkingTalk, the first speech QA dataset with paired text reasoning traces (26K samples, 319 hours), achieving state-of-the-art speech-to-speech QA accuracy (+9 points over best baseline) and best TTS quality among generative models (6.2% WER).

Diffusion LLM Speech-Text Multimodal Reasoning Speech QA

Efficient Foundation Models with Mixture of Experts

NUS - Apple
Sep 2024 – May 2025

Developed MoST, a novel speech-text foundation model featuring a Modality-Aware Mixture of Experts (MAMOE) architecture; achieved competitive performance across multiple speech-text benchmarks using exclusively open-source data. Developed MoRS, a four-stage distillation method that compresses large language models (70B) into efficient MoE architectures (12B, 3B activated) while preserving reasoning capabilities, achieving up to +14.5% on benchmarks. Created the first framework to distill dense LMs into MoE without relying on pre-existing small models.

Mixture of Experts Foundation Models Knowledge Distillation Speech-Text

Multimodal LLM Agent with Retrieval Augmented Planning

NUS - Panasonic
Oct 2023 – May 2024

Developed RAP, a multimodal planning agent leveraging past successful experiences to enhance decision-making. Developed EnvBridge, a multimodal embodied agent transferring knowledge across diverse environments. Achieved SOTA on text-only environments (ALFWorld, Webshop) and significant improvements on multimodal robotics benchmarks (Franka Kitchen, Meta-World, RLBench).

LLM Agents Retrieval-Augmented Planning Embodied AI Cross-Environment Transfer

Vision Model Scaling with Mixture of Experts

HPC-AI Lab
Mar 2021 – Jan 2022

Developed large-scale vision models: Sparse-MLP, Widenet based on Mixture of Experts. Proposed a fully-MLP architecture with conditional computation in two directions and extended MoE to the spatial dimension of image representation.

Mixture of Experts Vision Models Conditional Computation

Open Source Projects

Making large AI models cheaper, faster, and more accessible. A collection of parallel components for distributed training. Managed and contributed to Colossal-AI examples.

A curated collection of Mixture of Experts papers and projects.

MoST

Official implementation of MoST, a speech-text foundation model with Modality-Aware Mixture of Experts (MAMOE) architecture for enhanced cross-modal understanding.

RAP

Official implementation of RAP, a multimodal planning agent leveraging past successful experiences to enhance decision-making.


Professional Experience

Tencent

Research Intern
Sep 2025 – Feb 2026

Developed DiffuSpeech, the first diffusion-based speech-text language model supporting both understanding and generation. Collaborated on building the ThinkingTalk dataset and speech-text diffusion framework.

Diffusion LLM Speech-Text Multimodal Reasoning

HPC-AI Tech

Deep Learning Engineer
Feb 2022 – Dec 2022

Colossal-AI and Colossal-AI Examples co-developer. Built an open domain dialog system with internet knowledge augmentation.

Colossal-AI Distributed Training Dialog Systems

Interactive Entertainment Group, Tencent

Machine Learning Engineer Intern
Mar 2020 – Jun 2020

Data mining and cleaning of users' comments. Built ML models for emotion classification and deep learning abstractive text summarization.

Data Mining NLP Text Summarization

Skills & Technologies

GPU Training
PyTorch, DeepSpeed, Megatron-LM, Colossal-AI, HuggingFace Transformers/Accelerate, vLLM, FlashAttention (NVIDIA GPU clusters)
TPU Training
TensorFlow, JAX/Flax, Keras (Google Cloud TPU pods)
Parallel Training & Optimization
Model parallel, tensor parallel, pipeline parallel, sequence parallel, data parallel, mixture-of-experts parallel training

Activities & Interests

During my undergraduate years, I was a member of the debate team of the Mathematical Science school. I really enjoyed presenting evidence, defending and questioning arguments, and developing persuasion techniques. We won the 2017 Fudan Debating Championship.

I also served as a volunteer mathematics teacher for kindergarten children in Yangpu district, introducing basic mathematical concepts and knowledge.

In my free time, I enjoy reading science fiction. The Three-Body Problem and the Foundation series are my favourites.

Last updated in February 2026