I am a PhD student at University of Illinois Urbana-Champaign starting from 2024 Fall, advised by Prof. Hao Peng. Previously, I was fortunate to work with Prof. Zhiyuan Liu at THUNLP and Prof. Heng Ji at UIUC.

My research pursues scalable AI systems that can ultimately push the frontier of humanity, i.e., (1) automating AI research through self-evolution or scalable oversight, and (2) advancing science. To this end, I work on:

  • Scalable data synthesis that enables to keep scaling compute to improve LLMs [UltraFeedback, UltraInteract/Eurus].
  • Scalable evaluation that unlocks and amplifies LLMs’ ability to provide feedback for both training and inference [Implicit PRM].
  • Scalable training algorithms that incorporate such feedback to enhance LLMs and, in return, help improve feedback quality [PRIME/Eurus-2, RL Compositionality].

🔥 News

  • 2025.10: We release CritPT, a benchmark curated by 50+ active physics researchers to evaluate LLMs’ reasoning ability in frontier physics research. No existing model except GPT-5 can consistently solve even a single problem! Still a looong way to go for LLMs to drive real-world scientific discovery.
  • 2025.09: We show that LLMs learn compositional skills through RL, and these skills are transferable to unseen tasks. The prerequisite is that atomic skills already exist in the base model and that RL training tasks incentivize composition properly.
  • 2025.05: We identified that entropy collapse impedes the scaling of RL, with the perfromance ceiling being surprisingly predictable if not intervened.
  • 2025.05: We find that simply minimizing entropy to squeeze LLMs’ capability works surprisingly well. Therefore, we call for attention to the importance of base models and a second thought on the recent fever over 0/few-shot RL.
  • 2025.05: We reveal that RL intrinsically leads to sparse updates, while SFT updates densely. Check out our paper here.
  • 2025.01: Eurus has been accepted to ICLR.
  • 2025.01: We introduce PRIME, a scalable RL solution for advanced reasoning through implicit process rewards! We also release Eurus-2, which is trained from Qwen2.5-Math-Base to surpass Qwen2.5-Math-Instruct using only 1/10 of the data.
  • 2024.12: We release Implicit PRM, get your model free process rewards without process labels! Together, we also release the SOTA Llama-3.1-8B-based PRMs!

📝 Publications

  • Selected
  • All

* denotes equal contribution


  • From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones
    Lifan Yuan*, Weize Chen*, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng.
    Preprint
  • Process Reinforcement through Implicit Rewards
    Ganqu Cui*, Lifan Yuan*, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding.
    Preprint
  • Free Process Rewards without Process Labels
    Lifan Yuan*, Wendi Li*, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng.
    ICML 2025
  • Advancing LLM Reasoning Generalists with Preference Trees
    Lifan Yuan*, Ganqu Cui*, Hanbin Wang*, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun.
    ICLR 2025; ICML 2024 Workshop On AI4Math
  • Executable Code Actions Elicit Better LLM Agents
    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji.
    ICML 2024
  • UltraFeedback: Boosting Language Models with High-quality Feedback
    Ganqu Cui*, Lifan Yuan*, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun.
    ICML 2024
  • Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
    Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun.
    NeurIPS 2023 (Datasets and Benchmarks Track)


    Preprints


  • From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones
    Lifan Yuan*, Weize Chen*, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng.
  • Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
    Minhui Zhu*, Minyang Tian*, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng.
  • Process Reinforcement through Implicit Rewards
    Ganqu Cui*, Lifan Yuan*, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding*.
  • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
    Ganqu Cui*, Yuchen Zhang*, Jiacheng Chen*, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding.
  • RLPR: Extrapolating RLVR to General Domains without Verifiers
    Tianyu Yu*, Bo Ji*, Shouli Wang*, Shu Yao*, Zefan Wang*, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua.

  • 2025


  • Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, Hao Peng.
    NeurIPS
  • The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng.
    NeurIPS
  • TTRL: Test-Time Reinforcement Learning
    Yuxin Zuo*, Kaiyan Zhang*, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou.
    NeurIPS
  • Free Process Rewards without Process Labels
    Lifan Yuan*, Wendi Li*, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng.
    ICML
  • Advancing LLM Reasoning Generalists with Preference Trees
    Lifan Yuan*, Ganqu Cui*, Hanbin Wang*, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun.
    ICLR
  • The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning
    Bingxiang He*, Ning Ding*, Cheng Qian*, Jia Deng, Ganqu Cui, Lifan Yuan, Haiwen Hong, Huan-ang Gao, Longtao Huang, Hui Xue,Huimin Chen, Zhiyuan Liu, Maosong Sun.
    Findings of ACL

  • 2024


  • Noise Contrastive Alignment of Language Models with Explicit Rewards
    Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu.
    NeurIPS
  • Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
    Yiju Guo*, Ganqu Cui*, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun.
    EMNLP
  • UltraFeedback: Boosting Language Models with High-quality Feedback
    Ganqu Cui*, Lifan Yuan*, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun.
    ICML
  • Executable Code Actions Elicit Better LLM Agents
    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji.
    ICML
  • CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
    Lifan Yuan*, Yangyi Chen*, Xingyao Wang, Yi R. Fung, Hao Peng, Heng Ji.
    ICLR
  • MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
    Xingyao Wang*, Zihan Wang*, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji.
    ICLR

  • 2023


  • Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
    Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu.
    EMNLP
  • Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
    Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun.
    NeurIPS (Datasets and Benchmarks Track)
  • Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training
    Biru Zhu*, Ganqu Cui*, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu.
    TACL
  • A Close Look into the Calibration of Pre-trained Language Models
    Yangyi Chen*, Lifan Yuan*, Ganqu Cui, Zhiyuan Liu, Heng Ji.
    ACL
  • Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework
    Lifan Yuan*, Yichi Zhang*, Yangyi Chen, Wei Wei.
    ACL (Findings)
  • From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
    Yangyi Chen*, Hongcheng Gao*, Ganqu Cui*, Lifan Yuan, Dehan Kong, Hanlu Wu, Ning Shi, Bo Yuan, Longtao Huang, Hui Xue, Zhiyuan Liu, Maosong Sun, Heng Ji.
    ACL (Findings)

  • 2022


  • A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks
    Ganqu Cui*, Lifan Yuan*, Bingxiang He, Yangyi Chen, Zhiyuan Liu, Maosong Sun.
    NeurIPS (Datasets and Benchmarks Track)(Spotlight)
  • FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition
    Lifan Yuan*, Linyi Yang*, Leyang Cui, Wenyang Gao, Yue Zhang.
    COLING(Oral)
  • Deep Clustering and Visualization for End-to-End High-Dimensional Data Analysis
    Lirong Wu*, Lifan Yuan*, Guojiang Zhao, Haitao Lin, Stan Z. Li.
    IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

🔧 Projects

  • TinyZero: A Minimal Reproduction of Reasoning Models [Github Repo][Tweet]

    12K+ Stars; Featured in CNBC, The Independent, Tom's Hardware, Daily Cal, Xinhua News, etc.

    Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, Alane Suhr

💬 Invited Talks

  • 2025.10, From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones [slides], ASAP Seminar.
  • 2025.02, Process Reinforcement through Implicit Rewards [slides], Google DeepMind.

💻 Internships

  • 2025.06 - 2025.10, Student Researcher at Google Deepmind, Mountain View.

📄 Academic Services

Reviewer:

NeurIPS (2022-2025), ICLR (2024-2025), ICML (2024-2025), COLM (2025), ACL (2023), EMNLP (2022-2023), ARR (2022-2024)

Workshop Organizer:

  • The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR), COLM 2025.