I am a PhD student at University of Illinois Urbana-Champaign starting from 2024 Fall, advised by Prof. Hao Peng.
Previously, I was fortunate to work with Prof. Zhiyuan Liu at THUNLP and Prof. Heng Ji at UIUC.
My research pursues scalable AI systems that can ultimately push the frontier of humanity, i.e., (1) automating AI research through self-evolution or scalable oversight, and (2) advancing science. To this end, I work on:
- Scalable data synthesis that enables to keep scaling compute to improve LLMs [UltraFeedback, UltraInteract/Eurus].
- Scalable evaluation that unlocks and amplifies LLMs’ ability to provide feedback for both training and inference [Implicit PRM].
- Scalable training algorithms that incorporate such feedback to enhance LLMs and, in return, help improve feedback quality [PRIME/Eurus-2, RL Compositionality].
🔥 News
- 2025.10: We release CritPT, a benchmark curated by 50+ active physics researchers to evaluate LLMs’ reasoning ability in frontier physics research. No existing model except GPT-5 can consistently solve even a single problem! Still a looong way to go for LLMs to drive real-world scientific discovery.
- 2025.09: We show that LLMs learn compositional skills through RL, and these skills are transferable to unseen tasks. The prerequisite is that atomic skills already exist in the base model and that RL training tasks incentivize composition properly.
- 2025.05: We identified that entropy collapse impedes the scaling of RL, with the perfromance ceiling being surprisingly predictable if not intervened.
- 2025.05: We find that simply minimizing entropy to squeeze LLMs’ capability works surprisingly well. Therefore, we call for attention to the importance of base models and a second thought on the recent fever over 0/few-shot RL.
- 2025.05: We reveal that RL intrinsically leads to sparse updates, while SFT updates densely. Check out our paper here.
- 2025.01: Eurus has been accepted to ICLR.
- 2025.01: We introduce PRIME, a scalable RL solution for advanced reasoning through implicit process rewards! We also release Eurus-2, which is trained from Qwen2.5-Math-Base to surpass Qwen2.5-Math-Instruct using only 1/10 of the data.
- 2024.12: We release Implicit PRM, get your model free process rewards without process labels! Together, we also release the SOTA Llama-3.1-8B-based PRMs!
📝 Publications
* denotes equal contribution
-
Preprint
-
Preprint
-
ICML 2025
-
ICLR 2025; ICML 2024 Workshop On AI4Math
-
ICML 2024
-
ICML 2024
-
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs EvaluationsNeurIPS 2023 (Datasets and Benchmarks Track)
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
-
RLPR: Extrapolating RLVR to General Domains without Verifiers
-
NeurIPS
-
NeurIPS
-
TTRL: Test-Time Reinforcement LearningNeurIPS
-
ICML
-
ICLR
-
The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction TuningFindings of ACL
-
Noise Contrastive Alignment of Language Models with Explicit RewardsNeurIPS
-
Controllable Preference Optimization: Toward Controllable Multi-Objective AlignmentEMNLP
-
ICML
-
ICML
-
ICLR
-
ICLR
-
Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPTEMNLP
-
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs EvaluationsNeurIPS (Datasets and Benchmarks Track)
-
Removing Backdoors in Pre-trained Models by Regularized Continual Pre-trainingTACL
-
A Close Look into the Calibration of Pre-trained Language ModelsACL
-
Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack FrameworkACL (Findings)
-
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation FrameworkACL (Findings)
-
A Unified Evaluation of Textual Backdoor Learning: Frameworks and BenchmarksNeurIPS (Datasets and Benchmarks Track)(Spotlight)
-
FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity RecognitionCOLING(Oral)
-
Deep Clustering and Visualization for End-to-End High-Dimensional Data AnalysisIEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Preprints
2025
2024
2023
2022
🔧 Projects
-
TinyZero: A Minimal Reproduction of Reasoning Models [Github Repo][Tweet]
12K+ Stars; Featured in CNBC, The Independent, Tom's Hardware, Daily Cal, Xinhua News, etc.
💬 Invited Talks
- 2025.10, From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones [slides], ASAP Seminar.
- 2025.02, Process Reinforcement through Implicit Rewards [slides], Google DeepMind.
💻 Internships
- 2025.06 - 2025.10, Student Researcher at Google Deepmind, Mountain View.
📄 Academic Services
Reviewer:
NeurIPS (2022-2025), ICLR (2024-2025), ICML (2024-2025), COLM (2025), ACL (2023), EMNLP (2022-2023), ARR (2022-2024)
Workshop Organizer:
- The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR), COLM 2025.