- Mervin Praison Newsletter
- Posts
- Phi-4-Reasoning: Microsoft's 14B Model That Punches Above Its Weight
Phi-4-Reasoning: Microsoft's 14B Model That Punches Above Its Weight
A compact model outperforming giants in reasoning, coding, and problem-solving tasks
Microsoft has unveiled Phi-4-Reasoning, a 14-billion parameter model designed for complex reasoning tasks—and it’s making waves in the AI community. Despite its relatively modest size, this model rivals and often outperforms much larger models on benchmarks involving math, science, coding, planning, and symbolic problem-solving. Microsoft didn’t stop there: they also introduced Phi-4-Reasoning-Plus, a reinforced version of the model that goes even further by incorporating reinforcement learning (RL) to enhance reasoning capabilities. Together, these models show how thoughtful training strategies and data curation can produce small models that deliver performance traditionally expected from massive models.
Phi-4-Reasoning Performance Across Benchmarks
Phi-4-Reasoning sets a new standard by excelling across multiple reasoning domains, including mathematics, scientific question answering, and code generation. It not only surpasses the performance of its own base model, Phi-4, but also beats much larger competitors like DeepSeek-R1-Distill-Llama-70B and OpenAI’s o1-mini in many benchmarks. With Phi-4-Reasoning-Plus, Microsoft pushes performance even closer to frontier models like o3-mini and DeepSeek-R1 (671B), but with only a fraction of the parameters. These results underscore how smaller models can close the performance gap with thoughtful architecture and training.

AIME 2025 & Test-Time Compute
In high-stakes benchmarks like AIME 2025 (a top-tier math competition), performance can improve significantly with more inference-time computation. By generating multiple outputs and selecting the best one, Phi-4-Reasoning-Plus demonstrates that “thinking harder” during inference pays off. When run with 64 generations, it actually exceeds the performance of its teacher model, o3-mini, showcasing the strength of inference-time scaling. This result is particularly impressive because AIME 2025 was released after the training period—so the model had zero exposure to these problems during training.

SFT Training Curves:
The supervised fine-tuning (SFT) phase for Phi-4-Reasoning shows a steady climb in accuracy, especially on reasoning-heavy datasets like AIME and GPQA. As training progresses, the model’s responses become shorter yet more effective, indicating that it is learning to reason more efficiently. The model also begins using structured reasoning tags like <think> and </think> early in training, but the quality of its reasoning improves over time. These observations show that SFT doesn’t just teach format—it teaches the model how to think.

Training Experiments:
Microsoft conducted extensive training experiments to perfect the SFT recipe for reasoning. They tested various combinations of math, coding, and alignment data, finding that using synthetic prompts and short, verifiable solutions had a large positive impact. A reasoning-specific system message helped standardise and improve output consistency. Crucially, they tuned data mixtures separately for each domain and then combined them—a strategy that preserved gains from both math and coding while maintaining robustness in safety and alignment.

Reward Design for RL:
To guide reinforcement learning, Microsoft created a handcrafted reward function focused on accuracy, response length, and output structure. Correct answers were rewarded more if they were concise, while incorrect answers received higher rewards for longer, more thoughtful reasoning—encouraging the model to “think more” when it’s unsure. Repetitive or badly formatted outputs were penalised. This custom reward function avoided issues often seen with neural reward models, like reward hacking, and provided a clear training signal aligned with human expectations.

RL Training Dynamics:
Amazingly, just 90 steps of reinforcement learning significantly improved Phi-4-Reasoning’s performance. On AIME 2024 and 2025, accuracy jumped by over 10%, validating the power of even a small amount of RL. The average length of correct responses increased gradually, suggesting the model was learning to think more carefully. Incorrect answers saw even faster growth in response length, as the reward function encouraged more effort in these cases—showing the model was adapting its reasoning process dynamically.

Head-to-Head Accuracy :
Phi-4-Reasoning-Plus holds its own even against large-scale models. It scores 78% on AIME 2025, outperforming 70B+ models like DeepSeek-R1-Distill and OpenAI’s o1-mini. On GPQA (graduate-level science) and LiveCodeBench (code generation), it delivers similarly strong results. These numbers show that this small model doesn’t just do well—it competes at the very top of the reasoning leaderboard.

Benchmark Suite Overview:
Microsoft evaluated the Phi-4 models across a wide variety of reasoning tasks—from symbolic logic (3SAT) and planning (TSP, Calendar) to spatial understanding (Maze, Spatial Map). The models delivered strong results across the board, often surpassing much larger models. The addition of reinforcement learning in Phi-4-Reasoning-Plus further boosted generalisation to domains that weren’t explicitly covered during training. This breadth of capability suggests Phi-4 is not just a specialist—it’s a well-rounded problem solver.

Accuracy Variance on AIME 2025 :
One key insight from Microsoft’s evaluation is that single-run accuracy scores can be misleading. On AIME 2025, model performance varied widely across different runs, even with the same prompts and temperatures. By running 50 independent evaluations, Microsoft showed that Phi-4-Reasoning-Plus had a consistent and narrow distribution, overlapping strongly with the powerful o3-mini. This analysis highlights the importance of statistical rigor in evaluating reasoning models.

AIME Accuracy by Year:
Performance on AIME questions varies significantly by year, and the Phi-4 models show strong consistency across decades. Even difficult years like 1994 and 2025—where most models struggle—see respectable performance from Phi-4-Reasoning-Plus. The trend shows steady improvement over the base Phi-4 model. This suggests the model has learned robust and transferable reasoning skills, not just patterns memorized from training data.

Final Thoughts
Phi-4-Reasoning and Phi-4-Reasoning-Plus represent a breakthrough in efficient AI reasoning. With just 14B parameters, they match or exceed much larger models across critical reasoning, planning, and code generation benchmarks. Microsoft’s careful design—centred on curated data, supervised fine-tuning, and light reinforcement learning—shows how thoughtful methodology can outperform brute scale. These models are a glimpse into the future of lean, capable AI systems that are more accessible and compute-efficient, without compromising performance.