2025 has been a strong and eventful year of progress in LLMs. The following is a list of personally notable and mildly surprising “paradigm changes” – things that altered the landscape and stood out to me conceptually.
1. Reinforcement Learning from Verifiable Rewards (RLVR)
At the start of 2025, the LLM production stack in all labs looked something like this:
- Pretraining (GPT-2/3 of ~2020)
- Supervised Finetuning (InstructGPT ~2022) and
- Reinforcement Learning from Human Feedback (RLHF ~2022)
This was the stable and proven recipe for training a production-grade LLM for a while. In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like “reasoning” to humans – they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). These strategies would have been very difficult to achieve in the previous paradigms because it’s not clear what the optimal reasoning traces and recoveries look like for the LLM – it has to find what works for it, via the optimization against rewards.