DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 fro...

		HarHarVeryFunny on Feb 7, 2025 \| parent \| context \| favorite \| on: Understanding Reasoning LLMs DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 from their V3 base model, but then went though two iterations of using current model to generate synthetic reasoning data, SFT on that, then RL fine-tuning, and repeat.