DeepSeek R1 and the Trillion Dollar Question
On 20th Jan 2025 the DeepSeek R1 research paper (and the chatbot) was released: github.com/deepseek-ai/…
This caused a market frenzy wiping out almost A$1 trillion of market value overnight:
So, what’s the deal here? (why all the frenzy?)
Let’s step back to see what happened here:
DeepSeek released a reasoning model (R1) similar to OpenAI's 01, but was the next company to publicly release it after OpenAI
Other companies, such as Google (Gemini 2), Flash Thinking (Deep Research 1.5), and Anthropic, are working on or have prototypes of similar reasoning models, but DeepSeek's model is open-sourced and much cheaper to access (12 times less cost for API access compared to OpenAI's)
DeepSeek’s release of R1, particularly being a Chinese company, surprised the industry. Prior to this, the general consensus was that China was 6 to 12 months behind in AI, but R1’s release shows that China may be only 3 to 6 months behind
This changes the perception of how far China is in AI development, compressing the timeline for how competitive they are
DeepSeek's decision to open source their model and release the white paper on how it was developed provides transparency, allowing others to stress test and replicate it at a much lower cost
This open-source approach enables further innovation at a lower entry barrier, particularly in comparison to the closed models from other companies like OpenAI
If we talk about evolution of AI models, especially focusing on the difference between base LLMs (Large Language Models) and reasoning models. here’s a breakdown:
1. Base LLMs (e.g., ChatGPT, DeepSeek's V3)
These models are designed to provide quick answers to questions, similar to how a highly knowledgeable person might give you an answer immediately without further breakdown.
They are considered "smart PhDs," offering straightforward responses.
2. Reasoning Models (e.g., OpenAI's 01, DeepSeek's R1)
These models use reinforcement learning as a process separate from pre-training.
Instead of giving immediate answers, they break down complicated questions into smaller tasks and solve them step-by-step. This is referred to as Chain of Thought.
This type of model allows for more complex problem-solving than just providing quick answers.
The groundbreaking method introduced by DeepSeek for developing reasoning capabilities in large language models (LLMs) primarily focuses on reinforcement learning (RL). Here’s a breakdown of the novel aspects:
Pure Reinforcement Learning (RL) without Supervised Fine-Tuning:
Most models are traditionally trained using supervised learning, where they are fed labeled data to learn from. In contrast, DeepSeek R1 models leverage pure RL, which allows the model to autonomously develop reasoning capabilities by learning through feedback based on its actions, rather than relying on predefined correct answers or examples.
Group Relative Policy Optimization (GRPO):
This RL technique, introduced by DeepSeek, builds on the Proximal Policy Optimization (PPO) framework but is specifically designed to improve mathematical reasoning while optimizing for memory usage. GRPO enhances the model’s ability to handle complex reasoning tasks with fewer computational resources.
Reward Mechanisms:
Dynamic Thinking Process:
One of the key breakthroughs in DeepSeek’s approach is how the model learns to dynamically allocate thinking time. Rather than being explicitly taught how to solve specific problems, the model autonomously develops strategies for reevaluating its initial problem-solving approach. This approach allows the model to discover new, more efficient problem-solving techniques based on the incentives (rewards) given during training.
Cold-Start Data and Reasoning-Oriented Reinforcement Learning:
DeepSeek-R1’s training starts with a cold-start phase, where only a small amount of Chain-of-Thought (CoT) data is collected through methods like few-shot prompting and detailed reflective prompts. This minimizes the reliance on large labeled datasets and places emphasis on reasoning-oriented learning through RL.
Some further notes
DeepSeek claims they trained R1 for $6 million.
This figure is hard to validate, and this number likely only reflects the final training run, not the total cost.
A fair comparison would be to compare the full cost of DeepSeek’s R&D and model development to the holistic cost of OpenAI or Anthropic models, which includes hardware, years of research, and other expenses.
The $1 billion costs quoted for American companies (like OpenAI) likely includes all development costs, whereas the $6 million figure only refers to the final training phase, making direct comparisons misleading.
Ergo, with all the optimizations at the software layer, the model has leapfrogged its competitors, even though China lags a few generations behind in chip access. Consequently, expensive hardware for training advanced models, especially reasoning models, becomes less crucial. The focus is not on the "cheaper model" (which is not substantiated) but on what these innovations enable without the need for expensive hardware. Additionally, the model is open-sourced, promoting widespread use
In essence, the novel approach DeepSeek introduces is an advanced RL-driven method that focuses on autonomous reasoning and problem-solving by rewarding the model for logical, structured thinking, making it a powerful leap in the development of reasoning abilities in AI systems, is what, that has spooked the markets