What I've been reading on DeepSeek

All the smart takes I've come across

Jan 28, 2025

Here is everything I’ve read on DeepSeek and the current state of AI (from models to chips) that was enlightening for me. I’ll include some quotes inline (all emphasis is mine), so in case you don’t have time to click on links, you can quickly skim through here.

Link dump / Table of contents

Steven Sinofsky: Deepseek Has Been Inevitable and Here’s Why (History Tells Us)
Ben Thompson: DeepSeek FAQ
Jamin Ball: on Twitter
Morgan Brown: on Twitter
Jeffrey Emanuel: The Short Case for Nvidia Stock
Andrej Karpathy: on Twitter
Ethan Mollick: on Twitter
Yishan Wong, on Twitter
[ADDED] Dario Amodei: On DeepSeek and Export Controls
[ADDED] Will Byrk, on Twitter

1. Steven Sinofsky: Deepseek Has Been Inevitable and Here’s Why (History Tells Us)

Steven was there in the center of the arena as the tech industry shaped itself into what we can recognize, and his historical points of view are always extremely grounding. He tells of many times a story like Deepseek’s has played out.

(In the early days of building out internet infrastructure, AT&T was) convinced that the right way to build the internet was to take their phone network and scale it up. Add more hardware and more protocols and a lot more wires and equipment to deliver on reliability, QoS, and so on. They weren’t alone. Europe was busy building out internet connectivity with ISDN over their telco networks. ATT loved this because it took huge capital and relied on their existing infrastructure.
They were completely wrong. Cisco came along and delivered all those things on the IP-based network using toy software like DNS. Other toys like HTTP and HTML layered on top. Then came Apache, Linux, and a lot of browsers. Not only did the initial infrastructure prove to be the least interesting part, but it was also drawn into a scale out approach by a completely different player that had previously mostly served weird university computing infrastructure. Cisco did not have tens of billions of dollars nor did Netscape nor did CERN. They used what they could to deliver the information superhighway. The rest is history.

China faced an AI situation not unlike Cisco. Many (including “The Short Case”) are looking at the Nvidia embargo as a driver. The details don’t really matter. It is just that they had different constraints. They had many more engineers to attack the problem than they had data centers to train. They were inevitably going to create a different kind of solution.

2. Ben Thompson: DeepSeek FAQ

As expected, Ben gets into the details and covers far more ground than I can simply quote here. Highly recommend reading the entire article.

DeepSeek pulled this off despite the chip ban. Again, though, while there are big loopholes in the chip ban, it seems likely to me that DeepSeek accomplished this with legal chips.

DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses. […] So no, you can’t replicate DeepSeek the company for $5.576 million.

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. Distillation seems terrible for leading edge models. […] On the negative side, they (OpenAI, Anthropic, Google) are effectively bearing the entire cost of training the leading edge, while everyone else is free-riding on their investment.

3. Jamin Ball: on Twitter

Jamin focusses on distillation and its implied second order effects, a peek into what changes are sure to follow.

Model distillation might be the most important shift happening in AI right now—and it’s reshaping the entire tech industry. It means anyone can take that super complex, SOTA model (that someone else spent billions on), spend a FRACTION of the cost and time distilling it, and end up with their own model that's nearly just as good.

If anyone can distill a high-performing model, the differentiator shifts to proprietary data for fine-tuning.

Will the large providers stop spending tons of money to develop SOTA models?? I doubt it... but it does question the ultimate business model of selling essentially API calls. I think the business models will have to slowly move more towards packaged products around a model vs selling infra

4. Morgan Brown: on Twitter

Morgan explains what DeepSeek did in very layperson terms.

3/ How? They rethought everything from the ground up. Traditional AI is like writing every number with 32 decimal places. DeepSeek was like "what if we just used 8? It's still accurate enough!" Boom - 75% less memory needed.

4/ Then there's their "multi-token" system. Normal AI reads like a first-grader: "The... cat... sat..." DeepSeek reads in whole phrases at once. 2x faster, 90% as accurate. When you're processing billions of words, this MATTERS.

5/ But here's the really clever bit: They built an "expert system." Instead of one massive AI trying to know everything (like having one person be a doctor, lawyer, AND engineer), they have specialized experts that only wake up when needed. Traditional models? All 1.8 trillion parameters active ALL THE TIME. DeepSeek? 671B total but only 37B active at once. It's like having a huge team but only calling in the experts you actually need for each task.

7/ The results are mind-blowing:
- Training cost: $100M → $5M
- GPUs needed: 100,000 → 2,000
- API costs: 95% cheaper
- Can run on gaming GPUs instead of data center hardware

5. Jeffrey Emanuel: The Short Case for Nvidia Stock

Prescient take, given today’s reaction in the markets. As Jeffrey describes it, NVIDA's moat has four components: high-quality Linux drivers, CUDA as an industry standard, the fast GPU interconnect technology they acquired from Mellanox in 2019 and the flywheel effect where they can invest their enormous profits (75-90% margin in some cases!) into more R&D. Each of these is under threat. Jeffrey also makes some of the technical details of DeepSeek v3 and r1 very accessible, as well covering ground on innovations with chips. Highly recommend reading it in entirety.

So we've always had a looming "data wall" when it comes to the original scaling law; although we know we can keep shoveling more and more capex into GPUs and building more and more data centers, it's a lot harder to mass produce useful new human knowledge which is correct and incremental to what is already out there. Now, one intriguing response to this has been the rise of "synthetic data," which is text that is itself the output of an LLM. And while this seems almost nonsensical that it would work to "get high on your own supply" as a way of improving model quality, it actually seems to work very well in practice, at least in the domain of math, logic, and computer programming. The reason, of course, is that these are areas where we can mechanically check and prove the correctness of things.

Basically, the way Transformers work in terms of predicting the next token at each step is that, if they start out on a bad "path" in their initial response, they become almost like a prevaricating child who tries to spin a yarn about why they are actually correct, even if they should have realized mid-stream using common sense that what they are saying couldn't possibly be correct. Because the models are always seeking to be internally consistent and to have each successive generated token flow naturally from the preceding tokens and context, it's very hard for them to course-correct and backtrack. By breaking the inference process into what is effectively many intermediate stages, they can try lots of different things and see what's working and keep trying to course-correct and try other approaches until they can reach a fairly high threshold of confidence that they aren't talking nonsense.

With R1, DeepSeek essentially cracked one of the holy grails of AI: getting models to reason step-by-step without relying on massive supervised datasets. Their DeepSeek-R1-Zero experiment showed something remarkable: using pure reinforcement learning with carefully crafted reward functions, they managed to get models to develop sophisticated reasoning capabilities completely autonomously. This wasn't just about solving problems— the model organically learned to generate long chains of thought, self-verify its work, and allocate more computation time to harder problems.

The technical breakthrough here was their novel approach to reward modeling. Rather than using complex neural reward models that can lead to "reward hacking" (where the model finds bogus ways to boost their rewards that don't actually lead to better real-world model performance), they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking). This simpler approach turned out to be more robust and scalable than the process-based reward models that others have tried.

6. Andrej Karpathy, on Twitter

Also extolling the virtues of Reinforcement Learning, one of the big technical breakthroughs Jeffrey talks about with DeepSeek-R1-Zero above.

I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI.

Data has historically been seen as a separate category from compute, but even data is downstream of compute to a large extent - you can spend compute to create data. You've heard this called synthetic data generation, but less obviously, there is a very deep connection (equivalence even) between "synthetic data generation" and "reinforcement learning".

There are two major types of learning, in both children and in deep learning: There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). […] 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are emergent (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like.

7. Ethan Mollick: on Twitter

A reality check on how hard open source can be to deploy effectively, as he looks at performance and utility from models running locally.

Running DeepSeek r1 32B locally is kind of depressing. You get the reasoning, but the model is not *smart* enough to actually use it so it keeps getting tripped up in its own ideas & is full of doubt. 3 minutes on a limerick that a small non-reasoning model could do instantly.

8. Yishan Wong, on Twitter

His point is that we’re all freaking out because DeepSeek came out of China, but what they’ve done is very, very good for all of us.

The tech world (and apparently Wall Street) is massively over-rotated on this because it came out of CHINA. I get it. After everyone has been sensitized over the H1BLM uproar, we are conditioned to think of OMG Immigrants China as some kind of Alien Other. As though the Alien-Other Chinese Researchers are doing something special that's out of reach and now China The Empire is somehow uniquely in possession of Super Efficient AI Power and the US companies can't compete. […] It is not actually some sort of tectonic geopolitical shift, it is just Some Nerds Over There saying "Hey we figured out some cool shit, here's how we did it, maybe you would like to check it out?"

I think the Deepseek moment is not really the Sputnik moment, but more like the Google moment. […] Sputnik showed that the Soviets could do something the US couldn't ("a new fearsome power"). They didn't subsequently publish all the technical details and half the blueprints. They only showed that it could be done. With Deepseek, if I recall correctly, a lab in Berkeley read their paper and duplicated the claimed results on a small scale within a day.

That's why I say it's like the Google moment in 2004. Google filed its S-1 in 2004, and revealed to the world that they had built the largest supercomputer cluster by using distributed algorithms to network together commodity computers at the best performance-per-dollar point on the cost curve. In their S-1, they described how they were able to leapfrog the scalability limits of mainframes and had been (for years!) running a far more massive networked supercomputer comprised of thousands of commodity machines. Some time later, Google published their MapReduce and BigTable papers, describing the algorithms they'd used to manage and control this massively more cost-effective and powerful supercomputer. Deepseek is MUCH more like the Google moment, because Google essentially described what it did and told everyone else how they could do it too.

9. Dario Amodei: On DeepSeek and Export Controls

A thoughtful rebuttal to the hype around DeepSeek, and a great overview on how to think about scaling and improvements to models. His point that the cost of training runs isn’t indicative of the total cost of setting up everything else around a foundational model lab is what Ben Thompson said too, above.

DeepSeek does not "do for $6M what cost US AI companies billions". I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors). Sonnet's training was conducted 9-12 months ago, and DeepSeek's model was trained in November/December, while Sonnet remains notably ahead in many internal and external evals. Thus, I think a fair statement is "DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)".

Both DeepSeek and US AI companies have much more money and many more chips than they used to train their headline models. The extra chips are used for R&D to develop the ideas behind the model, and sometimes to train larger models that are not yet ready (or that needed more than one try to get right). It's been reported — we can't be certain it is true — that DeepSeek actually had 50,000 Hopper generation chips⁶, which I'd guess is within a factor ~2-3x of what the major US AI companies have (for example, it's 2-3x less than the xAI "Colossus" cluster)⁷. Those 50,000 Hopper chips cost on the order of ~$1B. Thus, DeepSeek's total spend as a company (as distinct from spend to train an individual model) is not vastly different from US AI labs.

10. Will Byrk, on Twitter

The founder of a hot search+AI startup called Exa AI chiming on how the biggest deal from the DeekSeek news is around applying Reinforcement Learning on LLMs (similar to Andrej Karpathy and Jeffrey Emanuel above)

The biggest news… was that Reinforcement Learning for LLMs "just works".
We see RL for LLMs just working for Deepseek, given the speed they were able to replicate o1, and given the ease that other orgs had using the same RL algorithm on different training data. And we see RL for LLMs just working for OpenAI, with the speed they were able to get Deep Research working, only SOME WEEKS after o3 was trained.

Something new has been discovered about reality, a statistical law of the universe. It's hard for us to grasp the power of billions of weights melding toward some reward signal. We're touching up against fundamental properties of information systems. If we ever meet superintelligent aliens out there, they'd probably tell us that they too discovered something akin to RL for LLMs long ago.

I’ll keep this page updated with any other articles I come across that are educational.

Force Multipliers

Discussion about this post