WEBVTT

00:00:00.000 --> 00:00:03.120
Just when you think the AI landscape is set in

00:00:03.120 --> 00:00:06.000
stone, dominated by these huge corporate players,

00:00:06.280 --> 00:00:08.880
an open source model just drops. It's called

00:00:08.880 --> 00:00:13.880
DeepSeq v3 .2. And the claims are pretty staggering.

00:00:14.060 --> 00:00:16.780
It's saying it can outperform GPT -5 and Gemini

00:00:16.780 --> 00:00:19.559
3 in really critical areas like reasoning and

00:00:19.559 --> 00:00:21.980
math. This isn't just an update. No, this feels

00:00:21.980 --> 00:00:24.920
like a real shift. Welcome to the deep dive.

00:00:25.460 --> 00:00:29.039
The pace of AI right now can feel well. overwhelming.

00:00:29.320 --> 00:00:31.359
Yeah, it's a lot to keep up with. Our job is

00:00:31.359 --> 00:00:33.380
to give you the shortcut. We've spent our time

00:00:33.380 --> 00:00:35.960
digging through the sources on DeepSeq v3 .2.

00:00:36.100 --> 00:00:38.039
And our mission today is pretty straightforward.

00:00:38.719 --> 00:00:41.439
We're going to demystify the core tech here.

00:00:41.780 --> 00:00:44.479
We're talking mixture of experts and sparse attention.

00:00:44.640 --> 00:00:46.640
We need to unpack the three different versions

00:00:46.640 --> 00:00:49.039
of this model. And most importantly, figure out

00:00:49.039 --> 00:00:51.200
if this thing really delivers on that promise.

00:00:51.200 --> 00:00:54.460
Yeah. Saving you money without, you know, sacrificing

00:00:54.460 --> 00:00:56.320
actual intelligence. OK, let's start with that.

00:00:56.380 --> 00:00:58.829
The core architecture. Why is this such a game

00:00:58.829 --> 00:01:01.450
changer? For years, the models we've used, the

00:01:01.450 --> 00:01:05.189
standard GPTs, they were all dense models. And

00:01:05.189 --> 00:01:07.969
that's the key problem right there. Dense just

00:01:07.969 --> 00:01:10.969
means inefficient. Oh, so. A dense model is like

00:01:10.969 --> 00:01:14.349
a single massive brain. Every single neuron,

00:01:14.730 --> 00:01:17.829
every part of that network has to fire up for

00:01:17.829 --> 00:01:20.290
every single query. Even for a simple question.

00:01:20.329 --> 00:01:22.069
Yeah, you're using the whole brain just to answer

00:01:22.069 --> 00:01:24.390
what's the weather like. It's just incredibly

00:01:24.390 --> 00:01:26.829
wasteful. So if that's the old, inefficient way,

00:01:26.969 --> 00:01:30.469
what's DeepSeek's big fix? The fix is Moe. Mixture

00:01:30.469 --> 00:01:33.049
of experts. Okay. Instead of one giant brain,

00:01:33.150 --> 00:01:36.030
think of it like a team of specialized AI experts.

00:01:36.829 --> 00:01:39.250
Moe just activates the specific network parts

00:01:39.250 --> 00:01:41.719
it needs for efficiency. So it's like having

00:01:41.719 --> 00:01:44.599
a team, not one overworked assistant. Exactly.

00:01:44.760 --> 00:01:46.900
Imagine a librarian who has to read every single

00:01:46.900 --> 00:01:48.799
book in the library for every question. That's

00:01:48.799 --> 00:01:51.140
the old way. Right. Moe is like having a team

00:01:51.140 --> 00:01:54.599
of experts, a math genius, a coding wizard, a

00:01:54.599 --> 00:01:57.340
history buff. And when a question comes in, only

00:01:57.340 --> 00:01:59.700
the relevant expert wakes up to answer it. And

00:01:59.700 --> 00:02:02.439
the scale of this is pretty massive, right? It's

00:02:02.439 --> 00:02:06.060
huge. The total brain power is 685 billion parameters.

00:02:06.640 --> 00:02:08.539
But for any given word it generates, it only

00:02:08.539 --> 00:02:11.240
uses a tiny fraction of that. And that's the

00:02:11.240 --> 00:02:13.479
key. That solves the two biggest problems in

00:02:13.479 --> 00:02:16.460
AI right now. Speed and cost. You get answers

00:02:16.460 --> 00:02:19.539
faster and you need less computing power to do

00:02:19.539 --> 00:02:23.340
it. Why is MoE of fundamentally better architecture

00:02:23.340 --> 00:02:26.240
right now? Well, it's faster and cheaper because

00:02:26.240 --> 00:02:28.960
it uses less computing power for every query.

00:02:29.240 --> 00:02:31.800
That's the core of it. OK, so it's easy to get

00:02:31.800 --> 00:02:34.620
a bit overwhelmed when a new model family drops.

00:02:35.159 --> 00:02:38.120
And DeepSeq isn't just one bot. No, it's a family

00:02:38.120 --> 00:02:40.919
of three. And it's really crucial to know which

00:02:40.919 --> 00:02:42.919
one to use for what task. So let's break them

00:02:42.919 --> 00:02:45.419
down. First up, you've got the main one, DeepSeq

00:02:45.419 --> 00:02:49.080
v3 .2. This is your daily driver. Generalist.

00:02:49.199 --> 00:02:51.319
Yeah, your go -to. It's balanced. It's fast.

00:02:51.379 --> 00:02:53.699
It's smart. It handles all the standard stuff,

00:02:53.699 --> 00:02:55.979
coding, writing. And it can connect to the internet,

00:02:56.159 --> 00:02:58.419
use external tools. Yep. It could check stock

00:02:58.419 --> 00:03:01.740
prices, pull weather data, browse the web. I

00:03:01.740 --> 00:03:03.939
was using it to plan a trip and debug some simple

00:03:03.939 --> 00:03:07.280
Python, and it just felt snappier. OK, so that's

00:03:07.280 --> 00:03:08.699
the Daily Driver. What's the second one, the

00:03:08.699 --> 00:03:10.419
one all the researchers are talking about? That's

00:03:10.419 --> 00:03:15.379
DeepSeq v3 .2, especially. The genius. The genius.

00:03:15.759 --> 00:03:19.949
This one, it locks the door to focus. It does

00:03:19.949 --> 00:03:21.490
not look at the Internet. It doesn't use any

00:03:21.490 --> 00:03:24.090
tools. So it's just relying on its own internal

00:03:24.090 --> 00:03:26.229
knowledge. How does that give it an edge? It

00:03:26.229 --> 00:03:28.310
generates something they call thinking tokens.

00:03:28.890 --> 00:03:31.849
It's like it talks silently to itself, checking

00:03:31.849 --> 00:03:34.069
its own logic step by step before it gives you

00:03:34.069 --> 00:03:36.250
an answer. So it's best for things that require

00:03:36.250 --> 00:03:39.430
pure structured reasoning. Exactly. Hard math

00:03:39.430 --> 00:03:43.030
problems, complex logic riddles, physics. That's

00:03:43.030 --> 00:03:45.770
its domain. And the third one, the X. DeepSeq

00:03:45.770 --> 00:03:48.729
v3 .2 -x6. That one is really just for researchers

00:03:48.729 --> 00:03:51.289
at the moment. It's testing the raw tech. You

00:03:51.289 --> 00:03:52.770
should probably stick to the first two. Which

00:03:52.770 --> 00:03:55.530
of these versions should our listener use 90

00:03:55.530 --> 00:03:59.069
% of the time? The Balanced Main V3 .2 is the

00:03:59.069 --> 00:04:01.430
general purpose daily driver for daily tasks.

00:04:01.710 --> 00:04:04.210
Got it. OK, this brings us to the benchmarks.

00:04:04.469 --> 00:04:06.370
And this is where it gets kind of shocking. Can

00:04:06.370 --> 00:04:10.969
an open, free model really beat the giants? The

00:04:10.969 --> 00:04:13.629
numbers. I mean, they speak for themselves. In

00:04:13.629 --> 00:04:16.649
math, on the AM 2025 test, which is incredibly

00:04:16.649 --> 00:04:20.769
difficult, Deepsea scored a 93 .1 % pass at one.

00:04:20.889 --> 00:04:23.449
Let's define that. Pass it one means what exactly?

00:04:23.689 --> 00:04:26.009
It means it gets the right answer on the very

00:04:26.009 --> 00:04:29.470
first try. So no guessing, just immediate high

00:04:29.470 --> 00:04:32.689
confidence accuracy. Exactly. And that 93 .1

00:04:32.689 --> 00:04:36.769
% that beats GPT -5 high, which was around 91%.

00:04:36.769 --> 00:04:39.730
Wow. Then in reasoning, think math, Olympiad

00:04:39.730 --> 00:04:42.569
level, it's hitting gold medal performance. It's

00:04:42.569 --> 00:04:45.550
matching Gemini 3 .0 Pro. Matching the best in

00:04:45.550 --> 00:04:48.470
the world. And it's open source. And for coding,

00:04:48.930 --> 00:04:52.089
on the SWE bench, it solved over 2 ,500 issues,

00:04:52.569 --> 00:04:55.209
which beats Claude 4 .5 Sonnet. So it's not just

00:04:55.209 --> 00:04:57.870
a one -trick pony. And that amazing performance

00:04:57.870 --> 00:05:00.350
on logic, puzzles, and math, that really comes

00:05:00.350 --> 00:05:02.529
back to the special version's thinking advantage,

00:05:02.550 --> 00:05:05.230
right? It really does. Most bots just rush to

00:05:05.230 --> 00:05:08.350
an answer. Special pauses to verify. We see that

00:05:08.350 --> 00:05:10.810
with that classic riddle, right? If one wet shirt

00:05:10.810 --> 00:05:13.050
takes an hour to dry. Yeah, how long for three

00:05:13.050 --> 00:05:14.930
shirts? A normal bot might just multiply and

00:05:14.930 --> 00:05:17.209
say three hours. But special... Special reasons

00:05:17.209 --> 00:05:19.209
it through, understands they dry in parallel,

00:05:19.430 --> 00:05:23.029
and correctly says one hour. Whoa. Imagine applying

00:05:23.029 --> 00:05:25.509
that verification ability to scale enterprise

00:05:25.509 --> 00:05:27.569
-level coding projects. That's the thing. It's

00:05:27.569 --> 00:05:29.610
not just about riddles. It's about catching bugs

00:05:29.610 --> 00:05:32.709
before they happen. How exactly does the special

00:05:32.709 --> 00:05:35.329
version achieve such reliable results on hard

00:05:35.329 --> 00:05:39.139
math? It pauses to verify its own logic, checking

00:05:39.139 --> 00:05:41.899
each step quietly before giving the final answer.

00:05:42.220 --> 00:05:44.660
OK, so that efficiency brings us right to the

00:05:44.660 --> 00:05:46.759
money part. The part everyone's waiting for.

00:05:47.060 --> 00:05:50.600
For anyone paying API costs, this is huge. Let's

00:05:50.600 --> 00:05:53.680
talk about DSA, or Deep Seek Sparse Attention.

00:05:54.220 --> 00:05:56.759
DSA is really the secret sauce that makes all

00:05:56.759 --> 00:05:59.199
this cost saving possible. It's all about how

00:05:59.199 --> 00:06:01.060
the model reads the information you give it.

00:06:01.339 --> 00:06:03.740
So with the old way, if I gave a model, say,

00:06:04.040 --> 00:06:06.660
a 1 ,000 -page document... It had to read all

00:06:06.660 --> 00:06:09.800
1 ,000 pages every single time. That's dense

00:06:09.800 --> 00:06:13.279
attention. Exactly. Just massive wasted effort.

00:06:13.879 --> 00:06:16.120
DSA is like giving that 1 ,000 -page book a really

00:06:16.120 --> 00:06:17.920
smart index. So instead of reading the whole

00:06:17.920 --> 00:06:20.120
book, it just jumps to the relevant pages. It

00:06:20.120 --> 00:06:22.540
only reads what it needs. And the cost difference

00:06:22.540 --> 00:06:25.240
is... It's just staggering. Let's get into the

00:06:25.240 --> 00:06:27.360
specifics. OK. So the cash miss, that's the first

00:06:27.360 --> 00:06:29.779
time you read a document, is about $0 .28 per

00:06:29.779 --> 00:06:31.920
million tokens. Which is already pretty cheap.

00:06:32.100 --> 00:06:35.019
It is. But the cash hit. When you ask a question

00:06:35.019 --> 00:06:37.939
about that same document again, the price drops

00:06:37.939 --> 00:06:41.220
to about $0 .28 per million tokens. That's a

00:06:41.220 --> 00:06:44.459
90 % drop. In a lot of cases, it's 10 to 30 times

00:06:44.459 --> 00:06:46.439
cheaper than models that don't have this. So

00:06:46.439 --> 00:06:49.199
if you're building an app that queries the same

00:06:49.199 --> 00:06:51.860
legal documents or internal wiki over and over,

00:06:52.459 --> 00:06:55.040
your bill just plummets. It changes the entire

00:06:55.040 --> 00:06:58.139
economic model for those kinds of apps. I still

00:06:58.139 --> 00:07:00.839
wrestle with how quickly API costs can spiral

00:07:00.839 --> 00:07:04.240
when you're querying large knowledge bases. This

00:07:04.240 --> 00:07:07.060
cash hit model is the cost break we needed. For

00:07:07.060 --> 00:07:09.180
a developer reading the same long documents,

00:07:09.360 --> 00:07:11.579
how significant is the cash and price difference?

00:07:11.779 --> 00:07:14.120
The cash hit feature can effectively drop the

00:07:14.120 --> 00:07:17.339
API bill by 90%, making large -scale processing

00:07:17.339 --> 00:07:19.540
affordable. It's cheap, it's fast, it's smart,

00:07:19.600 --> 00:07:21.819
so what's the catch? Let's talk reality. Can

00:07:21.819 --> 00:07:24.079
you actually run this thing yourself? The MIT

00:07:24.079 --> 00:07:26.259
license is great for privacy, your data stays

00:07:26.259 --> 00:07:29.199
in -house, but there's a serious hardware reality

00:07:29.199 --> 00:07:32.139
check. The full model weights are nearly 700

00:07:32.139 --> 00:07:34.180
gigabytes. So you can't run this on a MacBook

00:07:34.180 --> 00:07:37.899
Pro. No, not even close. You need a dedicated

00:07:37.899 --> 00:07:40.959
server with massive VRAM. We're talking something

00:07:40.959 --> 00:07:45.800
like 8 NVIDIA H100 or A100 GPUs. Which is tens

00:07:45.800 --> 00:07:48.620
of thousands of dollars. At least. Just for the

00:07:48.620 --> 00:07:51.399
hardware. So for an individual developer or small

00:07:51.399 --> 00:07:53.879
startup, what's the realistic path to actually

00:07:53.879 --> 00:07:56.560
using this? You've got a couple options. The

00:07:56.560 --> 00:07:59.480
easiest by far is just to use their API. It's

00:07:59.480 --> 00:08:01.959
cheap and they handle all the hardware complexity.

00:08:02.100 --> 00:08:04.310
Okay, what's the second option? You wait. The

00:08:04.310 --> 00:08:06.610
open source community will almost certainly create

00:08:06.610 --> 00:08:09.449
smaller, distilled versions. You know, like a

00:08:09.449 --> 00:08:12.069
7 billion parameter version that captures most

00:08:12.069 --> 00:08:14.329
of the intelligence but can actually run on high

00:08:14.329 --> 00:08:16.889
-end consumer hardware. And for the bigger companies

00:08:16.889 --> 00:08:19.110
that do have that kind of server farm? They need

00:08:19.110 --> 00:08:21.910
to use specialized software. Tools like VLLM

00:08:21.910 --> 00:08:24.949
or SGLANG are built specifically to run these

00:08:24.949 --> 00:08:27.899
sparse, moey models efficiently. What's the most

00:08:27.899 --> 00:08:30.339
realistic path for an individual to use this

00:08:30.339 --> 00:08:33.259
powerful model today? The easiest way is using

00:08:33.259 --> 00:08:36.039
their cheap API or waiting for smaller, diskilled

00:08:36.039 --> 00:08:39.679
versions to release. Makes sense. Okay, let's

00:08:39.679 --> 00:08:42.320
pivot to using it. If we have these two distinct

00:08:42.320 --> 00:08:44.940
versions, the daily driver and the logic engine,

00:08:46.139 --> 00:08:48.220
we can't talk to them the same way, can we? No,

00:08:48.360 --> 00:08:50.480
absolutely not. The way you prompt them has to

00:08:50.480 --> 00:08:53.279
be different. So for the main v3 .2, the daily

00:08:53.279 --> 00:08:56.580
driver, what's the strategy? You use it for quick,

00:08:56.779 --> 00:08:59.759
standard tasks where you need clean output fast,

00:09:00.440 --> 00:09:02.980
ask it for a React component for a pricing table,

00:09:03.500 --> 00:09:06.960
specify the styling, tailwind CSS, three columns.

00:09:07.039 --> 00:09:09.379
And it just spits out the code? Clean code immediately.

00:09:09.519 --> 00:09:11.240
It's built for that. And for the special version,

00:09:11.379 --> 00:09:14.220
the reasoning expert? There, you have to be precise.

00:09:14.600 --> 00:09:16.340
Define the rules clearly. It's more like giving

00:09:16.340 --> 00:09:19.340
it a formal problem statement. Use it for a complex

00:09:19.340 --> 00:09:21.559
geometry proof where accuracy is everything.

00:09:21.799 --> 00:09:24.279
So the switch is... Quick email summary, use

00:09:24.279 --> 00:09:26.879
the main, debugging a really stubborn logic error

00:09:26.879 --> 00:09:29.960
in your code. You use special. You match the

00:09:29.960 --> 00:09:32.000
tool to the task. For the special version, do

00:09:32.000 --> 00:09:35.779
I still need to write, think step by step? No,

00:09:36.000 --> 00:09:37.960
special does that automatically, but you must

00:09:37.960 --> 00:09:40.379
define the rules and constraints clearly. So

00:09:40.379 --> 00:09:42.759
let's wrap this all up. What is the single biggest

00:09:42.759 --> 00:09:45.580
idea you should walk away with? I think the big

00:09:45.580 --> 00:09:48.500
idea is that open source AI is officially no

00:09:48.500 --> 00:09:52.070
longer second best. DeepSeek proves that with

00:09:52.070 --> 00:09:55.149
moey and sparse attention, open models can match

00:09:55.149 --> 00:09:57.669
or even beat the performance of the biggest closed

00:09:57.669 --> 00:10:00.529
platforms. And the final takeaway for me is twofold.

00:10:00.870 --> 00:10:03.350
The special version gives a reasoning power that's

00:10:03.350 --> 00:10:07.210
on par with GPT -5 or Gemini 3, while the main

00:10:07.210 --> 00:10:09.490
version gives developers a cost structure that

00:10:09.490 --> 00:10:11.789
could honestly revolutionize how we build AI

00:10:11.789 --> 00:10:14.850
apps. The gap between open and closed has basically

00:10:14.850 --> 00:10:17.370
vanished. The performance ceiling got raised,

00:10:17.710 --> 00:10:19.769
but the cost floor just dropped through the basement.

00:10:19.899 --> 00:10:22.139
That's the tectonic shift. We really recommend

00:10:22.139 --> 00:10:24.539
you go and test that thinking mode on the special

00:10:24.539 --> 00:10:26.620
model yourself. Give it a tough logic puzzle.

00:10:26.879 --> 00:10:28.960
See how it works. And here's a thought to leave

00:10:28.960 --> 00:10:32.080
you with. Yeah. If an open source model under

00:10:32.080 --> 00:10:35.840
a free MIT license can do all this today, what

00:10:35.840 --> 00:10:38.340
happens to the economics of those expensive closed

00:10:38.340 --> 00:10:41.000
source AI platforms next? What does the future

00:10:41.000 --> 00:10:42.919
of competition look like now? A great question

00:10:42.919 --> 00:10:45.419
to think on. It is. Thank you for sharing your

00:10:45.419 --> 00:10:47.299
sources with us and letting us take this deep

00:10:47.299 --> 00:10:48.639
dive. We'll catch you next time.
