WEBVTT

00:00:00.000 --> 00:00:03.500
So here's the dilemma, right? Paying, say, $20

00:00:03.500 --> 00:00:05.860
a month for what's supposedly the absolute best

00:00:05.860 --> 00:00:09.539
AI or getting something that performs almost

00:00:09.539 --> 00:00:11.480
identically, maybe even better in some ways,

00:00:11.640 --> 00:00:15.580
but it's free and it's open source. Yeah, that's

00:00:15.580 --> 00:00:17.940
really the core tension in AI right now, isn't

00:00:17.940 --> 00:00:20.780
it? And this isn't just a small saving. It feels

00:00:20.780 --> 00:00:22.820
like a fundamental shift is happening. A new

00:00:22.820 --> 00:00:24.739
model has just shot up the rankings. Welcome

00:00:24.739 --> 00:00:26.600
to the Deep Dive. Today we're looking closely

00:00:26.600 --> 00:00:30.579
at Kimi K2 Thinking. This model has kind of quietly

00:00:30.579 --> 00:00:33.820
blown past almost every big name, closed source

00:00:33.820 --> 00:00:36.140
competitor in head -to -head tests. Absolutely.

00:00:36.280 --> 00:00:38.619
So our mission today is to really understand

00:00:38.619 --> 00:00:41.170
how. How did this open source model climb to

00:00:41.170 --> 00:00:43.530
number two globally? It's apparently just one

00:00:43.530 --> 00:00:46.270
point behind GPT -5, which is kind of wild. We're

00:00:46.270 --> 00:00:48.369
going to break down the tech, look at what it

00:00:48.369 --> 00:00:50.289
means for businesses, especially its reliability,

00:00:50.369 --> 00:00:52.170
because it's doing amazing things with financial

00:00:52.170 --> 00:00:55.210
analysis, expert coding, really complex stuff.

00:00:55.409 --> 00:00:57.710
Okay, let's unpack that shift, because it really

00:00:57.710 --> 00:00:59.990
does feel like another deep -seek moment, like

00:00:59.990 --> 00:01:02.270
the courses call it, where suddenly open source

00:01:02.270 --> 00:01:04.370
isn't just catching up, it's setting the standard.

00:01:04.629 --> 00:01:08.870
The ranking itself is genuinely staggering. Artificial

00:01:08.870 --> 00:01:12.030
analysis has Kimi K2 thinking ranked number two

00:01:12.030 --> 00:01:15.489
in the world. And we're not talking about it

00:01:15.489 --> 00:01:17.849
beating some niche models here. It's leaping

00:01:17.849 --> 00:01:20.370
over the giants. Like who? Let's name them. Okay,

00:01:20.430 --> 00:01:22.909
so it's outperforming Grok 4. That's XAI's big

00:01:22.909 --> 00:01:26.530
one. Claude 4 .5 Sonnet from Anthropic. And Gemini

00:01:26.530 --> 00:01:29.680
2 .5 Pro from Google. Just a few months back,

00:01:29.840 --> 00:01:32.739
the feeling was, you know, the absolute top tier

00:01:32.739 --> 00:01:35.319
that always be proprietary, always locked behind

00:01:35.319 --> 00:01:39.000
huge R &D budgets. Now you've got a zero cost

00:01:39.000 --> 00:01:41.140
option delivering intelligence that's knocking

00:01:41.140 --> 00:01:44.060
on GPT -5's door. That changes the whole equation

00:01:44.060 --> 00:01:45.640
if you're building things, right? It really does.

00:01:45.799 --> 00:01:48.280
And when you see that tiny gap, just one point

00:01:48.280 --> 00:01:51.659
behind GTT5, the open source part becomes the

00:01:51.659 --> 00:01:53.760
killer feature. You're getting, what, like 99

00:01:53.760 --> 00:01:56.420
% of the capability, but without the vendor risk.

00:01:56.599 --> 00:01:58.180
Exactly. And think about the practical side.

00:01:58.280 --> 00:02:00.120
If you're running a dev shop. or maybe handling

00:02:00.120 --> 00:02:02.239
really sensitive data. Now you can potentially

00:02:02.239 --> 00:02:04.180
run this model on your own servers, your private

00:02:04.180 --> 00:02:06.420
cloud. You keep complete control over your data.

00:02:06.480 --> 00:02:07.859
That's something you just don't get with most

00:02:07.859 --> 00:02:10.939
of the big cloud APIs. And the cost predictability

00:02:10.939 --> 00:02:13.759
must be huge. API fees can jump around, scale

00:02:13.759 --> 00:02:15.740
in ways you don't expect. Getting rid of that

00:02:15.740 --> 00:02:18.199
line item, that's got to feel good. Oh, totally.

00:02:18.280 --> 00:02:21.009
Imagine the budgeting relief. You know your server

00:02:21.009 --> 00:02:24.090
costs, roughly, but you ditch those massive,

00:02:24.189 --> 00:02:28.030
sometimes unpredictable API usage fees. It just

00:02:28.030 --> 00:02:30.729
shifts the power away from, you know, the handful

00:02:30.729 --> 00:02:32.750
of big tech companies controlling the best models.

00:02:32.870 --> 00:02:36.090
So zooming out, what would you say is the single

00:02:36.090 --> 00:02:39.969
biggest practical win for a business when a model

00:02:39.969 --> 00:02:42.349
this smart goes open source? I'd say it's freedom.

00:02:42.490 --> 00:02:44.449
Freedom from being locked into one vendor and

00:02:44.449 --> 00:02:47.169
those high kind of unpredictable API costs. Okay,

00:02:47.210 --> 00:02:50.310
let's pivot from the rankings to how it actually

00:02:50.310 --> 00:02:52.229
performs. Coding seems like a great place to

00:02:52.229 --> 00:02:53.810
start because that's often where these models

00:02:53.810 --> 00:02:55.930
show their limits. The source material kicks

00:02:55.930 --> 00:02:58.689
off with a wild challenge. Build a drag and drop

00:02:58.689 --> 00:03:01.129
website builder, kind of like Wix, but from a

00:03:01.129 --> 00:03:04.280
single prompt. That sounds ambitious. Yeah, it's

00:03:04.280 --> 00:03:06.819
a serious test of like structural reasoning.

00:03:07.020 --> 00:03:08.599
It's not just spitting out some static HTML.

00:03:09.000 --> 00:03:12.120
It needs to understand interaction dynamic elements.

00:03:12.340 --> 00:03:15.860
And KimiK2 delivered a fully functional editor,

00:03:16.060 --> 00:03:19.360
just one HTML file. It really seemed to grasp

00:03:19.360 --> 00:03:22.080
the underlying logic needed. So it had to plan

00:03:22.080 --> 00:03:23.939
out the JavaScript, right, the dragging, the

00:03:23.939 --> 00:03:26.620
dropping, handling the over events, plus the

00:03:26.620 --> 00:03:29.340
CSS for styling, and make it all work together

00:03:29.340 --> 00:03:32.439
smoothly. Yes. And the little details, too. It

00:03:32.439 --> 00:03:34.479
had working side panels, elements you could actually

00:03:34.479 --> 00:03:36.580
drag around, and importantly, a snap -to -grid

00:03:36.580 --> 00:03:38.840
system, you know, with the little red lines showing

00:03:38.840 --> 00:03:41.419
alignment, getting all that right in one shot

00:03:41.419 --> 00:03:44.960
from one prompt. That's really, really rare for

00:03:44.960 --> 00:03:46.919
this kind of complex application. Okay, the next

00:03:46.919 --> 00:03:49.560
test sounds even harder. The fluid dynamics simulation,

00:03:49.939 --> 00:03:51.659
I mean, that's straight up expert coding territory.

00:03:51.680 --> 00:03:55.219
You need physics, math, animation, all interactive.

00:03:55.560 --> 00:03:57.819
It's a real synthesis task. The model has to

00:03:57.819 --> 00:04:00.379
plan the physics simulation itself, managing

00:04:00.379 --> 00:04:02.939
particles. pressure, velocity, all that, and

00:04:02.939 --> 00:04:05.960
then translate that complex math into fast JavaScript

00:04:05.960 --> 00:04:08.960
that renders in real time on an HTML canvas.

00:04:09.180 --> 00:04:11.120
And the result was interactive. You could tweak

00:04:11.120 --> 00:04:15.080
sliders for viscosity, diffusion, and the fluid

00:04:15.080 --> 00:04:16.860
actually behaved like you'd expect. It looked

00:04:16.860 --> 00:04:19.220
realistic. Now, here's where it gets really telling,

00:04:19.279 --> 00:04:21.279
I think, about this open source parity idea.

00:04:21.519 --> 00:04:24.920
The sources point out that Grok 4, Cloud 4 .5,

00:04:25.240 --> 00:04:33.870
Gemini 2 .5 Pro... Run at all. Right. And Kini

00:04:33.870 --> 00:04:36.750
K2 and GPT -5 were the only two models tested

00:04:36.750 --> 00:04:38.730
that actually nailed it. That could build this

00:04:38.730 --> 00:04:41.649
complex physics -based interactive thing successfully.

00:04:41.889 --> 00:04:43.810
That's the takeaway you really need to absorb.

00:04:44.069 --> 00:04:46.069
Just imagine the complexity behind that. A system

00:04:46.069 --> 00:04:48.170
that gets the deep math of fluid physics and

00:04:48.170 --> 00:04:50.189
knows how to implement that efficiently in JavaScript,

00:04:50.410 --> 00:04:52.550
rendering it smoothly, all from a text prompt.

00:04:52.629 --> 00:04:54.410
That's something else. Yeah. It means we basically

00:04:54.410 --> 00:04:57.629
have two models now operating at that peak level

00:04:57.629 --> 00:05:00.170
for expert code generation. And one of them is

00:05:00.170 --> 00:05:03.060
free to use. So how does Kimi K2 passing that

00:05:03.060 --> 00:05:05.399
fluid dynamics test really change the game for

00:05:05.399 --> 00:05:08.319
evaluating model complexity? It proves Kimi K2

00:05:08.319 --> 00:05:11.100
isn't just good, it truly competes at the highest

00:05:11.100 --> 00:05:13.740
level of expert coding, even across different

00:05:13.740 --> 00:05:16.410
disciplines like physics and web tech. Before

00:05:16.410 --> 00:05:18.490
we get into maybe where it stumbles, there was

00:05:18.490 --> 00:05:21.230
another big win mentioned, right? The 3D geospatial

00:05:21.230 --> 00:05:24.230
visualization of Tokyo that also shows off its

00:05:24.230 --> 00:05:25.889
knowledge base. Yeah, and they did something

00:05:25.889 --> 00:05:27.490
important there. They turned off web search.

00:05:27.750 --> 00:05:30.629
So KimiK2 had to use early its internal baked

00:05:30.629 --> 00:05:33.430
-in knowledge. And it correctly placed neighborhoods

00:05:33.430 --> 00:05:37.129
like Shibuya and Asakusa on a 3D map. It added

00:05:37.129 --> 00:05:40.149
building extrusions, used Mapbox GLJS correctly,

00:05:40.389 --> 00:05:43.529
even added a day -night toggle. All from memory,

00:05:43.670 --> 00:05:45.449
essentially. That shows it's not just applying

00:05:45.449 --> 00:05:47.709
code patterns. It has actual world knowledge

00:05:47.709 --> 00:05:49.930
integrated pretty deeply. That's knowledge plus

00:05:49.930 --> 00:05:52.889
application skill. Absolutely. But, okay, to

00:05:52.889 --> 00:05:54.870
be balanced, we need to look at the edges. Where

00:05:54.870 --> 00:05:56.949
does GPT -5 still have that slight advantage?

00:05:57.329 --> 00:05:59.639
That brings us to the beehive simulation. Right,

00:05:59.699 --> 00:06:02.459
another super complex test. This needed specific

00:06:02.459 --> 00:06:05.439
biological knowledge about how bees build hives

00:06:05.439 --> 00:06:08.279
combined with tricky geometry, those hexagonal

00:06:08.279 --> 00:06:10.560
cells, forging patterns, interactive controls.

00:06:10.980 --> 00:06:14.139
And KimiK2 did build a simulation, which honestly

00:06:14.139 --> 00:06:16.220
is still impressive. You could see cells forming,

00:06:16.300 --> 00:06:19.790
bees moving around, but... The hexagonal alignment

00:06:19.790 --> 00:06:22.009
was off. Critically flawed, actually. The pattern

00:06:22.009 --> 00:06:24.689
wasn't regular like a real honeycomb. The hive

00:06:24.689 --> 00:06:27.350
grew kind of chaotically, not in that structured,

00:06:27.490 --> 00:06:30.290
layered way you see in nature. But GPT -5 got

00:06:30.290 --> 00:06:32.649
the geometry perfect. Stable, mathematically

00:06:32.649 --> 00:06:36.470
correct hexagons. Apparently, yes. GPT -5 nailed

00:06:36.470 --> 00:06:39.730
that part. Hmm. So if Chemiket -2 can handle

00:06:39.730 --> 00:06:42.810
the complex math of fluid dynamics, why would

00:06:42.810 --> 00:06:45.209
this specific geometric pattern trip it up? That

00:06:45.209 --> 00:06:47.370
seems counterintuitive. It's subtle, isn't it?

00:06:47.720 --> 00:06:49.579
I think it gets into the nuance of these models.

00:06:49.779 --> 00:06:52.500
Fluid dynamics, while complex, is heavily based

00:06:52.500 --> 00:06:55.439
on known equations. You apply the formulas. Achieving

00:06:55.439 --> 00:06:57.560
perfect geometric precision in a complex simulation

00:06:57.560 --> 00:07:00.139
like the beehive, that seems to need a different

00:07:00.139 --> 00:07:02.540
kind of extreme attention to detail. To coordinate

00:07:02.540 --> 00:07:05.300
systems, object relationships, maybe it's just

00:07:05.300 --> 00:07:07.730
harder to specify perfectly in a prompt. You

00:07:07.730 --> 00:07:09.629
know, I still wrestle with this myself sometimes.

00:07:09.810 --> 00:07:13.009
I find myself expecting absolute, almost scientific

00:07:13.009 --> 00:07:15.329
perfection from these single prompt outputs,

00:07:15.610 --> 00:07:18.470
even when they're incredibly complex. It's easy

00:07:18.470 --> 00:07:20.569
to forget they're working from learned patterns,

00:07:20.689 --> 00:07:23.209
not some fundamental understanding of mathematical

00:07:23.209 --> 00:07:25.750
truth, you know. That's the vulnerable admission,

00:07:26.009 --> 00:07:27.689
right? We all kind of do that. And these small

00:07:27.689 --> 00:07:30.569
failures, like the beehive alignment, they're

00:07:30.569 --> 00:07:33.350
useful. They show us precisely where the current

00:07:33.350 --> 00:07:36.350
limits are, and where careful prompting and maybe

00:07:36.350 --> 00:07:39.050
multi -step generation are still key. So why

00:07:39.050 --> 00:07:41.069
is understanding these little stumbles, like

00:07:41.069 --> 00:07:43.850
the beehive example, just as important as celebrating

00:07:43.850 --> 00:07:46.910
the big wins? It highlights where GPT -5 still

00:07:46.910 --> 00:07:49.310
holds an edge, particularly in tasks needing

00:07:49.310 --> 00:07:52.269
extreme geometric precision and intricate detail.

00:07:52.550 --> 00:07:54.810
Okay, really interesting. Let's take a quick

00:07:54.810 --> 00:07:57.009
pause here. When we come back, we'll dig into

00:07:57.009 --> 00:07:59.470
what might be the ultimate test for professional

00:07:59.470 --> 00:08:02.459
use. Can you actually trust it? We're talking

00:08:02.459 --> 00:08:05.800
reliability, zero hallucination, especially in

00:08:05.800 --> 00:08:08.319
high stakes areas like finance and scientific

00:08:08.319 --> 00:08:10.629
research. Welcome back to the Deep Dive. We're

00:08:10.629 --> 00:08:13.350
talking about the open source model Kimi K2 Thinking.

00:08:13.629 --> 00:08:16.269
So for any business, any receptor listening,

00:08:16.550 --> 00:08:19.310
reliability is paramount, right? A cool demo

00:08:19.310 --> 00:08:21.529
is one thing, but if the output isn't accurate,

00:08:21.750 --> 00:08:24.529
it's useless, maybe even dangerous. Let's get

00:08:24.529 --> 00:08:26.410
into that financial analysis use case mentioned

00:08:26.410 --> 00:08:28.350
in the sources. Yeah, this sounds like a killer

00:08:28.350 --> 00:08:31.149
app scenario because it tests really deep reasoning

00:08:31.149 --> 00:08:34.110
across multiple dense documents. So they fed

00:08:34.110 --> 00:08:37.269
it Q4 Financial Reports Think Thick PDS from

00:08:37.269 --> 00:08:40.980
Google, NVIDIA. Amazon. And the task was compare

00:08:40.980 --> 00:08:43.620
them, create charts, pull out key insights. That's

00:08:43.620 --> 00:08:45.899
tough. It's not just summarizing one doc. It

00:08:45.899 --> 00:08:47.879
has to find the same metrics across different

00:08:47.879 --> 00:08:49.799
report structures, different accounting styles,

00:08:49.919 --> 00:08:51.899
and pull exact numbers correctly from all of

00:08:51.899 --> 00:08:53.820
them. And the results. According to the source

00:08:53.820 --> 00:08:57.820
material, the accuracy was shocking. It nailed

00:08:57.820 --> 00:09:01.559
YouTube ads revenue. $10 .5 billion. Correct.

00:09:01.840 --> 00:09:04.259
It correctly pulled out NVIDIA's absolutely insane

00:09:04.259 --> 00:09:08.769
12 ,264 % year -over -year growth. Correct. The

00:09:08.769 --> 00:09:11.070
claim is the numbers were 100 % right across

00:09:11.070 --> 00:09:13.929
all three of these super dense reports. I mean,

00:09:13.929 --> 00:09:16.330
that level of precision, synthesizing hundreds

00:09:16.330 --> 00:09:19.190
of pages, that could save a financial analyst

00:09:19.190 --> 00:09:21.889
days of manual grunt work. OK, so if it builds

00:09:21.889 --> 00:09:23.809
trust in finance, what about really specialized

00:09:23.809 --> 00:09:26.389
science? It was tested on researching Alexander

00:09:26.389 --> 00:09:28.970
disease, a rare neurological disorder. Right.

00:09:29.049 --> 00:09:30.909
And here it apparently used its thinking and

00:09:30.909 --> 00:09:33.330
search modes, which sound agentic. Agentic basically

00:09:33.330 --> 00:09:35.190
means the model doesn't just respond. It can

00:09:35.190 --> 00:09:37.289
plan and execute steps like a human researcher.

00:09:37.320 --> 00:09:39.220
Okay, I need to search for papers, read them,

00:09:39.360 --> 00:09:41.860
synthesize findings, structure a report. And

00:09:41.860 --> 00:09:43.820
the quality of that final report after a process,

00:09:43.940 --> 00:09:46.440
what, 48 different research results? The claim

00:09:46.440 --> 00:09:49.220
is publication quality. It apparently generated

00:09:49.220 --> 00:09:51.460
detailed flowcharts mapping out the molecular

00:09:51.460 --> 00:09:54.279
pathophysiology, a clear diagnostic pathway.

00:09:54.840 --> 00:09:57.779
And crucially, it included a timely update about

00:09:57.779 --> 00:10:00.340
an expected FDA filing for a potential treatment

00:10:00.340 --> 00:10:03.519
in Q1 2026. That's not just summarizing old info.

00:10:03.600 --> 00:10:05.379
That's pulling cutting -edge, relevant details.

00:10:05.659 --> 00:10:09.120
Super sophisticated synthesis. Wow. Okay, and

00:10:09.120 --> 00:10:12.080
then the ultimate acid test for trust, the hallucination

00:10:12.080 --> 00:10:14.659
trap. They asked about stable diffusion 5, which

00:10:14.659 --> 00:10:17.600
doesn't exist. Exactly. This is a classic failure

00:10:17.600 --> 00:10:20.240
point for LLMs. They often just confidently invent

00:10:20.240 --> 00:10:22.240
plausible -sounding details about things that

00:10:22.240 --> 00:10:26.120
aren't real. perfectly. It didn't invent anything

00:10:26.120 --> 00:10:29.240
about SD5. Instead, it correctly stated it doesn't

00:10:29.240 --> 00:10:31.379
exist and provided accurate info on the actual

00:10:31.379 --> 00:10:34.480
current version, SD3 .5. That kind of reliability,

00:10:34.840 --> 00:10:37.259
refusing to just make stuff up, that's absolutely

00:10:37.259 --> 00:10:38.860
critical if you're going to use this in a professional

00:10:38.860 --> 00:10:41.279
setting. And the sources also mentioned quick

00:10:41.279 --> 00:10:43.259
hits like successfully creating an interactive

00:10:43.259 --> 00:10:46.779
gut bacteria taxonomy tree, quite niche, and

00:10:46.779 --> 00:10:48.960
an interactive physics course that perfectly

00:10:48.960 --> 00:10:51.840
modeled kinematics. So the picture emerging is

00:10:51.840 --> 00:10:54.179
one of reliable, sophisticated, and importantly

00:10:54.179 --> 00:10:57.460
trustworthy performance in complex domains. So

00:10:57.460 --> 00:10:59.940
given that stellar performance in finance, science,

00:10:59.960 --> 00:11:02.259
and the hallucination test, what's the biggest

00:11:02.259 --> 00:11:04.629
hurdle left for companies wanting to... adopt

00:11:04.629 --> 00:11:07.750
Kimi K2? Probably scaling its deployment, right?

00:11:07.830 --> 00:11:09.789
And integrating it smoothly into their existing

00:11:09.789 --> 00:11:12.490
workflows and tech infrastructure. So let's try

00:11:12.490 --> 00:11:14.990
to wrap this up. What Kimi K2 seems to represent,

00:11:15.250 --> 00:11:17.750
it feels like a really significant, maybe permanent

00:11:17.750 --> 00:11:20.549
shift in the AI power balance. It's clearly a

00:11:20.549 --> 00:11:22.710
powerhouse. It offers capability that's right

00:11:22.710 --> 00:11:24.970
up there near GPT -5, but it's free, it can be

00:11:24.970 --> 00:11:27.309
run privately, and it's proven capable of generating

00:11:27.309 --> 00:11:30.190
complex working apps and highly accurate analysis

00:11:30.190 --> 00:11:33.090
in demanding fields. Yeah, the strategic takeaway

00:11:33.090 --> 00:11:34.950
for anyone... When listening, developers, business

00:11:34.950 --> 00:11:37.590
leaders is pretty clear, I think. Open source

00:11:37.590 --> 00:11:40.389
has definitively closed the quality gap with

00:11:40.389 --> 00:11:42.750
the top proprietary models. You might no longer

00:11:42.750 --> 00:11:44.730
have to choose between the absolute best performance

00:11:44.730 --> 00:11:48.029
and having control over your data, your cost,

00:11:48.070 --> 00:11:50.649
your infrastructure. That choice is changing.

00:11:50.830 --> 00:11:53.490
And the source material hints that part two is

00:11:53.490 --> 00:11:55.429
going to dive into the tech specs specifically.

00:11:55.549 --> 00:11:59.149
It's one trillion parameter mixture of experts

00:11:59.149 --> 00:12:03.960
or Moe architecture. That Moe approach is probably

00:12:03.960 --> 00:12:06.679
key to how it achieves this performance while

00:12:06.679 --> 00:12:09.639
staying, well, manageable enough to be open sourced.

00:12:09.720 --> 00:12:12.279
Right. So maybe here's a final thought for you

00:12:12.279 --> 00:12:14.899
to chew on after this deep dive. If a free open

00:12:14.899 --> 00:12:17.159
source model can already do this, what happens

00:12:17.159 --> 00:12:19.080
next? What happens when the cost of the hardware

00:12:19.080 --> 00:12:21.259
needed to run a model like this drops low enough

00:12:21.259 --> 00:12:23.460
that basically every small team, every consultant,

00:12:23.519 --> 00:12:26.460
every startup can have their own private, powerful,

00:12:26.779 --> 00:12:30.120
maybe even custom -tuned AI? That feels like

00:12:30.120 --> 00:12:32.279
the next wave of disruption coming. Definitely

00:12:32.279 --> 00:12:33.840
something to think about. Keep digging.