WEBVTT

00:00:00.000 --> 00:00:03.879
Okay, so imagine this AI, right? It just aces

00:00:03.879 --> 00:00:08.480
complex math, writes code like a total pro, but

00:00:08.480 --> 00:00:11.939
then ask it to do some, you know, creative design

00:00:11.939 --> 00:00:15.599
and it just completely bombs. Beat. That's kind

00:00:15.599 --> 00:00:18.219
of the weird paradox with Grok 4. It really is.

00:00:18.300 --> 00:00:20.539
A fascinating one. Welcome to the deep dive.

00:00:21.339 --> 00:00:24.339
Today, we're really digging into Grok 4 XAI's

00:00:24.339 --> 00:00:27.079
new thing. Just came out, what, July 9th, 2025.

00:00:27.679 --> 00:00:29.739
Exactly. And yeah, Elon Musk's company, they're

00:00:29.739 --> 00:00:32.579
saying, you know, world's best AI model. Big

00:00:32.579 --> 00:00:35.060
claims. Right. Big claims. You hear that a lot.

00:00:35.179 --> 00:00:36.240
Cool. We're not just going to take their word

00:00:36.240 --> 00:00:37.880
for it, obviously. We actually got our hands

00:00:37.880 --> 00:00:41.280
dirty, ran some real tests, tried to see what's

00:00:41.280 --> 00:00:44.100
really under the hood. Yeah, we did. So our mission

00:00:44.100 --> 00:00:46.560
here for you listening is to cut through all

00:00:46.560 --> 00:00:49.359
that hype. Get to the core of it. Figure out

00:00:49.359 --> 00:00:52.210
what you really need to know about Grok 4. We'll

00:00:52.210 --> 00:00:54.530
look at the scale, the benchmarks, some surprises

00:00:54.530 --> 00:00:57.350
there. Definitely some surprises. The cost, the

00:00:57.350 --> 00:01:00.329
economics, and then how you actually use it in

00:01:00.329 --> 00:01:03.289
a real project. Right, the practical side. That's

00:01:03.289 --> 00:01:06.810
crucial. Plus, we've got these three key real

00:01:06.810 --> 00:01:09.450
-world tests. They really show where it works

00:01:09.450 --> 00:01:11.450
and, well, where it kind of falls flat. Okay,

00:01:11.769 --> 00:01:13.609
so let's start unpacking this. What is Grok for

00:01:13.609 --> 00:01:16.849
exactly? The headline number is just huge. Yeah.

00:01:16.969 --> 00:01:20.269
An estimated 1 .7 trillion parameters. Yeah,

00:01:20.269 --> 00:01:23.129
1 .7 trillion. It's hard to even wrap your head

00:01:23.129 --> 00:01:25.609
around that number. Think of parameters as like

00:01:25.609 --> 00:01:30.189
the AI's brain cells or maybe connection points

00:01:30.189 --> 00:01:32.930
in its network. Okay. More parameters generally

00:01:32.930 --> 00:01:35.750
mean more power to learn really complex, intricate

00:01:35.750 --> 00:01:37.950
patterns in the data it's trained on. Right.

00:01:38.010 --> 00:01:40.319
So how does that compare? Well, for perspective,

00:01:40.620 --> 00:01:43.540
GPT -4 is estimated around $1 .8 trillion, so

00:01:43.540 --> 00:01:46.219
kind of similar ballpark there. Google's Gemini

00:01:46.219 --> 00:01:49.060
Ultra is about $1 trillion, and Anthropix Claude

00:01:49.060 --> 00:01:52.939
4 is maybe around $500 billion. So Croc 4 is

00:01:52.939 --> 00:01:54.799
definitely up there with the biggest model. So

00:01:54.799 --> 00:01:56.120
definitely not something you're running locally.

00:01:56.400 --> 00:01:58.239
Oh, no way. Yeah, you're definitely not running

00:01:58.239 --> 00:02:00.560
this beast on your gaming PC. It's entirely cloud

00:02:00.560 --> 00:02:02.700
-based, needs massive infrastructure. Makes sense.

00:02:02.840 --> 00:02:06.290
And XAI's claims, I mean, they're bold. better

00:02:06.290 --> 00:02:09.069
than a phd level in every subject smarter than

00:02:09.069 --> 00:02:11.289
almost all graduate students in all disciplines

00:02:11.289 --> 00:02:13.990
simultaneously that's straight from elon musk

00:02:13.990 --> 00:02:17.250
wow that's quite the statement it is it sets

00:02:17.250 --> 00:02:20.569
a high bar so boiling it down what's grok 4's

00:02:20.569 --> 00:02:22.969
biggest raw strength just based on that scale

00:02:22.969 --> 00:02:25.849
its immense scale allows it to learn incredibly

00:02:25.849 --> 00:02:29.509
complex detailed patterns stuff smaller models

00:02:29.509 --> 00:02:32.629
might just miss okay scale is one thing but performance

00:02:32.629 --> 00:02:36.349
is another This is where it gets really interesting

00:02:36.349 --> 00:02:38.810
for me. The actual benchmarks. How did it do?

00:02:39.050 --> 00:02:41.930
Right, the numbers. On something called Humanities

00:02:41.930 --> 00:02:45.250
Last Exam, or HLE, it's designed to test broad

00:02:45.250 --> 00:02:47.250
knowledge and reasoning across different fields.

00:02:47.550 --> 00:02:51.169
Grok 4 scored 25 % just on its own. But, and

00:02:51.169 --> 00:02:53.389
this is key, when it could use tools like a search

00:02:53.389 --> 00:02:57.030
engine, it jumped to 44 .4%. Ah, so it knows

00:02:57.030 --> 00:02:58.849
how to use tools effectively. That's different

00:02:58.849 --> 00:03:01.169
from just raw knowledge. Exactly. It shows it

00:03:01.169 --> 00:03:03.430
can leverage external resources intelligently,

00:03:03.469 --> 00:03:05.930
which is, you know, way more like how humans

00:03:05.930 --> 00:03:08.009
solve problems in the real world. That makes

00:03:08.009 --> 00:03:09.909
sense. What about more specialized areas? Yeah,

00:03:10.009 --> 00:03:12.050
for grad level physics and astronomy, it hit

00:03:12.050 --> 00:03:15.389
87, 88 percent. That puts it ahead of Google

00:03:15.389 --> 00:03:18.330
Gemini and Anthropic Glod in those tests. So

00:03:18.330 --> 00:03:21.370
a really strong grasp of complex scientific concepts.

00:03:21.689 --> 00:03:24.909
87, 88 percent on grad level physics. That's

00:03:24.909 --> 00:03:27.110
impressive. It really is. But honestly, the part

00:03:27.110 --> 00:03:29.550
that truly blew me away was the A score, the

00:03:29.550 --> 00:03:31.870
American Invitational Mathematics Examination.

00:03:31.969 --> 00:03:34.250
Oh, yeah, that's notoriously difficult. Extremely.

00:03:34.370 --> 00:03:39.009
Grok 4 scored 95 out of 100. Whoa, wait, 95 percent?

00:03:39.479 --> 00:03:42.860
on a 95 that's that's incredible that's not just

00:03:42.860 --> 00:03:45.020
calculation that's deep mathematical reasoning

00:03:45.020 --> 00:03:47.919
step -by -step problem solving precisely it really

00:03:47.919 --> 00:03:49.919
points to that powerful step -by -step thinking

00:03:49.919 --> 00:03:52.860
capability especially for complex math it's a

00:03:52.860 --> 00:03:56.219
huge deal imagine an ai solving those multi -step

00:03:56.219 --> 00:03:59.060
math problems with 95 accuracy that feels like

00:03:59.060 --> 00:04:01.340
a massive leap and what about for developers

00:04:01.340 --> 00:04:05.000
coding benchmarks Ah, yes, the Software Engineering

00:04:05.000 --> 00:04:08.340
Benchmark, SWE Bench, crucial one. This test

00:04:08.340 --> 00:04:10.599
is fixing real bugs, adding features and existing

00:04:10.599 --> 00:04:12.840
code bases. Right, the messy stuff. Exactly.

00:04:13.120 --> 00:04:17.379
Grok 4 scored between 72 % and 75%. That places

00:04:17.379 --> 00:04:20.240
it right at the absolute top for tackling these

00:04:20.240 --> 00:04:22.079
real -world coding challenges. Okay, so putting

00:04:22.079 --> 00:04:24.259
these numbers together, how do they translate

00:04:24.259 --> 00:04:26.939
to real -world impact? What's the takeaway? It

00:04:26.939 --> 00:04:29.720
means elite problem solving ability, especially

00:04:29.720 --> 00:04:32.879
for tough scientific encoding tasks. OK, those

00:04:32.879 --> 00:04:35.339
numbers are seriously impressive. Elite problem

00:04:35.339 --> 00:04:38.300
solving power. But, you know, power usually comes

00:04:38.300 --> 00:04:40.699
with a price tag. Let's talk economics. This

00:04:40.699 --> 00:04:43.379
isn't free, right? Not at all. Grok 4 is a premium

00:04:43.379 --> 00:04:46.379
commercial product. You access it via an API

00:04:46.379 --> 00:04:49.399
and you pay for that access. There's no free

00:04:49.399 --> 00:04:51.980
lunch here. So the value proposition isn't about

00:04:51.980 --> 00:04:54.000
being the cheapest option out there. Definitely

00:04:54.000 --> 00:04:56.639
not. It's about being the best for specific high

00:04:56.639 --> 00:04:59.480
-value tasks. Think about it this way. Maybe

00:04:59.480 --> 00:05:02.199
it costs, say, 12 cents for a complex query.

00:05:02.439 --> 00:05:04.980
Okay. But if that 12 -cent query automates a

00:05:04.980 --> 00:05:07.360
task that would take a skilled developer, I don't

00:05:07.360 --> 00:05:09.120
know, hours to figure out, like tracking down

00:05:09.120 --> 00:05:10.939
a really tricky bug. Right. The return on investment

00:05:10.939 --> 00:05:14.279
could be massive. Exactly. The cost of the developer's

00:05:14.279 --> 00:05:17.360
time, the project delays. Suddenly, 12 cents

00:05:17.360 --> 00:05:19.639
looks incredibly cheap. You're paying for that

00:05:19.639 --> 00:05:21.920
elite level performance, that acceleration. So

00:05:21.920 --> 00:05:24.120
bottom line, is it really worth the cost then?

00:05:24.519 --> 00:05:28.980
For high value, complex problems, its superior

00:05:28.980 --> 00:05:31.560
performance absolutely justifies the cost. All

00:05:31.560 --> 00:05:33.839
right. Makes sense. So let's get practical. How

00:05:33.839 --> 00:05:38.000
do you actually start using Grok 4 in, say, an

00:05:38.000 --> 00:05:41.129
automation workflow? What are the paths? There

00:05:41.129 --> 00:05:43.350
are basically two main ways people are doing

00:05:43.350 --> 00:05:45.689
it right now. Path one is the direct connection

00:05:45.689 --> 00:05:48.250
to XAI. Okay, how does that work? It's pretty

00:05:48.250 --> 00:05:50.050
straightforward on the surface. You typically

00:05:50.050 --> 00:05:53.029
use an AI agent node in whatever automation tool

00:05:53.029 --> 00:05:57.110
you prefer. You grab an API key from the XAI

00:05:57.110 --> 00:05:58.850
developer console, plug in your credentials.

00:05:58.850 --> 00:06:01.500
Standard API setup. Right. Then you select the

00:06:01.500 --> 00:06:04.560
model, which would be Grok 4 -0709 or whatever

00:06:04.560 --> 00:06:06.620
the latest version is. You can do a quick test,

00:06:06.759 --> 00:06:09.100
like sending hello, Grok, just to make sure the

00:06:09.100 --> 00:06:10.639
connection's live. And then you can give it tools,

00:06:10.819 --> 00:06:14.139
like web search. Yep. You can add nodes for tools,

00:06:14.279 --> 00:06:16.759
maybe a propensity node for research or Tavoli

00:06:16.759 --> 00:06:19.980
or others. Grok 4 is designed to figure out when

00:06:19.980 --> 00:06:22.319
it needs to use those tools to answer your prompt.

00:06:22.579 --> 00:06:26.300
Sounds good in theory, but does it always work

00:06:26.300 --> 00:06:31.220
smoothly? Well... No, not always. We actually

00:06:31.220 --> 00:06:33.199
tried building something we called an ultimate

00:06:33.199 --> 00:06:35.939
assistant. The goal was research a topic using

00:06:35.939 --> 00:06:38.120
two different tools, find a relevant contact

00:06:38.120 --> 00:06:41.100
person, and then draft an email to them. Pretty

00:06:41.100 --> 00:06:43.699
complex task. Multi -step. Yeah, definitely.

00:06:44.000 --> 00:06:46.800
And we immediately hit a snag. We got this error.

00:06:47.379 --> 00:06:50.319
Failed to parse tool arguments from chat model.

00:06:51.129 --> 00:06:53.709
Meaning Grok 4 wasn't sending back its instructions

00:06:53.709 --> 00:06:55.750
for using the tools in the strict format, the

00:06:55.750 --> 00:06:57.889
JSON format that the workflow needed. So the

00:06:57.889 --> 00:06:59.790
workflow just broke. It didn't know what Grok

00:06:59.790 --> 00:07:02.230
was trying to tell it to do next. Okay. That

00:07:02.230 --> 00:07:04.709
sounds frustrating. Could you fix it? We tried.

00:07:04.829 --> 00:07:07.610
We even tried forcing it into a JSON -only output

00:07:07.610 --> 00:07:10.189
mode, but that didn't help either. In fact, it

00:07:10.189 --> 00:07:11.850
just stopped trying to use the tools altogether

00:07:11.850 --> 00:07:15.569
then. So the direct connection can be a bit finicky.

00:07:15.829 --> 00:07:17.769
Yeah. I still wrestle with prompt drift myself

00:07:17.769 --> 00:07:19.970
sometimes, you know, where... The AI's output

00:07:19.970 --> 00:07:22.529
just changes over time, even with the same prompt.

00:07:22.649 --> 00:07:25.889
A consistent API is like a godsend when you're

00:07:25.889 --> 00:07:28.629
building something real. Totally agree. Which

00:07:28.629 --> 00:07:32.129
leads us nicely to path number two, using OpenRouter.

00:07:32.290 --> 00:07:34.189
OpenRouter. I've heard of that. It's like a middleman.

00:07:34.529 --> 00:07:37.550
Exactly. It acts as an intermediary, a routing

00:07:37.550 --> 00:07:40.350
layer. And honestly, it's often a smarter and

00:07:40.350 --> 00:07:43.230
much more reliable way to go. Why is that? What

00:07:43.230 --> 00:07:46.529
are the benefits? Several big ones. First. single

00:07:46.529 --> 00:07:49.449
billing you get one bill even if you use dozens

00:07:49.449 --> 00:07:52.089
of different models from open ai anthropic google

00:07:52.089 --> 00:07:54.850
xai whoever that simplifies things a lot okay

00:07:54.850 --> 00:07:57.449
that's convenient second a unified api format

00:07:57.449 --> 00:07:59.629
open router makes all these different models

00:07:59.629 --> 00:08:01.350
talk and basically the same way through their

00:08:01.350 --> 00:08:03.769
api so your code or workflow setup is much more

00:08:03.769 --> 00:08:07.389
consistent even if you swap models So less chance

00:08:07.389 --> 00:08:09.029
of those parsing errors we just talked about.

00:08:09.129 --> 00:08:12.110
Yeah, precisely. And third, often the connections

00:08:12.110 --> 00:08:15.350
just seem more stable and reliable through OpenRouter.

00:08:15.470 --> 00:08:17.689
In your workflow tool, you'd use a generic chat

00:08:17.689 --> 00:08:20.230
model node, connect it to OpenRouter, and then

00:08:20.230 --> 00:08:22.490
just select Zagrok 4 from the list of models

00:08:22.490 --> 00:08:24.430
they offer. And did that work for the ultimate

00:08:24.430 --> 00:08:26.529
assistant? Like a charm. It worked perfectly.

00:08:27.129 --> 00:08:29.930
Grok4, through OpenRouter, laid out its plan.

00:08:30.509 --> 00:08:33.149
Research with Perplexity, then research with

00:08:33.149 --> 00:08:35.649
Tavli, then it synthesized the info from both

00:08:35.649 --> 00:08:38.250
sources intelligently, looked up the contact,

00:08:38.570 --> 00:08:40.909
and composed this really well -crafted email,

00:08:41.009 --> 00:08:43.990
even citing its sources. Wow. So it managed all

00:08:43.990 --> 00:08:46.210
four tools seamlessly. Seamlessly. It was actually

00:08:46.210 --> 00:08:47.950
quite impressive to watch it orchestrate the

00:08:47.950 --> 00:08:50.529
whole thing. So if a developer is starting out,

00:08:50.669 --> 00:08:53.350
which setup method should they probably prioritize?

00:08:54.340 --> 00:08:56.460
Generally, OpenRouter provides more stability,

00:08:56.799 --> 00:08:59.299
reliability, and honestly, just ease of use,

00:08:59.379 --> 00:09:01.980
mid -roll sponsor read. Okay, we saw OpenRouter

00:09:01.980 --> 00:09:05.360
smooth things out. But let's talk brass tacks,

00:09:05.460 --> 00:09:08.919
speed, and cost for that complex ultimate assistant

00:09:08.919 --> 00:09:11.620
workflow. How did it actually perform? Right,

00:09:11.679 --> 00:09:13.299
the performance varied. The first time we ran

00:09:13.299 --> 00:09:15.159
it, it took about one minute and 40 seconds,

00:09:15.419 --> 00:09:18.120
which is, you know, pretty reasonable for that

00:09:18.120 --> 00:09:20.720
complexity. Yeah, not bad. But the second time,

00:09:20.740 --> 00:09:23.779
it took over three minutes. Hmm. That's quite

00:09:23.779 --> 00:09:26.759
a difference. Nearly double the time. Why the

00:09:26.759 --> 00:09:29.980
variability? Server load, most likely. Grok 4

00:09:29.980 --> 00:09:32.840
is new. It's popular. It's powerful. Lots of

00:09:32.840 --> 00:09:35.620
people are hitting the API. So performance isn't

00:09:35.620 --> 00:09:37.539
always consistent. That's a really important

00:09:37.539 --> 00:09:39.460
point if you're building something that needs

00:09:39.460 --> 00:09:42.559
predictable response times. Absolutely critical.

00:09:42.779 --> 00:09:44.360
You have to build your applications assuming

00:09:44.360 --> 00:09:46.519
that variability might happen. You need error

00:09:46.519 --> 00:09:49.940
handling, maybe longer timeouts, or ways to manage

00:09:49.940 --> 00:09:52.799
user expectations if a task takes longer sometimes.

00:09:53.179 --> 00:09:55.360
It could definitely destabilize things if you

00:09:55.360 --> 00:09:58.169
don't account for it. Okay. Good warning. And

00:09:58.169 --> 00:10:00.230
the cost for that run. You mentioned $0 .12 earlier.

00:10:00.509 --> 00:10:03.029
Yeah, through OpenRouter, that specific workflow

00:10:03.029 --> 00:10:05.230
costs about $0 .12 each time it ran successfully.

00:10:05.570 --> 00:10:07.809
$0 .12 doesn't sound like much on its own. It

00:10:07.809 --> 00:10:09.730
doesn't. But if you're running that kind of complex

00:10:09.730 --> 00:10:12.149
workflow hundreds or thousands of times a day,

00:10:12.269 --> 00:10:15.730
it adds up fast. The cost is driven by the amount

00:10:15.730 --> 00:10:18.590
of text processed, the input prompt, the tool

00:10:18.590 --> 00:10:20.629
usage, the back and forth, the final output.

00:10:20.870 --> 00:10:24.210
They call these tokens. More tokens, more cost.

00:10:24.840 --> 00:10:26.799
So how do you manage that? You can't just not

00:10:26.799 --> 00:10:29.039
use it if you need its power, but you don't want

00:10:29.039 --> 00:10:31.860
costs spiraling. The smart strategy is often

00:10:31.860 --> 00:10:34.820
a hybrid approach. Use cheaper, faster models

00:10:34.820 --> 00:10:37.259
for the initial legwork. Legwork. Maybe use a

00:10:37.259 --> 00:10:39.980
smaller cloud model like Hypu or one of the smaller

00:10:39.980 --> 00:10:43.100
Gemini models to do initial data gathering, maybe

00:10:43.100 --> 00:10:45.679
summarize some long documents or filter information.

00:10:45.919 --> 00:10:48.580
They're much cheaper and faster per token. OK,

00:10:48.659 --> 00:10:51.100
so pre -process the information. Exactly. Do

00:10:51.100 --> 00:10:53.360
the grunt work with the cheaper models. Then

00:10:53.360 --> 00:10:56.299
take that condensed summary, that key information,

00:10:56.480 --> 00:10:59.659
and feed only that to Grok for the really high

00:10:59.659 --> 00:11:02.799
level analysis, the complex reasoning, the final

00:11:02.799 --> 00:11:05.750
synthesis, or the difficult coding task. Ah,

00:11:05.909 --> 00:11:08.090
I see. So you're using Grok 4 strategically,

00:11:08.370 --> 00:11:10.909
only where its unique power adds the most value,

00:11:11.049 --> 00:11:13.250
leveraging its strengths without paying for it

00:11:13.250 --> 00:11:16.429
on simpler tasks. Precisely. That way, you get

00:11:16.429 --> 00:11:19.529
the best of both worlds. The power of Grok 4

00:11:19.529 --> 00:11:21.789
where you need it, but better cost efficiency

00:11:21.789 --> 00:11:25.149
overall. So how can developers manage these performance

00:11:25.149 --> 00:11:27.970
and cost variations effectively? Strategically,

00:11:27.990 --> 00:11:31.090
use cheaper models for initial steps, reserving

00:11:31.090 --> 00:11:33.409
Grok 4 for the high -value reasoning parts. All

00:11:33.409 --> 00:11:35.919
right, we've talked benchmarks. Set up, cost,

00:11:36.019 --> 00:11:40.159
performance. Now for the really fun part. Putting

00:11:40.159 --> 00:11:42.500
Grok 4 to the test in some real -world scenarios,

00:11:42.779 --> 00:11:45.899
we set up three distinct challenges. Yep. Wanted

00:11:45.899 --> 00:11:47.720
to see how it handled different kinds of tasks

00:11:47.720 --> 00:11:49.759
you might actually throw at it. Test number one,

00:11:49.799 --> 00:11:52.820
a simple bug fix. Okay. What was the bug? It

00:11:52.820 --> 00:11:55.159
was a common front -end issue. A scrolling problem

00:11:55.159 --> 00:11:58.500
in a React component. We gave Grok 4 the buggy

00:11:58.500 --> 00:12:01.220
code and asked it to fix it. It nailed it. In

00:12:01.220 --> 00:12:03.600
under two minutes, it identified the issue in

00:12:03.600 --> 00:12:05.980
the CSS, proposed a clean, professional solution,

00:12:06.179 --> 00:12:09.360
explained why it worked. Pass. Solid pass. Nice.

00:12:09.600 --> 00:12:12.360
So for straightforward, well -defined, logical

00:12:12.360 --> 00:12:14.799
problems like fixing a specific bug, it's very

00:12:14.799 --> 00:12:17.200
effective. Incredibly effective. Really shines

00:12:17.200 --> 00:12:20.179
there. Okay. Simple bug fix dot check. What about

00:12:20.179 --> 00:12:23.299
something more complex? Test number two. Complex

00:12:23.299 --> 00:12:26.820
feature development. This was ambitious. We asked

00:12:26.820 --> 00:12:29.179
it to add a memory feature to an existing chat

00:12:29.179 --> 00:12:32.080
application. Whoa, okay. That's not trivial.

00:12:32.139 --> 00:12:34.120
What did that involve? It required changes across

00:12:34.120 --> 00:12:36.720
the board, database schema updates, creating

00:12:36.720 --> 00:12:39.500
new API endpoints, building new user interface

00:12:39.500 --> 00:12:42.779
components, modifying the core chat logic to

00:12:42.779 --> 00:12:45.799
actually use the memory. A lot of moving parts.

00:12:45.879 --> 00:12:47.480
Yeah, that sounds like a multi -day task for

00:12:47.480 --> 00:12:50.759
a human developer. Easily. How did Grok 4 do?

00:12:50.919 --> 00:12:53.799
Honestly. My mind was truly blown. Really? The

00:12:53.799 --> 00:12:56.720
entire feature was built, designed, coded, integrated

00:12:56.720 --> 00:12:59.360
in under five minutes. Under five minutes. Seriously.

00:12:59.620 --> 00:13:01.580
I didn't write a single line of code myself.

00:13:01.659 --> 00:13:03.899
I just reviewed what it produced, ran it, and

00:13:03.899 --> 00:13:06.820
everything worked perfectly on the first try.

00:13:07.039 --> 00:13:09.299
Wow. Beat. Five minutes. I was trying to process

00:13:09.299 --> 00:13:11.679
that. Thinking about the usual back and forth,

00:13:11.720 --> 00:13:14.100
the debugging, the testing for a feature like

00:13:14.100 --> 00:13:16.519
that. What was the most surprising part for you

00:13:16.519 --> 00:13:19.429
just watching that unfold? It felt. Almost like

00:13:19.429 --> 00:13:22.429
magic. But you could see the logic. It understood

00:13:22.429 --> 00:13:24.529
the existing code -based structure, which was

00:13:24.529 --> 00:13:26.549
well -organized. That's important. And it just

00:13:26.549 --> 00:13:28.789
methodically generated all the necessary pieces

00:13:28.789 --> 00:13:30.830
and connected them correctly. It wasn't just

00:13:30.830 --> 00:13:32.549
code generation. It was architectural understanding.

00:13:32.929 --> 00:13:35.789
It's nothing short of revolutionary for adding

00:13:35.789 --> 00:13:38.789
complex features to existing well -structured

00:13:38.789 --> 00:13:40.750
code bases. Well, revolutionary. That's a strong

00:13:40.750 --> 00:13:43.649
word. But based on that. Yeah. Okay. Mind officially

00:13:43.649 --> 00:13:47.610
blown, too. So test one. Simple bug fix. Pass.

00:13:48.090 --> 00:13:50.789
Test two, complex feature, absolutely revolutionary

00:13:50.789 --> 00:13:53.330
pass. What was test three? Test three, new project

00:13:53.330 --> 00:13:56.230
creation. After that stunning success with the

00:13:56.230 --> 00:13:58.149
feature ad, we thought, okay, let's see if it

00:13:58.149 --> 00:14:00.289
can build something from scratch. We asked it

00:14:00.289 --> 00:14:02.889
to create a beautiful landing page for a fictional

00:14:02.889 --> 00:14:05.190
product. Just from a prompt, make me a beautiful

00:14:05.190 --> 00:14:07.450
landing page. Pretty much. We gave it some basic

00:14:07.450 --> 00:14:09.250
info about the product, but the key instruction

00:14:09.250 --> 00:14:11.710
was make it beautiful. And the result? Yeah.

00:14:11.769 --> 00:14:15.960
Was it beautiful? No, not really. The website

00:14:15.960 --> 00:14:18.059
it generated was functional. The HTML structure

00:14:18.059 --> 00:14:20.679
was okay. The basic elements were there. But

00:14:20.679 --> 00:14:23.120
visually, it was really disappointing. How so?

00:14:23.279 --> 00:14:27.480
The styling was just... Very basic. Kind of bland,

00:14:27.539 --> 00:14:30.019
even outdated looking. Nothing like what you'd

00:14:30.019 --> 00:14:32.740
consider a modern, polished, beautiful landing

00:14:32.740 --> 00:14:35.639
page. It completely missed the mark on the aesthetics.

00:14:36.100 --> 00:14:39.360
Ah, back to that paradox we started with. Great

00:14:39.360 --> 00:14:42.179
at logic and code structure, but struggles with

00:14:42.179 --> 00:14:44.519
the subjective, creative side. Exactly that.

00:14:45.019 --> 00:14:47.620
It failed here, we think, because it lacked specific

00:14:47.620 --> 00:14:50.740
design context or examples in the prompt. And

00:14:50.740 --> 00:14:53.379
beautiful is just too subjective for it. It doesn't

00:14:53.379 --> 00:14:55.820
have an inherent sense of visual taste or modern

00:14:55.820 --> 00:14:58.200
design trends. Right. It needs clear objective

00:14:58.200 --> 00:15:00.940
parameters or examples for creative tasks. It

00:15:00.940 --> 00:15:03.440
can't just intuit good design. Precisely. It

00:15:03.440 --> 00:15:05.620
struggles immensely with tasks requiring that

00:15:05.620 --> 00:15:07.799
strong sense of visual design or, you know, open

00:15:07.799 --> 00:15:09.940
-ended creativity, unless you guide it very,

00:15:10.019 --> 00:15:12.440
very explicitly. So looking back at these three

00:15:12.440 --> 00:15:15.259
tests, what's the biggest takeaway? Grok4 excels

00:15:15.259 --> 00:15:18.000
at structured, logical tasks, even very complex

00:15:18.000 --> 00:15:20.340
ones, but struggles with open -ended creativity.

00:15:20.899 --> 00:15:22.860
Okay, we've run the tests, looked at the numbers,

00:15:22.940 --> 00:15:25.139
the setup, the cost. Let's try to bring it all

00:15:25.139 --> 00:15:27.820
together. Is Grok4 worth it? Let's recap the

00:15:27.820 --> 00:15:30.460
good and the bad. The good. Exceptional reasoning

00:15:30.460 --> 00:15:33.480
ability, definitely. Powerful math capabilities,

00:15:33.679 --> 00:15:36.409
as we saw with that aim score. Excellent tool

00:15:36.409 --> 00:15:38.870
use, especially through something like OpenRouter.

00:15:39.009 --> 00:15:41.309
And it can produce really high quality research

00:15:41.309 --> 00:15:44.230
and analysis when guided properly. Okay. Sounds

00:15:44.230 --> 00:15:46.750
powerful. Right. What about the downsides? The

00:15:46.750 --> 00:15:49.269
not so good? Speed inconsistency is a big one

00:15:49.269 --> 00:15:51.370
due to that server load issue. You have to plan

00:15:51.370 --> 00:15:53.669
for it. Right. Higher cost, especially if you're

00:15:53.669 --> 00:15:55.929
using it for high volume tasks without optimization.

00:15:56.429 --> 00:16:00.149
That direct XAI integration can be finicky as

00:16:00.149 --> 00:16:02.370
we found. Yeah. The JSON parsing issue. And that

00:16:02.370 --> 00:16:05.850
clear struggle with creative. subjective tasks

00:16:05.850 --> 00:16:08.470
like visual design it's just not its strong suit

00:16:08.470 --> 00:16:11.149
so the bottom line what's the final verdict look

00:16:11.149 --> 00:16:14.549
grok 4 is undeniably impressive it's incredibly

00:16:14.549 --> 00:16:17.289
powerful it excels at complex reasoning math

00:16:17.289 --> 00:16:19.789
coding within existing structures deep analysis

00:16:19.789 --> 00:16:22.789
it's not always the fastest it's not the cheapest

00:16:22.789 --> 00:16:25.549
but when you need that serious intellectual horsepower

00:16:25.549 --> 00:16:28.389
for your ai automations especially if you're

00:16:28.389 --> 00:16:30.429
a developer working on well -structured code

00:16:30.429 --> 00:16:33.860
or you need deep analysis It's an absolute game

00:16:33.860 --> 00:16:36.519
changer. It can do things other models just can't

00:16:36.519 --> 00:16:39.539
or can't do as well. So what defines Grok 4's

00:16:39.539 --> 00:16:42.840
ideal use case then? It's for serious intellectual

00:16:42.840 --> 00:16:46.179
horsepower on complex logical problems. So we've

00:16:46.179 --> 00:16:48.580
really dug deep here, uncovered Grok 4's incredible

00:16:48.580 --> 00:16:51.779
strengths that logic, the math, the tool use,

00:16:51.960 --> 00:16:55.759
but also seen its limits clearly. The speed issues,

00:16:55.919 --> 00:16:58.879
the cost factor that struggle with pure creativity.

00:16:59.139 --> 00:17:02.139
It's a powerful, almost paradoxical beast, isn't

00:17:02.139 --> 00:17:04.339
it? It really is. And this whole deep dive looking

00:17:04.339 --> 00:17:06.859
at Grok 4, it points to something bigger, I think.

00:17:06.900 --> 00:17:09.220
A profound shift in how we might approach work,

00:17:09.319 --> 00:17:11.799
especially complex knowledge work. How so? The

00:17:11.799 --> 00:17:14.900
future you see feels increasingly agentic. It's

00:17:14.900 --> 00:17:17.380
becoming less about just how fast you as an individual

00:17:17.380 --> 00:17:19.839
can code or research or write. And more about.

00:17:20.160 --> 00:17:23.180
More about how effectively you can direct and

00:17:23.180 --> 00:17:26.339
orchestrate these powerful AI assistants, maybe

00:17:26.339 --> 00:17:29.660
even a team of them, to achieve complex strategic

00:17:29.660 --> 00:17:32.940
goals. It's like being a conductor rather than

00:17:32.940 --> 00:17:35.480
just playing one instrument. That's a great analogy.

00:17:35.700 --> 00:17:37.759
And you're saying this is becoming more accessible?

00:17:38.160 --> 00:17:40.440
Yeah. The tools are getting easier to use, as

00:17:40.440 --> 00:17:42.200
we saw with Open Router smoothing things out.

00:17:42.259 --> 00:17:44.539
The productivity gains, like that five -minute

00:17:44.539 --> 00:17:47.400
feature build, they're potentially massive and

00:17:47.400 --> 00:17:50.349
very real. And the price, if used strategically,

00:17:50.589 --> 00:17:54.069
can absolutely be justified by the value. So

00:17:54.069 --> 00:17:56.509
wrapping this up, what does this all mean for

00:17:56.509 --> 00:17:59.289
you, the listener, the learner, the developer,

00:17:59.410 --> 00:18:02.509
maybe just the curious mind tuning in? It feels

00:18:02.509 --> 00:18:04.130
like the landscape is shifting under our feet.

00:18:04.210 --> 00:18:06.619
It really does. It suggests that... Understanding

00:18:06.619 --> 00:18:08.920
these tools, learning how to integrate them strategically.

00:18:09.259 --> 00:18:11.380
It's not just about adding a new skill. It could

00:18:11.380 --> 00:18:13.160
fundamentally change your capabilities, maybe

00:18:13.160 --> 00:18:16.079
10x them, like people often say. The choice seems

00:18:16.079 --> 00:18:18.700
to be leaning towards embracing these tools,

00:18:18.859 --> 00:18:21.539
figuring out how to leverage them, or risk getting

00:18:21.539 --> 00:18:24.099
left behind as others do. That might be the stark

00:18:24.099 --> 00:18:27.140
reality, yeah. Which leads to a final, maybe

00:18:27.140 --> 00:18:29.220
provocative thought to leave people with. Go

00:18:29.220 --> 00:18:32.740
on. How might you listening right now? How might

00:18:32.740 --> 00:18:35.339
you start to rethink your own role, your own

00:18:35.339 --> 00:18:38.579
workflow, in a world where AI agents like Grok4

00:18:38.579 --> 00:18:41.279
can increasingly accomplish very complex tasks

00:18:41.279 --> 00:18:44.240
autonomously? What does that mean for your unique

00:18:44.240 --> 00:18:47.599
human contribution? That's a deep question. Something

00:18:47.599 --> 00:18:49.640
to definitely mull over. We hope this deep dive

00:18:49.640 --> 00:18:51.819
into Grok4 has given you a solid, well -informed

00:18:51.819 --> 00:18:53.880
starting point for thinking about that. Thanks

00:18:53.880 --> 00:18:55.319
for joining us. OTRO Music.