WEBVTT

00:00:00.000 --> 00:00:01.700
Ever felt like you're just sort of guessing with

00:00:01.700 --> 00:00:04.700
your AI, you know, tinkering, adjusting things,

00:00:04.820 --> 00:00:06.620
holding your breath, just hoping it's actually

00:00:06.620 --> 00:00:09.759
getting better, hoping it matches what you envisioned?

00:00:10.099 --> 00:00:12.199
What if we could move past that intuition, past

00:00:12.199 --> 00:00:14.140
I think this is better? What if you could say,

00:00:14.199 --> 00:00:16.699
I know this is better, and here's the data, the

00:00:16.699 --> 00:00:19.640
actual proof to back it up? Yeah, ditching the

00:00:19.640 --> 00:00:23.239
gut feelings for cold, hard facts. That's really

00:00:23.239 --> 00:00:25.300
the superpower we're diving into today. It's

00:00:25.300 --> 00:00:27.600
a fundamental shift, actually. Welcome to the

00:00:27.600 --> 00:00:30.500
Deep Dive. Today we're unpacking a really crucial

00:00:30.500 --> 00:00:34.479
topic, how to evaluate your AI workflows, specifically

00:00:34.479 --> 00:00:37.159
within environments like NAK and moving away

00:00:37.159 --> 00:00:39.700
from those subjective feelings towards objective

00:00:39.700 --> 00:00:42.560
data -driven decisions. This isn't just about

00:00:42.560 --> 00:00:44.780
small tweaks, you know. It's about building systems

00:00:44.780 --> 00:00:47.320
with real precision, a much more reliable approach.

00:00:48.159 --> 00:00:50.340
We're going to explore why traditional testing

00:00:50.340 --> 00:00:52.340
methods often fall short when you're dealing

00:00:52.340 --> 00:00:55.520
with AI's unique complexities. We'll definitely

00:00:55.520 --> 00:00:57.280
get into the critical importance of what we call

00:00:57.280 --> 00:01:00.119
a gold standard data set and walk through some

00:01:00.119 --> 00:01:03.219
really illuminating real -world examples so you

00:01:03.219 --> 00:01:05.519
can see this in action. And then, yeah, give

00:01:05.519 --> 00:01:07.579
you a step -by -step guide to setting up your

00:01:07.579 --> 00:01:09.859
very own evaluation system. Think of this deep

00:01:09.859 --> 00:01:12.799
dive as your personal guide. Your path to becoming

00:01:12.799 --> 00:01:15.540
a workflow wizard. Someone who crafts solutions

00:01:15.540 --> 00:01:18.439
with proven precision. Not just someone who,

00:01:18.560 --> 00:01:21.439
well, tinkers and hopes for the best. It's a

00:01:21.439 --> 00:01:23.299
game changer for how you approach automation.

00:01:23.760 --> 00:01:27.140
And what you can achieve, really. So let's start

00:01:27.140 --> 00:01:29.120
with maybe a stark image, one we're calling the

00:01:29.120 --> 00:01:33.000
medieval doctor problem. It's about why... For

00:01:33.000 --> 00:01:35.340
too long, many of us have been kind of flying

00:01:35.340 --> 00:01:37.719
blind when it comes to optimizing AI. You know

00:01:37.719 --> 00:01:39.420
the drill. You build an AI workflow. It's not

00:01:39.420 --> 00:01:41.200
quite right. You think, OK, if I just change

00:01:41.200 --> 00:01:42.980
this one thing, maybe we'll get her. So you change

00:01:42.980 --> 00:01:44.900
it, you run it, and then you just sort of feel

00:01:44.900 --> 00:01:47.099
if it's better or worse. You repeat this loop

00:01:47.099 --> 00:01:49.219
until you're either totally frustrated or you

00:01:49.219 --> 00:01:51.379
just settle for good enough. It really is like

00:01:51.379 --> 00:01:54.200
that medieval doctor prescribing leeches. Lots

00:01:54.200 --> 00:01:56.980
of confidence, maybe, but absolutely zero data

00:01:56.980 --> 00:01:59.280
backing up the decisions. Exactly. And that's

00:01:59.280 --> 00:02:01.599
such a perfect analogy because, you know, in

00:02:01.599 --> 00:02:04.000
the probabilistic world of AI, your feelings

00:02:04.000 --> 00:02:06.560
can genuinely deceive you. And the problem with

00:02:06.560 --> 00:02:09.460
that guess and check approach just leads to wasted

00:02:09.460 --> 00:02:14.219
time and honestly unreliable AI outcomes. Decisions

00:02:14.219 --> 00:02:16.240
aren't based on facts. They're based on subjective

00:02:16.240 --> 00:02:18.699
feelings, which, well, we know are notoriously

00:02:18.699 --> 00:02:22.139
unreliable with AI output. Workflow evaluation,

00:02:22.419 --> 00:02:24.460
though, that offers. objective proof. It tells

00:02:24.460 --> 00:02:26.300
you exactly what works and what doesn't. It turns

00:02:26.300 --> 00:02:28.479
those educated guesses into properly informed

00:02:28.479 --> 00:02:31.800
decisions. So why does this guess and check approach

00:02:31.800 --> 00:02:36.039
really hinder progress in AI? Well, simply put,

00:02:36.099 --> 00:02:39.259
it leads to wasted time and unreliable AI outcomes

00:02:39.259 --> 00:02:41.699
because decisions are based on subjective feelings.

00:02:42.039 --> 00:02:44.520
Okay, let's unpack this a bit more. The difference

00:02:44.520 --> 00:02:47.099
between testing traditional code and AI is absolutely

00:02:47.099 --> 00:02:50.000
fundamental. Think of it like a glass box versus

00:02:50.000 --> 00:02:53.210
a black box. Traditional code is the glass box.

00:02:53.409 --> 00:02:56.310
It's deterministic. You put 100 identical inputs

00:02:56.310 --> 00:02:58.949
in, you'll get 100 identical outputs every time.

00:02:58.969 --> 00:03:01.430
You can see everything, know exactly how it works.

00:03:01.509 --> 00:03:04.590
It's transparent, predictable. Right, but AI

00:03:04.590 --> 00:03:08.020
models... They're the black box. They're probabilistic.

00:03:08.219 --> 00:03:10.400
Give them 100 identical inputs and you might

00:03:10.400 --> 00:03:12.219
get 100 slightly different nuance variations.

00:03:12.800 --> 00:03:15.039
It's kind of like asking 100 different human

00:03:15.039 --> 00:03:18.000
experts the same question, you know. This complexity

00:03:18.000 --> 00:03:20.819
comes from evolving models, that inherent probabilistic

00:03:20.819 --> 00:03:23.520
nature, and the fact that you're often optimizing

00:03:23.520 --> 00:03:26.180
for multiple goals at once, like accuracy, cost,

00:03:26.400 --> 00:03:29.039
speed. Because of all this, you really need a

00:03:29.039 --> 00:03:32.280
full dashboard of key dials. Think of it as performance

00:03:32.280 --> 00:03:34.800
for accuracy, reliability for consistency, efficiency

00:03:34.800 --> 00:03:37.479
for speed and cost and then quality for that

00:03:37.479 --> 00:03:41.120
sort of subjective goodness. So if AI is this

00:03:41.120 --> 00:03:44.740
opaque black box, why is isolating variables

00:03:44.740 --> 00:03:47.580
so crucial for testing? How do we even figure

00:03:47.580 --> 00:03:51.180
out what's working? The golden rule. It's truly

00:03:51.180 --> 00:03:54.580
the core of scientific testing here. Isolate

00:03:54.580 --> 00:03:58.030
your variables. Think about it. If you tweak

00:03:58.030 --> 00:03:59.990
the prompt and change the model and adjust the

00:03:59.990 --> 00:04:01.509
temperature all at once, then maybe your accuracy

00:04:01.509 --> 00:04:03.610
goes up great, right? But you have no idea what

00:04:03.610 --> 00:04:05.110
actually caused that improvement. Was it the

00:04:05.110 --> 00:04:07.610
prompt, the model, some weird combination? Right,

00:04:07.669 --> 00:04:11.050
you're lost. Exactly. Isolating variables precisely

00:04:11.050 --> 00:04:14.090
identifies which specific changes improve or

00:04:14.090 --> 00:04:16.810
harm your workflow's performance. Otherwise,

00:04:16.850 --> 00:04:19.529
you're just guessing again. What's really fascinating

00:04:19.529 --> 00:04:21.550
here, and maybe the absolute foundation of this

00:04:21.550 --> 00:04:23.790
whole process, is your gold standard data set.

00:04:23.889 --> 00:04:26.290
This is your source of truth. And honestly, your

00:04:26.290 --> 00:04:28.209
evaluation is only as good as the data you feed

00:04:28.209 --> 00:04:31.209
it. A brilliant test system with, well, garbage

00:04:31.209 --> 00:04:33.490
data. It just produces garbage results. Simple

00:04:33.490 --> 00:04:35.110
as that. I remember struggling with that, trying

00:04:35.110 --> 00:04:37.589
to convince stakeholders why building out a really

00:04:37.589 --> 00:04:40.610
robust test data set was worth the upfront effort.

00:04:40.709 --> 00:04:43.649
It felt like extra work then. But it truly is

00:04:43.649 --> 00:04:46.509
the bedrock, isn't it? Your evaluation data is

00:04:46.509 --> 00:04:49.029
like your perfect measuring stick. So what makes

00:04:49.029 --> 00:04:52.269
a good gold standard data set? Well, it has to

00:04:52.269 --> 00:04:54.009
be accurate, obviously, with undeniably correct

00:04:54.009 --> 00:04:56.569
answers. Consistent, so no contradictions in

00:04:56.569 --> 00:04:58.949
there. Comprehensive, covering every scenario

00:04:58.949 --> 00:05:01.930
you expect. Representative of real -world usage,

00:05:02.129 --> 00:05:05.509
that's key. And crucially, full of... Edge cases,

00:05:05.709 --> 00:05:08.290
those weird, tricky examples most likely to break

00:05:08.290 --> 00:05:10.209
your system. Only edge cases are critical. I've

00:05:10.209 --> 00:05:12.430
seen workflows handle 99 % of inputs perfectly,

00:05:12.569 --> 00:05:14.610
only to completely fall apart on one unusual

00:05:14.610 --> 00:05:17.370
phrasing, those edge cases. They're gold for

00:05:17.370 --> 00:05:20.430
finding the real weak spots. Exactly. And where

00:05:20.430 --> 00:05:23.569
do you find this data goldmine? Often it's right

00:05:23.569 --> 00:05:25.870
there in your own company's history. Think about

00:05:25.870 --> 00:05:28.310
high -quality support tickets with perfect resolutions

00:05:28.310 --> 00:05:31.850
or expert responses to common questions. Even

00:05:31.850 --> 00:05:34.050
top -performing marketing content could be useful.

00:05:34.170 --> 00:05:37.430
But honestly, the absolute best source for validating

00:05:37.430 --> 00:05:40.550
both your test data and the AI's output quality,

00:05:40.670 --> 00:05:44.540
your subject matter experts. SMEs, the people

00:05:44.540 --> 00:05:47.040
who did the job manually before the AI, they

00:05:47.040 --> 00:05:50.120
know the nuances. That SME connection is truly

00:05:50.120 --> 00:05:52.259
invaluable. But then the question always comes

00:05:52.259 --> 00:05:55.009
up, OK, how much data is enough? For early testing,

00:05:55.089 --> 00:05:57.329
maybe 50 to 100 examples are a decent start.

00:05:57.449 --> 00:05:59.670
Get your feet wet. For production readiness,

00:05:59.850 --> 00:06:03.129
though, you'll probably want 250 to 750 examples

00:06:03.129 --> 00:06:05.709
for statistically significant results. And for

00:06:05.709 --> 00:06:07.750
those really mission critical systems where accuracy

00:06:07.750 --> 00:06:09.889
is just non -negotiable, you're looking at a

00:06:09.889 --> 00:06:12.490
thousand examples, maybe more. Big numbers sometimes.

00:06:12.769 --> 00:06:15.149
Pro tip here. Start collecting this data months

00:06:15.149 --> 00:06:16.930
before you actually think you'll need it. You

00:06:16.930 --> 00:06:19.110
will thank yourself later. Definitely. So what's

00:06:19.110 --> 00:06:21.269
the biggest risk if your gold standard data?

00:06:21.949 --> 00:06:24.370
isn't truly gold well flawed data doesn't just

00:06:24.370 --> 00:06:26.790
give you bad results it gives you confidently

00:06:26.790 --> 00:06:28.990
wrong results you end up building your whole

00:06:28.990 --> 00:06:31.230
strategy on quicksand thinking you're making

00:06:31.230 --> 00:06:33.430
progress when you might be optimizing for the

00:06:33.430 --> 00:06:36.029
wrong thing or maybe making things worse it leads

00:06:36.029 --> 00:06:38.550
to misleading evaluation results and ultimately

00:06:38.550 --> 00:06:42.029
just bad decisions all right enough theory for

00:06:42.029 --> 00:06:44.170
a minute let's get our hands dirty the best way

00:06:44.170 --> 00:06:45.930
to understand this power is really to see it

00:06:45.930 --> 00:06:48.589
in action so let's walk through a couple of real

00:06:48.589 --> 00:06:51.290
world any and evaluation scenarios First up,

00:06:51.370 --> 00:06:54.129
imagine an email tagging agent. Goal is simple.

00:06:54.250 --> 00:06:56.629
Read incoming emails, tag them with a specific

00:06:56.629 --> 00:06:59.810
category and a priority level. Okay, so the setup

00:06:59.810 --> 00:07:01.589
for this experiment involves a test dataset,

00:07:01.910 --> 00:07:04.149
let's say just six emails, each with a known

00:07:04.149 --> 00:07:07.269
correct category and priority. Then we use NANN's

00:07:07.269 --> 00:07:09.290
evaluation nodes, which are like these pre -built

00:07:09.290 --> 00:07:11.569
components designed to compare AI output against

00:07:11.569 --> 00:07:13.790
a correct answer, to run the test and measure

00:07:13.790 --> 00:07:15.970
the results against those known answers. Pretty

00:07:15.970 --> 00:07:18.949
neat. And the first run reality check was, well,

00:07:19.050 --> 00:07:21.220
a bit of a disaster, honestly. Priority accuracy

00:07:21.220 --> 00:07:24.560
was mediocre, maybe 57%. But category accuracy,

00:07:24.819 --> 00:07:29.819
a big fat 0%. Zero. Wow. Now, the diagnosis was

00:07:29.819 --> 00:07:33.160
immediately clear. The AI had no system prompt

00:07:33.160 --> 00:07:35.439
to guide it at all. It was just making up its

00:07:35.439 --> 00:07:38.139
own category names, like billing issue, instead

00:07:38.139 --> 00:07:40.939
of the required category billing. So every single

00:07:40.939 --> 00:07:44.370
category test failed because of this, well. Fundamental

00:07:44.370 --> 00:07:47.670
oversight. Right. Makes sense. So a simple fix

00:07:47.670 --> 00:07:50.230
then probably made a huge difference. Monumental.

00:07:50.410 --> 00:07:53.149
A clear system prompt was added to the AI node,

00:07:53.310 --> 00:07:56.009
giving it a constrained list of the exact categories

00:07:56.009 --> 00:07:58.310
it was allowed to choose from. The final results.

00:07:58.629 --> 00:08:01.529
Category accuracy jumped from 0 % to a perfect

00:08:01.529 --> 00:08:05.589
100%. That's amazing. 0 to 100. It really highlights

00:08:05.589 --> 00:08:08.269
how the simplest, most systematic fixes discovered

00:08:08.269 --> 00:08:10.329
through evaluation can have the biggest impact.

00:08:10.759 --> 00:08:13.040
It's often not about super complex engineering,

00:08:13.379 --> 00:08:16.029
it's just about clarity and constraints. Totally.

00:08:16.189 --> 00:08:18.089
So, okay, second example, a bit more complex,

00:08:18.269 --> 00:08:20.709
the FAQ response agent. The goal here is read

00:08:20.709 --> 00:08:22.910
a customer email, find the relevant info in an

00:08:22.910 --> 00:08:25.170
FAQ database, and then craft a helpful natural

00:08:25.170 --> 00:08:27.850
language response. The big challenge here is

00:08:27.850 --> 00:08:30.029
that the output is subjective, right? You can't

00:08:30.029 --> 00:08:32.450
just check if two long paragraphs of text match

00:08:32.450 --> 00:08:35.070
exactly. How do you measure things like helpfulness

00:08:35.070 --> 00:08:37.690
or tone? This is where it gets really interesting.

00:08:38.049 --> 00:08:41.429
How does using a second AI as a judge help with

00:08:41.429 --> 00:08:43.990
those really subjective tasks? Well, the solution

00:08:43.990 --> 00:08:46.440
is using a second AI. AI to act as an impartial

00:08:46.440 --> 00:08:49.860
judge. You feed this evaluator AI the original

00:08:49.860 --> 00:08:52.379
email, the known gold standard answer from your

00:08:52.379 --> 00:08:54.879
test set, and your agent's generated response.

00:08:55.240 --> 00:08:58.200
The evaluator's only job is to provide an objective

00:08:58.200 --> 00:09:01.539
quantitative score, say, on a one to five scale

00:09:01.539 --> 00:09:04.279
for the quality of the agent's response. It provides

00:09:04.279 --> 00:09:07.179
an objective quantitative score for quality where

00:09:07.179 --> 00:09:09.580
exact text matches just aren't possible. It's

00:09:09.580 --> 00:09:11.980
a super useful technique. And this led to a model

00:09:11.980 --> 00:09:14.500
showdown. with some surprising results, didn't

00:09:14.500 --> 00:09:16.720
it? The initial hypothesis was something like

00:09:16.720 --> 00:09:19.200
Google's Flash model will be faster, but the

00:09:19.200 --> 00:09:21.799
more expensive GPT -40 Mini will be more accurate.

00:09:22.179 --> 00:09:24.139
Makes sense on the surface. Yeah, that was the

00:09:24.139 --> 00:09:27.000
guess. But the data delivered a twist. GPT -40

00:09:27.000 --> 00:09:29.679
Mini scored a pretty mediocre 3 .5 out of 5.

00:09:30.000 --> 00:09:32.259
And yeah, it was slower and more expensive. Google

00:09:32.259 --> 00:09:34.419
Flash, on the other hand, delivered a much stronger

00:09:34.419 --> 00:09:37.399
4 .3 out of 5. It's like a 23 % improvement in

00:09:37.399 --> 00:09:40.340
quality. Plus, it was twice as fast and significantly

00:09:40.340 --> 00:09:42.759
cheaper. So the verdict was clear then. Crystal

00:09:42.759 --> 00:09:46.039
clear. The cold, hard data proved that the cheaper

00:09:46.039 --> 00:09:48.340
and faster alternative was actually the superior

00:09:48.340 --> 00:09:50.980
choice in terms of quality. Without the systematic

00:09:50.980 --> 00:09:53.559
data -driven evaluation, most people would have

00:09:53.559 --> 00:09:55.340
probably just stuck with the more famous, more

00:09:55.340 --> 00:09:57.399
expensive model, assuming it was good enough.

00:09:58.000 --> 00:10:01.039
This shows a real, tangible, competitive adage

00:10:01.039 --> 00:10:03.379
you can get from evaluation. Mid -roll sponsor

00:10:03.379 --> 00:10:05.970
read, provided separately. So you're probably

00:10:05.970 --> 00:10:07.549
sitting there wondering, okay, how do I actually

00:10:07.549 --> 00:10:09.870
set this up for myself? We've designed what we

00:10:09.870 --> 00:10:12.509
call a final exam system. It's a practical four

00:10:12.509 --> 00:10:14.970
-step guide to building your own AI evaluation

00:10:14.970 --> 00:10:18.110
setup. Step one, design your exam paper. This

00:10:18.110 --> 00:10:19.970
is your test data set. You'll create a simple

00:10:19.970 --> 00:10:22.350
Google sheet, maybe, with clear columns for your

00:10:22.350 --> 00:10:24.529
input data, like the email body you want to test,

00:10:24.690 --> 00:10:26.950
and the expected correct answer, say, the specific

00:10:26.950 --> 00:10:29.509
category you want the AI to output. Keep it simple.

00:10:29.840 --> 00:10:31.960
Right. Then step two is building your testing

00:10:31.960 --> 00:10:35.440
room using NEN's evaluation nodes. The workflow

00:10:35.440 --> 00:10:37.899
is pretty straightforward. An evaluation trigger

00:10:37.899 --> 00:10:40.179
node loads all those test cases from your Google

00:10:40.179 --> 00:10:43.000
Sheet. That data then flows through your AI workflow

00:10:43.000 --> 00:10:45.960
to get processed. Finally, a set of evaluation

00:10:45.960 --> 00:10:49.279
nodes records the AI's actual answer and directly

00:10:49.279 --> 00:10:51.720
compares it to the expected correct answer from

00:10:51.720 --> 00:10:53.519
your sheet. It's basically setting up the proctors

00:10:53.519 --> 00:10:56.179
for your AI student's exam. And for step three,

00:10:56.200 --> 00:10:59.360
you run the exam and analyze the grades. execute

00:10:59.360 --> 00:11:02.059
this evaluation workflow, NEN provides a really

00:11:02.059 --> 00:11:04.200
detailed report card. This includes the overall

00:11:04.200 --> 00:11:06.360
accuracy percentage, a breakdown of performance

00:11:06.360 --> 00:11:08.759
on each individual test case, and other key metrics

00:11:08.759 --> 00:11:10.980
like execution time and API costs. It's all right

00:11:10.980 --> 00:11:13.019
there, laid out for you. Actionable information.

00:11:13.320 --> 00:11:15.379
And this brings us to maybe the most critical

00:11:15.379 --> 00:11:18.720
step for long -term improvement. Step four, keeping

00:11:18.720 --> 00:11:21.980
a lab notebook. You know, a good scientist always

00:11:21.980 --> 00:11:24.039
keeps a detailed log. You should create your

00:11:24.039 --> 00:11:26.240
own testing log. Maybe a Google Sheet or Notion

00:11:26.240 --> 00:11:28.200
page works great. For every single test run,

00:11:28.399 --> 00:11:31.200
document what you changed. The prompt, the model,

00:11:31.279 --> 00:11:33.440
the workflow step, whatever. Why you changed

00:11:33.440 --> 00:11:35.139
it, what was your hypothesis, and then the final

00:11:35.139 --> 00:11:39.080
results. The new accuracy, speed, cost. Honestly,

00:11:39.220 --> 00:11:41.679
this documentation is pure gold. I have to admit,

00:11:41.860 --> 00:11:44.259
I still wrestle with prompt drift myself sometimes.

00:11:44.480 --> 00:11:47.080
Or I accidentally change some tiny variable I

00:11:47.080 --> 00:11:49.080
didn't mean to, and then my results are just...

00:11:49.559 --> 00:11:51.879
Chaos. But the lab notebook always brings me

00:11:51.879 --> 00:11:55.019
back. Ah, prompt drift. That's when even a really

00:11:55.019 --> 00:11:57.120
subtle change in your instructions can completely

00:11:57.120 --> 00:12:00.320
alter the AI's behavior in unexpected ways. It

00:12:00.320 --> 00:12:02.379
really highlights just how sensitive these systems

00:12:02.379 --> 00:12:05.039
can be, doesn't it? So beyond just getting the

00:12:05.039 --> 00:12:07.559
results, why is documenting every single change

00:12:07.559 --> 00:12:10.379
so vital? Because it builds institutional knowledge,

00:12:10.580 --> 00:12:12.480
it prevents you from repeating the same errors,

00:12:12.620 --> 00:12:15.179
and it ensures systematic long -term improvement.

00:12:15.460 --> 00:12:17.379
Simple as that. Okay, now that we know how to

00:12:17.379 --> 00:12:19.559
set up the system, let's talk about the different

00:12:19.559 --> 00:12:22.039
measurement tools in your toolkit. Think of this

00:12:22.039 --> 00:12:24.580
like a doctor's diagnostic kit again. You wouldn't

00:12:24.580 --> 00:12:26.940
use a stethoscope to analyze a blood sample,

00:12:27.000 --> 00:12:29.899
right? You need the right metric for the right

00:12:29.899 --> 00:12:32.519
job, a specific tool for a specific problem.

00:12:32.659 --> 00:12:35.049
That's an excellent analogy. First up, we have

00:12:35.049 --> 00:12:37.850
categorization metrics. Think of this as your

00:12:37.850 --> 00:12:41.350
stethoscope. This is perfect for tasks that involve

00:12:41.350 --> 00:12:44.029
putting items into predefined buckets like email

00:12:44.029 --> 00:12:46.870
tagging, content classification, maybe sentiment

00:12:46.870 --> 00:12:50.289
analysis. It works as a simple exact match comparison.

00:12:50.809 --> 00:12:53.330
Did the AI pick the right category? Yes or no?

00:12:53.840 --> 00:12:56.320
Pretty straightforward. Then there are correctness

00:12:56.320 --> 00:12:58.240
metrics. These are maybe more like a blood test.

00:12:58.379 --> 00:13:00.820
These are perfect for subjective generative tasks,

00:13:00.980 --> 00:13:03.460
like evaluating the quality or factual accuracy

00:13:03.460 --> 00:13:05.440
of a written response. This is where you use

00:13:05.440 --> 00:13:07.860
that AI evaluating AI technique we just talked

00:13:07.860 --> 00:13:11.000
about. The evaluator AI provides an objective

00:13:11.000 --> 00:13:14.220
one to five score, perhaps, for the response's

00:13:14.220 --> 00:13:16.659
correctness and helpfulness. It adds that layer

00:13:16.659 --> 00:13:19.200
of objective judgment to inherently fuzzy tasks.

00:13:20.159 --> 00:13:23.600
Next up. Similarity metrics. Let's call this

00:13:23.600 --> 00:13:27.490
your MRI scan. This provides a deep, nuanced

00:13:27.490 --> 00:13:29.990
comparison. It's perfect for tasks where the

00:13:29.990 --> 00:13:32.769
goal is to match a specific style or tone or

00:13:32.769 --> 00:13:36.250
format. This metric measures how close the AI's

00:13:36.250 --> 00:13:39.169
output is to a known gold standard example from

00:13:39.169 --> 00:13:41.809
your data set. It's all about stylistic adherence.

00:13:42.009 --> 00:13:44.490
And finally, there are custom metrics. The specialist

00:13:44.490 --> 00:13:47.149
test may be, these are for any unique business

00:13:47.149 --> 00:13:48.889
-specific requirements that the standard metrics

00:13:48.889 --> 00:13:51.350
just don't quite cover. You define the criteria

00:13:51.350 --> 00:13:54.090
and the scoring system yourself, often by building

00:13:54.090 --> 00:13:56.259
out... specific logic using something like an

00:13:56.259 --> 00:13:59.340
N8n code node, which is basically a small block

00:13:59.340 --> 00:14:01.720
of custom JavaScript right inside your workflow.

00:14:02.299 --> 00:14:03.960
So why do we need so many different metrics,

00:14:04.059 --> 00:14:06.779
like a stethoscope versus an MRI scan? It's simply

00:14:06.779 --> 00:14:08.919
because different AI tasks demand different diagnostic

00:14:08.919 --> 00:14:11.600
tools. Categorization is just a binary check,

00:14:11.720 --> 00:14:14.200
yes or no. Correctness is about subjective quality,

00:14:14.419 --> 00:14:17.179
but scored objectively. Similarity looks at stylistic

00:14:17.179 --> 00:14:19.539
nuance. Each one helps you understand a different

00:14:19.539 --> 00:14:21.759
dimension of your AI's performance. It allows

00:14:21.759 --> 00:14:23.919
you to accurately measure and diagnose its strengths

00:14:23.919 --> 00:14:26.480
and weaknesses. Okay, cool. So you now have the

00:14:26.480 --> 00:14:28.379
core framework. Let's move into the professional

00:14:28.379 --> 00:14:31.379
playbook. Some advanced tips. troubleshooting,

00:14:31.440 --> 00:14:33.940
and a clear action plan that will really elevate

00:14:33.940 --> 00:14:36.740
your evaluation game. First, the pro level playbook

00:14:36.740 --> 00:14:39.460
has three golden rules. The consistency principle

00:14:39.460 --> 00:14:42.120
is totally non -negotiable. Keep your evaluation

00:14:42.120 --> 00:14:45.019
model consistent across all tests. If you have

00:14:45.019 --> 00:14:47.919
an AI judging another AI, don't change the judge

00:14:47.919 --> 00:14:50.799
AI midway through your testing. If you do, all

00:14:50.799 --> 00:14:53.779
your comparisons become invalid. Absolutely crucial.

00:14:54.100 --> 00:14:55.980
Then there's the documentation imperative we

00:14:55.980 --> 00:14:58.440
talked about that lab notebook. N8n shows you

00:14:58.440 --> 00:15:00.240
the results, sure, but not what you change to

00:15:00.240 --> 00:15:02.559
get them. A simple Google sheet tracking your

00:15:02.559 --> 00:15:05.320
changes, hypotheses, and results is key to systematic

00:15:05.320 --> 00:15:07.639
improvement. It becomes your institutional knowledge

00:15:07.639 --> 00:15:11.019
base. And finally, the iteration strategy. Start

00:15:11.019 --> 00:15:13.399
small. Don't try to boil the ocean. Maybe 10

00:15:13.399 --> 00:15:15.480
to 20 examples just to validate your setup is

00:15:15.480 --> 00:15:17.960
working. Then scale up to 5 ,100 for more serious

00:15:17.960 --> 00:15:20.600
testing and maybe 250 or more for production

00:15:20.600 --> 00:15:23.080
-ready systems. It's a gradual, deliberate climb.

00:15:23.340 --> 00:15:25.620
Right. And of course, out in the field. you'll

00:15:25.620 --> 00:15:28.139
run into issues it happens so here's a quick

00:15:28.139 --> 00:15:30.559
field guide for troubleshooting some common things

00:15:30.559 --> 00:15:33.759
if the built -in set metrics node is giving you

00:15:33.759 --> 00:15:36.580
errors or not doing what you need try creating

00:15:36.580 --> 00:15:39.320
your own custom evaluation agent instead you

00:15:39.320 --> 00:15:41.899
have that flexibility if your evaluation results

00:15:41.899 --> 00:15:44.559
seem inconsistent run to run you're almost certainly

00:15:44.559 --> 00:15:46.879
changing multiple variables at once remember

00:15:46.879 --> 00:15:50.440
the rule focus on one change at a time And if

00:15:50.440 --> 00:15:52.840
your test data doesn't seem to reflect real -world

00:15:52.840 --> 00:15:55.440
results, your data set probably isn't representative

00:15:55.440 --> 00:15:57.879
enough. You need to collect data over a longer

00:15:57.879 --> 00:16:00.559
period, maybe include more of those tricky edge

00:16:00.559 --> 00:16:02.580
cases. You know, this isn't just about tweaking

00:16:02.580 --> 00:16:04.120
your workflows a little bit better. This is fundamentally

00:16:04.120 --> 00:16:06.480
about transforming your entire approach to AI

00:16:06.480 --> 00:16:09.879
automation. The before state is guesswork, frustration,

00:16:10.139 --> 00:16:13.059
settling for good enough. The after state, it's

00:16:13.059 --> 00:16:14.960
data -driven decisions, continuous improvement,

00:16:15.159 --> 00:16:17.860
and a real durable competitive advantage, faster

00:16:17.860 --> 00:16:20.659
optimization. better cost control, proven quality

00:16:20.659 --> 00:16:23.200
assurance, you actually know it works. Whoa.

00:16:23.600 --> 00:16:26.320
Imagine scaling this precise evaluation method

00:16:26.320 --> 00:16:29.360
to hundreds of complex workflows across an organization,

00:16:29.679 --> 00:16:32.480
ensuring every single interaction is top tier,

00:16:32.620 --> 00:16:35.080
perfectly aligned with business goals. That's

00:16:35.080 --> 00:16:37.039
a massive competitive advantage right there.

00:16:37.159 --> 00:16:39.340
So let's give you your mission briefing, a seven

00:16:39.340 --> 00:16:41.960
-step action plan to get you started today. One,

00:16:42.139 --> 00:16:45.299
choose your first evaluation target. Pick a workflow

00:16:45.299 --> 00:16:47.639
that's maybe almost good enough, but could be

00:16:47.639 --> 00:16:51.649
better. 2. Create a small test data set. 20 to

00:16:51.649 --> 00:16:55.070
50 examples is fine to start. 3. Set up the basic

00:16:55.070 --> 00:16:58.169
NADN evaluation nodes to get your baseline measurement.

00:16:58.529 --> 00:17:01.809
4. Run your first test and document that initial

00:17:01.809 --> 00:17:05.150
performance in your lab notebook. 5. Make one

00:17:05.150 --> 00:17:08.470
single change. Just one. 6. Run the test again

00:17:08.470 --> 00:17:10.910
and compare the results. See what happened. 7.

00:17:11.410 --> 00:17:14.000
Iterate and improve based on that data. And remember,

00:17:14.140 --> 00:17:16.099
you can often find complete workflow templates

00:17:16.099 --> 00:17:19.019
and even test datasets in online automation communities

00:17:19.019 --> 00:17:21.359
to help jumpstart your journey. Don't reinvent

00:17:21.359 --> 00:17:23.440
the wheel. So thinking about all that, what's

00:17:23.440 --> 00:17:25.519
the single most impactful takeaway for someone

00:17:25.519 --> 00:17:27.779
looking to get started right now? Start small

00:17:27.779 --> 00:17:30.700
with one workflow, build a small dataset, and

00:17:30.700 --> 00:17:33.119
really commit to documenting every single change

00:17:33.119 --> 00:17:35.960
you make. That's the foundation. Today, we've

00:17:35.960 --> 00:17:38.480
walked through how to transform your AI workflow

00:17:38.480 --> 00:17:41.359
optimization process, moving it from an art of

00:17:41.359 --> 00:17:43.680
guesswork, really, to a science of data -driven

00:17:43.680 --> 00:17:46.319
decisions. It's a profound shift in mindset.

00:17:46.579 --> 00:17:49.240
Yeah, from recognizing that medieval doctor problem

00:17:49.240 --> 00:17:53.609
to understanding the black box challenge. crafting

00:17:53.609 --> 00:17:56.150
those gold standard data sets and applying precise

00:17:56.150 --> 00:17:59.089
evaluation metrics, you now truly have the tools

00:17:59.089 --> 00:18:01.509
you need. It's all about moving from, I think

00:18:01.509 --> 00:18:03.650
this works better, to I know this works better,

00:18:03.730 --> 00:18:06.849
and here's the data to prove it. That evidence

00:18:06.849 --> 00:18:09.109
-based approach, that conviction based on facts,

00:18:09.289 --> 00:18:11.650
that's the real key. So don't just hope your

00:18:11.650 --> 00:18:14.109
AI works. Start knowing it works. Pick a workflow,

00:18:14.349 --> 00:18:16.390
gather some data, and begin your first evaluation.

00:18:16.809 --> 00:18:18.930
The power is really in your hands now. The difference

00:18:18.930 --> 00:18:21.390
between an amateur and a professional AI automator.

00:18:21.819 --> 00:18:24.500
It often lies squarely in this shift. This is

00:18:24.500 --> 00:18:26.579
how you build AI systems that truly deliver on

00:18:26.579 --> 00:18:29.359
their promises. Systems you can justify. Systems

00:18:29.359 --> 00:18:32.200
that give you a real tangible edge. Yeah, it's

00:18:32.200 --> 00:18:33.960
time to stop reading about it and actually start

00:18:33.960 --> 00:18:37.119
evaluating. Make that leap. We really hope this

00:18:37.119 --> 00:18:39.279
deep dive has given you clarity and conviction.

00:18:39.579 --> 00:18:41.920
Until next time, keep digging, keep learning,

00:18:41.980 --> 00:18:44.220
and keep building smarter. Otiero Music.
