WEBVTT

00:00:00.000 --> 00:00:04.200
Have you ever poured hours into building an automation

00:00:04.200 --> 00:00:06.799
workflow? You know, seen it work just perfectly

00:00:06.799 --> 00:00:09.740
in your test environment. Yeah, flawless. Only

00:00:09.740 --> 00:00:12.000
for it to just fall apart the moment it hits

00:00:12.000 --> 00:00:13.679
the real world. Happens all the time, like that

00:00:13.679 --> 00:00:16.690
sandcastle analogy you used. Looks great until

00:00:16.690 --> 00:00:19.870
the first wave. Exactly. So today we're diving

00:00:19.870 --> 00:00:23.070
into how to build systems that don't just work,

00:00:23.089 --> 00:00:25.469
but well, systems that survive. Yeah. We're talking

00:00:25.469 --> 00:00:28.250
about taking your automations from looks good

00:00:28.250 --> 00:00:32.289
on my laptop to bulletproof in the wild. We'll

00:00:32.289 --> 00:00:35.009
unpack five core techniques. Five techniques.

00:00:35.210 --> 00:00:37.469
Yeah. To make your workflows resilient, visible,

00:00:37.570 --> 00:00:39.710
and, you know, truly professional. It's kind

00:00:39.710 --> 00:00:41.689
of like giving your automation a superhero cape.

00:00:42.240 --> 00:00:44.140
Or maybe a suit of armor. A suit of armor. I

00:00:44.140 --> 00:00:46.399
like that. We'll cover everything from centralizing

00:00:46.399 --> 00:00:49.539
error reports to smart retries, even backup AI

00:00:49.539 --> 00:00:53.560
models. Think of it as a playbook for peace of

00:00:53.560 --> 00:00:56.719
mind, transforming those workflows from ticking

00:00:56.719 --> 00:01:00.320
time bombs into robust, production -ready powerhouses.

00:01:00.560 --> 00:01:03.399
That's the goal. No more midnight alerts, hopefully.

00:01:03.719 --> 00:01:07.900
So the core challenge, I feel almost philosophical,

00:01:08.079 --> 00:01:09.780
right? Yeah. Building something in a sterile

00:01:09.780 --> 00:01:12.969
test lab is, well, Easy compared to the real

00:01:12.969 --> 00:01:15.209
world. It is. Yeah. Because the real world is

00:01:15.209 --> 00:01:18.950
messy. Users do weird things. APIs go down. Servers

00:01:18.950 --> 00:01:22.329
just have bad days. So what's the biggest hurdle

00:01:22.329 --> 00:01:26.700
when moving from that lab to production? It's

00:01:26.700 --> 00:01:29.280
realizing that production ready doesn't just

00:01:29.280 --> 00:01:32.040
mean it worked when I tested it. Not even close.

00:01:32.200 --> 00:01:34.540
Right. It means building an anti -fragile system.

00:01:34.819 --> 00:01:36.680
Anti -fragile. Okay. What does that mean exactly?

00:01:36.859 --> 00:01:39.540
Different from just robust. Yeah. Subtly different,

00:01:39.659 --> 00:01:42.659
but important. Robust resists stress, stays the

00:01:42.659 --> 00:01:45.280
same. Think a strong wall. Anti -fragile actually

00:01:45.280 --> 00:01:47.780
gets better from stress, from errors, like our

00:01:47.780 --> 00:01:50.840
immune system. It learns, adapts. Okay. So it

00:01:50.840 --> 00:01:53.219
involves what? Handling failures gracefully.

00:01:53.280 --> 00:01:55.799
Gracefully, yeah. Without total collapse, you

00:01:55.799 --> 00:01:57.879
need instant notification when something important

00:01:57.879 --> 00:02:00.459
breaks. Okay. You need intelligent logging, so

00:02:00.459 --> 00:02:04.019
debugging isn't guesswork. And crucially, a built

00:02:04.019 --> 00:02:07.700
-in plan B, retry logic, fallback logic. Plan

00:02:07.700 --> 00:02:10.900
B. Always need a plan B. Absolutely. And it needs

00:02:10.900 --> 00:02:14.960
to fail safely. No bad email sent, no critical

00:02:14.960 --> 00:02:18.240
data deleted by accident because, look, failures

00:02:18.240 --> 00:02:20.419
are inevitable. Right. You can't stop every single

00:02:20.419 --> 00:02:23.419
one. Exactly. The job isn't to prevent all failures.

00:02:23.719 --> 00:02:26.460
It's to build systems that fail intelligently.

00:02:26.599 --> 00:02:29.280
That's the shift. So the core shift in thinking

00:02:29.280 --> 00:02:32.500
when aiming for production ready. Yeah. What

00:02:32.500 --> 00:02:35.750
is it? It's about expecting failures and building

00:02:35.750 --> 00:02:38.490
systems that adapt rather than break. Expecting

00:02:38.490 --> 00:02:41.050
failures, building systems that adapt. Got it.

00:02:41.310 --> 00:02:43.990
And to do that, we're using this onion or suit

00:02:43.990 --> 00:02:46.569
of armor idea, these five techniques, as layers.

00:02:46.729 --> 00:02:48.990
Yep, layers of protection. We've got error workflows,

00:02:49.349 --> 00:02:51.590
retry on failure, the fallback LLM, continue

00:02:51.590 --> 00:02:54.090
on error, and pulling. Okay, let's peel back

00:02:54.090 --> 00:02:57.050
that first layer then, or maybe buckle on the

00:02:57.050 --> 00:02:59.250
first piece of armor. Error workflows. You said

00:02:59.250 --> 00:03:02.120
this is fundamental, non -negotiable. Absolutely

00:03:02.120 --> 00:03:04.139
foundational because the big problem here is

00:03:04.139 --> 00:03:06.340
what we call the silent killer. The silent killer

00:03:06.340 --> 00:03:09.500
sounds ominous. It is. A standard workflow often

00:03:09.500 --> 00:03:13.360
fails quietly. Imagine an automation processing

00:03:13.360 --> 00:03:16.680
new leads every night. If an API changes or a

00:03:16.680 --> 00:03:21.080
credential expires, poof. It could silently drop

00:03:21.080 --> 00:03:23.580
leads for days, weeks even. You don't know until

00:03:23.580 --> 00:03:26.300
sales complains. Yeah. That's bad. Very bad.

00:03:26.479 --> 00:03:28.719
So the solution? A centralized mission control

00:03:28.719 --> 00:03:31.560
for errors. Think of it like a single security

00:03:31.560 --> 00:03:34.599
desk for your whole NAN operation or whatever

00:03:34.599 --> 00:03:37.900
tool you use. A central hub. Exactly. All error

00:03:37.900 --> 00:03:40.639
signals pipe back to this one place. How do you

00:03:40.639 --> 00:03:42.560
build that mission control? Is it complicated?

00:03:42.960 --> 00:03:45.780
Surprisingly simple, really. Two steps. Step

00:03:45.780 --> 00:03:48.780
one, create the emergency response team workflow.

00:03:49.060 --> 00:03:51.479
Just use as an error trigger node. Its only job

00:03:51.479 --> 00:03:53.919
is to listen for errors. Okay. Listening for

00:03:53.919 --> 00:03:56.840
trouble. Step two, connect the red phone. In

00:03:56.840 --> 00:03:58.840
every single one of your active workflows, you

00:03:58.840 --> 00:04:01.099
go into settings and point its error output to

00:04:01.099 --> 00:04:03.180
that new error workflow. Like installing an emergency

00:04:03.180 --> 00:04:05.759
line everywhere. Precisely. Step three, design

00:04:05.759 --> 00:04:08.139
the alert and log protocol. The error workflow

00:04:08.139 --> 00:04:10.639
grabs crucial data workflow name, error message,

00:04:10.819 --> 00:04:13.219
which step failed, maybe even the input data.

00:04:13.360 --> 00:04:15.240
All the content. Right. Logs it somewhere central,

00:04:15.539 --> 00:04:18.959
Google Sheet, Airtable, a database, and sends

00:04:18.959 --> 00:04:22.310
smart notifications. Slack, email, whatever works

00:04:22.310 --> 00:04:24.850
for your team. Okay, I can see that. We had an

00:04:24.850 --> 00:04:27.350
issue once. An Airtable credential failed silently.

00:04:27.589 --> 00:04:29.990
Took ages to notice from weird reports. How would

00:04:29.990 --> 00:04:31.990
this have helped? Instantly. Instead of silent

00:04:31.990 --> 00:04:34.110
failure, you'd get a Slack message like, Sat

00:04:34.110 --> 00:04:38.069
anan workflow error. Sat workflow. Telegram AI

00:04:38.069 --> 00:04:42.050
assistant. Failing node. AI agent. Error. Node

00:04:42.050 --> 00:04:44.810
operation error. With a link straight to the

00:04:44.810 --> 00:04:47.870
failed run. Wow. Okay. Actionable. Immediate.

00:04:48.240 --> 00:04:50.100
Totally changes the game from detective work

00:04:50.100 --> 00:04:52.839
weeks later to a fix in minutes. And you mentioned

00:04:52.839 --> 00:04:55.480
a pro upgrade, tiered alerting. Yeah, because

00:04:55.480 --> 00:04:57.579
not all errors are DEF CON 1, right? True. A

00:04:57.579 --> 00:05:00.079
payment failure. Big deal. I have channel alert

00:05:00.079 --> 00:05:02.860
now. A minor summary task failing. Maybe just

00:05:02.860 --> 00:05:05.259
log it to the sheet. You use a switch node in

00:05:05.259 --> 00:05:07.300
the error workflow to route critical errors to

00:05:07.300 --> 00:05:09.459
high priority alerts and non -critical ones to

00:05:09.459 --> 00:05:11.759
just logging. Keeps the noise down. Smart. So

00:05:11.759 --> 00:05:13.800
what's the biggest danger if you skip this whole

00:05:13.800 --> 00:05:16.459
error workflow step? Losing data silently and

00:05:16.459 --> 00:05:18.160
not knowing your automations are fundamentally

00:05:18.160 --> 00:05:22.480
broken. Losing data silently. Yeah, you definitely

00:05:22.480 --> 00:05:25.610
want to avoid that. Okay, foundational layer

00:05:25.610 --> 00:05:28.910
sorted. Now, technique number two. This one sounds

00:05:28.910 --> 00:05:32.889
almost too simple. The turn it off and on again

00:05:32.889 --> 00:05:37.250
button. Retry on failure. Yeah, it sounds basic,

00:05:37.290 --> 00:05:39.790
but honestly, a huge percentage of failures are

00:05:39.790 --> 00:05:42.930
just temporary. Clips, network glitches, server

00:05:42.930 --> 00:05:44.769
overload for a second. The hiccups. Exactly.

00:05:45.230 --> 00:05:47.149
Amateur workflows just give up. Professional

00:05:47.149 --> 00:05:50.110
ones, they retry. It's often the first and, frankly,

00:05:50.230 --> 00:05:52.129
most effective line of defense against those

00:05:52.129 --> 00:05:54.089
transient things. And setting it up is easy.

00:05:54.250 --> 00:05:57.230
Super easy. Most automation tools, like in AN,

00:05:57.550 --> 00:06:00.350
have it built into almost every node. Just find

00:06:00.350 --> 00:06:02.850
the settings for that node, toggle retry on fail

00:06:02.850 --> 00:06:05.689
to on end. Okay. Then you set max tries, maybe

00:06:05.689 --> 00:06:07.689
three to five is a good starting point, and...

00:06:07.870 --> 00:06:10.250
a wait time between tries, like five seconds.

00:06:10.329 --> 00:06:12.350
That's usually it. Two clicks, two numbers. Handles

00:06:12.350 --> 00:06:14.430
most of those temporary issues. The vast majority,

00:06:14.670 --> 00:06:16.350
yeah. But there's a bit of an art to it, depending

00:06:16.350 --> 00:06:19.050
on the task. Ah, strategy. Okay, tell me more.

00:06:19.189 --> 00:06:21.290
Well, for calling external APIs, maybe three

00:06:21.290 --> 00:06:23.189
to five retries with a five -second delay is

00:06:23.189 --> 00:06:25.310
good. Gives their server time to recover. Makes

00:06:25.310 --> 00:06:28.459
sense. For AI models, maybe... Two, three retries,

00:06:28.660 --> 00:06:31.319
five second delay. If it fails three times, it's

00:06:31.319 --> 00:06:33.300
probably a bigger issue than just a blip. Right.

00:06:33.379 --> 00:06:36.379
File operations. Those often fail because the

00:06:36.379 --> 00:06:38.639
file is temporarily locked. So maybe five plus

00:06:38.639 --> 00:06:40.740
retries, but with a very short delay, like one

00:06:40.740 --> 00:06:42.839
or two seconds. Okay. Tailored to the type of

00:06:42.839 --> 00:06:45.259
potential failure. You mentioned an OpenAI hiccup

00:06:45.259 --> 00:06:47.379
example. How does retry actually play out there?

00:06:47.560 --> 00:06:50.139
Right. So imagine OpenAI's server gets slammed

00:06:50.139 --> 00:06:53.560
for like 30 seconds. Your workflow makes a call,

00:06:53.680 --> 00:06:55.680
gets an error. And the amateur workflow just

00:06:55.680 --> 00:06:59.980
stops. Dead. Yep. But yours? With retry on fail

00:06:59.980 --> 00:07:03.319
set to three tries, five second delay, it fails,

00:07:03.360 --> 00:07:05.399
waits five seconds, tries again. Maybe fails

00:07:05.399 --> 00:07:07.500
again, waits five seconds, tries a third time.

00:07:07.620 --> 00:07:09.860
By now, the spike is over, the call goes through.

00:07:10.040 --> 00:07:12.279
And the workflow just continues like nothing

00:07:12.279 --> 00:07:15.019
happened. Exactly. The end user or the overall

00:07:15.019 --> 00:07:17.620
process is completely unaware there was ever

00:07:17.620 --> 00:07:20.100
a problem. It just smoothed itself out. That's

00:07:20.100 --> 00:07:22.000
pretty powerful for such a simple setting. And

00:07:22.000 --> 00:07:24.420
there's an even more advanced version. Exponential

00:07:24.420 --> 00:07:26.899
backoff. Yeah, this is what the big players like

00:07:26.899 --> 00:07:28.959
Google and Amazon use, instead of waiting the

00:07:28.959 --> 00:07:32.199
same 5 seconds each time. You wait longer. Exponentially

00:07:32.199 --> 00:07:35.139
longer. First retry waits 5 seconds, second waits

00:07:35.139 --> 00:07:38.259
10, third waits 20. Ah, giving the server more

00:07:38.259 --> 00:07:40.620
and more breathing room. Precisely. Especially

00:07:40.620 --> 00:07:43.220
crucial for mission -critical APIs that might

00:07:43.220 --> 00:07:46.879
be under heavy, sustained load? Whoa. I mean,

00:07:46.899 --> 00:07:49.019
imagine the resilience that gives you when you're

00:07:49.019 --> 00:07:52.100
handling millions of requests. It's smart scaling.

00:07:52.720 --> 00:07:55.180
So for most external services, how many retries

00:07:55.180 --> 00:07:57.439
are usually enough? Three to five attempts with

00:07:57.439 --> 00:08:00.040
a short delay often solves most transient issues.

00:08:00.379 --> 00:08:02.839
Three to five, short delay, good rule of thumb.

00:08:02.959 --> 00:08:05.920
Yeah. So retries handle the blips. But what if

00:08:05.920 --> 00:08:08.740
it's not a blip? What if, say, OpenAI is just...

00:08:09.470 --> 00:08:12.209
Like really down for an hour. Retries won't help

00:08:12.209 --> 00:08:14.250
that. Correct. That's when retries run out and

00:08:14.250 --> 00:08:16.149
you need the next layer. You need a plan B, a

00:08:16.149 --> 00:08:19.709
real backup. The fallback LLM. Exactly. The AI's

00:08:19.709 --> 00:08:24.089
backup singer analogy. Your main AI, maybe GPT

00:08:24.089 --> 00:08:27.269
-40 mini, is your star. Yeah. But if they suddenly

00:08:27.269 --> 00:08:29.850
lose their voice. The show must go on. The show

00:08:29.850 --> 00:08:33.210
must go on. So you have another capable AI, maybe

00:08:33.210 --> 00:08:36.230
Cloud 4 or Google Gemini, waiting in the wings,

00:08:36.429 --> 00:08:39.899
ready to take over automatically. Okay. And setting

00:08:39.899 --> 00:08:43.600
this up, is it in the node settings too? Often,

00:08:43.720 --> 00:08:46.720
yes. In tools like NNN's AI Agent node, you'd

00:08:46.720 --> 00:08:49.659
first make sure retry and fail is on. Then there's

00:08:49.659 --> 00:08:52.620
usually a checkbox like add fallback model. Right.

00:08:52.720 --> 00:08:55.559
You check that and it lets you connect a second

00:08:55.559 --> 00:08:58.700
different AI model. Different is key here. Absolutely

00:08:58.700 --> 00:09:01.820
critical. The golden rule is... diversify providers.

00:09:02.179 --> 00:09:05.039
Why? Because if open AI is having a major outage,

00:09:05.059 --> 00:09:07.960
switching to another open AI model probably won't

00:09:07.960 --> 00:09:10.480
help. They might share the same underlying problem.

00:09:10.659 --> 00:09:12.980
Ah, same infrastructure. Exactly. It's like having

00:09:12.980 --> 00:09:15.360
a backup generator that runs on the same potentially

00:09:15.360 --> 00:09:17.779
disrupted power grid. Useless. You want your

00:09:17.779 --> 00:09:19.519
backup on a completely different fuel source.

00:09:19.820 --> 00:09:22.559
So primary open AI, fallback, anthropic, or Google?

00:09:22.970 --> 00:09:24.870
Makes sense. Yeah. Or if you use an aggregator.

00:09:24.929 --> 00:09:26.830
Like OpenRouter. Yeah. Maybe your primary is

00:09:26.830 --> 00:09:29.350
via OpenRouter and your fallback is a direct

00:09:29.350 --> 00:09:32.289
connection to Google Gemini. Bypasses the middleman

00:09:32.289 --> 00:09:35.129
entirely. You mentioned testing this by deliberately

00:09:35.129 --> 00:09:38.090
breaking the primary key. We did. Gave the primary

00:09:38.090 --> 00:09:41.549
AI a bad API key. It failed. Retry kicked in.

00:09:41.649 --> 00:09:44.230
Failed again. Expected. Then, automatically,

00:09:44.509 --> 00:09:46.950
the system switched to the configured fallback

00:09:46.950 --> 00:09:50.509
model Google Gemini in our test. And Gemini processed

00:09:50.509 --> 00:09:53.070
the request successfully. The workflow completed.

00:09:53.350 --> 00:09:56.029
The end user just saw a slightly longer pause.

00:09:56.110 --> 00:09:59.669
No error message. Seamless failover. That's impressive.

00:09:59.889 --> 00:10:02.649
Yeah. But you also mentioned a challenge. Prompt

00:10:02.649 --> 00:10:05.549
drift. Ah, yeah. It's something I still wrestle

00:10:05.549 --> 00:10:07.730
with, honestly, even with fallbacks. When you

00:10:07.730 --> 00:10:10.350
switch from, say, GPT to Claude, even with the

00:10:10.350 --> 00:10:12.309
exact same prompt. They might interpret it slightly

00:10:12.309 --> 00:10:14.710
differently. Exactly. Different architectures,

00:10:14.710 --> 00:10:17.529
different training data. You can get subtle shifts

00:10:17.529 --> 00:10:20.350
in tone, style, maybe even how it emphasizes

00:10:20.350 --> 00:10:23.289
certain points. So the failover might be seamless

00:10:23.289 --> 00:10:25.519
technically, but you still need to. test and

00:10:25.519 --> 00:10:27.840
maybe tweak the prompts for your fallback model

00:10:27.840 --> 00:10:30.240
to ensure the output quality and brand voice

00:10:30.240 --> 00:10:32.759
stay consistent. It's an ongoing tuning process.

00:10:33.139 --> 00:10:35.639
Good point. Consistency matters. So why is it

00:10:35.639 --> 00:10:37.779
so important to use a different AI provider for

00:10:37.779 --> 00:10:40.379
the fallback? To ensure your backup isn't reliant

00:10:40.379 --> 00:10:42.340
on the same potentially failing infrastructure

00:10:42.340 --> 00:10:44.899
as the primary. Different provider, different

00:10:44.899 --> 00:10:49.879
infrastructure. Got it. Okay, next layer. Technique

00:10:49.879 --> 00:10:53.039
number four. Continue on error. You said this

00:10:53.039 --> 00:10:55.559
is a personal favorite for pros? Oh, yeah. Especially

00:10:55.559 --> 00:10:58.320
for batch processing. It's a lifesaver. Think

00:10:58.320 --> 00:11:00.419
of it as another vital layer of armor. Okay,

00:11:00.500 --> 00:11:02.340
so what's the problem it solves? You called it

00:11:02.340 --> 00:11:05.320
the assembly line shutdown. Right. Imagine that

00:11:05.320 --> 00:11:08.059
content factory again, pulling 1 ,000 leads,

00:11:08.279 --> 00:11:11.519
researching each, adding to CRM. What if lead

00:11:11.519 --> 00:11:14.379
number three has some weird character in its

00:11:14.379 --> 00:11:16.980
data that breaks the research step in a normal

00:11:16.980 --> 00:11:18.879
workflow? The whole thing stops dead at number

00:11:18.879 --> 00:11:21.759
three. Exactly. One bad apple spoils the whole

00:11:21.759 --> 00:11:25.750
batch. 997 perfectly good leads never get processed

00:11:25.750 --> 00:11:28.850
because of one tiny error. Huge waste. Okay,

00:11:28.929 --> 00:11:31.129
that's inefficient. So continue on error prevents

00:11:31.129 --> 00:11:33.690
that. It builds a smart assembly line. Instead

00:11:33.690 --> 00:11:35.850
of shutting down, it intelligently pulls that

00:11:35.850 --> 00:11:38.129
one defective item off the line for inspection,

00:11:38.309 --> 00:11:42.029
while the other 999 keep moving smoothly. How

00:11:42.029 --> 00:11:43.730
does that work in the tool? Are there different

00:11:43.730 --> 00:11:46.529
modes? Typically, yes. In most node settings,

00:11:46.730 --> 00:11:48.830
you'll see error handling options. One might

00:11:48.830 --> 00:11:51.769
be just... Continue, which basically ignores

00:11:51.769 --> 00:11:54.389
the error and moves on. For low stakes stuff,

00:11:54.549 --> 00:11:57.919
maybe. But not ideal. No. The professional option

00:11:57.919 --> 00:12:00.240
is usually called something like continue using

00:12:00.240 --> 00:12:03.259
error output or similar. This is the game changer.

00:12:03.379 --> 00:12:05.639
Oh, so. It doesn't just ignore the error. It

00:12:05.639 --> 00:12:07.620
creates two separate paths out of that node.

00:12:07.740 --> 00:12:10.519
A success path for items that worked. The green

00:12:10.519 --> 00:12:13.139
lane. And an error path for the specific item

00:12:13.139 --> 00:12:16.059
that failed. The red lane. Exactly. It isolates

00:12:16.059 --> 00:12:17.759
the problem child without stopping everything

00:12:17.759 --> 00:12:20.919
else. You tested this with Google, Meta, and

00:12:20.919 --> 00:12:24.409
a deliberately broken NVIDIA entry. Yep. Put

00:12:24.409 --> 00:12:26.470
quotes around NVIDIA to make it invalid JSON.

00:12:26.710 --> 00:12:29.110
Without continue on error, it would process Google

00:12:29.110 --> 00:12:31.809
Meta, then crash on NVIDIA. Stop. Right. With

00:12:31.809 --> 00:12:35.129
continue using error output enabled. Google and

00:12:35.129 --> 00:12:37.269
Meta processed fine, went down the success path.

00:12:37.389 --> 00:12:39.710
NVIDIA hit the error. And instead of stopping.

00:12:39.850 --> 00:12:42.070
It got routed cleanly down the error path. The

00:12:42.070 --> 00:12:44.049
workflow itself kept running for any subsequent

00:12:44.049 --> 00:12:48.350
items. 99 .9 % success, 0 .1 % oscillated for

00:12:48.350 --> 00:12:51.169
review. That's incredibly useful for large data

00:12:51.169 --> 00:12:55.029
sets. And the pro upgrade here. Self -correction.

00:12:55.230 --> 00:12:57.490
Yeah, this is where it gets really cool. That

00:12:57.490 --> 00:12:59.870
error path doesn't just have to go to a log or

00:12:59.870 --> 00:13:01.850
a notification. It can trigger more automation.

00:13:02.210 --> 00:13:04.330
Exactly. It can lead to its own mini workflow.

00:13:04.649 --> 00:13:06.950
Maybe it tries sending the failed NVIDIA item

00:13:06.950 --> 00:13:09.009
to a different AI model with a simpler prompt

00:13:09.009 --> 00:13:11.750
or uses a different lookup tool. Trying to fix

00:13:11.750 --> 00:13:13.669
the problem automatically. Right. And if that

00:13:13.669 --> 00:13:16.610
fix works, the result can then be merged back

00:13:16.610 --> 00:13:19.509
into the main success path downstream. That's

00:13:19.509 --> 00:13:23.190
peak self -healing. Wow. Okay. So how does...

00:13:23.450 --> 00:13:26.769
Continue on error help most with large data sets.

00:13:26.970 --> 00:13:29.529
It processes good data efficiently while isolating

00:13:29.529 --> 00:13:31.970
problematic items for separate handling. Isolates

00:13:31.970 --> 00:13:34.149
the problems. Very smart. Okay. Okay, final technique.

00:13:34.309 --> 00:13:39.330
Layer five. Polling. Sounds patient. Oh, it is.

00:13:39.490 --> 00:13:42.009
This one's key for asynchronous tasks. Stuff

00:13:42.009 --> 00:13:43.490
where you ask for something and the answer is

00:13:43.490 --> 00:13:45.690
an instant, like generating a complex report

00:13:45.690 --> 00:13:48.669
or a big AI image. Right, things that take time.

00:13:48.789 --> 00:13:50.629
What's the problem without polling? The agony

00:13:50.629 --> 00:13:52.850
of guess and wait. You kick off a job then? What

00:13:52.850 --> 00:13:55.429
dough? Add a wait node for five minutes, ten

00:13:55.429 --> 00:13:57.210
minutes. You're just guessing how long it'll

00:13:57.210 --> 00:13:59.649
take. Exactly. Guess too short. Your workflow

00:13:59.649 --> 00:14:02.350
tries to grab the result before it's ready. Failure.

00:14:02.350 --> 00:14:03.850
Guess too long. You're just sitting there wasting

00:14:03.850 --> 00:14:06.750
time and resources. It's fragile. So polling

00:14:06.750 --> 00:14:09.750
is the fix. The pizza tracker analogy. Perfect

00:14:09.750 --> 00:14:12.129
analogy. You order a pizza. You don't just stare

00:14:12.129 --> 00:14:14.289
at the door guessing when it arrives. You check

00:14:14.289 --> 00:14:18.090
the tracker app. Making. baking out for delivery.

00:14:18.769 --> 00:14:20.610
Polling is that tracker for your automation.

00:14:20.950 --> 00:14:23.549
It asks the service, are you done yet? Are you

00:14:23.549 --> 00:14:26.149
done yet? And only moves on when the answer is

00:14:26.149 --> 00:14:29.149
yes. So how does that look in practice, say for

00:14:29.149 --> 00:14:32.789
AI image generation? Typically a few steps. Step

00:14:32.789 --> 00:14:37.570
one, initial request. You send the POST request

00:14:37.570 --> 00:14:41.070
to start the image job. The service replies with

00:14:41.070 --> 00:14:44.230
like a task and in status. Queued. Okay. Order

00:14:44.230 --> 00:14:47.509
placed. Step two. Initial wait. Don't pull immediately.

00:14:47.710 --> 00:14:49.889
Give it a reasonable time to start. Maybe a wait

00:14:49.889 --> 00:14:52.649
node for 40 seconds. Let the chefs start working.

00:14:52.889 --> 00:14:55.809
Step three. The status check loop. This is the

00:14:55.809 --> 00:14:58.870
core. It's usually a loop containing an HTTP

00:14:58.870 --> 00:15:01.490
request node to get the status using the task

00:15:01.490 --> 00:15:04.269
id, an IF node to check if the status is still

00:15:04.269 --> 00:15:07.470
processing or if it's completed, and another

00:15:07.470 --> 00:15:10.330
wait node, maybe 20 seconds before checking again.

00:15:10.490 --> 00:15:12.759
So it keeps checking every 20 seconds. Yep. Get

00:15:12.759 --> 00:15:15.320
status. Is it completed? No. Wait, 20 is get

00:15:15.320 --> 00:15:17.820
to status again. It repeats until the IF node

00:15:17.820 --> 00:15:20.039
sees completed, then the loop breaks, and the

00:15:20.039 --> 00:15:22.059
workflow continues with the finished image data.

00:15:22.299 --> 00:15:25.080
Clever. Are there best practices for this? Golden

00:15:25.080 --> 00:15:27.620
rules. Four main ones. One, set a reasonable

00:15:27.620 --> 00:15:29.559
initial wait. Don't hammer the API immediately.

00:15:29.980 --> 00:15:32.620
Okay. Two, use sensible check intervals. 15,

00:15:32.679 --> 00:15:35.259
30 seconds is often good. Don't check every second

00:15:35.259 --> 00:15:38.120
unless the API docs say two. Be polite to the

00:15:38.120 --> 00:15:41.779
server. Exactly. Three, always, always have a

00:15:41.779 --> 00:15:44.379
maximum retry limit on your loop. An escape hatch.

00:15:44.519 --> 00:15:46.179
What if the service breaks and never reports

00:15:46.179 --> 00:15:48.679
completed? You need the loop to stop eventually,

00:15:48.940 --> 00:15:51.440
maybe after 10 tries, and go down an error path.

00:15:51.740 --> 00:15:53.960
Prevents infinite loops. Crucial escape hatch.

00:15:54.000 --> 00:15:57.429
Got it. and four understand the api's status

00:15:57.429 --> 00:16:00.889
vocabulary read the docs does it say processing

00:16:00.889 --> 00:16:03.789
running pending does it say completed succeeded

00:16:03.789 --> 00:16:06.649
done you need to know the exact words to check

00:16:06.649 --> 00:16:09.490
for read the manual okay and there's an alternative

00:16:09.490 --> 00:16:13.350
to polling web hooks yeah the more modern often

00:16:13.350 --> 00:16:16.629
more efficient way web hook callbacks how's that

00:16:16.629 --> 00:16:19.350
different with polling your workflow keeps asking

00:16:19.350 --> 00:16:23.220
are you done with a web hook callback When you

00:16:23.220 --> 00:16:25.740
make the initial request, you give the service

00:16:25.740 --> 00:16:29.500
a unique URL, your NAN webhook URL. You basically

00:16:29.500 --> 00:16:31.559
say, call me back at this address when you're

00:16:31.559 --> 00:16:33.879
finished. So the service calls you. Yep. Your

00:16:33.879 --> 00:16:36.240
workflow then just sits at a wait for webhook

00:16:36.240 --> 00:16:38.519
node doing nothing until the external service

00:16:38.519 --> 00:16:40.580
sends the completed result back to that URL.

00:16:40.799 --> 00:16:43.480
No loop, no constant checking, much cleaner if

00:16:43.480 --> 00:16:45.899
the service supports it. More efficient. So what's

00:16:45.899 --> 00:16:48.519
the main benefit of polling over just... guessing

00:16:48.519 --> 00:16:51.340
wait times it ensures you proceed only when data

00:16:51.340 --> 00:16:54.440
is truly ready avoiding premature failures ready

00:16:54.440 --> 00:16:57.139
and waiting makes sense mineral mentioned refer

00:16:57.139 --> 00:17:00.080
to separate script okay so we've covered these

00:17:00.080 --> 00:17:02.500
five specific techniques these layers of armor

00:17:02.500 --> 00:17:04.759
but you mentioned there's a broader approach

00:17:04.759 --> 00:17:07.990
to the guardrail mindset Yeah, it's kind of the

00:17:07.990 --> 00:17:09.849
philosophy that ties it all together. Because

00:17:09.849 --> 00:17:12.470
fundamentally, you don't know what you don't

00:17:12.470 --> 00:17:14.950
know. Meaning? Production environments are chaotic.

00:17:15.009 --> 00:17:18.109
You'll encounter weird data, unexpected API responses,

00:17:18.670 --> 00:17:20.730
edge cases you never dreamed of during testing.

00:17:20.890 --> 00:17:22.829
You can't predict everything. So the mindset

00:17:22.829 --> 00:17:25.430
is about? Being proactive and learning from the

00:17:25.430 --> 00:17:28.009
chaos. It's a three -step process, really. First,

00:17:28.130 --> 00:17:31.490
log everything. Use that error workflow. Maybe

00:17:31.490 --> 00:17:34.109
add more logging. Capture every error, every

00:17:34.109 --> 00:17:37.609
weird pattern. Second, identify patterns. Don't

00:17:37.609 --> 00:17:39.670
just let logs pile up. Review them regularly,

00:17:39.970 --> 00:17:43.490
maybe weekly. Look for common issues. Is a certain

00:17:43.490 --> 00:17:45.609
type of input always causing trouble? Is one

00:17:45.609 --> 00:17:48.390
specific third party API flaky? Find the recurring

00:17:48.390 --> 00:17:51.690
problems. And third, build targeted guardrails

00:17:51.690 --> 00:17:54.170
based on those patterns, create specific fixes.

00:17:54.269 --> 00:17:56.789
Like we saw lots of failures because an AI was

00:17:56.789 --> 00:18:00.289
outputting slightly malformed JSON. So we built

00:18:00.289 --> 00:18:02.960
a guardrail. a little code node right after the

00:18:02.960 --> 00:18:06.160
AI call that specifically sanitizes the JSON

00:18:06.160 --> 00:18:08.839
output, fixing common issues before it gets sent

00:18:08.839 --> 00:18:11.819
to the next API. Ah, a custom fix for a known

00:18:11.819 --> 00:18:14.420
problem pattern. Exactly. That one little guardrail

00:18:14.420 --> 00:18:17.319
turned a workflow that failed maybe 10 % of the

00:18:17.319 --> 00:18:20.599
time into one that succeeded 100%. That's the

00:18:20.599 --> 00:18:23.059
guardrail mindset. Learn from failures, build

00:18:23.059 --> 00:18:24.920
smarter defenses. So let's put it all together.

00:18:24.980 --> 00:18:27.519
An example, a content research and generation

00:18:27.519 --> 00:18:30.759
system. How would all five layers work here?

00:18:30.880 --> 00:18:33.960
Okay, imagine that system. Layer 1, error workflows.

00:18:34.259 --> 00:18:37.220
All errors go to a central log. Critical failures

00:18:37.220 --> 00:18:40.000
ping Slack immediately. Visibility, check. Layer

00:18:40.000 --> 00:18:43.119
2, retry on failure. Every external API called

00:18:43.119 --> 00:18:45.660
research tools, AI models, is set to retry three

00:18:45.660 --> 00:18:48.099
times with a 15 -second delay. Handles blips,

00:18:48.160 --> 00:18:51.500
check. Layer 3, fallback LLM. The primary AI,

00:18:51.779 --> 00:18:55.059
say GPT -4A -mini, has Google Gemini Pro configured

00:18:55.059 --> 00:18:57.160
as its automatic fallback if it fails consistently.

00:18:57.460 --> 00:18:59.880
Plan B for AI, check. Layer four, continue on

00:18:59.880 --> 00:19:02.200
air. If the research step fails for one specific

00:19:02.200 --> 00:19:04.460
topic out of 20. It doesn't stop the other 19.

00:19:04.819 --> 00:19:08.039
Right. That one failed topic gets routed to a

00:19:08.039 --> 00:19:10.700
separate manual review list. The rest continue.

00:19:11.019 --> 00:19:13.980
Isolates problems. Check. Layer five, polling.

00:19:15.259 --> 00:19:18.039
After the AI generates the content for each successful

00:19:18.039 --> 00:19:21.119
topic, which might take a minute, a polling loop

00:19:21.119 --> 00:19:23.740
patiently waits for the completed status before

00:19:23.740 --> 00:19:26.220
saving the content and moving to the next topic.

00:19:26.400 --> 00:19:29.569
Waits intelligently. Check. All five layers working

00:19:29.569 --> 00:19:32.349
together. Yeah. The result isn't just a script

00:19:32.349 --> 00:19:35.470
that runs. It's an anti -fragile system. It handles

00:19:35.470 --> 00:19:39.230
partial failures, gives you visibility, maximizes

00:19:39.230 --> 00:19:40.849
the work that actually gets done successfully.

00:19:41.130 --> 00:19:43.910
That makes a lot of sense. So wrapping up, the

00:19:43.910 --> 00:19:47.609
core big idea really seems to be failures aren't

00:19:47.609 --> 00:19:49.089
just possible. They're going to happen. They

00:19:49.089 --> 00:19:51.559
are inevitable. Yeah. And the professional edge

00:19:51.559 --> 00:19:53.480
isn't building systems that never fail because

00:19:53.480 --> 00:19:55.960
that's impossible. It's building workflows that

00:19:55.960 --> 00:19:58.000
fail intelligently. That's it exactly. Achieving

00:19:58.000 --> 00:20:01.420
resilience, visibility, and what we call graceful

00:20:01.420 --> 00:20:04.079
degradation. Failing partially, but effectively.

00:20:04.460 --> 00:20:06.380
Amateur scripts might look good in tests, but

00:20:06.380 --> 00:20:08.680
break easily under pressure. While professional

00:20:08.680 --> 00:20:10.819
workflows assume failure and handle it gracefully.

00:20:11.019 --> 00:20:12.960
That's the difference. And that leads to actual

00:20:12.960 --> 00:20:15.680
peace of mind. You trust your systems. And the

00:20:15.680 --> 00:20:18.829
implementation. People don't have to do all five

00:20:18.829 --> 00:20:21.990
layers at once, right? No, definitely not. Start

00:20:21.990 --> 00:20:24.529
with the error workflow. That visibility is key.

00:20:24.789 --> 00:20:27.869
Then add retry on failure to most nodes. That's

00:20:27.869 --> 00:20:30.670
usually easy. Okay. Then maybe the fallback LLM

00:20:30.670 --> 00:20:33.470
for your critical AI steps. Then look at continue

00:20:33.470 --> 00:20:35.470
on error and polling where they make sense for

00:20:35.470 --> 00:20:38.470
your specific flows. And always keep that guardrail

00:20:38.470 --> 00:20:41.769
mindset review logs, build targeted fixes. Start

00:20:41.769 --> 00:20:45.220
simple, layer it up, analyze and adapt. We really

00:20:45.220 --> 00:20:47.460
hope this deep dive gave everyone a clear roadmap

00:20:47.460 --> 00:20:50.579
for making their automations truly robust. What

00:20:50.579 --> 00:20:52.920
stands out to you most about building these resilient

00:20:52.920 --> 00:20:55.779
systems? For me, it's the shift from hoping things

00:20:55.779 --> 00:20:58.059
don't break to knowing you have systems in place

00:20:58.059 --> 00:21:00.339
to handle it when they inevitably do. We encourage

00:21:00.339 --> 00:21:02.839
everyone listening, just pick one technique,

00:21:03.039 --> 00:21:05.960
start there, add that first layer of armor. Even

00:21:05.960 --> 00:21:08.279
that small change can make a huge difference.

00:21:08.880 --> 00:21:11.299
Thank you for joining us on the deep dive. Yeah,

00:21:11.400 --> 00:21:13.259
thanks for listening, and remember, the real

00:21:13.259 --> 00:21:15.599
Pro Edge, it's deploying automations knowing

00:21:15.599 --> 00:21:17.799
they can handle the messy real world, letting

00:21:17.799 --> 00:21:20.460
you kind of forget about them, and most importantly,

00:21:20.680 --> 00:21:23.299
sleep soundly at night. OTRO music.