WEBVTT

00:00:00.000 --> 00:00:03.359
So you've probably been there. You build a retrieval

00:00:03.359 --> 00:00:06.120
augmented generation system, an array system.

00:00:06.379 --> 00:00:08.460
You feed it your documents, hook it up to an

00:00:08.460 --> 00:00:11.859
AI, and you watch it answer questions. It feels,

00:00:11.880 --> 00:00:13.740
well, it feels pretty empowering at first, doesn't

00:00:13.740 --> 00:00:15.320
it? You probably have that moment of pride, maybe

00:00:15.320 --> 00:00:17.940
for a day or two. But then that little worry

00:00:17.940 --> 00:00:20.640
starts to creep in. Can you actually, you know,

00:00:20.640 --> 00:00:23.140
trust what the AI is telling you? Can you prove

00:00:23.140 --> 00:00:24.980
where that information came from? Can you see

00:00:24.980 --> 00:00:28.620
the exact source, a specific document, page number,

00:00:28.719 --> 00:00:31.420
maybe a timestamp? The answer is no. Then, well,

00:00:31.519 --> 00:00:33.259
what you build isn't really an intelligent system.

00:00:33.420 --> 00:00:36.340
It's more like a fancy magic eight ball. It might

00:00:36.340 --> 00:00:38.719
give you the right answer or it might be confidently

00:00:38.719 --> 00:00:41.179
hallucinating. You just have no way of knowing.

00:00:41.259 --> 00:00:44.240
And in business, flying blind like that, that's

00:00:44.240 --> 00:00:46.799
how you crash. And that is exactly what we're

00:00:46.799 --> 00:00:49.259
here to fix today. Welcome to the deep dive.

00:00:49.659 --> 00:00:52.600
We unpack the core problems holding back AI systems

00:00:52.600 --> 00:00:55.509
that are, you know, genuinely useful. Today,

00:00:55.630 --> 00:00:57.890
we're going deep on a concept that sounds simple,

00:00:57.969 --> 00:01:01.450
but is incredibly powerful. Metadata. It's really

00:01:01.450 --> 00:01:03.969
the key ingredient that turns your RIX system

00:01:03.969 --> 00:01:07.290
from this kind of untrustworthy black box into

00:01:07.290 --> 00:01:10.829
something transparent, auditable, and, well,

00:01:10.969 --> 00:01:14.129
actually useful. Our mission today is pretty

00:01:14.129 --> 00:01:16.989
clear then. We want to move beyond that magic

00:01:16.989 --> 00:01:19.189
eight ball stage. We'll show you how to build

00:01:19.189 --> 00:01:21.370
an AI system that isn't just smart, but one you

00:01:21.370 --> 00:01:24.519
can rely on, one you can trust. Absolutely. We're

00:01:24.519 --> 00:01:26.140
going to start by really digging into why these

00:01:26.140 --> 00:01:28.239
RG systems sometimes struggle to earn your trust.

00:01:28.359 --> 00:01:30.319
It's a common problem. Then we'll peel back the

00:01:30.319 --> 00:01:33.340
layers on what metadata really is and why it's

00:01:33.340 --> 00:01:35.780
so often overlooked or misunderstood. Next up,

00:01:35.819 --> 00:01:37.939
a practical blueprint. How do you actually build

00:01:37.939 --> 00:01:40.260
a metadata -rich system from the ground up? We'll

00:01:40.260 --> 00:01:41.920
explore some advanced uses too, might surprise

00:01:41.920 --> 00:01:44.480
you. And finally, highlight some common pitfalls.

00:01:44.840 --> 00:01:47.280
Easy traps to fall into. Let's get into it. When

00:01:47.280 --> 00:01:49.019
Eric first appeared, there was this definite

00:01:49.019 --> 00:01:52.480
sense of... Possibility wasn't there. You give

00:01:52.480 --> 00:01:55.120
it your data, ask a question, and poof, an answer

00:01:55.120 --> 00:01:58.900
appears. Magic. But that initial wonder, it pretty

00:01:58.900 --> 00:02:01.159
quickly gives way to a kind of trust crisis.

00:02:01.459 --> 00:02:03.760
Everyone knows AI can write things, make images,

00:02:03.859 --> 00:02:07.560
even code now. But the real hurdle, trust. There's

00:02:07.560 --> 00:02:09.680
just so much AI -generated stuff out there, it's

00:02:09.680 --> 00:02:11.020
getting harder and harder to know what's real

00:02:11.020 --> 00:02:13.650
and what's, well, what's just made up. This really

00:02:13.650 --> 00:02:16.490
is the new frontier. The next generation of valuable

00:02:16.490 --> 00:02:19.389
AI systems, they won't be defined just by creativity

00:02:19.389 --> 00:02:21.590
or raw intelligence. It's going to be trustworthiness.

00:02:21.770 --> 00:02:24.210
Yeah. That brings us right to the aha moment.

00:02:25.439 --> 00:02:27.699
Imagine an advanced RA agent. Let's say it's

00:02:27.699 --> 00:02:30.099
trained specifically on dozens of YouTube video

00:02:30.099 --> 00:02:32.020
transcripts. You ask it something complex like,

00:02:32.039 --> 00:02:33.639
what's the real difference between a relational

00:02:33.639 --> 00:02:35.699
database and a vector database? It doesn't just

00:02:35.699 --> 00:02:38.159
spit out an answer. It gives you evidence. That's

00:02:38.159 --> 00:02:39.900
the crucial difference, isn't it? Instead of

00:02:39.900 --> 00:02:42.300
just a summary, you get an answer. But with sources,

00:02:42.439 --> 00:02:44.060
you can actually check. It's like your smart

00:02:44.060 --> 00:02:46.159
assistant isn't just telling you something. It's

00:02:46.159 --> 00:02:47.699
handing you the book and saying, look, here's

00:02:47.699 --> 00:02:49.979
the page you need. Exactly. The final output

00:02:49.979 --> 00:02:52.000
from a system like this looks more like this.

00:02:52.460 --> 00:02:55.460
In summary, relational databases store structured

00:02:55.460 --> 00:02:58.639
data in tables. Fixed schemas dot complex queries.

00:02:58.860 --> 00:03:01.120
Vector databases store high -dimensional vector

00:03:01.120 --> 00:03:04.419
data. Focus on similarity search. You get the

00:03:04.419 --> 00:03:07.139
idea. But then it adds, I found this information

00:03:07.139 --> 00:03:11.199
in the video. What are vector databases? Pros

00:03:11.199 --> 00:03:14.520
and cons versus relational databases at timestamp

00:03:14.520 --> 00:03:18.460
000 .37. You can watch the full explanation here

00:03:18.460 --> 00:03:21.110
and then a clickable YouTube link. And that's

00:03:21.110 --> 00:03:24.250
why metadata is so powerful. It genuinely transforms

00:03:24.250 --> 00:03:27.930
your AI from maybe a fun toy into a proper tool.

00:03:28.129 --> 00:03:30.030
Without it, you get an answer. With it, you get

00:03:30.030 --> 00:03:32.069
an answer you can absolutely trust. So what's

00:03:32.069 --> 00:03:34.330
the core issue if we can't verify an AI source?

00:03:34.729 --> 00:03:37.210
Well, without provenance, AI is just a black

00:03:37.210 --> 00:03:41.430
box. Trust completely collapses. OK, let's talk

00:03:41.430 --> 00:03:44.389
metadata. What is it really? Simply put, it's

00:03:44.389 --> 00:03:47.090
just data about data. It's extra info that tells

00:03:47.090 --> 00:03:48.849
you where something came from, what it's about.

00:03:48.990 --> 00:03:51.289
It doesn't change the actual content, but it

00:03:51.289 --> 00:03:53.169
adds that crucial context and understanding.

00:03:53.990 --> 00:03:56.210
And here's a really common mistake people make

00:03:56.210 --> 00:03:59.530
building RV systems. They focus only on the main

00:03:59.530 --> 00:04:01.849
text chunks, you know, the actual words, and

00:04:01.849 --> 00:04:03.949
they completely forget these critical extra details.

00:04:04.090 --> 00:04:06.210
It's like trying to navigate a huge library.

00:04:06.550 --> 00:04:09.069
But there are no labels on the shelves, no labels

00:04:09.069 --> 00:04:10.729
on the books. You just can't. anything reliably,

00:04:11.050 --> 00:04:13.530
it makes your whole knowledge base way less useful

00:04:13.530 --> 00:04:15.909
than it could be. That's so true. Good metadata,

00:04:16.069 --> 00:04:18.009
though. It looks really specific depending on

00:04:18.009 --> 00:04:19.949
the content. For YouTube videos, you want the

00:04:19.949 --> 00:04:22.550
video title, channel name, upload date, those

00:04:22.550 --> 00:04:25.269
precise timestamp ranges, the video URL, definitely.

00:04:25.430 --> 00:04:28.370
For business docs. Think title, author, department,

00:04:28.629 --> 00:04:30.610
creation date, file type, maybe even version

00:04:30.610 --> 00:04:32.970
number, customer support tickets, ticket ID,

00:04:33.129 --> 00:04:35.269
customer name, issue category, resolution, status,

00:04:35.490 --> 00:04:37.889
the agent involved. It doesn't change how the

00:04:37.889 --> 00:04:40.769
AI reads the text itself. It changes how you

00:04:40.769 --> 00:04:43.370
can use it and crucially trust it. Right. And

00:04:43.370 --> 00:04:45.750
taking this metadata first approach gives your

00:04:45.750 --> 00:04:48.670
Argi system three immediate superpowers right

00:04:48.670 --> 00:04:50.490
off the bat. Oh, yeah. These are the good ones.

00:04:50.529 --> 00:04:54.180
First up, provenance and trust. We call this

00:04:54.180 --> 00:04:56.399
the show your work power. This isn't just about

00:04:56.399 --> 00:04:58.899
checking sources. It's about changing AI from

00:04:58.899 --> 00:05:01.600
some mysterious black box into a transparent

00:05:01.600 --> 00:05:03.379
partner you could actually hold accountable.

00:05:03.819 --> 00:05:06.000
That's non -negotiable for business use, for

00:05:06.000 --> 00:05:08.620
compliance. When your AI gives an answer, you

00:05:08.620 --> 00:05:11.120
can instantly check its source. No more wondering,

00:05:11.180 --> 00:05:13.899
did it just make that up? Provide direct links.

00:05:14.040 --> 00:05:16.060
That builds a level of trust a black box system

00:05:16.060 --> 00:05:19.220
just can't match. Second superpower, organization

00:05:19.220 --> 00:05:22.379
and segmentation. The chunk power. Instead of

00:05:22.379 --> 00:05:24.699
this giant messy data swamp with like a million

00:05:24.699 --> 00:05:27.899
text chunks all jumbled together, metadata brings

00:05:27.899 --> 00:05:30.060
order. It turns that swamp into an organized

00:05:30.060 --> 00:05:32.360
library you can actually navigate. Clear sections

00:05:32.360 --> 00:05:34.839
for departments, document types, time periods.

00:05:35.060 --> 00:05:37.519
And third, precision filtering, our sniper rifle

00:05:37.519 --> 00:05:39.439
power. This is where it gets really cool. Sometimes

00:05:39.439 --> 00:05:41.240
you don't want to search everything, right? Metadata

00:05:41.240 --> 00:05:43.240
filtering lets you be surgical. Get on your agent.

00:05:43.339 --> 00:05:45.199
Search only through marketing documents created

00:05:45.199 --> 00:05:47.899
last quarter and give me the key takeaways. That

00:05:47.899 --> 00:05:50.500
level of precision. It turns a simple search

00:05:50.500 --> 00:05:53.279
tool into a seriously powerful analytical instrument.

00:05:54.060 --> 00:05:56.620
So why is metadata crucial beyond just finding

00:05:56.620 --> 00:05:59.319
information? It's fundamental. It builds user

00:05:59.319 --> 00:06:02.100
trust and enables that precise organization you

00:06:02.100 --> 00:06:04.860
need. Okay, let's get practical now. Let's unpack

00:06:04.860 --> 00:06:07.720
this. This isn't just theory. It's really a blueprint

00:06:07.720 --> 00:06:10.779
for building a metadata -rich, R -ragged pipeline.

00:06:11.560 --> 00:06:14.240
And it all starts with the basic truth. Good

00:06:14.240 --> 00:06:17.180
answers only come from good, well -organized

00:06:17.180 --> 00:06:20.259
data. Precisely. Step one is what we call smart

00:06:20.259 --> 00:06:22.579
ingestion. It's all about preserving your data's

00:06:22.579 --> 00:06:25.779
DNA, its context. For our YouTube example, the

00:06:25.779 --> 00:06:27.699
process kicks off when you provide the video's

00:06:27.699 --> 00:06:30.379
title and URL. That's the trigger. From there,

00:06:30.420 --> 00:06:32.819
an NNN workflow that's a powerful low -code automation

00:06:32.819 --> 00:06:35.319
platform just takes over. Right. So phase one

00:06:35.319 --> 00:06:38.060
is gathering the raw evidence, scraping the transcript.

00:06:38.160 --> 00:06:40.740
The workflow uses a tool, maybe Appify, it's

00:06:40.740 --> 00:06:42.699
a web scraping platform, to go to that YouTube

00:06:42.699 --> 00:06:45.079
URL and grab the full transcript. But here's

00:06:45.079 --> 00:06:47.759
the catch. The data you get back. It's not simple

00:06:47.759 --> 00:06:49.360
or clean. It usually comes in hundreds of little

00:06:49.360 --> 00:06:51.519
pieces. Each piece might only have a few words,

00:06:51.560 --> 00:06:54.439
a start time, a duration. It's messy. Yeah. And

00:06:54.439 --> 00:06:55.759
here's where most people make their first big

00:06:55.759 --> 00:06:57.959
mistake. You see this fragmented mess. Yeah.

00:06:58.040 --> 00:07:00.240
And their first instinct is, OK, let's clean

00:07:00.240 --> 00:07:02.439
this up. Combine all these tiny text snippets

00:07:02.439 --> 00:07:05.480
into one long transcript. You can do that easily

00:07:05.480 --> 00:07:08.670
with a bit of code. But the problem. You lose

00:07:08.670 --> 00:07:10.670
all the timestamp information when you do that.

00:07:10.769 --> 00:07:12.529
It's like throwing all your crime scene clues

00:07:12.529 --> 00:07:14.990
into one box without labels. You don't know which

00:07:14.990 --> 00:07:16.790
clue relates to what anymore. Your data just

00:07:16.790 --> 00:07:20.389
became effectively dumb. A professional approach

00:07:20.389 --> 00:07:23.189
builds differently. Instead of destroying that

00:07:23.189 --> 00:07:26.750
context, we preserve it. The goal is to create

00:07:26.750 --> 00:07:29.709
meaningful chunks of text, but keep that timestamp

00:07:29.709 --> 00:07:32.769
data perfectly attached to each one. And this

00:07:32.769 --> 00:07:35.170
is often done with a, well, a clever code node

00:07:35.170 --> 00:07:37.009
that acts kind of like a meticulous forensic

00:07:37.009 --> 00:07:39.569
investigator. Exactly. It groups the evidence.

00:07:39.589 --> 00:07:42.519
It loops through maybe. hundreds of these tiny

00:07:42.519 --> 00:07:44.540
transcript objects, groups them into logical

00:07:44.540 --> 00:07:47.319
chunks, let's say, 20 objects at a time, which

00:07:47.319 --> 00:07:50.259
might be about 40 seconds of video. Then it builds

00:07:50.259 --> 00:07:53.060
the text by combining the words from those 20

00:07:53.060 --> 00:07:55.779
objects into a coherent paragraph. And crucially,

00:07:55.860 --> 00:07:58.819
it bags and tags the evidence. For each new paragraph,

00:07:58.959 --> 00:08:01.480
it tags it with metadata. It grabs the start

00:08:01.480 --> 00:08:03.360
timestamp from the very first object in that

00:08:03.360 --> 00:08:05.459
group, calculates the end timestamp from the

00:08:05.459 --> 00:08:08.360
last one. It packages that text paragraph and

00:08:08.360 --> 00:08:10.480
its start and end timestamps together into a

00:08:10.480 --> 00:08:14.939
single paragraph. So every single piece of information

00:08:14.939 --> 00:08:16.819
you're about to file away in your database is

00:08:16.819 --> 00:08:19.180
now perfectly labeled with its origin. The evidence

00:08:19.180 --> 00:08:21.480
is bagged, tagged, ready for the library. Taking

00:08:21.480 --> 00:08:23.800
that care now really helps your AI give trustworthy

00:08:23.800 --> 00:08:26.199
answers later. So the key is keeping that source

00:08:26.199 --> 00:08:28.500
data connected to the chunks, right? Absolutely.

00:08:28.779 --> 00:08:31.100
Preserving that context is what prevents creating

00:08:31.100 --> 00:08:33.799
dumb data. Right. Okay, once you've ingested

00:08:33.799 --> 00:08:36.320
the data smartly, the hard part is kind of done.

00:08:36.820 --> 00:08:40.139
Step two is metadata enrichment. Think of it

00:08:40.139 --> 00:08:41.960
as creating the digital card catalog for your

00:08:41.960 --> 00:08:44.399
library, as we store each chunk in our vector

00:08:44.399 --> 00:08:47.059
database. Supabase is great for this, built on

00:08:47.059 --> 00:08:49.340
PostgreSQL. We don't just save the text in its

00:08:49.340 --> 00:08:51.960
vector embedding. We attach that rich set of

00:08:51.960 --> 00:08:54.120
metadata fields we just preserved. It's a fairly

00:08:54.120 --> 00:08:56.039
simple step, just mapping the preserved data

00:08:56.039 --> 00:08:57.940
to the metadata column in the database. Right,

00:08:58.019 --> 00:09:00.379
so the final data object for each chunk looks

00:09:00.379 --> 00:09:03.279
something like, uh, open curly brace, video title.

00:09:15.279 --> 00:09:18.799
And here's a really crucial insight. Metadata

00:09:18.799 --> 00:09:22.279
is the salt, not the steak. This is probably

00:09:22.279 --> 00:09:24.139
one of the most important and commonly misunderstood

00:09:24.139 --> 00:09:27.820
things in building ROG systems. This metadata,

00:09:28.080 --> 00:09:30.559
it has zero effect on the vector calculation

00:09:30.559 --> 00:09:33.460
itself. Think about it like this. The content

00:09:33.460 --> 00:09:36.379
of your junk. That's the steak. That's the core

00:09:36.379 --> 00:09:39.200
substance. The AI aneuryses the steak itself,

00:09:39.419 --> 00:09:41.960
its texture, its quality to decide where it fits

00:09:41.960 --> 00:09:44.779
in the semantic universe. The metadata. That's

00:09:44.779 --> 00:09:46.559
the salt you sprinkle on top after it's cooked.

00:09:47.139 --> 00:09:48.980
Salt doesn't change the steak itself, but it

00:09:48.980 --> 00:09:51.159
enhances the flavor, tells you maybe where it

00:09:51.159 --> 00:09:53.299
came from, who cooked it. When a user searches,

00:09:53.480 --> 00:09:55.500
the AI first finds the most semantically relevant

00:09:55.500 --> 00:09:59.080
text chunk, the best steak. Then it looks to

00:09:59.080 --> 00:10:00.820
solve the metadata to tell you where that steak

00:10:00.820 --> 00:10:03.200
came from. Which brings us neatly to step three,

00:10:03.379 --> 00:10:06.330
smart retrieval and display. This is the show

00:10:06.330 --> 00:10:08.169
your work payoff. This is really the moment of

00:10:08.169 --> 00:10:10.309
truth. The library is built. The books are on

00:10:10.309 --> 00:10:12.809
the shelves. The card catalog is complete. Now

00:10:12.809 --> 00:10:15.169
a user walks up to the front desk and asks our

00:10:15.169 --> 00:10:18.129
A .I. scholar a tough question. Right. The process

00:10:18.129 --> 00:10:20.370
looks kind of like this. User asks a question,

00:10:20.429 --> 00:10:22.629
maybe, what are the key takeaways about that

00:10:22.629 --> 00:10:26.009
new A .I. tool? That question gets translated

00:10:26.009 --> 00:10:28.750
into a vector for comparison. The system searches

00:10:28.750 --> 00:10:30.830
the vector database for chunks most similar to

00:10:30.830 --> 00:10:33.450
the question vector. Then often there's an are

00:10:33.450 --> 00:10:36.220
you sure check re -ranking. It might find, say,

00:10:36.299 --> 00:10:38.259
10 possible answers and then sort them again

00:10:38.259 --> 00:10:40.919
to find the best two or three results. Finally,

00:10:41.019 --> 00:10:43.360
the AI does synthesis and citation. It uses those

00:10:43.360 --> 00:10:44.980
best two, three chunks to write a fresh answer,

00:10:45.080 --> 00:10:47.360
always adding the metadata, like the source information,

00:10:47.639 --> 00:10:51.370
to that final output. And the result is a genuinely

00:10:51.370 --> 00:10:53.870
trustworthy answer. Instead of just a simple

00:10:53.870 --> 00:10:55.710
paragraph you can't verify, the user gets something

00:10:55.710 --> 00:10:58.029
more like this. The key takeaways about the new

00:10:58.029 --> 00:11:01.490
AI tool from OpenAI are the AI features an incredibly

00:11:01.490 --> 00:11:04.789
realistic voice. Sounds exactly like a real human.

00:11:05.090 --> 00:11:07.509
Breathing, whispering, even singing. It can perform

00:11:07.509 --> 00:11:09.970
tasks like singing happy birthday in a very human

00:11:09.970 --> 00:11:12.970
-like way. The realism raises questions about

00:11:12.970 --> 00:11:14.929
how human -like AI should be, getting a little

00:11:14.929 --> 00:11:17.789
bit too human -like almost. And then the citation,

00:11:18.090 --> 00:11:20.809
AI is taking over and it's getting real scary

00:11:20.809 --> 00:11:25.570
this time. Ziver 190 .53, 2 .39, 3 .23. Watch

00:11:25.570 --> 00:11:28.769
your clickable YouTube link. It clearly shows

00:11:28.769 --> 00:11:31.090
the significant progress, but also flags the

00:11:31.090 --> 00:11:33.389
source. This is the end game. This is what it

00:11:33.389 --> 00:11:35.250
looks like when an AI doesn't just give you an

00:11:35.250 --> 00:11:37.429
answer, but gives you the evidence. Your users

00:11:37.429 --> 00:11:39.809
can trust your AI because it always shows its

00:11:39.809 --> 00:11:42.149
proof. It's a system that basically says, hey,

00:11:42.230 --> 00:11:43.610
don't just take my word for it. Here are the

00:11:43.610 --> 00:11:45.919
receipts. So what's the main benefit of actually

00:11:45.919 --> 00:11:48.080
showing the source like that? It builds that

00:11:48.080 --> 00:11:50.860
verifiable trust and really empowers users to

00:11:50.860 --> 00:11:53.440
check for themselves. Mid -roll sponsor read.

00:11:55.039 --> 00:11:57.460
Okay, now let's talk about what might be the

00:11:57.460 --> 00:12:00.179
ultimate power move here. Metadata filtering.

00:12:01.059 --> 00:12:03.600
This technique, I think, truly separates the

00:12:03.600 --> 00:12:05.639
professional grade systems from more amateur

00:12:05.639 --> 00:12:08.139
projects. It's where the real precision, the

00:12:08.139 --> 00:12:10.019
real targeting comes into play. Here's where

00:12:10.019 --> 00:12:12.460
it gets really interesting. Right. So far, we've

00:12:12.460 --> 00:12:14.669
mostly talked about searching. Well, everything.

00:12:14.830 --> 00:12:16.809
It's like going to that huge library and asking

00:12:16.809 --> 00:12:19.450
the librarian for just a book on ancient Rome.

00:12:19.549 --> 00:12:21.029
Yeah. They might bring you back 100 different

00:12:21.029 --> 00:12:22.750
books. You'd have to sift through them all. But

00:12:22.750 --> 00:12:24.590
what if you could be more specific? What if you

00:12:24.590 --> 00:12:26.690
could say, go to the ancient Rome section, but

00:12:26.690 --> 00:12:28.809
only bring me books written by Mary Beard and

00:12:28.809 --> 00:12:31.740
only those published after 2010? That, my friend,

00:12:31.779 --> 00:12:34.179
is metadata filtering. It lets you search just

00:12:34.179 --> 00:12:36.519
the specific slice of your library you actually

00:12:36.519 --> 00:12:39.519
need. And how it works, technically. You build

00:12:39.519 --> 00:12:41.700
a user interface, maybe a simple form in your

00:12:41.700 --> 00:12:44.299
app, that lets the user specify not just their

00:12:44.299 --> 00:12:47.279
question, but also the context. For our YouTube

00:12:47.279 --> 00:12:49.259
example, the interface might have two fields.

00:12:49.419 --> 00:12:51.779
Maybe a drop -down menu to select a specific

00:12:51.779 --> 00:12:54.779
YouTube video title .type, and a text box for

00:12:54.779 --> 00:12:57.159
their question. When the user hits submit, the

00:12:57.159 --> 00:13:00.929
NAN workflow gets both pieces of info. the command

00:13:00.929 --> 00:13:02.889
it sends to the Supabase database isn't just

00:13:02.889 --> 00:13:05.309
find text similar to this question anymore, it's

00:13:05.309 --> 00:13:07.450
now find text similar to this question where

00:13:07.450 --> 00:13:10.149
the video title and the metadata equals AI is

00:13:10.149 --> 00:13:12.250
taking over and it's getting real scary this

00:13:12.250 --> 00:13:15.059
time. That simple where clause, it completely

00:13:15.059 --> 00:13:17.320
changes the game. Now the agent searches only

00:13:17.320 --> 00:13:19.980
one specific document or video, makes the answer

00:13:19.980 --> 00:13:23.179
much, much more focused. Exactly. So for a user's

00:13:23.179 --> 00:13:25.559
filtered query like, okay, I only want to look

00:13:25.559 --> 00:13:27.620
through the AI is taking over video. Give me

00:13:27.620 --> 00:13:29.980
the three key takeaways from that video specifically.

00:13:30.779 --> 00:13:34.620
The system, using that metadata filter, totally

00:13:34.620 --> 00:13:36.440
ignores all the other videos in the database.

00:13:36.759 --> 00:13:39.950
Zero confusion. It responds with insights pulled

00:13:39.950 --> 00:13:42.409
exclusively from that one source, complete with

00:13:42.409 --> 00:13:45.029
specific timestamps for each point. Because it

00:13:45.029 --> 00:13:47.190
knows with 100 % certainty the information came

00:13:47.190 --> 00:13:49.809
from that exact video. And then you get the pro

00:13:49.809 --> 00:13:52.190
-level upgrade, multi -filter power searches.

00:13:52.450 --> 00:13:55.409
With metadata filters, your RJ system can search

00:13:55.409 --> 00:13:57.669
with incredible accuracy, almost like a seasoned

00:13:57.669 --> 00:14:00.210
data analyst. Imagine a knowledge base for a

00:14:00.210 --> 00:14:02.690
large company. Metadata for department, document

00:14:02.690 --> 00:14:05.169
type, creation pie date. A manager could ask

00:14:05.169 --> 00:14:07.240
something like, What was our stated marketing

00:14:07.240 --> 00:14:10.139
budget for the last quarter? Search only in documents

00:14:10.139 --> 00:14:12.659
from the finance department that are tight quarterly

00:14:12.659 --> 00:14:14.799
report and were created in the last 12 months.

00:14:15.100 --> 00:14:18.220
Whoa, imagine scaling that. A billion queries

00:14:18.220 --> 00:14:20.700
like that, that's a level of precision. It turns

00:14:20.700 --> 00:14:23.340
a simple search tool into a really powerful analytical

00:14:23.340 --> 00:14:25.980
instrument. So how does filtering fundamentally

00:14:25.980 --> 00:14:28.940
transform ArjAgent? It lets it perform precise.

00:14:29.549 --> 00:14:32.370
Targeted research, almost like a dedicated data

00:14:32.370 --> 00:14:34.570
analyst. Now, while the rest of the world is

00:14:34.570 --> 00:14:36.929
often focused on the flashy generation part of

00:14:36.929 --> 00:14:39.669
AI, the true professionals are thinking about

00:14:39.669 --> 00:14:42.970
the, let's be honest, unglamorous but absolutely

00:14:42.970 --> 00:14:45.929
essential work of maintenance. You have to have

00:14:45.929 --> 00:14:48.309
a system for removing outdated or irrelevant

00:14:48.309 --> 00:14:50.850
content from your vector database. Otherwise,

00:14:50.970 --> 00:14:54.639
you risk what some call AI brain rot. Carolyn.

00:14:54.779 --> 00:14:57.679
Your agent starts giving customers old pricing,

00:14:57.820 --> 00:14:59.480
citing policies you don't even have anymore,

00:14:59.659 --> 00:15:02.139
referencing discontinued features. This can seriously

00:15:02.139 --> 00:15:04.799
hurt your business. It just erodes trust. Completely.

00:15:04.940 --> 00:15:06.779
Totally. And the answer is to build an automated

00:15:06.779 --> 00:15:09.299
system that cleans out that old or bad data.

00:15:09.500 --> 00:15:11.740
And you can do this with something as, honestly,

00:15:11.860 --> 00:15:13.960
simple and brilliant as a Google Sheet acting

00:15:13.960 --> 00:15:16.080
as your control panel. This creates a really

00:15:16.080 --> 00:15:18.399
simple, non -technical interface that anyone

00:15:18.399 --> 00:15:20.320
on your team can use. They don't need to know

00:15:20.320 --> 00:15:23.080
ME or Supabase. The setup's pretty straightforward.

00:15:23.299 --> 00:15:25.480
Every time your R -Edge pipeline processes a

00:15:25.480 --> 00:15:28.019
new video or document, it adds a new row to this

00:15:28.019 --> 00:15:30.740
Google Sheet, tracking key info like video title,

00:15:30.840 --> 00:15:32.860
video oral, and a status column initially set

00:15:32.860 --> 00:15:36.379
to active. Okay. Now, you build a separate, dedicated

00:15:36.379 --> 00:15:39.379
N8AN workflow just for this housekeeping. The

00:15:39.379 --> 00:15:41.600
trigger. It starts with a Google Sheet trigger

00:15:41.600 --> 00:15:43.799
node, set up to watch that status column and

00:15:43.799 --> 00:15:45.659
run instantly whenever a row there gets updated.

00:15:45.860 --> 00:15:48.659
The filter. To start a deletion, a team member

00:15:48.659 --> 00:15:50.440
just changes a video status in the sheet from

00:15:50.440 --> 00:15:53.500
active to remove. That change triggers the workflow.

00:15:53.779 --> 00:15:56.220
A filter node then checks. Is the new status

00:15:56.220 --> 00:15:59.470
actually remove? The deletion. If yes, the workflow

00:15:59.470 --> 00:16:01.870
moves to a Supabase node configured for a delete

00:16:01.870 --> 00:16:04.730
operation. It uses the video oral from that Google

00:16:04.730 --> 00:16:07.590
Sheet row to find and delete all the vector chunks

00:16:07.590 --> 00:16:09.429
in your database that have a matching video oral

00:16:09.429 --> 00:16:12.929
in their metadata. Gone. The confirmation. Once

00:16:12.929 --> 00:16:14.769
that's successful, the final step updates the

00:16:14.769 --> 00:16:16.370
status back in the Google Sheet from removed

00:16:16.370 --> 00:16:19.590
to deleted. Maybe adds a timestamp. Clean loop.

00:16:19.830 --> 00:16:22.769
This simple automated loop keeps your AI's brain

00:16:22.769 --> 00:16:25.870
clean. It ensures outdated info gets purged from

00:16:25.870 --> 00:16:27.940
its memory. Preventing it from ever giving a

00:16:27.940 --> 00:16:30.740
user a wrong answer based on old data. You know,

00:16:30.740 --> 00:16:32.799
I still wrestle with prompt drift myself sometimes,

00:16:32.879 --> 00:16:34.919
and honestly, the thought of this cleanup can

00:16:34.919 --> 00:16:37.899
feel daunting. But it's just so crucial for maintaining

00:16:37.899 --> 00:16:41.399
accuracy. And for a pro -level upgrade, you could

00:16:41.399 --> 00:16:43.779
consider an automatic expiration date system.

00:16:44.299 --> 00:16:47.200
When you first ingest documents, maybe add an

00:16:47.200 --> 00:16:49.960
extra piece of metadata. A review date, say,

00:16:50.120 --> 00:16:52.779
six months out from the creation date. Then you

00:16:52.779 --> 00:16:55.019
create another separate N8N workflow that just

00:16:55.019 --> 00:16:57.179
runs on a schedule maybe every Monday morning.

00:16:57.399 --> 00:17:00.539
Yeah. And this workflow's only job is to scan

00:17:00.539 --> 00:17:03.120
your Supabase database and find any documents

00:17:03.120 --> 00:17:06.160
where that review date is now in the past. Expired

00:17:06.160 --> 00:17:09.059
content. For every piece it finds, it automatically

00:17:09.059 --> 00:17:11.039
changes its status in your Google Sheet control

00:17:11.039 --> 00:17:13.599
panel to something like needs review. And maybe

00:17:13.599 --> 00:17:15.819
even sends a notification Slack email to the

00:17:15.819 --> 00:17:17.420
content owner. Hey, time to look at this again.

00:17:18.250 --> 00:17:20.910
This system proactively manages your AI knowledge,

00:17:21.190 --> 00:17:25.069
keeping it fresh, reliable, super smart. So why

00:17:25.069 --> 00:17:27.450
is automated cleanup so vital for maintaining

00:17:27.450 --> 00:17:30.450
AI trust? It prevents that brain rot, ensures

00:17:30.450 --> 00:17:33.309
ongoing accuracy, and maintains overall system

00:17:33.309 --> 00:17:35.799
reliability. Okay, so the YouTube example we've

00:17:35.799 --> 00:17:37.460
used, it's just a simple demonstration, really.

00:17:37.539 --> 00:17:39.980
The real power of this metadata -first approach,

00:17:40.180 --> 00:17:41.940
it gets unlocked when you apply it to complex

00:17:41.940 --> 00:17:44.279
business data out in the real world. Oh, absolutely.

00:17:44.559 --> 00:17:47.279
Imagine a customer support knowledge base trained

00:17:47.279 --> 00:17:49.740
on thousands of past support tickets. Metadata

00:17:49.740 --> 00:17:51.720
for each chunk could include product category,

00:17:51.960 --> 00:17:54.220
issue severity, resolution date, support age,

00:17:54.259 --> 00:17:58.039
maybe customer tier. Now, when a premium tier

00:17:58.039 --> 00:18:00.799
customer asks about a billing issue, the system

00:18:00.799 --> 00:18:03.160
can use metadata filtering to prioritize solutions

00:18:03.160 --> 00:18:06.000
that are recent, maybe came from your top agents,

00:18:06.099 --> 00:18:07.980
are definitely relevant to premium customers.

00:18:08.180 --> 00:18:10.740
Big difference. Or think about a law firm. Metadata

00:18:10.740 --> 00:18:13.119
could be case type, jurisdiction, date file,

00:18:13.220 --> 00:18:15.599
court level outcome. A lawyer could then do an

00:18:15.599 --> 00:18:18.059
incredibly powerful search like find precedents

00:18:18.059 --> 00:18:19.900
related to intellectual property disputes in

00:18:19.900 --> 00:18:21.980
California at the appellate court level where

00:18:21.980 --> 00:18:24.440
the outcome was summary judgment. That kind of

00:18:24.440 --> 00:18:26.079
research normally takes a pair of legal hours.

00:18:26.220 --> 00:18:29.769
Now. Yeah, or even for just an internal company

00:18:29.769 --> 00:18:33.289
wiki. Metadata could include department, document

00:18:33.289 --> 00:18:36.289
type like policy, tutorial, meeting notes, last

00:18:36.289 --> 00:18:39.049
updated date, author. An employee in marketing

00:18:39.049 --> 00:18:41.250
could ask, what's our policy on social media

00:18:41.250 --> 00:18:44.329
engagement? The system could filter to show only

00:18:44.329 --> 00:18:46.789
official documents for marketing or maybe HR

00:18:46.789 --> 00:18:49.769
updated in the last year, ensuring they get the

00:18:49.769 --> 00:18:53.539
correct current info. Not some old draft. And

00:18:53.539 --> 00:18:55.259
if you're ready to actually build a system like

00:18:55.259 --> 00:18:57.539
this, the tech stack is surprisingly accessible

00:18:57.539 --> 00:19:00.380
these days. For the vector database, Supabase

00:19:00.380 --> 00:19:02.440
is a fantastic choice. Like we said, built on

00:19:02.440 --> 00:19:05.579
PostgreSQL. Solid. ATON is kind of the engine

00:19:05.579 --> 00:19:07.539
running the whole pipeline from ingestion to

00:19:07.539 --> 00:19:09.619
the interactive agent. For web content like those

00:19:09.619 --> 00:19:11.859
YouTube transcripts, Apify is a really reliable

00:19:11.859 --> 00:19:14.420
tool for scraping. And honestly, for many uses,

00:19:14.599 --> 00:19:17.000
Supabase's built -in similarity search functions

00:19:17.000 --> 00:19:18.940
are often powerful enough to act as your basic

00:19:18.940 --> 00:19:21.119
re -ranker. You might not need more. Mm -hmm.

00:19:21.869 --> 00:19:23.769
When you're getting ready, here's a quick pre

00:19:23.769 --> 00:19:26.950
-flight checklist. Key configuration points to

00:19:26.950 --> 00:19:30.450
think about. First, chunk size. How big are your

00:19:30.450 --> 00:19:32.410
text chunks? It's critical. You need to experiment.

00:19:32.809 --> 00:19:35.109
Start with maybe around 40 seconds of video transcript

00:19:35.109 --> 00:19:38.109
or a few solid paragraphs of text and test what

00:19:38.109 --> 00:19:40.450
gives you the best results for your data. Second,

00:19:40.589 --> 00:19:43.309
your metadata schema. Design this before you

00:19:43.309 --> 00:19:45.890
start building anything. Seriously. A consistent

00:19:45.890 --> 00:19:48.049
schema across all your different content types

00:19:48.049 --> 00:19:50.470
is essential for that filtering to work effectively.

00:19:50.750 --> 00:19:54.250
Third, Filtering syntax. Take a little time.

00:19:54.349 --> 00:19:56.470
Learn your chosen vector database's specific

00:19:56.470 --> 00:19:59.589
metadata filtering syntax. It varies. This is

00:19:59.589 --> 00:20:02.009
the key to unlocking those power searches. And

00:20:02.009 --> 00:20:04.490
fourth, cleanup automation. Don't treat this

00:20:04.490 --> 00:20:06.990
as an add -on later. Build your automated housekeeping

00:20:06.990 --> 00:20:09.549
workflow from day one. A clean knowledge base

00:20:09.549 --> 00:20:11.769
is a trustworthy one. Okay, let's quickly touch

00:20:11.769 --> 00:20:13.990
on some common mistakes. The things that often

00:20:13.990 --> 00:20:16.650
cause RRag projects to stumble or fail. Call

00:20:16.650 --> 00:20:18.609
them the four horsemen of rank failure. Huh,

00:20:18.789 --> 00:20:22.259
yeah. First. Treating metadata as an afterthought.

00:20:22.619 --> 00:20:25.200
So common. People get excited, build the ingestion

00:20:25.200 --> 00:20:27.859
pipeline first, then try to bolt on some metadata

00:20:27.859 --> 00:20:30.220
later. It's backwards. Design your metadata schema

00:20:30.220 --> 00:20:32.460
first. It dictates how your data needs to be

00:20:32.460 --> 00:20:35.299
structured and processed. Foundational. Second,

00:20:35.440 --> 00:20:38.279
over -indexing on chunk content. Yes, the content

00:20:38.279 --> 00:20:40.819
matters, obviously, but a perfectly chunked document

00:20:40.819 --> 00:20:43.680
with zero context. It's way less useful than

00:20:43.680 --> 00:20:46.140
slightly imperfect chunks that have rich, filterable

00:20:46.140 --> 00:20:50.160
metadata. Context is king. Third, Ignoring data

00:20:50.160 --> 00:20:53.000
lineage. Every single piece of info in your vector

00:20:53.000 --> 00:20:55.319
database must be traceable back to its original

00:20:55.319 --> 00:20:58.339
source, period. If your AI gives an answer and

00:20:58.339 --> 00:21:00.440
you can't verify where it came from, you don't

00:21:00.440 --> 00:21:02.359
have an intelligent system. You've got a rumor

00:21:02.359 --> 00:21:05.400
mill. And fourth, using a static metadata schema.

00:21:05.539 --> 00:21:07.880
The info you need to track might change. Your

00:21:07.880 --> 00:21:10.059
business changes. Your data changes. Build your

00:21:10.059 --> 00:21:12.160
system to be flexible so you can update or add

00:21:12.160 --> 00:21:14.000
to your metadata schema fairly easily down the

00:21:14.000 --> 00:21:16.039
road. So what's the biggest takeaway for anyone

00:21:16.039 --> 00:21:18.720
building these systems? Plan your metadata first.

00:21:18.980 --> 00:21:21.059
It's truly the foundation for building trust.

00:21:21.480 --> 00:21:23.980
Right. So let's bring it all back home. The bottom

00:21:23.980 --> 00:21:26.579
line here is metadata is the price of trust.

00:21:26.859 --> 00:21:29.259
The real difference between just a basic ARAG

00:21:29.259 --> 00:21:31.680
system and one that's truly effective. It really

00:21:31.680 --> 00:21:34.509
is that simple. Trust. A system that just gives

00:21:34.509 --> 00:21:36.809
answers is, well, it's a black box. You can't

00:21:36.809 --> 00:21:38.769
see inside. You can't verify it. A system that

00:21:38.769 --> 00:21:40.809
gives answers and proves where they came from,

00:21:40.930 --> 00:21:43.630
that's transparent. It's auditable. It's genuinely

00:21:43.630 --> 00:21:46.369
useful. Metadata is the underlying technology

00:21:46.369 --> 00:21:48.690
that makes this transparency possible. It's really

00:21:48.690 --> 00:21:51.470
the infrastructure of trust. So if you're building

00:21:51.470 --> 00:21:54.289
a RAC system, your process should be really clear

00:21:54.289 --> 00:21:56.990
now. One. Design your metadata schema first.

00:21:57.170 --> 00:21:59.109
Think hard about the context that will make your

00:21:59.109 --> 00:22:01.230
answers trustworthy and useful for your users.

00:22:01.569 --> 00:22:04.109
Two, build your pipeline around that schema.

00:22:04.430 --> 00:22:06.869
Don't treat metadata like an afterthought. Make

00:22:06.869 --> 00:22:09.910
it core to the system. And three, always, always

00:22:09.910 --> 00:22:12.269
present the evidence. Make sure your final output

00:22:12.269 --> 00:22:14.950
includes the source. Allow users to verify the

00:22:14.950 --> 00:22:17.890
info for themselves. Empower them. Stop building

00:22:17.890 --> 00:22:20.710
black boxes. Start building systems that people

00:22:20.710 --> 00:22:23.789
can actually rely on. Because in this new age

00:22:23.789 --> 00:22:25.730
of AI, the best system isn't going to be the

00:22:25.730 --> 00:22:28.170
one with the most data. It's the one that earns

00:22:28.170 --> 00:22:31.390
the most trust. Think about it. How do you verify

00:22:31.390 --> 00:22:34.269
information in your own daily life? Your AI should

00:22:34.269 --> 00:22:36.880
really be held to that same standard. Thank you

00:22:36.880 --> 00:22:38.980
so much for joining us on this deep dive into

00:22:38.980 --> 00:22:42.839
RG systems and the real power of metadata. We

00:22:42.839 --> 00:22:44.680
really encourage you to explore these concepts

00:22:44.680 --> 00:22:47.880
further and start building AI systems that truly

00:22:47.880 --> 00:22:50.319
earn trust. Until next time, outro music.
