WEBVTT

00:00:00.000 --> 00:00:03.000
Three months ago, I almost shipped a quote -unquote

00:00:03.000 --> 00:00:06.480
clever feature. It was a drag and drop for a

00:00:06.480 --> 00:00:11.220
CSV file. Where we could do auto, EDA, and boom,

00:00:11.339 --> 00:00:15.359
done. And two weeks ago, I made this app better.

00:00:15.419 --> 00:00:19.920
I included additional features. Here's what I

00:00:19.920 --> 00:00:24.140
didn't take into account. The real world. Where

00:00:24.140 --> 00:00:26.620
we have hospital claim files, student records.

00:00:27.819 --> 00:00:29.899
purchase histories with emails and addresses.

00:00:30.300 --> 00:00:32.579
And then I imagine someone dragging a million

00:00:32.579 --> 00:00:35.719
rows into my app from a coffee shop Wi -Fi and

00:00:35.719 --> 00:00:38.960
my stomach just dropped. So I pulled the plug,

00:00:38.979 --> 00:00:42.219
not on everything, just on the uploads, just

00:00:42.219 --> 00:00:44.979
on file uploads. And I rebuilt the whole thing

00:00:44.979 --> 00:00:49.039
around one single question. What if your AI analyst

00:00:49.039 --> 00:00:53.719
is not meant to see your data, but still help

00:00:53.719 --> 00:00:57.820
you to analyze it like a pro? And today I'm going

00:00:57.820 --> 00:01:07.420
to show you how I exactly did it. Hey, I'm Mukundan.

00:01:07.519 --> 00:01:10.260
This show is about solving real problems with

00:01:10.260 --> 00:01:13.060
small, useful AI. Things that you can actually

00:01:13.060 --> 00:01:17.640
use today. Each week, we pick one problem, build

00:01:17.640 --> 00:01:19.700
a simple workflow or tool, and talk through the

00:01:19.700 --> 00:01:22.680
decisions, what to automate, how to check quality,

00:01:22.939 --> 00:01:26.849
and how to make it reliable. If you're a builder,

00:01:27.030 --> 00:01:30.969
creator, or just AI curious, you'll leave with

00:01:30.969 --> 00:01:34.590
the steps that you can copy tonight. Welcome

00:01:34.590 --> 00:01:36.450
to Data and AI with Mukundan, where you learn

00:01:36.450 --> 00:01:40.829
AI by building. I'm Mukundan. Each week, we turn

00:01:40.829 --> 00:01:43.469
everyday problems into small, useful AI workflows.

00:01:44.170 --> 00:01:49.569
It's not some hand -wavy, someday, it is something

00:01:49.569 --> 00:01:55.170
we ship today. This episode, where... we have

00:01:55.170 --> 00:01:57.870
ai that thinks like an analyst i would call it

00:01:57.870 --> 00:02:01.950
a more version 4 secure mode where we have no

00:02:01.950 --> 00:02:07.010
file uploads no paste only i mean it's just copy

00:02:07.010 --> 00:02:11.129
paste only more policy driven privacy where you

00:02:11.129 --> 00:02:14.490
hash mask redact on your machine before any analysis

00:02:14.490 --> 00:02:18.250
you can get a notebook ready transformer where

00:02:18.250 --> 00:02:20.990
you can copy paste into jupyter notebooks or

00:02:20.990 --> 00:02:24.289
collab by google so you'll get like a streamlet

00:02:24.569 --> 00:02:27.129
wrapper for teams with a data handling report

00:02:27.129 --> 00:02:30.150
that you can hand to your ops team legal or your

00:02:30.150 --> 00:02:32.770
future self and if you have data that you can't

00:02:32.770 --> 00:02:37.310
upload this is especially for you so let's look

00:02:37.310 --> 00:02:40.710
at the story and what are what are the design

00:02:40.710 --> 00:02:42.550
constraints that we need to take into account

00:02:42.550 --> 00:02:46.689
here's the reality your stakeholders they want

00:02:46.689 --> 00:02:49.889
insights they don't want any risk associated

00:02:49.889 --> 00:02:54.189
with those insights but most ai tools They want

00:02:54.189 --> 00:02:58.069
your files. And that's a non -starter in healthcare,

00:02:58.210 --> 00:03:02.870
finance, education. Here's the reality. Your

00:03:02.870 --> 00:03:05.250
stakeholders, your stakeholders, they want insights,

00:03:05.590 --> 00:03:10.009
not risk associated with those insights. And

00:03:10.009 --> 00:03:14.449
most AI tools, they want your files. That's a

00:03:14.449 --> 00:03:16.349
non -starter, especially if you're working in

00:03:16.349 --> 00:03:21.259
healthcare, finance, education. And most definitely,

00:03:21.520 --> 00:03:23.680
and most likely, it's all industries really.

00:03:23.860 --> 00:03:27.240
For small companies, big companies, they're trying

00:03:27.240 --> 00:03:29.879
not to leak their customers' emails into the

00:03:29.879 --> 00:03:36.419
void. So this time, I just picked four hard constraints.

00:03:36.719 --> 00:03:40.659
One, no file uploads, period. Second, client

00:03:40.659 --> 00:03:43.979
-side only or local runtime preferred. Basically,

00:03:43.979 --> 00:03:46.840
running it on your own laptop. Third one was

00:03:46.840 --> 00:03:50.280
policy presets, like a one -click default. That's

00:03:50.280 --> 00:03:55.199
opinionated, but still editable. And a human

00:03:55.199 --> 00:03:58.340
readable audit of what happened to each column.

00:03:59.900 --> 00:04:02.979
Here's what design philosophy I had in mind for

00:04:02.979 --> 00:04:07.259
this. Use a rules -first based approach where

00:04:07.259 --> 00:04:09.599
you're focusing on the schema, the semantics,

00:04:09.599 --> 00:04:13.780
and the policy to make safe decisions by default.

00:04:14.580 --> 00:04:18.870
Then let AI suggest analysis steps. from structured

00:04:18.870 --> 00:04:22.170
descriptions and not from raw rules and keep

00:04:22.170 --> 00:04:25.230
the user in the loop a safe default is great

00:04:25.230 --> 00:04:29.189
an understandable default is even better think

00:04:29.189 --> 00:04:31.730
of three big privacy more that's what i wanted

00:04:31.730 --> 00:04:36.269
to do here one is hash where you turn it into

00:04:36.269 --> 00:04:38.490
a code like giving the value a nickname with

00:04:38.490 --> 00:04:43.290
your secret key for example it could be alice

00:04:43.290 --> 00:04:49.240
at 123 .com And this becomes F3A dot dot dot

00:04:49.240 --> 00:04:53.759
9B2C or whatever. It's the same input, same secret,

00:04:53.839 --> 00:04:56.660
and it gives the same code. It's great for joins,

00:04:56.860 --> 00:04:59.160
but it's hard to reverse. Hashing is essentially

00:04:59.160 --> 00:05:02.019
you are turning your emails or any kind of sensitive

00:05:02.019 --> 00:05:08.139
data into like a key, a key which is hard to

00:05:08.139 --> 00:05:14.199
directly recognize. Second privacy move is mask,

00:05:14.240 --> 00:05:17.319
where you... blur the details it's like a black

00:05:17.319 --> 00:05:20.879
marker where alice at example .com becomes a

00:05:20.879 --> 00:05:24.279
dot dot dot dot at example .com right so you're

00:05:24.279 --> 00:05:26.399
blurring the details here so you see the shape

00:05:26.399 --> 00:05:29.120
but not the secrets and third is where you're

00:05:29.120 --> 00:05:32.160
redacting or removing it so if you don't need

00:05:32.160 --> 00:05:34.600
it delete it ssn is a good example that's it

00:05:34.600 --> 00:05:37.720
just code it blur it or drop it i think i've

00:05:37.720 --> 00:05:39.879
seen when i've worked with applications that

00:05:39.879 --> 00:05:45.519
ask you for ssn they blur the details So it looks

00:05:45.519 --> 00:05:48.040
more of a case of masking in those kind of applications.

00:05:48.519 --> 00:05:50.819
While I don't particularly remember which one,

00:05:50.939 --> 00:05:54.279
but I believe like maybe websites which ask you

00:05:54.279 --> 00:05:55.899
for that. It could be these government websites

00:05:55.899 --> 00:05:59.439
only. They will still be blurring the details

00:05:59.439 --> 00:06:02.360
is what I believe. And let's come on to some

00:06:02.360 --> 00:06:05.759
easy presets. So we ship with presets so you

00:06:05.759 --> 00:06:09.019
don't have to overthink. So analytics is more

00:06:09.019 --> 00:06:12.259
analytics safe, like hash emails or phones or

00:06:12.259 --> 00:06:17.240
IDs. mask names or addresses keep dates drop

00:06:17.240 --> 00:06:20.759
ssn so for analytics what you need you just need

00:06:20.759 --> 00:06:25.839
a more heft value of emails phones or ids you

00:06:25.839 --> 00:06:27.819
don't need the full names you can just mask it

00:06:27.819 --> 00:06:31.339
like how i talked about earlier like a dot dot

00:06:31.339 --> 00:06:35.139
dot at example .com right and addresses could

00:06:35.139 --> 00:06:39.279
be masked as well the dates can be kept however

00:06:39.279 --> 00:06:42.060
just drop the ssn that's more analytics safe

00:06:42.759 --> 00:06:45.160
So if the AI models are using it, just make sure

00:06:45.160 --> 00:06:49.000
those kind of values have been applied already.

00:06:49.680 --> 00:06:52.319
And if it is marketing safe, so stricter on addresses

00:06:52.319 --> 00:06:55.160
and phones. You don't want marketers to reach

00:06:55.160 --> 00:06:58.180
out to you because your address is out there.

00:06:58.319 --> 00:07:01.060
You don't want them to be sending stuff to your

00:07:01.060 --> 00:07:04.360
mail, mailbox, or could be just your email address

00:07:04.360 --> 00:07:06.259
or whatever, or your phone. You don't want them

00:07:06.259 --> 00:07:08.740
to call you too. So you want that to be marketing

00:07:08.740 --> 00:07:12.769
safe and HIPAA -like, right? more like an example

00:07:12.769 --> 00:07:15.829
only but not legal advice so very strict where

00:07:15.829 --> 00:07:19.569
it's like a lot of redaction required so if you

00:07:19.569 --> 00:07:24.750
don't like a decision flip it a column keep hash

00:07:24.750 --> 00:07:28.810
mask or redact that that's the basic rules to

00:07:28.810 --> 00:07:31.410
keep in mind here now let's do it step by step

00:07:31.410 --> 00:07:35.009
how how would this work so step one what you

00:07:35.009 --> 00:07:37.569
do is like you paste a tiny sample so you would

00:07:37.569 --> 00:07:39.949
paste like five to ten rows or just the header

00:07:39.949 --> 00:07:44.279
row So example, if there is more sensitive information

00:07:44.279 --> 00:07:48.240
involved, you just copy like just the column

00:07:48.240 --> 00:07:50.519
names, which is user ID, email, signup date,

00:07:50.660 --> 00:07:54.660
country, and a column for churned, which means

00:07:54.660 --> 00:07:58.720
the customer or somebody churned. And the last

00:07:58.720 --> 00:08:03.899
field could be phone. Step two, you choose a

00:08:03.899 --> 00:08:07.579
preset. You say analytics safe. Step three, you

00:08:07.579 --> 00:08:11.100
say set your secret. If anything will be hashed,

00:08:11.100 --> 00:08:14.420
you type your secret and it could be a password

00:08:14.420 --> 00:08:18.339
that you pick. Keep it safe. Step 4. Transform.

00:08:18.720 --> 00:08:22.079
Step 4. In this step, we will transform. The

00:08:22.079 --> 00:08:24.500
tool looks at each column and it does the right

00:08:24.500 --> 00:08:27.259
move. So email, when it looks at email, it will

00:08:27.259 --> 00:08:31.399
be hashed. Phone hash. User ID hash. SSL redact

00:08:31.399 --> 00:08:36.519
or remove. Name, address, mask. Dates, usually

00:08:36.519 --> 00:08:40.919
keep or bucket by month. step five get two things

00:08:40.919 --> 00:08:45.539
back a safe table that you can analyze a short

00:08:45.539 --> 00:08:49.480
report that says a short report that says what

00:08:49.480 --> 00:08:53.899
we did and why what the report says in simple

00:08:53.899 --> 00:08:59.659
words column email type looks like email action

00:08:59.659 --> 00:09:04.019
hashed reason column name contains email result

00:09:04.019 --> 00:09:08.009
columns test result column stays but now It's

00:09:08.009 --> 00:09:11.509
quotes. This is your paper trail. What the AI

00:09:11.509 --> 00:09:14.809
still does without your data. You might ask,

00:09:15.090 --> 00:09:18.909
if the AI doesn't see my full data, is it useful?

00:09:20.450 --> 00:09:24.509
Yes, because thinking like an analyst starts

00:09:24.509 --> 00:09:27.950
with good questions, not with seeing every row.

00:09:29.230 --> 00:09:32.730
You give it column names, a few sample rows,

00:09:32.909 --> 00:09:37.730
or none, and your business question. Why did

00:09:37.730 --> 00:09:44.269
churn spike in May? It gives you split churn

00:09:44.269 --> 00:09:48.909
by plant type and country. Look at cohorts by

00:09:48.909 --> 00:09:54.529
sign -up month. Check time to first value. Compare

00:09:54.529 --> 00:09:58.429
activated versus not activated users. Look for

00:09:58.429 --> 00:10:01.230
payment failures and support tickets around June.

00:10:01.409 --> 00:10:03.659
See, that's your analyst thinking. Your rows

00:10:03.659 --> 00:10:06.200
never left your laptop. Here are some real world

00:10:06.200 --> 00:10:10.399
examples for your reference. So let's say you're

00:10:10.399 --> 00:10:14.480
in healthcare, you hash patient IDs. You can

00:10:14.480 --> 00:10:18.139
still join the tables, but you never see raw

00:10:18.139 --> 00:10:22.860
names. If you're in education, you keep the cohorts,

00:10:22.960 --> 00:10:26.659
you drop the names, but you measure the learning

00:10:26.659 --> 00:10:30.879
without revealing the students. If you're in

00:10:30.879 --> 00:10:36.110
marketing, hash emails for matching, mask the

00:10:36.110 --> 00:10:40.769
phone, redact the SSL. You shouldn't have SSL

00:10:40.769 --> 00:10:45.070
anyway. And when you're looking at internal product

00:10:45.070 --> 00:10:48.669
analytics, you keep the date buckets, but you

00:10:48.669 --> 00:10:51.470
drop the exact addresses. Let's look at some

00:10:51.470 --> 00:10:53.610
common questions. So one of the questions that

00:10:53.610 --> 00:10:57.529
could come up is why a secret for hashing? So

00:10:57.529 --> 00:11:00.690
it makes the code hard to guess. Without the

00:11:00.690 --> 00:11:03.490
secret, People can try common values, right?

00:11:03.750 --> 00:11:08.789
And can I still join tables? Yes, you can. If

00:11:08.789 --> 00:11:11.629
the same input and the same secret is kept, you

00:11:11.629 --> 00:11:14.769
get the same code. Are the dates safe? Depends.

00:11:15.149 --> 00:11:19.490
Often keep or bucket by month. Birthdays, well,

00:11:19.590 --> 00:11:21.970
usually mask or remove. Is this legally compliant?

00:11:22.490 --> 00:11:25.389
So it's more sensible defaults. So obviously

00:11:25.389 --> 00:11:27.629
talk to your legal or privacy team for stricter

00:11:27.629 --> 00:11:31.500
rules. HIPAA or GDPR related. But it definitely

00:11:31.500 --> 00:11:34.740
gets you to a closer spot than uploading your

00:11:34.740 --> 00:11:37.399
file to a tool where AI will be analyzing that

00:11:37.399 --> 00:11:40.799
data. So this is definitely getting you in that

00:11:40.799 --> 00:11:43.519
direction. So obviously this will be depending

00:11:43.519 --> 00:11:46.059
on the legal teams that you get to work with

00:11:46.059 --> 00:11:48.940
as part of a company, as part of your client

00:11:48.940 --> 00:11:51.559
or whatever. Just make sure you're talking to

00:11:51.559 --> 00:11:54.879
them and checking with them if this is something

00:11:54.879 --> 00:11:58.019
that will work. But this, I would think, gets

00:11:58.019 --> 00:11:59.879
you in that direction. And what about running

00:11:59.879 --> 00:12:02.840
on a server? Like pasting it into a cloud app

00:12:02.840 --> 00:12:06.360
still sends the text over the network. But for

00:12:06.360 --> 00:12:10.080
maximum privacy, 100 % run locally. So when I

00:12:10.080 --> 00:12:13.879
tried some demo in my words, what I did was I

00:12:13.879 --> 00:12:16.899
gave a sample. I gave a sample of user ID, email,

00:12:17.139 --> 00:12:21.200
signup date, country, churned, and phone. And

00:12:21.200 --> 00:12:24.000
I gave it three rows. One, two, three of user

00:12:24.000 --> 00:12:28.299
ID emails were analyst at example .com. Bob at

00:12:28.299 --> 00:12:31.440
example .com. Charlie at example .com. And signup

00:12:31.440 --> 00:12:33.720
dates were somewhere in the start of this year.

00:12:35.159 --> 00:12:37.840
Churned. One of them was churned. One was not.

00:12:38.259 --> 00:12:41.360
Two were not. And one was churned. And three

00:12:41.360 --> 00:12:43.879
random phone numbers. And I picked analytics

00:12:43.879 --> 00:12:48.759
safe. I set my secret. The result was email code,

00:12:48.899 --> 00:12:53.179
phone code, user ID code. sign up date was kept

00:12:53.179 --> 00:12:59.279
country kept churned kept and also get a report

00:12:59.279 --> 00:13:02.220
that explains those choices in one page that's

00:13:02.220 --> 00:13:05.039
it it's like a safe copy mode where then you

00:13:05.039 --> 00:13:07.299
can move on to your analysis how you can start

00:13:07.299 --> 00:13:12.080
today option a just copy one cell into the copy

00:13:12.080 --> 00:13:14.659
one cell from the show notes paste your sample

00:13:14.659 --> 00:13:19.899
pick a preset set your secret and then run it

00:13:22.000 --> 00:13:25.360
You'll get a safe table and a report. Option

00:13:25.360 --> 00:13:28.919
B, using Streamlit app for a local system. Run

00:13:28.919 --> 00:13:31.259
the app, paste sample data, click transform,

00:13:31.519 --> 00:13:35.179
download safe CSV plus report. Here's a pro tip.

00:13:36.360 --> 00:13:42.820
Make privacy the default. Use the preset first.

00:13:43.639 --> 00:13:46.500
Relax only if you truly need to. Here's a tiny

00:13:46.500 --> 00:13:49.240
checklist you can read while working. Paste headers

00:13:49.240 --> 00:13:53.179
plus 5 to 10 rows. Pick analytics safe. Set a

00:13:53.179 --> 00:13:58.039
strong secret, but don't share it. Transform

00:13:58.039 --> 00:14:01.639
into a safe table plus a report. Do analysis

00:14:01.639 --> 00:14:04.080
on the safe table. Commit the report with your

00:14:04.080 --> 00:14:06.779
work. So to close, today you learned the no upload

00:14:06.779 --> 00:14:10.639
way. Hash, mask, and redact. Simple words. Simple

00:14:10.639 --> 00:14:13.580
steps. It's solid on privacy. Your data stays

00:14:13.580 --> 00:14:16.779
with you and you still get answers. If this helped,

00:14:16.860 --> 00:14:18.879
share it with a teammate who handles sensitive

00:14:18.879 --> 00:14:21.759
data. It might save them a headache and a compliance

00:14:21.759 --> 00:14:24.559
email. A quick question for you. What privacy

00:14:24.559 --> 00:14:27.759
move do you use most? Do you prefer hashing,

00:14:27.879 --> 00:14:31.500
masking, redacting or not sure yet? Another question.

00:14:31.539 --> 00:14:34.240
What's your toughest no upload challenge? I'll

00:14:34.240 --> 00:14:36.539
reply with a simple pattern. I've also added

00:14:36.539 --> 00:14:39.519
a little quiz for you that we can just do now.

00:14:39.820 --> 00:14:42.899
Hey, are you excited to play a quiz? If you're

00:14:42.899 --> 00:14:47.200
listening, why don't you just... If you're listening,

00:14:47.320 --> 00:14:52.470
is it option A, keep? option b mask option c

00:14:52.470 --> 00:14:57.509
redact or remove option d hash so if you paid

00:14:57.509 --> 00:15:01.370
attention earlier in this episode you probably

00:15:01.370 --> 00:15:04.850
know the answer and the answer is option c redact

00:15:04.850 --> 00:15:07.570
remove you almost never need ssn for analysis

00:15:07.570 --> 00:15:10.909
so it's safest to drop it question two why use

00:15:10.909 --> 00:15:15.110
h mac instead of plain hash is it a is because

00:15:15.110 --> 00:15:19.389
it's faster b because it's harder to guess C

00:15:19.389 --> 00:15:21.870
because it's prettier or D because it's random

00:15:21.870 --> 00:15:25.830
each time. HMAC is nothing but hashing. But like

00:15:25.830 --> 00:15:28.470
I guess a more advanced hash. So why use HMAC

00:15:28.470 --> 00:15:30.409
instead of plain hash? The options again are

00:15:30.409 --> 00:15:36.129
faster. B harder to guess. C prettier or D random

00:15:36.129 --> 00:15:38.470
each time. So if you selected harder to guess,

00:15:38.570 --> 00:15:42.750
you are right. The reason being the secret salt

00:15:42.750 --> 00:15:45.590
makes reversal or dictionary attacks much harder.

00:15:45.769 --> 00:15:47.210
All right, so let's move on to question three.

00:15:47.330 --> 00:15:50.289
You need to join two tables by email without

00:15:50.289 --> 00:15:54.190
actually exposing their emails. What is the best

00:15:54.190 --> 00:15:59.990
move? Is it option A, mask emails? B, hash or

00:15:59.990 --> 00:16:04.509
HMAC, SHC, redact emails or D, keep the emails?

00:16:05.149 --> 00:16:07.990
Again, so you need to join two tables by email

00:16:07.990 --> 00:16:11.139
without exposing their emails. What is your best

00:16:11.139 --> 00:16:15.879
move? Is it A. Masking emails? B. Hashing? Or

00:16:15.879 --> 00:16:18.940
HMAC? Is it C. Redact emails? Or D. Keep emails?

00:16:19.220 --> 00:16:22.440
Or is it D. Keep emails? So again, if you paid

00:16:22.440 --> 00:16:24.679
attention earlier, you probably know this is

00:16:24.679 --> 00:16:27.559
hashing is the answer. Option B. Why? Because

00:16:27.559 --> 00:16:30.799
hashing keeps joinability without any raw values.

00:16:31.100 --> 00:16:35.039
Question 4. You must show phone numbers in the

00:16:35.039 --> 00:16:38.899
UI but hide most digits. Is it option A. Keep?

00:16:39.289 --> 00:16:43.070
B, mask or keep shape? C, redact or de -hash?

00:16:44.190 --> 00:16:50.490
So this is question four. You must show phone

00:16:50.490 --> 00:16:53.730
numbers in UI but hide most digits. What does

00:16:53.730 --> 00:16:56.809
this represent? That's the question. Is it option

00:16:56.809 --> 00:17:00.470
A, is it a keep case? B, is it because it's mask

00:17:00.470 --> 00:17:05.430
or keep shape? C, redact or de -hash? If you

00:17:05.430 --> 00:17:08.769
selected option B, mask, you are right. why because

00:17:08.769 --> 00:17:12.230
masking preserves human readable shape example

00:17:12.230 --> 00:17:17.029
hyphen hyphen one two three four right question

00:17:17.029 --> 00:17:19.289
five you're pasting five rows to test transforms

00:17:19.289 --> 00:17:23.029
is that okay so question five is you're pasting

00:17:23.029 --> 00:17:28.210
five rows to test transforms is that okay options

00:17:28.210 --> 00:17:32.289
option a yes because the sample is minimal b

00:17:32.289 --> 00:17:35.869
no because you need full data Option C, only

00:17:35.869 --> 00:17:39.990
headers. D, upload the file. So this is the whole

00:17:39.990 --> 00:17:42.230
purpose of this podcast, right? This particular

00:17:42.230 --> 00:17:48.309
episode is trying to use a more secure way of

00:17:48.309 --> 00:17:52.470
doing things. So a sample would be fine. Only

00:17:52.470 --> 00:17:56.029
headers, yes, sure, it helps. I mean, what would

00:17:56.029 --> 00:17:59.049
help more is a little sample. And if you can't,

00:17:59.049 --> 00:18:01.069
at least take the sample and maybe put in some

00:18:01.069 --> 00:18:03.930
junk values in it. It doesn't have to be... same

00:18:03.930 --> 00:18:06.170
values either so the junk values would act like

00:18:06.170 --> 00:18:09.549
as a proxy especially for fields like emails

00:18:09.549 --> 00:18:12.289
ssn or whatever and the other fields we spoke

00:18:12.289 --> 00:18:15.329
about so yeah tiny sample is enough here to verify

00:18:15.329 --> 00:18:17.210
the rules safely right moving on to question

00:18:17.210 --> 00:18:21.589
six which is not pii by itself at least in most

00:18:21.589 --> 00:18:28.430
cases not pii so a email b phone d ssn but here's

00:18:28.430 --> 00:18:31.769
the option which which is not pii by itself at

00:18:31.769 --> 00:18:35.099
least in most cases A. Email B. Phone C. Month

00:18:35.099 --> 00:18:38.019
bucket or D. SSN If you selected option C. Month

00:18:38.019 --> 00:18:41.119
bucket, you are right. Why? Because month level

00:18:41.119 --> 00:18:44.180
dates are usually non -identifying. Question

00:18:44.180 --> 00:18:50.960
7. If two teams use different HMAC secrets, their

00:18:50.960 --> 00:18:55.740
hashes A. Won't match B. Will match C. Become

00:18:55.740 --> 00:18:59.079
random Or D. Decrypt each other. If two teams

00:18:59.079 --> 00:19:02.559
use different HMAX secrets, their hashes A. Won't

00:19:02.559 --> 00:19:06.640
match. B. Will match. Or C. Become random. Or

00:19:06.640 --> 00:19:10.680
D. Decrypt each other. If you selected won't

00:19:10.680 --> 00:19:14.019
match, that is right. Option A. Why different

00:19:14.019 --> 00:19:18.740
secrets means different digest. So your policy,

00:19:18.859 --> 00:19:21.079
question 8. Your policy says, there's more to

00:19:21.079 --> 00:19:25.240
question 8. Your policy says mask names. A rare.

00:19:25.819 --> 00:19:29.759
two letter your policy says mass names a rare

00:19:29.759 --> 00:19:33.819
two letter name appears what's the risk a no

00:19:33.819 --> 00:19:39.519
risk b re -id risk remains c legal issue only

00:19:39.519 --> 00:19:42.180
or d just aesthetics so given the other options

00:19:42.180 --> 00:19:43.799
sound a bit vague i would have gone with option

00:19:43.799 --> 00:19:48.559
b where re -id risk remains why because very

00:19:48.559 --> 00:19:52.579
short strings can still be unique Consider stricter

00:19:52.579 --> 00:19:54.680
handling. Question 9. What is the best place

00:19:54.680 --> 00:19:58.900
to run the paste -only app for max privacy? A.

00:19:58.980 --> 00:20:04.019
Public cloud. B. Shared kiosk. C. Your local

00:20:04.019 --> 00:20:07.160
machine. Or D. Your friend's laptop. So this

00:20:07.160 --> 00:20:08.819
is something, again, that would have been easy

00:20:08.819 --> 00:20:12.400
if you listened to the podcast. The option is

00:20:12.400 --> 00:20:14.799
your local machine. So that's the best place

00:20:14.799 --> 00:20:17.119
to run your paste -only app for max privacy.

00:20:17.299 --> 00:20:21.799
Question 10. What should your logs contain? Should

00:20:21.799 --> 00:20:26.000
it contain raw rows, B, PII excerpts, C, nothing

00:20:26.000 --> 00:20:30.319
sensitive, or D, secrets for debugging? So I'll

00:20:30.319 --> 00:20:32.779
go over this question one more time. What should

00:20:32.779 --> 00:20:37.279
your logs contain? A, raw rows, B, PII excerpts,

00:20:37.400 --> 00:20:41.339
C, nothing sensitive, or D, secrets for debugging?

00:20:42.740 --> 00:20:45.700
If you selected nothing sensitive, you are right.

00:20:45.920 --> 00:20:47.759
Obviously, nothing sensitive should be there

00:20:47.759 --> 00:20:49.759
in your logs. Please will be copy pasting that

00:20:49.759 --> 00:20:52.910
to understand. what went wrong or what went right

00:20:52.910 --> 00:20:56.450
right so just make sure that is taken care of

00:20:56.450 --> 00:20:58.349
so let's do another writing round of true and

00:20:58.349 --> 00:21:01.490
false hashing with edge mac is deterministic

00:21:01.490 --> 00:21:06.970
is it true or false answer true masking preserves

00:21:06.970 --> 00:21:10.569
exact join keys answer false redacting reduces

00:21:10.569 --> 00:21:13.609
breach blast radius dates are always safe to

00:21:13.609 --> 00:21:16.970
keep false a data handling report helps audits

00:21:16.970 --> 00:21:20.200
true what's one column Okay, here's some questions

00:21:20.200 --> 00:21:22.279
for you to think about. In column, you always

00:21:22.279 --> 00:21:25.140
hash and why? When is masking more useful than

00:21:25.140 --> 00:21:28.559
hashing? What's your pace the minimum rule? And

00:21:28.559 --> 00:21:32.640
something for more deeper prompt for you. Share

00:21:32.640 --> 00:21:36.000
a time when privacy defaults saved you from a

00:21:36.000 --> 00:21:39.319
mess. How do you balance need to know versus

00:21:39.319 --> 00:21:41.319
nice to have columns? What would your team's

00:21:41.319 --> 00:21:43.680
default policy be and why? I have some hot takes

00:21:43.680 --> 00:21:47.339
for you. Date should always be bucketed. Agree

00:21:47.339 --> 00:21:49.940
or disagree? hashing without h mac is reckless

00:21:49.940 --> 00:21:55.019
redaction is underused in bi so i need to tell

00:21:55.019 --> 00:21:57.680
you why i even came up with this whole episode

00:21:57.680 --> 00:22:01.460
so when i was doing my research i kept on thinking

00:22:01.460 --> 00:22:05.420
like yes i built this perfect tool you can upload

00:22:05.420 --> 00:22:08.440
csv into this tool and it will give you like

00:22:08.440 --> 00:22:12.099
analysis and questions to ask which is all great

00:22:12.589 --> 00:22:14.789
But then I looked at some research online. I

00:22:14.789 --> 00:22:17.650
looked at blogs. I looked at comments in the

00:22:17.650 --> 00:22:19.589
blogs where people are saying like, oh, have

00:22:19.589 --> 00:22:21.309
you even worked in real world? And then I realized,

00:22:21.470 --> 00:22:26.690
you know what? They're right. Yes, I've worked

00:22:26.690 --> 00:22:29.210
in real world. It's not that part. It's just

00:22:29.210 --> 00:22:34.670
that because people are scared about using AI

00:22:34.670 --> 00:22:39.769
tools for uploading their data because sensitive

00:22:39.769 --> 00:22:43.049
data can be leaked. It's a very valid fear to

00:22:43.049 --> 00:22:44.890
have, especially when you're working with AI.

00:22:45.029 --> 00:22:48.430
You don't know what AI is tracking. So as someone

00:22:48.430 --> 00:22:54.450
who uses AI to analyze data, I do make sure I'm

00:22:54.450 --> 00:22:56.250
not using any sensitive data. I've used only

00:22:56.250 --> 00:22:59.089
for non -sensitive data in the past. But like

00:22:59.089 --> 00:23:02.509
something I knew in the back of my mind, I've

00:23:02.509 --> 00:23:05.650
not been using it for sensitive data. But it's

00:23:05.650 --> 00:23:07.309
something that I just wanted it to call out,

00:23:07.410 --> 00:23:10.440
especially because... people have that fear like

00:23:10.440 --> 00:23:12.460
how can we use it for sensitive data and that's

00:23:12.460 --> 00:23:16.240
why i suggested these hashing techniques masking

00:23:16.240 --> 00:23:19.779
techniques keeping or or redacting right like

00:23:19.779 --> 00:23:23.700
ssn fields so those security aspect does become

00:23:23.700 --> 00:23:26.380
a very important aspect security does become

00:23:26.380 --> 00:23:28.140
very important aspect when you're dealing with

00:23:29.069 --> 00:23:31.829
Even before AI tools. And now with AI tools,

00:23:31.910 --> 00:23:34.190
especially, there's so much to be concerned about.

00:23:34.390 --> 00:23:36.529
So that's why this episode is very important.

00:23:36.769 --> 00:23:38.869
I want to challenge you on something. In the

00:23:38.869 --> 00:23:42.009
comments section, post your default policy. Give

00:23:42.009 --> 00:23:43.869
me a max of five lines. You know, that's helpful.

00:23:44.950 --> 00:23:48.430
Give me three columns you always hash. And one

00:23:48.430 --> 00:23:50.990
you will always redact. Share one logging fix

00:23:50.990 --> 00:23:53.789
that you will ship this week. So in the show

00:23:53.789 --> 00:23:55.630
notes, you'll also see some, you know, the prompt

00:23:55.630 --> 00:23:58.009
challenges, the quiz, and a discussion question.

00:23:58.640 --> 00:24:01.440
And a link to the blog which will have the app

00:24:01.440 --> 00:24:05.079
as well. So that will help you try to do it by

00:24:05.079 --> 00:24:13.519
yourself. Before you go, I just wanted to let

00:24:13.519 --> 00:24:17.180
you know that I record with Riverside. I'm an

00:24:17.180 --> 00:24:19.240
affiliate partner of Riverside. They helped me

00:24:19.240 --> 00:24:22.220
record this episode and they have amazing audio

00:24:22.220 --> 00:24:24.900
quality. So that's something I just wanted to

00:24:24.900 --> 00:24:27.180
recommend if you're ever deciding to do your

00:24:27.180 --> 00:24:30.289
own podcast. This is a great platform to record

00:24:30.289 --> 00:24:33.769
because their editing features are amazing. Their

00:24:33.769 --> 00:24:37.589
AI editing is, I think, next level. Helps you

00:24:37.589 --> 00:24:40.750
remove all awkward pauses and gives your sound

00:24:40.750 --> 00:24:44.670
a strong voice. It gives you a strong voice.

00:24:45.450 --> 00:24:49.049
So I would highly recommend Riverside as the

00:24:49.049 --> 00:24:50.910
reason why I chose to be an affiliate partner

00:24:50.910 --> 00:24:53.069
because I actually use the product and I love

00:24:53.069 --> 00:24:58.359
it. So link to join their platform. is in the

00:24:58.359 --> 00:25:00.940
show notes. And again, I'm an affiliate partner.

00:25:01.059 --> 00:25:04.700
So if you make any purchase, I may make a small

00:25:04.700 --> 00:25:08.680
commission from it. I also host an rss .com.

00:25:09.319 --> 00:25:12.000
Also, I'm an affiliate partner of RSS because

00:25:12.000 --> 00:25:17.839
they have an amazing distribution channel. They

00:25:17.839 --> 00:25:22.779
are able to publish your episodes from RSS into

00:25:22.779 --> 00:25:25.980
Spotify, Apple Podcasts, Amazon Music, Audible,

00:25:26.240 --> 00:25:32.109
Deezer. Pandora, iHeartRadio, and even lets you

00:25:32.109 --> 00:25:34.829
publish your episode with one button into YouTube

00:25:34.829 --> 00:25:38.690
as well. Did I forget to mention that you get,

00:25:38.750 --> 00:25:42.269
you can start making money by hosting on rss

00:25:42.269 --> 00:25:45.069
.com, even if you have just 10 downloads a month.

00:25:45.650 --> 00:25:48.069
So that's all you need to make money, 10 downloads

00:25:48.069 --> 00:25:50.029
a month. It's a new feature they started and

00:25:50.029 --> 00:25:53.630
it's amazing. So again, link to joinrss .com

00:25:53.630 --> 00:25:55.809
will be in the show notes. I'm also using cider

00:25:55.809 --> 00:26:01.170
.ai to help with my research. So if you are interested

00:26:01.170 --> 00:26:06.109
in any research for your content or for any general

00:26:06.109 --> 00:26:08.609
you want to learn about anything, want to read

00:26:08.609 --> 00:26:12.170
research papers or any kind of thing to improve

00:26:12.170 --> 00:26:15.109
your knowledge, cider .ai is your platform to

00:26:15.109 --> 00:26:17.829
use. So again, I'm an affiliate partner and I'd

00:26:17.829 --> 00:26:22.420
love if you're able to support the show. by clicking

00:26:22.420 --> 00:26:49.039
on my affiliate link and joining Cider. Tell

00:26:49.039 --> 00:26:51.440
me one takeaway. I read every single one and

00:26:51.440 --> 00:26:54.099
it helps more people find the show. That's all

00:26:54.099 --> 00:26:55.799
for this week. Thank you for joining and I will

00:26:55.799 --> 00:26:57.680
see you in the next episode of Data and AI with

00:26:57.680 --> 00:26:59.319
Mukundan, where you learn AI by building.
