WEBVTT

00:00:00.000 --> 00:00:02.620
Two months ago I shipped a tiny AI tool that

00:00:02.620 --> 00:00:05.339
looks at a CSV file and asks better questions

00:00:05.339 --> 00:00:10.359
than I can on day two at a new job. It went way

00:00:10.359 --> 00:00:12.779
further than I had expected and it showed me

00:00:12.779 --> 00:00:17.440
what I had missed. So I rebuilt it. Today I'll

00:00:17.440 --> 00:00:19.920
share what worked, what didn't and I'll give

00:00:19.920 --> 00:00:23.940
you a short quiz, a discussion prompt and a checklist

00:00:23.940 --> 00:00:33.509
that you can use. Hey, I'm Mukundan. This show

00:00:33.509 --> 00:00:35.810
is about solving real problems with small, useful

00:00:35.810 --> 00:00:39.810
AI. Things that you can actually use today. Each

00:00:39.810 --> 00:00:43.490
week, we pick one problem, build a simple workflow

00:00:43.490 --> 00:00:46.130
or tool, and talk through the decisions about

00:00:46.130 --> 00:00:49.250
what to automate, how to check quality, and how

00:00:49.250 --> 00:00:52.710
to make it reliable. If you're a builder, creator,

00:00:52.929 --> 00:00:56.329
or just AI curious, you'll leave with the steps

00:00:56.329 --> 00:01:00.070
that you can copy tonight. Hit follow and start

00:01:00.070 --> 00:01:02.460
with the Startup Pack episodes. in the show notes.

00:01:02.579 --> 00:01:04.299
Hey everyone, welcome back to the Data and AI

00:01:04.299 --> 00:01:08.840
with Mukundan show. Today, we'll talk about how

00:01:08.840 --> 00:01:14.500
I built an AI tool that helped look at CSV files

00:01:14.500 --> 00:01:18.420
and Excel files and generate questions from it

00:01:18.420 --> 00:01:21.819
and how I took it to the next level. This is

00:01:21.819 --> 00:01:25.680
about that story. You join a new team. Here's

00:01:25.680 --> 00:01:29.620
why this exists. I call it the day two problem.

00:01:30.890 --> 00:01:35.209
You join a new team in a company and maybe let's

00:01:35.209 --> 00:01:38.609
just say, or even you join a new company and

00:01:38.609 --> 00:01:41.650
someone will send you this data set without any

00:01:41.650 --> 00:01:45.810
context and no goals about what you want to achieve

00:01:45.810 --> 00:01:49.170
with this data. It's just, just tell you with

00:01:49.170 --> 00:01:50.790
pressure, right? Just pressure. That's all it

00:01:50.790 --> 00:01:57.370
is. And version one of my tool, when I first

00:01:57.370 --> 00:02:01.200
did this, you know, this project before. It helped

00:02:01.200 --> 00:02:05.799
me by generating 10 smart exploratory data analysis

00:02:05.799 --> 00:02:09.180
questions like the EDA questions, each with a

00:02:09.180 --> 00:02:12.840
short why this matters. And it gave me like a

00:02:12.840 --> 00:02:17.740
plan instead of me panicking about it, right?

00:02:18.620 --> 00:02:21.639
So I built this tool before. I called this version

00:02:21.639 --> 00:02:28.099
one, right? And what happened was I... created

00:02:28.099 --> 00:02:31.060
this like exploratory data analysis tool what

00:02:31.060 --> 00:02:34.800
basically it would take you take your csv file

00:02:34.800 --> 00:02:39.099
and then it would generate questions from it

00:02:39.099 --> 00:02:43.620
give you like 10 smart questions to ask ask this

00:02:43.620 --> 00:02:47.300
data set right that will help you really to understand

00:02:47.300 --> 00:02:50.300
the data and it will also give you like a why

00:02:50.300 --> 00:02:54.120
this matters why this question matters statement

00:02:54.120 --> 00:02:57.960
as well so it gives you like a plan right so

00:02:57.960 --> 00:03:00.219
to speak like it give you like what what to do

00:03:00.219 --> 00:03:02.180
gives you a plan instead of like you know you're

00:03:02.180 --> 00:03:08.460
panicking about it um so when i did this when

00:03:08.460 --> 00:03:10.840
i did the first version of this it went viral

00:03:10.840 --> 00:03:14.580
a lot of people were able to resonate with it

00:03:14.580 --> 00:03:18.879
uh some even you know some reacted to the blog

00:03:18.879 --> 00:03:21.479
post i'd written it was a blog post with code

00:03:21.479 --> 00:03:25.479
and i'll link to the blog post as well So I'll

00:03:25.479 --> 00:03:29.060
tell you what worked, why it went viral and what

00:03:29.060 --> 00:03:33.599
was missing in this version one. So people liked

00:03:33.599 --> 00:03:37.479
three things. They liked that there was a fast

00:03:37.479 --> 00:03:42.719
grip, right? I mean, you would be able to get

00:03:42.719 --> 00:03:46.800
really useful questions from the data in under

00:03:46.800 --> 00:03:51.740
a minute. And that's really quick, right? Like,

00:03:51.740 --> 00:03:54.060
I mean, otherwise we're just spending hours and

00:03:54.060 --> 00:03:56.360
hours, maybe days and days, just looking at the

00:03:56.360 --> 00:03:59.699
data set without knowing what to answer. And

00:03:59.699 --> 00:04:02.080
especially when you're looking at it fresh, you

00:04:02.080 --> 00:04:04.300
don't know what's going on and you don't know

00:04:04.300 --> 00:04:07.500
what question to ask. We don't know how to answer

00:04:07.500 --> 00:04:09.219
that because A, you don't know what question

00:04:09.219 --> 00:04:11.259
to ask really, right? You don't want to be doing

00:04:11.259 --> 00:04:14.819
count distinct, right? It's a SQL query if you're

00:04:14.819 --> 00:04:17.220
not familiar with it. Count distinct will basically...

00:04:19.660 --> 00:04:22.759
in a way just count the unique records or something

00:04:22.759 --> 00:04:27.720
right of whatever is the number of whatever is

00:04:27.720 --> 00:04:29.660
the table you're looking at so let's say you

00:04:29.660 --> 00:04:35.579
want to look at the number of you know let's

00:04:35.579 --> 00:04:37.319
say you want to look at the number of cars and

00:04:37.319 --> 00:04:40.720
you would do that by let's say car id right you

00:04:40.720 --> 00:04:43.019
have something called as a car id field so you

00:04:43.019 --> 00:04:46.500
would be using count distinct to see how many

00:04:48.000 --> 00:04:52.279
you know unique cars there are in this showroom

00:04:52.279 --> 00:04:57.199
for example right so that was something that

00:04:57.199 --> 00:05:01.060
will not really help but we want to be asking

00:05:01.060 --> 00:05:05.399
some more useful questions and I'll talk about

00:05:05.399 --> 00:05:07.819
what questions those are and everything as we

00:05:07.819 --> 00:05:10.579
go down but I just wanted to give you some background

00:05:10.579 --> 00:05:15.199
about this so the other thing people liked was

00:05:15.199 --> 00:05:20.670
like it acted like a partner gave you a vibe

00:05:20.670 --> 00:05:26.009
of a partner where it nudges but not doesn't

00:05:26.009 --> 00:05:28.290
doesn't give you like a dashboard dump right

00:05:28.290 --> 00:05:31.470
it's not like dumping everything on you it's

00:05:31.470 --> 00:05:34.189
just like being your partner right you support

00:05:34.189 --> 00:05:37.209
your partner and so it's it's basically giving

00:05:37.209 --> 00:05:39.930
you those questions and then once you have those

00:05:39.930 --> 00:05:43.470
questions all you have to do is just uh you know

00:05:43.470 --> 00:05:46.279
use your analytical skills to answer those questions

00:05:46.279 --> 00:05:50.800
and there was also a respect for time that's

00:05:50.800 --> 00:05:54.019
the third thing people liked because it was short

00:05:54.019 --> 00:05:59.459
clear and it was human so to recap people like

00:05:59.459 --> 00:06:03.800
three things because it was a fast b acted like

00:06:03.800 --> 00:06:07.550
your partner and three respected your time So

00:06:07.550 --> 00:06:10.009
I feel like I still missed three key pieces.

00:06:10.209 --> 00:06:12.990
That's what people said. I mean, that's what

00:06:12.990 --> 00:06:19.709
I believed it as well. I hadn't mentioned what

00:06:19.709 --> 00:06:23.089
was the business objective up front and why am

00:06:23.089 --> 00:06:26.189
I doing this exercise? Why am I doing this little

00:06:26.189 --> 00:06:28.029
project, this AI tool? Why did I build that?

00:06:28.649 --> 00:06:32.110
I didn't really mention any objective up front.

00:06:33.350 --> 00:06:40.040
And I also did not, focus on any columns right

00:06:40.040 --> 00:06:42.699
i treated basically every feature the same and

00:06:42.699 --> 00:06:47.360
everyone knows that if you treat every column

00:06:47.360 --> 00:06:52.480
in your table as the same how is that going to

00:06:52.480 --> 00:06:59.060
you know really help you right um and i also

00:06:59.060 --> 00:07:02.620
wanted to give an option to export the data set

00:07:02.620 --> 00:07:04.839
or export the you know questions and everything

00:07:04.839 --> 00:07:10.050
so that was missing as well so there's no export

00:07:10.050 --> 00:07:12.490
feature to available no export feature which

00:07:12.490 --> 00:07:16.589
was available to drop it into slack or even into

00:07:16.589 --> 00:07:19.949
notion or your jira board right so that was uh

00:07:19.949 --> 00:07:22.610
something i felt was missing so i wanted to take

00:07:22.610 --> 00:07:25.350
that to the next level here were some lessons

00:07:25.350 --> 00:07:30.550
that helped me rebuild this so i wanted to first

00:07:30.550 --> 00:07:35.550
put the decision in the room first spend attention

00:07:35.550 --> 00:07:39.129
like money attention in this case would be time

00:07:39.129 --> 00:07:42.230
and time is money so focus on you know what i

00:07:42.230 --> 00:07:45.750
wanted to do um what problem i wanted to solve

00:07:45.750 --> 00:07:49.910
and like a little one page brief so that the

00:07:49.910 --> 00:07:52.029
work actually moves here's how the rebuild actually

00:07:52.029 --> 00:07:55.149
took place the three simple features now before

00:07:55.149 --> 00:07:58.769
anything else the tool asks what outcome are

00:07:58.769 --> 00:08:00.949
you trying to move so this is the objective input

00:08:00.949 --> 00:08:05.910
the y box right so let's talk about that so just

00:08:05.910 --> 00:08:07.949
to ask yourself like what outcome are you trying

00:08:07.949 --> 00:08:13.209
to move so example is 30 do you want to example

00:08:13.209 --> 00:08:15.230
as you want to reduce 30 day churn for monthly

00:08:15.230 --> 00:08:18.610
plan users by 15 an optional mode would be risk

00:08:18.610 --> 00:08:21.629
opportunity anomaly or quick wins to shape the

00:08:21.629 --> 00:08:25.569
tone result questions became specific and useful

00:08:25.569 --> 00:08:28.670
and then i wanted to also do a column focus the

00:08:28.670 --> 00:08:31.889
second simple feature one was the objective input

00:08:31.889 --> 00:08:35.690
so what outcome i'm trying to move right and

00:08:35.690 --> 00:08:39.690
the second one more on the columns and this is

00:08:39.690 --> 00:08:43.370
specifically how much columns i want five to

00:08:43.370 --> 00:08:45.690
eight columns only is what i chose and up to

00:08:45.690 --> 00:08:49.909
eight columns or except smart hints right ask

00:08:49.909 --> 00:08:55.169
for that so you should want to be able to look

00:08:55.169 --> 00:08:57.450
at the data quickly if you look at 30 columns

00:08:57.450 --> 00:09:00.950
together like you want to pick out every feature

00:09:01.480 --> 00:09:05.039
how is that going to help so that is not a time

00:09:05.039 --> 00:09:09.799
saving tool or maybe it saves a little bit but

00:09:09.799 --> 00:09:12.639
not as much as you would want right so the column

00:09:12.639 --> 00:09:16.259
focus where we picked up to eight columns or

00:09:16.259 --> 00:09:19.179
accept any smart hints and the tool concentrates

00:09:19.179 --> 00:09:23.919
its thinking budget on what matters and warns

00:09:23.919 --> 00:09:27.809
about any leakage or messy labels So that was

00:09:27.809 --> 00:09:30.409
number two. And then number three was markdown

00:09:30.409 --> 00:09:33.669
export. So ship the plan. So one click and you

00:09:33.669 --> 00:09:36.230
get a clean brief. So the objective in your own

00:09:36.230 --> 00:09:39.009
words, you'll mention top 10 questions and why

00:09:39.009 --> 00:09:41.330
this matters. Those three that was already there.

00:09:42.669 --> 00:09:46.230
And data health checks, quick analysis to run

00:09:46.230 --> 00:09:52.750
risks and next decisions. And also paste it into

00:09:52.750 --> 00:09:56.840
a Slack, Notion and Jira and go. right so before

00:09:56.840 --> 00:09:58.759
anything else like i mentioned like these were

00:09:58.759 --> 00:10:02.299
the three rebuild features simple features that

00:10:02.299 --> 00:10:05.399
i've worked on the objective input where we're

00:10:05.399 --> 00:10:10.100
trying to uh where you're trying to figure out

00:10:10.100 --> 00:10:12.460
what outcome you're trying to move the column

00:10:12.460 --> 00:10:14.940
focus where you're picking up to eight columns

00:10:14.940 --> 00:10:18.720
or accepting any smart hints the markdown export

00:10:18.720 --> 00:10:23.700
right where you're able to export the data set

00:10:23.700 --> 00:10:26.500
and export the the questions really like objective

00:10:26.500 --> 00:10:28.980
the 10 questions why this matters the data health

00:10:28.980 --> 00:10:31.840
checks quick analysis to run and risks and any

00:10:31.840 --> 00:10:34.799
next decisions all right so here's here's a fun

00:10:34.799 --> 00:10:37.899
fun session for you lightning quiz that i have

00:10:37.899 --> 00:10:40.100
for you here's a quick quiz that i just prepared

00:10:40.100 --> 00:10:42.860
and i'll read the answers right after i'll give

00:10:42.860 --> 00:10:46.220
you some time to think if you're if you're listening

00:10:46.220 --> 00:10:50.620
um so that like you know um i don't want to directly

00:10:50.620 --> 00:10:52.559
prompt the answers and maybe just give you a

00:10:52.559 --> 00:10:54.259
few seconds to think. First, I'll just give you

00:10:54.259 --> 00:10:56.399
the questions and then you can think about it

00:10:56.399 --> 00:10:59.259
and then, you know, obviously give me the answers.

00:11:00.539 --> 00:11:03.019
Well, I mean, the answers I will be giving you,

00:11:03.080 --> 00:11:07.580
but you can also come up with your own. You can

00:11:07.580 --> 00:11:11.500
also send me on chat. Just so that I know you're

00:11:11.500 --> 00:11:14.840
listening and you're paying attention. So yeah,

00:11:14.899 --> 00:11:16.740
just quick quiz. I'll read the answers right

00:11:16.740 --> 00:11:20.120
after. So pause if you need more time. Question

00:11:20.120 --> 00:11:25.799
one, true or false? Start exploratory data analysis

00:11:25.799 --> 00:11:31.360
or EDA by uploading the data and exploring the

00:11:31.360 --> 00:11:33.720
charts, right? So I'll just repeat this again.

00:11:33.879 --> 00:11:40.379
True or false? Start EDA by uploading data and

00:11:40.379 --> 00:11:43.799
exploring charts. So it's telling you that as

00:11:43.799 --> 00:11:47.429
soon as you upload the data, and explores the

00:11:47.429 --> 00:11:50.629
charts, you need to start EDA. Question 2. You

00:11:50.629 --> 00:11:54.370
have 50 columns. What do you focus on? 5 to 8,

00:11:54.529 --> 00:11:58.710
that's option A. Option B, 20 to 25. Option C,

00:11:58.769 --> 00:12:02.110
all 50. What comes first? Writing a business

00:12:02.110 --> 00:12:07.049
objective, option A. Cleaning every column, option

00:12:07.049 --> 00:12:10.950
B. And option C, running 20 plots. Question 4.

00:12:11.129 --> 00:12:14.629
If you were to ask, if you were asked, to predict

00:12:14.629 --> 00:12:18.269
the 30 -day churn and you're asked what was the

00:12:18.269 --> 00:12:21.169
leakage here would it be 10 -yard days on day

00:12:21.169 --> 00:12:25.190
one would it be churned in 30 days or would it

00:12:25.190 --> 00:12:28.509
be region so just to like reiterate leakage means

00:12:28.509 --> 00:12:31.750
really the target variable right if you're asked

00:12:31.750 --> 00:12:34.590
to be predicting the 30 -day churn what would

00:12:34.590 --> 00:12:36.529
be the target variable would it be the 10 -yard

00:12:36.529 --> 00:12:40.490
days on day one would it be churned in 30 days

00:12:40.490 --> 00:12:45.320
or would it be So there's three options, right?

00:12:45.659 --> 00:12:47.720
I'll just repeat the options again just so that

00:12:47.720 --> 00:12:51.960
you guys are clear. Option A, 10 -year days on

00:12:51.960 --> 00:12:55.019
day one. Option B, churned in 30 days. And option

00:12:55.019 --> 00:12:57.879
C, region. And finally, question five. So just

00:12:57.879 --> 00:13:02.240
to reiterate, what is the best artifact after

00:13:02.240 --> 00:13:05.700
an exploratory data analysis? Is it option A,

00:13:05.840 --> 00:13:10.100
dashboard gallery? Is it option B, a one -page

00:13:10.100 --> 00:13:14.159
markdown brief? Or is it option C, a raw notebook?

00:13:14.580 --> 00:13:19.799
Question 1, true or false? Start ADA. Would you

00:13:19.799 --> 00:13:22.840
start ADA by uploading the data and exploring

00:13:22.840 --> 00:13:27.620
charts? Question 2, when you have 50 columns,

00:13:27.899 --> 00:13:31.700
what would you focus on? 5 to 8 columns, option

00:13:31.700 --> 00:13:35.120
A. Option B, 20 to 25 columns. Or option C, all

00:13:35.120 --> 00:13:40.860
50 columns. Question 3, what comes first? A,

00:13:40.879 --> 00:13:45.220
writing a business objective. B, cleaning every

00:13:45.220 --> 00:13:49.759
column. Or C, running 20 plots. And question

00:13:49.759 --> 00:13:53.080
five, what would be the best artifact to give

00:13:53.080 --> 00:13:57.500
after an EDA? Would it be a dashboard gallery?

00:13:57.740 --> 00:14:01.379
Would it be a one -page markdown brief? As option

00:14:01.379 --> 00:14:04.039
B is one -page markdown and option A is dashboard

00:14:04.039 --> 00:14:07.840
gallery. And option C, would it be a raw notebook?

00:14:08.570 --> 00:14:11.889
here we go with the answers so the answer one

00:14:11.889 --> 00:14:16.950
to true or false it's false so you would first

00:14:16.950 --> 00:14:20.850
start with the decision right you wouldn't directly

00:14:20.850 --> 00:14:23.070
just do eda that's what it means really upload

00:14:23.070 --> 00:14:25.909
data and to ed that doesn't it's not how it works

00:14:25.909 --> 00:14:28.850
you need to be understanding the business question

00:14:28.850 --> 00:14:31.950
and everything so don't directly jump into exploring

00:14:31.950 --> 00:14:36.970
charts um answer two is five to eight columns

00:14:38.570 --> 00:14:41.289
so i asked what what does the you know how many

00:14:41.289 --> 00:14:43.450
columns would you choose like if you have 50

00:14:43.450 --> 00:14:46.330
columns which would you focus on the answer is

00:14:46.330 --> 00:14:49.490
five to eight columns you don't want to be overburdening

00:14:49.490 --> 00:14:51.929
yourself right so that's something you would

00:14:51.929 --> 00:14:56.090
tell the ai to do as well answer three what comes

00:14:56.090 --> 00:14:59.210
first between writing a business objective cleaning

00:14:59.210 --> 00:15:02.669
every column or running 20 plots the objective

00:15:02.669 --> 00:15:06.389
comes first so the options were writing a business

00:15:06.389 --> 00:15:09.139
objective cleaning every column or running 20

00:15:09.139 --> 00:15:12.840
plots. So objective, writing a business objective

00:15:12.840 --> 00:15:15.480
comes first. Question four was predicting 30

00:15:15.480 --> 00:15:18.200
-day churn. What was the output variable? Was

00:15:18.200 --> 00:15:20.960
it 10 -year days on day one? Was it churned in

00:15:20.960 --> 00:15:24.340
30 days or was it region? The answer is churned

00:15:24.340 --> 00:15:25.960
in 30 days because that is the target variable.

00:15:26.279 --> 00:15:29.919
Question five, what was the best artifact after

00:15:29.919 --> 00:15:34.720
an exploratory data analysis? Is it a dashboard

00:15:34.720 --> 00:15:37.370
gallery? one page markdown brief or raw notebook

00:15:37.370 --> 00:15:40.990
answer is one page brief so those were the answers

00:15:40.990 --> 00:15:45.870
so again just to reiterate answer one was false

00:15:45.870 --> 00:15:48.450
so you don't directly start eda you first start

00:15:48.450 --> 00:15:51.730
with the the business question really what you

00:15:51.730 --> 00:15:55.529
decided to do so that is one answer two was if

00:15:55.529 --> 00:15:57.730
you have 50 columns you focus on five to eight

00:15:57.730 --> 00:16:02.230
columns first so question three what comes first

00:16:03.889 --> 00:16:06.289
between writing a business objective, cleaning

00:16:06.289 --> 00:16:08.769
every column and running 20 plots, the objective

00:16:08.769 --> 00:16:12.269
comes first. And answer four, when you're trying

00:16:12.269 --> 00:16:14.710
to predict the 30 -day churn, what is the output

00:16:14.710 --> 00:16:19.009
variable? Is it tenure days on day one, churned

00:16:19.009 --> 00:16:21.210
in 30 days or region? The answer is churned in

00:16:21.210 --> 00:16:24.889
30 days. And answer five, best artifact after

00:16:24.889 --> 00:16:28.700
EDA? A dashboard, gallery, one -page markdown

00:16:28.700 --> 00:16:30.980
brief, or raw notebook? The answer is a one -page

00:16:30.980 --> 00:16:34.840
brief is the best artifact after an EDA. So that

00:16:34.840 --> 00:16:38.419
was the quiz. If you scored a 5 on 5, you're

00:16:38.419 --> 00:16:41.559
ready. You're ready to be the expert data analyst

00:16:41.559 --> 00:16:46.419
yourself. If you scored between 3, 2, 4, and

00:16:46.419 --> 00:16:51.519
4 or 5, just use the checklist next. And if you

00:16:51.519 --> 00:16:55.980
scored between 0 to 2 out of 5, three to four

00:16:55.980 --> 00:16:57.620
if you scored then just use the checklist and

00:16:57.620 --> 00:17:03.580
zero to two if you scored um don't worry this

00:17:03.580 --> 00:17:05.980
checklist will help a lot all right so discussion

00:17:05.980 --> 00:17:09.759
time i want to just like do a little discussion

00:17:09.759 --> 00:17:12.420
with you and this is this is a discussion that

00:17:12.420 --> 00:17:15.099
will be held in my sub stack so basically sub

00:17:15.099 --> 00:17:17.339
stack is uh if you're not familiar with sub stack

00:17:17.339 --> 00:17:21.619
is it's one of the great content creating websites

00:17:21.619 --> 00:17:27.269
um it gives content creators a platform to be

00:17:27.269 --> 00:17:33.109
heard uh through primarily i would say their

00:17:33.109 --> 00:17:36.210
written content yeah sure you get like a lot

00:17:36.210 --> 00:17:40.430
of audio and video uh podcast hosts or whatever

00:17:40.430 --> 00:17:44.349
but i think primarily what i've seen what works

00:17:44.349 --> 00:17:47.650
a lot on substack is written text so that's a

00:17:47.650 --> 00:17:50.849
good platform for that a lot of creators uh are

00:17:50.849 --> 00:17:53.190
getting their voices heard on substack but anyway

00:17:54.539 --> 00:17:58.500
I wanted to also invite a discussion on Substack.

00:17:59.079 --> 00:18:02.279
So here's what I want to discuss about. I want

00:18:02.279 --> 00:18:07.660
to hear your story and your process. Let's say

00:18:07.660 --> 00:18:11.279
you joined that new company, your new team on

00:18:11.279 --> 00:18:14.200
day two or day three, maybe week one for that

00:18:14.200 --> 00:18:18.180
matter. You were asked to look at dataset. And,

00:18:18.200 --> 00:18:21.720
you know, they just asked you like, you know,

00:18:21.740 --> 00:18:25.609
tell me. this is a messy data what will you come

00:18:25.609 --> 00:18:30.410
up with that is a lot of stress um because you're

00:18:30.410 --> 00:18:32.849
trying to make a new you know an an impression

00:18:32.849 --> 00:18:36.589
on this team that you want to be like on top

00:18:36.589 --> 00:18:38.470
right like you want to make a great impression

00:18:38.470 --> 00:18:42.450
you want to feel that they made a good choice

00:18:42.450 --> 00:18:45.890
in hiring you uh for their team so that's why

00:18:45.890 --> 00:18:49.869
i say day two data set horror story so yeah just

00:18:50.509 --> 00:18:53.009
I want to hear your story in this. How do you

00:18:53.009 --> 00:18:57.170
handle a situation like this? The other thing

00:18:57.170 --> 00:19:00.369
I wanted to share, I wanted to discuss is what

00:19:00.369 --> 00:19:02.930
is one metric that you would move in 30 days,

00:19:03.089 --> 00:19:09.269
in your first 30 days? And why? Which metric

00:19:09.269 --> 00:19:13.029
you would move and why? And the third thing,

00:19:13.170 --> 00:19:15.289
the five to eight columns that you would start

00:19:15.289 --> 00:19:17.450
with right now. So let's just say you're looking

00:19:17.450 --> 00:19:22.869
at maybe... a churn data set what would be five

00:19:22.869 --> 00:19:24.710
to eight columns that you'd want to look at right

00:19:24.710 --> 00:19:28.109
now anything what is your day two data set horror

00:19:28.109 --> 00:19:31.190
story what is one metric that you would move

00:19:31.190 --> 00:19:35.630
in your first 30 days and also in 30 days in

00:19:35.630 --> 00:19:39.170
general and why would you move that and third

00:19:39.170 --> 00:19:42.130
what are the five to eight columns that you would

00:19:42.130 --> 00:19:46.650
start with right now and an example here Could

00:19:46.650 --> 00:19:48.650
be anything that you would want to talk about

00:19:48.650 --> 00:19:51.150
your organization or anything that you, you know,

00:19:51.150 --> 00:19:52.970
are passionate about with this five to eight

00:19:52.970 --> 00:19:56.890
columns. I would, you know, definitely appreciate

00:19:56.890 --> 00:20:00.190
like if you would also add a comment in this

00:20:00.190 --> 00:20:04.589
discussion with an okay to read, like a comment

00:20:04.589 --> 00:20:08.549
in quotes saying okay to read. If I can feature

00:20:08.549 --> 00:20:11.809
your reply the next episode or even anonymous

00:20:11.809 --> 00:20:16.599
if you prefer, right? I read every comment. So

00:20:16.599 --> 00:20:18.779
join the discussion and I'll link the discussion

00:20:18.779 --> 00:20:21.799
in the show notes. So that was a discussion that

00:20:21.799 --> 00:20:24.539
I wanted to have. Now I wanted to give a quick

00:20:24.539 --> 00:20:27.059
walkthrough of how an app would work like this.

00:20:27.440 --> 00:20:29.559
So let's just say that you're talking about a

00:20:29.559 --> 00:20:33.319
subscription app. That's the scenario. The objective

00:20:33.319 --> 00:20:37.400
type is reduce the 30 -day churn for monthly

00:20:37.400 --> 00:20:43.059
planned users by 15%. The focus columns, the

00:20:43.059 --> 00:20:44.740
columns you want to focus on would be planned

00:20:44.740 --> 00:20:49.130
type. a signup channel, 10 -year days, region,

00:20:49.450 --> 00:20:55.910
device, discount applied, and support tickets,

00:20:56.049 --> 00:20:58.289
which is number of support tickets and first

00:20:58.289 --> 00:21:02.609
week activity rate. So here are some of the questions

00:21:02.609 --> 00:21:05.509
the tool generates with and why. But let me just

00:21:05.509 --> 00:21:08.369
go over the columns again. And it may be the

00:21:08.369 --> 00:21:10.109
scenario. So the scenario is let's say you want

00:21:10.109 --> 00:21:12.529
to build a subscription app, right? And this

00:21:12.529 --> 00:21:16.579
could be... let's say you're working for a subscription

00:21:16.579 --> 00:21:21.220
-based company and their primary revenue is coming

00:21:21.220 --> 00:21:23.900
from subscriptions. For example, you can take

00:21:23.900 --> 00:21:25.519
your Netflix or whatever, they're subscription

00:21:25.519 --> 00:21:30.619
-based, and your objective for them is you want

00:21:30.619 --> 00:21:34.119
to reduce 30 -day churn for them for monthly

00:21:34.119 --> 00:21:38.559
plan users by 15%, right? So that would be an

00:21:38.559 --> 00:21:42.960
objective. the columns you want to focus on is

00:21:42.960 --> 00:21:45.740
plan type sign up channel how did they sign up

00:21:45.740 --> 00:21:47.640
on really like what is the plan type is it a

00:21:47.640 --> 00:21:50.779
basic plan or whatever tenure days how many days

00:21:50.779 --> 00:21:54.579
they've been a subscriber um which region where

00:21:54.579 --> 00:21:57.099
in the world are they from which device are they

00:21:57.099 --> 00:22:00.119
sign up on or which device are they using so

00:22:00.119 --> 00:22:06.400
to to watch the shows in this case um and if

00:22:06.400 --> 00:22:10.039
they had any discount applied so where i in the

00:22:10.039 --> 00:22:12.559
united states at least i don't think netflix

00:22:12.559 --> 00:22:15.700
has any discount anymore um but that being said

00:22:15.700 --> 00:22:17.380
like you know it could be there anywhere else

00:22:17.380 --> 00:22:20.519
in the world um so discount applied is another

00:22:20.519 --> 00:22:23.720
column here and support tickets is number of

00:22:23.720 --> 00:22:29.480
support tickets um that may be this this user

00:22:29.480 --> 00:22:33.500
has created this user has requested right or

00:22:33.500 --> 00:22:36.349
in the first week activity rate like They started

00:22:36.349 --> 00:22:41.450
using it and how much time they've been watching

00:22:41.450 --> 00:22:46.970
it in the first week. So this is, I'm just giving

00:22:46.970 --> 00:22:49.710
Netflix as an example, but whatever, right? Could

00:22:49.710 --> 00:22:51.990
be any kind of subscription app. The objective

00:22:51.990 --> 00:22:55.109
really is to reduce the 30 -day churn for monthly

00:22:55.109 --> 00:22:59.950
plan users by about 15%. So now let's look at

00:22:59.950 --> 00:23:02.049
some of the questions that the tool has generated

00:23:02.049 --> 00:23:05.069
with the why. The questions it came up with was,

00:23:05.960 --> 00:23:09.559
are churn rates higher for paid social versus

00:23:09.559 --> 00:23:12.940
organic search? And the why is because it's a

00:23:12.940 --> 00:23:17.359
customer acquisition cost or a CAC payback risk,

00:23:17.440 --> 00:23:21.140
right? Because organic search, you're not really

00:23:21.140 --> 00:23:23.519
losing any money. Paid social, you're paying

00:23:23.519 --> 00:23:27.799
for that, the visibility. So if a consumer who's

00:23:27.799 --> 00:23:30.059
joined in through organic search and you're losing

00:23:30.059 --> 00:23:33.819
that customer, quick versus you're losing that

00:23:33.819 --> 00:23:36.880
customer who joined in through paid social so

00:23:36.880 --> 00:23:39.599
that's a problem like i mean if you're losing

00:23:39.599 --> 00:23:41.519
more on paid social versus on organic search

00:23:41.519 --> 00:23:45.819
then um your paid social strategy is clearly

00:23:45.819 --> 00:23:48.079
not working so that's going to be a big big uh

00:23:48.079 --> 00:23:51.240
problem on your customer acquisition cost will

00:23:51.240 --> 00:23:55.119
just rise higher right um no i mean customer

00:23:55.119 --> 00:23:56.900
acquisition cost is one thing but i think what

00:23:56.900 --> 00:24:00.140
what you're gonna face especially here in this

00:24:00.140 --> 00:24:03.240
case that i'm thinking about it is well you spend

00:24:03.240 --> 00:24:07.200
this money now you want to make money from this

00:24:07.200 --> 00:24:11.759
customer so like a payback so to speak so that

00:24:11.759 --> 00:24:13.920
would affect the payback that you get from the

00:24:13.920 --> 00:24:18.660
customer because well their return is low that

00:24:18.660 --> 00:24:22.059
is one question the second question is do sign

00:24:22.059 --> 00:24:25.359
up discounts lead to lower first week activity

00:24:26.509 --> 00:24:29.009
and a higher churn the why is a habit formation

00:24:29.009 --> 00:24:31.670
so let's think about this question right so let's

00:24:31.670 --> 00:24:35.430
just say you joined this streaming uh company

00:24:35.430 --> 00:24:41.630
does the sign up discount that you got lead to

00:24:41.630 --> 00:24:44.150
a lower first week activity and higher churn

00:24:44.150 --> 00:24:46.329
so if they received a discount while joining

00:24:46.329 --> 00:24:53.539
did they get um did they churn more So if they

00:24:53.539 --> 00:24:56.460
got 40 % discount versus maybe 0 % discount,

00:24:56.519 --> 00:24:59.599
did the 40 % discount person churn more than

00:24:59.599 --> 00:25:02.900
the 0 % discount? Probably because the person

00:25:02.900 --> 00:25:05.900
who's paid more wants to stay and try to make

00:25:05.900 --> 00:25:09.059
his money's worth, right? Or yeah, and obviously

00:25:09.059 --> 00:25:11.559
they probably engage with the app as well less.

00:25:12.079 --> 00:25:14.359
I think usually that kind of behavior I engage

00:25:14.359 --> 00:25:17.619
with, right? If I pay more, I'm inclined to stay

00:25:17.619 --> 00:25:19.660
more because, well, I want to get my money's

00:25:19.660 --> 00:25:23.480
worth. um so that is more of a habit formation

00:25:23.480 --> 00:25:26.660
thing that's what the ai comes up with third

00:25:26.660 --> 00:25:31.059
question do week two escalations predict churn

00:25:31.059 --> 00:25:35.099
so this customer has been there on in week two

00:25:35.099 --> 00:25:38.099
they try to do a lot of support ticket escalations

00:25:38.099 --> 00:25:42.259
does that help predict churn so they create a

00:25:42.259 --> 00:25:44.619
lot of support tickets and if those tickets have

00:25:44.619 --> 00:25:47.000
been escalated does that mean that they are going

00:25:47.000 --> 00:25:50.549
to churn Why? Because the early customers, they

00:25:50.549 --> 00:25:55.150
are early experience pain. So that they want

00:25:55.150 --> 00:25:58.130
to try to avoid. So that was the third question.

00:25:58.710 --> 00:26:02.549
Now the fourth question is any region and device

00:26:02.549 --> 00:26:06.049
combos with short tenure? So you want to see

00:26:06.049 --> 00:26:11.950
if maybe people from Northeast United States,

00:26:11.970 --> 00:26:20.029
for example, are they using mobile more but they

00:26:20.029 --> 00:26:24.529
also stay on the platform less they use the streaming

00:26:24.529 --> 00:26:27.589
device very less right i mean the streaming subscription

00:26:27.589 --> 00:26:29.789
very less so that's something you'd want to see

00:26:29.789 --> 00:26:36.309
so why the why behind this is it is a user interface

00:26:36.309 --> 00:26:40.609
parity or localization right is it getting skewed

00:26:40.609 --> 00:26:44.130
in one area or not is is uh the geographic area

00:26:44.130 --> 00:26:46.809
playing a part that's what you want to see here

00:26:46.809 --> 00:26:50.049
and uh so another question it came up with was

00:26:50.049 --> 00:26:54.250
is the churn non -linear versus the first week

00:26:54.250 --> 00:26:57.430
activity so let's just say you know you're trying

00:26:57.430 --> 00:27:00.450
to churn right is it you know at all proportional

00:27:00.450 --> 00:27:03.650
with how much you're interacting in the first

00:27:03.650 --> 00:27:05.730
week so is the first week activity depending

00:27:05.730 --> 00:27:08.549
on i mean is the first week activity affecting

00:27:08.549 --> 00:27:12.609
churn like the more you interact meaning the

00:27:12.609 --> 00:27:14.779
less you churn or something like that right or

00:27:14.779 --> 00:27:19.319
is it not related as much not related alone with

00:27:19.319 --> 00:27:21.880
the first week activity so maybe it's related

00:27:21.880 --> 00:27:25.680
but like not in a linear way you know this will

00:27:25.680 --> 00:27:30.059
help you uh the y here is like an activation

00:27:30.059 --> 00:27:34.339
threshold so you can obviously simplify this

00:27:34.339 --> 00:27:38.279
language and everything but that's how uh the

00:27:38.279 --> 00:27:40.619
kind of questions the tool will help you generate

00:27:40.619 --> 00:27:43.240
and it'll give you some data health notes also

00:27:44.039 --> 00:27:47.680
give you negative 10 -year days, inconsistent

00:27:47.680 --> 00:27:51.440
region labels, and 12 % missing data, missing

00:27:51.440 --> 00:27:54.039
discount applied. So this is more of a data health

00:27:54.039 --> 00:27:58.059
check for, maybe you can think of it like data

00:27:58.059 --> 00:28:01.819
consistency or, you know, you want to have some

00:28:01.819 --> 00:28:04.279
consistent data. So if you see some 10 -year

00:28:04.279 --> 00:28:06.539
days with a negative, it means something is up,

00:28:06.700 --> 00:28:09.200
right? You wouldn't expect these to be negative.

00:28:09.920 --> 00:28:14.819
Or the region labels. Where the person is from,

00:28:15.000 --> 00:28:18.599
you would not want to see different spellings,

00:28:18.599 --> 00:28:22.279
for example, different labels. So that is affecting

00:28:22.279 --> 00:28:25.420
something. And the other thing is discount applied.

00:28:26.599 --> 00:28:29.299
If there's any missing data or something, that

00:28:29.299 --> 00:28:31.500
is what it is telling you. And then finally,

00:28:31.539 --> 00:28:34.140
an export of one page brief into paste into Slack

00:28:34.140 --> 00:28:37.960
and align the team. So these are the three steps

00:28:37.960 --> 00:28:41.809
for that export. So here are some choices. the

00:28:41.809 --> 00:28:44.750
product or engineering can make put the objective

00:28:44.750 --> 00:28:48.009
first focus second and then the questions come

00:28:48.009 --> 00:28:51.009
in so business objective goes first what do you

00:28:51.009 --> 00:28:54.049
want to focus on columns wise and stuff and then

00:28:54.049 --> 00:28:58.049
what questions you want to ask then pre -compute

00:28:58.049 --> 00:29:01.430
small statistics so that the model reasons more

00:29:01.430 --> 00:29:06.269
and you know rummage is less what does rummage

00:29:06.269 --> 00:29:10.400
mean really it means that it's not it's not going

00:29:10.400 --> 00:29:14.680
systematically, right? So you want the model

00:29:14.680 --> 00:29:18.440
to go systematically. So you want it to reason

00:29:18.440 --> 00:29:22.319
more. So small statistic needs to be computed,

00:29:22.460 --> 00:29:25.799
but also monitor these guardrails. Guardrails

00:29:25.799 --> 00:29:27.680
are like, you know, one thing's going up, you

00:29:27.680 --> 00:29:28.980
don't want other things to go down and stuff.

00:29:29.119 --> 00:29:31.759
So guardrail is one thing's going up, you don't

00:29:31.759 --> 00:29:33.559
want other things going down. So guardrail is

00:29:33.559 --> 00:29:38.359
like, you matrix like churn, bounce rate. So

00:29:38.359 --> 00:29:42.210
any kind of, uh fly any kind of leakage is there

00:29:42.210 --> 00:29:45.950
or uh data leakage or messy labels are there

00:29:45.950 --> 00:29:50.269
those should be you know addressed keep the copy

00:29:50.269 --> 00:29:54.650
simple and reduce any blank box anxiety with

00:29:54.650 --> 00:29:56.930
examples so what does this mean people freeze

00:29:56.930 --> 00:29:59.490
basically when they uh let's look at what does

00:29:59.490 --> 00:30:01.710
keep copy simple mean so when you're writing

00:30:01.710 --> 00:30:05.170
the text the users see that the you know which

00:30:05.170 --> 00:30:07.809
includes labels instructions button or button

00:30:07.809 --> 00:30:11.859
text don't get any don't get clever or verbose

00:30:11.859 --> 00:30:15.400
the instead of articulate your analytical objective

00:30:15.400 --> 00:30:18.859
here just say what's your goal what do you want

00:30:18.859 --> 00:30:22.859
to find out right just simple words less thinking

00:30:22.859 --> 00:30:26.220
and it's a smoother flow so that's simple uh

00:30:26.220 --> 00:30:30.019
copy and blank box anxiety means like people

00:30:30.019 --> 00:30:33.220
freeze when they see a blank input field with

00:30:33.220 --> 00:30:37.240
no guidance so example here would be if the tool

00:30:37.240 --> 00:30:40.259
asks what's your objective and leaves a giant

00:30:40.259 --> 00:30:43.160
empty box a user might not know what to type

00:30:43.160 --> 00:30:47.279
this is called blank box anxiety the stress basically

00:30:47.279 --> 00:30:49.779
you're starting from nothing and with examples

00:30:49.779 --> 00:30:54.140
here means so you remove that anxiety by showing

00:30:54.140 --> 00:30:59.319
an example or two right in the box or next to

00:30:59.319 --> 00:31:03.660
it yeah some example placeholders example it

00:31:03.660 --> 00:31:10.319
would be explain the dip in the week of 7 .06

00:31:10.319 --> 00:31:15.819
to 7 .12 right that was a week or explain like

00:31:15.819 --> 00:31:19.119
a march sales dip or something that or it could

00:31:19.119 --> 00:31:22.160
be fine churn drivers for monthly plan now basically

00:31:22.160 --> 00:31:25.559
the user sees how how to phrase their input and

00:31:25.559 --> 00:31:29.240
they can just copy the style to summarize it

00:31:29.240 --> 00:31:32.359
basically means use short and plain words in

00:31:32.359 --> 00:31:35.940
your ui copy And whenever you ask users to type

00:31:35.940 --> 00:31:39.940
something into a box, show them an example so

00:31:39.940 --> 00:31:43.000
that they don't freeze up. Always prompt them

00:31:43.000 --> 00:31:46.480
so they're not left guessing. And to now go over

00:31:46.480 --> 00:31:49.319
guardrails, what it means is like when you're

00:31:49.319 --> 00:31:53.339
building or using an AI EDA helper, like an exploratory

00:31:53.339 --> 00:31:56.279
data analysis helper for AI, you don't want it

00:31:56.279 --> 00:31:59.059
to silently accept everything. You want it to

00:31:59.059 --> 00:32:01.079
flag potential issues that it sees in a data

00:32:01.079 --> 00:32:03.859
set, but in a friendly guiding way. but not in

00:32:03.859 --> 00:32:07.059
a scary error message. Here are some two common

00:32:07.059 --> 00:32:10.160
issues, right? One is on the leakage side. That

00:32:10.160 --> 00:32:12.920
could mean that the data that cheats by including

00:32:12.920 --> 00:32:17.000
information, you wouldn't realistically know

00:32:17.000 --> 00:32:22.519
at the prediction time. So example here is, let's

00:32:22.519 --> 00:32:24.660
say you're trying to predict churn at a signup,

00:32:24.740 --> 00:32:27.579
right? But your dataset includes a column called

00:32:27.579 --> 00:32:31.079
churn in 30 days. That's leakage. It gives away

00:32:31.079 --> 00:32:34.480
the answer. the guardrail is a to the tool that

00:32:34.480 --> 00:32:39.220
the tool can warn um with the warning heads up

00:32:39.220 --> 00:32:42.599
column churned in 30 days looks like an outcome

00:32:42.599 --> 00:32:46.339
variable using it might cause leakage so that's

00:32:46.339 --> 00:32:53.039
what is a leakage right you don't want um your

00:32:53.039 --> 00:32:55.299
output variable to be essentially included in

00:32:55.299 --> 00:32:58.680
the input which is called leakage which would

00:32:58.680 --> 00:33:01.440
mean your dataset is overtrained or overfitted

00:33:01.440 --> 00:33:05.420
in machine learning terms. Now, messy labels

00:33:05.420 --> 00:33:08.380
is the other thing which happens. Definition

00:33:08.380 --> 00:33:13.000
here would be, let's just say your categorical

00:33:13.000 --> 00:33:17.500
values that are inconsistent, redundant, or even

00:33:17.500 --> 00:33:22.660
unclear. So example, let's take a region column

00:33:22.660 --> 00:33:28.660
with values of US, US. let's say you have a region

00:33:28.660 --> 00:33:32.339
column with values us and then another one with

00:33:32.339 --> 00:33:37.000
u .s. and another one with usa and another one

00:33:37.000 --> 00:33:39.720
with north america and all are mixed together

00:33:39.720 --> 00:33:43.759
right the tool can say as a guardrail notice

00:33:43.759 --> 00:33:49.980
region has overlapping categories us u .s. and

00:33:49.980 --> 00:33:53.240
usa you may want to clean this and it will give

00:33:53.240 --> 00:33:55.180
you like friendly tips right like that so if

00:33:55.180 --> 00:33:59.359
the tool says Error invalidator, that's a very

00:33:59.359 --> 00:34:01.460
vague message and users can freeze about it.

00:34:01.819 --> 00:34:04.059
They'll be like, what does that invalidator mean,

00:34:04.180 --> 00:34:06.859
right? Users can just give helpful tips, be helpful,

00:34:07.099 --> 00:34:10.679
right? Or if it says quick tip, you might want

00:34:10.679 --> 00:34:14.300
to check X. Users might feel supported instead

00:34:14.300 --> 00:34:17.599
of judged. You might want to check column X or

00:34:17.599 --> 00:34:20.940
value X in the column or whatever. Quick tip.

00:34:21.800 --> 00:34:25.880
That's a very nice way to say it. basically the

00:34:25.880 --> 00:34:29.219
tone matters small emojis or short wording let

00:34:29.219 --> 00:34:31.619
your ai be your friend so it should be no jargon

00:34:31.619 --> 00:34:34.519
really so what this essentially means is like

00:34:34.519 --> 00:34:36.480
you know you're building soft card rails into

00:34:36.480 --> 00:34:39.579
the tool that gently warns users about common

00:34:39.579 --> 00:34:42.039
pitfalls like leakage or messy category labels

00:34:42.039 --> 00:34:45.000
without breaking their flow now i just wanted

00:34:45.000 --> 00:34:47.179
to again reiterate about this other point about

00:34:47.179 --> 00:34:50.260
pre -computing small stats so the model reasons

00:34:50.260 --> 00:34:53.489
more and rummages less. So this one's about helping

00:34:53.489 --> 00:34:56.289
the AI spend its brain power wisely. So let's

00:34:56.289 --> 00:34:59.210
break it down. So the problem is when you don't

00:34:59.210 --> 00:35:02.230
pre -compute, if you just throw the data set,

00:35:02.329 --> 00:35:06.309
raw data set into a model and say, ask 10 smart

00:35:06.309 --> 00:35:10.530
EDA questions, the model has to pass every column

00:35:10.530 --> 00:35:13.510
from scratch. It may miss obvious things like

00:35:13.510 --> 00:35:15.610
columns which are numeric versus categorical

00:35:15.610 --> 00:35:19.869
and it wastes tokens and time. ai tokens and

00:35:19.869 --> 00:35:22.710
obviously time to rummage the data so to speak

00:35:22.710 --> 00:35:25.969
through the raw csv it sometimes gives generic

00:35:25.969 --> 00:35:28.230
or irrelevant questions and rummage just means

00:35:28.230 --> 00:35:31.150
like it goes about in a messy way very generic

00:35:31.150 --> 00:35:32.789
or relevant questions so you don't want that

00:35:32.789 --> 00:35:35.230
to happen it's like asking a human analyst to

00:35:35.230 --> 00:35:38.530
read a 500 page log file line by line instead

00:35:38.530 --> 00:35:41.769
of giving them a one page summary first so give

00:35:41.769 --> 00:35:44.389
some context context is very important essentially

00:35:44.389 --> 00:35:47.250
right one page summary is essentially a context

00:35:47.840 --> 00:35:50.300
And I'll talk about context engineering in another

00:35:50.300 --> 00:35:53.920
episode, but brief thing is just like to give

00:35:53.920 --> 00:35:57.059
context, right? If you want somebody to do the

00:35:57.059 --> 00:36:00.380
job correctly, help them understand what they're

00:36:00.380 --> 00:36:04.480
trying to do. Here's a simple fix to pre -compute

00:36:04.480 --> 00:36:08.099
small stats. Before asking your model to generate

00:36:08.099 --> 00:36:11.139
questions, you or your app should run a quick

00:36:11.139 --> 00:36:14.440
Pandas summary. For each column, compute things

00:36:14.440 --> 00:36:20.840
like type numeric. categorical text date, count

00:36:20.840 --> 00:36:24.559
of unique values, percent missing values. So

00:36:24.559 --> 00:36:27.159
get a count of unique values, see how much percent

00:36:27.159 --> 00:36:28.840
of missing values are there, what are the data

00:36:28.840 --> 00:36:31.800
types are there, what is the mean, median, and

00:36:31.800 --> 00:36:35.619
standard deviation for the numeric columns. You

00:36:35.619 --> 00:36:37.199
can use a lot of Pandas functions available.

00:36:37.539 --> 00:36:40.159
So I think describe or something does it. And

00:36:40.159 --> 00:36:44.460
info will give you the type. Or a Pandas profiling

00:36:44.460 --> 00:36:47.070
report also. is another library which will give

00:36:47.070 --> 00:36:50.150
you all these answers and in this and and there's

00:36:50.150 --> 00:36:52.409
a couple of other ones like top three most common

00:36:52.409 --> 00:36:55.610
values for categoricals or a min max date for

00:36:55.610 --> 00:36:58.789
time columns so instead of dumping the full csv

00:36:58.789 --> 00:37:02.070
you give the model this compact cheat sheet and

00:37:02.070 --> 00:37:04.190
i'll link uh and i'll give you that in the show

00:37:04.190 --> 00:37:06.469
notes as well why it helps is because basically

00:37:06.469 --> 00:37:09.969
the model can reason about the data and plan

00:37:09.969 --> 00:37:14.239
type it would be like hmm plant type only has

00:37:14.239 --> 00:37:16.519
three categories maybe i can compare churn across

00:37:16.519 --> 00:37:22.340
them so this kind of context will enable the

00:37:22.340 --> 00:37:24.300
model to think smarter right it'll say like oh

00:37:24.300 --> 00:37:26.239
i only have three categories in the plant type

00:37:26.239 --> 00:37:31.099
maybe i can compare churn so it doesn't really

00:37:31.099 --> 00:37:34.599
waste tokens or hallucinate right ai models are

00:37:34.599 --> 00:37:36.599
known to hallucinate a lot So you don't want

00:37:36.599 --> 00:37:38.940
to waste your AI tokens there. Because AI tokens

00:37:38.940 --> 00:37:41.340
is obviously very expensive. So just use it wisely.

00:37:42.840 --> 00:37:46.039
Let it come up with smarter questions. Let it

00:37:46.039 --> 00:37:48.960
say, maybe this is a text column. Basically what

00:37:48.960 --> 00:37:51.159
happens in this case is output gets sharper,

00:37:51.380 --> 00:37:55.679
faster and cheaper. So essentially this whole

00:37:55.679 --> 00:37:57.400
thing means like you'd want to do a lightweight

00:37:57.400 --> 00:38:00.260
summary of the data set first. Then pass that

00:38:00.260 --> 00:38:03.500
summary to AI. That way the AI spends more energy

00:38:03.500 --> 00:38:06.630
on reasoning. and less on digging through ROROs.

00:38:06.769 --> 00:38:09.929
So this is essentially like a message for product

00:38:09.929 --> 00:38:12.469
and engineering. And finally, one last message

00:38:12.469 --> 00:38:15.769
for product and engineering is nothing leaves

00:38:15.769 --> 00:38:18.469
your session unless you opt in. So what does

00:38:18.469 --> 00:38:21.690
that mean? Here's the concern, right? It's a

00:38:21.690 --> 00:38:24.289
trust signal line. Here's what it means in plain

00:38:24.289 --> 00:38:27.949
words. The concern is basically when people upload

00:38:27.949 --> 00:38:32.130
a dataset into your AI and EDA tool, they immediately

00:38:32.130 --> 00:38:35.960
wonder, Is this being stored somewhere? Is my

00:38:35.960 --> 00:38:39.739
client's data safe? Will this end up draining

00:38:39.739 --> 00:38:42.300
someone's model? That's a very valid concern.

00:38:42.480 --> 00:38:45.420
So you need to make sure that you are not opted

00:38:45.420 --> 00:38:50.099
in if you don't want to share your data. So there's

00:38:50.099 --> 00:38:55.019
a way to check that in different tools. In chat,

00:38:55.059 --> 00:39:00.989
GPT is pretty easy to do that. People have very

00:39:00.989 --> 00:39:03.610
valid reasons to have this kind of privacy concerns,

00:39:03.849 --> 00:39:07.170
what we call as a privacy anxiety. You don't

00:39:07.170 --> 00:39:11.150
want the data to be used by other people because

00:39:11.150 --> 00:39:13.510
maybe it's confidential. You don't want to end

00:39:13.510 --> 00:39:18.570
up trading someone's model, obviously. Is this

00:39:18.570 --> 00:39:21.170
being stored somewhere? So that are very valid

00:39:21.170 --> 00:39:23.329
concerns, right? So just make sure you're not

00:39:23.329 --> 00:39:27.179
opted in. Here's something you can watch out

00:39:27.179 --> 00:39:29.639
for is you can calm that by setting like a clear

00:39:29.639 --> 00:39:32.920
privacy boundary. By default, all your processing

00:39:32.920 --> 00:39:34.760
happens locally or within the active session.

00:39:35.099 --> 00:39:37.559
But when the user closes the tab, the data is

00:39:37.559 --> 00:39:41.840
gone. Nothing is logged, saved or even reused

00:39:41.840 --> 00:39:45.800
unless the user chooses to export, save or share

00:39:45.800 --> 00:39:50.480
that. Here's how I would frame the copy. So in

00:39:50.480 --> 00:39:53.320
app or in notes, this is how it would go. So

00:39:53.320 --> 00:39:55.780
your file is processed only inside your session.

00:39:56.199 --> 00:39:58.480
Nothing leaves or nothing is stored. This is

00:39:58.480 --> 00:40:00.400
a more privacy -first approach, where if you

00:40:00.400 --> 00:40:03.119
want to save or share your results, that's only

00:40:03.119 --> 00:40:06.380
opting in. So to summarize, here's what my message

00:40:06.380 --> 00:40:08.619
to product and engineering is. I'll start with

00:40:08.619 --> 00:40:11.320
the last one first. Your tool won't quietly send

00:40:11.320 --> 00:40:14.980
data to your servers, store logs or retrain models

00:40:14.980 --> 00:40:17.780
with their data. Users will stay in control.

00:40:18.269 --> 00:40:20.170
Because sharing is a choice and it's not the

00:40:20.170 --> 00:40:22.869
default. Building soft guardrails into the tool

00:40:22.869 --> 00:40:25.590
that gently warn users about common pitfalls

00:40:25.590 --> 00:40:29.949
like leakage or messy category labels without

00:40:29.949 --> 00:40:33.349
breaking their flow. So build those guardrails

00:40:33.349 --> 00:40:35.989
into the tool that warns users about pitfalls

00:40:35.989 --> 00:40:40.889
like leakage or messy category labels. Be supportive

00:40:40.889 --> 00:40:47.400
about it as well. Next point. Use short and plain

00:40:47.400 --> 00:40:50.559
words in your user interface copy. And whenever

00:40:50.559 --> 00:40:53.380
you ask users to type something into a box, show

00:40:53.380 --> 00:40:56.539
them an example of how to do it so that they

00:40:56.539 --> 00:40:59.539
don't freeze up. We want to make everything easy.

00:40:59.739 --> 00:41:04.000
And finally, put your objective first, focus

00:41:04.000 --> 00:41:05.860
second, and then question. That's really the

00:41:05.860 --> 00:41:08.280
first point here. Objective first, focus second,

00:41:08.340 --> 00:41:11.119
and questions last. Here's the numbers I track

00:41:11.119 --> 00:41:16.349
which are, you know, more useful than cool. percent

00:41:16.349 --> 00:41:20.250
of sessions with an objective like 80 percent

00:41:20.250 --> 00:41:23.170
greater than 80 percent of the target variable

00:41:23.170 --> 00:41:26.110
right percent of useful questions time to brief

00:41:26.110 --> 00:41:31.289
like ideally under three minutes seven day return

00:41:31.289 --> 00:41:33.570
so more on the habit side like the user activity

00:41:33.570 --> 00:41:36.289
side and the real world use is this is how i

00:41:36.289 --> 00:41:40.590
usually try to work with data set i prefer using

00:41:40.590 --> 00:41:44.130
churn also as like a Churn rate over a period

00:41:44.130 --> 00:41:46.329
of 30 days. That's my useful metric. That's why

00:41:46.329 --> 00:41:52.889
I kept talking about it. Retention over a three

00:41:52.889 --> 00:41:54.969
-month window, a six -month window, a yearly

00:41:54.969 --> 00:41:56.929
window. They're very common metrics to look at,

00:41:57.010 --> 00:42:00.170
but these are very useful also. Better than your

00:42:00.170 --> 00:42:04.309
vanity metrics like followers or subscribers.

00:42:04.949 --> 00:42:08.449
That's very vanity metrics. Number of users,

00:42:08.610 --> 00:42:11.659
for example. Those are not going to help. user

00:42:11.659 --> 00:42:14.579
growth is what really helps. And finally, let's

00:42:14.579 --> 00:42:16.719
come to the listener checklist. I'll be also

00:42:16.719 --> 00:42:18.760
adding this in the show notes, but here's what

00:42:18.760 --> 00:42:22.920
I just want to read out first. This is a follow

00:42:22.920 --> 00:42:25.340
-up to the quiz you listened to earlier. So let's

00:42:25.340 --> 00:42:27.579
go with this checklist here. This is the third

00:42:27.579 --> 00:42:30.780
key item I wanted to discuss. First, you write

00:42:30.780 --> 00:42:34.920
the objective in one line. Like example would

00:42:34.920 --> 00:42:39.489
be reduce x by y in z time frame for whatever

00:42:39.489 --> 00:42:42.710
the segment is um pick five to eight focus columns

00:42:42.710 --> 00:42:46.849
or accept the tools hints generate 10 questions

00:42:46.849 --> 00:42:50.409
plus one line of why it matters so this is like

00:42:50.409 --> 00:42:53.250
a prompt you are using essentially four run quick

00:42:53.250 --> 00:42:58.250
data health checks nulls odd values naming export

00:42:58.250 --> 00:43:02.889
a one page markdown brief number five number

00:43:02.889 --> 00:43:06.309
six share it in slack notion jira and tag owners

00:43:06.309 --> 00:43:10.369
number seven Run two to three quick analysis

00:43:10.369 --> 00:43:14.590
today, not 10. Log what you learned, plus the

00:43:14.590 --> 00:43:17.989
next decision. Nine, repeat tomorrow with tighter

00:43:17.989 --> 00:43:22.710
focus. And tell me which step saved you the most

00:43:22.710 --> 00:43:26.070
time in the sub stack comments. So I'll list

00:43:26.070 --> 00:43:28.889
this checklist in the show notes so you can have

00:43:28.889 --> 00:43:30.670
it. All you need to do is just don't overthink

00:43:30.670 --> 00:43:32.550
it. Just run through these nine steps and you'll

00:43:32.550 --> 00:43:35.570
move faster than 90 % of people who dropped into

00:43:35.570 --> 00:43:38.400
a data set. right i'd love to hear from you which

00:43:38.400 --> 00:43:41.860
step actually saved you the most time so again

00:43:41.860 --> 00:43:43.820
just drop your answer in the comment section

00:43:43.820 --> 00:43:47.559
i'll link the thread for you in substack and

00:43:47.559 --> 00:43:50.539
i read them all like i said i'll share a few

00:43:50.539 --> 00:43:53.239
next week if you'd want so we can all learn from

00:43:53.239 --> 00:43:56.119
each other so what to steal from this show and

00:43:56.119 --> 00:43:58.599
maybe think of it like an action plan for you

00:43:58.599 --> 00:44:03.000
points start every analysis with the decision

00:44:03.000 --> 00:44:06.400
limiting yourself to five to eight columns Always

00:44:06.400 --> 00:44:09.820
ship a one -page brief. Track usefulness and

00:44:09.820 --> 00:44:13.820
not vanity metrics. And finally, make the first

00:44:13.820 --> 00:44:16.960
90 seconds feel like progress. All right? Before

00:44:16.960 --> 00:44:20.199
you go, I just wanted to let you know that I

00:44:20.199 --> 00:44:23.860
record with Riverside. I'm an affiliate partner

00:44:23.860 --> 00:44:27.139
of Riverside. They helped me record this episode.

00:44:27.739 --> 00:44:31.900
And they have amazing audio quality. So that's

00:44:31.900 --> 00:44:33.699
something I just wanted to recommend if you're

00:44:33.699 --> 00:44:36.380
ever, you know, deciding to do your own podcast.

00:44:37.219 --> 00:44:40.760
This is a great platform to record because their

00:44:40.760 --> 00:44:43.780
editing features are amazing. Their AI editing

00:44:43.780 --> 00:44:46.699
is, I think, next level. Helps you remove all

00:44:46.699 --> 00:44:50.539
awkward pauses and gives your sound a strong

00:44:50.539 --> 00:44:55.119
voice. It gives you a strong voice. So I would

00:44:55.119 --> 00:44:58.119
highly recommend Riverside as the reason why

00:44:58.119 --> 00:45:00.480
I chose to... be an affiliate partner because

00:45:00.480 --> 00:45:03.940
I actually use the product and I love it. So

00:45:03.940 --> 00:45:08.460
link to join their platform is in the show notes.

00:45:09.119 --> 00:45:12.139
And again, I'm an affiliate partner. So if you

00:45:12.139 --> 00:45:15.739
make any purchase, I may make a small commission

00:45:15.739 --> 00:45:20.760
from it. I also host an rss .com. Also, I'm an

00:45:20.760 --> 00:45:25.239
affiliate partner of RSS because they have an

00:45:25.239 --> 00:45:29.119
amazing distribution channel. They are able to

00:45:29.119 --> 00:45:33.840
publish your episodes from RSS into Spotify,

00:45:34.159 --> 00:45:38.840
Apple Podcasts, Amazon Music, Audible, Deezer,

00:45:39.139 --> 00:45:44.579
Pandora, iHeartRadio, and even lets you publish

00:45:44.579 --> 00:45:47.260
your episode with one button into YouTube as

00:45:47.260 --> 00:45:50.659
well. Oh, did I forget to mention that you can

00:45:50.659 --> 00:45:55.559
start making money by hosting on RSS .com even

00:45:55.559 --> 00:45:58.559
if you have just 10 downloads a month. So that's

00:45:58.559 --> 00:46:00.559
all you need to make money. 10 downloads a month.

00:46:01.900 --> 00:46:04.079
It's a new feature they started and it's amazing.

00:46:04.639 --> 00:46:07.639
So again, link to joinrss .com will be in the

00:46:07.639 --> 00:46:12.300
show notes. I'm also using Cider .ai to help

00:46:12.300 --> 00:46:17.079
with my research. So if you are interested in

00:46:17.079 --> 00:46:21.780
any research for your content or for in general,

00:46:21.920 --> 00:46:24.280
you want to learn about anything, want to read

00:46:24.280 --> 00:46:29.880
research papers or any kind of thing to improve

00:46:29.880 --> 00:46:32.820
your knowledge, cider .ai is your platform to

00:46:32.820 --> 00:46:35.539
use. So again, I'm an affiliate partner and I'd

00:46:35.539 --> 00:46:40.139
love if you're able to support the show by clicking

00:46:40.139 --> 00:46:43.599
on my affiliate link and joining cider .ai. So

00:46:43.599 --> 00:46:47.599
that's all I have, but take the quiz, drop your

00:46:47.599 --> 00:46:49.579
story in the sub stack comments, grab the checklist,

00:46:49.900 --> 00:46:55.989
share this with your friend. With that one friend

00:46:55.989 --> 00:46:59.829
who's drowning in CSVs or Excels. Show notes

00:46:59.829 --> 00:47:02.349
have everything. Substack discussion link is

00:47:02.349 --> 00:47:06.050
also included. And a thread is also there for

00:47:06.050 --> 00:47:09.130
you to join on Substack. So see you next week.

00:47:10.070 --> 00:47:13.949
Thanks again. And this is Mukundan from the Data

00:47:13.949 --> 00:47:15.869
and AI with Mukundan show. Hey, it's Mukundan.

00:47:16.010 --> 00:47:18.570
If this episode helped you, two tiny favors that

00:47:18.570 --> 00:47:21.250
make a huge difference. Rate the show five stars.

00:47:21.900 --> 00:47:24.659
On Spotify, you can just open the show page and

00:47:24.659 --> 00:47:27.960
tap the star button. On Apple Podcasts, scroll

00:47:27.960 --> 00:47:30.340
to the bottom of the show page and tap five stars.

00:47:31.800 --> 00:47:34.820
Also, leave a one -to -line review on Apple Podcasts.

00:47:35.719 --> 00:47:38.280
Tell me one takeaway or which checklist step

00:47:38.280 --> 00:47:41.079
saved you time. I read every single one and it

00:47:41.079 --> 00:47:43.880
helps more people find the show. Thank you. Your

00:47:43.880 --> 00:47:46.599
rating and review genuinely help this reach the

00:47:46.599 --> 00:47:49.440
next analyst who's staring at a messy CSV. See

00:47:49.440 --> 00:47:49.900
you next week.
