WEBVTT

00:00:07.919 --> 00:00:10.580
Hey everyone, welcome back to another episode

00:00:10.580 --> 00:00:13.380
of Data and AI with Mukundan, the show where

00:00:13.380 --> 00:00:17.760
we dive deep into tools, stories and breakthroughs

00:00:17.760 --> 00:00:21.280
in the world of data science and artificial intelligence.

00:00:22.339 --> 00:00:25.320
So whether you're on your morning walk, prepping

00:00:25.320 --> 00:00:29.399
for your interview in your car or building something

00:00:29.399 --> 00:00:32.280
amazing from scratch, I hope this podcast helps

00:00:32.280 --> 00:00:35.359
you think a little sharper and ask better questions.

00:00:37.350 --> 00:00:42.070
And build smarter. Now, let me ask you something.

00:00:42.869 --> 00:00:46.310
Have you ever opened a data set and thought to

00:00:46.310 --> 00:00:51.950
yourself, I have no idea where to begin. You're

00:00:51.950 --> 00:00:55.909
basically at work. And you're asked to look at

00:00:55.909 --> 00:00:58.689
some data. And someone tells you like, hey, this

00:00:58.689 --> 00:01:04.890
is, I mean, you started a new job here. They're

00:01:04.890 --> 00:01:08.709
asking you to look at the data set that you're

00:01:08.709 --> 00:01:11.709
going to be working with all the time. But all

00:01:11.709 --> 00:01:15.909
they say is get familiar with this data. But

00:01:15.909 --> 00:01:18.469
you don't know how to really get familiar because

00:01:18.469 --> 00:01:20.689
I guess you don't know what questions to ask.

00:01:21.450 --> 00:01:27.950
Where to even begin. And you're not alone in

00:01:27.950 --> 00:01:30.730
this because I'm sure every person who is starting

00:01:30.730 --> 00:01:35.859
a new role goes through this. in fact this exact

00:01:35.859 --> 00:01:39.299
moment actually inspired me to write a blog post

00:01:39.299 --> 00:01:42.120
first on this which gained a lot of traction

00:01:42.120 --> 00:01:46.299
and a lot of people i felt could relate uh because

00:01:46.299 --> 00:01:49.439
i mean in in the seven days or something that

00:01:49.439 --> 00:01:53.060
i've that i've written the blog post it gained

00:01:53.060 --> 00:01:57.939
like about i think 500 plus you know read time

00:01:57.939 --> 00:02:01.000
on it and a read time is 500 plus reads so that

00:02:01.000 --> 00:02:03.319
would mean that roughly people are spending 30

00:02:03.319 --> 00:02:09.500
plus seconds per read on average right so it

00:02:09.500 --> 00:02:11.460
was getting a lot of traction and it has been

00:02:11.460 --> 00:02:15.900
my most successful blog happy to say that so

00:02:15.900 --> 00:02:17.419
that's why I wanted to create this episode which

00:02:17.419 --> 00:02:21.360
I felt like this is something people face a lot

00:02:21.360 --> 00:02:25.580
at you know whatever part of the journey they

00:02:25.580 --> 00:02:28.460
are especially if they're in the data field right

00:02:28.460 --> 00:02:30.479
your data scientist data analyst data engineer

00:02:30.479 --> 00:02:34.580
whatever you just uh you're faced with this challenge

00:02:34.580 --> 00:02:37.819
that you're asked to look at data but really

00:02:37.819 --> 00:02:42.879
don't know what questions to ask um so i i created

00:02:42.879 --> 00:02:46.620
a tool that i uh you know i felt like could be

00:02:46.620 --> 00:02:51.740
a real game changer especially for me and i think

00:02:51.740 --> 00:02:57.009
it could be for you as well So buckle up because

00:02:57.009 --> 00:02:59.449
today I'm going to be walking you through how

00:02:59.449 --> 00:03:03.849
I built an AI tool that thinks like a data analyst

00:03:03.849 --> 00:03:08.050
or maybe a data scientist. And not only that,

00:03:08.169 --> 00:03:12.310
but I also want to unpack the problem that it

00:03:12.310 --> 00:03:16.449
solves, show you how the app works behind the

00:03:16.449 --> 00:03:20.110
scenes and share how you can use it right now

00:03:20.110 --> 00:03:27.539
to level up your own analysis. So what's the

00:03:27.539 --> 00:03:31.879
real struggle behind an exploratory data analysis?

00:03:32.300 --> 00:03:36.120
So let's be real, right? Exploratory data analysis

00:03:36.120 --> 00:03:39.879
or EDA isn't as straightforward as textbooks

00:03:39.879 --> 00:03:44.539
make it sound. Textbooks may say that, you know,

00:03:44.539 --> 00:03:49.319
load the data, check for nulls, data types, you

00:03:49.319 --> 00:03:51.620
know, check for the data types, like what the

00:03:51.620 --> 00:03:54.889
different data types present. and it could tell

00:03:54.889 --> 00:03:57.250
you like to do some histograms to see what to

00:03:57.250 --> 00:04:00.229
the count frequency and everything and by textbook

00:04:00.229 --> 00:04:03.370
i mean like let's say you did actually do maybe

00:04:03.370 --> 00:04:07.669
a course in data analysis data scientist or or

00:04:07.669 --> 00:04:11.229
you went to school for it for your undergrad

00:04:11.229 --> 00:04:14.069
or your master's or your phd program or whatever

00:04:14.069 --> 00:04:17.029
right i'm sure not at the phd level but definitely

00:04:17.029 --> 00:04:20.069
at the master's level you are doing more eda

00:04:21.579 --> 00:04:27.259
then you would be doing at the PhD level for

00:04:27.259 --> 00:04:31.420
sure. The drill is basically that you load the

00:04:31.420 --> 00:04:35.240
data, check for missing values or outliers really,

00:04:35.399 --> 00:04:39.920
data types and the EDA, right? The actual data

00:04:39.920 --> 00:04:42.060
visualization where you look for the counts.

00:04:42.439 --> 00:04:45.839
But real -life data work, when you're in the

00:04:45.839 --> 00:04:52.740
actual workplace, it is way messier. So you're

00:04:52.740 --> 00:04:58.160
given a CSV with no documentation, no context,

00:04:58.300 --> 00:05:04.379
and vague instructions. Could be like, find something

00:05:04.379 --> 00:05:07.680
interesting in this data set, right? And this

00:05:07.680 --> 00:05:10.439
could be something like in interviews, you're

00:05:10.439 --> 00:05:13.500
doing a take -home exam, for example. You're

00:05:13.500 --> 00:05:16.920
working with a new client and maybe you're in

00:05:16.920 --> 00:05:19.060
a consulting company and you're working some

00:05:19.060 --> 00:05:22.250
new client project. and they'd ask you to find

00:05:22.250 --> 00:05:24.529
something interesting and maybe that's a very

00:05:24.529 --> 00:05:28.550
quick turnaround time as well so how do you how

00:05:28.550 --> 00:05:34.790
do you be super fast in this right um or maybe

00:05:34.790 --> 00:05:38.029
just uh your even in your internal stakeholder

00:05:38.029 --> 00:05:40.709
meetings they'd be asking you to do this and

00:05:40.709 --> 00:05:43.149
and the other thing which i spoke about was also

00:05:43.149 --> 00:05:45.189
onboarding right when you're starting a new company

00:05:45.189 --> 00:05:49.269
role they might ask you to look at data set not

00:05:49.269 --> 00:05:52.269
everybody has too much time to help you understand

00:05:52.269 --> 00:05:55.269
the data set uh maybe you're doing in a startup

00:05:55.269 --> 00:05:57.750
role they they still just give you the data and

00:05:57.750 --> 00:06:02.730
to uh look at it right and that uh and the turnaround

00:06:02.730 --> 00:06:05.949
time that you would have to you know produce

00:06:05.949 --> 00:06:10.970
results from it would be very quick so so you

00:06:10.970 --> 00:06:12.990
may be just struggling with the orientation aspect

00:06:12.990 --> 00:06:18.050
and it's not that you're struggling with the

00:06:18.050 --> 00:06:20.790
analysis it's just the way you've been told to

00:06:20.790 --> 00:06:26.129
do it um but here's the thing you don't you don't

00:06:26.129 --> 00:06:29.589
need to clean the data every time you just need

00:06:29.589 --> 00:06:34.810
to ask better questions and so which brings me

00:06:34.810 --> 00:06:38.430
to the next section here the spark what what

00:06:38.430 --> 00:06:42.649
really inspired me right um so i was doing a

00:06:42.649 --> 00:06:46.600
take -home assignment For a company that I really

00:06:46.600 --> 00:06:51.459
admired. And they gave me a data set of car listings.

00:06:52.459 --> 00:06:55.480
And they asked me to explore it. So there was

00:06:55.480 --> 00:07:00.899
like no specific KPI, no business goal. Just

00:07:00.899 --> 00:07:04.779
tell us something useful from the data set. And

00:07:04.779 --> 00:07:08.220
even though I did all the standard stuff, right?

00:07:08.920 --> 00:07:11.360
Looking at some price distributions, missing

00:07:11.360 --> 00:07:14.300
values, correlation plots, nothing popped up.

00:07:16.079 --> 00:07:21.860
Then it hit me. Maybe it's not an analysis problem.

00:07:23.500 --> 00:07:27.759
It's just a question problem. And let me elaborate

00:07:27.759 --> 00:07:33.839
what that means. Like, I needed prompts that

00:07:33.839 --> 00:07:40.279
could nudge me. Like, you know, asking the right

00:07:40.279 --> 00:07:42.600
questions from the data, right? That's what I

00:07:42.600 --> 00:07:45.819
meant as a question problem. So in this case,

00:07:45.939 --> 00:07:49.639
it was like, do accident -prone cars sell for

00:07:49.639 --> 00:07:54.240
less? Does city influence price? Are new cars

00:07:54.240 --> 00:08:00.279
listed for shorter durations? So what I wanted

00:08:00.279 --> 00:08:05.319
was something that could think like an analyst

00:08:05.319 --> 00:08:12.600
with me. And that's when I started building this

00:08:12.600 --> 00:08:16.360
tool. yeah obviously it could have helped me

00:08:16.360 --> 00:08:20.139
during the interview but I think the main goal

00:08:20.139 --> 00:08:25.319
of this was to maybe set me up for success for

00:08:25.319 --> 00:08:27.699
future interviews and if it's a great tool share

00:08:27.699 --> 00:08:32.379
it with others to help them along with their

00:08:32.379 --> 00:08:37.240
process as well right so this this is what the

00:08:37.240 --> 00:08:40.580
app does really like you upload a CSV that's

00:08:40.580 --> 00:08:48.409
step one step two you mean the the AI tool summarizes

00:08:48.409 --> 00:08:52.669
each column with the data type unique values

00:08:52.669 --> 00:08:58.750
that are present in the data set and ranges examples

00:08:58.750 --> 00:09:04.929
and it builds a custom prompt for GPT -4 and

00:09:04.929 --> 00:09:10.970
finally GPT -4 then returns ten thoughtful exploratory

00:09:10.970 --> 00:09:13.720
questions so actually the summarization of the

00:09:13.720 --> 00:09:16.679
columns that is a code logic that's being written

00:09:16.679 --> 00:09:21.059
but gpt4 where the ai comes in is it returns

00:09:21.059 --> 00:09:24.899
these exploratory questions that you can ask

00:09:24.899 --> 00:09:30.679
from the dsa so they are not shallow summaries

00:09:30.679 --> 00:09:36.340
they are very nuanced very context aware and

00:09:36.340 --> 00:09:41.629
they're very useful in real workflows so like

00:09:41.629 --> 00:09:43.950
i said right before with these examples so imagine

00:09:43.950 --> 00:09:46.690
you just joined a team and you get handed raw

00:09:46.690 --> 00:09:50.309
data to look at so instead of freezing up you

00:09:50.309 --> 00:09:54.129
get like a head start here because this ai tool

00:09:54.129 --> 00:09:56.509
is doing that analysis for you i mean it's asking

00:09:56.509 --> 00:09:58.190
those questions like what questions you should

00:09:58.190 --> 00:10:02.830
be asking so that you are then um maybe using

00:10:02.830 --> 00:10:05.230
a sql code a python code or whatever other right

00:10:05.230 --> 00:10:09.679
uh way to explore the data set But now you know

00:10:09.679 --> 00:10:11.779
what questions to ask because this tool is already

00:10:11.779 --> 00:10:14.820
giving you that head start. It's giving you a

00:10:14.820 --> 00:10:18.419
relevant way to frame the questions. And it's

00:10:18.419 --> 00:10:23.639
giving you an instant analytical momentum. And

00:10:23.639 --> 00:10:28.960
that's where the magic happens. So let's look

00:10:28.960 --> 00:10:32.700
at behind the code, which I'll be sharing in

00:10:32.700 --> 00:10:37.120
the blog post which accompanies. you know, this

00:10:37.120 --> 00:10:40.200
podcast. So it'll be in the show notes for anybody

00:10:40.200 --> 00:10:44.759
wondering. So for my fellow builders and, you

00:10:44.759 --> 00:10:48.200
know, code tinkerers, I want to call it, here's

00:10:48.200 --> 00:10:52.840
a quick walkthrough of the backend code. So the

00:10:52.840 --> 00:10:57.100
app itself, it's built with Streamlit. So it's

00:10:57.100 --> 00:10:59.480
a Python -based library and it's for the UI.

00:11:00.000 --> 00:11:05.919
It's a very easy library to use. OpenAI is GPT

00:11:05.919 --> 00:11:12.779
-4 API and Pandas for data profiling. So once

00:11:12.779 --> 00:11:17.100
a user uploads a CSV, the app parses the columns.

00:11:18.259 --> 00:11:24.399
It categorizes them by data type and generates

00:11:24.399 --> 00:11:31.740
human readable summaries. So it's easy to understand

00:11:31.740 --> 00:11:35.740
those summaries right then it feeds those summaries

00:11:35.740 --> 00:11:40.519
into a gpt4 prompt like this so the prompt goes

00:11:40.519 --> 00:11:44.019
like this right you are a data analyst here's

00:11:44.019 --> 00:11:47.179
a data set summary what are 10 smart questions

00:11:47.179 --> 00:11:54.940
we should ask to understand this better and gpt4

00:11:54.940 --> 00:11:58.840
then does its magic we show the output in the

00:11:58.840 --> 00:12:05.289
app you copy paste the questions or you can use

00:12:05.289 --> 00:12:08.909
them to guide the dashboards that you build from

00:12:08.909 --> 00:12:12.690
it the analysis that you want to do from this

00:12:12.690 --> 00:12:17.590
in your take -home exams in your site projects

00:12:17.590 --> 00:12:23.450
client side projects your interviews anything

00:12:23.450 --> 00:12:25.889
really right like you're onboarding with a new

00:12:25.889 --> 00:12:30.470
company so a lot of use cases but yeah you know

00:12:30.470 --> 00:12:34.149
what questions to ask now because well uh the

00:12:34.149 --> 00:12:36.690
tool is doing that for you so you're just focusing

00:12:36.690 --> 00:12:40.710
on the analysis right so this is your guide now

00:12:40.710 --> 00:12:48.649
um who is this for so this is for you if you're

00:12:48.649 --> 00:12:51.309
a new hire who's ramping up on mexi data sets

00:12:51.309 --> 00:12:55.909
um like i mentioned so even if you're even if

00:12:55.909 --> 00:12:58.009
you're a student who's practicing eda for your

00:12:58.009 --> 00:13:01.039
school projects This could be something that

00:13:01.039 --> 00:13:05.919
guides you there. And let's say you're a data

00:13:05.919 --> 00:13:09.460
analyst who's building portfolio projects for

00:13:09.460 --> 00:13:12.639
your interviews so that you can get interviews

00:13:12.639 --> 00:13:16.019
in the future, right? So this is very handy.

00:13:17.220 --> 00:13:22.600
Or now, let's say you got the interview and it's

00:13:22.600 --> 00:13:26.100
like a take -home exam and they ask you to still

00:13:26.100 --> 00:13:28.820
explore the data set. This is where you can use

00:13:28.820 --> 00:13:34.139
it. you're a freelancer scoping up raw data for

00:13:34.139 --> 00:13:39.059
clients. So let's say you're just on a client

00:13:39.059 --> 00:13:42.600
contract and they ask you to look at raw data

00:13:42.600 --> 00:13:45.500
to generate meaningful insights. This is where

00:13:45.500 --> 00:13:49.320
the tool comes in. So even senior analysts can

00:13:49.320 --> 00:13:52.299
use it to spark inspiration first. Even if you're

00:13:52.299 --> 00:13:54.139
like maybe an established person in a company,

00:13:54.200 --> 00:13:57.720
I think even if you're looking at something new,

00:13:58.169 --> 00:14:00.149
for the first time and again don't know what

00:14:00.149 --> 00:14:04.090
questions to ask this can give you that jump

00:14:04.090 --> 00:14:07.830
start right it's not a replacement for deep thinking

00:14:07.830 --> 00:14:13.690
it's a partner for better thinking so you still

00:14:13.690 --> 00:14:21.289
want to be able to think deep but also have a

00:14:21.289 --> 00:14:27.710
better thinking partner right So one of my favorite

00:14:27.710 --> 00:14:34.610
use cases for this so far has been one newsletter

00:14:34.610 --> 00:14:40.389
subscriber reached out to me last week and they

00:14:40.389 --> 00:14:45.029
were prepping for an interview and they were

00:14:45.029 --> 00:14:49.070
asked to paste in a dataset. I mean basically

00:14:49.070 --> 00:14:51.190
they were asked to look at dataset and generate

00:14:51.190 --> 00:14:53.649
questions. So what they did was they used this

00:14:53.649 --> 00:14:57.720
tool which I built in. they were able to generate

00:14:57.720 --> 00:15:00.899
questions because the app generated for it for

00:15:00.899 --> 00:15:05.120
them and that this isn't something that they

00:15:05.120 --> 00:15:11.980
even considered so they picked like three or

00:15:11.980 --> 00:15:14.039
four of those 10 questions which were generated

00:15:14.039 --> 00:15:19.620
they ran an analysis in the ada basically and

00:15:19.620 --> 00:15:25.840
built a slide deck which said that helped really

00:15:25.840 --> 00:15:29.600
gain her gain their confidence back right like

00:15:29.600 --> 00:15:32.759
i mean i i felt like they were a bit stuck here

00:15:32.759 --> 00:15:40.159
and this this kind of tool really helped them

00:15:40.159 --> 00:15:44.840
get unstuck so my purpose of sharing this is

00:15:44.840 --> 00:15:49.820
just showing what's possible when we go from

00:15:49.820 --> 00:15:53.470
you know getting blank screen in our minds to

00:15:53.470 --> 00:15:57.330
more structured curiosity. So I wasn't sharing

00:15:57.330 --> 00:15:59.149
this to hype the tool, but just to show what's

00:15:59.149 --> 00:16:02.909
possible to do, right? And that's what this tool

00:16:02.909 --> 00:16:06.730
really helps you to do. Here's where I'm taking

00:16:06.730 --> 00:16:10.970
it next. So as I mentioned, I will be sharing

00:16:10.970 --> 00:16:13.769
the show notes, the blog post that accompanies

00:16:13.769 --> 00:16:17.850
this podcast. And it will be in the show notes,

00:16:17.889 --> 00:16:20.049
which will have the full code of how to do it.

00:16:20.509 --> 00:16:23.629
And I mean, I explained the process. if you want

00:16:23.629 --> 00:16:25.409
to be able to do that yourself you can do that

00:16:25.409 --> 00:16:30.129
based on the process as well and what's next

00:16:30.129 --> 00:16:34.730
is like i will be doing a csv upload with column

00:16:34.730 --> 00:16:41.769
profiling and using a custom tone a stakeholder

00:16:41.769 --> 00:16:46.070
aware gen question generation as well custom

00:16:46.070 --> 00:16:49.450
tones as in like the way you want the data set

00:16:49.450 --> 00:16:52.879
to sound So what that means is like, if you want

00:16:52.879 --> 00:16:56.399
to ask more emotional questions from the data,

00:16:56.480 --> 00:16:59.299
so something like that would be an ideal next

00:16:59.299 --> 00:17:04.579
step for this. So like make it, make it smarter

00:17:04.579 --> 00:17:11.019
essentially. Right. And I want to do it like

00:17:11.019 --> 00:17:15.099
a Slack bot or a chat GPT plugin. So I think

00:17:15.099 --> 00:17:17.920
Slack allows you to, or maybe at least chat GPT

00:17:17.920 --> 00:17:21.509
allows you to do it. directly but the idea is

00:17:21.509 --> 00:17:27.029
to do it like not just as a data uh not just

00:17:27.029 --> 00:17:31.269
as a data upload you know in a in chat jpt this

00:17:31.269 --> 00:17:36.109
allows you to use an app and i think apps are

00:17:36.109 --> 00:17:41.369
cool so um that's why i wanted to do this and

00:17:41.369 --> 00:17:45.269
i want to add some visualization support to this

00:17:45.269 --> 00:17:49.579
some more data visualization support more charts

00:17:49.579 --> 00:17:56.980
to look at smarter charts to look at and so that's

00:17:56.980 --> 00:17:59.920
like some of the you know use cases I'm thinking

00:17:59.920 --> 00:18:06.680
this is just the beginning if this resonated

00:18:06.680 --> 00:18:11.920
with you try it out paste in a data set that

00:18:11.920 --> 00:18:17.069
you have struggled with and let the AI give you

00:18:17.069 --> 00:18:21.690
that head start. And if it gives you even one

00:18:21.690 --> 00:18:26.569
useful question, let me know. Tag me, message

00:18:26.569 --> 00:18:31.990
me, share the output. Because better questions

00:18:31.990 --> 00:18:38.869
build better insights. And better insights build

00:18:38.869 --> 00:18:43.930
better careers. So thanks for listening to another

00:18:43.930 --> 00:18:46.950
episode of Data and AI with Mukundan. And until

00:18:46.950 --> 00:18:51.369
next time, keep thinking, keep questioning, and

00:18:51.369 --> 00:18:52.569
keep building.
