1
00:00:00,000 --> 00:00:08,160
How did the best machine learning practitioners get involved in the field?

2
00:00:08,160 --> 00:00:10,600
What challenges have they faced?

3
00:00:10,600 --> 00:00:12,840
What has helped them flourish?

4
00:00:12,840 --> 00:00:14,720
Let's ask them.

5
00:00:14,720 --> 00:00:16,840
Welcome to Learning from Machine Learning.

6
00:00:16,840 --> 00:00:21,380
I'm your host, Seth Levine.

7
00:00:21,380 --> 00:00:24,360
Welcome to Learning from Machine Learning.

8
00:00:24,360 --> 00:00:31,960
In this episode, it's incredible to have Sebastian Raschke here, Lead AI Educator at Lightning,

9
00:00:31,960 --> 00:00:37,320
former statistics professor at University of Wisconsin, the author of Python Machine

10
00:00:37,320 --> 00:00:41,480
Learning Book and Machine Learning with PyTorch and Scikit-learn.

11
00:00:41,480 --> 00:00:47,360
And overall, just an amazing force making AI and deep learning more accessible and teaching

12
00:00:47,360 --> 00:00:50,960
people how to use AI and deep learning at scale.

13
00:00:50,960 --> 00:00:51,960
Welcome.

14
00:00:51,960 --> 00:00:55,720
Yeah, thank you for the kind invitation to be here.

15
00:00:55,720 --> 00:00:58,440
Very exciting, Seth, to have me here on your podcast.

16
00:00:58,440 --> 00:01:00,720
I think it's a relatively new podcast.

17
00:01:00,720 --> 00:01:05,580
I'm especially honored to be one of the first people on this podcast.

18
00:01:05,580 --> 00:01:08,040
So I hope we will have a lot of fun.

19
00:01:08,040 --> 00:01:12,000
Hopefully a lot of stuff to talk about because we work both in machine learning and have

20
00:01:12,000 --> 00:01:15,200
a lot of overlapping interests.

21
00:01:15,200 --> 00:01:19,680
It's awesome to have you here to get things kicked off.

22
00:01:19,680 --> 00:01:23,640
Do you want to give us a little bit of a career background, your journey?

23
00:01:23,640 --> 00:01:26,720
How did you get to where you are today in the machine learning field?

24
00:01:26,720 --> 00:01:29,160
Yeah, that's...

25
00:01:29,160 --> 00:01:32,080
How far do you want me to go back?

26
00:01:32,080 --> 00:01:33,080
So maybe starting with...

27
00:01:33,080 --> 00:01:34,080
As far back as you want.

28
00:01:34,080 --> 00:01:35,080
Yeah.

29
00:01:35,080 --> 00:01:44,840
So I think how I basically started was during my undergrad, I got into statistics, R, Python

30
00:01:44,840 --> 00:01:46,400
programming eventually.

31
00:01:46,400 --> 00:01:49,320
And I've always been a tinkerer.

32
00:01:49,320 --> 00:01:53,360
I must say I always liked the coding more than the math.

33
00:01:53,360 --> 00:01:56,480
Nonetheless, I always was somewhere in between the two.

34
00:01:56,480 --> 00:02:01,320
So I was never really, let's say, a software engineer, but I was also never really a mathematician,

35
00:02:01,320 --> 00:02:02,480
if that makes sense.

36
00:02:02,480 --> 00:02:07,040
So I was more like an applied researcher or scientist.

37
00:02:07,040 --> 00:02:12,440
And yeah, my background is essentially during my PhD, I worked on computational biology

38
00:02:12,440 --> 00:02:17,120
problems where it was usually centered around some prediction task.

39
00:02:17,120 --> 00:02:22,320
Let's say virtual screening, where we were interested in finding small molecules that

40
00:02:22,320 --> 00:02:28,040
inhibit some biological response, let's say related to diseases or other types of biological,

41
00:02:28,040 --> 00:02:30,160
let's say, systems.

42
00:02:30,160 --> 00:02:33,600
In the same way, we were also modeling protein structures and these types of things.

43
00:02:33,600 --> 00:02:39,640
And yeah, we had to do a lot of coding, coming up with rules to classify things.

44
00:02:39,640 --> 00:02:42,680
And there was this class when I was in grad school that is...

45
00:02:42,680 --> 00:02:45,240
It is more than 10 years ago now, I think.

46
00:02:45,240 --> 00:02:47,760
So it was called Statistical Pattern Recognition.

47
00:02:47,760 --> 00:02:53,840
And I was, as my advisor back then, she recommended me taking that course because, well, it was

48
00:02:53,840 --> 00:02:59,120
something where you can maybe automate this prediction type of problem that we had instead

49
00:02:59,120 --> 00:03:01,780
of hand coding things, going through things.

50
00:03:01,780 --> 00:03:07,440
And I must say I wrote a lot of brute force for loops in Python to optimize things using

51
00:03:07,440 --> 00:03:10,240
very simple optimization libraries.

52
00:03:10,240 --> 00:03:12,280
And that was kind of like eye opening.

53
00:03:12,280 --> 00:03:18,660
So that course was mostly focused on Bayesian methods, let's say Bayes optimal classifiers

54
00:03:18,660 --> 00:03:22,080
and then Naive Bayes to make that more feasible and these types of things.

55
00:03:22,080 --> 00:03:26,320
But that kind of introduced me, I would say, to machine learning, like the concept of...

56
00:03:26,320 --> 00:03:30,840
I mean, it was more statistical learning, but the concept of learning from data, essentially.

57
00:03:30,840 --> 00:03:33,620
And then I took another class, data mining.

58
00:03:33,620 --> 00:03:38,360
And that was also the time where Andrew Eng's class was launched on Coursera, the machine

59
00:03:38,360 --> 00:03:39,360
learning class.

60
00:03:39,360 --> 00:03:40,600
And I got totally hooked.

61
00:03:40,600 --> 00:03:43,680
It was, I mean, in two ways, revolutionary.

62
00:03:43,680 --> 00:03:49,600
At first, working with data, letting computers learn from data automatically, that was super

63
00:03:49,600 --> 00:03:50,600
fascinating.

64
00:03:50,600 --> 00:03:55,040
At the same time, Coursera as an online learning platform was also super cool as a student.

65
00:03:55,040 --> 00:03:57,200
Like, wow, I can do this at home.

66
00:03:57,200 --> 00:04:02,200
I mean, I like going to classes in person, but this was just also very revolutionary

67
00:04:02,200 --> 00:04:03,340
where you had everything at home.

68
00:04:03,340 --> 00:04:05,120
You could take the class whenever you wanted.

69
00:04:05,120 --> 00:04:07,960
And it was just addicting to take that class.

70
00:04:07,960 --> 00:04:09,840
Andrew Eng was such a good teacher.

71
00:04:09,840 --> 00:04:11,080
I got really hooked.

72
00:04:11,080 --> 00:04:16,040
And yeah, from there, eventually, I joined the statistics department at UW-Madison in

73
00:04:16,040 --> 00:04:20,400
2018, where I focused on machine learning and deep learning research.

74
00:04:20,400 --> 00:04:24,040
And then in 2022, I joined Lightning AI.

75
00:04:24,040 --> 00:04:30,880
I liked my time as an assistant professor, but things change in machine learning where

76
00:04:30,880 --> 00:04:33,240
the problems become more challenging and bigger.

77
00:04:33,240 --> 00:04:37,680
If you are, let's say, a small team, it's more challenging to keep up with, let's say,

78
00:04:37,680 --> 00:04:39,560
technology and resources.

79
00:04:39,560 --> 00:04:43,440
And like I mentioned before, I'm not, let's say, the typical mathematician type of person.

80
00:04:43,440 --> 00:04:44,600
So I like computing.

81
00:04:44,600 --> 00:04:49,640
So I was looking for an opportunity where, let's say, I have a team of people and infrastructure

82
00:04:49,640 --> 00:04:52,240
to work on different type of problems.

83
00:04:52,240 --> 00:04:58,800
And also, like how to extend my educational, let's say, passion from just in classroom

84
00:04:58,800 --> 00:05:02,720
teaching to also maybe developing an online course, which is what I'm, for example, among

85
00:05:02,720 --> 00:05:03,920
other things doing right now.

86
00:05:03,920 --> 00:05:07,520
So yeah, I joined Lightning AI, long story short.

87
00:05:07,520 --> 00:05:13,240
And yeah, and yeah, back then, I've been really happily working there.

88
00:05:13,240 --> 00:05:15,480
I like my time at UW-Madison as well.

89
00:05:15,480 --> 00:05:18,840
But yeah, you can't do everything all at once, I guess.

90
00:05:18,840 --> 00:05:19,840
So yeah.

91
00:05:19,840 --> 00:05:24,080
Yeah, no, that's a great journey.

92
00:05:24,080 --> 00:05:29,960
I too was captivated by Andrew Ang's Coursera class, as I think a lot of people in machine

93
00:05:29,960 --> 00:05:31,800
learning.

94
00:05:31,800 --> 00:05:37,840
So having the experience being a professor and now working in industry, how would you

95
00:05:37,840 --> 00:05:41,600
compare working in academia compared to industry?

96
00:05:41,600 --> 00:05:43,040
Yeah.

97
00:05:43,040 --> 00:05:49,400
So I wouldn't say one is necessarily better than the other and just very different.

98
00:05:49,400 --> 00:05:57,760
I think as an academia, what I especially like, I mean, was this academic thing in the

99
00:05:57,760 --> 00:06:02,280
air where you have freedom to do whatever you want, and it's very exciting to be in

100
00:06:02,280 --> 00:06:03,880
academia in that sense.

101
00:06:03,880 --> 00:06:06,400
You get to design your own research projects.

102
00:06:06,400 --> 00:06:08,840
But with that, there are also a lot of responsibilities.

103
00:06:08,840 --> 00:06:13,520
So you have to write grants, you have to make sure you are becoming basically a manager,

104
00:06:13,520 --> 00:06:17,520
you are managing your small lab, you have research students, you have to make sure that

105
00:06:17,520 --> 00:06:21,400
your research students get paid, you have to then reapply for grants and these types

106
00:06:21,400 --> 00:06:22,400
of things.

107
00:06:22,400 --> 00:06:29,400
Which is, if things go well, it's very satisfying, but I must say as a person who likes doing

108
00:06:29,400 --> 00:06:34,680
things, I would like to focus more on the research and let's say also the teaching,

109
00:06:34,680 --> 00:06:38,200
rather than let's say writing grants and these types of things.

110
00:06:38,200 --> 00:06:40,600
So in that sense, it's very different.

111
00:06:40,600 --> 00:06:43,240
You have these responsibilities where you have to do a little bit here and a little

112
00:06:43,240 --> 00:06:44,240
bit there.

113
00:06:44,240 --> 00:06:48,040
So you're getting drawn into different directions, which I would say is not a bad thing.

114
00:06:48,040 --> 00:06:53,080
It's just depending on your personality, whether you prefer that or just to focus on one thing

115
00:06:53,080 --> 00:06:54,680
and doing one thing well.

116
00:06:54,680 --> 00:07:02,080
I must say, I really like doing research, but one thing I didn't like was a bit, let's

117
00:07:02,080 --> 00:07:03,720
say the reviewing system.

118
00:07:03,720 --> 00:07:06,520
I think something everyone complains about peer reviewing.

119
00:07:06,520 --> 00:07:10,600
There's a lot of work to do if you are a peer reviewer.

120
00:07:10,600 --> 00:07:13,580
You get a lot of papers for conferences to review.

121
00:07:13,580 --> 00:07:18,720
But then also as an author, it can sometimes be a little bit demotivating because reviewers

122
00:07:18,720 --> 00:07:23,160
are, I would say, sometimes very critical and not always in a constructive way.

123
00:07:23,160 --> 00:07:27,000
So sometimes we get this almost mean or hostile comments.

124
00:07:27,000 --> 00:07:31,760
And this was something where I was like, I don't know if I want to do that for the rest

125
00:07:31,760 --> 00:07:32,760
of my life.

126
00:07:32,760 --> 00:07:38,440
Same with grant reviews, where sometimes you get these very, I don't know, apparent reason

127
00:07:38,440 --> 00:07:42,600
because sometimes even someone misunderstood your report, you get very mean responses.

128
00:07:42,600 --> 00:07:48,480
And I was like, maybe let me focus more on the good things, building things, teaching

129
00:07:48,480 --> 00:07:50,760
and less on these types of things.

130
00:07:50,760 --> 00:07:54,360
In industry, I mean, there are, of course, other trade-offs.

131
00:07:54,360 --> 00:08:01,600
But I would say what changed for me is that I basically get to focus more on certain things

132
00:08:01,600 --> 00:08:05,560
without having to worry about other things.

133
00:08:05,560 --> 00:08:07,220
I'm not like a manager, basically.

134
00:08:07,220 --> 00:08:11,220
So I like to build things and I like also to teach people.

135
00:08:11,220 --> 00:08:18,200
So I'm glad that I found something where I can focus more on that.

136
00:08:18,200 --> 00:08:19,360
That's great.

137
00:08:19,360 --> 00:08:24,560
Speaking of building things and tinkering, do you remember one of your first projects

138
00:08:24,560 --> 00:08:28,120
in machine learning and what attracted you initially?

139
00:08:28,120 --> 00:08:31,640
Yeah, my first project in machine learning.

140
00:08:31,640 --> 00:08:36,240
I think besides that, I think one was maybe a fun one.

141
00:08:36,240 --> 00:08:42,520
That was back then when I took this data mining class that I mentioned that was a side project

142
00:08:42,520 --> 00:08:46,120
because we had to come up with a class project for that class.

143
00:08:46,120 --> 00:08:50,720
And by the way, that is also something I took inspiration from, from that class.

144
00:08:50,720 --> 00:08:55,560
I also always emphasized in my courses to include little class projects.

145
00:08:55,560 --> 00:08:59,140
It's always something that students found very exciting.

146
00:08:59,140 --> 00:09:02,320
And back then, so there were two things I was working on.

147
00:09:02,320 --> 00:09:06,320
As a student, I was working on fantasy sports predictions.

148
00:09:06,320 --> 00:09:08,840
Back then, I was a big soccer fan.

149
00:09:08,840 --> 00:09:14,960
And there was a website where it was called, I forgot the website, but it was a daily fantasy

150
00:09:14,960 --> 00:09:22,560
sports where you basically assembled a team of players and they got scores based on how

151
00:09:22,560 --> 00:09:25,840
well they performed in the Premier League games on the weekend.

152
00:09:25,840 --> 00:09:31,420
And so it was basically a constraint optimization problem where you had certain budget and you

153
00:09:31,420 --> 00:09:37,120
wanted to basically maximize, you wanted to predict how many score or what the best players

154
00:09:37,120 --> 00:09:39,240
are based on the budget, basically.

155
00:09:39,240 --> 00:09:42,680
So you couldn't, and there were also other constraints like the formation.

156
00:09:42,680 --> 00:09:44,040
You couldn't have 10 strikers.

157
00:09:44,040 --> 00:09:45,960
You could only have, I think, maximum three strikers.

158
00:09:45,960 --> 00:09:49,360
So it was very interesting.

159
00:09:49,360 --> 00:09:53,680
And based on that, I built machine learning classifiers with Cyclet Learn, very simple

160
00:09:53,680 --> 00:09:57,480
ones to basically predict what the promising players were.

161
00:09:57,480 --> 00:10:02,600
And that was very interesting as an exercise because that's how I taught myself Pandas,

162
00:10:02,600 --> 00:10:08,520
the data array library or data frame library.

163
00:10:08,520 --> 00:10:10,480
And I tried to automate as much as possible.

164
00:10:10,480 --> 00:10:15,560
So I was also trying to do some simple NLP, going through news articles, basically predicting

165
00:10:15,560 --> 00:10:20,040
the sentiment and extracting names from players who are injured and these types of things.

166
00:10:20,040 --> 00:10:25,880
It was very challenging, but it was a very good exercise to learn data processing and

167
00:10:25,880 --> 00:10:27,440
implementing simple things.

168
00:10:27,440 --> 00:10:31,080
So that was maybe one of my first projects, not related to my PhD at all.

169
00:10:31,080 --> 00:10:33,000
It was more like a side project.

170
00:10:33,000 --> 00:10:38,600
And also I built something called, I think it was called Music Mood.

171
00:10:38,600 --> 00:10:44,840
I called it Music Mood, which was for this class project where it was about predicting

172
00:10:44,840 --> 00:10:49,280
the mood of music in terms of, is this a positive, negative song?

173
00:10:49,280 --> 00:10:54,080
And originally it was the Happy Rock Song project where we had also the genre.

174
00:10:54,080 --> 00:10:56,160
So it was genre and the mood.

175
00:10:56,160 --> 00:10:58,080
And yeah, I turned this into an open source project.

176
00:10:58,080 --> 00:11:05,560
I think I shared, I built a simple website with Flask where people could enter the movie,

177
00:11:05,560 --> 00:11:11,480
sorry, the music lyrics and then get a predicted label, whether it's positive or negative.

178
00:11:11,480 --> 00:11:15,800
And yeah, that was a nice little project because it was also almost like an end-to-end project

179
00:11:15,800 --> 00:11:18,280
where we had to collect our own data.

180
00:11:18,280 --> 00:11:20,360
So it was with two other classmates.

181
00:11:20,360 --> 00:11:24,040
We collected our own data, cleaned the data, built the classifiers and then built that

182
00:11:24,040 --> 00:11:25,880
website also on top of that.

183
00:11:25,880 --> 00:11:30,040
So it was kind of like a pretty comprehensive project.

184
00:11:30,040 --> 00:11:32,360
The machine learning was pretty simple with Psyched Learn.

185
00:11:32,360 --> 00:11:36,080
I think we used the random forest classifier, but yeah, a lot of fun, a good exercise, I

186
00:11:36,080 --> 00:11:37,080
think.

187
00:11:37,080 --> 00:11:39,400
Yeah, that's awesome.

188
00:11:39,400 --> 00:11:45,880
I think the best way to get involved is just to find something that you're interested in,

189
00:11:45,880 --> 00:11:48,480
create a project, find some data.

190
00:11:48,480 --> 00:11:52,440
You learn a lot of the skills doing it that way, solving problems that you're interested

191
00:11:52,440 --> 00:11:53,440
in.

192
00:11:53,440 --> 00:11:57,040
If I may ask you before we go to...

193
00:11:57,040 --> 00:11:58,040
Sorry.

194
00:11:58,040 --> 00:12:04,160
If I may ask, what was your first machine learning project, if you can remember, like

195
00:12:04,160 --> 00:12:07,440
on the spot?

196
00:12:07,440 --> 00:12:09,280
It's a really good question.

197
00:12:09,280 --> 00:12:14,000
Well, one of the first ones that I worked on was a...

198
00:12:14,000 --> 00:12:19,680
Basically it was a computer vision project where we wanted to use face recognition or

199
00:12:19,680 --> 00:12:24,600
face detection actually to control a media player.

200
00:12:24,600 --> 00:12:28,200
So if you looked away, the media player would stop.

201
00:12:28,200 --> 00:12:31,200
If you looked at it, then the media player would play.

202
00:12:31,200 --> 00:12:36,200
And then we started to get into different hand recognition.

203
00:12:36,200 --> 00:12:39,320
So if you put your hand up like this, then it would stop.

204
00:12:39,320 --> 00:12:42,020
Like doing it like that, it would raise the volume.

205
00:12:42,020 --> 00:12:43,200
So it was really interesting.

206
00:12:43,200 --> 00:12:52,240
I got to learn about all of the different algorithms that are used to do face detection.

207
00:12:52,240 --> 00:12:55,200
And I learned so much about computer vision.

208
00:12:55,200 --> 00:12:58,400
For me, the amazing part of that was just...

209
00:12:58,400 --> 00:13:02,320
I've always had a really strong background in math.

210
00:13:02,320 --> 00:13:10,120
So being able to take images and converting them into numbers was just kind of mind boggling.

211
00:13:10,120 --> 00:13:11,640
And then you can do a lot of things with them.

212
00:13:11,640 --> 00:13:16,800
But now that you mentioned that, where I think this type of system still lives is if you

213
00:13:16,800 --> 00:13:19,120
use, for example, an iPhone.

214
00:13:19,120 --> 00:13:23,520
And I think they encode or they hide the text messages until you look at them for privacy

215
00:13:23,520 --> 00:13:24,520
reasons.

216
00:13:24,520 --> 00:13:26,680
So I think they're only visible when you look at them.

217
00:13:26,680 --> 00:13:31,880
It's kind of reminded me of your system, basically, where it's basically all the time detecting

218
00:13:31,880 --> 00:13:35,200
whether your face is pointing towards the camera if you're looking.

219
00:13:35,200 --> 00:13:39,800
And I think the next level is if it's you who's looking into the camera versus someone else,

220
00:13:39,800 --> 00:13:40,800
basically.

221
00:13:40,800 --> 00:13:42,880
Very interesting, yeah.

222
00:13:42,880 --> 00:13:43,880
Right.

223
00:13:43,880 --> 00:13:45,400
Yeah, it was cool.

224
00:13:45,400 --> 00:13:51,400
It was also really interesting to see when it worked and when it didn't work.

225
00:13:51,400 --> 00:13:54,080
We trained it on perfect conditions, right?

226
00:13:54,080 --> 00:13:57,920
The lighting was perfect and things like that.

227
00:13:57,920 --> 00:14:03,360
And then as soon as things got dimmer, it was much harder to detect faces, obviously.

228
00:14:03,360 --> 00:14:07,760
It were different types of people.

229
00:14:07,760 --> 00:14:14,720
So yeah, we ended up creating our own training data set and it ended up being a lot of fun.

230
00:14:14,720 --> 00:14:17,760
I think that's the best way to get involved.

231
00:14:17,760 --> 00:14:20,880
Yeah, just to find something that you're really interested in.

232
00:14:20,880 --> 00:14:30,840
We didn't need to do the recognition of our fingers and hands for that project, but we

233
00:14:30,840 --> 00:14:36,800
were just so interested in it that we decided to take it one step further.

234
00:14:36,800 --> 00:14:42,360
I find that to be the most rewarding when you're doing it, not just for a class or for

235
00:14:42,360 --> 00:14:43,360
a grade.

236
00:14:43,360 --> 00:14:47,240
You actually are very interested in the project that you're working on.

237
00:14:47,240 --> 00:14:48,240
Super cool.

238
00:14:48,240 --> 00:14:49,240
Yeah.

239
00:14:49,240 --> 00:14:50,240
Yeah.

240
00:14:50,240 --> 00:14:53,560
You mentioned sports and fantasy sports.

241
00:14:53,560 --> 00:14:56,160
That's something that I've been very interested in in the past.

242
00:14:56,160 --> 00:14:58,160
And then music also is one of my interests too.

243
00:14:58,160 --> 00:15:05,840
So it's awesome to hear that you worked on projects in those areas.

244
00:15:05,840 --> 00:15:10,520
Speaking of those sorts of projects, are there any other open source projects that you've

245
00:15:10,520 --> 00:15:13,040
been a contributor for?

246
00:15:13,040 --> 00:15:17,960
I would say back then I was using a lot of Second Learn and I also contributed a lot

247
00:15:17,960 --> 00:15:19,240
to Second Learn.

248
00:15:19,240 --> 00:15:23,680
In the recent years, maybe not as much because I got busier with other things.

249
00:15:23,680 --> 00:15:31,300
But yeah, back then we had the Ensemble voting classifier, the feature selection, the sequential

250
00:15:31,300 --> 00:15:35,120
feature selection, and some other things where I got to contribute.

251
00:15:35,120 --> 00:15:36,680
And that was a lot of fun.

252
00:15:36,680 --> 00:15:43,720
Besides that, I built my own little hobby library called ML Extend, which is I think

253
00:15:43,720 --> 00:15:49,000
used by a lot of people now because it has this frequent pattern on mining submodules

254
00:15:49,000 --> 00:15:50,720
that a lot of people at companies use.

255
00:15:50,720 --> 00:15:54,720
I always see on the discussion board a lot of companies, they have some proprietary data

256
00:15:54,720 --> 00:16:00,360
set about some customer item sets data stuff where they have some questions.

257
00:16:00,360 --> 00:16:04,760
And I think it's very widely used not for machine learning, although it has machine

258
00:16:04,760 --> 00:16:10,080
learning capabilities, mostly for the frequent pattern mining.

259
00:16:10,080 --> 00:16:13,960
But yeah, this was a library essentially because I built a lot of stuff that I needed for my

260
00:16:13,960 --> 00:16:19,420
work like little, let's say functions here and there for normalizing things and also

261
00:16:19,420 --> 00:16:23,280
some other classifiers and so forth, where I just thought, okay, instead of just hiding

262
00:16:23,280 --> 00:16:26,680
them on my computer, I can make them a little bit more general and then I can share them

263
00:16:26,680 --> 00:16:29,920
with the world and then others might find them useful basically.

264
00:16:29,920 --> 00:16:34,640
And yeah, I just grew that library over the years just adding and adding to it.

265
00:16:34,640 --> 00:16:40,360
And the other major one I would say was BioPandas where in computational biology, we work with

266
00:16:40,360 --> 00:16:44,760
these protein structure files and also small molecule structure files.

267
00:16:44,760 --> 00:16:49,920
And we were building back then a virtual screening library where we were making predictions on

268
00:16:49,920 --> 00:16:51,400
millions of molecules.

269
00:16:51,400 --> 00:16:57,400
And for that, you had to parse these molecules in a way that you could process them.

270
00:16:57,400 --> 00:17:02,800
And there were a lot of libraries out there that did something like that.

271
00:17:02,800 --> 00:17:09,720
They basically had some API where they read in these molecule files and then you access

272
00:17:09,720 --> 00:17:14,680
the objects in Python, let's say with a custom API and so forth, which is fine.

273
00:17:14,680 --> 00:17:16,240
But it is like, yeah, you have to learn that.

274
00:17:16,240 --> 00:17:20,920
It's like a specific library and you have to learn how do you get the number of carbon

275
00:17:20,920 --> 00:17:21,920
atoms?

276
00:17:21,920 --> 00:17:25,160
How do you get the position, the coordinates of that atom?

277
00:17:25,160 --> 00:17:30,800
And it is, I think, yeah, it is a bit steep in terms of the learning curve.

278
00:17:30,800 --> 00:17:34,180
And I thought, okay, why making that so complicated?

279
00:17:34,180 --> 00:17:38,960
If we just had a way we can load that protein structure file into a Pandas data frame, I

280
00:17:38,960 --> 00:17:43,040
can just use everything that's already there in Pandas.

281
00:17:43,040 --> 00:17:48,480
I don't have to reinvent the function to compute, let's say, the center of mass using

282
00:17:48,480 --> 00:17:49,480
the coordinates.

283
00:17:49,480 --> 00:17:54,160
I can use all the functions, standard deviations, mean everything that is in Pandas and to make

284
00:17:54,160 --> 00:17:55,160
that more convenient.

285
00:17:55,160 --> 00:17:59,000
So it's essentially a library where you can convert protein structure files into a Pandas

286
00:17:59,000 --> 00:18:03,800
data frame and then you can do machine learning, you can do statistics, everything on top of

287
00:18:03,800 --> 00:18:07,400
that without having to relearn, let's say, a custom API.

288
00:18:07,400 --> 00:18:11,240
It's basically all in a Pandas data frame.

289
00:18:11,240 --> 00:18:15,080
And other than that, I would say, yeah, these were my main libraries where I contributed

290
00:18:15,080 --> 00:18:18,400
to or that I built basically from scratch back then.

291
00:18:18,400 --> 00:18:24,460
But then over the years, I did a lot of open source stuff, but not necessarily libraries.

292
00:18:24,460 --> 00:18:30,880
What I did more was education, I would say, like writing blog posts, explaining things,

293
00:18:30,880 --> 00:18:36,360
PyTorch and second learn related tutorials or things like, hey, let's implement a principal

294
00:18:36,360 --> 00:18:41,680
component analysis from scratch or let's implement a self-attention mechanism from scratch and

295
00:18:41,680 --> 00:18:46,000
like writing the code, but not necessarily as a library because I think there are already

296
00:18:46,000 --> 00:18:48,320
a lot of efficient implementations out there.

297
00:18:48,320 --> 00:18:53,160
So it doesn't really make sense to reinvent the wheel, but it's more about like, let's

298
00:18:53,160 --> 00:18:58,480
peel back a few layers, make a very simple implementation of that so that people can

299
00:18:58,480 --> 00:19:01,920
read them because that's one thing.

300
00:19:01,920 --> 00:19:06,500
Deep learning libraries are becoming more powerful if we look at PyTorch, for example,

301
00:19:06,500 --> 00:19:09,160
but they are also becoming much, much, much harder to read.

302
00:19:09,160 --> 00:19:13,800
And so if I would ask you to take a look at the convolution operation in PyTorch, I wouldn't

303
00:19:13,800 --> 00:19:17,440
even know where to look in PyTorch to start with.

304
00:19:17,440 --> 00:19:22,080
It's like, I mean, for good reason because they implemented it very efficiently and then

305
00:19:22,080 --> 00:19:24,680
there's CUDA on top of that and stuff like that.

306
00:19:24,680 --> 00:19:29,040
But as a user, if I want to customize or even understand things, it's very hard to look

307
00:19:29,040 --> 00:19:30,040
at the code.

308
00:19:30,040 --> 00:19:35,040
So in that case, I think there's value in peeling back the layers, making a simple implementation

309
00:19:35,040 --> 00:19:38,500
for educational purposes to understand how things work.

310
00:19:38,500 --> 00:19:43,760
So that's something I have also liked doing in recent years, which is why I maybe didn't

311
00:19:43,760 --> 00:19:46,360
contribute so much to the core libraries.

312
00:19:46,360 --> 00:19:52,320
I was more like focusing on the coding for education, essentially.

313
00:19:52,320 --> 00:19:53,320
Right.

314
00:19:53,320 --> 00:19:56,600
Yeah, no, that makes a lot of sense.

315
00:19:56,600 --> 00:19:59,800
I appreciate a lot of the writing that you've done.

316
00:19:59,800 --> 00:20:03,160
I really enjoy your blog.

317
00:20:03,160 --> 00:20:05,960
I think you have a newsletter that I'm following now, too.

318
00:20:05,960 --> 00:20:09,920
I'm looking forward to your new book that's coming out.

319
00:20:09,920 --> 00:20:10,920
Q&AI?

320
00:20:10,920 --> 00:20:14,880
What's the title?

321
00:20:14,880 --> 00:20:18,760
Q&AI, so I can maybe say a few words about that.

322
00:20:18,760 --> 00:20:25,280
So it is essentially, it started because what I do is when I read or learn things I have

323
00:20:25,280 --> 00:20:28,960
for myself, I have flashcards.

324
00:20:28,960 --> 00:20:33,080
Basically I write down questions and answers for myself.

325
00:20:33,080 --> 00:20:39,040
So just, I mean, usually when you write them down, that process helps you learn these things.

326
00:20:39,040 --> 00:20:44,480
And maybe you rarely have to go back to your flashcards because it's not about the memorization

327
00:20:44,480 --> 00:20:45,480
necessary.

328
00:20:45,480 --> 00:20:46,640
It's more about making the question.

329
00:20:46,640 --> 00:20:51,160
But then also it kind of feels good when you feel like you have read a paper or a book

330
00:20:51,160 --> 00:20:55,360
and then you made these questions for future use so you know you have them written down

331
00:20:55,360 --> 00:20:56,360
somewhere.

332
00:20:56,360 --> 00:20:59,520
And just in case you forget, they are there as flashcards in my software so I can look

333
00:20:59,520 --> 00:21:00,800
them up.

334
00:21:00,800 --> 00:21:06,320
And people on the internet, they ask me sometimes to share these flashcards.

335
00:21:06,320 --> 00:21:08,640
And what I did is I thought, okay, why not?

336
00:21:08,640 --> 00:21:12,120
But let me polish them a little bit up because when I write things for myself, they're usually

337
00:21:12,120 --> 00:21:13,120
not that nice.

338
00:21:13,120 --> 00:21:16,320
They are also, I mean, containing grammar errors or typos.

339
00:21:16,320 --> 00:21:20,240
And I was like, hmm, let me polish them, make them a little bit more clear so that someone

340
00:21:20,240 --> 00:21:22,440
else can read them.

341
00:21:22,440 --> 00:21:25,760
And in that process, these notes became longer and longer.

342
00:21:25,760 --> 00:21:28,540
So they became like fully-fledged answers.

343
00:21:28,540 --> 00:21:32,400
Some of them like, I don't know, I just was in the mood of writing.

344
00:21:32,400 --> 00:21:35,800
And then some of them were like four or five pages long.

345
00:21:35,800 --> 00:21:41,200
And yeah, so one question would be, for example, what's the difference between an embedding,

346
00:21:41,200 --> 00:21:48,440
a latent space, and things like that, essentially, or when are fully connected layers and convolution

347
00:21:48,440 --> 00:21:49,840
layers equivalent?

348
00:21:49,840 --> 00:21:51,560
And all types of questions.

349
00:21:51,560 --> 00:21:55,480
What is the difference between self-attention and the traditional attention mechanism in

350
00:21:55,480 --> 00:21:56,940
RNNs?

351
00:21:56,940 --> 00:22:02,280
What are the multiple GPU training paradigms, like tensor parallelism, data parallelism,

352
00:22:02,280 --> 00:22:03,640
and so forth?

353
00:22:03,640 --> 00:22:06,480
And the answers, they tended to become longer and longer and longer.

354
00:22:06,480 --> 00:22:10,440
And I was like, okay, instead of just, I mean, these are not flashcards anymore.

355
00:22:10,440 --> 00:22:12,480
These are basically book chapters.

356
00:22:12,480 --> 00:22:16,480
So I thought, okay, I could just basically turn that into a book.

357
00:22:16,480 --> 00:22:22,520
And yeah, it's basically machine learning Q&A because it's like a Q&A, it's a question

358
00:22:22,520 --> 00:22:23,720
and an answer.

359
00:22:23,720 --> 00:22:29,920
But then also it was interesting that it's chat GPT now, so an AI doing the answers.

360
00:22:29,920 --> 00:22:34,880
And as a little gimmick, I thought, because it just came out, why don't I include also

361
00:22:34,880 --> 00:22:37,880
the answers by chat GPT?

362
00:22:37,880 --> 00:22:41,920
So I have my own answer followed by the chat GPT answer and the short discussion.

363
00:22:41,920 --> 00:22:49,520
And then readers can tell or can let's say, judge for themselves which answer is appropriate

364
00:22:49,520 --> 00:22:51,660
or appropriate.

365
00:22:51,660 --> 00:22:56,480
So one thing, of course, chat GPT cannot create figures and these types of things.

366
00:22:56,480 --> 00:23:00,200
So it's kind of a little bit unfair, but I must say for my comparison, what was very

367
00:23:00,200 --> 00:23:04,720
interesting is that when I wrote the answer, I had at least a very long answer.

368
00:23:04,720 --> 00:23:07,600
Chat GPT was way, way shorter.

369
00:23:07,600 --> 00:23:15,080
Times or yeah, I would say if you have 10 items, I would say three items are wrong.

370
00:23:15,080 --> 00:23:19,760
Chat GPT answers contain sometimes factually incorrect things.

371
00:23:19,760 --> 00:23:22,760
It's easy for a domain expert to weed them out.

372
00:23:22,760 --> 00:23:26,840
However, what was nice about chat GPT is it sometimes came up with things I didn't think

373
00:23:26,840 --> 00:23:32,880
about when I, for example, asked about what are some ways we can deal or can improve or

374
00:23:32,880 --> 00:23:34,200
reduce, let's say, overfitting?

375
00:23:34,200 --> 00:23:36,320
What are some techniques for reducing overfitting?

376
00:23:36,320 --> 00:23:41,160
I had quite a long list, explained everything, asked chat GPT if it had some, let's say,

377
00:23:41,160 --> 00:23:44,920
wrong answers, but some of them I didn't even think about.

378
00:23:44,920 --> 00:23:45,920
And so that was nice.

379
00:23:45,920 --> 00:23:51,720
It's essentially creating false positives, but it's also having these true positives,

380
00:23:51,720 --> 00:23:53,120
let's say, that you missed.

381
00:23:53,120 --> 00:23:56,440
So it's in a sense, actually pretty good for brainstorming, I would say.

382
00:23:56,440 --> 00:23:58,740
It's actually a pretty good writing companion.

383
00:23:58,740 --> 00:24:03,960
You still have to know a bit about the field because these errors, if I wouldn't know about,

384
00:24:03,960 --> 00:24:07,480
let's say, machine learning, it could be dangerous because it would give me wrong information.

385
00:24:07,480 --> 00:24:15,120
But if you look for inspiration, I do think it's a valuable tool, essentially.

386
00:24:15,120 --> 00:24:17,960
Yeah, definitely.

387
00:24:17,960 --> 00:24:25,760
I was about to say that I use chat GPT as a brainstorm assistant.

388
00:24:25,760 --> 00:24:27,800
It can help you with drafts.

389
00:24:27,800 --> 00:24:31,960
It can help you write outlines and things like that.

390
00:24:31,960 --> 00:24:35,120
But yeah, there is that danger.

391
00:24:35,120 --> 00:24:41,400
You're a machine learning expert reading about it and you're able to quickly pick out, say,

392
00:24:41,400 --> 00:24:47,040
whatever, 20%, 30% of this information might not be factually correct.

393
00:24:47,040 --> 00:24:51,560
And it does become dangerous when there's someone looking at it and looking at it as

394
00:24:51,560 --> 00:24:58,880
an authority, seeing the output and thinking that it's probably going to be correct.

395
00:24:58,880 --> 00:25:05,920
Yeah, so talking to someone in NLP and machine learning, we brought up chat GPT.

396
00:25:05,920 --> 00:25:09,320
It took us a little bit, but I guess we could dive into it now.

397
00:25:09,320 --> 00:25:12,520
Yeah, there's no way to avoid it nowadays.

398
00:25:12,520 --> 00:25:15,800
No, can't avoid it.

399
00:25:15,800 --> 00:25:22,600
I know you've been in the field and you've seen the progression.

400
00:25:22,600 --> 00:25:30,360
It seems as if it's like this overnight success, going to a million users in a couple of days.

401
00:25:30,360 --> 00:25:34,800
But obviously, this has been years in the making.

402
00:25:34,800 --> 00:25:42,360
Where I want to start off with is how do you view the gap between the hype of something

403
00:25:42,360 --> 00:25:48,360
like chat GPT and the generative models now and the reality of AI?

404
00:25:48,360 --> 00:25:51,040
Yeah, so it's interesting.

405
00:25:51,040 --> 00:25:57,280
I would say chat GPT did a good job in terms of closing the gap, because honestly, I must

406
00:25:57,280 --> 00:25:59,800
say it works pretty well.

407
00:25:59,800 --> 00:26:02,720
And it is impressive.

408
00:26:02,720 --> 00:26:08,000
I don't know how far it scales in terms of would we...

409
00:26:08,000 --> 00:26:14,480
I mean, we can always improve things, but I don't know what, let's say, the rate is

410
00:26:14,480 --> 00:26:18,240
of how we can make it better, I guess, related to the hype.

411
00:26:18,240 --> 00:26:23,280
I think there's a lot of...it's like the same with self-driving cars, I guess, where five

412
00:26:23,280 --> 00:26:26,840
years ago they already had pretty impressive demos.

413
00:26:26,840 --> 00:26:29,240
I haven't seen, to be honest...

414
00:26:29,240 --> 00:26:32,840
I mean, the thing that they don't show you is what they have right now that is not released

415
00:26:32,840 --> 00:26:33,840
yet.

416
00:26:33,840 --> 00:26:39,160
But I do think it's usually the last few percent that are crucial.

417
00:26:39,160 --> 00:26:43,920
I think with self-driving cars, we have been...it's just a number, I don't know for sure, but

418
00:26:43,920 --> 00:26:47,360
I would say we have been there for like 95% now.

419
00:26:47,360 --> 00:26:53,120
Like five years ago, it was almost, let's say, 95% there, almost, let's say, ready.

420
00:26:53,120 --> 00:26:58,760
Now five years later, we are maybe there at 97% or 98%.

421
00:26:58,760 --> 00:27:04,000
But can we get the two last remaining percent points to really nail it, basically to have

422
00:27:04,000 --> 00:27:06,620
them on the roads reliably and so forth?

423
00:27:06,620 --> 00:27:10,320
And that is hard to say with the Dutch language models as well.

424
00:27:10,320 --> 00:27:14,280
I think we can reduce the factually incorrect information, make them more useful and so

425
00:27:14,280 --> 00:27:15,280
forth.

426
00:27:15,280 --> 00:27:23,640
I just don't know how much work it takes to get just a few more percent more better performance.

427
00:27:23,640 --> 00:27:29,080
We will see with the next generation, let's say the GPT-4 models and so forth, if they

428
00:27:29,080 --> 00:27:33,240
apply then also the reinforcement learning with human feedback in the loop on top of

429
00:27:33,240 --> 00:27:40,440
it, if it's substantially better, like the same like from GPT-2 to GPT-3.

430
00:27:40,440 --> 00:27:44,120
Maybe it's the same from 3 to 4 where we get, again, mind blown.

431
00:27:44,120 --> 00:27:46,080
But yeah, that is one thing.

432
00:27:46,080 --> 00:27:50,440
The other thing is I think people are chasing, like hype wise, they see ChetGPT and they

433
00:27:50,440 --> 00:27:54,920
are chasing AGI, like artificial general intelligence.

434
00:27:54,920 --> 00:27:57,560
Yeah, that is an interesting question.

435
00:27:57,560 --> 00:28:03,720
I think no one knows how far we are from AGI.

436
00:28:03,720 --> 00:28:08,040
With ChetGPT, I think there's a lot more hype around AGI.

437
00:28:08,040 --> 00:28:11,320
It appears closer than before, of course, because we have these models.

438
00:28:11,320 --> 00:28:15,320
There are people though who say, okay, this is the totally wrong approach.

439
00:28:15,320 --> 00:28:19,040
We need something completely different if we want to get AGI.

440
00:28:19,040 --> 00:28:20,760
No one knows what that approach looks like.

441
00:28:20,760 --> 00:28:22,720
So it's really hard to say.

442
00:28:22,720 --> 00:28:23,720
That's the thing.

443
00:28:23,720 --> 00:28:28,120
If something hasn't been there before or it doesn't even exist, it's hard to predict

444
00:28:28,120 --> 00:28:29,120
when it will exist.

445
00:28:29,120 --> 00:28:36,120
It's really hard basically to make any reliable or any statement about that, I would say.

446
00:28:36,120 --> 00:28:41,400
The thing though, what I always find interesting is do we need AGI?

447
00:28:41,400 --> 00:28:43,800
More like a philosophical question.

448
00:28:43,800 --> 00:28:46,880
I think AGI is useful as a motivation.

449
00:28:46,880 --> 00:28:52,560
I think it motivates a lot of people to work on AI, to make that progress.

450
00:28:52,560 --> 00:28:59,640
I think without AGI, we wouldn't have maybe things like, I don't know, like what was it

451
00:28:59,640 --> 00:29:06,680
called, the AlphaGo where they basically beat the best player at Go.

452
00:29:06,680 --> 00:29:09,800
Maybe chess, even back then chess.

453
00:29:09,800 --> 00:29:10,800
How is that useful?

454
00:29:10,800 --> 00:29:14,920
I would say maybe AlphaGo and chess engines are not useful, but I think it ultimately

455
00:29:14,920 --> 00:29:20,520
led to AlphaFold, the first version for protein structure prediction, and then AlphaFold2,

456
00:29:20,520 --> 00:29:24,800
which is now based on large language models, what uses large language models.

457
00:29:24,800 --> 00:29:30,040
In that case, I think without large language models and without the desire maybe to develop

458
00:29:30,040 --> 00:29:37,000
AGI, we wouldn't have all the, let's say, very useful things in the natural sciences.

459
00:29:37,000 --> 00:29:44,920
My question is do we need AGI or do we really just need good models for special purposes?

460
00:29:44,920 --> 00:29:50,160
For example, if I want to, I mean there was a paper the other day, accurate weather prediction

461
00:29:50,160 --> 00:29:54,880
with deep learning, like more accurate than the best physics-based simulations that run

462
00:29:54,880 --> 00:29:59,760
on supercomputers with a smaller, let's say more, not smaller, but with a more energy

463
00:29:59,760 --> 00:30:03,320
efficient neural network and more accurate.

464
00:30:03,320 --> 00:30:04,320
Maybe that is sufficient.

465
00:30:04,320 --> 00:30:07,520
Maybe we don't need an AGI that can also predict the weather.

466
00:30:07,520 --> 00:30:12,160
Maybe it's better to just focus on improving that weather prediction engine and separately

467
00:30:12,160 --> 00:30:15,760
improving the protein structure prediction model AlphaFold.

468
00:30:15,760 --> 00:30:20,280
Maybe we don't need to chase something that can do all the things at once.

469
00:30:20,280 --> 00:30:24,680
However, I do think AGI is useful as a motivator to find better algorithms.

470
00:30:24,680 --> 00:30:33,000
So in terms of hype, I think I'm personally, I don't see the purpose of AGI.

471
00:30:33,000 --> 00:30:38,360
Maybe I'm too short-sighted here.

472
00:30:38,360 --> 00:30:44,520
I would say what would we do with AGI besides what people say about replacing humans?

473
00:30:44,520 --> 00:30:51,920
I don't know how that really benefits compared to special purpose applications of machine

474
00:30:51,920 --> 00:30:52,920
learning.

475
00:30:52,920 --> 00:30:53,920
Yeah.

476
00:30:53,920 --> 00:30:54,920
Right.

477
00:30:54,920 --> 00:30:58,160
I mean, you brought up so many interesting points.

478
00:30:58,160 --> 00:31:01,440
I don't even know where to go next.

479
00:31:01,440 --> 00:31:08,120
Let's talk about the use cases for generative models.

480
00:31:08,120 --> 00:31:13,000
So you were mentioning basically, which I love this point, where we're able to get these

481
00:31:13,000 --> 00:31:15,960
models up to a certain level of performance.

482
00:31:15,960 --> 00:31:23,840
Say you can get a model to 90% or 95%, but it's that last 5% that's so hard.

483
00:31:23,840 --> 00:31:29,000
The closer you're getting to that 100%, it's even harder.

484
00:31:29,000 --> 00:31:34,480
It makes me think about when you're training a machine learning model, any model, say even

485
00:31:34,480 --> 00:31:42,320
a text classifier and you have your F1 score at, say, 0.85, how much work can you really

486
00:31:42,320 --> 00:31:45,240
do to get it that much higher?

487
00:31:45,240 --> 00:31:53,000
But I wanted to take a step back and I wanted to talk about basically generative models.

488
00:31:53,000 --> 00:31:54,840
I think there's a lower threshold.

489
00:31:54,840 --> 00:31:59,300
So error can be okay depending on your use case.

490
00:31:59,300 --> 00:32:04,800
So if you're using it for something like just to make a draft, it doesn't need to be 100%

491
00:32:04,800 --> 00:32:11,680
correct because if you're making marketing content, let's say, that could be the product

492
00:32:11,680 --> 00:32:18,280
I'm seeing now Wix is offering a complete generative, using generative models to create

493
00:32:18,280 --> 00:32:19,920
your whole website.

494
00:32:19,920 --> 00:32:21,280
That's amazing.

495
00:32:21,280 --> 00:32:23,960
That solves the cold start problem.

496
00:32:23,960 --> 00:32:28,040
It gives you so many options you can build off of it.

497
00:32:28,040 --> 00:32:30,320
But then there's the other part.

498
00:32:30,320 --> 00:32:36,660
There's predictive models where you're, say, you're categorizing something and you need

499
00:32:36,660 --> 00:32:41,600
it to be very close to 100% correct.

500
00:32:41,600 --> 00:32:45,680
Depending on your use case.

501
00:32:45,680 --> 00:32:52,540
And then you bring up AGI, artificial general intelligence.

502
00:32:52,540 --> 00:32:55,760
I think everybody thinks about it a little bit differently.

503
00:32:55,760 --> 00:32:58,360
Everybody has a different sense of it.

504
00:32:58,360 --> 00:33:00,480
Everyone has a different definition.

505
00:33:00,480 --> 00:33:02,640
Are we trying to replicate humans?

506
00:33:02,640 --> 00:33:07,440
Are we trying to replicate human intelligence?

507
00:33:07,440 --> 00:33:12,360
If that's the case, then I personally don't think that large language models is the way

508
00:33:12,360 --> 00:33:13,360
to go.

509
00:33:13,360 --> 00:33:18,280
There are certain things that I think about like from GPT-2 to GPT-3, one thing that's

510
00:33:18,280 --> 00:33:24,920
very interesting are when you, by orders of magnitude, add all these parameters.

511
00:33:24,920 --> 00:33:29,400
There are these emergent capabilities, which is really interesting.

512
00:33:29,400 --> 00:33:34,080
I think in one of them, you're reading so much of the English language, so you're going

513
00:33:34,080 --> 00:33:39,400
to learn how to make grammatically correct sentences, and then you're going to learn

514
00:33:39,400 --> 00:33:41,880
different relationships between things.

515
00:33:41,880 --> 00:33:48,000
All of that stuff is amazing, but there's more to it, I think, than that.

516
00:33:48,000 --> 00:33:51,640
Just being able to predict the next word.

517
00:33:51,640 --> 00:33:57,120
The reinforcement and human in the loop piece of it is definitely going to, as you were

518
00:33:57,120 --> 00:34:06,120
saying, minimize the amount of factually incorrect responses.

519
00:34:06,120 --> 00:34:07,120
What do you think?

520
00:34:07,120 --> 00:34:14,120
Do you think that our goal should be to try to replicate human intelligence, or do you

521
00:34:14,120 --> 00:34:19,600
think we should be specializing in certain systems or certain use cases?

522
00:34:19,600 --> 00:34:28,040
I personally think, for the sake of developing more efficient learning algorithms or alternative

523
00:34:28,040 --> 00:34:34,840
learning algorithms, I do think it makes sense to get inspired by, let's say, replicating

524
00:34:34,840 --> 00:34:36,760
human intelligence.

525
00:34:36,760 --> 00:34:39,840
But I would say if it doesn't work, that's fine too.

526
00:34:39,840 --> 00:34:46,040
The classic example is really airplanes or submarines, where airplanes are inspired by

527
00:34:46,040 --> 00:34:47,040
birds.

528
00:34:47,040 --> 00:34:51,160
It's like, hey, birds can fly, they have wings, can we build something similar?

529
00:34:51,160 --> 00:34:54,800
Turns out the airplane is very different, it doesn't flap the wings, but it gets the

530
00:34:54,800 --> 00:34:55,800
job done.

531
00:34:55,800 --> 00:34:59,360
In the case, we don't need to mimic how birds fly.

532
00:34:59,360 --> 00:35:04,760
In the same sense, we probably don't have to mimic how, let's say, humans learn and

533
00:35:04,760 --> 00:35:10,840
think, although I do think it would help understanding that because there might be more inspiration

534
00:35:10,840 --> 00:35:12,940
that we can use for these models.

535
00:35:12,940 --> 00:35:18,380
One thing is also related to that, ensemble methods are, so building an ensemble of different

536
00:35:18,380 --> 00:35:23,440
methods is usually something to improve, how you can, let's say, make more robust and accurate

537
00:35:23,440 --> 00:35:26,000
predictions.

538
00:35:26,000 --> 00:35:30,600
Ensemble methods usually work best if you have an ensemble of different methods, if

539
00:35:30,600 --> 00:35:36,200
there's no correlation in terms of how they work, so they are not redundant basically.

540
00:35:36,200 --> 00:35:41,320
That is also one argument why it makes sense to maybe approach the problem from different

541
00:35:41,320 --> 00:35:45,320
angles to produce totally different systems that we can then combine.

542
00:35:45,320 --> 00:35:52,040
I think that's also interesting from the perspective of how people try to implement large language

543
00:35:52,040 --> 00:35:56,040
models as part of a search engine because I feel like, yeah, we don't, so it's kind

544
00:35:56,040 --> 00:36:00,280
of related to artificial intelligence, general intelligence where maybe we don't need one

545
00:36:00,280 --> 00:36:05,520
system that solves it all because, for example, with chat GPT, it can do math, it's some of

546
00:36:05,520 --> 00:36:11,840
the emergent capabilities that you mentioned, but it's not useful for simple math, like

547
00:36:11,840 --> 00:36:20,240
if you say multiply 13 by 117 or something like that, it's maybe not useful to use chat

548
00:36:20,240 --> 00:36:21,420
GPT for that.

549
00:36:21,420 --> 00:36:24,760
We have a calculator that can do that accurately, that doesn't need to be trained, there are

550
00:36:24,760 --> 00:36:25,760
simple rules.

551
00:36:25,760 --> 00:36:31,280
Yeah, so in that case, what we need is more like identification of what we need to get

552
00:36:31,280 --> 00:36:32,280
the job done.

553
00:36:32,280 --> 00:36:38,240
Maybe having, like Siri, what Siri is doing is it's parsing the language.

554
00:36:38,240 --> 00:36:42,800
I mean, besides the fact that it doesn't work well, but let's say it would work better in

555
00:36:42,800 --> 00:36:44,960
parsing your input.

556
00:36:44,960 --> 00:36:48,560
What it does, it's rerouting your input to the appropriate application on your phone.

557
00:36:48,560 --> 00:36:55,040
I think if you set a timer, it will use the timer app on your phone or if you do a calculation,

558
00:36:55,040 --> 00:36:56,880
it will use the calculator app.

559
00:36:56,880 --> 00:37:01,160
So it's not trying to do everything itself, it's trying to delegate.

560
00:37:01,160 --> 00:37:03,860
And I think with AI, I think that's the same thing.

561
00:37:03,860 --> 00:37:09,400
If we ask it to maybe compose text, the AI itself might be the best way to do that.

562
00:37:09,400 --> 00:37:16,080
If we want factual information, maybe sometimes just extracting information from an existing

563
00:37:16,080 --> 00:37:22,200
Wikipedia page might be more efficient than having itself answering that.

564
00:37:22,200 --> 00:37:28,000
So I'm not saying it's not necessary to use an LLM, but the LLM here would be more efficient

565
00:37:28,000 --> 00:37:33,920
at going to that website and summarizing the text rather than rewriting the text, basically,

566
00:37:33,920 --> 00:37:36,240
if you are looking for an answer.

567
00:37:36,240 --> 00:37:42,320
I think that is one thing we could focus on, on how to basically delegate more efficiently

568
00:37:42,320 --> 00:37:48,880
and building an AI that, let's say, delegates rather than tries to solve everything, in

569
00:37:48,880 --> 00:37:49,880
my opinion.

570
00:37:49,880 --> 00:37:55,160
And also to your point, the AI doesn't even have to be correct all the time when creating

571
00:37:55,160 --> 00:38:02,440
text as long as we use it as a template, basically, not as the end product.

572
00:38:02,440 --> 00:38:08,520
So I think Chachipiti, the main use for me, how I use it is to help me write texts, but

573
00:38:08,520 --> 00:38:09,720
I'm filling in the blanks.

574
00:38:09,720 --> 00:38:15,080
I'm not like, if I want to text about something, I usually write the text myself before, then

575
00:38:15,080 --> 00:38:19,080
I say, hey, Chachipiti, rewrite this, and I see if I like it more or less.

576
00:38:19,080 --> 00:38:21,960
I take certain sentences, and then I even tweak them afterwards.

577
00:38:21,960 --> 00:38:27,520
I'm not really literally copy and pasting anything or in the same way with information.

578
00:38:27,520 --> 00:38:34,680
So there was another LLM, I think it was called Galaxy something, or Galactica, I think, Galactica,

579
00:38:34,680 --> 00:38:39,080
where they had an AI or LLM that was writing research papers.

580
00:38:39,080 --> 00:38:44,120
I think there was this misconception that you let it write the whole research paper.

581
00:38:44,120 --> 00:38:47,440
I see it more as something that writes the template for a research paper.

582
00:38:47,440 --> 00:38:51,080
It's more like, I would say, a sophisticated template builder.

583
00:38:51,080 --> 00:38:55,840
I think it would have been better if it wouldn't fill in numbers or any factual information.

584
00:38:55,840 --> 00:39:00,720
It would leave blanks so that it's more clear to a human, like, hey, you have to fill in

585
00:39:00,720 --> 00:39:05,120
the numbers and the details, and they're not provided by the machine learning AI system,

586
00:39:05,120 --> 00:39:06,120
basically.

587
00:39:06,120 --> 00:39:13,440
So I think having these models, it's essentially about using them responsibly, essentially.

588
00:39:13,440 --> 00:39:14,920
Yeah.

589
00:39:14,920 --> 00:39:17,640
You bring up so many interesting points.

590
00:39:17,640 --> 00:39:24,600
To talk about the different tasks that you want to complete, I see a future where, yeah,

591
00:39:24,600 --> 00:39:31,720
depending on what prompt, basically, you are asking, you could use something that's rule

592
00:39:31,720 --> 00:39:36,000
based or it could pull up the correct tool.

593
00:39:36,000 --> 00:39:45,720
The sibling or predecessor of ChatGPT, InstructGPT, sort of was going into that, how you can take

594
00:39:45,720 --> 00:39:49,080
an initial prompt and then have some follow ups.

595
00:39:49,080 --> 00:39:53,680
That's what's really nice about ChatGPT as well, that you can sort of take the output

596
00:39:53,680 --> 00:39:58,280
and you can say, make it longer, make it shorter.

597
00:39:58,280 --> 00:40:06,680
I saw another recent paper, Toolformer, basically showing some examples of how to use tools,

598
00:40:06,680 --> 00:40:12,960
you can basically combine the power of large language models and using third party tools.

599
00:40:12,960 --> 00:40:18,240
I think it's this ability to sort of find that hybrid approach.

600
00:40:18,240 --> 00:40:23,800
When a rule is the right approach and when should you be using more advanced systems,

601
00:40:23,800 --> 00:40:27,260
which is kind of like always a question.

602
00:40:27,260 --> 00:40:31,080
Can you make it simpler?

603
00:40:31,080 --> 00:40:35,000
It's like this saying, if you have a hammer, everything looks like a nail.

604
00:40:35,000 --> 00:40:40,720
I think this is right now a little bit true with ChatGPT because we just have fun with

605
00:40:40,720 --> 00:40:41,720
it.

606
00:40:41,720 --> 00:40:45,480
Let me see if it can do this and that, but it doesn't mean we should be using it for

607
00:40:45,480 --> 00:40:48,000
everything.

608
00:40:48,000 --> 00:40:52,720
The question is basically the next level would be how to basically when to use AI and when

609
00:40:52,720 --> 00:40:55,000
not to use AI basically.

610
00:40:55,000 --> 00:40:58,200
Right now we are using AI for a lot of things because it's exciting and we want to see how

611
00:40:58,200 --> 00:41:02,920
far we can push it until it breaks or doesn't work.

612
00:41:02,920 --> 00:41:08,520
Sometimes we have nonsensical applications of AI because of that, like training a calculator,

613
00:41:08,520 --> 00:41:12,280
a new network that can do calculation that doesn't really make sense.

614
00:41:12,280 --> 00:41:18,240
But there are examples where I think reinforcement learning found a more efficient matrix multiplication

615
00:41:18,240 --> 00:41:21,900
algorithm, more like the algorithm itself finding that.

616
00:41:21,900 --> 00:41:26,440
That makes sense, something you as human wouldn't think about, but we wouldn't let it do the

617
00:41:26,440 --> 00:41:30,920
matrix multiplication itself because it's not deterministic in a sense.

618
00:41:30,920 --> 00:41:34,640
You don't know if it's going to be correct or not depending on your inputs.

619
00:41:34,640 --> 00:41:39,640
There are definite rules that we can use, so why making it, let's say approximate when

620
00:41:39,640 --> 00:41:42,200
we can have it accurate?

621
00:41:42,200 --> 00:41:44,200
Yeah.

622
00:41:44,200 --> 00:41:50,380
I think that that's something in the machine learning field that's really such an interesting

623
00:41:50,380 --> 00:41:55,640
area that deserves more research.

624
00:41:55,640 --> 00:41:59,120
Machine learning models are going to make predictions.

625
00:41:59,120 --> 00:42:00,840
There are systems where it might not.

626
00:42:00,840 --> 00:42:04,360
It doesn't have high enough confidence to make a prediction, but when it makes a prediction,

627
00:42:04,360 --> 00:42:06,480
usually it's like it's usually binary.

628
00:42:06,480 --> 00:42:08,760
It's usually like it made a prediction.

629
00:42:08,760 --> 00:42:14,240
This is what it thinks the answer is, but it doesn't give you that confidence level.

630
00:42:14,240 --> 00:42:18,300
You know how when you're talking with a human, you can kind of tell how confident someone

631
00:42:18,300 --> 00:42:21,520
is when they're saying something, when they're saying it, or they might validate it.

632
00:42:21,520 --> 00:42:26,380
They might say, oh, I think I heard about this.

633
00:42:26,380 --> 00:42:31,520
That's lost when you're talking about chat GPT.

634
00:42:31,520 --> 00:42:36,360
On top of that, one thing is also there's a whole branch of research on that neural

635
00:42:36,360 --> 00:42:41,120
networks are typically overconfident on out of distribution data.

636
00:42:41,120 --> 00:42:46,240
What happens is if you have data that is slightly different from your training data or let's

637
00:42:46,240 --> 00:42:51,600
say out of the distribution, the network will, if you program it to give a confidence score

638
00:42:51,600 --> 00:42:57,960
as part of the output, this score for the data where it's especially wrong is usually

639
00:42:57,960 --> 00:42:58,960
overconfident.

640
00:42:58,960 --> 00:43:05,200
It's over estimating its confidence, which makes it even more dangerous.

641
00:43:05,200 --> 00:43:10,760
Even the confidence score, let's say, it's misleading if it's a tricky problem, which

642
00:43:10,760 --> 00:43:14,680
is kind of ironic or paradoxical.

643
00:43:14,680 --> 00:43:16,280
It's kind of an interesting research problem.

644
00:43:16,280 --> 00:43:20,440
There are methods that try to address that, but it's not out of the box.

645
00:43:20,440 --> 00:43:22,760
It's a lot of extra effort.

646
00:43:22,760 --> 00:43:24,440
It's an ongoing research field.

647
00:43:24,440 --> 00:43:31,360
Like you said, even if we had the confidence scores, it would be hard to use them or trust

648
00:43:31,360 --> 00:43:32,360
them.

649
00:43:32,360 --> 00:43:34,320
But also you bring up a good point.

650
00:43:34,320 --> 00:43:38,960
So chat GPT doesn't give us any confidence about anything, but then there's also, I mean,

651
00:43:38,960 --> 00:43:43,960
an even better example, I think where it's more clear is this classifier they developed

652
00:43:43,960 --> 00:43:53,520
to classify whether text is written by an AI or a human where they have different labels

653
00:43:53,520 --> 00:43:58,440
like likely or not likely generated by an AI or something like that.

654
00:43:58,440 --> 00:43:59,720
And yeah, it's just a label.

655
00:43:59,720 --> 00:44:01,800
So you trust it or not.

656
00:44:01,800 --> 00:44:07,040
And for example, when I used Shakespeare Macbeth texts in there, it predicted it was likely

657
00:44:07,040 --> 00:44:08,200
generated by AI.

658
00:44:08,200 --> 00:44:09,520
It's just a label.

659
00:44:09,520 --> 00:44:11,720
And well, what do you do with that?

660
00:44:11,720 --> 00:44:20,400
It's like totally wrong, but because Shakespeare was around way before AI was a thing.

661
00:44:20,400 --> 00:44:26,000
So there's another approach is called GPT zero where the researcher who developed that

662
00:44:26,000 --> 00:44:27,160
just gives you a score.

663
00:44:27,160 --> 00:44:29,720
It's only the perplexity score.

664
00:44:29,720 --> 00:44:32,960
And then you as a human, you have to compare it and think about it, which is maybe a better

665
00:44:32,960 --> 00:44:34,520
approach than just giving a label.

666
00:44:34,520 --> 00:44:37,440
But yeah, you bring up a good point.

667
00:44:37,440 --> 00:44:42,520
We just take it for granted or we just take a score and yeah, we use it.

668
00:44:42,520 --> 00:44:47,240
And it's maybe out of convenience because that's the simplest user interface.

669
00:44:47,240 --> 00:44:53,640
But with things like machine learning, yeah, it is depending on application, tricky.

670
00:44:53,640 --> 00:44:55,120
Yeah.

671
00:44:55,120 --> 00:44:59,760
And I think that's definitely a problem that machine learning practitioners should try

672
00:44:59,760 --> 00:45:03,240
to address, but it's extremely difficult, right?

673
00:45:03,240 --> 00:45:08,680
Especially as humans, we're trying to interpret these very complex machine learning, deep

674
00:45:08,680 --> 00:45:15,160
learning models and something that's out of distribution and it's trying to make a prediction

675
00:45:15,160 --> 00:45:19,880
on it and you get a prediction and the prediction is high confidence.

676
00:45:19,880 --> 00:45:24,040
And it's like, that doesn't even, it's like, why?

677
00:45:24,040 --> 00:45:26,080
It doesn't even make sense.

678
00:45:26,080 --> 00:45:33,600
It's a little scary because sometimes like, so take an active learning system where you're

679
00:45:33,600 --> 00:45:36,840
going to label samples that have low confidence.

680
00:45:36,840 --> 00:45:40,760
And then like those high confidence ones are just going to slip through.

681
00:45:40,760 --> 00:45:45,880
Yeah, in that case, it would be achieving totally the opposite of what you want because

682
00:45:45,880 --> 00:45:50,040
yeah, it will give you the high confidence for the ones that you actually need to label

683
00:45:50,040 --> 00:45:51,680
because they are so different.

684
00:45:51,680 --> 00:45:57,160
Yeah, it's essentially antagonistic or adversarial.

685
00:45:57,160 --> 00:45:58,480
Yeah.

686
00:45:58,480 --> 00:45:59,480
Yeah.

687
00:45:59,480 --> 00:46:08,120
I mean, it makes you think about just how many moving parts there are with machine learning

688
00:46:08,120 --> 00:46:14,340
and just trying to understand, it's so important to understand every aspect of it.

689
00:46:14,340 --> 00:46:15,720
It's not just the algorithm.

690
00:46:15,720 --> 00:46:20,320
It's not just the newest language model.

691
00:46:20,320 --> 00:46:25,540
Sometimes it's like common sense things, understanding the data, understanding the output, why are

692
00:46:25,540 --> 00:46:27,220
you making this?

693
00:46:27,220 --> 00:46:29,960
How is it going to be used?

694
00:46:29,960 --> 00:46:30,960
Those sorts of things.

695
00:46:30,960 --> 00:46:33,400
And I want to say we are complaining here.

696
00:46:33,400 --> 00:46:34,400
Oh yes.

697
00:46:34,400 --> 00:46:39,680
I'm sorry, I'm just wanted to say we are complaining about this here that machine learning

698
00:46:39,680 --> 00:46:46,640
systems make these mistakes and we don't get the scores and we don't interpret them.

699
00:46:46,640 --> 00:46:49,840
It's something to think about, I wanted to say, but it is challenging.

700
00:46:49,840 --> 00:46:55,440
It is not that I would say people who are working on this, they are trying their best.

701
00:46:55,440 --> 00:47:00,780
They put a lot of effort into improving that and getting the best out of it as possible.

702
00:47:00,780 --> 00:47:06,360
It is just such a hard problem that I think it needs more time and work.

703
00:47:06,360 --> 00:47:10,240
We are trying to do the best we can or most researchers are doing the best they can when

704
00:47:10,240 --> 00:47:11,240
they release the products.

705
00:47:11,240 --> 00:47:13,040
It's just such a hard problem.

706
00:47:13,040 --> 00:47:16,040
So I would say there's no one to blame about that.

707
00:47:16,040 --> 00:47:19,000
It's just how hard this problem is.

708
00:47:19,000 --> 00:47:21,880
Yeah, of course.

709
00:47:21,880 --> 00:47:25,760
I didn't mean to say it in that sense.

710
00:47:25,760 --> 00:47:30,840
There's an interesting trend that I've found actually with machine learning practitioners

711
00:47:30,840 --> 00:47:36,440
after they work in the field for a certain amount of time, many then shift their focus

712
00:47:36,440 --> 00:47:44,760
into AI ethics, which is exactly trying to address these types of problems, which I find

713
00:47:44,760 --> 00:47:47,600
that to be very, very interesting.

714
00:47:47,600 --> 00:47:53,360
The more I work in this field, you have to think about those things.

715
00:47:53,360 --> 00:47:59,240
Yeah, that's actually a good point because I think it makes a lot of sense to start with

716
00:47:59,240 --> 00:48:04,600
machine learning and then go into AI ethics because then you basically get exposed to

717
00:48:04,600 --> 00:48:06,840
all the problems that exist.

718
00:48:06,840 --> 00:48:11,680
But you also notice that it's maybe not so trivial because I think it's easy to say,

719
00:48:11,680 --> 00:48:14,880
well, this is not good and this is a problem.

720
00:48:14,880 --> 00:48:17,640
Fixing it is the more difficult problem really.

721
00:48:17,640 --> 00:48:22,160
And I think experiencing the maybe frustration around machine learning, that's a good way

722
00:48:22,160 --> 00:48:28,080
to also be prepared for what's possible and whatnot and what could we do.

723
00:48:28,080 --> 00:48:33,040
And I think it is frustrating sometimes to work with machine learning systems because

724
00:48:33,040 --> 00:48:39,040
we train these classifiers and then we see exactly, okay, this gets this input wrong,

725
00:48:39,040 --> 00:48:40,880
but we don't know why this particular input.

726
00:48:40,880 --> 00:48:44,580
We can maybe include more training examples of this particular input.

727
00:48:44,580 --> 00:48:45,580
We improve the system.

728
00:48:45,580 --> 00:48:49,520
It doesn't get this one wrong anymore, but then it gets something else wrong instead.

729
00:48:49,520 --> 00:48:53,440
And it's really like you're trying to fix one thing, the other thing breaks.

730
00:48:53,440 --> 00:48:56,320
And it is very, very challenging.

731
00:48:56,320 --> 00:48:57,800
Yeah.

732
00:48:57,800 --> 00:48:59,200
Yeah.

733
00:48:59,200 --> 00:49:04,040
Yeah, it's very, very tricky problems.

734
00:49:04,040 --> 00:49:10,400
And it's nice to have the chance to discuss this with somebody that's dealt with these

735
00:49:10,400 --> 00:49:11,400
problems.

736
00:49:11,400 --> 00:49:15,920
And yeah, it makes sense after you are applying machine learning and understanding maybe some

737
00:49:15,920 --> 00:49:25,200
of the pitfalls to then transition into some more of like the AI ethics sorts of questions.

738
00:49:25,200 --> 00:49:32,000
To change things up, not really though, but in the spirit of learning from machine learning,

739
00:49:32,000 --> 00:49:36,300
let's zoom back to someone who's just starting out in the field.

740
00:49:36,300 --> 00:49:41,520
What advice would you give to someone that's just starting out in machine learning?

741
00:49:41,520 --> 00:49:45,440
I would say, yeah, that's tricky.

742
00:49:45,440 --> 00:49:51,680
I don't want to give anyone wrong advice, but I would say machine learning is a big

743
00:49:51,680 --> 00:49:52,680
field.

744
00:49:52,680 --> 00:49:58,120
I think even like what we just covered, there are so many moving parts that are involved.

745
00:49:58,120 --> 00:50:02,480
And I mean, even zooming back, we have predictions, we have generative models, we have computer

746
00:50:02,480 --> 00:50:06,320
vision, we have natural language processing and all kinds of different fields.

747
00:50:06,320 --> 00:50:09,840
And then for each field, we have different approaches for generative modeling.

748
00:50:09,840 --> 00:50:16,160
We have, let's say just for images, we have autoencoders, diffusion models, generative

749
00:50:16,160 --> 00:50:18,400
adversarial networks and so forth.

750
00:50:18,400 --> 00:50:23,640
And they're all kind of like almost fundamentally different in terms of how they work.

751
00:50:23,640 --> 00:50:29,580
And it can be very, very, very overwhelming, I think, when you start out.

752
00:50:29,580 --> 00:50:35,080
So I would say, honestly, I would start with the book or a course and just work through

753
00:50:35,080 --> 00:50:41,520
that with, I would say almost with a blindness on not getting distracted by other, let's

754
00:50:41,520 --> 00:50:45,760
say, resources at that point, just working through that.

755
00:50:45,760 --> 00:50:50,640
Because I think that happens to me all the time, I get distracted by something else,

756
00:50:50,640 --> 00:50:52,680
I look it up and then it's like a rabbit hole.

757
00:50:52,680 --> 00:50:55,200
And then you feel like, wow, there's so much to learn.

758
00:50:55,200 --> 00:50:58,160
And then you get frustrated and overwhelmed because it's like, oh, the day only has

759
00:50:58,160 --> 00:51:01,800
24 hours, I can't possibly ever learn it all.

760
00:51:01,800 --> 00:51:06,640
So I think really doing one thing at a time, like step by step, it's a marathon, not a

761
00:51:06,640 --> 00:51:09,720
sprint, I would say.

762
00:51:09,720 --> 00:51:17,280
So I think, I would say, yeah, take it slowly, enjoy it, make sure you have fun, try not

763
00:51:17,280 --> 00:51:19,600
to do all at once.

764
00:51:19,600 --> 00:51:25,360
And maybe also finding a balance between trying things out or maybe implementing some ideas

765
00:51:25,360 --> 00:51:28,080
in a project after reading about them.

766
00:51:28,080 --> 00:51:32,280
And then going back to reading about more things, trying them out.

767
00:51:32,280 --> 00:51:39,240
So having a balance between soaking up knowledge also and trying out things you learned about.

768
00:51:39,240 --> 00:51:44,160
Yeah, I think that's really good advice.

769
00:51:44,160 --> 00:51:49,920
It's interesting, when someone asks me, oh, how can I learn about machine learning, there's

770
00:51:49,920 --> 00:51:53,960
no shortage of resources out there, right?

771
00:51:53,960 --> 00:52:00,240
There's no shortage of new material coming out, but it's sort of like hacking through

772
00:52:00,240 --> 00:52:06,160
the weeds and staying on a path to get yourself to a point where you can understand a certain

773
00:52:06,160 --> 00:52:07,880
level of the basics.

774
00:52:07,880 --> 00:52:13,320
You don't need to know every paper that's coming out daily, right?

775
00:52:13,320 --> 00:52:14,360
It's not necessary.

776
00:52:14,360 --> 00:52:16,840
It's much more important to understand the basics.

777
00:52:16,840 --> 00:52:22,560
So you're setting yourself up for a future of success basically.

778
00:52:22,560 --> 00:52:29,800
In a similar vein, if you have anything, what's one piece of advice that you've received that

779
00:52:29,800 --> 00:52:34,600
has helped you along your machine learning journey?

780
00:52:34,600 --> 00:52:37,360
That's a good question.

781
00:52:37,360 --> 00:52:44,560
Top of my head, I wouldn't have a good, let's say, advice someone, let's say, gave particular

782
00:52:44,560 --> 00:52:45,560
to me.

783
00:52:45,560 --> 00:52:51,480
But I would say going back to the Andrew Ng class that we talked about in the beginning,

784
00:52:51,480 --> 00:52:55,840
I think something Andrew Ng always said in his classes was, if you don't understand this

785
00:52:55,840 --> 00:52:57,720
part, don't worry about it.

786
00:52:57,720 --> 00:53:00,840
And I think that's a good saying.

787
00:53:00,840 --> 00:53:07,400
Maybe if we don't understand a certain thing, maybe let's not worry about it just yet.

788
00:53:07,400 --> 00:53:09,760
Some things are more important than others.

789
00:53:09,760 --> 00:53:15,240
Also when we specialize, I think letting go of some things to make room for other things.

790
00:53:15,240 --> 00:53:23,000
For me, I worked on some more mathematical papers where we proved theorems and so forth,

791
00:53:23,000 --> 00:53:26,760
like the ordinary regression papers we worked on, which was fun.

792
00:53:26,760 --> 00:53:32,360
But I, for example, I know that I'm not that good at proving theorems because I'm more

793
00:53:32,360 --> 00:53:34,840
like a person who enjoys coding.

794
00:53:34,840 --> 00:53:40,160
And for proving theorems, you have to sometimes sit there for days or weeks and stare at it

795
00:53:40,160 --> 00:53:43,200
until you get some inspiration.

796
00:53:43,200 --> 00:53:44,640
And this is not for me.

797
00:53:44,640 --> 00:53:46,400
I think that's okay.

798
00:53:46,400 --> 00:53:53,480
I would say not getting frustrated, I guess, saying, okay, this is not for me, recognizing

799
00:53:53,480 --> 00:53:56,360
that, focusing on my other strengths.

800
00:53:56,360 --> 00:53:59,720
And that would be something like, don't worry about it.

801
00:53:59,720 --> 00:54:02,560
Oh, sorry, I almost knocked off this thing here.

802
00:54:02,560 --> 00:54:04,520
Let's say, what Andrew Ng said, not worry about it.

803
00:54:04,520 --> 00:54:09,000
That is like something I think that kind of relieved me.

804
00:54:09,000 --> 00:54:18,440
It's a small thing that's really nice because when Andrew Ng was going through, say, a proof

805
00:54:18,440 --> 00:54:25,320
for something or showing all the mathematics behind gradient descent or changing the weights

806
00:54:25,320 --> 00:54:31,320
or back propagation and things like that, you don't need to know every single detail

807
00:54:31,320 --> 00:54:32,600
right then and there.

808
00:54:32,600 --> 00:54:38,200
You might not ever really need to know every detail, but understanding the...

809
00:54:38,200 --> 00:54:39,320
Getting an intuition.

810
00:54:39,320 --> 00:54:43,240
And that's what Andrew Ng always used to say, gaining that intuition and getting that gut

811
00:54:43,240 --> 00:54:44,640
feeling and things like that.

812
00:54:44,640 --> 00:54:48,040
That's what's going to help you along the way.

813
00:54:48,040 --> 00:54:50,000
Yeah, that is a good point.

814
00:54:50,000 --> 00:54:51,000
Other than Andrew Ng...

815
00:54:51,000 --> 00:54:52,000
Oh, yes.

816
00:54:52,000 --> 00:54:54,000
So I wanted to say exactly what you said.

817
00:54:54,000 --> 00:54:59,000
I wanted to say what you brought up a very good point is, yeah, you should, of course,

818
00:54:59,000 --> 00:55:04,020
make sure you understand the bigger picture and intuition in a certain way.

819
00:55:04,020 --> 00:55:08,540
But the details are sometimes implementation details, I would say.

820
00:55:08,540 --> 00:55:14,600
But like you said, yeah, recognizing when it's time to focus on the big picture and

821
00:55:14,600 --> 00:55:20,480
when it's time to dive in and really making sure you don't have to dive into everything

822
00:55:20,480 --> 00:55:21,480
basically.

823
00:55:21,480 --> 00:55:27,240
Also, very good exercises to implement things from scratch, like reading about, let's say,

824
00:55:27,240 --> 00:55:30,600
decision trees and then implementing decision trees from scratch.

825
00:55:30,600 --> 00:55:35,440
For example, that's one homework I usually give where students have to code a cart decision

826
00:55:35,440 --> 00:55:40,040
tree or a C4.5 tree from scratch, which is a good learning exercise.

827
00:55:40,040 --> 00:55:44,440
But I wouldn't say do that for every algorithm, because if you do that, yeah, you would get

828
00:55:44,440 --> 00:55:45,440
stuck.

829
00:55:45,440 --> 00:55:48,960
You would never really move forward because it takes a lot of time.

830
00:55:48,960 --> 00:55:50,480
It takes weeks to do that.

831
00:55:50,480 --> 00:55:57,640
And life is also, in a way, short if you spend your whole time re-implementing old algorithms.

832
00:55:57,640 --> 00:56:01,440
Yeah, that's also not a good way of spending time, I think.

833
00:56:01,440 --> 00:56:05,520
It's like being selective, I think, also focusing on the big picture, sometimes diving in, but

834
00:56:05,520 --> 00:56:08,640
not diving into the details of everything.

835
00:56:08,640 --> 00:56:09,880
Right.

836
00:56:09,880 --> 00:56:18,680
Yeah, one of my professors, during my masters, he had us by hand, step by step going through

837
00:56:18,680 --> 00:56:21,360
back propagation for neural networks.

838
00:56:21,360 --> 00:56:22,520
That sounds fun.

839
00:56:22,520 --> 00:56:28,040
You're beating your head against the wall, and it's very frustrating.

840
00:56:28,040 --> 00:56:31,240
It's not like you ever need to do that.

841
00:56:31,240 --> 00:56:37,960
But there's something about even just doing it once that you do just kind of gain a better

842
00:56:37,960 --> 00:56:39,440
sense of it.

843
00:56:39,440 --> 00:56:44,280
Yeah, at first, the details aren't that important.

844
00:56:44,280 --> 00:56:51,840
Future, when you're in industry and you're trying to get a model into production, I mean,

845
00:56:51,840 --> 00:56:56,040
sometimes things are so abstracted that you don't necessarily need to, which could be

846
00:56:56,040 --> 00:56:57,880
a good thing or a bad thing, right?

847
00:56:57,880 --> 00:57:02,360
Because it's fine if there's no problems, but it quickly becomes a bad thing when you

848
00:57:02,360 --> 00:57:06,880
start to run into some issues and you don't really understand what's going on with your

849
00:57:06,880 --> 00:57:07,880
model.

850
00:57:07,880 --> 00:57:14,920
But yeah, I mean, at first, it's much more important to get the, in broad strokes, just

851
00:57:14,920 --> 00:57:24,040
sort of get a handle of what's going on, building up that foundation so you can understand everything.

852
00:57:24,040 --> 00:57:29,720
You can't learn recurrent neural networks without understanding what a decision tree

853
00:57:29,720 --> 00:57:31,040
is, right?

854
00:57:31,040 --> 00:57:36,520
It's just like, it's not, you just, you can't, there's certain things, it just wouldn't,

855
00:57:36,520 --> 00:57:37,520
it wouldn't make sense.

856
00:57:37,520 --> 00:57:40,520
Like, you should start with logistic regression.

857
00:57:40,520 --> 00:57:42,840
You know, just do it, right?

858
00:57:42,840 --> 00:57:43,840
Good advice.

859
00:57:43,840 --> 00:57:49,400
I would say always start with, even if you know more sophisticated techniques, if we

860
00:57:49,400 --> 00:57:54,080
go back to what we talked about with large language models, even if it makes more sense,

861
00:57:54,080 --> 00:57:58,600
even for classification problem to fine tune a large language model for that, I would start,

862
00:57:58,600 --> 00:58:03,360
like you said, with a simple logistic regression classifier, maybe back of words model to just

863
00:58:03,360 --> 00:58:06,920
get a baseline, like something where you are confident it's very simple and it works.

864
00:58:06,920 --> 00:58:10,520
Let's say using second learn before trying the more complicated things.

865
00:58:10,520 --> 00:58:14,000
It's not only because we don't want to use the complicated things because the simple

866
00:58:14,000 --> 00:58:15,360
ones are efficient.

867
00:58:15,360 --> 00:58:17,680
It's more about also even checking our solutions.

868
00:58:17,680 --> 00:58:23,120
Like if our fine tune model, our, let's say BERT LLM performs worse than the logistic

869
00:58:23,120 --> 00:58:26,360
regression classifier, maybe we have a bug in our code.

870
00:58:26,360 --> 00:58:29,600
Maybe we didn't process the input correctly, tokenize it correctly.

871
00:58:29,600 --> 00:58:35,240
It's usually always a good idea to, I think, to really start simple and then increasingly

872
00:58:35,240 --> 00:58:41,440
get complicated or improve, let's say, improve by adding things instead of starting complicated

873
00:58:41,440 --> 00:58:46,840
and then trying to debug the complicated solution to find out where the error is essentially.

874
00:58:46,840 --> 00:58:48,440
Right.

875
00:58:48,440 --> 00:58:53,120
Even if worst case scenario, if you use a very simple model, you just got it.

876
00:58:53,120 --> 00:58:55,080
You just have a baseline, right?

877
00:58:55,080 --> 00:59:00,920
Just like a sanity baseline to work off of.

878
00:59:00,920 --> 00:59:07,520
So other than Andrew Aang, who we both obviously admire, who are some other people in the machine

879
00:59:07,520 --> 00:59:13,640
learning field that you gain inspiration from or that have influenced you?

880
00:59:13,640 --> 00:59:15,280
Good question.

881
00:59:15,280 --> 00:59:20,320
I would say, because I also recently enjoyed some of the educational material by Andrew

882
00:59:20,320 --> 00:59:28,200
Capati, what he reminds me always is that it's fun to code things and it's very contagious.

883
00:59:28,200 --> 00:59:30,840
If you see someone having fun coding things up.

884
00:59:30,840 --> 00:59:35,600
So that's something I did very early on in my blog where I implemented a principal component

885
00:59:35,600 --> 00:59:40,680
analysis from scratch or linear discriminant analysis, other things.

886
00:59:40,680 --> 00:59:45,120
I always used to implement things from scratch, but over the years I have become more, I would

887
00:59:45,120 --> 00:59:49,560
say, conceptual because things got more complicated.

888
00:59:49,560 --> 00:59:54,160
I was focusing more, let's say, on implementing an end-to-end system and then not, let's say,

889
00:59:54,160 --> 00:59:56,080
doing the step-by-step coding.

890
00:59:56,080 --> 01:00:01,280
And his recent stuff reminded me of how much fun it actually is to do things from scratch.

891
01:00:01,280 --> 01:00:04,840
So that's one inspiration, I would say.

892
01:00:04,840 --> 01:00:11,160
Or other people, I would say maybe Paige Bailey because she always has so much fun on, let's

893
01:00:11,160 --> 01:00:12,160
say, social media.

894
01:00:12,160 --> 01:00:17,880
It's also to remind you, I don't know, whatever you do, have fun.

895
01:00:17,880 --> 01:00:21,240
Enjoy, share the joy.

896
01:00:21,240 --> 01:00:26,200
That is, I think, also important to keep in mind that things are sometimes complicated

897
01:00:26,200 --> 01:00:27,880
and I don't know, work can be intense.

898
01:00:27,880 --> 01:00:32,640
We want to get things done, but don't forget also maybe just to stop and enjoy sometimes,

899
01:00:32,640 --> 01:00:38,480
like to share the successes, have spread some fun stuff.

900
01:00:38,480 --> 01:00:42,520
Yeah, definitely.

901
01:00:42,520 --> 01:00:47,040
Speaking of starting things from scratch, well, I think of it, I was able to read your

902
01:00:47,040 --> 01:00:54,000
recent blog, Understanding and Coding Self-Attention Mechanisms of Large Language Models from Scratch.

903
01:00:54,000 --> 01:00:59,080
And yeah, just, I mean, yeah, we were talking about some of that going into it and understanding

904
01:00:59,080 --> 01:01:05,280
the similarities between cross-attention and self-attention.

905
01:01:05,280 --> 01:01:11,760
It's really interesting to go down to the more basic principles and to see things from

906
01:01:11,760 --> 01:01:12,760
the code.

907
01:01:12,760 --> 01:01:15,880
And I, how do I say it?

908
01:01:15,880 --> 01:01:21,320
It's like in production and like when you're deploying models, you don't want to reinvent

909
01:01:21,320 --> 01:01:22,320
the wheel, right?

910
01:01:22,320 --> 01:01:23,320
Yeah, exactly.

911
01:01:23,320 --> 01:01:24,320
Right.

912
01:01:24,320 --> 01:01:26,640
You want battle-tested, you want battle-tested things.

913
01:01:26,640 --> 01:01:33,000
But when you're trying to understand something conceptually, it's really nice to understand

914
01:01:33,000 --> 01:01:34,000
it from scratch.

915
01:01:34,000 --> 01:01:35,000
Excellent point.

916
01:01:35,000 --> 01:01:37,000
And then to your second point.

917
01:01:37,000 --> 01:01:38,000
Yeah.

918
01:01:38,000 --> 01:01:42,280
Yeah, we really want to emphasize that, like, I think for real-world applications, don't

919
01:01:42,280 --> 01:01:43,800
try to reinvent the wheel.

920
01:01:43,800 --> 01:01:48,960
I think, yeah, that is a lot of work and also risky.

921
01:01:48,960 --> 01:01:52,040
But it is, like you said, it is good for learning.

922
01:01:52,040 --> 01:01:53,960
It's especially good for learning.

923
01:01:53,960 --> 01:01:59,240
Actually, one thing I like is also, so I build sometimes things both ways.

924
01:01:59,240 --> 01:02:03,920
So when I want to implement something, I do the most naive implementation ever, like where

925
01:02:03,920 --> 01:02:09,240
I just use a very plain, simple Python code, write some unit tests to know because I want

926
01:02:09,240 --> 01:02:10,520
this and this output.

927
01:02:10,520 --> 01:02:12,400
And then I try to make it more efficient.

928
01:02:12,400 --> 01:02:16,100
So like adding more efficiency to that to see if I can improve things.

929
01:02:16,100 --> 01:02:18,240
That's what I do usually for things that don't exist yet.

930
01:02:18,240 --> 01:02:21,660
But for things that exist, you can actually use what is already out there and then kind

931
01:02:21,660 --> 01:02:26,200
of like use that as a unit test almost and then try to make your implementation similar

932
01:02:26,200 --> 01:02:27,200
to that.

933
01:02:27,200 --> 01:02:31,520
But yeah, like you said, don't maybe use from scratch implementations if there's an existing

934
01:02:31,520 --> 01:02:32,520
solution.

935
01:02:32,520 --> 01:02:36,680
Only it's for learning purposes, essentially.

936
01:02:36,680 --> 01:02:37,680
Yeah.

937
01:02:37,680 --> 01:02:39,360
Definitely.

938
01:02:39,360 --> 01:02:46,480
And then, yeah, to your second point from before, it's important to have fun, right?

939
01:02:46,480 --> 01:02:53,240
To realize that learning can be, it should be enjoyable and expanding your knowledge

940
01:02:53,240 --> 01:02:56,840
is so important.

941
01:02:56,840 --> 01:03:04,960
So to conclude, learning from machine learning, the last real meaty question, what has a career

942
01:03:04,960 --> 01:03:10,560
in machine learning taught you about life?

943
01:03:10,560 --> 01:03:17,640
I would say it's like, yeah, being patient because there's so much out there.

944
01:03:17,640 --> 01:03:20,560
So it's like, can't learn it all at once.

945
01:03:20,560 --> 01:03:21,640
Take it one step at a time.

946
01:03:21,640 --> 01:03:26,840
But like what we just talked about, making sure we enjoy what we're doing.

947
01:03:26,840 --> 01:03:30,600
But then also, what I think what machine learning taught me, especially in the last couple of

948
01:03:30,600 --> 01:03:33,800
years is things are changing quickly.

949
01:03:33,800 --> 01:03:38,400
So in that sense, it's kind of like counter to what we just said, like taking things slowly.

950
01:03:38,400 --> 01:03:44,560
But it's also be open to change, be open to new experiences.

951
01:03:44,560 --> 01:03:51,320
It could be anything, like from job related things to location wise, where we live, what

952
01:03:51,320 --> 01:03:53,040
our hobbies are.

953
01:03:53,040 --> 01:03:57,240
And that is something related to machine learning in the sense that there are so many new methods

954
01:03:57,240 --> 01:03:59,360
coming out there.

955
01:03:59,360 --> 01:04:00,360
Things changed completely.

956
01:04:00,360 --> 01:04:03,920
We were using GANs two years ago, now we're using diffusion models.

957
01:04:03,920 --> 01:04:06,240
It's like being open to things and open to change.

958
01:04:06,240 --> 01:04:10,360
And yeah, I don't know, like trying it out, making sure maybe we don't like it, we don't

959
01:04:10,360 --> 01:04:11,480
have to use it.

960
01:04:11,480 --> 01:04:16,760
It's the same with life, like trying new experiences, I think.

961
01:04:16,760 --> 01:04:17,760
That's great.

962
01:04:17,760 --> 01:04:23,320
Yeah, I think being patient when you need to be patient, but also just sort of accepting

963
01:04:23,320 --> 01:04:28,920
that we are living in a very fast moving world where things are changing.

964
01:04:28,920 --> 01:04:30,280
So being open to change.

965
01:04:30,280 --> 01:04:34,320
And like machine learning, everything gets better with time, with more training epochs,

966
01:04:34,320 --> 01:04:35,320
essentially.

967
01:04:35,320 --> 01:04:41,200
So maybe hopefully when we like with life experiences and stuff like that, things get better usually,

968
01:04:41,200 --> 01:04:42,200
I hope.

969
01:04:42,200 --> 01:04:43,200
So yeah.

970
01:04:43,200 --> 01:04:44,200
Yeah.

971
01:04:44,200 --> 01:04:49,040
Sebastian, it's been such a pleasure talking to you.

972
01:04:49,040 --> 01:04:54,040
If there are some listeners out there who want to learn more about your work, where

973
01:04:54,040 --> 01:04:58,080
could they go to reach out or to find out more about you?

974
01:04:58,080 --> 01:05:05,200
I think my website would be the best place because there I have links to everything else.

975
01:05:05,200 --> 01:05:10,080
So yeah, my website is essentially my first name lastname.com, sebastianrushka.com.

976
01:05:10,080 --> 01:05:17,320
It's maybe a little bit difficult to spell in the sense of it's easier if you maybe see

977
01:05:17,320 --> 01:05:18,320
a link.

978
01:05:18,320 --> 01:05:19,840
So it's my first name lastname.com.

979
01:05:19,840 --> 01:05:21,840
I'll have it in the show notes.

980
01:05:21,840 --> 01:05:22,840
Yeah, exactly.

981
01:05:22,840 --> 01:05:23,840
Yeah.

982
01:05:23,840 --> 01:05:26,160
So and because it's a very long name.

983
01:05:26,160 --> 01:05:30,600
Otherwise, I'm very active on social media.

984
01:05:30,600 --> 01:05:34,320
Most of them basically like Twitter, Mastodon, LinkedIn.

985
01:05:34,320 --> 01:05:38,240
So on most platforms, I'm r-a-s-b-t.

986
01:05:38,240 --> 01:05:43,360
So that is actually back, it's weird because it's back then on Twitter, there was a character

987
01:05:43,360 --> 01:05:47,880
limit where the Twitter handle was cutting into that character limit.

988
01:05:47,880 --> 01:05:50,280
So I try to keep it as short as possible.

989
01:05:50,280 --> 01:05:54,280
Five letters, it's basically the first two letters of my last name, r-a.

990
01:05:54,280 --> 01:05:58,360
And then s-b-t as in Sebastian, so r-a-s-b-t.

991
01:05:58,360 --> 01:06:02,880
So I'm that on GitHub, Twitter, and some other platforms.

992
01:06:02,880 --> 01:06:08,960
So yeah, if you want to reach out on social media, I'm pretty much everywhere, maybe too

993
01:06:08,960 --> 01:06:09,960
much.

994
01:06:09,960 --> 01:06:13,360
But I must say that is also one thing over the years.

995
01:06:13,360 --> 01:06:16,000
I've been on social media over like maybe 10 years.

996
01:06:16,000 --> 01:06:19,880
And if you use it responsibly, you can learn a lot of things.

997
01:06:19,880 --> 01:06:24,520
We are always having good discussions where we discussed recent papers.

998
01:06:24,520 --> 01:06:27,200
There's always someone who knows more than you do.

999
01:06:27,200 --> 01:06:30,880
So it's always nice to have always these comments where someone points something out or follow

1000
01:06:30,880 --> 01:06:34,080
up material or, hey, have you thought about this and that?

1001
01:06:34,080 --> 01:06:38,720
And yeah, I think it's basically if you use it responsibly, it can be a very effective

1002
01:06:38,720 --> 01:06:40,760
way for learning too.

1003
01:06:40,760 --> 01:06:43,920
Yeah, for sure.

1004
01:06:43,920 --> 01:06:50,360
So, Sebastian, yeah, it has been so nice chatting with you, you're like a fountain of knowledge.

1005
01:06:50,360 --> 01:06:53,960
I feel like there's so much that we could chat about more.

1006
01:06:53,960 --> 01:06:58,840
We could do a whole other episode, maybe sometime in the future.

1007
01:06:58,840 --> 01:07:00,600
But thank you so much for your time.

1008
01:07:00,600 --> 01:07:02,840
I really appreciate it.

1009
01:07:02,840 --> 01:07:04,520
Yeah, that was fun.

1010
01:07:04,520 --> 01:07:05,680
Thanks for having me on your podcast.

1011
01:07:05,680 --> 01:07:08,480
It was like a really fun hour to spend today.

1012
01:07:08,480 --> 01:07:09,960
So yeah, thanks for inviting me.

1013
01:07:09,960 --> 01:07:10,960
I had a lot of fun.

1014
01:07:10,960 --> 01:07:23,560
And yeah, anytime again.

1015
01:07:23,560 --> 01:07:26,960
Thank you for tuning in to this episode of Learning from Machine Learning.

1016
01:07:26,960 --> 01:07:32,000
I hope you enjoyed the insights and knowledge shared by Sebastian Raschke, a renowned author

1017
01:07:32,000 --> 01:07:34,200
and machine learning expert.

1018
01:07:34,200 --> 01:07:38,680
Don't forget to check out the show notes for links to Sebastian's work and resources discussed

1019
01:07:38,680 --> 01:07:40,620
in this episode.

1020
01:07:40,620 --> 01:07:45,080
If you enjoyed this episode, please leave a review and share with your friends and colleagues.

1021
01:07:45,080 --> 01:08:12,120
Until next time, keep on learning.

