1
00:00:00,000 --> 00:00:04,800
You can speak to GPT-4 like you would speak to another human.

2
00:00:04,800 --> 00:00:06,440
We know how to communicate.

3
00:00:06,440 --> 00:00:07,760
I don't speak protein.

4
00:00:07,760 --> 00:00:10,080
I don't know anyone else that does.

5
00:00:10,080 --> 00:00:11,840
Do I spell it out in amino acid?

6
00:00:15,600 --> 00:00:17,640
How did the best machine learning practitioners

7
00:00:17,640 --> 00:00:19,600
get involved in the field?

8
00:00:19,600 --> 00:00:22,080
What challenges have they faced?

9
00:00:22,080 --> 00:00:24,240
What has helped them flourish?

10
00:00:24,240 --> 00:00:26,020
Let's ask them.

11
00:00:26,020 --> 00:00:28,460
Welcome to Learning from Machine Learning.

12
00:00:28,460 --> 00:00:31,240
I'm your host, Seth Levine.

13
00:00:31,240 --> 00:00:34,120
Hello, and welcome to Learning from Machine Learning.

14
00:00:34,120 --> 00:00:37,520
On this episode, we have a very special guest, Dr. Michelle

15
00:00:37,520 --> 00:00:38,760
Gill.

16
00:00:38,760 --> 00:00:42,120
She's the tech lead and applied research manager at NVIDIA.

17
00:00:42,120 --> 00:00:45,880
She works on projects like BioNemo and BioFoundation

18
00:00:45,880 --> 00:00:48,800
models, frameworks, and inference service

19
00:00:48,800 --> 00:00:51,360
for AI-assisted drug discovery.

20
00:00:51,360 --> 00:00:55,920
Recently, she gave the keynote at the most recent PiData NYC.

21
00:00:55,920 --> 00:00:57,800
It was one of my favorite talks, and I'm

22
00:00:57,800 --> 00:01:00,920
thrilled to have you on the podcast.

23
00:01:00,920 --> 00:01:02,380
Thank you for the invitation, Seth.

24
00:01:02,380 --> 00:01:03,560
It is awesome to be here.

25
00:01:03,560 --> 00:01:05,120
I'm really excited.

26
00:01:05,120 --> 00:01:06,280
Me too.

27
00:01:06,280 --> 00:01:09,240
So you are a little unique for the podcast

28
00:01:09,240 --> 00:01:11,320
with my background in NLP.

29
00:01:11,320 --> 00:01:13,640
So yeah, why don't you kick us off?

30
00:01:13,640 --> 00:01:16,000
What's your background, and what initially attracted you

31
00:01:16,000 --> 00:01:17,240
to machine learning?

32
00:01:17,240 --> 00:01:17,800
Yep.

33
00:01:17,800 --> 00:01:21,920
So proud that I don't fit many molds,

34
00:01:21,920 --> 00:01:23,800
where it is a badge of honor.

35
00:01:23,800 --> 00:01:27,440
I did my PhD in structural biology and biophysics

36
00:01:27,440 --> 00:01:28,480
at Yale.

37
00:01:28,480 --> 00:01:31,240
And then for postdoc, I studied enzyme dynamics

38
00:01:31,240 --> 00:01:33,440
with a technique called nuclear magnetic resonance

39
00:01:33,440 --> 00:01:37,400
spectroscopy, NMR for short.

40
00:01:37,400 --> 00:01:40,260
I was particularly unique in that I didn't really

41
00:01:40,260 --> 00:01:41,560
do computational work.

42
00:01:41,560 --> 00:01:44,100
For my scientific career, I was what

43
00:01:44,100 --> 00:01:45,920
we would call a wet lab scientist.

44
00:01:45,920 --> 00:01:49,040
I actually did experiments, expressed proteins,

45
00:01:49,040 --> 00:01:51,200
and studied them.

46
00:01:51,200 --> 00:01:56,520
When I became a scientist at the NIH, it was around 2014.

47
00:01:56,520 --> 00:02:00,880
That's when AlexNet won in the ImageNet competition.

48
00:02:00,880 --> 00:02:03,560
And it became very clear to me that there was probably

49
00:02:03,560 --> 00:02:07,240
going to be something to this machine learning thing

50
00:02:07,240 --> 00:02:10,920
that many people were very excited about at the time,

51
00:02:10,920 --> 00:02:13,800
and that it was going to, while it had already

52
00:02:13,800 --> 00:02:15,360
had some impact in various fields,

53
00:02:15,360 --> 00:02:19,440
it was going to have tremendous impact across many fields.

54
00:02:19,440 --> 00:02:23,840
And certainly, I wanted to stay very close to science,

55
00:02:23,840 --> 00:02:25,440
but I wanted to learn more about this

56
00:02:25,440 --> 00:02:29,080
and basically hopefully do this for my career.

57
00:02:29,080 --> 00:02:31,200
Like I said at that time, machine learning and science

58
00:02:31,200 --> 00:02:36,760
wasn't exactly a thing yet, certainly not to the level

59
00:02:36,760 --> 00:02:38,040
that it is now.

60
00:02:38,040 --> 00:02:40,560
But I managed to stay pretty close,

61
00:02:40,560 --> 00:02:43,520
and now it's actually pretty easy to work

62
00:02:43,520 --> 00:02:46,200
at those intersections.

63
00:02:46,200 --> 00:02:46,960
Very nice.

64
00:02:46,960 --> 00:02:50,120
Yeah, I think AlexNet was a moment for a lot of people,

65
00:02:50,120 --> 00:02:52,240
both in and out of the field, being

66
00:02:52,240 --> 00:02:55,280
able to see something that could come in and just

67
00:02:55,280 --> 00:02:58,200
like have error rates on such a hard problem.

68
00:02:58,200 --> 00:03:00,640
And it started to become more tangible

69
00:03:00,640 --> 00:03:03,160
that this sort of stuff was going to be usable

70
00:03:03,160 --> 00:03:07,720
and was going to be able to be applied to other fields.

71
00:03:07,720 --> 00:03:10,840
OK, yeah, so back in 2014, during that time,

72
00:03:10,840 --> 00:03:12,400
you were working in the lab.

73
00:03:12,400 --> 00:03:14,520
But now fast forwarding, now you are

74
00:03:14,520 --> 00:03:16,400
doing some computational stuff, right?

75
00:03:16,400 --> 00:03:20,400
So being able to see that difference between the hands

76
00:03:20,400 --> 00:03:22,840
on work and how fast things are moving now,

77
00:03:22,840 --> 00:03:24,840
can you speak to that?

78
00:03:24,840 --> 00:03:27,760
So I think it impacts the way, so certainly,

79
00:03:27,760 --> 00:03:30,160
having worked in the lab, it impacts the way

80
00:03:30,160 --> 00:03:34,920
I think about data and perceptions of,

81
00:03:34,920 --> 00:03:36,800
it helps you deeply understand the data where

82
00:03:36,800 --> 00:03:38,760
there might be errors.

83
00:03:38,760 --> 00:03:42,680
But certainly, the field now of machine learning and science

84
00:03:42,680 --> 00:03:43,960
is moving so fast.

85
00:03:43,960 --> 00:03:47,320
When I gave my Pi data talk, I had a slide that was,

86
00:03:47,320 --> 00:03:48,800
it was a joke, but not really, which

87
00:03:48,800 --> 00:03:51,120
is like, by the time you finish reading this,

88
00:03:51,120 --> 00:03:52,960
there will be a new protein design paper

89
00:03:52,960 --> 00:03:55,520
out completely revolutionizing the field.

90
00:03:55,520 --> 00:03:57,680
And it's kind of true.

91
00:03:57,680 --> 00:03:59,360
Actually, one of my side projects

92
00:03:59,360 --> 00:04:03,120
is using some retrieval augmented generation models

93
00:04:03,120 --> 00:04:06,800
to identify and recommend new literature

94
00:04:06,800 --> 00:04:10,080
as it comes out for me, because it's a nontrivial thing

95
00:04:10,080 --> 00:04:13,720
to follow the literature these days.

96
00:04:13,720 --> 00:04:16,840
So yeah, it's changing incredibly fast.

97
00:04:16,840 --> 00:04:21,520
I think there's so many potential areas where it's going to,

98
00:04:21,520 --> 00:04:23,440
it's starting to and will change the field.

99
00:04:23,440 --> 00:04:27,320
I think we don't always know exactly the right ways it's

100
00:04:27,320 --> 00:04:30,160
going to fit in, but there are many, many people

101
00:04:30,160 --> 00:04:32,000
trying to figure that out.

102
00:04:32,000 --> 00:04:33,200
Yeah.

103
00:04:33,200 --> 00:04:38,000
Machine learning, AI, NLP, the field's moving so fast.

104
00:04:38,000 --> 00:04:40,920
The joke is that by the time it gets peer reviewed,

105
00:04:40,920 --> 00:04:43,360
it will be outdated.

106
00:04:43,360 --> 00:04:46,280
So that's why there's nice things like archive,

107
00:04:46,280 --> 00:04:51,040
where people can just kind of get it directly to their readers.

108
00:04:51,040 --> 00:04:53,240
There are some pros and cons to it, though.

109
00:04:53,240 --> 00:04:54,360
Of course.

110
00:04:54,360 --> 00:04:55,640
There certainly are.

111
00:04:55,640 --> 00:04:58,360
You always have to evaluate what's there

112
00:04:58,360 --> 00:05:01,320
and understand where you have to learn

113
00:05:01,320 --> 00:05:03,920
to read between the lines what the strengths

114
00:05:03,920 --> 00:05:06,680
and actual weaknesses of a particular piece of research

115
00:05:06,680 --> 00:05:07,180
are.

116
00:05:07,180 --> 00:05:08,960
You don't always know.

117
00:05:08,960 --> 00:05:11,560
But I also think that is a model.

118
00:05:11,560 --> 00:05:16,040
The open access is a model that has allowed machine learning

119
00:05:16,040 --> 00:05:18,720
to gain a lot of traction.

120
00:05:18,720 --> 00:05:20,200
Certainly, the field of science is

121
00:05:20,200 --> 00:05:21,640
undergoing some transitions.

122
00:05:21,640 --> 00:05:23,720
And not all of scientific literature

123
00:05:23,720 --> 00:05:27,200
has been open access, but that's a lot changing.

124
00:05:27,200 --> 00:05:30,080
I think that's been very helpful to the growth and acceptance

125
00:05:30,080 --> 00:05:31,320
of machine learning.

126
00:05:31,320 --> 00:05:33,080
Yeah, absolutely.

127
00:05:33,080 --> 00:05:35,360
So in this field, the limited amount of stuff

128
00:05:35,360 --> 00:05:37,560
that I know about it, I'm very interested in it.

129
00:05:37,560 --> 00:05:43,080
I wish I knew more, but I can't go by without talking

130
00:05:43,080 --> 00:05:45,280
a little bit about Alpha Fold.

131
00:05:45,280 --> 00:05:48,440
What were your initial reactions to it?

132
00:05:48,440 --> 00:05:50,720
Could you tell us some more about it?

133
00:05:50,720 --> 00:05:51,220
Sure.

134
00:05:51,220 --> 00:05:54,080
So Alpha Fold is a model.

135
00:05:54,080 --> 00:05:58,920
It's a family of models called Equivariant Models.

136
00:05:58,920 --> 00:06:02,320
And the purpose is to predict the three-dimensional structure

137
00:06:02,320 --> 00:06:04,360
of a protein.

138
00:06:04,360 --> 00:06:08,240
So biology, we can talk maybe a bit about this in a moment.

139
00:06:08,240 --> 00:06:12,040
But we often can be represented with text,

140
00:06:12,040 --> 00:06:14,000
which is not a coincidence because that's

141
00:06:14,000 --> 00:06:15,680
how we communicate with each other.

142
00:06:15,680 --> 00:06:19,040
So it's not surprising that we have developed ways

143
00:06:19,040 --> 00:06:23,120
to represent various biological moieties

144
00:06:23,120 --> 00:06:25,320
in different fashions with text.

145
00:06:25,320 --> 00:06:27,720
And certainly, it's a valid representation.

146
00:06:27,720 --> 00:06:31,440
Those models have power and use and advantages.

147
00:06:31,440 --> 00:06:33,880
But to describe a three-dimensional protein

148
00:06:33,880 --> 00:06:39,040
structure, as you might guess, often need three dimensions.

149
00:06:39,040 --> 00:06:41,880
There's some other ways to do it.

150
00:06:41,880 --> 00:06:44,360
So Alpha Fold takes in a sequence.

151
00:06:44,360 --> 00:06:46,720
It's called amino acids.

152
00:06:46,720 --> 00:06:49,640
That's the building block of a protein.

153
00:06:49,640 --> 00:06:51,680
It takes that amino acid sequence.

154
00:06:51,680 --> 00:06:53,880
There's one-letter abbreviations for each

155
00:06:53,880 --> 00:06:56,640
of the 20 naturally occurring amino acids.

156
00:06:56,640 --> 00:06:59,440
And then it predicts the three-dimensional coordinates

157
00:06:59,440 --> 00:07:02,600
of that protein.

158
00:07:02,600 --> 00:07:06,600
And so in 2018, Alpha Fold won CASP,

159
00:07:06,600 --> 00:07:09,200
which is the Critical Assessment of Protein Structure.

160
00:07:09,200 --> 00:07:12,160
It's a protein structure prediction competition.

161
00:07:12,160 --> 00:07:15,640
And there's a great paper written by a colleague of mine,

162
00:07:15,640 --> 00:07:17,320
Jeff Hoke.

163
00:07:17,320 --> 00:07:20,400
And he went in and examined all the metrics

164
00:07:20,400 --> 00:07:23,440
and all the different competition types of metrics

165
00:07:23,440 --> 00:07:25,080
that were measured from CASP.

166
00:07:25,080 --> 00:07:27,320
And Alpha Fold didn't just win.

167
00:07:27,320 --> 00:07:28,600
It dominated.

168
00:07:28,600 --> 00:07:30,560
It really, really changed the way

169
00:07:30,560 --> 00:07:33,080
we think about machine learning.

170
00:07:33,080 --> 00:07:37,680
And that was the Alex Net moment for science, in my opinion.

171
00:07:37,680 --> 00:07:40,520
So I started doing machine learning in 2014,

172
00:07:40,520 --> 00:07:44,080
but not too far down the road.

173
00:07:44,080 --> 00:07:47,320
We had our own moment of intense realization

174
00:07:47,320 --> 00:07:49,800
that this was going to change everything.

175
00:07:49,800 --> 00:07:51,800
And there are Alpha Fold.

176
00:07:51,800 --> 00:07:53,800
There are certainly challenges with Alpha Fold.

177
00:07:53,800 --> 00:07:56,680
I'm not saying it's a panacea.

178
00:07:56,680 --> 00:08:00,840
Drug discovery and science just in general are challenging.

179
00:08:00,840 --> 00:08:02,000
And it has limitations.

180
00:08:02,000 --> 00:08:04,080
But of course, we iterate on those.

181
00:08:04,080 --> 00:08:06,160
That's how the field progresses.

182
00:08:06,160 --> 00:08:06,960
Right.

183
00:08:06,960 --> 00:08:10,280
So some of the work that you're doing at NVIDIA.

184
00:08:10,280 --> 00:08:13,080
So can you tell us about BioNemo?

185
00:08:13,080 --> 00:08:13,600
Sure.

186
00:08:13,600 --> 00:08:15,800
So BioNemo, as you briefly alluded to,

187
00:08:15,800 --> 00:08:19,040
it's an inference service and a framework.

188
00:08:19,040 --> 00:08:23,000
The inference service is models from the community,

189
00:08:23,000 --> 00:08:26,760
including Alpha Fold, Open Fold, all these derivatives

190
00:08:26,760 --> 00:08:29,120
of the folders.

191
00:08:29,120 --> 00:08:33,720
There's models for docking small molecules in proteins

192
00:08:33,720 --> 00:08:37,560
and finding the binding pocket of it.

193
00:08:37,560 --> 00:08:42,080
There are models for protein representation learning.

194
00:08:42,080 --> 00:08:44,320
There's a model called ESM, which

195
00:08:44,320 --> 00:08:47,800
was developed by a group at Meta at FAIR.

196
00:08:47,800 --> 00:08:51,600
They're now their own group.

197
00:08:51,600 --> 00:08:55,280
And there are other models as well for small molecules.

198
00:08:55,280 --> 00:08:58,280
And so we basically accelerated those checkpoints

199
00:08:58,280 --> 00:08:59,640
as much as possible.

200
00:08:59,640 --> 00:09:03,160
We did our NVIDIA thing with them, put them behind an API.

201
00:09:03,160 --> 00:09:05,040
And so they are accessible to users.

202
00:09:05,040 --> 00:09:07,920
There's a Python client.

203
00:09:07,920 --> 00:09:09,880
Individuals, companies developing software

204
00:09:09,880 --> 00:09:12,000
could put those API calls into their software

205
00:09:12,000 --> 00:09:13,400
if they don't want to host their models.

206
00:09:13,400 --> 00:09:14,680
There's load balancing.

207
00:09:14,680 --> 00:09:19,240
There's all these nice features at scalable,

208
00:09:19,240 --> 00:09:21,000
so you don't have to deal with that.

209
00:09:21,000 --> 00:09:25,040
There's also a framework that's more for ML researchers

210
00:09:25,040 --> 00:09:27,920
to train and develop their own models.

211
00:09:27,920 --> 00:09:28,420
Very cool.

212
00:09:28,420 --> 00:09:32,040
So the users are both, say, research institutes

213
00:09:32,040 --> 00:09:33,720
and companies as well?

214
00:09:33,720 --> 00:09:38,520
Yeah, I would say the users range from data scientists

215
00:09:38,520 --> 00:09:42,560
to even bench scientists who want to use an API

216
00:09:42,560 --> 00:09:45,800
to researchers who want to build their own stuff.

217
00:09:45,800 --> 00:09:46,800
Right.

218
00:09:46,800 --> 00:09:50,560
So yeah, being at NVIDIA, I wouldn't

219
00:09:50,560 --> 00:09:54,720
say the first thing that I would think of is AI drug discovery.

220
00:09:54,720 --> 00:09:55,920
Well, I guess the AI part.

221
00:09:55,920 --> 00:09:58,720
But I wouldn't put NVIDIA and drug discovery together.

222
00:09:58,720 --> 00:10:01,000
But yeah, so can you tell us why is NVIDIA

223
00:10:01,000 --> 00:10:03,640
interested in doing this type of work?

224
00:10:03,640 --> 00:10:04,680
Yeah, great question.

225
00:10:04,680 --> 00:10:06,120
And I get that a lot.

226
00:10:06,120 --> 00:10:08,600
NVIDIA's objective is not to become a drug discovery

227
00:10:08,600 --> 00:10:09,120
company.

228
00:10:09,120 --> 00:10:13,240
We want to enable those doing drug discovery or developing

229
00:10:13,240 --> 00:10:17,120
software for drug discovery with the best of what

230
00:10:17,120 --> 00:10:19,760
we can offer on GPUs.

231
00:10:19,760 --> 00:10:21,500
NVIDIA is in kind of a unique position

232
00:10:21,500 --> 00:10:25,920
because we do make the hardware and a lot of the software

233
00:10:25,920 --> 00:10:29,400
along the stack building up to BioNemo we touch.

234
00:10:29,400 --> 00:10:33,320
So we can surface optimizations and the best

235
00:10:33,320 --> 00:10:36,640
of what the hardware has to offer in BioNemo.

236
00:10:36,640 --> 00:10:41,520
And in turn, what users need and ask for and pain points

237
00:10:41,520 --> 00:10:42,880
can get filtered back to us.

238
00:10:42,880 --> 00:10:45,640
And that can influence hardware, software,

239
00:10:45,640 --> 00:10:47,000
you never know down the road.

240
00:10:47,000 --> 00:10:49,480
So that's really our objective.

241
00:10:49,480 --> 00:10:52,120
Right, that makes a lot of sense.

242
00:10:52,120 --> 00:10:54,240
So there's probably like some, you

243
00:10:54,240 --> 00:10:55,680
talked about it in your Pi data talk,

244
00:10:55,680 --> 00:10:59,880
like there's this cyclic nature of supply and demand, right,

245
00:10:59,880 --> 00:11:02,720
between the hardware and developing

246
00:11:02,720 --> 00:11:03,960
these sorts of solutions.

247
00:11:03,960 --> 00:11:06,640
So you have this architecture that you

248
00:11:06,640 --> 00:11:07,920
can run things in parallel.

249
00:11:07,920 --> 00:11:09,720
And you can do all of these amazing things.

250
00:11:09,720 --> 00:11:11,640
You can test out all of this stuff.

251
00:11:11,640 --> 00:11:14,480
And you have these sorts of machine learning models

252
00:11:14,480 --> 00:11:18,200
that can create tons of different whatever

253
00:11:18,200 --> 00:11:20,800
tasks you're trying to do.

254
00:11:20,800 --> 00:11:25,480
So yeah, it's a very interesting relationship.

255
00:11:25,480 --> 00:11:29,880
So just talking about, well, let's back up for a second,

256
00:11:29,880 --> 00:11:31,040
AI drug discovery.

257
00:11:31,040 --> 00:11:33,480
Why is that so important?

258
00:11:33,480 --> 00:11:38,800
Why is it something that people should care about?

259
00:11:38,800 --> 00:11:40,120
Why should we put it in?

260
00:11:40,120 --> 00:11:42,120
Why should people be thinking about it?

261
00:11:42,120 --> 00:11:47,800
Yeah, the drug discovery process is long and expensive.

262
00:11:47,800 --> 00:11:51,320
For a particular class of drug, small molecules,

263
00:11:51,320 --> 00:11:56,640
it's something like close to $3 billion in 10-ish years.

264
00:11:56,640 --> 00:11:58,960
And there's tons of failures along the way.

265
00:11:58,960 --> 00:12:02,120
It's a very manual process.

266
00:12:02,120 --> 00:12:05,000
And there's certainly a lot of repetitiveness.

267
00:12:05,000 --> 00:12:07,480
So that's a place where you can start to think that machine

268
00:12:07,480 --> 00:12:09,360
learning might come in.

269
00:12:09,360 --> 00:12:12,000
For example, small molecule development,

270
00:12:12,000 --> 00:12:18,600
once we have a candidate, a lead that binds to a protein,

271
00:12:18,600 --> 00:12:20,160
then it's optimized.

272
00:12:20,160 --> 00:12:21,520
And that's a very, you know, there's

273
00:12:21,520 --> 00:12:24,440
the molecule synthesized by a chemist.

274
00:12:24,440 --> 00:12:26,080
It's tested in an assay.

275
00:12:26,080 --> 00:12:27,360
It's evaluated.

276
00:12:27,360 --> 00:12:31,320
You know, this design-make-test cycle goes over and over.

277
00:12:31,320 --> 00:12:35,280
If we can start to help superpower those chemists

278
00:12:35,280 --> 00:12:38,560
by enabling some of that to be done in silico,

279
00:12:38,560 --> 00:12:42,080
then our objective is to make this process faster,

280
00:12:42,080 --> 00:12:43,960
make their lives better.

281
00:12:43,960 --> 00:12:45,560
We can give scientists the power that they

282
00:12:45,560 --> 00:12:50,080
need to work more efficiently.

283
00:12:50,080 --> 00:12:51,520
It must be pretty tricky.

284
00:12:51,520 --> 00:12:55,080
I mean, evaluation is always a tough thing

285
00:12:55,080 --> 00:12:58,320
for any machine learning, really, I guess, any problem.

286
00:12:58,320 --> 00:13:02,720
But when do you know, OK, like, you know, in silicon,

287
00:13:02,720 --> 00:13:06,200
like, this molecule, this protein

288
00:13:06,200 --> 00:13:10,000
has reached the point where it's time.

289
00:13:10,000 --> 00:13:12,480
Like, let's actually synthesize this thing

290
00:13:12,480 --> 00:13:16,560
and let's start testing it in a real lab, not a, you know,

291
00:13:16,560 --> 00:13:18,320
in a physical lab.

292
00:13:18,320 --> 00:13:21,160
Yeah, I can't answer that because it's certainly

293
00:13:21,160 --> 00:13:28,480
different for every application, every chemist, every model.

294
00:13:28,480 --> 00:13:31,520
So, yeah, as for my objective, I think

295
00:13:31,520 --> 00:13:33,120
to use these effectively, you would

296
00:13:33,120 --> 00:13:35,960
have to get a sense of how well does the model perform

297
00:13:35,960 --> 00:13:37,160
in this assay.

298
00:13:37,160 --> 00:13:38,880
And certainly, that's something.

299
00:13:38,880 --> 00:13:42,960
Developing generalized models, like foundation models,

300
00:13:42,960 --> 00:13:48,320
for example, that can then be specialized by users,

301
00:13:48,320 --> 00:13:50,600
that's why it's important for NVIDIA.

302
00:13:50,600 --> 00:13:52,200
We can train these powerful models,

303
00:13:52,200 --> 00:13:55,800
but users will specialize them to their particular drug

304
00:13:55,800 --> 00:13:58,640
program, their particular target.

305
00:13:58,640 --> 00:14:01,600
You know, we understand that certainly there's

306
00:14:01,600 --> 00:14:04,400
a lot of specifics that have to go into this,

307
00:14:04,400 --> 00:14:06,000
even after the model's created.

308
00:14:06,000 --> 00:14:07,480
Right, that makes a lot of sense.

309
00:14:07,480 --> 00:14:10,160
So you're more focusing on the tools

310
00:14:10,160 --> 00:14:12,560
and enabling people to do this type of work.

311
00:14:12,560 --> 00:14:14,320
Exactly.

312
00:14:14,320 --> 00:14:17,240
So talking about, and you alluded to it

313
00:14:17,240 --> 00:14:21,080
before, some of the overlap or how you can use MLP to help you

314
00:14:21,080 --> 00:14:22,000
with biology.

315
00:14:22,000 --> 00:14:23,800
Where do you see that there's overlap

316
00:14:23,800 --> 00:14:27,040
and where are there some differences?

317
00:14:27,040 --> 00:14:30,880
Yeah, so there are, yeah, like I said,

318
00:14:30,880 --> 00:14:35,440
natural language is an excellent way to represent biology.

319
00:14:35,440 --> 00:14:40,120
The different moieties in biology, DNA, RNA, proteins,

320
00:14:40,120 --> 00:14:43,640
small molecules, they all have their own language

321
00:14:43,640 --> 00:14:44,840
that can be used.

322
00:14:44,840 --> 00:14:47,960
And that enables us to beg, borrow, and steal

323
00:14:47,960 --> 00:14:52,200
from all of these NLP breakthroughs that have happened.

324
00:14:52,200 --> 00:14:56,400
And that has really helped jumpstart that field.

325
00:14:56,400 --> 00:14:59,000
It's also, there's considerations

326
00:14:59,000 --> 00:15:01,040
from a biological standpoint, which

327
00:15:01,040 --> 00:15:06,360
is that there's a lot of NLP data, of text data available.

328
00:15:06,360 --> 00:15:09,040
For example, with proteins, we have many more protein

329
00:15:09,040 --> 00:15:11,440
sequences than we do protein structures,

330
00:15:11,440 --> 00:15:12,720
than three-dimensional structures,

331
00:15:12,720 --> 00:15:14,320
like we discussed earlier.

332
00:15:14,320 --> 00:15:18,040
So in some contexts, you have a lot more data.

333
00:15:18,040 --> 00:15:20,760
So that's very powerful.

334
00:15:20,760 --> 00:15:27,120
I think differences in NLP are certainly vocabulary.

335
00:15:27,120 --> 00:15:29,920
Many times, biological vocabularies

336
00:15:29,920 --> 00:15:34,760
are like tokenized three letters at a time,

337
00:15:34,760 --> 00:15:39,120
or every amino acid, in which case there's 20 amino acids.

338
00:15:39,120 --> 00:15:44,160
So the vocabulary size is certainly less than 50.

339
00:15:44,160 --> 00:15:47,120
And the distribution of the tokens is very different.

340
00:15:47,120 --> 00:15:51,760
So those are ways that biology doesn't necessarily always

341
00:15:51,760 --> 00:15:54,480
follow the same trends as NLP.

342
00:15:54,480 --> 00:15:56,760
We still try to do experiments empirically.

343
00:15:56,760 --> 00:16:01,160
Sure, x, y, and z features improve the model for NLP,

344
00:16:01,160 --> 00:16:04,840
but we need to do the test for biology as well.

345
00:16:04,840 --> 00:16:06,560
Right.

346
00:16:06,560 --> 00:16:08,880
Yeah, it's so interesting to hear

347
00:16:08,880 --> 00:16:11,960
you mention tokenization, because that's

348
00:16:11,960 --> 00:16:14,400
something in natural language processing

349
00:16:14,400 --> 00:16:16,120
that you're always thinking about.

350
00:16:16,120 --> 00:16:18,480
And it's like, what level of analysis

351
00:16:18,480 --> 00:16:20,360
do you want to be thinking of your problem?

352
00:16:20,360 --> 00:16:22,000
How do you want to break down your problem?

353
00:16:22,000 --> 00:16:24,400
And then for any sort of problem,

354
00:16:24,400 --> 00:16:27,600
that's how you're going to have to figure out

355
00:16:27,600 --> 00:16:30,960
what should be the proper way of tokenizing these things.

356
00:16:30,960 --> 00:16:33,840
And that's an open question in biology, I would say.

357
00:16:33,840 --> 00:16:36,360
Yeah, there certainly have been explorations

358
00:16:36,360 --> 00:16:39,000
to do more other types of tokenization.

359
00:16:39,000 --> 00:16:45,280
But I don't think anything has had sufficient sticking power

360
00:16:45,280 --> 00:16:46,440
yet.

361
00:16:46,440 --> 00:16:47,600
That's interesting.

362
00:16:47,600 --> 00:16:53,120
Yeah, there's always new things coming along in NLP.

363
00:16:55,720 --> 00:17:00,800
For a while, there was subword tokenization.

364
00:17:00,800 --> 00:17:03,080
And then when you're doing sentence tokenization,

365
00:17:03,080 --> 00:17:06,440
well, how should you do the sentence tokenization now

366
00:17:06,440 --> 00:17:08,320
with retrieval augmented generation?

367
00:17:08,320 --> 00:17:11,240
It's like, how should you do the chunking properly?

368
00:17:11,240 --> 00:17:13,480
Because it's always about capturing

369
00:17:13,480 --> 00:17:16,640
the amount of meaning and getting the right context

370
00:17:16,640 --> 00:17:18,520
that you have.

371
00:17:18,520 --> 00:17:21,240
And it's interesting, because you think with language,

372
00:17:21,240 --> 00:17:22,480
oh, we understand language.

373
00:17:22,480 --> 00:17:23,720
We should be able to do this.

374
00:17:23,720 --> 00:17:24,800
It's hard there.

375
00:17:24,800 --> 00:17:30,120
So I can't imagine in biochem, it

376
00:17:30,120 --> 00:17:32,280
must be extremely difficult to understand

377
00:17:32,280 --> 00:17:40,000
what's the smallest level of meaning for these systems.

378
00:17:40,000 --> 00:17:42,360
Right, and you don't know, too.

379
00:17:42,360 --> 00:17:44,880
It may be different for different questions.

380
00:17:44,880 --> 00:17:46,320
This is a fundamental challenge.

381
00:17:46,320 --> 00:17:50,200
If you want to do things with in-context learning,

382
00:17:50,200 --> 00:17:55,000
if you can speak to GPT-4 like you would speak to another

383
00:17:55,000 --> 00:17:57,120
human, we know how to communicate.

384
00:17:57,120 --> 00:17:58,400
I don't speak protein.

385
00:17:58,400 --> 00:18:00,720
I don't know anyone else that does.

386
00:18:00,720 --> 00:18:02,400
Do I spell it out in amino acids?

387
00:18:02,400 --> 00:18:03,400
I don't know.

388
00:18:03,400 --> 00:18:04,440
How do I tell it?

389
00:18:04,440 --> 00:18:08,320
So that has led to a lot of ideas

390
00:18:08,320 --> 00:18:12,080
that the multimodal models might be a way to go if you actually

391
00:18:12,080 --> 00:18:15,760
want to speak to a model that will design a protein for you

392
00:18:15,760 --> 00:18:17,520
the way you would speak about how you're

393
00:18:17,520 --> 00:18:19,760
going to do an experiment.

394
00:18:19,760 --> 00:18:21,560
So certainly, those are active areas.

395
00:18:21,560 --> 00:18:26,720
And that's a difference as well, I guess, with biology versus NLP.

396
00:18:26,720 --> 00:18:29,840
But yeah, back to the discussion about tokenization,

397
00:18:29,840 --> 00:18:30,840
we don't always know.

398
00:18:30,840 --> 00:18:34,080
Proteins have a whole hierarchy of structure.

399
00:18:34,080 --> 00:18:35,640
There's amino acid sequence.

400
00:18:35,640 --> 00:18:38,080
There's something called secondary structure.

401
00:18:38,080 --> 00:18:40,040
And then there's the tertiary structure, which

402
00:18:40,040 --> 00:18:42,200
is sort of the 3D coordinates.

403
00:18:42,200 --> 00:18:46,280
Maybe you need to break it along secondary structure entities.

404
00:18:46,280 --> 00:18:49,960
And many sort of sentence piece type tokenizers

405
00:18:49,960 --> 00:18:51,400
will actually do that a little bit.

406
00:18:51,400 --> 00:18:54,120
I've done some experimentation.

407
00:18:54,120 --> 00:18:55,440
But not always.

408
00:18:55,440 --> 00:18:57,400
And so yeah, what's the right way?

409
00:18:57,400 --> 00:19:00,400
Or do you need to break it along three dimensional domains

410
00:19:00,400 --> 00:19:01,480
for it to be useful?

411
00:19:01,480 --> 00:19:02,000
Maybe.

412
00:19:02,000 --> 00:19:03,120
I don't know.

413
00:19:03,120 --> 00:19:05,920
But then it's not always the same amino acid.

414
00:19:05,920 --> 00:19:08,160
And some of the amino acids are kind of interchangeable.

415
00:19:08,160 --> 00:19:12,320
So then you need to be able to account for that too.

416
00:19:12,320 --> 00:19:12,880
Right.

417
00:19:12,880 --> 00:19:15,840
So I can't even imagine.

418
00:19:15,840 --> 00:19:17,600
It's so tricky.

419
00:19:17,600 --> 00:19:19,280
I mean, I'm just going to keep coming back

420
00:19:19,280 --> 00:19:21,020
to natural language processing, thinking

421
00:19:21,020 --> 00:19:22,600
about different domains.

422
00:19:22,600 --> 00:19:26,560
And you think about a task, say, like sentiment analysis.

423
00:19:26,560 --> 00:19:30,720
And then you think about it in product reviews.

424
00:19:30,720 --> 00:19:32,960
Or you think about it in news articles.

425
00:19:32,960 --> 00:19:34,520
Or you think about it in dialogue.

426
00:19:34,520 --> 00:19:37,600
And those three, yes, they're all sentiment analysis.

427
00:19:37,600 --> 00:19:41,240
But those three are all extremely different tasks.

428
00:19:41,240 --> 00:19:43,840
And you have to do specific things for all of those things.

429
00:19:43,840 --> 00:19:47,160
So I'm sure there's parallels for the work that you're doing,

430
00:19:47,160 --> 00:19:51,620
trying to understand what is the specific context

431
00:19:51,620 --> 00:19:53,460
and what's the specific methodology that

432
00:19:53,460 --> 00:19:56,720
would work for this type of problem.

433
00:19:56,720 --> 00:19:58,380
So it's probably a lot of experimentation

434
00:19:58,380 --> 00:19:59,880
that needs to get done there.

435
00:19:59,880 --> 00:20:02,560
And another thing that I would say that it's very hard,

436
00:20:02,560 --> 00:20:05,840
you as we as speakers of the English language

437
00:20:05,840 --> 00:20:10,400
have some gauge whether or not the prediction of that sentiment

438
00:20:10,400 --> 00:20:12,920
is close, way off.

439
00:20:12,920 --> 00:20:14,880
We can kind of sniff test that.

440
00:20:14,880 --> 00:20:16,760
It's very hard with proteins.

441
00:20:16,760 --> 00:20:18,720
Same thing, I don't speak protein.

442
00:20:18,720 --> 00:20:21,160
So we don't know.

443
00:20:21,160 --> 00:20:23,080
We have to compare it to existing assays.

444
00:20:23,080 --> 00:20:26,680
We have to be in a situation where one can run assays,

445
00:20:26,680 --> 00:20:30,000
or these lab in the loop ideas, which is a very smart way

446
00:20:30,000 --> 00:20:32,240
of doing things.

447
00:20:32,240 --> 00:20:34,920
But yeah, that's another challenging part with biology,

448
00:20:34,920 --> 00:20:38,160
that we don't have a great sense of how correct a prediction is.

449
00:20:38,160 --> 00:20:42,960
And there are many predictive tasks, as you alluded to,

450
00:20:42,960 --> 00:20:47,780
that are useful, that are useful in the drug design process.

451
00:20:47,780 --> 00:20:49,960
Certainly, the value of a foundation model

452
00:20:49,960 --> 00:20:52,800
would be the very rich predictive embeddings.

453
00:20:52,800 --> 00:20:55,040
that it produces.

454
00:20:55,040 --> 00:20:55,920
Very cool.

455
00:20:55,920 --> 00:20:58,040
So let's go into it.

456
00:20:58,040 --> 00:21:02,400
What does it take to create, or try to create, or begin

457
00:21:02,400 --> 00:21:06,400
to even think about bio foundation models?

458
00:21:06,400 --> 00:21:08,720
Yeah, well, certainly first and foremost, I

459
00:21:08,720 --> 00:21:10,960
think a data strategy.

460
00:21:10,960 --> 00:21:14,280
The hard lessons learned, I think,

461
00:21:14,280 --> 00:21:17,000
by many data scientists and researchers

462
00:21:17,000 --> 00:21:19,760
is that data is the most important thing.

463
00:21:19,760 --> 00:21:21,360
And I certainly think that's starting

464
00:21:21,360 --> 00:21:26,640
to be true because model architectures, obviously,

465
00:21:26,640 --> 00:21:27,960
they will progress and change.

466
00:21:27,960 --> 00:21:29,280
And there are improvements.

467
00:21:29,280 --> 00:21:31,480
But there are a lot of model architectures

468
00:21:31,480 --> 00:21:33,160
that are starting to be very good.

469
00:21:33,160 --> 00:21:34,600
They're easy to use.

470
00:21:34,600 --> 00:21:38,080
So data can be a huge source of advantage.

471
00:21:38,080 --> 00:21:40,320
So I'm part of a team right now that's

472
00:21:40,320 --> 00:21:44,520
starting to think about building these bio foundation models.

473
00:21:44,520 --> 00:21:49,880
And it's a matter of picking a problem where we think NVIDIA

474
00:21:49,880 --> 00:21:53,440
can uniquely succeed.

475
00:21:53,440 --> 00:21:55,440
NVIDIA is not a drug discovery company,

476
00:21:55,440 --> 00:21:57,920
so we don't generate our own data.

477
00:21:57,920 --> 00:22:00,320
That doesn't mean we can't get it from places.

478
00:22:00,320 --> 00:22:03,400
But we have to think very carefully about that.

479
00:22:03,400 --> 00:22:05,800
And we're also trying to think very carefully about,

480
00:22:05,800 --> 00:22:08,280
like I said, we do have a lot of compute.

481
00:22:08,280 --> 00:22:12,920
So that is a very significant advantage.

482
00:22:12,920 --> 00:22:14,880
But what is the data we want to use?

483
00:22:14,880 --> 00:22:16,240
What is the problem?

484
00:22:16,240 --> 00:22:19,120
We also want to really, we're thinking

485
00:22:19,120 --> 00:22:22,800
about what are grand challenge problems in biology

486
00:22:22,800 --> 00:22:26,200
and trying to make sure we position ourselves

487
00:22:26,200 --> 00:22:29,760
to work towards something that is a really fundamentally

488
00:22:29,760 --> 00:22:31,560
difficult problem.

489
00:22:31,560 --> 00:22:33,920
Yeah, that makes sense.

490
00:22:33,920 --> 00:22:36,800
What are some of the potential grand challenge problems?

491
00:22:36,800 --> 00:22:39,720
I mean, I don't know that I should say too much.

492
00:22:39,720 --> 00:22:42,200
But I think these are probably fairly obvious things,

493
00:22:42,200 --> 00:22:45,960
like how do you simulate protein-protein interactions?

494
00:22:45,960 --> 00:22:48,880
How do you simulate parts of a cell, functionality of a cell?

495
00:22:48,880 --> 00:22:52,840
The way cells work, there's not an absolute cell model.

496
00:22:52,840 --> 00:22:54,800
There are temporal parts to that.

497
00:22:54,800 --> 00:22:57,600
There are tissue-specific aspects.

498
00:22:57,600 --> 00:23:00,440
All these things are very hard.

499
00:23:00,440 --> 00:23:04,240
Yeah, that makes a lot of sense.

500
00:23:04,240 --> 00:23:04,860
OK, cool.

501
00:23:04,860 --> 00:23:07,720
So trying to create these sorts of things,

502
00:23:07,720 --> 00:23:12,120
you probably need a pretty multidisciplinary team.

503
00:23:12,120 --> 00:23:14,600
I know in your Pi Data talk, you spoke

504
00:23:14,600 --> 00:23:17,760
about building a team in your previous project.

505
00:23:17,760 --> 00:23:23,680
You went from two to about 30 or 40 people.

506
00:23:23,680 --> 00:23:28,040
Yeah, the broader product team is about 40,

507
00:23:28,040 --> 00:23:29,920
right close to 40 people right now.

508
00:23:29,920 --> 00:23:31,400
Yeah.

509
00:23:31,400 --> 00:23:32,240
How was that?

510
00:23:32,240 --> 00:23:35,720
What was that like building a team?

511
00:23:35,720 --> 00:23:38,240
I mean, there's so much that goes into it.

512
00:23:38,240 --> 00:23:40,800
There's all these machine learning problems that we face,

513
00:23:40,800 --> 00:23:45,000
but dealing with people is a whole other level.

514
00:23:45,000 --> 00:23:50,000
And people are always the most challenging problem.

515
00:23:50,000 --> 00:23:52,920
I think first and foremost, you've

516
00:23:52,920 --> 00:23:55,000
got to get the culture right.

517
00:23:55,000 --> 00:23:58,200
Team dynamics matters so, so much.

518
00:23:58,200 --> 00:24:03,320
And you have to cultivate an environment where people

519
00:24:03,320 --> 00:24:04,840
feel valued for their expertise.

520
00:24:04,840 --> 00:24:06,400
They feel like they're working on things

521
00:24:06,400 --> 00:24:10,600
that are important, that are of some level of enjoyment

522
00:24:10,600 --> 00:24:13,240
to them, but that also align.

523
00:24:13,240 --> 00:24:15,400
It's always like getting all the three buckets

524
00:24:15,400 --> 00:24:19,760
to align of things, your interest, product, impact.

525
00:24:19,760 --> 00:24:21,960
Getting those things to align is very hard,

526
00:24:21,960 --> 00:24:24,440
but try to strike a balance.

527
00:24:24,440 --> 00:24:28,320
So hiring, certainly, culture is really important.

528
00:24:28,320 --> 00:24:31,000
We certainly, not every role, but some roles,

529
00:24:31,000 --> 00:24:32,600
we do look for some domain experience

530
00:24:32,600 --> 00:24:36,400
because there are a lot of details and nuances

531
00:24:36,400 --> 00:24:37,400
to scientific data.

532
00:24:37,400 --> 00:24:41,760
And it's even genomics data is different from protein data,

533
00:24:41,760 --> 00:24:45,440
is different from chem informatics data,

534
00:24:45,440 --> 00:24:50,240
and then usually some deep learning experience.

535
00:24:50,240 --> 00:24:53,760
One thing that is very true is that the field

536
00:24:53,760 --> 00:24:55,160
is changing so fast.

537
00:24:55,160 --> 00:24:57,360
The things that we, as a team, have had to do

538
00:24:57,360 --> 00:24:59,600
have changed and evolved pretty quickly.

539
00:24:59,600 --> 00:25:01,160
So it's also figuring out how you

540
00:25:01,160 --> 00:25:05,520
can hire for people that can adapt and learn new things.

541
00:25:05,520 --> 00:25:09,120
And I mean, layer on top of that the speed at which the field is

542
00:25:09,120 --> 00:25:11,320
changing, which we've discussed, but also

543
00:25:11,320 --> 00:25:14,160
NVIDIA comes out with new GPUs pretty regularly.

544
00:25:14,160 --> 00:25:17,360
In fact, I think it's a very regular cycle,

545
00:25:17,360 --> 00:25:20,600
and maybe even increasing in frequency from what it has been.

546
00:25:20,600 --> 00:25:24,680
So we want to surface the best of those GPUs.

547
00:25:24,680 --> 00:25:27,880
So that means constantly updating what you're doing

548
00:25:27,880 --> 00:25:30,560
or the way you have things implemented

549
00:25:30,560 --> 00:25:31,760
in your software stack.

550
00:25:31,760 --> 00:25:34,120
So benchmarking, lots of benchmarking,

551
00:25:34,120 --> 00:25:38,040
and not just predictive power, but inference and training

552
00:25:38,040 --> 00:25:39,000
time.

553
00:25:39,000 --> 00:25:41,160
So we do a lot of that.

554
00:25:41,160 --> 00:25:43,960
So those are very different, I think,

555
00:25:43,960 --> 00:25:51,000
from what data scientist roles might be like at other companies.

556
00:25:51,000 --> 00:25:52,760
Yeah.

557
00:25:52,760 --> 00:25:55,400
You really need to have a strong background.

558
00:25:55,400 --> 00:25:57,960
I think it's always so helpful.

559
00:25:57,960 --> 00:26:00,960
Well, like in your case, you worked in the lab

560
00:26:00,960 --> 00:26:03,520
before now working on these sorts of problems.

561
00:26:03,520 --> 00:26:05,400
So even though I know you were saying

562
00:26:05,400 --> 00:26:06,960
it's hard to know if it makes sense,

563
00:26:06,960 --> 00:26:10,240
but you do have a general sense.

564
00:26:10,240 --> 00:26:13,800
Is that number even remotely right?

565
00:26:13,800 --> 00:26:14,480
Right.

566
00:26:14,480 --> 00:26:17,480
Is it even in the right ballpark?

567
00:26:17,480 --> 00:26:18,600
Yeah.

568
00:26:18,600 --> 00:26:21,160
Which is so important for all of these problems.

569
00:26:24,200 --> 00:26:25,360
Yeah, there's a lot of things.

570
00:26:25,360 --> 00:26:27,040
I guess one thing I'm curious about,

571
00:26:27,040 --> 00:26:29,920
so yeah, you've played different roles, right?

572
00:26:29,920 --> 00:26:35,720
So you've been an AI researcher, and that was more hands on.

573
00:26:35,720 --> 00:26:37,160
And now you've kind of transitioned

574
00:26:37,160 --> 00:26:38,800
into a manager role.

575
00:26:38,800 --> 00:26:41,400
Are there anything that I'm missing in there?

576
00:26:41,400 --> 00:26:41,880
No.

577
00:26:41,880 --> 00:26:43,000
Are those the two?

578
00:26:43,000 --> 00:26:44,120
Oh, those are the two?

579
00:26:44,120 --> 00:26:44,640
Yeah.

580
00:26:44,640 --> 00:26:45,840
Those are the two?

581
00:26:45,840 --> 00:26:46,480
I was a teacher.

582
00:26:46,480 --> 00:26:51,480
I taught a data science boot camp after I left the NIH.

583
00:26:51,480 --> 00:26:54,080
I was basically learning more machine learning

584
00:26:54,080 --> 00:26:55,640
as I was teaching it.

585
00:26:55,640 --> 00:26:58,000
OK.

586
00:26:58,000 --> 00:26:59,480
That's the best way to learn.

587
00:26:59,480 --> 00:27:00,360
Yeah, it is.

588
00:27:00,360 --> 00:27:03,040
By teaching it, because then you're

589
00:27:03,040 --> 00:27:04,720
forced to really know all of it.

590
00:27:04,720 --> 00:27:06,200
And then someone will ask you some question,

591
00:27:06,200 --> 00:27:09,360
and you'll be like, I never really thought of it like that.

592
00:27:09,360 --> 00:27:11,200
Or when you're forced to explain it to someone,

593
00:27:11,200 --> 00:27:14,960
you're like, oh, I see this gap in my knowledge.

594
00:27:14,960 --> 00:27:16,280
Yeah.

595
00:27:16,280 --> 00:27:17,120
Humility.

596
00:27:17,120 --> 00:27:18,960
It's good to have a little bit of humility.

597
00:27:18,960 --> 00:27:22,760
That's also very important on a multidisciplinary team,

598
00:27:22,760 --> 00:27:25,320
is to be open to learning from each other.

599
00:27:25,320 --> 00:27:26,080
Yeah.

600
00:27:26,080 --> 00:27:30,200
Any other traits that you're looking for in a team?

601
00:27:30,200 --> 00:27:36,160
I think, obviously, technical excellence and experience

602
00:27:36,160 --> 00:27:37,760
are important.

603
00:27:37,760 --> 00:27:42,000
But I culture and attitude are the really big ones,

604
00:27:42,000 --> 00:27:46,080
because I think they can make a break.

605
00:27:46,080 --> 00:27:47,680
Yeah.

606
00:27:47,680 --> 00:27:50,680
A coworker of mine just shared this piece.

607
00:27:50,680 --> 00:27:54,720
It was about looking for people who are hungry, humble,

608
00:27:54,720 --> 00:27:55,840
and smart.

609
00:27:55,840 --> 00:27:57,760
But not smart in the traditional way.

610
00:27:57,760 --> 00:28:01,720
Smart in like, you know when you're

611
00:28:01,720 --> 00:28:03,440
in a meeting what you're saying, how

612
00:28:03,440 --> 00:28:06,040
it's going to influence the people in the room kind of thing.

613
00:28:06,040 --> 00:28:06,920
Emotional.

614
00:28:06,920 --> 00:28:07,760
Yeah.

615
00:28:07,760 --> 00:28:08,360
EQ.

616
00:28:08,360 --> 00:28:09,000
High EQ.

617
00:28:09,000 --> 00:28:09,960
Yeah.

618
00:28:09,960 --> 00:28:12,120
Some high EQ.

619
00:28:12,120 --> 00:28:16,920
Very hard to find people that check all of those boxes.

620
00:28:16,920 --> 00:28:17,960
It is.

621
00:28:17,960 --> 00:28:21,960
But that's the fun part of everything.

622
00:28:21,960 --> 00:28:22,480
Yeah.

623
00:28:22,480 --> 00:28:25,760
I think I was just going to say, yeah, becoming a manager,

624
00:28:25,760 --> 00:28:31,040
it's really fun to see these extremely talented people grow

625
00:28:31,040 --> 00:28:34,520
and develop in their role and start to lead things.

626
00:28:34,520 --> 00:28:37,640
And that part is really fun about a manager.

627
00:28:37,640 --> 00:28:41,720
It's really fun seeing them succeed and cheering them on.

628
00:28:41,720 --> 00:28:44,720
Yeah, absolutely.

629
00:28:44,720 --> 00:28:48,080
I've spoken to people who have done a similar transition

630
00:28:48,080 --> 00:28:52,440
from sort of like a IC.

631
00:28:52,440 --> 00:28:54,920
No one's really an IC, right?

632
00:28:54,920 --> 00:28:56,400
You're always working in a team.

633
00:28:56,400 --> 00:28:58,320
A team, yeah.

634
00:28:58,320 --> 00:29:01,840
But sort of that IC role to the manager role

635
00:29:01,840 --> 00:29:06,280
and this idea of, oh, I'm a 5x, 10x engineer.

636
00:29:06,280 --> 00:29:09,400
But when you go into managing, you

637
00:29:09,400 --> 00:29:11,520
can actually unblock so many people

638
00:29:11,520 --> 00:29:15,800
that you can have a bigger impact on your company,

639
00:29:15,800 --> 00:29:16,280
actually.

640
00:29:16,280 --> 00:29:20,520
So yeah, it's such an interesting thing

641
00:29:20,520 --> 00:29:25,440
because people want to be very, sometimes you

642
00:29:25,440 --> 00:29:27,400
want to be very focused on what you're doing.

643
00:29:27,400 --> 00:29:29,120
And then you don't want to be distracted

644
00:29:29,120 --> 00:29:30,360
by all of these other things.

645
00:29:30,360 --> 00:29:31,920
But I guess it's just each person

646
00:29:31,920 --> 00:29:37,840
needs to sort of find their own balance between that.

647
00:29:37,840 --> 00:29:41,200
So like, NVIDIA is a very ground up company,

648
00:29:41,200 --> 00:29:44,240
which it's one of the things I love about it.

649
00:29:44,240 --> 00:29:48,960
But then it's helping your mentees

650
00:29:48,960 --> 00:29:52,840
to understand how to prioritize things.

651
00:29:52,840 --> 00:29:54,720
Because it gets to be kind of nuanced

652
00:29:54,720 --> 00:29:57,120
or how to help with the situation

653
00:29:57,120 --> 00:30:00,200
but not get completely sucked into something that probably

654
00:30:00,200 --> 00:30:03,160
isn't a top priority for you.

655
00:30:03,160 --> 00:30:06,200
It's like the nuances.

656
00:30:06,200 --> 00:30:07,680
And I think there's some of it too,

657
00:30:07,680 --> 00:30:10,440
just acknowledging that the work that we do is hard

658
00:30:10,440 --> 00:30:12,320
and helping them understand that it's expected.

659
00:30:12,320 --> 00:30:14,800
I expect that this is going to take a while to learn

660
00:30:14,800 --> 00:30:15,680
how to do it.

661
00:30:15,680 --> 00:30:18,880
So I don't love, I kind of made a face at the 5x, 10x engineer.

662
00:30:18,880 --> 00:30:20,280
I don't love that stuff.

663
00:30:20,280 --> 00:30:22,160
I don't know why.

664
00:30:22,160 --> 00:30:23,240
Because what does it mean?

665
00:30:23,240 --> 00:30:24,520
What does it really mean?

666
00:30:24,520 --> 00:30:24,880
Right.

667
00:30:24,880 --> 00:30:25,560
Yeah.

668
00:30:25,560 --> 00:30:26,160
I know.

669
00:30:26,160 --> 00:30:27,440
But people say it.

670
00:30:27,440 --> 00:30:28,840
Yeah.

671
00:30:28,840 --> 00:30:30,080
But I like them.

672
00:30:30,080 --> 00:30:32,280
But yeah, just acknowledging that what we do is hard

673
00:30:32,280 --> 00:30:33,960
and that that's OK.

674
00:30:33,960 --> 00:30:35,600
You're going to have to work at this.

675
00:30:35,600 --> 00:30:37,320
Yeah, for sure.

676
00:30:37,320 --> 00:30:42,720
I think working with other people where I'm at,

677
00:30:42,720 --> 00:30:46,000
we have so many initiatives and projects that are taking place.

678
00:30:46,000 --> 00:30:49,680
And you can't get so in the details on everything.

679
00:30:49,680 --> 00:30:53,040
But having a team that can be a sounding board

680
00:30:53,040 --> 00:30:57,520
and having someone that you can say, OK, this is what I'm up to.

681
00:30:57,520 --> 00:30:59,400
These are the next things I was thinking of.

682
00:30:59,400 --> 00:31:01,200
And then someone can kind of say, well, I

683
00:31:01,200 --> 00:31:02,960
did a project that was similar to this.

684
00:31:02,960 --> 00:31:06,400
And ABC, you'll probably go down this rabbit hole.

685
00:31:06,400 --> 00:31:08,040
So maybe do DEF.

686
00:31:08,040 --> 00:31:08,600
You know?

687
00:31:08,600 --> 00:31:09,600
Yeah.

688
00:31:09,600 --> 00:31:14,760
And it's really nice to, yeah, working in a team,

689
00:31:14,760 --> 00:31:16,280
it's incredible.

690
00:31:16,280 --> 00:31:19,920
I think at one point in my life, I always thought,

691
00:31:19,920 --> 00:31:22,680
oh, I can go very far myself.

692
00:31:22,680 --> 00:31:25,800
But being a part of a startup is where I realize, wow,

693
00:31:25,800 --> 00:31:28,080
you can do unbelievable things.

694
00:31:28,080 --> 00:31:28,760
As a team.

695
00:31:28,760 --> 00:31:29,520
Yeah.

696
00:31:29,520 --> 00:31:31,600
Yeah.

697
00:31:31,600 --> 00:31:32,440
For sure.

698
00:31:32,440 --> 00:31:32,960
Yeah.

699
00:31:32,960 --> 00:31:37,960
So not really transitioning, but just the next question,

700
00:31:37,960 --> 00:31:39,440
just talking about machine learning,

701
00:31:39,440 --> 00:31:41,880
and we can still talk about the same stuff.

702
00:31:41,880 --> 00:31:44,360
What's an important question that you believe

703
00:31:44,360 --> 00:31:47,720
remains unanswered in machine learning?

704
00:31:47,720 --> 00:31:48,200
Yeah.

705
00:31:48,200 --> 00:31:50,360
So I alluded to this a little bit when

706
00:31:50,360 --> 00:31:52,720
I was talking about my postdoc.

707
00:31:52,720 --> 00:31:55,440
So biological and chemical modalities.

708
00:31:55,440 --> 00:31:58,160
So number one, they aren't taxed, even though we

709
00:31:58,160 --> 00:31:59,240
represent them that way.

710
00:31:59,240 --> 00:32:02,320
They're three-dimensional, but they're also dynamic.

711
00:32:02,320 --> 00:32:08,320
And that motion is very fundamental to the roles

712
00:32:08,320 --> 00:32:09,680
that they play in biology.

713
00:32:09,680 --> 00:32:11,880
To protein-protein interactions,

714
00:32:11,880 --> 00:32:18,160
like proteins bind together, to the binding of a compound

715
00:32:18,160 --> 00:32:20,160
or a drug, a ligand.

716
00:32:20,160 --> 00:32:22,360
A ligand or a drug, usually a drug

717
00:32:22,360 --> 00:32:25,120
is just a ligand that binds in a specific way

718
00:32:25,120 --> 00:32:28,680
and causes an enzyme.

719
00:32:28,680 --> 00:32:30,600
Those are really fundamental to biology,

720
00:32:30,600 --> 00:32:33,760
and we don't have good representations for them yet.

721
00:32:33,760 --> 00:32:39,160
Not even alpha-fold is trained on the structural data,

722
00:32:39,160 --> 00:32:41,840
our x-ray crystallographic structures

723
00:32:41,840 --> 00:32:45,920
from a repository called the Protein Data Bank, which

724
00:32:45,920 --> 00:32:49,280
is if you publish a structure in a scientific journal,

725
00:32:49,280 --> 00:32:52,280
you have to deposit the coordinates there.

726
00:32:52,280 --> 00:32:53,840
But those are static.

727
00:32:53,840 --> 00:32:57,520
And so they're static, and they're also,

728
00:32:57,520 --> 00:32:59,720
like to crystallize a protein, it

729
00:32:59,720 --> 00:33:01,520
has to pack into this ordered lattice.

730
00:33:01,520 --> 00:33:03,680
So it's whatever conformation of a protein

731
00:33:03,680 --> 00:33:05,480
packed into this lattice.

732
00:33:05,480 --> 00:33:10,560
Many people don't think about that, but they're not moving.

733
00:33:10,560 --> 00:33:14,840
So there's very, and sometimes protein motions are small,

734
00:33:14,840 --> 00:33:16,280
but sometimes they're really big.

735
00:33:16,280 --> 00:33:20,160
They're really very, very major conformational changes.

736
00:33:20,160 --> 00:33:23,560
So we don't have a good way of describing that

737
00:33:23,560 --> 00:33:25,200
in machine learning models yet.

738
00:33:25,200 --> 00:33:26,600
We're starting to think about it.

739
00:33:26,600 --> 00:33:29,200
There's ways you can collect data.

740
00:33:29,200 --> 00:33:32,840
For example, molecular dynamic simulation data

741
00:33:32,840 --> 00:33:35,240
can be used for that to some degree,

742
00:33:35,240 --> 00:33:37,360
but we just don't have good models yet for it.

743
00:33:37,360 --> 00:33:40,280
Yeah, and so dynamic here, just to make sure,

744
00:33:40,280 --> 00:33:43,600
you're talking like changing through time, right?

745
00:33:43,600 --> 00:33:46,080
Basically, it's not stationary.

746
00:33:46,080 --> 00:33:48,200
It's actually changing.

747
00:33:48,200 --> 00:33:51,040
Yeah, OK, yeah.

748
00:33:51,040 --> 00:33:54,000
So as much as I'm interested in machine learning

749
00:33:54,000 --> 00:33:55,640
and artificial intelligence, I'm very

750
00:33:55,640 --> 00:33:58,400
interested in natural intelligence also

751
00:33:58,400 --> 00:34:00,600
and how the brain works and everything.

752
00:34:00,600 --> 00:34:05,760
And when I was first, really took a deep dive into that,

753
00:34:05,760 --> 00:34:07,680
I found that to be the case also,

754
00:34:07,680 --> 00:34:10,880
is that people looked at things a lot in a static way

755
00:34:10,880 --> 00:34:13,240
and that understanding that it's like things are,

756
00:34:13,240 --> 00:34:17,840
it's dynamic, complex, adaptive, changing system,

757
00:34:17,840 --> 00:34:20,760
and that it's not in isolation also, which

758
00:34:20,760 --> 00:34:23,880
is what you're mentioning to how it interacts

759
00:34:23,880 --> 00:34:25,920
with the other.

760
00:34:25,920 --> 00:34:28,680
Yes, cells are changing.

761
00:34:28,680 --> 00:34:30,360
They're different in different tissues.

762
00:34:30,360 --> 00:34:33,880
So there's physical differences.

763
00:34:33,880 --> 00:34:36,440
There are temporal differences with cells,

764
00:34:36,440 --> 00:34:39,000
like in disease states, the cells

765
00:34:39,000 --> 00:34:41,240
interact with each other.

766
00:34:41,240 --> 00:34:46,720
So it's a really, really complex coupled system

767
00:34:46,720 --> 00:34:48,640
that is associated with biology.

768
00:34:48,640 --> 00:34:52,000
Yeah, so all of this amazing progress

769
00:34:52,000 --> 00:34:56,120
that's happening in your field with AI drug discovery

770
00:34:56,120 --> 00:34:58,320
and natural language processing, really just

771
00:34:58,320 --> 00:34:59,680
like across the board, right?

772
00:34:59,680 --> 00:35:03,360
There's like a new multimodal model that comes out every day.

773
00:35:03,360 --> 00:35:07,880
There's a new technique that comes out every day.

774
00:35:07,880 --> 00:35:13,640
How do you view the gap between the hype of this frenzied

775
00:35:13,640 --> 00:35:20,320
state that the field is in and the reality of AI?

776
00:35:20,320 --> 00:35:21,800
I think so.

777
00:35:21,800 --> 00:35:24,720
Certainly, you have to accept that it's a thing.

778
00:35:24,720 --> 00:35:29,280
And you have to evaluate papers very carefully.

779
00:35:29,280 --> 00:35:32,480
I think you have to speak to researchers in the field,

780
00:35:32,480 --> 00:35:34,280
going to conferences.

781
00:35:34,280 --> 00:35:36,480
I'm actually, after this, I'm going to go listen to this.

782
00:35:36,480 --> 00:35:40,640
There's a bunch of NeurIPS workshops.

783
00:35:40,640 --> 00:35:42,440
And actually, I was laughing because when

784
00:35:42,440 --> 00:35:46,320
I went to NeurIPS in 2018, there was one biology workshop.

785
00:35:46,320 --> 00:35:48,640
And I think there's like five each day or something

786
00:35:48,640 --> 00:35:49,720
on Friday and Saturday.

787
00:35:49,720 --> 00:35:51,600
It's crazy.

788
00:35:51,600 --> 00:35:53,160
We're talking to researchers, talking

789
00:35:53,160 --> 00:35:55,080
to those in pharma companies that

790
00:35:55,080 --> 00:35:58,880
are in the field that can help you really understand

791
00:35:58,880 --> 00:36:01,120
the flaws in what you have developed, which

792
00:36:01,120 --> 00:36:03,280
is not always fun.

793
00:36:03,280 --> 00:36:05,120
But yeah, I think we have to be very careful.

794
00:36:05,120 --> 00:36:08,080
We have to make sure that we're doing good baselines.

795
00:36:08,080 --> 00:36:10,200
There was a paper published recently

796
00:36:10,200 --> 00:36:11,680
in the single cell genomics field

797
00:36:11,680 --> 00:36:15,520
where they evaluated two transformer models

798
00:36:15,520 --> 00:36:17,960
and found that linear regression did better

799
00:36:17,960 --> 00:36:19,160
than both of those models.

800
00:36:19,160 --> 00:36:19,880
That's taboo.

801
00:36:19,880 --> 00:36:23,400
You're not allowed to say that.

802
00:36:23,400 --> 00:36:28,000
So I think really making sure that we do the right baselines.

803
00:36:28,000 --> 00:36:29,240
100%.

804
00:36:29,240 --> 00:36:31,000
Man, the amount of times.

805
00:36:31,000 --> 00:36:33,680
I think that every single person that I've spoken to

806
00:36:33,680 --> 00:36:36,760
for this podcast has spoken about the importance

807
00:36:36,760 --> 00:36:39,400
of getting baselines.

808
00:36:39,400 --> 00:36:43,760
And in the work that I'm doing, it's so true too.

809
00:36:43,760 --> 00:36:46,040
I know I was briefly telling you about this before.

810
00:36:46,040 --> 00:36:49,520
But yeah, like natural language processing,

811
00:36:49,520 --> 00:36:51,440
for some of the simpler tasks, people

812
00:36:51,440 --> 00:36:57,480
are trying to throw the heaviest, monstrous of a model

813
00:36:57,480 --> 00:37:01,320
at problems that can be solved using simple embeddings

814
00:37:01,320 --> 00:37:04,280
and either logistic regression or support vector machine,

815
00:37:04,280 --> 00:37:06,960
or at least get a baseline with it, right?

816
00:37:06,960 --> 00:37:09,640
To at least just see, because maybe you'll find a problem

817
00:37:09,640 --> 00:37:12,240
in your other pipeline just based off

818
00:37:12,240 --> 00:37:16,320
of what you did with the other pipeline

819
00:37:16,320 --> 00:37:18,320
that you created to get a baseline.

820
00:37:18,320 --> 00:37:20,240
If nothing else, the baseline is something

821
00:37:20,240 --> 00:37:23,360
that you can put into an interface or your product

822
00:37:23,360 --> 00:37:27,160
to keep building so that you're not blocked by this model.

823
00:37:27,160 --> 00:37:29,960
Because my joke is that, especially deep learning,

824
00:37:29,960 --> 00:37:31,360
it's like an ideal gas.

825
00:37:31,360 --> 00:37:37,680
It will expand to fill whatever space and time you give it.

826
00:37:37,680 --> 00:37:39,080
But it's true.

827
00:37:39,080 --> 00:37:42,440
Even in chem informatics, chem informaticians

828
00:37:42,440 --> 00:37:44,880
will tell you that they've used machine learning

829
00:37:44,880 --> 00:37:46,080
for a very long time.

830
00:37:46,080 --> 00:37:50,920
And the random force models and some simple fingerprint

831
00:37:50,920 --> 00:37:54,840
features of small molecules are pretty hard to beat.

832
00:37:54,840 --> 00:37:57,960
And they're not wrong.

833
00:37:57,960 --> 00:37:59,360
Yeah, it's true.

834
00:37:59,360 --> 00:38:03,800
Yeah, if you ask some Kagglers, they'll

835
00:38:03,800 --> 00:38:06,400
say XGBoost is all you need.

836
00:38:06,400 --> 00:38:08,960
That's the joke there.

837
00:38:08,960 --> 00:38:12,320
But no, ensemble models, you can really

838
00:38:12,320 --> 00:38:15,360
do a lot of amazing things.

839
00:38:15,360 --> 00:38:18,440
And in a little bit of a different perspective,

840
00:38:18,440 --> 00:38:21,760
I'm thinking about your Pi Data talk.

841
00:38:21,760 --> 00:38:26,480
You spoke about creating a product where

842
00:38:26,480 --> 00:38:28,880
you were visualizing things that wasn't really

843
00:38:28,880 --> 00:38:33,360
heavy on machine learning and the importance of that.

844
00:38:33,360 --> 00:38:35,560
Do you want to speak to it a little bit?

845
00:38:35,560 --> 00:38:36,720
Sure.

846
00:38:36,720 --> 00:38:39,040
I guess, are you asking sort of generally,

847
00:38:39,040 --> 00:38:40,800
or are you asking about that particular part?

848
00:38:40,800 --> 00:38:43,680
Well, no, just the importance of visualizing things

849
00:38:43,680 --> 00:38:46,200
before you even jump into the machine learning parts.

850
00:38:46,200 --> 00:38:46,920
Yeah.

851
00:38:46,920 --> 00:38:48,160
Oh, yeah, I understood that.

852
00:38:48,160 --> 00:38:50,760
Yeah, understanding your data, like I said,

853
00:38:50,760 --> 00:38:52,280
data is the most important thing.

854
00:38:52,280 --> 00:38:55,520
And getting to know it on a very personal level

855
00:38:55,520 --> 00:38:57,840
is really useful.

856
00:38:57,840 --> 00:39:01,120
And it can give you ideas for different scenarios

857
00:39:01,120 --> 00:39:03,480
that you want to test with the data.

858
00:39:03,480 --> 00:39:04,840
How do you split your data?

859
00:39:04,840 --> 00:39:09,600
So for example, with protein structure prediction,

860
00:39:09,600 --> 00:39:13,000
the data are often split temporally

861
00:39:13,000 --> 00:39:16,200
based on the data at which they were entered

862
00:39:16,200 --> 00:39:18,320
into the protein data bank.

863
00:39:18,320 --> 00:39:22,280
So one hypothesis is that that enables you

864
00:39:22,280 --> 00:39:24,360
to predict future structures.

865
00:39:24,360 --> 00:39:26,440
And that's not a bad hypothesis.

866
00:39:26,440 --> 00:39:29,480
However, having been a structural biologist,

867
00:39:29,480 --> 00:39:33,560
I can tell you that there is almost certainly bound

868
00:39:33,560 --> 00:39:36,720
to be a lot of redundancy in proteins

869
00:39:36,720 --> 00:39:37,920
that are deposited later.

870
00:39:37,920 --> 00:39:41,160
Because if you solve a crystal structure,

871
00:39:41,160 --> 00:39:43,240
even if it's something that's been solved before,

872
00:39:43,240 --> 00:39:44,800
but you do it as part of your paper,

873
00:39:44,800 --> 00:39:49,400
as like a check of something to ensure that you didn't disrupt,

874
00:39:49,400 --> 00:39:51,920
I don't know, the structure or something, which, by the way,

875
00:39:51,920 --> 00:39:53,120
that's the thing that is done.

876
00:39:53,120 --> 00:39:55,280
Then it gets deposited in the data bank.

877
00:39:55,280 --> 00:39:57,760
So if it ends up on the other side of that temporal split,

878
00:39:57,760 --> 00:40:01,640
there's actually a lot of redundancy in the data.

879
00:40:01,640 --> 00:40:03,360
So then you start to think, OK, what

880
00:40:03,360 --> 00:40:04,600
are better ways to do this?

881
00:40:04,600 --> 00:40:06,720
Maybe you cluster them.

882
00:40:06,720 --> 00:40:07,960
But how do you cluster them?

883
00:40:07,960 --> 00:40:10,200
You cluster them by amino acid sequence.

884
00:40:10,200 --> 00:40:13,840
Well, protein constructs can be crystallized.

885
00:40:13,840 --> 00:40:16,440
They can be changed when they're crystallized.

886
00:40:16,440 --> 00:40:18,880
Do you cluster it by three-dimensional similarity?

887
00:40:18,880 --> 00:40:19,800
And then how?

888
00:40:19,800 --> 00:40:21,600
And how do you do the alignment?

889
00:40:21,600 --> 00:40:24,440
I guess a very nuanced problem.

890
00:40:24,440 --> 00:40:26,720
So getting to know the data and visualizing it,

891
00:40:26,720 --> 00:40:29,720
where does the model do really well and why?

892
00:40:29,720 --> 00:40:32,640
Where does the model really fail and why?

893
00:40:32,640 --> 00:40:34,960
And sometimes it's useful to look at an aggregate plot.

894
00:40:34,960 --> 00:40:36,680
But there were situations, I was working

895
00:40:36,680 --> 00:40:42,360
with someone on the team, where there's like angular rotamers.

896
00:40:42,360 --> 00:40:44,400
It's hard to explain, but there's different angles

897
00:40:44,400 --> 00:40:45,600
that are predicted.

898
00:40:45,600 --> 00:40:47,720
And so we started going through the things,

899
00:40:47,720 --> 00:40:49,360
the ones that were really off, because we

900
00:40:49,360 --> 00:40:51,080
saw this periodicity.

901
00:40:51,080 --> 00:40:53,320
And so we started to understand that one of the,

902
00:40:53,320 --> 00:40:55,360
there was many issues, but I think one of the issues

903
00:40:55,360 --> 00:40:58,680
was probably the software that was measuring the angles

904
00:40:58,680 --> 00:41:02,520
didn't understand some aspects of chemistry.

905
00:41:02,520 --> 00:41:04,800
But when you start to see, the aggregate is useful,

906
00:41:04,800 --> 00:41:07,040
but so are the examples.

907
00:41:07,040 --> 00:41:08,000
Right.

908
00:41:08,000 --> 00:41:09,040
Yeah.

909
00:41:09,040 --> 00:41:12,320
I always, as much as in the work that I'm doing,

910
00:41:12,320 --> 00:41:14,800
we want to automate a lot of it, obviously.

911
00:41:14,800 --> 00:41:16,760
We want to streamline things.

912
00:41:16,760 --> 00:41:19,080
There's always this step in the beginning

913
00:41:19,080 --> 00:41:22,200
when you have a new data set, where you just

914
00:41:22,200 --> 00:41:23,920
visualize it in some way.

915
00:41:23,920 --> 00:41:26,800
Just think about how could you even possibly visualize it.

916
00:41:26,800 --> 00:41:29,280
The way that you're thinking about it is like all

917
00:41:29,280 --> 00:41:31,840
the different, either the states or the different features

918
00:41:31,840 --> 00:41:33,080
that you want to be looking at.

919
00:41:33,080 --> 00:41:35,920
But just even going through that exercise

920
00:41:35,920 --> 00:41:39,480
gives you this like inherent, not inherent,

921
00:41:39,480 --> 00:41:42,280
but like a little bit of a more intuitive understanding

922
00:41:42,280 --> 00:41:44,000
of what's taking place.

923
00:41:44,000 --> 00:41:46,320
And then it'll give you some nice hunches

924
00:41:46,320 --> 00:41:52,120
as you try to figure out even just the problem space itself.

925
00:41:52,120 --> 00:41:55,320
Yeah, that was really cool.

926
00:41:55,320 --> 00:41:58,880
I just have a real love of visualizations.

927
00:41:58,880 --> 00:42:03,320
So a lot of the eye candy in your presentation

928
00:42:03,320 --> 00:42:04,760
really attracted me also.

929
00:42:04,760 --> 00:42:06,800
Yeah.

930
00:42:06,800 --> 00:42:09,080
Probably there's some cool videos approaching structures

931
00:42:09,080 --> 00:42:09,800
that are in there.

932
00:42:09,800 --> 00:42:12,160
And that was really just for fun eye candy,

933
00:42:12,160 --> 00:42:14,560
because they're really cool to see.

934
00:42:14,560 --> 00:42:15,720
They are beautiful.

935
00:42:15,720 --> 00:42:17,400
Very beautiful.

936
00:42:17,400 --> 00:42:19,840
When I saw those when I was learning biochemistry

937
00:42:19,840 --> 00:42:21,920
and taking biochemistry in undergrad,

938
00:42:21,920 --> 00:42:23,520
that was the thing that made me really

939
00:42:23,520 --> 00:42:25,040
fall in love with structural biology.

940
00:42:25,040 --> 00:42:27,120
It was like, wow, this is really cool.

941
00:42:27,120 --> 00:42:27,640
Right.

942
00:42:27,640 --> 00:42:28,480
I want to do this.

943
00:42:28,480 --> 00:42:32,240
Yeah, a picture is there's you can't even

944
00:42:32,240 --> 00:42:33,480
put it into words, right?

945
00:42:33,480 --> 00:42:34,120
Yeah.

946
00:42:34,120 --> 00:42:35,040
It really.

947
00:42:35,040 --> 00:42:36,000
Powerful.

948
00:42:36,000 --> 00:42:38,800
Yeah, it can be very powerful.

949
00:42:38,800 --> 00:42:45,640
So now you've been working on drug discovery using AI

950
00:42:45,640 --> 00:42:49,120
and developing these sorts of tools.

951
00:42:49,120 --> 00:42:53,080
So how have you seen the field change

952
00:42:53,080 --> 00:42:57,040
since you started working in the industry?

953
00:42:57,040 --> 00:43:01,000
I think in general, when I started

954
00:43:01,000 --> 00:43:03,560
working in the industry, data science was still

955
00:43:03,560 --> 00:43:06,120
kind of a young new thing.

956
00:43:06,120 --> 00:43:10,600
And so there were a lot of generalists.

957
00:43:10,600 --> 00:43:14,520
And at least maybe this is biased by having been in my field

958
00:43:14,520 --> 00:43:15,020
too.

959
00:43:15,020 --> 00:43:19,320
I start to see a lot more individuals in the field who

960
00:43:19,320 --> 00:43:22,800
are coming to the field, but they have been trained

961
00:43:22,800 --> 00:43:23,520
in a domain.

962
00:43:23,520 --> 00:43:25,840
And then they picked up the data science

963
00:43:25,840 --> 00:43:28,200
as part of their domain specific education.

964
00:43:28,200 --> 00:43:30,520
And I think some of that just reflects

965
00:43:30,520 --> 00:43:32,080
the way universities are starting

966
00:43:32,080 --> 00:43:36,640
to integrate these computational skills

967
00:43:36,640 --> 00:43:41,080
into their curricula, which I think is very important.

968
00:43:41,080 --> 00:43:48,760
Teaching a compute literate, machine learning literate

969
00:43:48,760 --> 00:43:50,360
student is really important.

970
00:43:50,360 --> 00:43:51,600
So I start to see that.

971
00:43:51,600 --> 00:43:54,800
I think that is very useful.

972
00:43:54,800 --> 00:43:59,760
And demonstrating some level of deep understanding of data

973
00:43:59,760 --> 00:44:03,280
will benefit you, even if you change domains, in my opinion.

974
00:44:03,280 --> 00:44:05,360
Because then at least you have a frame of reference

975
00:44:05,360 --> 00:44:08,440
for the kind of ways that things really went

976
00:44:08,440 --> 00:44:10,200
wrong in the other domain.

977
00:44:10,200 --> 00:44:13,360
So I think that's a big one that I see changing.

978
00:44:16,040 --> 00:44:17,840
I think some of the machine learning tasks

979
00:44:17,840 --> 00:44:19,400
are getting to be a lot easier.

980
00:44:19,400 --> 00:44:20,240
We have libraries.

981
00:44:20,240 --> 00:44:21,320
We have AutoML.

982
00:44:21,320 --> 00:44:25,080
We have things that start to even hyperparameter tuning

983
00:44:25,080 --> 00:44:26,160
for deep learning models.

984
00:44:26,160 --> 00:44:27,800
It's getting a little easier.

985
00:44:27,800 --> 00:44:30,180
It's always hard if you're at the cutting edge of what's

986
00:44:30,180 --> 00:44:30,720
developed.

987
00:44:30,720 --> 00:44:35,160
But I think that stuff's all changing.

988
00:44:35,160 --> 00:44:36,960
Right.

989
00:44:36,960 --> 00:44:41,120
I think, yeah, at one point, there

990
00:44:41,120 --> 00:44:43,640
weren't that many models.

991
00:44:43,640 --> 00:44:46,240
There probably weren't that many resources.

992
00:44:46,240 --> 00:44:49,360
Now there's too many.

993
00:44:49,360 --> 00:44:52,240
There's so much out there.

994
00:44:52,240 --> 00:44:53,600
There's a lot of libraries.

995
00:44:53,600 --> 00:44:54,960
A lot of them are overlapping.

996
00:44:54,960 --> 00:44:57,400
There's a lot of vendors out there that

997
00:44:57,400 --> 00:45:00,800
are trying to do the same things.

998
00:45:00,800 --> 00:45:03,840
Yeah, there's this whole AutoML movement.

999
00:45:03,840 --> 00:45:09,000
But yeah, that's interesting the way

1000
00:45:09,000 --> 00:45:13,040
that people can learn about a particular topic.

1001
00:45:13,040 --> 00:45:15,960
You can start to apply these things more easily.

1002
00:45:15,960 --> 00:45:17,920
So maybe they can get a better understanding

1003
00:45:17,920 --> 00:45:20,520
of the pros and cons of it so they can understand

1004
00:45:20,520 --> 00:45:23,720
what they're learning there at a deeper level.

1005
00:45:23,720 --> 00:45:25,480
But then you can also take that and apply it

1006
00:45:25,480 --> 00:45:26,640
to other fields as well.

1007
00:45:26,640 --> 00:45:31,960
So yeah, that's going to be a very interesting trend.

1008
00:45:31,960 --> 00:45:33,880
Maybe a loaded question, but how do you

1009
00:45:33,880 --> 00:45:38,760
think machine learning will change in the next 10 years?

1010
00:45:38,760 --> 00:45:42,280
I guess for you, what impact do you

1011
00:45:42,280 --> 00:45:45,640
think it will have on scientific discovery?

1012
00:45:45,640 --> 00:45:51,440
I think it will profoundly benefit

1013
00:45:51,440 --> 00:45:54,560
the set of scientists who understand

1014
00:45:54,560 --> 00:45:57,120
how to incorporate it in their work

1015
00:45:57,120 --> 00:45:59,840
and how to critically evaluate it.

1016
00:45:59,840 --> 00:46:02,760
It's a question I get a lot from my former colleagues,

1017
00:46:02,760 --> 00:46:04,360
for example, in the Mars spectroscopy

1018
00:46:04,360 --> 00:46:07,120
and structural biology.

1019
00:46:07,120 --> 00:46:10,160
I think it will continue to become more

1020
00:46:10,160 --> 00:46:12,800
of a part of the field.

1021
00:46:12,800 --> 00:46:17,760
So thinking about structures from alpha-fold,

1022
00:46:17,760 --> 00:46:19,880
can and how should they be deposited

1023
00:46:19,880 --> 00:46:21,920
into the protein data bank?

1024
00:46:21,920 --> 00:46:24,960
That's a thing to think about.

1025
00:46:24,960 --> 00:46:26,720
It's not my job to solve, but certainly it's

1026
00:46:26,720 --> 00:46:28,680
an interesting thing to think about.

1027
00:46:28,680 --> 00:46:29,640
What does that mean?

1028
00:46:29,640 --> 00:46:30,280
Any different?

1029
00:46:30,280 --> 00:46:34,480
I mean, structures are computed from other data.

1030
00:46:34,480 --> 00:46:35,600
I'm not sure it's different.

1031
00:46:35,600 --> 00:46:39,320
Maybe it is, but how do you represent that?

1032
00:46:39,320 --> 00:46:42,600
I think the fields in general, like I sort of alluded to,

1033
00:46:42,600 --> 00:46:47,160
needs to find better ways to represent biology.

1034
00:46:47,160 --> 00:46:50,600
And right now, there is certainly no one way

1035
00:46:50,600 --> 00:46:51,360
to represent it.

1036
00:46:51,360 --> 00:46:53,480
I think that's a very open question.

1037
00:46:53,480 --> 00:46:56,800
With small molecules, there is language called SMILES.

1038
00:46:56,800 --> 00:46:57,440
There's graphs.

1039
00:46:57,440 --> 00:46:59,920
Graph models do really well, by the way, with small molecules.

1040
00:46:59,920 --> 00:47:03,040
There's three-dimensional representations as well.

1041
00:47:03,040 --> 00:47:03,800
Same with protein.

1042
00:47:03,800 --> 00:47:04,920
There's a variety of ways.

1043
00:47:04,920 --> 00:47:09,720
So I think that will crystallize a bit more.

1044
00:47:09,720 --> 00:47:12,400
Trying to think of what else.

1045
00:47:12,400 --> 00:47:15,360
Yeah, I think just really big picture.

1046
00:47:15,360 --> 00:47:17,560
I think those who learn how to use these tools

1047
00:47:17,560 --> 00:47:21,000
and how to interrogate them will figure out useful ways.

1048
00:47:21,000 --> 00:47:25,600
I figure out ways to use GPT and fluorine

1049
00:47:25,600 --> 00:47:28,000
my everyday life all the time.

1050
00:47:28,000 --> 00:47:31,240
We have internal NLP models at NVIDIA.

1051
00:47:31,240 --> 00:47:33,760
We don't enter proprietary data, of course, in public models.

1052
00:47:33,760 --> 00:47:36,800
But using those, playing around with those from time to time

1053
00:47:36,800 --> 00:47:38,960
to try to do things.

1054
00:47:38,960 --> 00:47:41,000
But you have to kind of experiment.

1055
00:47:41,000 --> 00:47:42,840
Like, it's not straightforward.

1056
00:47:42,840 --> 00:47:44,960
So you actually have to devote time

1057
00:47:44,960 --> 00:47:48,800
to figuring out how to do that.

1058
00:47:48,800 --> 00:47:50,440
I think, at least.

1059
00:47:50,440 --> 00:47:51,520
No, definitely.

1060
00:47:51,520 --> 00:47:55,600
And for technologists or for innovators,

1061
00:47:55,600 --> 00:47:59,360
having these tools then allows you to do so much more.

1062
00:47:59,360 --> 00:48:02,960
I mean, that's why it's so incredible.

1063
00:48:02,960 --> 00:48:04,520
I don't know.

1064
00:48:04,520 --> 00:48:07,820
The music industry is benefiting from all of this stuff.

1065
00:48:07,820 --> 00:48:10,320
I mean, The Beatles released a new song, right?

1066
00:48:10,320 --> 00:48:11,200
Like, that's the crazy.

1067
00:48:11,200 --> 00:48:12,880
That's nuts.

1068
00:48:12,880 --> 00:48:16,960
Because they were able to create a way of segmenting

1069
00:48:16,960 --> 00:48:20,880
John Lennon's voice from the TV in this old track

1070
00:48:20,880 --> 00:48:21,480
that they had.

1071
00:48:21,480 --> 00:48:26,080
And honestly, I listen to that track too often.

1072
00:48:26,080 --> 00:48:29,640
It's haunting for some reason.

1073
00:48:29,640 --> 00:48:32,520
Yeah, I don't know what the analogous thing is there

1074
00:48:32,520 --> 00:48:34,560
for science.

1075
00:48:34,560 --> 00:48:36,120
We'll have to see.

1076
00:48:36,120 --> 00:48:40,160
I mean, there are models now where

1077
00:48:40,160 --> 00:48:42,760
dynamics can be predicted.

1078
00:48:42,760 --> 00:48:45,440
So it's literally a model that does this,

1079
00:48:45,440 --> 00:48:49,520
that predicts the next frames of the simulation.

1080
00:48:49,520 --> 00:48:51,400
They've got a ways to go yet before they're

1081
00:48:51,400 --> 00:48:53,740
simulating proteins and simulating proteins

1082
00:48:53,740 --> 00:48:55,120
on the full time scale.

1083
00:48:55,120 --> 00:48:58,440
But maybe that's the analogy of what you're thinking of.

1084
00:48:58,440 --> 00:49:00,000
I'm not really sure.

1085
00:49:00,000 --> 00:49:01,000
Yeah.

1086
00:49:01,000 --> 00:49:03,000
I'm just saying how cool it is.

1087
00:49:03,000 --> 00:49:04,680
Yeah, yeah, yeah.

1088
00:49:04,680 --> 00:49:07,200
And just what becomes possible.

1089
00:49:07,200 --> 00:49:13,280
And then I just think about this common story

1090
00:49:13,280 --> 00:49:16,800
that they say, it's in a book, Prediction Machines.

1091
00:49:16,800 --> 00:49:18,520
Like when new technology comes out,

1092
00:49:18,520 --> 00:49:21,320
they talk about accountants.

1093
00:49:21,320 --> 00:49:25,240
And when Excel came out, there was this new spreadsheet tool

1094
00:49:25,240 --> 00:49:25,800
basically.

1095
00:49:25,800 --> 00:49:28,560
And then people were saying, oh, all accountants

1096
00:49:28,560 --> 00:49:29,400
are going to lose.

1097
00:49:29,400 --> 00:49:31,000
There's not going to be any accountants.

1098
00:49:31,000 --> 00:49:34,880
But all it did was it actually created so much more,

1099
00:49:34,880 --> 00:49:36,360
like a richer.

1100
00:49:36,360 --> 00:49:40,240
People could then apply more human thinking to it.

1101
00:49:40,240 --> 00:49:43,200
And they were able to do more art of finance

1102
00:49:43,200 --> 00:49:46,520
and create more sophisticated models.

1103
00:49:46,520 --> 00:49:50,640
And yeah, in some sense, it allows

1104
00:49:50,640 --> 00:49:52,280
you to do a higher level work.

1105
00:49:52,280 --> 00:49:55,080
It does sometimes make things more complex

1106
00:49:55,080 --> 00:49:56,560
and conflate things as well.

1107
00:49:56,560 --> 00:50:01,080
So it'll be interesting to see over, say, the next decade

1108
00:50:01,080 --> 00:50:04,760
what's just kind of hype and what are actually

1109
00:50:04,760 --> 00:50:08,960
tools that are enabling us to do these incredible things when

1110
00:50:08,960 --> 00:50:15,840
it comes to drug discovery, being able to try and create

1111
00:50:15,840 --> 00:50:18,680
all of these different things that can help us

1112
00:50:18,680 --> 00:50:23,440
fight different diseases, help us deal with,

1113
00:50:23,440 --> 00:50:26,480
create new medicines.

1114
00:50:26,480 --> 00:50:29,720
I mean, it's so exciting from the outside looking in.

1115
00:50:29,720 --> 00:50:31,440
It's such exciting work.

1116
00:50:31,440 --> 00:50:36,120
And knowing what thinking about what

1117
00:50:36,120 --> 00:50:40,880
the process is of trying to get a drug from idea to market

1118
00:50:40,880 --> 00:50:43,480
and how long that is, and if there's

1119
00:50:43,480 --> 00:50:49,640
anything that can be done to expedite that in a safe way,

1120
00:50:49,640 --> 00:50:54,520
is really pretty awesome.

1121
00:50:54,520 --> 00:50:58,440
It's probably the coolest thing to be working on.

1122
00:50:58,440 --> 00:50:59,840
Thanks.

1123
00:50:59,840 --> 00:51:01,960
And I mean, it's tremendously important.

1124
00:51:01,960 --> 00:51:06,960
We certainly don't wish another pandemic on the world.

1125
00:51:06,960 --> 00:51:09,720
But I think it's pretty likely it's

1126
00:51:09,720 --> 00:51:11,880
going to happen again, unfortunately.

1127
00:51:11,880 --> 00:51:14,200
So how do we be ready and think about the next one?

1128
00:51:14,200 --> 00:51:17,840
And how do we have the tools in place

1129
00:51:17,840 --> 00:51:23,560
so that we don't have to scale up vaccine manufacturing as much

1130
00:51:23,560 --> 00:51:25,480
as we did this time, right?

1131
00:51:25,480 --> 00:51:28,680
So yeah.

1132
00:51:28,680 --> 00:51:29,920
Yeah.

1133
00:51:29,920 --> 00:51:32,360
Yeah, obviously, we don't want anything like that to happen.

1134
00:51:32,360 --> 00:51:34,280
But it's sort of inevitable that there will

1135
00:51:34,280 --> 00:51:35,880
be something along that level.

1136
00:51:35,880 --> 00:51:43,120
But knowing that there are teams and companies and research

1137
00:51:43,120 --> 00:51:47,800
institutes that have the tools that can enable them to quickly

1138
00:51:47,800 --> 00:51:52,560
combat that sort of stuff is a little reassuring.

1139
00:51:52,560 --> 00:51:54,840
There's still the human element.

1140
00:51:54,840 --> 00:51:57,480
People have to actually do.

1141
00:51:57,480 --> 00:51:58,560
I know.

1142
00:51:58,560 --> 00:52:01,640
But at least you can just give people the tools

1143
00:52:01,640 --> 00:52:07,240
to do the best they can.

1144
00:52:07,240 --> 00:52:10,920
So switching gears into the learning from machine learning

1145
00:52:10,920 --> 00:52:14,960
aspect, we'll get into some advice questions,

1146
00:52:14,960 --> 00:52:18,040
everyone's favorite type of question.

1147
00:52:18,040 --> 00:52:22,040
For people who are just starting out in the field thinking,

1148
00:52:22,040 --> 00:52:23,840
hey, I want to be a data scientist,

1149
00:52:23,840 --> 00:52:27,040
or that they're doing some biochem stuff,

1150
00:52:27,040 --> 00:52:31,320
what's advice that you would give to people that are just

1151
00:52:31,320 --> 00:52:33,280
starting out?

1152
00:52:33,280 --> 00:52:36,400
I think this one's probably always data, data, data.

1153
00:52:36,400 --> 00:52:39,920
It's always about the data.

1154
00:52:39,920 --> 00:52:42,600
I think the cool thing with machine learning,

1155
00:52:42,600 --> 00:52:43,760
it's a very open community.

1156
00:52:43,760 --> 00:52:47,800
There are folks who participate in Kaggle competitions

1157
00:52:47,800 --> 00:52:50,760
or something if you want to get to know a particular domain

1158
00:52:50,760 --> 00:52:51,280
better.

1159
00:52:51,280 --> 00:52:54,840
There are active discussions there on those competitions.

1160
00:52:54,840 --> 00:52:58,720
You can learn more about the field that way.

1161
00:52:58,720 --> 00:53:03,160
That's something that's relatively accessible to all people.

1162
00:53:03,160 --> 00:53:05,920
Yeah, I think that's the biggest thing is data.

1163
00:53:05,920 --> 00:53:08,880
I think in this field, you need to figure out

1164
00:53:08,880 --> 00:53:12,320
what works for you as a way to continue to learn.

1165
00:53:12,320 --> 00:53:15,120
And that isn't just reading papers,

1166
00:53:15,120 --> 00:53:19,640
or maybe it's testing a few new tools as they come out,

1167
00:53:19,640 --> 00:53:22,560
these new visualization and interpretation tools

1168
00:53:22,560 --> 00:53:24,800
as you were alluding to.

1169
00:53:24,800 --> 00:53:25,920
Maybe it's that.

1170
00:53:25,920 --> 00:53:28,440
Maybe it's continuing to refine your software development

1171
00:53:28,440 --> 00:53:29,800
skills.

1172
00:53:29,800 --> 00:53:31,000
Maybe it's reading papers.

1173
00:53:31,000 --> 00:53:32,720
It depends what you're doing.

1174
00:53:32,720 --> 00:53:36,360
But figuring out a way to do that is important.

1175
00:53:36,360 --> 00:53:38,040
I think it's harder as you get older.

1176
00:53:38,040 --> 00:53:40,560
I mean, I was completely self-taught with machine

1177
00:53:40,560 --> 00:53:42,520
learning.

1178
00:53:42,520 --> 00:53:45,240
But I used to get up at 5 AM on Saturday

1179
00:53:45,240 --> 00:53:48,800
and do all my machine learning courses and my homework

1180
00:53:48,800 --> 00:53:49,360
for the week.

1181
00:53:49,360 --> 00:53:51,280
And man, I can't even.

1182
00:53:51,280 --> 00:53:52,440
It was hard.

1183
00:53:52,440 --> 00:53:53,320
Yeah.

1184
00:53:53,320 --> 00:53:55,640
Yeah.

1185
00:53:55,640 --> 00:53:57,360
And Andrew Ng class.

1186
00:53:57,360 --> 00:53:57,880
Yeah.

1187
00:53:57,880 --> 00:53:58,400
Oh, yeah.

1188
00:53:58,400 --> 00:53:59,720
That stuff, Jeff Hinton's class.

1189
00:53:59,720 --> 00:54:00,800
Yeah.

1190
00:54:00,800 --> 00:54:04,200
That was how I did it.

1191
00:54:04,200 --> 00:54:06,880
And then, yeah, certainly I've kept learning since then.

1192
00:54:06,880 --> 00:54:09,600
The field has changed profoundly since then even.

1193
00:54:09,600 --> 00:54:10,320
Yeah.

1194
00:54:10,320 --> 00:54:12,960
That's a trait that I find, I guess,

1195
00:54:12,960 --> 00:54:16,680
in data scientists and software engineers,

1196
00:54:16,680 --> 00:54:20,640
the really good ones, it's just they're capable.

1197
00:54:20,640 --> 00:54:23,720
But then it's just this idea of just continuous learning.

1198
00:54:23,720 --> 00:54:28,080
I'm going to continue to learn the newest things out there.

1199
00:54:28,080 --> 00:54:29,200
But then they have-

1200
00:54:29,200 --> 00:54:32,440
Tools, VS Code, whatever it is.

1201
00:54:32,440 --> 00:54:34,440
Master the tools.

1202
00:54:34,440 --> 00:54:36,200
But it's not mastering all your tools.

1203
00:54:36,200 --> 00:54:38,040
It's mastering the tools that matter.

1204
00:54:38,040 --> 00:54:39,880
And that's the hard part.

1205
00:54:39,880 --> 00:54:43,640
And it's the ability to master a tool.

1206
00:54:43,640 --> 00:54:47,400
And I think that's like school.

1207
00:54:47,400 --> 00:54:49,280
It's not like everything that you learn in school

1208
00:54:49,280 --> 00:54:50,880
you're actually applying in your job.

1209
00:54:50,880 --> 00:54:54,600
But for someone in a position like you,

1210
00:54:54,600 --> 00:55:00,480
you can consume research papers probably better than 99.9%

1211
00:55:00,480 --> 00:55:02,560
of the world at this point.

1212
00:55:02,560 --> 00:55:05,640
And being able to take the pieces of it

1213
00:55:05,640 --> 00:55:08,480
that are applicable for you.

1214
00:55:08,480 --> 00:55:14,400
But yeah, in this field, it's important to understand,

1215
00:55:14,400 --> 00:55:17,560
yeah, which tools are important and how quickly can I

1216
00:55:17,560 --> 00:55:19,720
get onboarded onto that tool.

1217
00:55:19,720 --> 00:55:21,000
Right, right.

1218
00:55:21,000 --> 00:55:24,120
Yeah, a little variation of the last one.

1219
00:55:24,120 --> 00:55:27,160
What advice would you give yourself, I guess,

1220
00:55:27,160 --> 00:55:29,040
earlier in your career?

1221
00:55:29,040 --> 00:55:32,440
Yeah, I was certainly when I was doing this transition,

1222
00:55:32,440 --> 00:55:34,600
I was very intimidated.

1223
00:55:34,600 --> 00:55:40,440
It's a big thing to sort of make a switch from something

1224
00:55:40,440 --> 00:55:44,360
I'd spent at that time, I don't know, 11, 12 years of my life

1225
00:55:44,360 --> 00:55:45,240
studying this.

1226
00:55:45,240 --> 00:55:47,600
And certainly, I still get to work

1227
00:55:47,600 --> 00:55:48,840
at the intersection of science.

1228
00:55:48,840 --> 00:55:51,480
But it's a very different type of job.

1229
00:55:51,480 --> 00:55:52,480
I love it.

1230
00:55:52,480 --> 00:55:57,280
But it was very intimidating and stressful at the time

1231
00:55:57,280 --> 00:55:58,960
to make that jump.

1232
00:55:58,960 --> 00:56:01,520
Right, right.

1233
00:56:01,520 --> 00:56:03,280
So what would it be?

1234
00:56:03,280 --> 00:56:06,440
Just you're going to make it through?

1235
00:56:06,440 --> 00:56:07,440
It's going to be OK?

1236
00:56:07,440 --> 00:56:08,920
Or don't be intimidated?

1237
00:56:08,920 --> 00:56:09,920
Don't be intimidated.

1238
00:56:09,920 --> 00:56:14,640
And I think keep your ear to the ground about things like this.

1239
00:56:14,640 --> 00:56:19,200
Don't be so focused on your, it's

1240
00:56:19,200 --> 00:56:21,440
important to focus on your domain, your world

1241
00:56:21,440 --> 00:56:26,120
as a structural biologist, as an NMR spectroscopist as I was.

1242
00:56:26,120 --> 00:56:29,560
But it's also important to be aware of other trends

1243
00:56:29,560 --> 00:56:31,800
and think about it.

1244
00:56:31,800 --> 00:56:35,280
Because I think the people that maybe are earlier on

1245
00:56:35,280 --> 00:56:38,320
in the field figured out that, oh, yeah, this is a thing.

1246
00:56:38,320 --> 00:56:40,000
I should go learn about this.

1247
00:56:40,000 --> 00:56:42,200
There's this whole revolution going on outside

1248
00:56:42,200 --> 00:56:44,440
of my purview of my day to day.

1249
00:56:44,440 --> 00:56:46,720
Maybe I should at least try to understand it.

1250
00:56:46,720 --> 00:56:48,400
Right.

1251
00:56:48,400 --> 00:56:53,120
Yeah, keep a pulse on the progress that's

1252
00:56:53,120 --> 00:56:55,760
taking place in other fields.

1253
00:56:55,760 --> 00:56:56,440
Yeah.

1254
00:56:56,440 --> 00:56:58,280
That's something that I learn.

1255
00:56:58,280 --> 00:57:00,520
Well, I mean, I guess I've always sort of tried to do it.

1256
00:57:00,520 --> 00:57:02,200
Just my interests have been so varied

1257
00:57:02,200 --> 00:57:03,440
that I've been able to do it.

1258
00:57:03,440 --> 00:57:09,800
But recently, I listened to Jeremy Howard from Vast AI,

1259
00:57:09,800 --> 00:57:14,400
just really hammering that point home.

1260
00:57:14,400 --> 00:57:16,520
If you're interested in natural language processing,

1261
00:57:16,520 --> 00:57:19,280
it's OK to learn something about computer vision.

1262
00:57:19,280 --> 00:57:20,760
If you're interested in computer vision,

1263
00:57:20,760 --> 00:57:22,440
you can do activity recognition.

1264
00:57:22,440 --> 00:57:25,320
Whatever it is, there'll be some thing

1265
00:57:25,320 --> 00:57:28,040
that you learn in signal processing that will help you.

1266
00:57:28,040 --> 00:57:30,440
Because when it comes down to it,

1267
00:57:30,440 --> 00:57:33,640
it's representing things numerically.

1268
00:57:33,640 --> 00:57:35,760
And it's doing manipulations to it.

1269
00:57:35,760 --> 00:57:37,400
It's pattern matching.

1270
00:57:37,400 --> 00:57:38,280
Right.

1271
00:57:38,280 --> 00:57:38,800
Yep.

1272
00:57:38,800 --> 00:57:40,840
So there's always a way.

1273
00:57:40,840 --> 00:57:41,600
Oh, yeah, sorry.

1274
00:57:41,600 --> 00:57:44,360
That's why I said that I think there

1275
00:57:44,360 --> 00:57:48,360
is a lot of value in having domain expertise,

1276
00:57:48,360 --> 00:57:50,960
and knowing and understanding something deeply what works

1277
00:57:50,960 --> 00:57:52,280
and what doesn't work.

1278
00:57:52,280 --> 00:57:54,920
And I don't actually believe that it prevents you

1279
00:57:54,920 --> 00:57:57,720
from switching domains later on if you want.

1280
00:57:57,720 --> 00:58:00,480
I think it is an asset because you

1281
00:58:00,480 --> 00:58:03,600
have dealt with some very fundamental problems

1282
00:58:03,600 --> 00:58:05,640
with data and modeling.

1283
00:58:05,640 --> 00:58:10,280
And you will see them apply, many of them in different ways,

1284
00:58:10,280 --> 00:58:12,040
perhaps, in that other domain.

1285
00:58:12,040 --> 00:58:15,440
But I think it teaches you what it takes to really interrogate

1286
00:58:15,440 --> 00:58:18,400
the data and build a good model.

1287
00:58:18,400 --> 00:58:20,960
Absolutely.

1288
00:58:20,960 --> 00:58:24,240
Something that's interesting, I know

1289
00:58:24,240 --> 00:58:30,120
that you work in the field with so many things going on.

1290
00:58:30,120 --> 00:58:33,960
But who are some people in the field that influence

1291
00:58:33,960 --> 00:58:37,320
you and your work?

1292
00:58:37,320 --> 00:58:40,280
I would say certainly my colleagues at NVIDIA.

1293
00:58:40,280 --> 00:58:47,160
They are tremendously talented and make me rethink and think

1294
00:58:47,160 --> 00:58:49,800
hard about problems every day and sometimes

1295
00:58:49,800 --> 00:58:50,760
help me when I'm stuck.

1296
00:58:50,760 --> 00:58:52,480
Sometimes I help them when I'm stuck.

1297
00:58:52,480 --> 00:58:55,800
I feel very, very fortunate to work

1298
00:58:55,800 --> 00:58:58,280
at such an incredible place.

1299
00:58:58,280 --> 00:59:00,880
I would say my postdoctoral advisor, Art Palmer,

1300
00:59:00,880 --> 00:59:05,280
and my grad school advisor, Patrick Loria and Scott Strobel,

1301
00:59:05,280 --> 00:59:08,520
they taught me to think very carefully and critically

1302
00:59:08,520 --> 00:59:12,840
about what I do and how to think about problems where they're

1303
00:59:12,840 --> 00:59:16,240
empirically, the problems that are very empirical.

1304
00:59:16,240 --> 00:59:18,120
And it actually has a lot of applications

1305
00:59:18,120 --> 00:59:20,400
in machine learning because it's a very empirical field

1306
00:59:20,400 --> 00:59:22,360
in many ways.

1307
00:59:22,360 --> 00:59:25,000
Certainly in the domain that I'm specifically in,

1308
00:59:25,000 --> 00:59:27,320
I think Alex Reibs from Evolution A Scale,

1309
00:59:27,320 --> 00:59:32,560
they've built some of the best, very, very well-known protein

1310
00:59:32,560 --> 00:59:34,920
representation models out there.

1311
00:59:34,920 --> 00:59:36,600
Deborah Marks, who's a professor at Harvard,

1312
00:59:36,600 --> 00:59:40,080
has done amazing things in machine learning

1313
00:59:40,080 --> 00:59:43,840
and development of some very fundamental techniques

1314
00:59:43,840 --> 00:59:49,040
with protein sequences, as well as does experiments

1315
00:59:49,040 --> 00:59:50,080
to back up the work.

1316
00:59:50,080 --> 00:59:54,400
So I think anyone who does both wet lab experiments

1317
00:59:54,400 --> 00:59:56,960
and machine learning, I think, gets a special place

1318
00:59:56,960 --> 00:59:59,280
of recognition from me, at least,

1319
00:59:59,280 --> 01:00:01,080
because I believe that's very hard to do.

1320
01:00:01,080 --> 01:00:04,240
And it really, really informs the work you do.

1321
01:00:04,240 --> 01:00:06,160
Yeah, absolutely.

1322
01:00:06,160 --> 01:00:08,640
Yeah, that makes a lot of sense.

1323
01:00:08,640 --> 01:00:11,840
And being on learning from machine learning,

1324
01:00:11,840 --> 01:00:14,360
I get to ask this question.

1325
01:00:14,360 --> 01:00:19,080
What has a career in machine learning taught you about life?

1326
01:00:19,080 --> 01:00:21,760
Yeah, I would say, so the importance of continuing,

1327
01:00:21,760 --> 01:00:25,720
of continually learning, whatever it means in your life,

1328
01:00:25,720 --> 01:00:30,160
whether it's, I learned how to snow ski two years ago.

1329
01:00:30,160 --> 01:00:34,760
And I wish I'd learned when I was young and made of rubber.

1330
01:00:34,760 --> 01:00:36,760
I'm not afraid to bite it, but I'm

1331
01:00:36,760 --> 01:00:40,040
a lot more afraid of tearing an ACL as an adult.

1332
01:00:40,040 --> 01:00:40,880
But it's fun.

1333
01:00:40,880 --> 01:00:44,080
And so this Christmas, we're going skiing.

1334
01:00:44,080 --> 01:00:46,600
And I'm excited to eventually get to the point

1335
01:00:46,600 --> 01:00:51,160
where I'm able to ski on blues pretty regularly.

1336
01:00:51,160 --> 01:00:52,880
My husband is an excellent skier.

1337
01:00:52,880 --> 01:00:55,360
So at some point, I want to get there.

1338
01:00:55,360 --> 01:00:56,760
That's my goal.

1339
01:00:56,760 --> 01:00:58,200
Nice.

1340
01:00:58,200 --> 01:01:01,960
I think how to debug complex empirical problems

1341
01:01:01,960 --> 01:01:04,440
and how to think rationally about, like, how do I try this?

1342
01:01:04,440 --> 01:01:06,920
And how do I debug this quickly?

1343
01:01:06,920 --> 01:01:11,800
Because it's not always as easy as you might think to solve it.

1344
01:01:11,800 --> 01:01:15,160
And then I think, I never really thought about classification

1345
01:01:15,160 --> 01:01:17,320
problems that much as a scientist.

1346
01:01:17,320 --> 01:01:20,280
So it really makes you think about,

1347
01:01:20,280 --> 01:01:23,320
there's many ways of being wrong.

1348
01:01:23,320 --> 01:01:25,120
And sometimes, some of those ways

1349
01:01:25,120 --> 01:01:27,880
matter a lot more than others.

1350
01:01:27,880 --> 01:01:30,960
So really understand what being wrong means

1351
01:01:30,960 --> 01:01:32,880
and what you need to optimize for.

1352
01:01:32,880 --> 01:01:35,120
It's not always accuracy, for example.

1353
01:01:35,120 --> 01:01:37,800
Sometimes false positives and false negatives,

1354
01:01:37,800 --> 01:01:40,600
they don't always carry the same weight.

1355
01:01:40,600 --> 01:01:42,080
Yeah, you beat me to it.

1356
01:01:42,080 --> 01:01:43,440
I was going to say, everyone wants

1357
01:01:43,440 --> 01:01:46,480
to know the accuracy of the model.

1358
01:01:46,480 --> 01:01:51,040
And I try to say, all errors are not equal.

1359
01:01:51,040 --> 01:01:52,760
They still want to just know the accuracy.

1360
01:01:52,760 --> 01:01:55,160
They don't care.

1361
01:01:55,160 --> 01:01:56,920
I try to explain it to them.

1362
01:01:56,920 --> 01:01:58,760
But then if you bring out a confusion matrix,

1363
01:01:58,760 --> 01:02:00,520
it gets a little too confusing.

1364
01:02:00,520 --> 01:02:03,920
But no, it's important the takeaway of just like,

1365
01:02:03,920 --> 01:02:05,600
all errors aren't the same.

1366
01:02:05,600 --> 01:02:10,240
And some matter much more than others.

1367
01:02:10,240 --> 01:02:13,680
That's a really good takeaway.

1368
01:02:13,680 --> 01:02:16,320
And then just about the debugging complex empirical

1369
01:02:16,320 --> 01:02:16,680
problems.

1370
01:02:16,680 --> 01:02:17,480
So what is it?

1371
01:02:17,480 --> 01:02:18,400
What makes it so hard?

1372
01:02:18,400 --> 01:02:20,120
Just all the moving parts?

1373
01:02:20,120 --> 01:02:21,640
Sometimes it could take a long.

1374
01:02:21,640 --> 01:02:25,120
If you have a problem, for example, with,

1375
01:02:25,120 --> 01:02:27,160
I'm just going to pick something, training

1376
01:02:27,160 --> 01:02:28,640
where it starts to throw nans.

1377
01:02:28,640 --> 01:02:31,160
But it doesn't happen until pretty far into it.

1378
01:02:31,160 --> 01:02:33,680
How can you figure out what the problem is

1379
01:02:33,680 --> 01:02:39,200
without minimizing the time it takes to solve the problem?

1380
01:02:39,200 --> 01:02:40,680
Are there things you can scale down?

1381
01:02:40,680 --> 01:02:44,320
Can you figure out if there's a piece of data point?

1382
01:02:44,320 --> 01:02:46,720
Can you figure out if, like just learning

1383
01:02:46,720 --> 01:02:51,680
how to debug that in an efficient way is,

1384
01:02:51,680 --> 01:02:52,880
it can be really tricky.

1385
01:02:52,880 --> 01:02:53,880
I don't know.

1386
01:02:53,880 --> 01:02:55,640
I certainly think it's hard.

1387
01:02:55,640 --> 01:02:56,720
No, absolutely.

1388
01:02:56,720 --> 01:02:57,240
Absolutely.

1389
01:02:57,240 --> 01:03:00,880
And being able to approach those problems

1390
01:03:00,880 --> 01:03:07,040
in a cool, calm, and collected way, you will get, well,

1391
01:03:07,040 --> 01:03:09,800
most of the time, you'll be able to figure it out.

1392
01:03:09,800 --> 01:03:12,560
Yeah.

1393
01:03:12,560 --> 01:03:14,360
Sometimes I get CUDA in it errors,

1394
01:03:14,360 --> 01:03:15,440
and then I don't like those.

1395
01:03:15,440 --> 01:03:15,920
Oh, no.

1396
01:03:15,920 --> 01:03:16,680
Because I'm like, oh, boy.

1397
01:03:16,680 --> 01:03:17,640
Oh, boy.

1398
01:03:17,640 --> 01:03:20,920
Yeah, well, then you have to go ask someone.

1399
01:03:20,920 --> 01:03:23,960
This is going to be a while.

1400
01:03:23,960 --> 01:03:24,440
Yeah.

1401
01:03:24,440 --> 01:03:25,480
CUDA alignment errors.

1402
01:03:25,480 --> 01:03:26,560
Those are not my friends.

1403
01:03:26,560 --> 01:03:28,160
No.

1404
01:03:28,160 --> 01:03:29,880
Or those out of memory errors.

1405
01:03:29,880 --> 01:03:30,880
And it's like, now?

1406
01:03:30,880 --> 01:03:32,800
This is when you're going to throw me that one?

1407
01:03:32,800 --> 01:03:33,320
OK, fine.

1408
01:03:33,320 --> 01:03:35,000
Yeah.

1409
01:03:35,000 --> 01:03:36,480
Yes.

1410
01:03:36,480 --> 01:03:39,480
Wow, Michelle, it's been such a pleasure.

1411
01:03:39,480 --> 01:03:43,520
This was so cool to dive into this area.

1412
01:03:43,520 --> 01:03:43,920
Thanks.

1413
01:03:43,920 --> 01:03:45,040
It was my pleasure, too.

1414
01:03:45,040 --> 01:03:46,200
Yeah.

1415
01:03:46,200 --> 01:03:49,040
Understanding how people even begin

1416
01:03:49,040 --> 01:03:55,240
to approach how to do drug discovery, leveraging AI,

1417
01:03:55,240 --> 01:03:57,920
it's such a fascinating field.

1418
01:03:57,920 --> 01:03:59,960
I think I'm going to use this as motivation

1419
01:03:59,960 --> 01:04:03,280
to continue to learn more about this.

1420
01:04:03,280 --> 01:04:06,960
For people that want to learn more about you and your work,

1421
01:04:06,960 --> 01:04:08,840
where would a good place to go?

1422
01:04:08,840 --> 01:04:11,320
Probably my website.

1423
01:04:11,320 --> 01:04:13,720
My papers, talks end up there.

1424
01:04:13,720 --> 01:04:15,160
I do need to update it.

1425
01:04:15,160 --> 01:04:16,520
That's a holiday project.

1426
01:04:16,520 --> 01:04:18,200
This is to go add a few more things,

1427
01:04:18,200 --> 01:04:20,880
including my Pi Data Talk.

1428
01:04:20,880 --> 01:04:23,320
But that's probably the best place.

1429
01:04:23,320 --> 01:04:25,920
I have Twitter and Mastodon.

1430
01:04:25,920 --> 01:04:28,560
Well, I guess X these days.

1431
01:04:28,560 --> 01:04:30,680
I don't tweet as much these days.

1432
01:04:30,680 --> 01:04:31,720
Maybe that will change.

1433
01:04:31,720 --> 01:04:33,560
I'm a modern scientist on Twitter.

1434
01:04:33,560 --> 01:04:36,200
But whatever accounts I'm using at the moment

1435
01:04:36,200 --> 01:04:38,400
will be linked to from my web page.

1436
01:04:38,400 --> 01:04:40,600
So it's michellelingale.com.

1437
01:04:40,600 --> 01:04:41,480
Cool.

1438
01:04:41,480 --> 01:04:44,440
And I can add some of those to the show notes as well.

1439
01:04:44,440 --> 01:04:47,720
And I encourage listeners to definitely check out

1440
01:04:47,720 --> 01:04:50,600
your most recent keynote at Pi Data NYC.

1441
01:04:50,600 --> 01:04:53,040
I'll add that as well.

1442
01:04:53,040 --> 01:04:55,000
What an amazing talk.

1443
01:04:55,000 --> 01:04:56,440
Michelle, thank you so much.

1444
01:04:56,440 --> 01:04:58,320
I really appreciate you giving me the time

1445
01:04:58,320 --> 01:05:00,200
and letting me pick your brain for a bit.

1446
01:05:00,200 --> 01:05:01,480
Yeah, it was all my pleasure.

1447
01:05:01,480 --> 01:05:02,200
Thank you.

1448
01:05:02,200 --> 01:05:03,400
Thanks for the invitation.

1449
01:05:03,400 --> 01:05:03,880
Thank you.

1450
01:05:03,880 --> 01:05:04,360
Thank you.

1451
01:05:04,360 --> 01:05:13,560
On this episode of Learning from Machine Learning,

1452
01:05:13,560 --> 01:05:16,440
Dr. Michelle Gill shared her incredible journey

1453
01:05:16,440 --> 01:05:19,520
from wet lab biochemist to driving cutting edge

1454
01:05:19,520 --> 01:05:21,400
AI at NVIDIA.

1455
01:05:21,400 --> 01:05:23,840
Her work helps address one of health care's biggest

1456
01:05:23,840 --> 01:05:26,400
challenges by enabling researchers

1457
01:05:26,400 --> 01:05:30,320
to do drug discovery, both faster and better.

1458
01:05:30,320 --> 01:05:32,920
Michelle discussed the critical need for better machine

1459
01:05:32,920 --> 01:05:36,360
learning representations for biological structures

1460
01:05:36,360 --> 01:05:40,520
and her insights on leading a multidisciplinary team.

1461
01:05:40,520 --> 01:05:41,840
Thank you for listening.

1462
01:05:41,840 --> 01:05:44,280
And remember to subscribe and share

1463
01:05:44,280 --> 01:05:46,160
with your friends and colleagues.

1464
01:05:46,160 --> 01:05:51,680
Until next time, keep on learning.

