1
00:00:00,000 --> 00:00:09,160
How did the best machine learning practitioners get involved in the field?

2
00:00:09,160 --> 00:00:11,600
What challenges have they faced?

3
00:00:11,600 --> 00:00:13,840
What has helped them flourish?

4
00:00:13,840 --> 00:00:15,720
Let's ask them.

5
00:00:15,720 --> 00:00:17,840
Welcome to Learning from Machine Learning.

6
00:00:17,840 --> 00:00:20,160
I'm your host, Seth Levine.

7
00:00:20,160 --> 00:00:21,380
Welcome.

8
00:00:21,380 --> 00:00:27,080
I have the honor of having Niels Reimers on the show today, the director of machine learning

9
00:00:27,080 --> 00:00:34,120
at Codehear, former researcher at Hugging Face, the creator of Sentence Transformers,

10
00:00:34,120 --> 00:00:37,920
and researcher on dozens of papers in NLP.

11
00:00:37,920 --> 00:00:38,920
Welcome to the show.

12
00:00:38,920 --> 00:00:39,920
Great.

13
00:00:39,920 --> 00:00:42,760
Yeah, really happy to be here.

14
00:00:42,760 --> 00:00:48,080
Why don't we start off, just give us some background on your career journey.

15
00:00:48,080 --> 00:00:50,080
How'd you get to where you are today?

16
00:00:50,080 --> 00:00:55,440
Yeah, so it actually started, I would say, a long time ago.

17
00:00:55,440 --> 00:01:00,040
So I first trained the first neural network in the early 2000s.

18
00:01:00,040 --> 00:01:05,780
So at that time, I was playing with this Lego, I don't know, these Lego robotars you can

19
00:01:05,780 --> 00:01:06,780
build.

20
00:01:06,780 --> 00:01:12,440
And so I thought, okay, maybe it's cool to add some artificial intelligence to the robot

21
00:01:12,440 --> 00:01:15,920
and control the robot with a neural network.

22
00:01:15,920 --> 00:01:22,160
And yeah, this was kind of toy example in the early 2000s before, I don't know, AI was

23
00:01:22,160 --> 00:01:24,680
in the media or in the hype.

24
00:01:24,680 --> 00:01:31,880
And another key point was like in 2009, I had a great lecture in Berkeley, Introduction

25
00:01:31,880 --> 00:01:40,200
to AI, so a super awesome professor who was introducing the concepts was like Pac-Man.

26
00:01:40,200 --> 00:01:45,760
So every concept he presented and started was like some A-star algorithms to find its

27
00:01:45,760 --> 00:01:50,720
shortest distance between two points, which is like for a root planet.

28
00:01:50,720 --> 00:01:54,440
And then showed this, how can you do it, it was like Pac-Man or reinforcement learning,

29
00:01:54,440 --> 00:02:00,000
how can you do it, how can you Pac-Man become smarter, how can your girls become smarter.

30
00:02:00,000 --> 00:02:04,680
And had a lot of challenges, playful way, it was like super amazing to do this.

31
00:02:04,680 --> 00:02:10,320
But then yeah, sadly, I did not went directly into machine learning AI.

32
00:02:10,320 --> 00:02:16,960
I first did like a detour in information security, so did a master degree on information security.

33
00:02:16,960 --> 00:02:21,600
But then after the master degree, I said, okay, I want to go back to artificial intelligence,

34
00:02:21,600 --> 00:02:26,200
artificial learning, and started to do my PhD in that field.

35
00:02:26,200 --> 00:02:28,160
Very cool.

36
00:02:28,160 --> 00:02:34,180
So what was it that you initially figured out like that machine learning is something

37
00:02:34,180 --> 00:02:36,240
different and it's something powerful?

38
00:02:36,240 --> 00:02:40,440
I think it was super fun in the beginning.

39
00:02:40,440 --> 00:02:45,840
So it was like super fun to watch your Pac-Man and try to build it smarter.

40
00:02:45,840 --> 00:02:47,920
And it was like really stupid in the beginning.

41
00:02:47,920 --> 00:02:53,360
So you see, okay, your Pac-Man was hunted by the ghost, and then it does the wrong turn.

42
00:02:53,360 --> 00:02:57,480
So it goes to the left instead of like to a dead end, into the trap to the left instead

43
00:02:57,480 --> 00:03:00,960
of going to victory or take the right turn.

44
00:03:00,960 --> 00:03:04,040
Like yelled at the machine and say, why do you take a left turn?

45
00:03:04,040 --> 00:03:05,960
Why did you go to the right?

46
00:03:05,960 --> 00:03:06,960
Why are you so stupid?

47
00:03:06,960 --> 00:03:10,400
And then why is it so hard to make you smart?

48
00:03:10,400 --> 00:03:16,280
And so it was like extremely playful and triggered your like ambitions.

49
00:03:16,280 --> 00:03:19,040
So you want to say, okay, I want you to make it better and better and better.

50
00:03:19,040 --> 00:03:25,200
It was every iteration, it got better and it was like fun to create.

51
00:03:25,200 --> 00:03:31,280
And then after the master degree, I said, okay, it's super powerful things you can build.

52
00:03:31,280 --> 00:03:37,080
And at that time I said, okay, language is extremely has a lot of potential because so

53
00:03:37,080 --> 00:03:39,680
much information is stored in language.

54
00:03:39,680 --> 00:03:45,000
So if there are spoken language or written language like text, and if a machine can understand

55
00:03:45,000 --> 00:03:51,680
text, can understand what's written in Wikipedia and all the books, all the news articles everywhere

56
00:03:51,680 --> 00:03:56,400
and everything, if a really machine cannot understand this, this will give you like an

57
00:03:56,400 --> 00:03:58,040
extremely powerful machine.

58
00:03:58,040 --> 00:04:02,760
So Pac-Man was nice and fun, but at the end, a smart Pac-Man can just win the game.

59
00:04:02,760 --> 00:04:08,600
But if you have a nice and fun Pac-Man that can read text and can produce text, that's

60
00:04:08,600 --> 00:04:10,280
super amazing and super powerful.

61
00:04:10,280 --> 00:04:13,360
And that's what got me interested in 2013.

62
00:04:13,360 --> 00:04:14,360
Awesome.

63
00:04:14,360 --> 00:04:19,080
So what was one of your first major projects in natural language processing?

64
00:04:19,080 --> 00:04:20,080
Yes.

65
00:04:20,080 --> 00:04:24,960
So at that time, LP was completely different field.

66
00:04:24,960 --> 00:04:32,960
So my professor, Lina Gurovich, where I started my PhD, she heard a lot of boss from the computer

67
00:04:32,960 --> 00:04:35,520
vision domain about neural networks.

68
00:04:35,520 --> 00:04:43,280
So neural networks had a lot of hype really a long time ago, like in the 50s and 60s and

69
00:04:43,280 --> 00:04:44,280
70s.

70
00:04:44,280 --> 00:04:47,960
And half a century ago, people were extremely excited about neural networks, but it didn't

71
00:04:47,960 --> 00:04:48,960
work out.

72
00:04:48,960 --> 00:04:52,760
And people said, no, neural networks, that's not working.

73
00:04:52,760 --> 00:04:55,880
That's like bad technology and you shouldn't use it.

74
00:04:55,880 --> 00:05:01,840
And I don't know, I know some colleagues who talked to Hinton in the early 2000s and he

75
00:05:01,840 --> 00:05:05,520
started to talk about neural networks again.

76
00:05:05,520 --> 00:05:08,440
And they were like, oh, please stop about this boring topic.

77
00:05:08,440 --> 00:05:10,680
I don't want to learn anything about neural networks.

78
00:05:10,680 --> 00:05:11,680
It's bad technology.

79
00:05:11,680 --> 00:05:12,680
Why are you still doing that?

80
00:05:12,680 --> 00:05:20,040
Again, in 2012, there was like the ImageNet moment where neural network from Hinton was

81
00:05:20,040 --> 00:05:28,120
so much better on ImageNet and recognizing images than any system seen before.

82
00:05:28,120 --> 00:05:33,080
That computer vision was extremely excited about neural networks again.

83
00:05:33,080 --> 00:05:38,280
And then my professor asked me in 2013, hey, is neural networks something that might be

84
00:05:38,280 --> 00:05:39,920
relevant for NLP?

85
00:05:39,920 --> 00:05:45,120
So in NLP at that time, we were still using Naive Bayes, support vector machines, back

86
00:05:45,120 --> 00:05:50,000
of words, TF-IDF, these types of features and stuff.

87
00:05:50,000 --> 00:05:57,120
And yeah, my first task and or the whole task for my PhD was figure out will neural networks

88
00:05:57,120 --> 00:06:02,160
have an impact on NLP and what type of impact will they have?

89
00:06:02,160 --> 00:06:05,160
And the first project was named entity recognition.

90
00:06:05,160 --> 00:06:12,280
In the text, can I identify named entities or company names, company names, person names

91
00:06:12,280 --> 00:06:13,280
and this?

92
00:06:13,280 --> 00:06:18,840
And yeah, but more generally, it was like neural networks, how will they change NLP

93
00:06:18,840 --> 00:06:22,400
and how we do machine learning?

94
00:06:22,400 --> 00:06:24,560
Awesome.

95
00:06:24,560 --> 00:06:31,800
Having the creator of Sentence Transformers, I would love to dive into it with you.

96
00:06:31,800 --> 00:06:34,980
Can you explain what is Sentence Transformers?

97
00:06:34,980 --> 00:06:36,880
What was its original goal?

98
00:06:36,880 --> 00:06:38,880
And why is it so powerful?

99
00:06:38,880 --> 00:06:49,520
Yes, so Sentence Transformers is an open source library, which takes text into a vector space.

100
00:06:49,520 --> 00:06:50,960
Sounds bit abstract.

101
00:06:50,960 --> 00:06:55,680
So far as humans, when we read text, we can make sense of the text.

102
00:06:55,680 --> 00:06:59,720
But for computers, it's like really, really hard to if they just look at the text, just

103
00:06:59,720 --> 00:07:01,440
read the text to really make a sense.

104
00:07:01,440 --> 00:07:04,760
So they don't understand the words, how the words are connected.

105
00:07:04,760 --> 00:07:12,000
So for example, the word hotel and motel, it's like these things are really similar.

106
00:07:12,000 --> 00:07:16,640
But from a computer perspective, these are two different tokens, two different words.

107
00:07:16,640 --> 00:07:19,920
And the computer doesn't know like how are these two words connected?

108
00:07:19,920 --> 00:07:20,920
What are concepts?

109
00:07:20,920 --> 00:07:24,220
Are they similar, dissimilar and so on.

110
00:07:24,220 --> 00:07:31,800
And so what we do is like, take the text written by humans and transform it to a representation

111
00:07:31,800 --> 00:07:36,960
that's understandable to a computer, that here vector spaces are extremely powerful.

112
00:07:36,960 --> 00:07:44,620
And then in these vector spaces, we can embed and code relationships of words.

113
00:07:44,620 --> 00:07:50,280
So like hotel and motel, yeah, both are, it's a place, a physical place, which has rooms

114
00:07:50,280 --> 00:07:55,520
where you can go and stay overnight, which has a reception where you pay money to stay

115
00:07:55,520 --> 00:07:56,520
overnight.

116
00:07:56,520 --> 00:08:02,360
And then can encode all this information we have into the vector space so that the computer

117
00:08:02,360 --> 00:08:07,000
can start to reason about the text, like really understand the text.

118
00:08:07,000 --> 00:08:08,000
And yes.

119
00:08:08,000 --> 00:08:09,000
Yeah.

120
00:08:09,000 --> 00:08:15,400
So, so the text embeddings are basically taking the text and converting it into a numerical

121
00:08:15,400 --> 00:08:17,400
representation.

122
00:08:17,400 --> 00:08:20,760
And then you can sort of do operations with the text.

123
00:08:20,760 --> 00:08:21,760
Can you speak to that?

124
00:08:21,760 --> 00:08:22,760
Correct.

125
00:08:22,760 --> 00:08:31,040
So it encodes all these information we have on text to all the information, semantics

126
00:08:31,040 --> 00:08:34,120
of a text makes it accessible to the computer.

127
00:08:34,120 --> 00:08:37,600
And then you have certain dimensions, certain directions.

128
00:08:37,600 --> 00:08:40,280
So for example, you have singular and plural words.

129
00:08:40,280 --> 00:08:46,040
You have like, for example, I don't know, gender, you know, king is connected to male

130
00:08:46,040 --> 00:08:48,860
and queen is connected to female.

131
00:08:48,860 --> 00:08:53,720
You know, the relationship between London and England, that London is the capital of

132
00:08:53,720 --> 00:08:54,720
England.

133
00:08:54,720 --> 00:08:56,920
And then, you know, okay, what's the capital of Germany?

134
00:08:56,920 --> 00:09:02,760
And then we can take the same relationship, the same direction, the vector space to infer,

135
00:09:02,760 --> 00:09:06,120
okay, the capital of Germany is Berlin.

136
00:09:06,120 --> 00:09:11,520
And then the user can add the computer can learn from all these encoded relationships

137
00:09:11,520 --> 00:09:17,840
of words and sentences and phrases and paragraphs and infer meaning like, okay, in which direction

138
00:09:17,840 --> 00:09:20,120
and how do I talk about that?

139
00:09:20,120 --> 00:09:26,920
Singular, murals, synonyms, relationships like capital, politicians, companies, and

140
00:09:26,920 --> 00:09:28,560
relationship to founders.

141
00:09:28,560 --> 00:09:34,080
Basically, you encode all these relationships you kind of have about the world into the

142
00:09:34,080 --> 00:09:41,680
vector space to enable an efficient access of the computer to this to the information

143
00:09:41,680 --> 00:09:43,880
in the text.

144
00:09:43,880 --> 00:09:49,560
What do you view is the biggest jumps in text embedding from you know, bag of words to word

145
00:09:49,560 --> 00:09:52,320
to vec to where we are today?

146
00:09:52,320 --> 00:09:54,240
Um, yes.

147
00:09:54,240 --> 00:10:00,000
So the first big splash, which caused a lot of interest was like word to vec.

148
00:10:00,000 --> 00:10:12,240
So before that, you represented text as, yes, as like, unique tokens, like, like, one hot

149
00:10:12,240 --> 00:10:13,240
encoding, one hot encoding.

150
00:10:13,240 --> 00:10:14,240
There was like no relationships.

151
00:10:14,240 --> 00:10:20,520
So the distance between hotel and motel, and the distance between hotel and dock was the

152
00:10:20,520 --> 00:10:21,520
same.

153
00:10:21,520 --> 00:10:28,160
So the so for a computer was like impossible to know that motel is probably more similar

154
00:10:28,160 --> 00:10:31,960
to hotel than dock to hotel.

155
00:10:31,960 --> 00:10:35,040
And then word to vec, they enabled this on a work level.

156
00:10:35,040 --> 00:10:44,120
So had like a really cool paper showcase that you can encode this on a work level.

157
00:10:44,120 --> 00:10:54,600
And then second big splash was in 2017, 18, like Elmo was was contextualized word to vec.

158
00:10:54,600 --> 00:10:57,760
So the issue was word to vec is that it's on a work level.

159
00:10:57,760 --> 00:11:04,240
So those, okay, hotel and motel are close by, and apple and banana and strawberry has certain

160
00:11:04,240 --> 00:11:05,240
relationships.

161
00:11:05,240 --> 00:11:11,600
But we use words in a big setting, like the word apple, I can refer to the fruit, or I

162
00:11:11,600 --> 00:11:18,040
can refer to the company, or I can refer probably to some movie or song or some podcast series

163
00:11:18,040 --> 00:11:19,040
or some website.

164
00:11:19,040 --> 00:11:24,040
So how do you know what apple stands for in this context?

165
00:11:24,040 --> 00:11:28,440
So when I say apple is great company, I probably talk about the company.

166
00:11:28,440 --> 00:11:32,560
If I say apple is my favorite food, I talk about the food.

167
00:11:32,560 --> 00:11:38,080
And yeah, Elmo was like the first way to show that you can compute contextualized word embedding,

168
00:11:38,080 --> 00:11:42,560
meaning the model learns, do you talk about the company apple or about the food apple

169
00:11:42,560 --> 00:11:44,960
or some other meaning of apple.

170
00:11:44,960 --> 00:11:50,640
And this enables like, yeah, more complex, better understanding of how the words are

171
00:11:50,640 --> 00:11:51,640
used.

172
00:11:51,640 --> 00:11:56,920
And then on top of that, we started to build like understanding of sentences and understanding

173
00:11:56,920 --> 00:11:58,960
of paragraphs.

174
00:11:58,960 --> 00:12:00,480
Right.

175
00:12:00,480 --> 00:12:06,120
So yeah, so the other ones that I view as like really big stepping stones are like top

176
00:12:06,120 --> 00:12:13,800
to vec, you know, things like that, just being able to represent either words, sentences,

177
00:12:13,800 --> 00:12:19,640
you know, full documents, numerically, and then being able to sort of do these operations.

178
00:12:19,640 --> 00:12:25,240
So creating powerful text embeddings is obviously something interesting for people in the NLP

179
00:12:25,240 --> 00:12:26,240
field.

180
00:12:26,240 --> 00:12:28,060
But you know, so what?

181
00:12:28,060 --> 00:12:30,260
You know, why should businesses care?

182
00:12:30,260 --> 00:12:33,360
Why should people outside of NLP care?

183
00:12:33,360 --> 00:12:36,160
So yeah, my favorite application here search.

184
00:12:36,160 --> 00:12:38,600
So so far search.

185
00:12:38,600 --> 00:12:39,600
It's horrible.

186
00:12:39,600 --> 00:12:44,600
Like like a lot of applications, if you exclude like Google and Bing and so on.

187
00:12:44,600 --> 00:12:49,160
So for example, if you go on Wikipedia and hit that search bar and ask question, what's

188
00:12:49,160 --> 00:12:51,720
the capital of the United States?

189
00:12:51,720 --> 00:12:56,920
First entry first search result is about capital punishment in the US.

190
00:12:56,920 --> 00:13:02,320
The article about Washington DC has not seen the top 20 search results.

191
00:13:02,320 --> 00:13:04,720
And yeah, that's like completely failed.

192
00:13:04,720 --> 00:13:10,720
So even such the search query is simple, doesn't retrieve any relevant results.

193
00:13:10,720 --> 00:13:17,680
Because Wikipedia right now, and like most other search systems in the world, use lexical

194
00:13:17,680 --> 00:13:22,800
search where the system has no understanding of the text.

195
00:13:22,800 --> 00:13:28,240
So it has no understanding that capital of the United States is connected to Washington

196
00:13:28,240 --> 00:13:29,280
DC.

197
00:13:29,280 --> 00:13:34,080
So it has like no idea that there's a relationship to this and Washington DC.

198
00:13:34,080 --> 00:13:38,920
Because if you look at the surface level, just at the characters, it's different, like

199
00:13:38,920 --> 00:13:41,840
capital of the United States and Washington DC.

200
00:13:41,840 --> 00:13:42,840
That's different.

201
00:13:42,840 --> 00:13:46,800
And with embeddings, we have these relationships built in.

202
00:13:46,800 --> 00:13:52,520
So the vector space knows that Washington DC and capital of the United States is connected.

203
00:13:52,520 --> 00:13:57,320
And also that United States and US and USA are connected.

204
00:13:57,320 --> 00:14:01,360
So it can retrieve extremely good search results.

205
00:14:01,360 --> 00:14:04,460
Also these embeddings, they make classification much nicer.

206
00:14:04,460 --> 00:14:10,480
So I build a system to filter spam unwanted content from a new inbox.

207
00:14:10,480 --> 00:14:17,640
I got like a lot of cold emails from people trying to sell me stuff, or headhunter trying

208
00:14:17,640 --> 00:14:23,600
to send me CVs where I say, no, please move it away from my inbox.

209
00:14:23,600 --> 00:14:28,240
And yeah, with these embedding approaches, I can say here like five examples of unwanted

210
00:14:28,240 --> 00:14:33,960
emails, and now it works extremely well and filters out all the unwanted emails just by

211
00:14:33,960 --> 00:14:39,400
providing these five examples because it learned what do I don't want to see.

212
00:14:39,400 --> 00:14:45,500
I don't want to have like any code calls, code emails, people trying to sell me stuff.

213
00:14:45,500 --> 00:14:46,920
So please filter it out.

214
00:14:46,920 --> 00:14:52,360
Here are five examples how such a code call email looks like and then it learns it and

215
00:14:52,360 --> 00:14:56,440
then knows and can generalize to other content in the field.

216
00:14:56,440 --> 00:14:57,440
Right.

217
00:14:57,440 --> 00:15:00,480
So embeddings are sort of the first step.

218
00:15:00,480 --> 00:15:03,520
You're going to take a text and then you represent it.

219
00:15:03,520 --> 00:15:10,280
And then you can start to do similarity tasks, which allow you to do search better, information

220
00:15:10,280 --> 00:15:14,800
retrieval, text classification.

221
00:15:14,800 --> 00:15:19,920
And as you were referring to, there's different techniques to do search.

222
00:15:19,920 --> 00:15:25,720
So lexical search is that that's more where you're just looking for exact matches.

223
00:15:25,720 --> 00:15:32,600
And then neural search is where you get to use sort of the power of text embeddings.

224
00:15:32,600 --> 00:15:38,640
Are there any applications where neural search, there's some limits, where there are limitations

225
00:15:38,640 --> 00:15:39,640
for neural search?

226
00:15:39,640 --> 00:15:40,640
Yeah, of course.

227
00:15:40,640 --> 00:15:45,320
I mean, there's always like pros and cons.

228
00:15:45,320 --> 00:15:51,320
Obviously like neural search, I mean, neural search is like a really broad topic.

229
00:15:51,320 --> 00:15:52,320
It's not only embeddings.

230
00:15:52,320 --> 00:15:55,200
It's like, I don't know, probably 20 different techniques.

231
00:15:55,200 --> 00:15:59,400
How you can use these technologies to improve search results.

232
00:15:59,400 --> 00:16:04,440
These embedding approaches itself, they have challenges if you want to do like lexical

233
00:16:04,440 --> 00:16:05,440
search.

234
00:16:05,440 --> 00:16:10,560
So I don't know if you search for like a phone number, you want to find that entry with

235
00:16:10,560 --> 00:16:12,560
this specific phone number.

236
00:16:12,560 --> 00:16:15,640
And there's not like a lot of semantic meaning in a phone number.

237
00:16:15,640 --> 00:16:19,000
So you cannot infer, hey, at position five, there's a seven.

238
00:16:19,000 --> 00:16:22,320
You cannot say, okay, there's like a lot of meaning.

239
00:16:22,320 --> 00:16:27,360
Or if these two positions are off by one number, that's a completely different phone number.

240
00:16:27,360 --> 00:16:29,920
So this is like a limitation.

241
00:16:29,920 --> 00:16:35,120
These approaches, obviously they have challenges understanding new words and learning these

242
00:16:35,120 --> 00:16:37,000
concepts of new words.

243
00:16:37,000 --> 00:16:43,960
So we constantly innovate new companies, new products, new movies are released, new people

244
00:16:43,960 --> 00:16:44,960
become known.

245
00:16:44,960 --> 00:16:52,880
So, so big question in the field is like, how can this model learn these new concepts

246
00:16:52,880 --> 00:16:56,400
and the relationships of these new concepts?

247
00:16:56,400 --> 00:16:57,400
Yeah.

248
00:16:57,400 --> 00:17:03,080
So just, just digging into a sentence transformers, you know, seeing that it has like 9,000 plus

249
00:17:03,080 --> 00:17:09,280
stars on GitHub and 25 million plus installations.

250
00:17:09,280 --> 00:17:12,240
How has the package changed over time?

251
00:17:12,240 --> 00:17:14,000
What's been your experience with open source?

252
00:17:14,000 --> 00:17:15,960
Can you speak to that?

253
00:17:15,960 --> 00:17:18,240
Yeah, sure.

254
00:17:18,240 --> 00:17:23,800
So yeah, originally what a lot of researchers do in the field, they do some research and

255
00:17:23,800 --> 00:17:28,280
then they mainly publish the code to reproduce their research.

256
00:17:28,280 --> 00:17:34,320
Say, hey, here's my paper and here's some code you can run to get like same numbers.

257
00:17:34,320 --> 00:17:39,600
There's like a lot of understanding in the machine learning research community, which

258
00:17:39,600 --> 00:17:45,360
in my opinion limits the usefulness for these software.

259
00:17:45,360 --> 00:17:53,040
I mean, you build amazing models, amazing tools, but other people do not really want

260
00:17:53,040 --> 00:17:54,880
to use your tool to get the same numbers.

261
00:17:54,880 --> 00:18:02,120
I mean, it's kind of boring as an output to say, yeah, at the end it prints out 82.5 and

262
00:18:02,120 --> 00:18:06,400
that's the same number I put in my paper, but they want to take your software and build

263
00:18:06,400 --> 00:18:07,400
a cool tool.

264
00:18:07,400 --> 00:18:11,440
They want to do a semantic search on Wikipedia or do a semantic search on the emails or do

265
00:18:11,440 --> 00:18:16,920
a semantic search on notes or do a semantic search on podcast transcript.

266
00:18:16,920 --> 00:18:19,240
So this changed a lot.

267
00:18:19,240 --> 00:18:24,280
So in the beginning I was similar, say, okay, mainly publishing code just to reproduce the

268
00:18:24,280 --> 00:18:29,040
experiments and results from the paper to more, no, no, let's make a product out of

269
00:18:29,040 --> 00:18:30,040
it.

270
00:18:30,040 --> 00:18:35,640
So what's a cool tool coming from research which allows other people to build cool product

271
00:18:35,640 --> 00:18:38,160
and cool use cases out of this?

272
00:18:38,160 --> 00:18:42,440
And that's then the main thing I did in the past years in all the research.

273
00:18:42,440 --> 00:18:44,320
Say, okay, do some research.

274
00:18:44,320 --> 00:18:46,240
How can we enable X?

275
00:18:46,240 --> 00:18:50,800
And then if we found a way for this, build a product, source product, and give it to

276
00:18:50,800 --> 00:18:55,680
people to use it to build other cool stuff.

277
00:18:55,680 --> 00:19:02,320
Over the three plus years having that library, what are some of the biggest challenges that

278
00:19:02,320 --> 00:19:03,320
you faced?

279
00:19:03,320 --> 00:19:11,160
I mean, as a researcher, you mainly often judge our main criterion is the number of

280
00:19:11,160 --> 00:19:17,400
research to put out, like how many papers do you publish, what are your sector and contributions.

281
00:19:17,400 --> 00:19:24,400
If you publish and maintain like an open source library, there's like, you have to do work

282
00:19:24,400 --> 00:19:30,440
in terms of maintenance, update it, update it to the most recent version of Python or

283
00:19:30,440 --> 00:19:32,720
dependencies.

284
00:19:32,720 --> 00:19:39,040
Do some bug fixing because someone wants to use it on the back book.

285
00:19:39,040 --> 00:19:42,920
And this takes time away from you doing research.

286
00:19:42,920 --> 00:19:49,440
And so it can happen that you do all these work, which is amazing for the community,

287
00:19:49,440 --> 00:19:55,320
but you don't have any time more to do research and contribute and improve human knowledge.

288
00:19:55,320 --> 00:19:58,280
And so you have to find the right balance.

289
00:19:58,280 --> 00:19:59,280
Right.

290
00:19:59,280 --> 00:20:00,280
Yeah.

291
00:20:00,280 --> 00:20:06,520
So finding that balance between having a library that's useful for as many people as possible

292
00:20:06,520 --> 00:20:10,640
where they can use it as like a huge building block for their work.

293
00:20:10,640 --> 00:20:15,760
And then also you want to be continuing to push your limits and continuing to expand

294
00:20:15,760 --> 00:20:18,880
the work that you're doing.

295
00:20:18,880 --> 00:20:25,480
One of the most exciting for me at least use cases of sentence transformers is set fit.

296
00:20:25,480 --> 00:20:31,280
I see that you're an author on that paper with some people from Intel and hugging face.

297
00:20:31,280 --> 00:20:36,760
Can you talk about the experience of creating set fit or helping?

298
00:20:36,760 --> 00:20:38,560
Yeah, sure.

299
00:20:38,560 --> 00:20:49,600
So in OpenAI when they published GPT-3, they had like paper showing that these large generative

300
00:20:49,600 --> 00:20:54,600
models are extremely good and few short classification and start a lot of hype.

301
00:20:54,600 --> 00:21:00,480
So if you write the right prompt, you can classify if a news article is about sports

302
00:21:00,480 --> 00:21:03,440
or business or technology.

303
00:21:03,440 --> 00:21:06,920
They show, okay, you only need like really few examples.

304
00:21:06,920 --> 00:21:12,360
So you show them all like three examples about sports news articles, three examples about

305
00:21:12,360 --> 00:21:15,560
business and three examples about technology.

306
00:21:15,560 --> 00:21:20,720
And then for a new article, the model can for the category.

307
00:21:20,720 --> 00:21:25,160
But if you really use this, it becomes like cumbersome to use.

308
00:21:25,160 --> 00:21:31,160
So you have to do like people invented the term prompt engineering because it makes difference.

309
00:21:31,160 --> 00:21:37,600
If you add like a colon, semicolon or an exclamation mark at the end of the prompt, sometimes it's

310
00:21:37,600 --> 00:21:45,760
helpful if you ask like, please classify this article instead of classify this article.

311
00:21:45,760 --> 00:21:51,560
So it becomes like really, really cumbersome, really hard to use from a user perspective.

312
00:21:51,560 --> 00:21:57,280
And then last year Moshe from Intel AI said, okay, I think with embeddings, this is much

313
00:21:57,280 --> 00:21:58,280
easier.

314
00:21:58,280 --> 00:22:04,120
So let's just take the examples, take your three examples of tech news, business news

315
00:22:04,120 --> 00:22:08,960
and sports news embedded in a vector space and train a classifier on the vector space

316
00:22:08,960 --> 00:22:17,320
to see where the vector space or tech articles, business articles and sports articles.

317
00:22:17,320 --> 00:22:22,000
And assuming the model has or the vector space has learned all these relationships about

318
00:22:22,000 --> 00:22:28,240
sports and players in sports and players in technology and players in politics and people

319
00:22:28,240 --> 00:22:32,640
in politics use this for future classification.

320
00:22:32,640 --> 00:22:41,200
And yeah, what we showed or Moshe showed that he can be much better than GPT-3 in a few

321
00:22:41,200 --> 00:22:47,840
shots setting while using like a much smaller model, like a model you can run on your phone,

322
00:22:47,840 --> 00:22:53,200
a model that's like 50,000 times more efficient and faster.

323
00:22:53,200 --> 00:22:57,400
And then he approached me and said, hey Niels, I found this super cool tool.

324
00:22:57,400 --> 00:22:58,400
It's super amazing.

325
00:22:58,400 --> 00:23:00,600
You want to do research on it.

326
00:23:00,600 --> 00:23:04,240
And when we tested it, we were totally amazed because you don't have to do any of these

327
00:23:04,240 --> 00:23:09,320
prompt engineering where it makes a difference if you end the prompt with a period or an

328
00:23:09,320 --> 00:23:12,440
exclamation mark or a colon.

329
00:23:12,440 --> 00:23:13,440
It works really nice.

330
00:23:13,440 --> 00:23:16,840
It can scale to any size of training data.

331
00:23:16,840 --> 00:23:19,000
It's super efficient.

332
00:23:19,000 --> 00:23:21,720
It runs on your phone.

333
00:23:21,720 --> 00:23:25,600
It's better than GPT-3 on your phone for text classification.

334
00:23:25,600 --> 00:23:27,720
It works in a multilingual setting, Extremely.

335
00:23:27,720 --> 00:23:32,880
So we tested these in-context learning examples for different languages.

336
00:23:32,880 --> 00:23:37,680
We did not find any method, for example, that worked in Japanese.

337
00:23:37,680 --> 00:23:41,680
So we tried really hard and connected also with native speakers.

338
00:23:41,680 --> 00:23:45,660
So the girlfriend of one of the co-authors is Japanese.

339
00:23:45,660 --> 00:23:48,680
So really make sure that we got the right Japanese prompt.

340
00:23:48,680 --> 00:23:53,440
And if she has some ideas how we can modify the prompts to do classification in Japanese.

341
00:23:53,440 --> 00:23:58,960
But yeah, with these embedding approaches, because they are language agnostic or can

342
00:23:58,960 --> 00:24:03,440
be language agnostic, it doesn't matter if you're examples are in English or German or

343
00:24:03,440 --> 00:24:05,280
Japanese or Arabic.

344
00:24:05,280 --> 00:24:11,480
So you take like three Japanese news, you say, okay, these three Japanese news are tech,

345
00:24:11,480 --> 00:24:13,760
sports, business.

346
00:24:13,760 --> 00:24:17,400
And then you train the classifier and then you have like an extremely poor four-few-shot

347
00:24:17,400 --> 00:24:19,720
classifier system in Japanese.

348
00:24:19,720 --> 00:24:24,560
And yeah, we were totally amazed by ease of use because super easy to use and super fast

349
00:24:24,560 --> 00:24:25,560
to run.

350
00:24:25,560 --> 00:24:29,640
Yeah, I have been using it and I absolutely love it.

351
00:24:29,640 --> 00:24:32,400
The results that you get are incredible.

352
00:24:32,400 --> 00:24:38,400
Yeah, ease of use is, it's like, it's a pleasure.

353
00:24:38,400 --> 00:24:39,400
It's really nice to use.

354
00:24:39,400 --> 00:24:42,640
You can run it on your laptop.

355
00:24:42,640 --> 00:24:48,760
The beauty of it for me, I think is like the contrast of learning the application there

356
00:24:48,760 --> 00:24:55,440
and that how you can take an embedding and leverage the information that you have, right?

357
00:24:55,440 --> 00:25:00,280
That these two samples are similar and these two samples are not similar.

358
00:25:00,280 --> 00:25:05,100
And then you can use that to create a even better embedding that can help you with whatever

359
00:25:05,100 --> 00:25:13,640
your downstream task is, say, text classification.

360
00:25:13,640 --> 00:25:22,400
In the NLP field, I feel like it's very hard to know when you're making meaningful progress.

361
00:25:22,400 --> 00:25:28,160
I know that you have spent time exploring different sorts of benchmarks.

362
00:25:28,160 --> 00:25:32,600
I'd love to just get your take on it.

363
00:25:32,600 --> 00:25:38,720
Can you speak to the big NLP benchmarks, the glues and super glues and all of that?

364
00:25:38,720 --> 00:25:39,720
Yeah, sure.

365
00:25:39,720 --> 00:25:44,280
So yeah, machine learning field, NLP field loves benchmarks.

366
00:25:44,280 --> 00:25:47,560
That's totally what people do.

367
00:25:47,560 --> 00:25:50,840
You take a benchmark, you see what's the latest number.

368
00:25:50,840 --> 00:25:53,720
Let's say it's 84.0.

369
00:25:53,720 --> 00:25:59,400
You try a lot of tricks and then you get a better number, 84.2.

370
00:25:59,400 --> 00:26:05,920
And then you think, yeah, 80.2 is higher, larger than 84.0.

371
00:26:05,920 --> 00:26:09,200
So you write a paper and say, this is better.

372
00:26:09,200 --> 00:26:11,600
We seldom really ask, is it really better?

373
00:26:11,600 --> 00:26:18,640
Is this 0.2 improvement, 0.1, or 1.5, whatever the delta is, really, really better?

374
00:26:18,640 --> 00:26:22,760
And I experienced this with sentence transformers itself.

375
00:26:22,760 --> 00:26:27,640
The original models from the paper in the benchmarks that were used at that time and

376
00:26:27,640 --> 00:26:33,840
which were the common standard in the field, the sentence transformer, the first version

377
00:26:33,840 --> 00:26:37,920
of the sentence transformer models looked better in all the benchmark than, for example,

378
00:26:37,920 --> 00:26:42,200
the universal sentence encoder, which was like one, two years older.

379
00:26:42,200 --> 00:26:49,560
But if you really use it for your own application and really try both stuff with it, you often

380
00:26:49,560 --> 00:26:54,800
saw, OK, no, universal sentence encoder was at that time much, much better than sentence

381
00:26:54,800 --> 00:26:55,800
transformers.

382
00:26:55,800 --> 00:26:57,840
And so how can there be such a gap?

383
00:26:57,840 --> 00:27:03,200
Like all the benchmarks say, yeah, sentence transformer, sentence bird, machine one is

384
00:27:03,200 --> 00:27:05,600
better than universal sentence encoder.

385
00:27:05,600 --> 00:27:09,640
But this got me interested, like, OK, how good are our benchmarks?

386
00:27:09,640 --> 00:27:12,720
Are we really benchmarking what we want?

387
00:27:12,720 --> 00:27:15,160
And how can we create better benchmarks?

388
00:27:15,160 --> 00:27:23,160
And sadly, a lot of benchmarks in the field become, or are becoming meaningless now.

389
00:27:23,160 --> 00:27:27,960
So in the beginning, when you publish a benchmark, it's useful, it's meaningful.

390
00:27:27,960 --> 00:27:32,080
But then over time, the value of the benchmark decreases.

391
00:27:32,080 --> 00:27:35,720
And at some point, it hits like zero value.

392
00:27:35,720 --> 00:27:42,480
So it's nowadays completely useless that you get a new state of the art results on Glue

393
00:27:42,480 --> 00:27:45,480
because the benchmark is over-secreted, over-fitted.

394
00:27:45,480 --> 00:27:47,680
It doesn't really tell.

395
00:27:47,680 --> 00:27:52,760
So even if the model is better on the Glue benchmark, it doesn't tell anything if the

396
00:27:52,760 --> 00:27:56,520
model is really better, if people will use it.

397
00:27:56,520 --> 00:27:57,520
Right.

398
00:27:57,520 --> 00:28:04,160
And I love the comparison that you made where once a, in a recent talk, you were talking

399
00:28:04,160 --> 00:28:11,000
basically about how having these benchmarks out and then having all these hundreds of

400
00:28:11,000 --> 00:28:14,480
people trying to work on it to make it better, it's like you're going to be over-fitting

401
00:28:14,480 --> 00:28:16,040
on that particular data set.

402
00:28:16,040 --> 00:28:18,880
You don't know how it's necessarily going to generalize.

403
00:28:18,880 --> 00:28:23,540
And you talked about how it was similar basically to like, you know, p-hacking, right?

404
00:28:23,540 --> 00:28:29,560
Like if you run a hundred different experiments, yeah, you're going to see that something is

405
00:28:29,560 --> 00:28:34,920
correlated with something else, but it doesn't mean it necessarily is a meaningful relationship

406
00:28:34,920 --> 00:28:35,920
between the two.

407
00:28:35,920 --> 00:28:36,920
Yeah.

408
00:28:36,920 --> 00:28:44,240
I mean, if we take, for example, Glue, Glue just has like short text classification or

409
00:28:44,240 --> 00:28:49,200
text understanding task in it, which is like mostly in the sentence level.

410
00:28:49,200 --> 00:28:56,240
So this take this, the sentence and tell me as a like positive or negative sentiment.

411
00:28:56,240 --> 00:29:02,640
And yeah, what people did working on this, they found ways how to train models that are

412
00:29:02,640 --> 00:29:03,920
better on the Glue benchmark.

413
00:29:03,920 --> 00:29:10,080
So now a lot of papers and models reports say, yeah, we just train our pre-trained models

414
00:29:10,080 --> 00:29:11,080
on short sequences.

415
00:29:11,080 --> 00:29:16,520
So we just pre-train it on like sequences up to I don't know, 128 work pieces.

416
00:29:16,520 --> 00:29:23,000
Because for Glue benchmark to work well on Glue benchmark, we don't need long text understanding.

417
00:29:23,000 --> 00:29:26,600
It's sufficient if the model just understands the sentence.

418
00:29:26,600 --> 00:29:28,280
And yeah, it's sufficient for Glue.

419
00:29:28,280 --> 00:29:33,160
And then people put it out and you think, okay, this is a great model, it performs well

420
00:29:33,160 --> 00:29:35,920
on Glue and you use it for your own application.

421
00:29:35,920 --> 00:29:39,920
But sadly your application is, I don't know, email classification, which is longer than

422
00:29:39,920 --> 00:29:41,920
a sentence.

423
00:29:41,920 --> 00:29:47,360
And there it works really badly because it was never trained to work on like longer text,

424
00:29:47,360 --> 00:29:50,360
never tested on longer text.

425
00:29:50,360 --> 00:29:55,720
And maybe the older model, BERT, which was not over-fitted heavily on Glue, is much,

426
00:29:55,720 --> 00:30:01,840
much better because it works well for your emails, which are longer than a sentence.

427
00:30:01,840 --> 00:30:07,360
Yeah, it reminds me, maybe it's like even the same thing.

428
00:30:07,360 --> 00:30:12,360
How similar are the datasets that you work on when you're training and evaluating your

429
00:30:12,360 --> 00:30:17,200
model offline and then when you put it into production and you see the real data that

430
00:30:17,200 --> 00:30:21,160
it's dealing with and then you realize like, oh, it's not performing the way that I thought

431
00:30:21,160 --> 00:30:22,280
it was.

432
00:30:22,280 --> 00:30:27,880
But if you look at why, sometimes it can be fairly clear if you trained it all on short

433
00:30:27,880 --> 00:30:30,600
text, so then it's not gonna work well on long text.

434
00:30:30,600 --> 00:30:40,920
One thing that I find to be really incredible and yeah, the datasets that you used to train

435
00:30:40,920 --> 00:30:49,120
sentence transformers, the variety of them, can you talk about that a little bit?

436
00:30:49,120 --> 00:30:51,360
Yeah, sure.

437
00:30:51,360 --> 00:30:56,400
So yeah, the original sentence transformers model was just trained on like small dataset

438
00:30:56,400 --> 00:30:57,400
from NRI.

439
00:30:57,400 --> 00:31:02,920
So there's like the standpoint NRI dataset, which is like all everything is like short

440
00:31:02,920 --> 00:31:05,960
text, really clean, nicely written.

441
00:31:05,960 --> 00:31:09,720
And as mentioned, the original model was not as good as Universal Sentence Encoder.

442
00:31:09,720 --> 00:31:14,920
That was like really bugging me because Universal Sentence Encoder was trained on a lot of Google

443
00:31:14,920 --> 00:31:23,160
internal data, like millions, billions of data points from all sources.

444
00:31:23,160 --> 00:31:27,800
But sadly, it was not like publicly available, so it was like Google internal data.

445
00:31:27,800 --> 00:31:31,800
So yeah, I spent a lot of time, a lot on the sentence transformers to get more and more

446
00:31:31,800 --> 00:31:36,840
data and then use more and more iterations, make every dataset available to build up like

447
00:31:36,840 --> 00:31:39,720
some public open source collection.

448
00:31:39,720 --> 00:31:46,720
And so now publicly available, there's like over a billion training pairs available from

449
00:31:46,720 --> 00:31:54,600
all places like Stack Overflow, Stack Exchange, Reddit, news articles, duplicate questions,

450
00:31:54,600 --> 00:31:57,160
hands-on image captions.

451
00:31:57,160 --> 00:31:59,800
And yeah, this allows them all to learn more.

452
00:31:59,800 --> 00:32:05,960
So not only to learn what are the relationship between two nicely cleanly written sentences.

453
00:32:05,960 --> 00:32:10,960
So that's what the first version of sentence transformers was trained on and sadly evaluated

454
00:32:10,960 --> 00:32:11,960
on.

455
00:32:11,960 --> 00:32:18,080
So now it was trained on like a whole bunch of texts, like really ugly, noisy, social

456
00:32:18,080 --> 00:32:21,240
media text full of hashtags and emojis.

457
00:32:21,240 --> 00:32:26,880
And now the model understands like emojis and hashtags and knows, okay, what's the similarity

458
00:32:26,880 --> 00:32:30,680
between hashtags and relationship in hashtags?

459
00:32:30,680 --> 00:32:32,920
And this gives you a much, much better model.

460
00:32:32,920 --> 00:32:39,040
Sadly, sometimes if people use it on the old benchmarks on the nicely cleanly written text,

461
00:32:39,040 --> 00:32:43,080
it doesn't perform as well as models overfitted on these settings.

462
00:32:43,080 --> 00:32:49,080
And I say, oh no, how is it like a state of the art if it's like two points weaker on

463
00:32:49,080 --> 00:32:54,200
this absolutely irrelevant benchmark because we have a big trust in benchmarks and think,

464
00:32:54,200 --> 00:32:57,640
okay, this benchmark tells what is the best model.

465
00:32:57,640 --> 00:32:58,640
Right.

466
00:32:58,640 --> 00:33:04,800
It's pretty incredible, you know, how text that's in a book compared to text that's,

467
00:33:04,800 --> 00:33:10,200
you know, in a social media post compared to text that's in a dialogue, you know, how

468
00:33:10,200 --> 00:33:12,600
different they really are.

469
00:33:12,600 --> 00:33:19,440
You know, yes, it's all people trying to express themselves, you know, using natural language,

470
00:33:19,440 --> 00:33:22,520
but there's just such, such a variety.

471
00:33:22,520 --> 00:33:28,960
I think that's what makes it hard to get this one, you know, encoder that's gonna work for

472
00:33:28,960 --> 00:33:29,960
everything.

473
00:33:29,960 --> 00:33:32,520
And I think that's why you sort of need to figure out what's the best one, what's the

474
00:33:32,520 --> 00:33:39,880
best embedding that's gonna be applicable to your use case.

475
00:33:39,880 --> 00:33:46,120
It was so nice to talk to you about, you know, sentence transformers and set fit to zoom

476
00:33:46,120 --> 00:33:52,120
out a little bit just to talk about machine learning in general.

477
00:33:52,120 --> 00:33:57,600
I'm just wondering in the field, what's an important question that you believe remains

478
00:33:57,600 --> 00:34:01,240
unanswered?

479
00:34:01,240 --> 00:34:08,480
So what amazes me about the humanist ability, how quickly we can learn and update information.

480
00:34:08,480 --> 00:34:14,800
So still a big challenge of a lot of models, like if you take the bird model, birds still

481
00:34:14,800 --> 00:34:21,720
thinks that Barack Obama is the US president has like no idea about Donald Trump, no idea

482
00:34:21,720 --> 00:34:23,680
about Joe Biden.

483
00:34:23,680 --> 00:34:28,120
And for us as a human to update the knowledge that there's like a new US president, it's

484
00:34:28,120 --> 00:34:32,280
like one sentence like person X is the new US president.

485
00:34:32,280 --> 00:34:34,800
I don't know, Joe Biden is the new US president.

486
00:34:34,800 --> 00:34:40,440
And we have this information in our head, but we also update the relationships.

487
00:34:40,440 --> 00:34:45,560
So we know, okay, there's a new US president, we know there's a new first lady, we know

488
00:34:45,560 --> 00:34:47,040
maybe the party changes.

489
00:34:47,040 --> 00:34:52,680
So before was Donald Trump Republican, now it's Joe Biden Democrat.

490
00:34:52,680 --> 00:34:57,400
It changes the number of presidents who attended different schools, the number of presidents

491
00:34:57,400 --> 00:34:59,840
who have been former vice presidents and so on.

492
00:34:59,840 --> 00:35:05,640
Like a lot of knowledge is updated in our head from this short, I don't know, five words

493
00:35:05,640 --> 00:35:10,760
examples like Joe Biden is the new US president.

494
00:35:10,760 --> 00:35:16,440
If we do this for a model like birds, like some transformer model, it's like super, super

495
00:35:16,440 --> 00:35:17,820
hard to update it.

496
00:35:17,820 --> 00:35:24,080
So often we need like a lot of text, like millions of examples mentioning that Joe Biden

497
00:35:24,080 --> 00:35:29,600
is the new US president and now the Democrats are back in the White House and that there's

498
00:35:29,600 --> 00:35:32,600
a new first lady and so on.

499
00:35:32,600 --> 00:35:34,860
So it's like extremely super inefficient.

500
00:35:34,860 --> 00:35:40,000
And this makes it like really, really hard to have like models up to date, models that

501
00:35:40,000 --> 00:35:44,520
you can learn like niche topics, because it's not data efficient.

502
00:35:44,520 --> 00:35:51,560
And I think going forward, that's an extremely interesting research topic, how can we make

503
00:35:51,560 --> 00:35:57,280
models update as efficiently as humans can acquire new knowledge?

504
00:35:57,280 --> 00:35:58,280
Right.

505
00:35:58,280 --> 00:36:03,120
So that ability to adapt to change.

506
00:36:03,120 --> 00:36:10,040
And that sort of touches on my next question, which is how do you think machine learning

507
00:36:10,040 --> 00:36:16,360
will change or evolve say in the next 10 years?

508
00:36:16,360 --> 00:36:24,520
And what do you think the impact will be on society?

509
00:36:24,520 --> 00:36:25,520
Big question.

510
00:36:25,520 --> 00:36:26,520
Big question.

511
00:36:26,520 --> 00:36:37,240
So currently the trend we see is we go to more exciting applications, which are harder

512
00:36:37,240 --> 00:36:40,940
and harder to quantify.

513
00:36:40,940 --> 00:36:44,040
So far machine learning research is a lot about quantification.

514
00:36:44,040 --> 00:36:50,080
So you take a benchmark like Blue, where you have like 1000 examples, I don't know, movie

515
00:36:50,080 --> 00:36:53,840
reviews, which you annotate as positive or negative sentiment.

516
00:36:53,840 --> 00:36:59,320
And that's like really easy to benchmark your numbers and say, okay, this model is better

517
00:36:59,320 --> 00:37:00,320
than previous models.

518
00:37:00,320 --> 00:37:01,320
How's it go?

519
00:37:01,320 --> 00:37:07,240
Like, I don't know, 95% correct and previous model got 94% correct.

520
00:37:07,240 --> 00:37:12,600
With these more complex use cases like generative things like, I don't know, here's an email,

521
00:37:12,600 --> 00:37:16,400
write like a nice response to that email.

522
00:37:16,400 --> 00:37:18,200
Like how do we evaluate this?

523
00:37:18,200 --> 00:37:24,880
Or I mean, we saw these applications, chat GPT write like a poem, how bubble sort is

524
00:37:24,880 --> 00:37:25,880
working.

525
00:37:25,880 --> 00:37:29,720
How do you evaluate if the generated poem is correct?

526
00:37:29,720 --> 00:37:33,080
So you have to evaluate as the content, does it make sense?

527
00:37:33,080 --> 00:37:34,080
How amusing it is.

528
00:37:34,080 --> 00:37:35,800
How amusing is it?

529
00:37:35,800 --> 00:37:37,760
Does it rhyme?

530
00:37:37,760 --> 00:37:44,480
And this puts like a lot of stress in science, like, okay, how can we know these two systems,

531
00:37:44,480 --> 00:37:48,680
this system creates like a nice poem, how bubble sort is working?

532
00:37:48,680 --> 00:37:51,520
Or is this system working nicer?

533
00:37:51,520 --> 00:37:56,240
And I think we're more and more tapping on these use cases, we see now it's possible.

534
00:37:56,240 --> 00:38:02,580
As interesting, how can we create like a science out of this, which requires experiments?

535
00:38:02,580 --> 00:38:06,400
And how can we continuously improve on this?

536
00:38:06,400 --> 00:38:12,200
And yeah, the research field went away in machine learning a lot from human experiments,

537
00:38:12,200 --> 00:38:17,600
like asking humans what they think to like take quantitative numbers, computing accuracy,

538
00:38:17,600 --> 00:38:18,600
YF1 scores.

539
00:38:18,600 --> 00:38:23,240
I think now it's going back again to human experiments.

540
00:38:23,240 --> 00:38:30,600
So we showed 100 humans these two generated poems, which poem is nicer is better, and

541
00:38:30,600 --> 00:38:32,720
then try to find these numbers.

542
00:38:32,720 --> 00:38:37,920
But yeah, it will be really hard for science how to compare it, how to scale it.

543
00:38:37,920 --> 00:38:42,560
So it will be a change how we do scientific stuff.

544
00:38:42,560 --> 00:38:50,760
Speaking of chat GPT, and, you know, the generative models, you know, I see that it's creating

545
00:38:50,760 --> 00:38:55,780
a big hype, right, even larger hype for AI.

546
00:38:55,780 --> 00:39:01,880
How do you view the gap between the hype and the reality right now in machine learning?

547
00:39:01,880 --> 00:39:08,280
Yeah, that's that's good one.

548
00:39:08,280 --> 00:39:16,560
So what I mean, it shows that there are a lot of applications possible, which we have

549
00:39:16,560 --> 00:39:19,040
not fought before which which could be possible.

550
00:39:19,040 --> 00:39:24,880
I mean, so so I just read like, I don't know, really nice cool idea.

551
00:39:24,880 --> 00:39:31,840
As as manager or director or whatever, we get like a lot of emails, so you have to respond

552
00:39:31,840 --> 00:39:32,840
to a lot of emails.

553
00:39:32,840 --> 00:39:38,280
And then yeah, use these generators models to create like a draft for every email you

554
00:39:38,280 --> 00:39:43,440
get based on previous responses to send you do some post editing and then you can send

555
00:39:43,440 --> 00:39:49,240
around and send it out and this will save you as a manager like a lot of time when your

556
00:39:49,240 --> 00:39:54,560
main job is like communication and you don't have to write these drafts by hand each time

557
00:39:54,560 --> 00:40:00,600
but just from previous what you send around, just take the same text distillate and generate

558
00:40:00,600 --> 00:40:01,600
this.

559
00:40:01,600 --> 00:40:05,760
Still a lot of uncertainty unknown in the fields like what's what the right business

560
00:40:05,760 --> 00:40:11,760
and that's what people currently try to figure out what's the business model was similar

561
00:40:11,760 --> 00:40:16,120
when when the mobile internet was launched and smartphones was launched.

562
00:40:16,120 --> 00:40:19,040
Like how do you create like a business out of an app?

563
00:40:19,040 --> 00:40:20,640
So so what's an app business?

564
00:40:20,640 --> 00:40:26,480
That's kind of the business model over the past 15 years, we learned how you can create

565
00:40:26,480 --> 00:40:28,400
like a business model out of apps.

566
00:40:28,400 --> 00:40:31,120
And I think it's similar with AI now.

567
00:40:31,120 --> 00:40:38,720
So what is potential business model for AI for generator things, majority of things will

568
00:40:38,720 --> 00:40:39,720
not work out.

569
00:40:39,720 --> 00:40:42,800
But there are some things that work out.

570
00:40:42,800 --> 00:40:44,880
And then it can be copied over and over again.

571
00:40:44,880 --> 00:40:49,560
Okay, that's that's a new way how you can use it.

572
00:40:49,560 --> 00:40:56,480
Right with with the, you know, advent of generative models and sort of how they're permeating

573
00:40:56,480 --> 00:41:03,000
through society, it's it's really crossing over that boundary of it's not like a natural

574
00:41:03,000 --> 00:41:04,200
language processing thing.

575
00:41:04,200 --> 00:41:05,320
It's not a machine learning thing.

576
00:41:05,320 --> 00:41:09,600
It's it's affecting all different types of work.

577
00:41:09,600 --> 00:41:15,000
I think it's really how humans interact with those generative models.

578
00:41:15,000 --> 00:41:20,560
Like for example, you know, you're talking about like drafting up emails or drafting

579
00:41:20,560 --> 00:41:24,640
up, you know, outlines for essays.

580
00:41:24,640 --> 00:41:29,880
And then, you know, the human can then take that and work it and make it you know, make

581
00:41:29,880 --> 00:41:30,920
it better.

582
00:41:30,920 --> 00:41:38,420
It's going to be really interesting to see sort of how humans and machines, how that

583
00:41:38,420 --> 00:41:44,200
interaction evolves over the next couple of years, when we're in this time, where it's

584
00:41:44,200 --> 00:41:48,680
like now everyone is sort of seeing the power.

585
00:41:48,680 --> 00:41:50,240
It's going to be interesting to see.

586
00:41:50,240 --> 00:41:54,840
Yeah, like all the different applications of it.

587
00:41:54,840 --> 00:42:03,680
So switching into our learning from machine learning part of the show, who are some people

588
00:42:03,680 --> 00:42:06,800
in in the field that influence you?

589
00:42:06,800 --> 00:42:13,720
Yeah, big, big impact like NG.

590
00:42:13,720 --> 00:42:20,920
So he gave like, like, I don't know, like, not too long, quite recently, maybe like one,

591
00:42:20,920 --> 00:42:27,080
two years, really cool, thought provoking talk about data center AI.

592
00:42:27,080 --> 00:42:34,720
So which I totally love like in machine learning, people focus a lot on modeling, they say,

593
00:42:34,720 --> 00:42:38,200
okay, this is my data set, the MNIST data set.

594
00:42:38,200 --> 00:42:39,880
It's given, it's got given.

595
00:42:39,880 --> 00:42:41,120
That's the training set.

596
00:42:41,120 --> 00:42:42,120
That's the death set.

597
00:42:42,120 --> 00:42:47,280
That's the test set, I have to get the highest accuracy on the test set, given the train

598
00:42:47,280 --> 00:42:48,280
and death set.

599
00:42:48,280 --> 00:42:54,440
And I try to hyper permute it to my model as much as possible.

600
00:42:54,440 --> 00:42:59,920
But yeah, so he argues that working on data is a lot more fruitful.

601
00:42:59,920 --> 00:43:06,760
So in many settings, in real cases, it's not like, okay, here's like one test set, or one

602
00:43:06,760 --> 00:43:10,520
data set where you can train on one test set you're going to evaluate on.

603
00:43:10,520 --> 00:43:16,120
And you can modify how the training works, you can change the training, add features,

604
00:43:16,120 --> 00:43:21,840
add, clean it, get more data, annotate more data.

605
00:43:21,840 --> 00:43:24,400
And this often improves your model a lot.

606
00:43:24,400 --> 00:43:26,360
And yeah, that's also what I learned.

607
00:43:26,360 --> 00:43:29,200
Okay, it's often not relevant to improve the model.

608
00:43:29,200 --> 00:43:34,920
So a lot of work I did in the past years is not trying to make the model better and add

609
00:43:34,920 --> 00:43:41,160
another, I don't know, skip connection or the latest add-on, whatever variation there's

610
00:43:41,160 --> 00:43:44,600
out there, but find ways how to make the data better.

611
00:43:44,600 --> 00:43:47,760
And this paid off a lot for search.

612
00:43:47,760 --> 00:43:53,440
So we have these initial search models, and then we found ways how we can make the training

613
00:43:53,440 --> 00:44:01,400
data nicer and better and work really well to train these vector spaces and remove bad

614
00:44:01,400 --> 00:44:03,280
examples from training examples.

615
00:44:03,280 --> 00:44:06,360
And that had like massive impact.

616
00:44:06,360 --> 00:44:12,240
So if the model is really training nice, clean data, it works much, much better.

617
00:44:12,240 --> 00:44:17,960
And then often you can just copy paste the same approaches and just use it with good,

618
00:44:17,960 --> 00:44:20,000
nice, clean data.

619
00:44:20,000 --> 00:44:21,000
Right.

620
00:44:21,000 --> 00:44:25,800
Yeah, it's interesting because when you're learning about data science, I feel like the

621
00:44:25,800 --> 00:44:30,520
attractive thing are the algorithms and you want to be modeling things.

622
00:44:30,520 --> 00:44:35,280
Then you might go to a resource like Kaggle, which is incredible, but you're given, you're

623
00:44:35,280 --> 00:44:37,800
handed a data set, right?

624
00:44:37,800 --> 00:44:40,600
And it's pretty clean usually.

625
00:44:40,600 --> 00:44:43,480
That is never the case in industry.

626
00:44:43,480 --> 00:44:49,600
So I think that that's another part of the data-centric approach to things, just sort

627
00:44:49,600 --> 00:44:54,440
of always making sure that, are you getting the best data that you can?

628
00:44:54,440 --> 00:44:58,360
Are you processing it in the right way?

629
00:44:58,360 --> 00:45:01,280
And yeah, I'm definitely a big proponent of that.

630
00:45:01,280 --> 00:45:02,280
Yeah, totally.

631
00:45:02,280 --> 00:45:08,200
I mean, in the talk Andrew showcases some cases from computer vision where they take

632
00:45:08,200 --> 00:45:12,400
a picture to try to detect like effects and manufacturing.

633
00:45:12,400 --> 00:45:15,160
So some parts which have like arrows.

634
00:45:15,160 --> 00:45:19,280
And what they do is like, I don't know, use a different camera, set it at a different

635
00:45:19,280 --> 00:45:24,680
angle with different light and model accuracy jumps like 20 points.

636
00:45:24,680 --> 00:45:30,520
So that's completely outside of data pre-processing, feature engineering, models, hyper-permitter

637
00:45:30,520 --> 00:45:31,520
tuning.

638
00:45:31,520 --> 00:45:34,040
But yeah, how you acquire the data had a big impact.

639
00:45:34,040 --> 00:45:38,640
Like, okay, use different angle of the camera, different lighting, and now everything is

640
00:45:38,640 --> 00:45:42,360
super easy to spot these mistakes in the production.

641
00:45:42,360 --> 00:45:47,400
That's nice thinking where you don't think, you get the data and then you try to, I don't

642
00:45:47,400 --> 00:45:50,240
know, clean the data or tune the model.

643
00:45:50,240 --> 00:45:54,600
But yeah, even a step before, like how do you acquire the data?

644
00:45:54,600 --> 00:45:55,600
Right.

645
00:45:55,600 --> 00:46:03,440
Just, yeah, figuring out different ways of, you know, what are the inputs going to be,

646
00:46:03,440 --> 00:46:05,560
you know, for the model.

647
00:46:05,560 --> 00:46:11,240
That's a very interesting way to augment data for computer vision.

648
00:46:11,240 --> 00:46:17,560
With so many things going on in machine learning and it being such a rapidly evolving field,

649
00:46:17,560 --> 00:46:22,160
how do you stay, you know, up to date with the latest developments and techniques in

650
00:46:22,160 --> 00:46:23,160
the field?

651
00:46:23,160 --> 00:46:27,000
Yeah, that's a challenge one.

652
00:46:27,000 --> 00:46:35,040
I try, I do not stress too much about like trying to stay up to date with latest techniques

653
00:46:35,040 --> 00:46:41,600
because a lot of things or majority of things that are published are kind of irrelevant.

654
00:46:41,600 --> 00:46:47,680
Every month you have like thousands of papers on archives, but only if you do like a retrospect

655
00:46:47,680 --> 00:46:53,400
what has been relevant papers last year, you can break it down to 20, maybe 50 papers or

656
00:46:53,400 --> 00:46:54,400
so.

657
00:46:54,400 --> 00:47:01,400
So you don't have to read all the thousands of papers that are uploaded to archive and

658
00:47:01,400 --> 00:47:08,880
what's relevant will be resurfaced at some form because there's maybe some full-on paper

659
00:47:08,880 --> 00:47:12,160
using this old technique because this old technique works well.

660
00:47:12,160 --> 00:47:13,720
And then you say, okay, cool.

661
00:47:13,720 --> 00:47:15,720
There's like some full-on work.

662
00:47:15,720 --> 00:47:19,720
And in general, what works for me is Twitter.

663
00:47:19,720 --> 00:47:25,440
So see what are people talking about, what are people tweeting.

664
00:47:25,440 --> 00:47:31,160
Open papers, just read the, I usually just read the title and look at figure one or table

665
00:47:31,160 --> 00:47:38,560
one, maybe the abstract and then just get like a gist of like, okay, that could be interesting

666
00:47:38,560 --> 00:47:44,440
or feeling and then over time, maybe at some point we read some paper and say, oh yeah,

667
00:47:44,440 --> 00:47:48,960
there was some old paper that had this cool figure.

668
00:47:48,960 --> 00:47:54,840
And then yeah, sadly back to a search problem, trying to really find those papers still really,

669
00:47:54,840 --> 00:47:55,840
really challenging.

670
00:47:55,840 --> 00:47:56,840
And it's not a-

671
00:47:56,840 --> 00:47:57,840
You could use a normal search.

672
00:47:57,840 --> 00:47:58,840
Yeah.

673
00:47:58,840 --> 00:48:03,760
I mean, Google search doesn't work that well where you say, yeah, I read some time half

674
00:48:03,760 --> 00:48:11,560
a year ago an archive paper, which in my figure one showing how you can modify kind of, I

675
00:48:11,560 --> 00:48:16,320
don't know, momentum in the optimizer, can you please show me this paper again?

676
00:48:16,320 --> 00:48:19,400
It goes back to a prompt engineering, right?

677
00:48:19,400 --> 00:48:20,400
Yeah.

678
00:48:20,400 --> 00:48:21,400
Yeah.

679
00:48:21,400 --> 00:48:22,400
I mean, yeah.

680
00:48:22,400 --> 00:48:28,720
I mean, many, many people compare search query formulation with prompt engineering.

681
00:48:28,720 --> 00:48:33,000
So we, I don't know, when Google launched, people didn't know how to use it in the beginning

682
00:48:33,000 --> 00:48:36,440
and they had to learn what's the right query to write.

683
00:48:36,440 --> 00:48:41,040
And yeah, that's similar with prompt engineering of generative models.

684
00:48:41,040 --> 00:48:42,040
Yeah.

685
00:48:42,040 --> 00:48:43,040
But-

686
00:48:43,040 --> 00:48:48,560
Prompting Google in the right way can get you pretty far in this world right now.

687
00:48:48,560 --> 00:48:50,560
Yeah.

688
00:48:50,560 --> 00:48:56,720
In your machine learning journey, what's one piece of advice that you received that has

689
00:48:56,720 --> 00:49:02,440
helped you or stuck with you?

690
00:49:02,440 --> 00:49:10,320
One piece of advice that stuck with me or helped with me.

691
00:49:10,320 --> 00:49:11,760
The next question is going to be harder.

692
00:49:11,760 --> 00:49:12,760
So-

693
00:49:12,760 --> 00:49:14,760
The next question is going to be harder.

694
00:49:14,760 --> 00:49:15,760
Yeah.

695
00:49:15,760 --> 00:49:20,680
I didn't reflect on that question.

696
00:49:20,680 --> 00:49:28,240
So yeah, I cannot really narrow it down to like single piece of advice.

697
00:49:28,240 --> 00:49:34,200
So what I'm more, what I did a lot of in the past years is more be closely connected to

698
00:49:34,200 --> 00:49:42,560
the community, see what are common questions they have where there's no answer for this.

699
00:49:42,560 --> 00:49:48,080
So for example, common question after launching sentence in Swalmose, which only works in

700
00:49:48,080 --> 00:49:52,000
English, was like, hey, that looks promising.

701
00:49:52,000 --> 00:49:55,120
I want to use it for another language.

702
00:49:55,120 --> 00:49:56,800
And there was like no answer for this.

703
00:49:56,800 --> 00:49:59,680
And I got this question over and over and over again.

704
00:49:59,680 --> 00:50:03,120
I said, oh yeah, if there's no answer, that's great research question.

705
00:50:03,120 --> 00:50:04,440
Let's do research on this.

706
00:50:04,440 --> 00:50:06,360
And so we did cool research.

707
00:50:06,360 --> 00:50:12,000
And then another common question was like a lot of people, yeah, I want to use this for,

708
00:50:12,000 --> 00:50:17,960
I don't know, searching and CVs, but I don't have any training data.

709
00:50:17,960 --> 00:50:21,040
So how can I tune this model without training data?

710
00:50:21,040 --> 00:50:24,040
The answer was then, oh, you need training data.

711
00:50:24,040 --> 00:50:26,120
It's not possible without training data.

712
00:50:26,120 --> 00:50:31,560
So I started to do a lot of research last year, how to train models without labeled

713
00:50:31,560 --> 00:50:32,560
training data.

714
00:50:32,560 --> 00:50:40,080
So it's more like be connected to the field, see what are the challenges and find.

715
00:50:40,080 --> 00:50:43,640
Yeah, that's good advice.

716
00:50:43,640 --> 00:50:50,040
In the similar vein, someone who is just starting out in data science or machine learning or

717
00:50:50,040 --> 00:50:55,840
say is making the transition from another field, what advice would you give them?

718
00:50:55,840 --> 00:51:03,680
Depends a bit on the roles, if they more go into like PhD role, research role, or if they

719
00:51:03,680 --> 00:51:06,360
more go into like product role, want to ship a product.

720
00:51:06,360 --> 00:51:09,120
Say industry, say they want to go into industry.

721
00:51:09,120 --> 00:51:11,720
Let's say go into industry.

722
00:51:11,720 --> 00:51:17,800
First I would say, do not trust everything that you read in papers for both roles.

723
00:51:17,800 --> 00:51:24,120
So often the best approaches are not the approaches, the newest approaches, like not try to get

724
00:51:24,120 --> 00:51:29,600
the state of the art approach for whatever problem you try to solve.

725
00:51:29,600 --> 00:51:36,320
But the early on approaches, there's often like the first iteration on something like,

726
00:51:36,320 --> 00:51:38,680
I don't know how to generate images.

727
00:51:38,680 --> 00:51:41,720
And then there's like a second iteration and a third iteration.

728
00:51:41,720 --> 00:51:45,080
And this is often the best kind of like model.

729
00:51:45,080 --> 00:51:50,280
And then at some point people start to overfit on the benchmarks and create like systems

730
00:51:50,280 --> 00:51:57,200
that are complex and overfitted and unstable and not robust and not efficient just to beat

731
00:51:57,200 --> 00:51:58,200
these benchmarks.

732
00:51:58,200 --> 00:52:03,800
So I try to find out the sweet spot where we really make progress and then find a solution

733
00:52:03,800 --> 00:52:05,680
that's easy.

734
00:52:05,680 --> 00:52:12,760
And in general, yeah, a lot of testing, find ways, quick ways to test your hypothesis.

735
00:52:12,760 --> 00:52:16,400
Don't try to be too clever with simple approaches.

736
00:52:16,400 --> 00:52:21,960
In often simple ways, simple approaches brings you a lot further than super complex methods

737
00:52:21,960 --> 00:52:22,960
and ways.

738
00:52:22,960 --> 00:52:25,680
Yeah, I think that that's really good advice.

739
00:52:25,680 --> 00:52:28,280
Start, start with the foundation.

740
00:52:28,280 --> 00:52:31,440
Here's the tricky question.

741
00:52:31,440 --> 00:52:41,280
What advice would you give yourself when you were just starting out in your career?

742
00:52:41,280 --> 00:52:43,680
That's also a good question.

743
00:52:43,680 --> 00:52:44,680
Thank you.

744
00:52:44,680 --> 00:52:52,440
Yeah, so I would say, yeah, go early on with like the user centric research.

745
00:52:52,440 --> 00:53:00,480
I'm a big fan of user centric research, which means do research that actually helps people

746
00:53:00,480 --> 00:53:05,160
and make really sure to release something that's helpful for others.

747
00:53:05,160 --> 00:53:12,160
So a lot of researchers, including myself in the beginning, were just saying, hey, we

748
00:53:12,160 --> 00:53:16,600
just published this paper and the work is done if the paper is accepted at some conference.

749
00:53:16,600 --> 00:53:21,440
I say, no, that's like, I don't know, that's not really the purpose.

750
00:53:21,440 --> 00:53:24,360
We want to find a problem that's big.

751
00:53:24,360 --> 00:53:28,760
And that's really challenging to find.

752
00:53:28,760 --> 00:53:33,320
So partner up with someone experienced to see these big problems and then create like

753
00:53:33,320 --> 00:53:40,600
a really nice solution for this and do the research, but also make the results accessible

754
00:53:40,600 --> 00:53:44,200
in a really simple and easy way.

755
00:53:44,200 --> 00:53:45,200
I like that.

756
00:53:45,200 --> 00:53:52,680
I think the way that I view research is basically there's a big puzzle in front of us and you're

757
00:53:52,680 --> 00:53:56,720
working on a single puzzle piece sometimes.

758
00:53:56,720 --> 00:54:01,720
And I think you can sometimes lose sight of the so what.

759
00:54:01,720 --> 00:54:06,840
So why should someone care outside of this field?

760
00:54:06,840 --> 00:54:11,880
And that can kind of make your work more relatable and allow other people in.

761
00:54:11,880 --> 00:54:18,680
And that's always a way to sort of enhance the work that you're doing.

762
00:54:18,680 --> 00:54:22,840
I mean, accessibility is a big issue in machine learning.

763
00:54:22,840 --> 00:54:26,760
So we have so many papers on, for example, optimizers.

764
00:54:26,760 --> 00:54:34,000
Like, I don't know, every month someone publish an optimizer that's much better than Adam.

765
00:54:34,000 --> 00:54:38,440
But I don't know, everyone is still using Adam optimizer, which is already, I don't

766
00:54:38,440 --> 00:54:41,080
know, five years old or so or older.

767
00:54:41,080 --> 00:54:45,800
Yeah, there's some funny memes about that, like Adam asking why me or something like

768
00:54:45,800 --> 00:54:46,800
that.

769
00:54:46,800 --> 00:54:48,560
I don't know how old is Adam.

770
00:54:48,560 --> 00:54:49,560
It's from 2014.

771
00:54:49,560 --> 00:54:53,720
So it's like nine years old.

772
00:54:53,720 --> 00:54:58,960
And they she has like, in these nine years, there has been probably hundreds of papers

773
00:54:58,960 --> 00:55:04,080
on better optimizers than Adam, but no one is using them.

774
00:55:04,080 --> 00:55:07,120
And I think one big issue is accessibility.

775
00:55:07,120 --> 00:55:13,840
So maybe you have found like a better optimizer than Adam, but they did not make it accessible.

776
00:55:13,840 --> 00:55:19,000
And if you really want to make it accessible means to have it like nicely, efficiently

777
00:55:19,000 --> 00:55:26,240
implemented and available in like common frameworks like TensorFlow and PyTorch and JAX and integrated

778
00:55:26,240 --> 00:55:29,480
in libraries like I don't know, how in face transformers.

779
00:55:29,480 --> 00:55:35,260
So make it extremely simple for people to use it, test it out.

780
00:55:35,260 --> 00:55:41,680
And then often you see it, if you do it, yeah, it works on these few example use cases, you

781
00:55:41,680 --> 00:55:42,680
tested it.

782
00:55:42,680 --> 00:55:48,200
But if you take all the users of, I don't know, TensorFlow, you will see, yeah, maybe

783
00:55:48,200 --> 00:55:55,320
doesn't work for 98% of the users of TensorFlow, it just works for like tiny slice of users.

784
00:55:55,320 --> 00:55:59,840
So then you can do research, okay, how do you make it more broadly suitable?

785
00:55:59,840 --> 00:56:01,120
So how can you increase it?

786
00:56:01,120 --> 00:56:06,120
Or how can you better predict for which 2% of users is it actually like a benefit to

787
00:56:06,120 --> 00:56:09,160
use this new optimizer?

788
00:56:09,160 --> 00:56:14,480
For listeners that are just getting involved in machine learning, you know, what is an

789
00:56:14,480 --> 00:56:15,480
optimizer?

790
00:56:15,480 --> 00:56:16,480
How would you explain that to someone?

791
00:56:16,480 --> 00:56:17,480
Sure.

792
00:56:17,480 --> 00:56:27,240
So optimizers, the fundamental way how we train the model, so we take a model, give

793
00:56:27,240 --> 00:56:32,360
input like an email, and then ask them, hey, do you think this like a spam email or is

794
00:56:32,360 --> 00:56:33,360
it ham email?

795
00:56:33,360 --> 00:56:37,200
And then the model say, no, I think, yeah, that looks legit.

796
00:56:37,200 --> 00:56:40,880
You want to buy some medication over the internet?

797
00:56:40,880 --> 00:56:41,880
That sounds good.

798
00:56:41,880 --> 00:56:44,280
And then you have a label say, no, no, no, this is spam.

799
00:56:44,280 --> 00:56:46,800
I don't want to see this in my inbox.

800
00:56:46,800 --> 00:56:49,520
And then you say, okay, there's like a mismatch.

801
00:56:49,520 --> 00:56:54,360
There's difference between what the model predicted and what's the correct answer.

802
00:56:54,360 --> 00:56:59,640
And then the optimizer can bring it back to the input, say, okay, which words did you

803
00:56:59,640 --> 00:57:03,800
thought make it look legit?

804
00:57:03,800 --> 00:57:09,920
And how can I modify the weights so that the next time you see this example, the email

805
00:57:09,920 --> 00:57:13,520
will be correctly classified as a spam email.

806
00:57:13,520 --> 00:57:14,520
That's great.

807
00:57:14,520 --> 00:57:15,520
Yeah, thank you for that.

808
00:57:15,520 --> 00:57:19,760
That's a good explanation for optimizers.

809
00:57:19,760 --> 00:57:30,200
So getting to the conclusion, what has a career in machine learning taught you about life?

810
00:57:30,200 --> 00:57:37,400
Career in machine learning taught me about life.

811
00:57:37,400 --> 00:57:38,400
That's what we're all here for.

812
00:57:38,400 --> 00:57:39,400
Yeah.

813
00:57:39,400 --> 00:57:43,120
I mean, it's more what has life taught me about machine learning.

814
00:57:43,120 --> 00:57:44,120
That's the easier one.

815
00:57:44,120 --> 00:57:47,400
Well, if you want to start with that, you can.

816
00:57:47,400 --> 00:57:48,400
Yeah.

817
00:57:48,400 --> 00:57:50,160
Oh yeah.

818
00:57:50,160 --> 00:57:51,960
What taught me machine learning about life?

819
00:57:51,960 --> 00:57:59,160
So I'm a parent of two kids, like one now, so like two and four years old.

820
00:57:59,160 --> 00:58:06,440
And yeah, you kind of see it like as your model, learning, improving, doing mistakes

821
00:58:06,440 --> 00:58:09,560
at the beginning and then improving and provide feedback.

822
00:58:09,560 --> 00:58:14,400
So you're kind of like the gradient update and optimize it to your kids.

823
00:58:14,400 --> 00:58:23,200
And it's kind of interesting for me, at least to see the things which are like what's similar

824
00:58:23,200 --> 00:58:29,200
in machine learning, trying to improve the model and what you as a parent try to raise

825
00:58:29,200 --> 00:58:31,160
your kids.

826
00:58:31,160 --> 00:58:35,400
And yet still amazed by the learning capabilities of humans.

827
00:58:35,400 --> 00:58:41,120
I mean, even in a young age, I don't know when they are like two or three years old,

828
00:58:41,120 --> 00:58:42,920
you can invent names.

829
00:58:42,920 --> 00:58:48,520
So you take like some toy or some stuff or some concept and you create like a fake name

830
00:58:48,520 --> 00:58:53,320
for it and they can talk, use this name and start to reason about the name.

831
00:58:53,320 --> 00:59:00,480
So if this is called like this invented name, then this must be some other invented name.

832
00:59:00,480 --> 00:59:07,320
It's like extremely interesting how quickly kids pick up language and be able to draw

833
00:59:07,320 --> 00:59:11,040
these conclusions and reason about this.

834
00:59:11,040 --> 00:59:18,960
And yeah, if you try this, even with the smartest GPT, chat GPT model and say, hey, please call

835
00:59:18,960 --> 00:59:26,040
my car, whatever, call my car, John, it's not really able to learn this and not able

836
00:59:26,040 --> 00:59:28,400
to reason about this.

837
00:59:28,400 --> 00:59:30,200
Right.

838
00:59:30,200 --> 00:59:31,200
Super interesting.

839
00:59:31,200 --> 00:59:38,360
So yeah, thinking of children and how reinforcement learning is influencing their behavior and

840
00:59:38,360 --> 00:59:44,720
how they're representing the world around them, just like the models that we're trying

841
00:59:44,720 --> 00:59:48,080
to train.

842
00:59:48,080 --> 00:59:55,600
I'm going to butcher his name, but Francois Chalet, I love some of his tweets.

843
00:59:55,600 --> 01:00:04,280
He talks about raising children and how it's like training a machine learning model.

844
01:00:04,280 --> 01:00:05,280
That's great.

845
01:00:05,280 --> 01:00:08,860
That's such a great take on it.

846
01:00:08,860 --> 01:00:15,260
So just to wrap things up, if there are listeners that want to learn more about you, where could

847
01:00:15,260 --> 01:00:17,360
they go to learn more about you?

848
01:00:17,360 --> 01:00:18,360
Yeah.

849
01:00:18,360 --> 01:00:25,040
So when you Google my name, you can find my personal website, neons-rymos.de.

850
01:00:25,040 --> 01:00:29,040
You can find also my Google scholar profile about research.

851
01:00:29,040 --> 01:00:35,320
And yeah, you can also watch cohere.ai about work we're going to do like instrumented search

852
01:00:35,320 --> 01:00:38,680
and text understanding and bring it to production.

853
01:00:38,680 --> 01:00:41,520
So really now more focused.

854
01:00:41,520 --> 01:00:45,640
I went to move away a bit from research side more to production side.

855
01:00:45,640 --> 01:00:53,200
So how can we deploy these systems and face the challenges from nicely clean research

856
01:00:53,200 --> 01:00:58,320
benchmarks to, okay, I actually put it into production and see all the challenges you

857
01:00:58,320 --> 01:01:04,400
have with ugly, noisy data and a production setting and how can you ensure that your system

858
01:01:04,400 --> 01:01:07,760
still work well and nicely in these settings.

859
01:01:07,760 --> 01:01:08,760
Right.

860
01:01:08,760 --> 01:01:13,680
Niels, it was such a pleasure to have you on Learning from Machine Learning.

861
01:01:13,680 --> 01:01:15,680
Thank you so much for being here.

862
01:01:15,680 --> 01:01:16,680
Likewise.

863
01:01:16,680 --> 01:01:26,840
It was great chatting with you.

864
01:01:26,840 --> 01:01:29,760
Thank you for listening to Learning from Machine Learning.

865
01:01:29,760 --> 01:01:35,200
This episode featured an expert in natural language processing, Niels Reimers, the creator

866
01:01:35,200 --> 01:01:40,240
of Sentence Transformers and currently the director of machine learning at Cohere.

867
01:01:40,240 --> 01:01:44,600
Be sure to check out the show notes to learn more about this podcast and some of the topics

868
01:01:44,600 --> 01:01:45,600
discussed.

869
01:01:45,600 --> 01:02:00,680
Talk soon and keep on learning.