1
00:00:00,000 --> 00:00:02,600
Welcome to the AI Papers podcast daily.

2
00:00:02,600 --> 00:00:06,680
Today we're diving deep into this paper

3
00:00:06,680 --> 00:00:09,280
that's making waves, thinking LLMs,

4
00:00:09,280 --> 00:00:12,160
general instruction, falling with thought generation.

5
00:00:12,160 --> 00:00:15,120
It's tackling a really intriguing question.

6
00:00:15,120 --> 00:00:18,040
Can we teach AI to think like we do?

7
00:00:18,040 --> 00:00:21,880
So this paper focuses on large language models,

8
00:00:21,880 --> 00:00:24,800
or LLMs, which are essentially powerful AIs

9
00:00:24,800 --> 00:00:27,880
trained on massive amounts of text data.

10
00:00:27,880 --> 00:00:29,680
What's fascinating is that these LLMs

11
00:00:29,680 --> 00:00:34,400
usually have a fixed budget for processing information,

12
00:00:34,400 --> 00:00:36,800
regardless of the task's complexity.

13
00:00:36,800 --> 00:00:38,880
So whether it's a simple question like,

14
00:00:38,880 --> 00:00:40,080
what's the capital of France?

15
00:00:40,080 --> 00:00:40,680
Exactly.

16
00:00:40,680 --> 00:00:43,000
Or a complex one requiring in-depth analysis.

17
00:00:43,000 --> 00:00:43,600
Yep.

18
00:00:43,600 --> 00:00:46,120
The LLM uses the same amount of computational power.

19
00:00:46,120 --> 00:00:47,520
That doesn't seem very efficient, does it?

20
00:00:47,520 --> 00:00:48,440
Exactly.

21
00:00:48,440 --> 00:00:49,120
Think about it.

22
00:00:49,120 --> 00:00:50,720
When you face a challenging problem,

23
00:00:50,720 --> 00:00:53,120
you don't just blurt out the first answer that comes to mind.

24
00:00:53,120 --> 00:00:53,680
Right.

25
00:00:53,680 --> 00:00:55,960
You take time to think, maybe jot down some notes,

26
00:00:55,960 --> 00:00:59,880
way different options before arriving at a well-reasoned solution.

27
00:00:59,880 --> 00:01:04,160
Traditional LLMs lack that crucial internal thinking stage.

28
00:01:04,160 --> 00:01:06,520
And that's where this paper comes in.

29
00:01:06,520 --> 00:01:11,000
It introduces the concept of thinking LLMs,

30
00:01:11,000 --> 00:01:15,040
models that are trained to generate internal thoughts

31
00:01:15,040 --> 00:01:18,880
in plain language before giving a response.

32
00:01:18,880 --> 00:01:22,800
It's like the AI is having a silent conversation with itself,

33
00:01:22,800 --> 00:01:26,080
brainstorming and refining his ideas before presenting

34
00:01:26,080 --> 00:01:27,040
the final output.

35
00:01:27,040 --> 00:01:27,800
Precisely.

36
00:01:27,800 --> 00:01:29,880
And this thinking happens behind the scenes,

37
00:01:29,880 --> 00:01:32,520
so the user only sees the polished final answer.

38
00:01:32,520 --> 00:01:33,040
Gotcha.

39
00:01:33,040 --> 00:01:34,920
But researchers can peek into this process

40
00:01:34,920 --> 00:01:38,000
to gain insights into how the AI is reasoning and problem

41
00:01:38,000 --> 00:01:38,880
solving.

42
00:01:38,880 --> 00:01:39,680
That's really cool.

43
00:01:39,680 --> 00:01:43,240
But how do you actually teach an AI to think in the first place?

44
00:01:43,240 --> 00:01:45,720
It's not like we can just tell it to start thinking

45
00:01:45,720 --> 00:01:47,520
and expect it to magically work right.

46
00:01:47,520 --> 00:01:48,000
Right.

47
00:01:48,000 --> 00:01:49,840
This is where the paper gets really interesting.

48
00:01:49,840 --> 00:01:50,320
OK.

49
00:01:50,320 --> 00:01:52,520
They've developed a novel training method

50
00:01:52,520 --> 00:01:56,040
called Thought Preference Optimization, or TPO.

51
00:01:56,040 --> 00:01:56,760
TPO.

52
00:01:56,760 --> 00:01:59,920
Essentially, they train the LLM to generate multiple thought

53
00:01:59,920 --> 00:02:01,720
response pairs for a given instruction.

54
00:02:01,720 --> 00:02:03,760
So it's like the AI is coming up with different approaches

55
00:02:03,760 --> 00:02:04,640
to solve a problem.

56
00:02:04,640 --> 00:02:05,120
Exactly.

57
00:02:05,120 --> 00:02:08,160
Then a separate judge model comes into play.

58
00:02:08,160 --> 00:02:09,240
OK.

59
00:02:09,240 --> 00:02:11,440
This judge is trained to evaluate

60
00:02:11,440 --> 00:02:13,280
the quality of the responses.

61
00:02:13,280 --> 00:02:14,280
But here's the key.

62
00:02:14,280 --> 00:02:17,120
It completely ignores the internal thoughts.

63
00:02:17,120 --> 00:02:17,640
Interesting.

64
00:02:17,640 --> 00:02:20,440
It only focuses on the final output.

65
00:02:20,440 --> 00:02:23,200
So the AI's thought process is a black box.

66
00:02:23,200 --> 00:02:25,320
And the judge only cares about the end result.

67
00:02:25,320 --> 00:02:27,240
That's a clever approach.

68
00:02:27,240 --> 00:02:30,520
But how does that feedback loop actually teach the AI

69
00:02:30,520 --> 00:02:31,480
to think better?

70
00:02:31,480 --> 00:02:35,560
The magic lies in a technique called preference optimization.

71
00:02:35,560 --> 00:02:37,880
Based on the judge's feedback, the LLM

72
00:02:37,880 --> 00:02:41,160
learns to adjust its internal thinking process

73
00:02:41,160 --> 00:02:43,760
to produce better and better responses.

74
00:02:43,760 --> 00:02:44,360
So like.

75
00:02:44,360 --> 00:02:46,520
It's like a chef trying different recipes in the kitchen.

76
00:02:46,520 --> 00:02:46,840
OK.

77
00:02:46,840 --> 00:02:49,480
But only the food critics get to taste the final dish.

78
00:02:49,480 --> 00:02:50,640
That's a great analogy.

79
00:02:50,640 --> 00:02:52,680
The chef doesn't reveal their process,

80
00:02:52,680 --> 00:02:55,520
but they learn what works based on the critics' feedback.

81
00:02:55,520 --> 00:02:56,000
Yeah.

82
00:02:56,000 --> 00:02:57,880
So in this case, the LLM is the chef.

83
00:02:57,880 --> 00:02:59,520
The internal thoughts are the recipes.

84
00:02:59,520 --> 00:03:01,840
And the judge is like the food critic.

85
00:03:01,840 --> 00:03:02,280
Right.

86
00:03:02,280 --> 00:03:02,880
That's it.

87
00:03:02,880 --> 00:03:05,680
And what's remarkable about TPO is that it doesn't need

88
00:03:05,680 --> 00:03:07,040
any labeled thought data.

89
00:03:07,040 --> 00:03:07,400
Right.

90
00:03:07,400 --> 00:03:10,720
Which is incredibly difficult and expensive to collect.

91
00:03:10,720 --> 00:03:13,840
The AI is free to explore different thought patterns

92
00:03:13,840 --> 00:03:16,840
and figure out what leads to the best results on its own.

93
00:03:16,840 --> 00:03:18,880
This is fascinating stuff.

94
00:03:18,880 --> 00:03:22,320
But did they actually put this TPO method to the test?

95
00:03:22,320 --> 00:03:26,000
Did they see any real improvements in how the AI performs?

96
00:03:26,000 --> 00:03:27,040
They did.

97
00:03:27,040 --> 00:03:31,200
They ran experiments using a Llama 3.8B Instruct model.

98
00:03:31,200 --> 00:03:31,600
OK.

99
00:03:31,600 --> 00:03:35,640
And tested it on benchmarks like Alpacheval and Arena Hard,

100
00:03:35,640 --> 00:03:39,240
which measure how well an AI can follow general instructions.

101
00:03:39,240 --> 00:03:42,800
So they basically gave the thinking LLM a series of tasks.

102
00:03:42,800 --> 00:03:43,200
Yeah.

103
00:03:43,200 --> 00:03:45,680
And compared its performance to traditional LLMs.

104
00:03:45,680 --> 00:03:46,080
Yes.

105
00:03:46,080 --> 00:03:47,640
And the results were quite impressive.

106
00:03:47,640 --> 00:03:50,040
At first, the thinking LLMs actually performed worse

107
00:03:50,040 --> 00:03:51,400
than the standard LLMs.

108
00:03:51,400 --> 00:03:51,920
Really?

109
00:03:51,920 --> 00:03:53,240
Which is to be expected.

110
00:03:53,240 --> 00:03:55,720
It's like trying to solve a complex math problem in your head

111
00:03:55,720 --> 00:03:57,480
when you're used to writing down all the steps.

112
00:03:57,480 --> 00:03:57,680
Yeah.

113
00:03:57,680 --> 00:04:00,560
It takes time to adjust to that new way of thinking.

114
00:04:00,560 --> 00:04:03,480
But after several rounds of training with TPO,

115
00:04:03,480 --> 00:04:07,080
these thinking LLMs started outperforming the baseline models.

116
00:04:07,080 --> 00:04:10,920
So it's like with any new skill practice makes perfect.

117
00:04:10,920 --> 00:04:11,200
Right.

118
00:04:11,200 --> 00:04:13,440
And the AI is learning to think in a way that actually

119
00:04:13,440 --> 00:04:14,880
boosts its effectiveness.

120
00:04:14,880 --> 00:04:15,160
Yeah.

121
00:04:15,160 --> 00:04:15,800
It's amazing.

122
00:04:15,800 --> 00:04:17,640
What kind of improvements did they see?

123
00:04:17,640 --> 00:04:22,520
The TPO trained model achieved a win rate of 52.5%

124
00:04:22,520 --> 00:04:26,480
on Alpacaival, putting it in third place on the leaderboard,

125
00:04:26,480 --> 00:04:29,400
just behind much larger models like GPT-4.

126
00:04:29,400 --> 00:04:29,760
Really?

127
00:04:29,760 --> 00:04:35,240
And on Arena Hard, it hit a 37.3% win rate,

128
00:04:35,240 --> 00:04:36,720
exceeding expectations.

129
00:04:36,720 --> 00:04:37,040
Wow.

130
00:04:37,040 --> 00:04:38,600
Those are some compelling results,

131
00:04:38,600 --> 00:04:40,960
especially considering it was up against much larger

132
00:04:40,960 --> 00:04:43,520
and more resource-intensive models.

133
00:04:43,520 --> 00:04:46,160
It seems like this TPO method could really be a game changer

134
00:04:46,160 --> 00:04:47,800
in the field.

135
00:04:47,800 --> 00:04:49,240
But earlier, you mentioned something

136
00:04:49,240 --> 00:04:51,160
about the AI using natural language

137
00:04:51,160 --> 00:04:52,760
for its internal thoughts.

138
00:04:52,760 --> 00:04:53,920
Can you elaborate on that?

139
00:04:53,920 --> 00:04:55,160
Why is that so important?

140
00:04:55,160 --> 00:04:56,640
Wouldn't it be more efficient for the AI

141
00:04:56,640 --> 00:04:59,440
to think in some kind of code?

142
00:04:59,440 --> 00:05:00,400
That's a great question.

143
00:05:00,400 --> 00:05:02,720
Using natural language for a thought generation

144
00:05:02,720 --> 00:05:03,920
offers several advantages.

145
00:05:03,920 --> 00:05:04,240
Right.

146
00:05:04,240 --> 00:05:06,760
Remember, these LLMs are trained on vast amounts

147
00:05:06,760 --> 00:05:10,880
of human written text, like books, articles, and code.

148
00:05:10,880 --> 00:05:13,080
By thinking in natural language,

149
00:05:13,080 --> 00:05:16,040
they can tap into the pattern structures and knowledge

150
00:05:16,040 --> 00:05:17,200
embedded in that data.

151
00:05:17,200 --> 00:05:20,280
So it's like the AI is using the same language we use

152
00:05:20,280 --> 00:05:23,040
to communicate and understand the world, which

153
00:05:23,040 --> 00:05:26,000
gives it a richer foundation for its thinking process.

154
00:05:26,000 --> 00:05:27,160
Exactly.

155
00:05:27,160 --> 00:05:29,880
It's not just manipulating symbols in a vacuum.

156
00:05:29,880 --> 00:05:32,080
It's drawing on the same linguistic tools

157
00:05:32,080 --> 00:05:34,680
we use to reason, plan, and create.

158
00:05:34,680 --> 00:05:37,240
This also makes the AI's thinking process

159
00:05:37,240 --> 00:05:40,000
more transparent and interpretable.

160
00:05:40,000 --> 00:05:43,600
Researchers can actually see how the AI connects ideas,

161
00:05:43,600 --> 00:05:46,960
explores different options, and arrives at its conclusions.

162
00:05:46,960 --> 00:05:49,160
That's crucial for building trust in AI systems.

163
00:05:49,160 --> 00:05:49,800
Right.

164
00:05:49,800 --> 00:05:52,240
If we can understand how an AI is making decisions,

165
00:05:52,240 --> 00:05:55,680
we're more likely to accept and rely on its output.

166
00:05:55,680 --> 00:05:57,880
But you mentioned earlier that these internal thoughts

167
00:05:57,880 --> 00:05:59,080
are hidden from the user.

168
00:05:59,080 --> 00:05:59,760
Right.

169
00:05:59,760 --> 00:06:01,160
So while researchers might benefit

170
00:06:01,160 --> 00:06:02,680
from this interpretability, wouldn't it

171
00:06:02,680 --> 00:06:05,320
be helpful for users to see the AI's thought process as well?

172
00:06:05,320 --> 00:06:06,080
That's a great point.

173
00:06:06,080 --> 00:06:08,400
There's a lot of debate about the level of transparency

174
00:06:08,400 --> 00:06:10,800
that's appropriate for different AI applications.

175
00:06:10,800 --> 00:06:11,240
Yeah.

176
00:06:11,240 --> 00:06:13,600
In some cases, like medical diagnoses,

177
00:06:13,600 --> 00:06:18,040
it might be crucial for users to see the AI's reasoning steps

178
00:06:18,040 --> 00:06:20,560
to ensure fairness and accountability.

179
00:06:20,560 --> 00:06:22,560
So in those situations, it's not just

180
00:06:22,560 --> 00:06:23,840
about getting the right answer.

181
00:06:23,840 --> 00:06:26,000
It's about understanding how the AI got there.

182
00:06:26,000 --> 00:06:27,440
Exactly.

183
00:06:27,440 --> 00:06:30,920
But for tasks like writing a poem,

184
00:06:30,920 --> 00:06:33,920
the user might prefer a more streamlined experience focusing

185
00:06:33,920 --> 00:06:37,080
on the final output without being bogged down

186
00:06:37,080 --> 00:06:39,280
by the AI's internal deliberations.

187
00:06:39,280 --> 00:06:42,000
It seems like it's about striking a balance between transparency

188
00:06:42,000 --> 00:06:44,160
and user experience.

189
00:06:44,160 --> 00:06:48,000
And the right approach might vary depending on the task.

190
00:06:48,000 --> 00:06:50,000
Now, one thing that really sit out to me in the paper

191
00:06:50,000 --> 00:06:53,080
was the finding that thinking LLMs initially performed

192
00:06:53,080 --> 00:06:54,840
worse than standard LLMs.

193
00:06:54,840 --> 00:06:55,360
Right.

194
00:06:55,360 --> 00:06:58,320
Especially on tasks involving logic and reasoning,

195
00:06:58,320 --> 00:06:59,400
like math problems.

196
00:06:59,400 --> 00:07:00,320
Why is that?

197
00:07:00,320 --> 00:07:03,480
It's interesting because even though TPO showed promise

198
00:07:03,480 --> 00:07:05,920
in general instruction following,

199
00:07:05,920 --> 00:07:07,760
it did struggle with math.

200
00:07:07,760 --> 00:07:10,920
The researchers believe that their experimental setup wasn't

201
00:07:10,920 --> 00:07:13,360
optimized for math-heavy tasks.

202
00:07:13,360 --> 00:07:16,360
The models were mainly trained on diverse instructions

203
00:07:16,360 --> 00:07:19,520
with only a small portion focused on math.

204
00:07:19,520 --> 00:07:21,920
So it's like trying to teach someone to play the piano

205
00:07:21,920 --> 00:07:24,000
by mainly giving them guitar lessons.

206
00:07:24,000 --> 00:07:27,320
They might grasp basic musical concepts,

207
00:07:27,320 --> 00:07:29,120
but their piano skills won't be great.

208
00:07:29,120 --> 00:07:30,640
Precisely.

209
00:07:30,640 --> 00:07:33,960
The researchers suggest that including more math-specific

210
00:07:33,960 --> 00:07:35,880
instructions during training could

211
00:07:35,880 --> 00:07:37,560
potentially bridge this gap.

212
00:07:37,560 --> 00:07:38,880
That makes sense.

213
00:07:38,880 --> 00:07:41,120
But the fact that this TPO method works

214
00:07:41,120 --> 00:07:43,400
across such a diverse range of tasks,

215
00:07:43,400 --> 00:07:46,640
even with that limitation, is pretty impressive.

216
00:07:46,640 --> 00:07:50,520
It challenges our traditional ideas of what AI is capable of.

217
00:07:50,520 --> 00:07:52,480
Are there any other limitations or challenges

218
00:07:52,480 --> 00:07:54,600
that the researchers highlighted in the paper?

219
00:07:54,600 --> 00:07:55,040
Yes.

220
00:07:55,040 --> 00:07:57,720
They did point out a few areas for improvement.

221
00:07:57,720 --> 00:08:01,640
One key challenge is controlling the length of the AI's thoughts.

222
00:08:01,640 --> 00:08:05,000
If they get too long, it can become computationally expensive

223
00:08:05,000 --> 00:08:06,400
and slow down the response time.

224
00:08:06,400 --> 00:08:06,600
Right.

225
00:08:06,600 --> 00:08:08,000
You don't want the AI getting stuck

226
00:08:08,000 --> 00:08:09,760
in an endless internal monologue.

227
00:08:09,760 --> 00:08:10,600
Exactly.

228
00:08:10,600 --> 00:08:13,800
They acknowledge the need for better ways

229
00:08:13,800 --> 00:08:15,400
to manage thought length.

230
00:08:15,400 --> 00:08:18,120
Ensuring the thinking process is comprehensive,

231
00:08:18,120 --> 00:08:19,640
but also efficient.

232
00:08:19,640 --> 00:08:21,840
Another area for exploration is the use

233
00:08:21,840 --> 00:08:23,880
of different thought prompts.

234
00:08:23,880 --> 00:08:27,800
The prompts used to initiate the AI's thinking

235
00:08:27,800 --> 00:08:30,480
and influence the types of thoughts it generates.

236
00:08:30,480 --> 00:08:32,960
It's like providing the AI with different starting points

237
00:08:32,960 --> 00:08:34,840
for its brainstorming session.

238
00:08:34,840 --> 00:08:37,520
A prompt, like think step by step,

239
00:08:37,520 --> 00:08:40,240
might lead to a different thought process compared to a prompt,

240
00:08:40,240 --> 00:08:42,160
like consider multiple perspectives.

241
00:08:42,160 --> 00:08:43,160
Precisely.

242
00:08:43,160 --> 00:08:44,720
And lastly, remember, this research

243
00:08:44,720 --> 00:08:47,880
was conducted on relatively small language models.

244
00:08:47,880 --> 00:08:51,280
How TPO will perform on the massive LLMs being developed

245
00:08:51,280 --> 00:08:53,000
today is an open question.

246
00:08:53,000 --> 00:08:53,720
OK.

247
00:08:53,720 --> 00:08:57,880
It seems like there's still so much to explore in this area,

248
00:08:57,880 --> 00:09:02,000
in the world of thinking LLMs, that even with these limitations,

249
00:09:02,000 --> 00:09:05,320
the research we've discussed today is a significant step

250
00:09:05,320 --> 00:09:05,880
forward.

251
00:09:05,880 --> 00:09:09,400
It really challenges our understanding of AI's potential.

252
00:09:09,400 --> 00:09:10,560
I completely agree.

253
00:09:10,560 --> 00:09:14,400
This ability to teach AI to think internally, to plan draft,

254
00:09:14,400 --> 00:09:18,280
and evaluate its own work could be transformative for the field.

255
00:09:18,280 --> 00:09:19,240
Absolutely.

256
00:09:19,240 --> 00:09:21,880
So as we wrap up this episode of the AI Papers Podcast

257
00:09:21,880 --> 00:09:24,360
daily, I'll leave our listeners with a question to Pond.

258
00:09:24,360 --> 00:09:28,840
What do you find most intriguing about the idea of thinking LLMs?

259
00:09:28,840 --> 00:09:31,680
And what potential applications or future developments

260
00:09:31,680 --> 00:09:33,640
in this area excite you the most?

261
00:09:33,640 --> 00:09:35,320
Thanks for joining us on this deep dive

262
00:09:35,320 --> 00:09:37,040
into the world of thinking LLMs.

263
00:09:37,040 --> 00:09:38,920
We hope you found it insightful.

264
00:09:38,920 --> 00:10:03,760
Until next time, keep exploring the fascinating world of AI.