1
00:00:00,000 --> 00:00:02,640
Hey everyone and welcome back to the deep dive.

2
00:00:02,640 --> 00:00:04,640
Today we've got a really interesting one,

3
00:00:04,640 --> 00:00:06,720
especially if you're into AI like me.

4
00:00:06,720 --> 00:00:07,560
Definitely.

5
00:00:07,560 --> 00:00:08,400
Yeah.

6
00:00:08,400 --> 00:00:11,800
So we're diving into a paper all about AI and conversation

7
00:00:11,800 --> 00:00:14,920
and how tricky it can be.

8
00:00:14,920 --> 00:00:16,840
Yeah, the paper's called Large Language Models,

9
00:00:16,840 --> 00:00:19,800
Know What to Say, but Not When to Speak.

10
00:00:19,800 --> 00:00:21,120
Catchy title.

11
00:00:21,120 --> 00:00:22,280
I know, right?

12
00:00:22,280 --> 00:00:23,880
And it kind of sums it up perfectly.

13
00:00:23,880 --> 00:00:26,120
Like these LLMs, the big language models

14
00:00:26,120 --> 00:00:27,520
we always hear about,

15
00:00:27,520 --> 00:00:29,720
they're amazing at generating text, right?

16
00:00:29,720 --> 00:00:31,160
Yeah, absolutely.

17
00:00:31,160 --> 00:00:33,360
But this research is all about how hard it is

18
00:00:33,360 --> 00:00:36,640
to teach them when to speak up in a conversation,

19
00:00:36,640 --> 00:00:38,640
not just what to say.

20
00:00:38,640 --> 00:00:39,960
It's funny because we humans,

21
00:00:39,960 --> 00:00:41,880
we just do this without even thinking about it.

22
00:00:41,880 --> 00:00:43,240
Right, like it's just natural.

23
00:00:43,240 --> 00:00:45,680
We instinctively know when to jump in,

24
00:00:45,680 --> 00:00:46,520
when to hold back,

25
00:00:46,520 --> 00:00:47,920
when someone's finished their thought.

26
00:00:47,920 --> 00:00:48,760
Yeah, totally.

27
00:00:48,760 --> 00:00:50,320
And this paper looks at something called

28
00:00:50,320 --> 00:00:53,400
Transition Relevance Places or TRPs.

29
00:00:53,400 --> 00:00:54,240
TRPs.

30
00:00:54,240 --> 00:00:56,080
Yeah, basically they're those points in a conversation

31
00:00:56,080 --> 00:00:58,480
where it feels natural for someone to respond.

32
00:00:58,480 --> 00:01:00,520
Like those unspoken cues we pick up on.

33
00:01:00,520 --> 00:01:01,360
Exactly.

34
00:01:01,360 --> 00:01:03,280
Those moments where you just feel like it's your turn.

35
00:01:03,280 --> 00:01:05,120
Exactly, and the researchers wanted to see

36
00:01:05,120 --> 00:01:08,480
if they could teach AI to recognize those same cues.

37
00:01:08,480 --> 00:01:10,720
Which I'm guessing is way harder than it sounds.

38
00:01:10,720 --> 00:01:11,560
Oh, absolutely.

39
00:01:11,560 --> 00:01:14,680
I mean, how do you even begin to explain

40
00:01:14,680 --> 00:01:16,640
that feeling to a computer?

41
00:01:16,640 --> 00:01:19,400
Yeah, it's like so abstract.

42
00:01:19,400 --> 00:01:20,400
Well, these researchers,

43
00:01:20,400 --> 00:01:22,200
they came up with a pretty clever method.

44
00:01:22,200 --> 00:01:25,080
They had real people listen to conversations.

45
00:01:25,080 --> 00:01:25,920
Okay.

46
00:01:25,920 --> 00:01:28,440
And then mark those instinctive TRP moments

47
00:01:28,440 --> 00:01:31,200
by giving little audio cues, like saying, hmm.

48
00:01:31,200 --> 00:01:32,040
Interesting.

49
00:01:32,040 --> 00:01:35,120
So they're like actually recording those subtle moments?

50
00:01:35,120 --> 00:01:35,960
Yeah, exactly.

51
00:01:35,960 --> 00:01:38,000
And those are important because they show

52
00:01:38,000 --> 00:01:40,800
where someone could have spoken, but chose not to.

53
00:01:40,800 --> 00:01:42,320
Right, those silent TRPs.

54
00:01:42,320 --> 00:01:43,160
Exactly.

55
00:01:43,160 --> 00:01:44,560
And those are usually missed in like

56
00:01:44,560 --> 00:01:46,560
most AI research on conversation.

57
00:01:46,560 --> 00:01:48,680
It adds like a whole extra layer of nuance.

58
00:01:48,680 --> 00:01:49,520
It does.

59
00:01:49,520 --> 00:01:50,880
So they built this whole data set

60
00:01:50,880 --> 00:01:53,240
of human-identified TRPs, right?

61
00:01:53,240 --> 00:01:56,000
And then they tested a bunch of different LLMs.

62
00:01:56,000 --> 00:01:58,040
We basically fed them the transcripts and said,

63
00:01:58,040 --> 00:02:01,600
okay, AI predict where those, hmm, moments should be.

64
00:02:01,600 --> 00:02:02,440
Clever.

65
00:02:02,440 --> 00:02:03,280
Right.

66
00:02:03,280 --> 00:02:04,440
And they did it in two ways,

67
00:02:04,440 --> 00:02:05,560
which I thought was interesting.

68
00:02:05,560 --> 00:02:10,000
Yeah, so first they used what they call expert prompts.

69
00:02:10,000 --> 00:02:10,840
Okay.

70
00:02:10,840 --> 00:02:14,280
And these gave the AI background info on TRPs,

71
00:02:14,280 --> 00:02:16,640
almost like a crash course on, you know,

72
00:02:16,640 --> 00:02:17,760
conversation theory.

73
00:02:17,760 --> 00:02:20,160
So they were like teaching the AI about TRPs

74
00:02:20,160 --> 00:02:21,080
before testing it.

75
00:02:21,080 --> 00:02:21,920
Yeah, exactly.

76
00:02:21,920 --> 00:02:24,200
And then they also used participant prompts,

77
00:02:24,200 --> 00:02:26,040
which just asked the AI to identify

78
00:02:26,040 --> 00:02:28,480
where it felt natural to respond,

79
00:02:28,480 --> 00:02:30,360
just like they did with the human listeners.

80
00:02:30,360 --> 00:02:31,200
Interesting.

81
00:02:31,200 --> 00:02:32,200
So like one was more theoretical,

82
00:02:32,200 --> 00:02:33,200
the other more practical.

83
00:02:33,200 --> 00:02:34,040
Exactly.

84
00:02:34,040 --> 00:02:36,600
So they were testing whether the AI could learn by example,

85
00:02:36,600 --> 00:02:39,200
or if it needed that, you know,

86
00:02:39,200 --> 00:02:41,640
that theoretical foundation to really get it.

87
00:02:41,640 --> 00:02:42,480
Fascinating.

88
00:02:42,480 --> 00:02:43,240
I know, right?

89
00:02:43,240 --> 00:02:46,440
So the big question is, did the AI ace the test?

90
00:02:46,440 --> 00:02:47,280
Yeah.

91
00:02:47,280 --> 00:02:48,120
What happened?

92
00:02:48,120 --> 00:02:48,960
Drumroll, please.

93
00:02:48,960 --> 00:02:50,280
Not exactly.

94
00:02:50,280 --> 00:02:51,120
Oh.

95
00:02:51,120 --> 00:02:53,600
Yeah, even the most advanced LLM they tested,

96
00:02:53,600 --> 00:02:55,200
which was GPT-4 AMI,

97
00:02:55,200 --> 00:02:57,000
had a tough time predicting those subtle

98
00:02:57,000 --> 00:02:58,440
with interned TRPs.

99
00:02:58,440 --> 00:02:59,280
Hmm.

100
00:02:59,280 --> 00:03:02,040
So even with all that data and fancy algorithms,

101
00:03:02,040 --> 00:03:05,880
AI still can't fully grasp the nuance of human conversation.

102
00:03:05,880 --> 00:03:06,960
It seems that way, right?

103
00:03:06,960 --> 00:03:07,920
It really does highlight

104
00:03:07,920 --> 00:03:10,120
how complex human communication is.

105
00:03:10,120 --> 00:03:11,840
I mean, all those tiny pauses,

106
00:03:11,840 --> 00:03:13,440
those slight shifts in tone,

107
00:03:13,440 --> 00:03:15,240
they convey so much missing.

108
00:03:15,240 --> 00:03:16,200
Oh, absolutely.

109
00:03:16,200 --> 00:03:18,560
And it's proving really difficult to teach AI

110
00:03:18,560 --> 00:03:20,400
to pick up on those cues.

111
00:03:20,400 --> 00:03:23,520
So I guess, you know, it sounds like we're still a ways off

112
00:03:23,520 --> 00:03:27,160
from having those seamless, natural-sounding AI assistants

113
00:03:27,160 --> 00:03:30,040
that everyone's hoping for.

114
00:03:30,040 --> 00:03:31,160
True, but I think the fact

115
00:03:31,160 --> 00:03:33,280
that we're even doing this research is huge.

116
00:03:33,280 --> 00:03:34,120
Oh, yeah, for sure.

117
00:03:34,120 --> 00:03:35,640
Like, we're beginning to understand

118
00:03:35,640 --> 00:03:38,160
the challenges involved, and that's crucial.

119
00:03:38,160 --> 00:03:39,760
Absolutely, absolutely.

120
00:03:39,760 --> 00:03:42,720
Now, you mentioned those two different types of prompts.

121
00:03:42,720 --> 00:03:45,840
Did that, like, actually make a difference?

122
00:03:45,840 --> 00:03:50,280
Did the way they explain the task to the AI change things?

123
00:03:50,280 --> 00:03:52,680
You know, that's why things get really interesting.

124
00:03:52,680 --> 00:03:54,560
The LLMs actually performed better

125
00:03:54,560 --> 00:03:56,720
when they were given those expert prompts,

126
00:03:56,720 --> 00:03:59,080
you know, the ones with the background info on TRPs.

127
00:03:59,080 --> 00:04:01,440
So giving them that theoretical framework,

128
00:04:01,440 --> 00:04:04,240
even if it was just a quick overview, actually helped.

129
00:04:04,240 --> 00:04:05,240
It seems that way.

130
00:04:05,240 --> 00:04:08,760
Like, having that foundation gave the AI a bit of an edge.

131
00:04:08,760 --> 00:04:10,000
That's wild.

132
00:04:10,000 --> 00:04:12,680
So it's not just about data, it's about understanding, too.

133
00:04:12,680 --> 00:04:14,320
Now, for our listeners out there,

134
00:04:14,320 --> 00:04:15,480
you might be thinking,

135
00:04:15,480 --> 00:04:18,120
why should I care about any of this?

136
00:04:18,120 --> 00:04:21,640
Well, just imagine a world where AI assistants

137
00:04:21,640 --> 00:04:25,320
can join your meetings without awkward interruptions.

138
00:04:25,320 --> 00:04:26,160
Right.

139
00:04:26,160 --> 00:04:28,280
Or chatbots that feel like you're talking to a real person.

140
00:04:28,280 --> 00:04:29,240
Exactly.

141
00:04:29,240 --> 00:04:31,160
Or even, like, characters in video games

142
00:04:31,160 --> 00:04:32,440
that react to you naturally.

143
00:04:32,440 --> 00:04:34,240
All of that could be possible

144
00:04:34,240 --> 00:04:36,840
if we can figure out TRP prediction.

145
00:04:36,840 --> 00:04:41,680
It's about making AI more human-like, more intuitive.

146
00:04:41,680 --> 00:04:43,440
More able to understand and respond to us

147
00:04:43,440 --> 00:04:44,960
in a way that feels natural.

148
00:04:44,960 --> 00:04:46,480
Yeah, that's the dream.

149
00:04:46,480 --> 00:04:47,320
For sure.

150
00:04:47,320 --> 00:04:48,400
But we also have to be realistic

151
00:04:48,400 --> 00:04:49,560
about the challenges, right?

152
00:04:49,560 --> 00:04:50,400
Oh, yeah.

153
00:04:50,400 --> 00:04:53,000
This research shows just how far we still have to go.

154
00:04:53,000 --> 00:04:56,200
So I'm curious, what's next for this research?

155
00:04:56,200 --> 00:04:57,640
Where do we go from here?

156
00:04:57,640 --> 00:04:59,040
Well, one thing the researchers pointed out

157
00:04:59,040 --> 00:05:01,800
is that they only looked at text-based conversation.

158
00:05:01,800 --> 00:05:02,640
Right.

159
00:05:02,640 --> 00:05:04,640
But what about all those other cues we use when we talk?

160
00:05:04,640 --> 00:05:08,720
Like tone of voice, pauses, facial expressions.

161
00:05:08,720 --> 00:05:09,560
You're right.

162
00:05:09,560 --> 00:05:10,840
We don't just communicate with words.

163
00:05:10,840 --> 00:05:11,680
Exactly.

164
00:05:11,680 --> 00:05:16,680
So they're asking, what if AI could analyze how we speak?

165
00:05:18,000 --> 00:05:19,640
Not just what we say.

166
00:05:19,640 --> 00:05:23,320
Could that unlock even more natural conversation?

167
00:05:23,320 --> 00:05:26,240
Now that is a question worth pondering.

168
00:05:26,240 --> 00:05:29,120
And that's where we'll pick up in part two of this deep dive.

169
00:05:29,120 --> 00:05:31,160
We'll look closer at the methods and metrics

170
00:05:31,160 --> 00:05:32,520
they used in this research

171
00:05:32,520 --> 00:05:34,360
and explore some of the implications

172
00:05:34,360 --> 00:05:36,000
for the future of AI.

173
00:05:36,000 --> 00:05:36,880
Stay tuned.

174
00:05:36,880 --> 00:05:38,160
Welcome back to the deep dive.

175
00:05:38,160 --> 00:05:40,600
It's really incredible how much we just like

176
00:05:40,600 --> 00:05:42,720
take for granted in human conversation.

177
00:05:42,720 --> 00:05:44,040
It really is when you start thinking about it.

178
00:05:44,040 --> 00:05:45,760
All those subtle cues, you know?

179
00:05:45,760 --> 00:05:48,680
The signals, we just navigate them without even thinking.

180
00:05:48,680 --> 00:05:49,520
Right.

181
00:05:49,520 --> 00:05:50,520
It's so effortless for us.

182
00:05:50,520 --> 00:05:51,920
But it's actually incredibly complex.

183
00:05:51,920 --> 00:05:53,240
And like we were saying before,

184
00:05:53,240 --> 00:05:55,240
teaching AI to recognize those cues,

185
00:05:55,240 --> 00:05:58,560
especially those silent TRPs, it's a huge challenge.

186
00:05:58,560 --> 00:05:59,400
Absolutely.

187
00:05:59,400 --> 00:06:01,440
So how do they even like capture those TRPs

188
00:06:01,440 --> 00:06:02,920
in the first place for this research?

189
00:06:02,920 --> 00:06:04,800
Well, they actually recorded these conversations

190
00:06:04,800 --> 00:06:06,800
like natural unscripted ones.

191
00:06:06,800 --> 00:06:07,640
Okay.

192
00:06:07,640 --> 00:06:09,480
And they made sure the audio was, you know, super clear.

193
00:06:09,480 --> 00:06:10,680
No background noise.

194
00:06:10,680 --> 00:06:11,520
Yeah, exactly.

195
00:06:11,520 --> 00:06:13,920
No background noise, no interruptions.

196
00:06:13,920 --> 00:06:17,040
They wanted those hmm moments to be crystal clear.

197
00:06:17,040 --> 00:06:19,160
So they were very particular about their data.

198
00:06:19,160 --> 00:06:20,280
Absolutely meticulous.

199
00:06:20,280 --> 00:06:22,360
And then they used special software

200
00:06:22,360 --> 00:06:25,640
to like pinpoint the exact timing.

201
00:06:25,640 --> 00:06:28,040
Yeah, in relation to the spoken words.

202
00:06:28,040 --> 00:06:30,720
So they basically created this precise dataset

203
00:06:30,720 --> 00:06:32,240
of TRP moments.

204
00:06:32,240 --> 00:06:34,280
So they're like mapping the conversation,

205
00:06:34,280 --> 00:06:36,280
marking those spots where a response

206
00:06:36,280 --> 00:06:37,360
could naturally happen.

207
00:06:37,360 --> 00:06:38,200
Exactly.

208
00:06:38,200 --> 00:06:40,040
And remember how they tested those LLMs

209
00:06:40,040 --> 00:06:43,240
using both expert and participant prompts?

210
00:06:43,240 --> 00:06:45,560
Yeah, giving the AI different levels of information

211
00:06:45,560 --> 00:06:46,680
about those TRPs.

212
00:06:46,680 --> 00:06:47,520
Right.

213
00:06:47,520 --> 00:06:48,560
So with those expert prompts,

214
00:06:48,560 --> 00:06:51,040
it was like giving the AI a crash course

215
00:06:51,040 --> 00:06:52,440
in conversation theory.

216
00:06:52,440 --> 00:06:53,280
Okay.

217
00:06:53,280 --> 00:06:55,120
Like a mini lecture explaining TRPs

218
00:06:55,120 --> 00:06:57,000
in this very academic way.

219
00:06:57,000 --> 00:06:58,880
And then with those participant prompts,

220
00:06:58,880 --> 00:07:00,880
they just asked the AI to identify

221
00:07:00,880 --> 00:07:02,320
those natural response points

222
00:07:02,320 --> 00:07:03,400
like they did with the humans.

223
00:07:03,400 --> 00:07:04,520
Right, exactly.

224
00:07:04,520 --> 00:07:06,960
They wanted to see if the AI could learn by example

225
00:07:06,960 --> 00:07:09,440
or if it needed that, you know, theoretical background

226
00:07:09,440 --> 00:07:10,760
to really understand.

227
00:07:10,760 --> 00:07:11,800
And you mentioned earlier

228
00:07:11,800 --> 00:07:13,800
that even the most advanced LLMs

229
00:07:13,800 --> 00:07:17,120
had trouble predicting those within-turned TRPs.

230
00:07:17,120 --> 00:07:18,880
Did the type of prompt to make a difference?

231
00:07:18,880 --> 00:07:20,000
It did actually.

232
00:07:20,000 --> 00:07:23,240
The LLMs performed better with the expert prompts.

233
00:07:23,240 --> 00:07:24,080
Wow.

234
00:07:24,080 --> 00:07:25,640
So even that quick explanation of the theory

235
00:07:25,640 --> 00:07:29,240
seemed to help the AI grasp the idea of TRPs.

236
00:07:29,240 --> 00:07:30,160
It did.

237
00:07:30,160 --> 00:07:31,880
It suggests that even though AI

238
00:07:31,880 --> 00:07:34,520
doesn't totally get human conversation yet,

239
00:07:34,520 --> 00:07:36,320
giving it that deeper understanding,

240
00:07:36,320 --> 00:07:39,400
those underlying principles could be really helpful.

241
00:07:39,400 --> 00:07:40,840
It's not just about feeding them data.

242
00:07:40,840 --> 00:07:43,480
It's about teaching them the why behind it all.

243
00:07:43,480 --> 00:07:44,320
Exactly.

244
00:07:44,320 --> 00:07:46,920
It's like, you know, teaching someone to play an instrument.

245
00:07:46,920 --> 00:07:48,720
You can show them where to put their fingers,

246
00:07:48,720 --> 00:07:50,600
but they also need to understand the music theory

247
00:07:50,600 --> 00:07:51,960
to really master it.

248
00:07:51,960 --> 00:07:52,960
I like that analogy.

249
00:07:52,960 --> 00:07:53,800
Yeah.

250
00:07:53,800 --> 00:07:55,120
But even with the expert prompts,

251
00:07:55,120 --> 00:07:56,880
the AI still wasn't perfect, right?

252
00:07:56,880 --> 00:07:58,000
Right, it wasn't perfect.

253
00:07:58,000 --> 00:08:01,160
And that just highlights how much of a gap there still is.

254
00:08:01,160 --> 00:08:02,000
In between.

255
00:08:02,000 --> 00:08:04,760
Between AI's theoretical understanding of conversation

256
00:08:04,760 --> 00:08:07,440
and its ability to actually like use that

257
00:08:07,440 --> 00:08:09,480
in those subtle dynamic interactions.

258
00:08:09,480 --> 00:08:10,720
Like knowing grammar rules

259
00:08:10,720 --> 00:08:12,680
versus actually speaking fluently.

260
00:08:12,680 --> 00:08:13,520
Yeah, exactly.

261
00:08:13,520 --> 00:08:14,720
It's a whole different ball game.

262
00:08:14,720 --> 00:08:15,560
Totally.

263
00:08:15,560 --> 00:08:17,520
And this research really emphasizes

264
00:08:17,520 --> 00:08:20,640
how complex those real world conversations are.

265
00:08:20,640 --> 00:08:21,480
Oh, for sure.

266
00:08:21,480 --> 00:08:22,800
We're always reading each other's cues,

267
00:08:22,800 --> 00:08:24,440
adjusting our responses,

268
00:08:24,440 --> 00:08:26,360
anticipating what the other person's gonna say.

269
00:08:26,360 --> 00:08:27,200
Right.

270
00:08:27,200 --> 00:08:29,440
And that's still really hard for AI to do.

271
00:08:29,440 --> 00:08:32,360
Okay, so let's talk about how they measured performance.

272
00:08:32,360 --> 00:08:33,720
Cause I know you mentioned some metrics

273
00:08:33,720 --> 00:08:35,200
that were a little bit over my head.

274
00:08:35,200 --> 00:08:36,160
Oh yeah, those metrics.

275
00:08:36,160 --> 00:08:39,160
Like precision, recall,

276
00:08:40,200 --> 00:08:41,840
cap statistics.

277
00:08:43,520 --> 00:08:44,840
Can we break those down a little?

278
00:08:44,840 --> 00:08:45,680
Sure, yeah.

279
00:08:45,680 --> 00:08:47,200
Those are all about understanding

280
00:08:47,200 --> 00:08:49,840
how well the AI is actually doing.

281
00:08:49,840 --> 00:08:51,160
At predicting TRPs.

282
00:08:51,160 --> 00:08:52,000
Exactly.

283
00:08:52,000 --> 00:08:55,640
So like precision is about how accurate the AI is

284
00:08:55,640 --> 00:08:57,360
when it does predict a TRP.

285
00:08:57,360 --> 00:08:58,200
Okay.

286
00:08:58,200 --> 00:08:59,120
So it's like asking,

287
00:08:59,120 --> 00:09:00,680
is it actually hitting the mark

288
00:09:00,680 --> 00:09:02,720
or just making a bunch of random guesses?

289
00:09:02,720 --> 00:09:05,400
We don't want the AI just interrupting randomly.

290
00:09:05,400 --> 00:09:07,920
We want those predictions to be, you know, meaningful.

291
00:09:07,920 --> 00:09:09,440
Okay, what about recall then?

292
00:09:09,440 --> 00:09:12,520
Recall is about how many actual TRPs

293
00:09:12,520 --> 00:09:14,040
the AI is able to catch.

294
00:09:14,040 --> 00:09:15,040
Okay, so it's like,

295
00:09:15,040 --> 00:09:17,600
is it missing any opportunities to jump in?

296
00:09:17,600 --> 00:09:18,440
Exactly.

297
00:09:18,440 --> 00:09:19,440
So it's like a two-part test.

298
00:09:19,440 --> 00:09:22,560
The AI needs to be both accurate and comprehensive.

299
00:09:22,560 --> 00:09:23,400
Got it.

300
00:09:23,400 --> 00:09:24,680
And what was that F1 score you mentioned?

301
00:09:24,680 --> 00:09:27,400
Oh yeah, the F1 score kind of combines those two,

302
00:09:27,400 --> 00:09:28,800
giving you a balanced view

303
00:09:28,800 --> 00:09:30,600
of the AI's overall performance.

304
00:09:30,600 --> 00:09:33,720
So a high F1 score means it's doing well on both fronts.

305
00:09:33,720 --> 00:09:34,800
Exactly.

306
00:09:34,800 --> 00:09:38,160
Now those CAPA statistics might sound a bit complicated.

307
00:09:38,160 --> 00:09:39,040
You do a little.

308
00:09:39,040 --> 00:09:41,640
But they're really just a way to measure agreement.

309
00:09:41,640 --> 00:09:42,480
Okay.

310
00:09:42,480 --> 00:09:43,720
But not just any agreement,

311
00:09:43,720 --> 00:09:46,760
but agreement beyond what you expect by chance.

312
00:09:46,760 --> 00:09:48,560
So we don't want the AI just getting lucky.

313
00:09:48,560 --> 00:09:51,120
We want it to be making like smart choices.

314
00:09:51,120 --> 00:09:52,400
Exactly.

315
00:09:52,400 --> 00:09:53,800
One of them, call-free,

316
00:09:53,800 --> 00:09:56,840
looks at the overall agreement between the AI's predictions

317
00:09:56,840 --> 00:09:59,280
and those human-identified TRPs

318
00:09:59,280 --> 00:10:01,960
across all those intervals in the conversation.

319
00:10:01,960 --> 00:10:03,240
It's like a big picture view

320
00:10:03,240 --> 00:10:06,560
of how well the AI matches human judgment.

321
00:10:06,560 --> 00:10:07,840
Yeah, you could say that.

322
00:10:07,840 --> 00:10:10,360
But they also looked at another one, true-free,

323
00:10:10,360 --> 00:10:12,480
which focused specifically on those intervals

324
00:10:12,480 --> 00:10:13,520
where the humans agreed there

325
00:10:13,520 --> 00:10:14,680
was a TRP.

326
00:10:14,680 --> 00:10:16,360
So like zeroing in on those moments

327
00:10:16,360 --> 00:10:18,360
that are hardest for the AI,

328
00:10:18,360 --> 00:10:20,880
those subtle, with intern TRPs.

329
00:10:20,880 --> 00:10:21,920
Precisely.

330
00:10:21,920 --> 00:10:24,960
And the results showed that even the best LLM

331
00:10:24,960 --> 00:10:27,800
had a pretty low, true-free score.

332
00:10:27,800 --> 00:10:29,040
Meaning it was still struggling

333
00:10:29,040 --> 00:10:31,720
to consistently match human intuition

334
00:10:31,720 --> 00:10:33,640
on those trickier TRPs.

335
00:10:33,640 --> 00:10:34,480
Exactly.

336
00:10:34,480 --> 00:10:37,000
And then they used something called temporal distance metrics,

337
00:10:37,000 --> 00:10:38,120
which might sound fancy.

338
00:10:38,120 --> 00:10:39,080
A little bit, yeah.

339
00:10:39,080 --> 00:10:40,760
But basically they just measure

340
00:10:40,760 --> 00:10:43,480
how far off the AI's predictions were

341
00:10:43,480 --> 00:10:46,840
from the actual human-identified TRPs.

342
00:10:46,840 --> 00:10:48,360
So like how big the gap was

343
00:10:48,360 --> 00:10:51,760
between where the AI thought a TRP should be

344
00:10:51,760 --> 00:10:53,000
and where it actually was.

345
00:10:53,000 --> 00:10:53,840
Yes.

346
00:10:53,840 --> 00:10:55,920
And it turned out that the AI was often off

347
00:10:55,920 --> 00:10:56,760
by quite a bit,

348
00:10:56,760 --> 00:10:59,320
especially for those with intern TRPs.

349
00:10:59,320 --> 00:11:00,880
So it's not just about being right or wrong,

350
00:11:00,880 --> 00:11:02,280
it's about being close.

351
00:11:02,280 --> 00:11:03,120
Right.

352
00:11:03,120 --> 00:11:04,200
Because even if it wasn't interrupting

353
00:11:04,200 --> 00:11:05,440
at the exact wrong time,

354
00:11:05,440 --> 00:11:07,320
it was still missing those subtle cues

355
00:11:07,320 --> 00:11:09,520
that make conversations flow smoothly.

356
00:11:09,520 --> 00:11:11,480
And you mentioned that the AI did a little better

357
00:11:11,480 --> 00:11:12,680
with those expert prompts.

358
00:11:12,680 --> 00:11:13,520
Yeah.

359
00:11:13,520 --> 00:11:15,000
But did those temporal distance metrics

360
00:11:15,000 --> 00:11:16,200
show any improvement there?

361
00:11:16,200 --> 00:11:18,320
You know, even with that theoretical boost,

362
00:11:18,320 --> 00:11:20,880
it was still making significant timing errors.

363
00:11:20,880 --> 00:11:22,920
But even understanding the theory of conversation

364
00:11:22,920 --> 00:11:24,200
only takes you so far.

365
00:11:24,200 --> 00:11:26,080
Right, it seems like the AI still needs

366
00:11:26,080 --> 00:11:29,040
to develop that real-world conversational instinct.

367
00:11:29,040 --> 00:11:29,880
That makes sense.

368
00:11:29,880 --> 00:11:32,360
It's like you can study all the books about swimming,

369
00:11:32,360 --> 00:11:34,880
but you still gotta get in the water to really learn.

370
00:11:34,880 --> 00:11:35,880
Exactly.

371
00:11:35,880 --> 00:11:38,800
And this is a key takeaway from this whole research.

372
00:11:38,800 --> 00:11:39,480
What is it?

373
00:11:39,480 --> 00:11:41,640
Even though AI has come a long way,

374
00:11:41,640 --> 00:11:44,080
we still have a lot of work to do to create AI

375
00:11:44,080 --> 00:11:46,240
that can truly understand and participate

376
00:11:46,240 --> 00:11:47,600
in human conversation.

377
00:11:47,600 --> 00:11:48,400
It's true.

378
00:11:48,400 --> 00:11:51,200
This is still early days for conversational AI.

379
00:11:51,200 --> 00:11:52,080
Absolutely.

380
00:11:52,080 --> 00:11:54,640
But research like this is paving the way.

381
00:11:54,640 --> 00:11:56,800
You know, it's helping us understand

382
00:11:56,800 --> 00:11:58,440
what it'll take to get there.

383
00:11:58,440 --> 00:11:59,520
It's pretty exciting to think about.

384
00:11:59,520 --> 00:12:00,000
It is.

385
00:12:00,000 --> 00:12:01,680
And this paper is a perfect example

386
00:12:01,680 --> 00:12:03,600
of that kind of groundbreaking research.

387
00:12:03,600 --> 00:12:05,200
Well, folks, that wraps up part two

388
00:12:05,200 --> 00:12:07,560
of our deep dive into this fascinating research

389
00:12:07,560 --> 00:12:09,440
on AI conversation.

390
00:12:09,440 --> 00:12:11,400
We've looked at the methods, the metrics,

391
00:12:11,400 --> 00:12:13,560
and what they tell us about the progress and challenges

392
00:12:13,560 --> 00:12:15,960
in teaching AI to converse like us.

393
00:12:15,960 --> 00:12:19,160
We've seen that AI is really good at generating text,

394
00:12:19,160 --> 00:12:22,440
but it still has trouble with those subtle nuances of timing

395
00:12:22,440 --> 00:12:23,640
and turn-taking.

396
00:12:23,640 --> 00:12:25,360
Yeah, those things that are so important

397
00:12:25,360 --> 00:12:26,840
for natural conversation.

398
00:12:26,840 --> 00:12:29,120
But there's hope, because the research suggests

399
00:12:29,120 --> 00:12:31,600
that giving AI a deeper understanding

400
00:12:31,600 --> 00:12:33,600
of those conversational principles

401
00:12:33,600 --> 00:12:35,960
could lead to some big improvements.

402
00:12:35,960 --> 00:12:37,920
And as this research keeps going,

403
00:12:37,920 --> 00:12:40,440
we can expect to see even more breakthroughs

404
00:12:40,440 --> 00:12:43,440
in the quest for truly conversational AI.

405
00:12:43,440 --> 00:12:44,080
Absolutely.

406
00:12:44,080 --> 00:12:46,080
And we'll continue exploring that in part three,

407
00:12:46,080 --> 00:12:48,360
where we'll dive into the potential implications

408
00:12:48,360 --> 00:12:51,280
of this research and what it means for the future of how

409
00:12:51,280 --> 00:12:53,080
we interact with AI.

410
00:12:53,080 --> 00:12:55,040
Welcome back to the deep dive.

411
00:12:55,040 --> 00:12:57,400
We've been talking all about teaching AI

412
00:12:57,400 --> 00:13:01,840
to converse like humans, picking up on those cues that

413
00:13:01,840 --> 00:13:04,080
tell us when it's our turn to speak.

414
00:13:04,080 --> 00:13:05,240
It's fascinating, isn't it?

415
00:13:05,240 --> 00:13:05,640
Yeah.

416
00:13:05,640 --> 00:13:07,600
How much is going on beneath the surface

417
00:13:07,600 --> 00:13:08,360
when we talk?

418
00:13:08,360 --> 00:13:08,680
Right.

419
00:13:08,680 --> 00:13:10,840
So much more than just the words themselves.

420
00:13:10,840 --> 00:13:13,280
All those pauses, those changes in tone,

421
00:13:13,280 --> 00:13:15,840
even when someone doesn't speak, it all adds to the meaning.

422
00:13:15,840 --> 00:13:18,760
And this research showed how tough it is for AI to get that,

423
00:13:18,760 --> 00:13:21,840
even with those super advanced language models we have now.

424
00:13:21,840 --> 00:13:24,320
Yeah, but this isn't just about pointing out

425
00:13:24,320 --> 00:13:25,720
where AI falls short.

426
00:13:25,720 --> 00:13:26,840
Right, it's about moving forward.

427
00:13:26,840 --> 00:13:27,360
Exactly.

428
00:13:27,360 --> 00:13:29,840
It's about understanding where we need to go next.

429
00:13:29,840 --> 00:13:31,000
So where do we go from here?

430
00:13:31,000 --> 00:13:33,000
What's next for this research?

431
00:13:33,000 --> 00:13:35,000
Well, one thing the researchers really emphasize

432
00:13:35,000 --> 00:13:37,160
is going beyond just text.

433
00:13:37,160 --> 00:13:39,880
Right, like they only look to transcripts for this study.

434
00:13:39,880 --> 00:13:44,080
Yeah, they focused on those cues to find those TRPs,

435
00:13:44,080 --> 00:13:46,480
but that's just a tiny part of a real conversation.

436
00:13:46,480 --> 00:13:47,280
Oh, absolutely.

437
00:13:47,280 --> 00:13:50,200
Think about all those nonverbal cues we use, tone of voice,

438
00:13:50,200 --> 00:13:52,600
facial expressions, body language, all that.

439
00:13:52,600 --> 00:13:55,280
They can change the meaning of what we're saying completely.

440
00:13:55,280 --> 00:13:56,160
Exactly.

441
00:13:56,160 --> 00:13:59,400
I mean, take a simple phrase like, that's great.

442
00:13:59,400 --> 00:14:01,840
You could mean you're excited or being

443
00:14:01,840 --> 00:14:04,080
sarcastic or even disappointed.

444
00:14:04,080 --> 00:14:05,200
All depending on how you say it.

445
00:14:05,200 --> 00:14:06,400
Exactly.

446
00:14:06,400 --> 00:14:09,200
And AI needs to understand those subtle variations

447
00:14:09,200 --> 00:14:12,000
if it's going to truly get human communication.

448
00:14:12,000 --> 00:14:14,920
So are you saying the next step is teaching AI

449
00:14:14,920 --> 00:14:17,120
to read those nonverbal cues?

450
00:14:17,120 --> 00:14:19,000
That's definitely a promising avenue.

451
00:14:19,000 --> 00:14:22,080
Imagine an AI that could process our words

452
00:14:22,080 --> 00:14:25,480
and analyze our tone, our expressions, even our body

453
00:14:25,480 --> 00:14:25,840
language.

454
00:14:25,840 --> 00:14:28,800
That's the goal, right, to give AI a much deeper

455
00:14:28,800 --> 00:14:31,080
understanding of what we're really trying to say.

456
00:14:31,080 --> 00:14:32,400
It would change everything.

457
00:14:32,400 --> 00:14:35,000
AI interactions would feel so much more natural.

458
00:14:35,000 --> 00:14:37,800
It would. And think about the applications.

459
00:14:37,800 --> 00:14:38,560
Like what?

460
00:14:38,560 --> 00:14:40,720
Customer service chatbots that can

461
00:14:40,720 --> 00:14:43,640
sense if you're getting frustrated just by your voice.

462
00:14:43,640 --> 00:14:44,200
Oh, wow.

463
00:14:44,200 --> 00:14:46,120
Or virtual assistants that pick up on,

464
00:14:46,120 --> 00:14:47,840
like if you're bored or not interested.

465
00:14:47,840 --> 00:14:48,520
That's incredible.

466
00:14:48,520 --> 00:14:50,280
It really shows how important this research is.

467
00:14:50,280 --> 00:14:50,960
Absolutely.

468
00:14:50,960 --> 00:14:52,360
And this is just the beginning.

469
00:14:52,360 --> 00:14:53,800
I'm excited to see what comes next.

470
00:14:53,800 --> 00:14:55,720
As AI keeps evolving, we're going

471
00:14:55,720 --> 00:14:57,760
to see even more creative ways to teach it

472
00:14:57,760 --> 00:14:59,520
the art of conversation.

473
00:14:59,520 --> 00:15:03,600
So for anyone listening who is as fascinated by AI as we are,

474
00:15:03,600 --> 00:15:07,040
this research gives us a peek into what's possible

475
00:15:07,040 --> 00:15:08,720
and also the challenges ahead.

476
00:15:08,720 --> 00:15:11,200
It reminds us that human conversation is incredibly

477
00:15:11,200 --> 00:15:12,480
complex.

478
00:15:12,480 --> 00:15:14,280
We're still figuring it out ourselves,

479
00:15:14,280 --> 00:15:16,160
let alone teaching it to machines.

480
00:15:16,160 --> 00:15:19,040
But with every step forward, we get closer to a future

481
00:15:19,040 --> 00:15:23,120
where AI can truly talk with us naturally and intuitively.

482
00:15:23,120 --> 00:15:25,080
And maybe, just maybe, we'll even

483
00:15:25,080 --> 00:15:28,400
have AI friends who can crack a good joke,

484
00:15:28,400 --> 00:15:31,800
understand a pun, or just offer a kind word at the right time.

485
00:15:31,800 --> 00:15:33,840
Now that's a future I'm looking forward to.

486
00:15:33,840 --> 00:15:36,040
Thanks for joining us on this deep dive, everyone.

487
00:15:36,040 --> 00:16:03,000
We'll see you next time.