1
00:00:00,000 --> 00:00:03,280
Welcome back to the AI Papers podcast daily for another deep dive.

2
00:00:03,720 --> 00:00:07,840
And today we're looking at a paper that's asking a pretty timely question.

3
00:00:08,400 --> 00:00:09,080
Yeah.

4
00:00:09,360 --> 00:00:14,440
Can large language models plan paths in the real world like your GPS?

5
00:00:14,520 --> 00:00:18,520
It's a really fascinating question, um, especially, you know, as we see

6
00:00:18,520 --> 00:00:22,280
companies like Volkswagen and Mercedes start to incorporate LLMs into their

7
00:00:22,280 --> 00:00:23,080
car systems.

8
00:00:23,080 --> 00:00:23,400
Yeah.

9
00:00:23,520 --> 00:00:26,720
Uh, you know, we rely on our navigation apps every day, but can these LLMs

10
00:00:26,720 --> 00:00:30,120
really handle the complexities of, you know, actually getting us from point A to

11
00:00:30,120 --> 00:00:30,640
point B?

12
00:00:30,640 --> 00:00:30,920
Right.

13
00:00:30,960 --> 00:00:34,040
It seems like a big leap to go from, you know, writing a poem or something to

14
00:00:34,520 --> 00:00:35,760
navigating a busy city.

15
00:00:35,800 --> 00:00:36,120
Yeah.

16
00:00:36,520 --> 00:00:40,640
Um, so how did these researchers actually test these LLMs?

17
00:00:40,960 --> 00:00:46,240
So, uh, they took three different LLMs, GPT-4, Gemini and Mistrol and tested them

18
00:00:46,240 --> 00:00:49,960
on both drive and routes and something called visual landmark navigation.

19
00:00:50,040 --> 00:00:50,360
Okay.

20
00:00:50,400 --> 00:00:54,000
So think of that, like walking around a city and using landmarks like statues

21
00:00:54,000 --> 00:00:55,400
or buildings to find your way.

22
00:00:55,400 --> 00:00:55,720
Okay.

23
00:00:55,720 --> 00:00:57,800
So they weren't just working with maps on a computer.

24
00:00:57,800 --> 00:01:01,120
They were actually trying to get these LLMs to navigate in the real world.

25
00:01:01,440 --> 00:01:02,040
Exactly.

26
00:01:02,040 --> 00:01:04,680
And they didn't just stick to one type of environment either.

27
00:01:05,000 --> 00:01:09,520
They designed driving routes in urban, suburban and rural settings to see how

28
00:01:09,520 --> 00:01:13,200
the LLMs would handle different road networks and levels of complexity.

29
00:01:13,440 --> 00:01:14,400
That's really interesting.

30
00:01:14,400 --> 00:01:18,200
So it's one thing to navigate like a grid like city, right, but quite another

31
00:01:18,200 --> 00:01:19,880
to deal with winding country roads.

32
00:01:19,880 --> 00:01:20,360
Yeah.

33
00:01:20,480 --> 00:01:23,560
Um, what about the visual landmark navigation?

34
00:01:23,560 --> 00:01:24,320
How did they test that?

35
00:01:24,320 --> 00:01:28,800
So for that, they used a university campus with easy, medium and hard routes.

36
00:01:29,080 --> 00:01:33,200
So the LLM had to figure out how to get from one point to another using only

37
00:01:33,200 --> 00:01:35,200
descriptions of landmarks along the way.

38
00:01:35,520 --> 00:01:36,840
That sounds pretty challenging.

39
00:01:37,400 --> 00:01:39,960
So did these LLMs pass with flying colors?

40
00:01:40,400 --> 00:01:41,760
Can I ditch my Waze app yet?

41
00:01:42,120 --> 00:01:43,520
Well, uh, not quite.

42
00:01:43,640 --> 00:01:45,520
The results were actually pretty surprising.

43
00:01:45,600 --> 00:01:49,480
All three of the LLMs made a lot of errors and some of them are actually pretty serious.

44
00:01:49,600 --> 00:01:49,840
Okay.

45
00:01:49,840 --> 00:01:51,320
Now I'm really curious.

46
00:01:51,920 --> 00:01:53,440
What kind of errors are we talking about?

47
00:01:53,440 --> 00:01:57,480
Like, did they just miss a turn here and there or were things more dramatic?

48
00:01:57,560 --> 00:02:01,200
Well, some of the errors were kind of what you might expect, like missing an exit or

49
00:02:01,200 --> 00:02:02,320
taking a wrong turn.

50
00:02:02,360 --> 00:02:02,680
Yeah.

51
00:02:02,760 --> 00:02:08,080
But they also made some really concerning mistakes, like directing the user to drive

52
00:02:08,080 --> 00:02:08,760
off the road.

53
00:02:08,840 --> 00:02:11,640
Wait, seriously, like telling someone to drive off a cliff?

54
00:02:11,760 --> 00:02:13,480
How is that even possible?

55
00:02:13,760 --> 00:02:16,880
The researchers called these discontinuity errors.

56
00:02:16,960 --> 00:02:17,320
Okay.

57
00:02:17,640 --> 00:02:21,320
It's as if the LLMs were stringing together words and directions without really

58
00:02:21,320 --> 00:02:23,640
understanding the spatial relationships involved.

59
00:02:24,000 --> 00:02:27,760
They would assume a road continued even when it clearly ended on the map.

60
00:02:27,960 --> 00:02:31,920
So it's like they're missing that basic common sense understanding of how the

61
00:02:31,920 --> 00:02:33,040
physical world works.

62
00:02:33,720 --> 00:02:37,320
Like humans, we can look at a map and see that a road ends abruptly.

63
00:02:37,480 --> 00:02:37,800
Yeah.

64
00:02:37,840 --> 00:02:39,760
And we know that's not a place we should be driving.

65
00:02:40,000 --> 00:02:40,200
Yeah.

66
00:02:40,200 --> 00:02:41,920
But the LLMs just seem to ignore that.

67
00:02:42,000 --> 00:02:42,680
Exactly.

68
00:02:43,000 --> 00:02:44,760
And it wasn't just a one time thing.

69
00:02:44,880 --> 00:02:50,240
All three LLMs made these discontinuity errors in both the driving and walking tasks.

70
00:02:50,240 --> 00:02:51,480
Wow. That's kind of alarming.

71
00:02:51,800 --> 00:02:53,200
It makes you wonder what else they got wrong.

72
00:02:53,600 --> 00:02:56,840
Do the researchers give any specific examples of these fails like something we

73
00:02:56,840 --> 00:02:57,560
could really picture?

74
00:02:57,560 --> 00:02:57,920
They do.

75
00:02:57,920 --> 00:03:00,440
There's a great example using a suburban driving route.

76
00:03:00,480 --> 00:03:00,880
Okay.

77
00:03:01,120 --> 00:03:05,720
They compared the LLM generated route to a route generated by Waze, you know, a

78
00:03:05,720 --> 00:03:07,000
popular navigation app.

79
00:03:07,200 --> 00:03:07,520
Right.

80
00:03:07,760 --> 00:03:10,760
Waze took 18 turns over 30 miles.

81
00:03:11,360 --> 00:03:16,480
GPT-4 on the other hand added an extra four miles of discontinuous road to its

82
00:03:16,480 --> 00:03:16,760
route.

83
00:03:16,760 --> 00:03:20,520
So if you were following GPT-4's directions, you'd literally be driving

84
00:03:20,520 --> 00:03:22,760
off road for a good chunk of that trip.

85
00:03:23,000 --> 00:03:24,520
I think I'll stick with my ways for now.

86
00:03:24,760 --> 00:03:25,720
Probably a good idea.

87
00:03:25,880 --> 00:03:28,640
And this highlights one of the key findings of the paper.

88
00:03:29,240 --> 00:03:33,160
LLMs may be great at language, but they don't seem to truly understand the

89
00:03:33,160 --> 00:03:34,680
real world the way we do.

90
00:03:34,920 --> 00:03:35,200
Right.

91
00:03:35,320 --> 00:03:39,080
They're missing that spatial reasoning ability that allows us to navigate

92
00:03:39,080 --> 00:03:40,360
complex environments.

93
00:03:40,440 --> 00:03:41,560
That's a really important point.

94
00:03:41,560 --> 00:03:44,640
It's like they're treating navigation as a word puzzle rather than a spatial

95
00:03:44,640 --> 00:03:45,080
problem.

96
00:03:45,080 --> 00:03:45,560
Yeah.

97
00:03:45,560 --> 00:03:46,560
Yeah.

98
00:03:46,560 --> 00:03:47,960
But what about the time aspect?

99
00:03:48,200 --> 00:03:53,120
Like if I need to be somewhere at a certain time, can these LLMs factor that

100
00:03:53,120 --> 00:03:54,080
into their route planning?

101
00:03:54,160 --> 00:03:55,880
That's another area where they struggled.

102
00:03:55,960 --> 00:03:58,920
Only GPT-4 even attempted to meet time constraints.

103
00:03:59,360 --> 00:04:03,120
The researchers gave it prompts alike, arrive two hours before a game and it

104
00:04:03,120 --> 00:04:04,880
tried to take travel time into account.

105
00:04:04,880 --> 00:04:05,840
So at least it was trying.

106
00:04:06,320 --> 00:04:07,480
Did it manage to get the timing right?

107
00:04:07,720 --> 00:04:08,400
Not always.

108
00:04:09,040 --> 00:04:12,480
And the other two LLMs, Gemini and Mistral, couldn't incorporate time into

109
00:04:12,480 --> 00:04:13,440
their plans at all.

110
00:04:13,440 --> 00:04:17,640
They just focused on getting from point A to point B, regardless of how long it

111
00:04:17,640 --> 00:04:18,240
might take.

112
00:04:19,000 --> 00:04:22,880
So even though GPT-4 showed a little bit more awareness of time, it sounds like

113
00:04:22,880 --> 00:04:26,120
none of the LLMs are quite ready to replace our human design navigation

114
00:04:26,120 --> 00:04:26,560
systems.

115
00:04:26,600 --> 00:04:27,120
Right.

116
00:04:27,200 --> 00:04:32,240
But I'm curious, did any of the LLMs stand out as being better than the others?

117
00:04:32,760 --> 00:04:36,120
Was there a clear winner or did they all pretty much fail equally?

118
00:04:36,320 --> 00:04:39,600
It's interesting because they all failed in the sense that none of them could

119
00:04:39,600 --> 00:04:44,160
consistently plan accurate and safe paths, but there were some subtle differences

120
00:04:44,160 --> 00:04:45,400
in how they messed up.

121
00:04:45,480 --> 00:04:45,840
Okay.

122
00:04:45,960 --> 00:04:46,640
I'm all ears.

123
00:04:47,280 --> 00:04:49,480
What kind of nuances did the researchers pick up on?

124
00:04:49,520 --> 00:04:51,960
Did they like break down the types of errors in any way?

125
00:04:52,160 --> 00:04:53,080
Yeah, they did.

126
00:04:53,080 --> 00:04:56,640
They made a distinction between what they called major errors and minor errors,

127
00:04:56,640 --> 00:04:58,840
which I think is a helpful way to think about it.

128
00:04:58,880 --> 00:04:59,120
Okay.

129
00:04:59,120 --> 00:05:02,840
So major errors sound pretty self-explanatory.

130
00:05:03,080 --> 00:05:05,600
Things that could seriously mislead someone, maybe even put them in a

131
00:05:05,600 --> 00:05:06,640
dangerous situation.

132
00:05:06,840 --> 00:05:07,760
Exactly.

133
00:05:07,760 --> 00:05:12,360
Imagine being told to drive the wrong way down a one-way street or being

134
00:05:12,360 --> 00:05:14,720
directed to an exit that doesn't exist.

135
00:05:15,360 --> 00:05:18,600
Those are the kinds of things that fall into the major error category.

136
00:05:18,600 --> 00:05:19,240
Yes.

137
00:05:19,880 --> 00:05:21,320
I can see how those would be a problem.

138
00:05:21,880 --> 00:05:23,040
What about minor errors?

139
00:05:23,080 --> 00:05:26,520
Are those more like slight inconveniences or things you could easily

140
00:05:26,520 --> 00:05:27,200
recover from?

141
00:05:27,360 --> 00:05:28,080
Exactly.

142
00:05:28,080 --> 00:05:31,520
Think of it like a slight misdirection or an inaccuracy in the

143
00:05:31,520 --> 00:05:34,200
instructions that you could probably figure out pretty easily.

144
00:05:34,560 --> 00:05:37,480
Maybe the LLM tells you to turn left at the second traffic light.

145
00:05:37,480 --> 00:05:37,960
Right.

146
00:05:38,120 --> 00:05:39,440
But it's actually the third one.

147
00:05:39,480 --> 00:05:39,760
Okay.

148
00:05:39,760 --> 00:05:40,520
That makes sense.

149
00:05:40,840 --> 00:05:43,480
So we've already talked about those discontinuity errors, which definitely

150
00:05:43,480 --> 00:05:45,840
sound like they belong in the major category.

151
00:05:45,880 --> 00:05:46,280
Yeah.

152
00:05:46,400 --> 00:05:48,520
What other kinds of major errors do they find?

153
00:05:49,000 --> 00:05:52,880
Well, one that stood out, especially for the driving tasks, was directing the

154
00:05:52,880 --> 00:05:55,840
user to merge onto highways in the wrong direction.

155
00:05:56,640 --> 00:05:58,960
Imagine thinking you're taking the on-ramp to go north.

156
00:05:58,960 --> 00:05:59,720
Oh, wow.

157
00:05:59,880 --> 00:06:02,280
But you accidentally end up going south on a busy highway.

158
00:06:02,280 --> 00:06:03,400
Oh, that would be terrifying.

159
00:06:04,760 --> 00:06:07,400
Merging into oncoming traffic is definitely not something I want

160
00:06:07,400 --> 00:06:07,760
to try.

161
00:06:08,200 --> 00:06:08,640
Right.

162
00:06:08,640 --> 00:06:13,320
And then there were the classic wrong exit or mis-exit errors, which we've probably

163
00:06:13,320 --> 00:06:17,320
all experienced at some point, even with our tried-and-true GPS apps.

164
00:06:17,640 --> 00:06:20,920
It's frustrating enough when your GPS misses an exit and has to reroute you.

165
00:06:21,640 --> 00:06:25,160
But I imagine it would be even more unnerving if you knew it was an LLM

166
00:06:25,160 --> 00:06:28,000
making those mistakes, since there's still such a new technology.

167
00:06:28,040 --> 00:06:28,480
Right.

168
00:06:28,760 --> 00:06:33,800
It really highlights that these LLMs, as sophisticated as they are, still haven't

169
00:06:33,800 --> 00:06:37,480
mastered the art of real-world navigation.

170
00:06:37,480 --> 00:06:40,400
So it wasn't just that they were bad at understanding roads and maps.

171
00:06:40,920 --> 00:06:45,440
They also seemed to struggle with recognizing and describing those real-world

172
00:06:45,440 --> 00:06:47,280
landmarks for the walking tasks.

173
00:06:47,400 --> 00:06:48,160
Exactly.

174
00:06:48,320 --> 00:06:52,480
For the visual landmark navigation, they made major errors, like missing the

175
00:06:52,480 --> 00:06:56,200
destination completely or describing landmarks that didn't even exist.

176
00:06:56,480 --> 00:06:58,760
So it wasn't just a matter of giving bad directions.

177
00:06:58,760 --> 00:07:01,400
They were actually hallucinating landmarks that weren't there.

178
00:07:01,440 --> 00:07:02,400
It seems that way.

179
00:07:02,400 --> 00:07:03,960
And this points to a deeper issue.

180
00:07:04,440 --> 00:07:07,720
LLMs might be missing a fundamental understanding of how things are

181
00:07:07,720 --> 00:07:09,720
spatially related in the physical world.

182
00:07:09,960 --> 00:07:10,320
Right.

183
00:07:10,680 --> 00:07:14,640
It makes you wonder if they're trying to navigate based on word associations

184
00:07:14,880 --> 00:07:17,000
rather than actual spatial awareness.

185
00:07:17,600 --> 00:07:21,680
Like if they've seen the word statue and fountain together in their training data,

186
00:07:22,000 --> 00:07:25,440
maybe they assume those things are always located next to each other in the real world.

187
00:07:25,480 --> 00:07:27,000
That's a really interesting point.

188
00:07:27,040 --> 00:07:30,320
And it kind of gets to the heart of the difference between generating a

189
00:07:30,320 --> 00:07:35,200
language that sounds coherent versus truly understanding the underlying

190
00:07:35,200 --> 00:07:38,000
concepts and relationships behind those words.

191
00:07:38,000 --> 00:07:39,520
It's like they're playing a language game.

192
00:07:39,680 --> 00:07:40,080
Right.

193
00:07:40,160 --> 00:07:44,400
But they haven't quite grasped the rules of the real world game board,

194
00:07:45,240 --> 00:07:48,280
which brings us back to that issue of transparency we talked about earlier.

195
00:07:48,800 --> 00:07:51,600
If these LLMs don't even realize they're making these kinds of errors,

196
00:07:51,800 --> 00:07:53,800
how can we trust them to guide us safely?

197
00:07:53,960 --> 00:07:55,120
It's a valid concern.

198
00:07:55,120 --> 00:07:58,640
And the researchers actually tried to get a handle on this by figuring out how

199
00:07:58,640 --> 00:08:02,840
much driver knowledge would be needed to successfully navigate the routes

200
00:08:02,840 --> 00:08:04,280
generated by the LLMs.

201
00:08:04,400 --> 00:08:04,680
OK.

202
00:08:04,680 --> 00:08:08,560
So they were basically trying to assess whether you would need to be an experienced

203
00:08:08,560 --> 00:08:11,480
driver to avoid getting hopelessly lost.

204
00:08:11,720 --> 00:08:12,480
Exactly.

205
00:08:12,520 --> 00:08:17,040
They categorized the routes as requiring either beginner, intermediate, or expert

206
00:08:17,040 --> 00:08:21,560
driving knowledge based on how severe the errors were and where they occurred.

207
00:08:21,840 --> 00:08:25,960
So if a route was riddled with major errors, you'd probably need to be a pretty

208
00:08:25,960 --> 00:08:28,840
savvy driver to figure out how to get back on track.

209
00:08:28,840 --> 00:08:29,360
Precisely.

210
00:08:29,360 --> 00:08:32,800
And for some of the LLMs, even an expert driver would have had a tough time

211
00:08:32,800 --> 00:08:34,120
making sense of the directions.

212
00:08:34,320 --> 00:08:38,880
It really underscores how much we humans take for granted when it comes to

213
00:08:38,880 --> 00:08:39,920
navigating the world.

214
00:08:40,680 --> 00:08:44,320
We could look at a map, see that a road doesn't connect, and instinctively

215
00:08:44,320 --> 00:08:45,440
know that something's wrong.

216
00:08:45,480 --> 00:08:45,880
Right.

217
00:08:46,280 --> 00:08:49,840
But for these LLMs, that common sense of awareness just isn't there yet.

218
00:08:49,960 --> 00:08:53,560
It definitely highlights the gap between artificial intelligence and human

219
00:08:53,560 --> 00:08:57,040
intelligence, especially when it comes to understanding and interacting with the

220
00:08:57,040 --> 00:08:57,800
physical world.

221
00:08:57,880 --> 00:09:00,840
So we've talked a lot about the types of errors these LLMs made.

222
00:09:00,880 --> 00:09:01,240
Right.

223
00:09:01,360 --> 00:09:03,440
But I'm curious about their overall performance.

224
00:09:03,520 --> 00:09:03,720
Yeah.

225
00:09:03,800 --> 00:09:06,680
Was there one that consistently did better than the others, or did they all pretty

226
00:09:06,680 --> 00:09:07,640
much bomb the test?

227
00:09:07,880 --> 00:09:10,680
Well, like I said earlier, they all failed in the sense that none of them were

228
00:09:10,680 --> 00:09:13,240
reliable enough for real world navigation.

229
00:09:13,560 --> 00:09:13,840
Right.

230
00:09:13,880 --> 00:09:16,560
But there were some subtle differences in their performance.

231
00:09:16,960 --> 00:09:19,520
Some glimmers of potential, you might say.

232
00:09:19,800 --> 00:09:21,320
Ooh, intriguing.

233
00:09:21,320 --> 00:09:23,240
Tell me more about these glimmers of potential.

234
00:09:23,880 --> 00:09:27,680
Did any of the LLMs show particular strengths in certain areas?

235
00:09:28,040 --> 00:09:32,480
For the driving tasks, GPT-4 seemed to have a slight edge, especially in those

236
00:09:32,480 --> 00:09:36,840
suburban and rural settings with longer distances and more complex road networks.

237
00:09:36,880 --> 00:09:41,080
So maybe it was a bit better at handling those trickier routes, perhaps it

238
00:09:41,080 --> 00:09:44,920
gleaned more knowledge about road systems from its vast training data?

239
00:09:45,160 --> 00:09:46,160
That's certainly possible.

240
00:09:46,200 --> 00:09:50,000
But it's crucial to remember that GPT-4 still made plenty of errors, including

241
00:09:50,000 --> 00:09:51,440
those discontinuity errors.

242
00:09:51,640 --> 00:09:54,080
It wasn't perfect by any stretch of the imagination.

243
00:09:54,120 --> 00:09:54,360
Right.

244
00:09:54,360 --> 00:09:56,720
So not quite ready to replace my Google Maps just yet.

245
00:09:56,920 --> 00:09:58,120
What about the other LLMs?

246
00:09:58,120 --> 00:10:00,640
Did Gemini or Mistrol show any strengths?

247
00:10:00,880 --> 00:10:05,720
Interestingly, Gemini actually performed the best overall on the visual landmark tasks.

248
00:10:06,000 --> 00:10:06,320
Really?

249
00:10:07,080 --> 00:10:07,680
That's surprising.

250
00:10:08,000 --> 00:10:11,400
I would have thought that navigating by landmarks would be even more challenging

251
00:10:11,400 --> 00:10:13,000
than following road directions.

252
00:10:13,240 --> 00:10:15,240
What made Gemini better at that?

253
00:10:15,280 --> 00:10:18,960
Well, remember that Gemini was specifically designed with a photo

254
00:10:18,960 --> 00:10:23,200
design with a focus on understanding context and following instructions.

255
00:10:23,200 --> 00:10:23,480
Right.

256
00:10:23,720 --> 00:10:27,080
So perhaps those capabilities gave it a bit of an advantage when it came to

257
00:10:27,080 --> 00:10:29,400
navigating using landmark descriptions.

258
00:10:29,440 --> 00:10:30,120
That makes sense.

259
00:10:30,680 --> 00:10:35,320
Being able to accurately interpret those landmark descriptions is essential for that task.

260
00:10:35,320 --> 00:10:35,640
Yeah.

261
00:10:36,240 --> 00:10:39,720
But even though Gemini did better with landmarks, it still wasn't perfect, right?

262
00:10:39,840 --> 00:10:40,480
Exactly.

263
00:10:40,480 --> 00:10:44,760
It still made significant errors, especially on those more challenging routes where

264
00:10:44,800 --> 00:10:48,600
the landmarks were more spread out or the descriptions were more ambiguous.

265
00:10:48,600 --> 00:10:52,920
So the bottom line is that none of these LLMs were consistently reliable enough to

266
00:10:52,920 --> 00:10:54,880
trust with real world navigation.

267
00:10:55,080 --> 00:10:57,400
That's the key takeaway from this research.

268
00:10:57,480 --> 00:11:01,400
And while it might seem like a bit of a setback for the LLM hype train, it's actually

269
00:11:01,400 --> 00:11:02,920
a really valuable finding.

270
00:11:03,360 --> 00:11:07,920
By identifying these limitations, it sets the stage for future research to focus on

271
00:11:07,920 --> 00:11:10,720
improving LLM capabilities in this crucial area.

272
00:11:10,880 --> 00:11:12,160
It's like any new technology.

273
00:11:12,600 --> 00:11:15,280
There are going to be growing pains and unexpected challenges.

274
00:11:15,640 --> 00:11:15,960
Right.

275
00:11:15,960 --> 00:11:20,280
But by understanding where these LLMs are falling short, we can start to develop

276
00:11:20,280 --> 00:11:22,720
solutions and make them more robust and reliable.

277
00:11:22,760 --> 00:11:23,360
Exactly.

278
00:11:23,400 --> 00:11:26,280
It's all about progress, not perfection.

279
00:11:26,760 --> 00:11:29,320
And that's what makes this kind of research so important.

280
00:11:29,680 --> 00:11:34,640
It's not just about showcasing what LLMs can do, but also about honestly assessing

281
00:11:34,640 --> 00:11:38,960
their limitations so that we can move forward in a responsible and informed way.

282
00:11:39,480 --> 00:11:42,680
We've covered a lot of ground here, and I'm sure our listeners are eager to hear

283
00:11:42,680 --> 00:11:46,360
what the researchers propose for the future of LLMs and navigation.

284
00:11:47,400 --> 00:11:50,240
What are some of the key takeaways and recommendations they offer?

285
00:11:50,680 --> 00:11:54,920
They actually had some really thought-provoking ideas about how to make LLMs

286
00:11:54,920 --> 00:11:58,840
better navigators, but we'll dive into those right after a quick message.

287
00:11:58,960 --> 00:11:59,560
Stay tuned.

288
00:11:59,800 --> 00:12:02,040
Welcome back to the AI Papers podcast daily.

289
00:12:02,560 --> 00:12:05,640
We've been talking about these large language models and how they're not quite

290
00:12:05,640 --> 00:12:07,680
ready to be our personal chauffeurs just yet.

291
00:12:08,240 --> 00:12:11,320
But you know, the paper we're discussing didn't just point out the problems.

292
00:12:11,320 --> 00:12:15,400
They also offered some really interesting ideas about how to improve LLMs for navigation.

293
00:12:15,560 --> 00:12:16,480
Yeah, that's right.

294
00:12:16,520 --> 00:12:20,320
It's not enough to just say, hey, these LLMs aren't very good at getting around.

295
00:12:20,720 --> 00:12:24,600
You know, the researchers really dug into what's causing these errors and came up

296
00:12:24,600 --> 00:12:28,920
with some concrete suggestions for how to make LLMs better navigators.

297
00:12:28,960 --> 00:12:30,720
Okay, I'm ready for some solutions.

298
00:12:31,160 --> 00:12:33,000
What's the first big idea they propose?

299
00:12:33,600 --> 00:12:37,960
One that really stood out to me was the idea of building in what they call reality checks.

300
00:12:38,200 --> 00:12:38,560
Okay.

301
00:12:38,560 --> 00:12:42,400
Right now, LLMs tend to operate in this kind of closed loop.

302
00:12:42,720 --> 00:12:46,520
They take in text data, they process it and they generate more text, but they don't

303
00:12:46,520 --> 00:12:49,520
always cross-reference their output with the real world.

304
00:12:49,560 --> 00:12:52,840
So if they tell you to turn left onto a road that doesn't exist, there's no part of

305
00:12:52,840 --> 00:12:54,960
the LLM that's like, wait a minute, that can't be right.

306
00:12:54,960 --> 00:12:55,680
Exactly.

307
00:12:55,680 --> 00:12:59,560
They're missing that layer of common sense that humans have where we can look at a map

308
00:12:59,600 --> 00:13:02,920
or think about our surroundings and say, this doesn't make sense.

309
00:13:03,440 --> 00:13:07,560
So the researchers suggest giving LLMs the ability to check their output against

310
00:13:07,560 --> 00:13:09,120
external data sources.

311
00:13:09,360 --> 00:13:14,440
So like having the LLM double check its directions against a real time map or maybe

312
00:13:14,440 --> 00:13:18,560
even traffic data, that could definitely help prevent those discontinuity errors where

313
00:13:18,560 --> 00:13:20,040
they send you driving off a cliff.

314
00:13:20,080 --> 00:13:20,880
Exactly.

315
00:13:20,880 --> 00:13:22,720
And it goes beyond just maps.

316
00:13:23,080 --> 00:13:26,640
They could also incorporate things like weather reports, business hours, or even

317
00:13:26,640 --> 00:13:31,880
reviews to see if a particular route is known for being dangerous or difficult to navigate.

318
00:13:32,240 --> 00:13:36,600
It's about giving the LLM a way to ground its output in the real world and make sure

319
00:13:36,600 --> 00:13:37,880
it's actually feasible.

320
00:13:38,040 --> 00:13:39,000
That makes a lot of sense.

321
00:13:39,360 --> 00:13:43,680
It's like giving them a dose of that human skepticism that we often take for granted.

322
00:13:43,800 --> 00:13:44,840
I like that analogy.

323
00:13:45,000 --> 00:13:49,600
And speaking of skepticism, the second big idea they propose is all about increasing

324
00:13:49,600 --> 00:13:50,480
transparency.

325
00:13:50,560 --> 00:13:50,920
Okay.

326
00:13:51,000 --> 00:13:55,560
Right now LLMs can be a bit overconfident even when they're tackling tasks they haven't

327
00:13:55,560 --> 00:13:56,600
quite mastered yet.

328
00:13:56,800 --> 00:13:59,760
It's like they haven't learned that sometimes it's okay to say, I don't know.

329
00:13:59,760 --> 00:13:59,960
Right.

330
00:13:59,960 --> 00:14:05,400
So the researchers are saying that future LLMs need to be more upfront about their limitations.

331
00:14:05,400 --> 00:14:08,640
Instead of just spitting out directions, they should be able to say something like, Hey,

332
00:14:08,640 --> 00:14:10,240
I'm not very familiar with this area.

333
00:14:10,240 --> 00:14:12,360
So you might want to double check these directions.

334
00:14:12,520 --> 00:14:13,080
I love that.

335
00:14:13,320 --> 00:14:19,320
It's about empowering the user to make informed decisions about how much they trust the LLM.

336
00:14:19,960 --> 00:14:23,880
It's that honesty and transparency that will ultimately make these systems more reliable.

337
00:14:23,920 --> 00:14:24,600
Absolutely.

338
00:14:25,040 --> 00:14:29,000
And the final idea they put forward is something we've touched on already, the potential of

339
00:14:29,000 --> 00:14:31,120
smaller, more specialized models.

340
00:14:31,120 --> 00:14:36,280
Maybe trying to create these giant, all-knowing LLMs isn't the best approach, especially when

341
00:14:36,280 --> 00:14:40,640
it comes to tasks that require very specific skills like navigation.

342
00:14:40,760 --> 00:14:43,200
It's like the old saying, jack of all trades, master of none.

343
00:14:43,240 --> 00:14:43,600
Yeah.

344
00:14:44,120 --> 00:14:49,400
Maybe we need to focus on creating LLMs that are experts in specific domains like navigation.

345
00:14:49,440 --> 00:14:50,160
Exactly.

346
00:14:50,320 --> 00:14:55,240
So the researchers suggest exploring the development of smaller LLMs that are trained specifically

347
00:14:55,240 --> 00:14:56,280
for navigation.

348
00:14:56,320 --> 00:14:56,720
Yeah.

349
00:14:56,720 --> 00:15:01,160
Using data sets and algorithms that are tailored to spatial reasoning and route planning.

350
00:15:01,200 --> 00:15:02,080
That makes a lot of sense.

351
00:15:02,080 --> 00:15:04,600
If you're building a house, you want a general contractor.

352
00:15:04,800 --> 00:15:04,960
Yeah.

353
00:15:04,960 --> 00:15:07,040
But if you need brain surgery, you want a neurosurgeon.

354
00:15:07,360 --> 00:15:07,680
Yeah.

355
00:15:07,960 --> 00:15:09,680
It's about finding the right tool for the job.

356
00:15:09,720 --> 00:15:11,080
I couldn't set it better myself.

357
00:15:11,600 --> 00:15:16,160
So while this research might seem like a bit of a reality check for LLMs, it's actually

358
00:15:16,160 --> 00:15:17,360
incredibly valuable.

359
00:15:17,400 --> 00:15:17,720
Right.

360
00:15:17,760 --> 00:15:21,560
It's highlighting the areas where we need to focus our efforts to make these systems

361
00:15:21,560 --> 00:15:25,520
truly useful and reliable for real-world navigation.

362
00:15:25,560 --> 00:15:26,560
Yeah, it's been a fun one.

363
00:15:26,560 --> 00:15:27,920
It's been a fascinating deep dive.

364
00:15:27,960 --> 00:15:28,240
Yeah.

365
00:15:28,400 --> 00:15:33,000
It's clear that LLMs have a lot of potential, but they're not quite ready to replace our

366
00:15:33,000 --> 00:15:34,880
trusty GPS apps just yet.

367
00:15:34,920 --> 00:15:35,240
Right.

368
00:15:35,320 --> 00:15:39,440
But by highlighting these challenges and proposing solutions, this research is paving the way

369
00:15:39,440 --> 00:15:43,760
for a future where LLMs can truly enhance our ability to navigate the world.

370
00:15:44,080 --> 00:15:45,240
I completely agree.

371
00:15:45,600 --> 00:15:49,200
And I think it's a reminder that AI development is an ongoing process.

372
00:15:49,640 --> 00:15:54,440
It's not about creating perfect systems overnight, but about constantly iterating, learning from

373
00:15:54,440 --> 00:15:58,680
our mistakes and striving to make these technologies more beneficial and reliable for everyone.

374
00:15:58,920 --> 00:16:01,040
And that's what makes this field so exciting.

375
00:16:01,400 --> 00:16:04,600
There's always something new to discover and new challenges to overcome.

376
00:16:04,920 --> 00:16:09,920
So keep those questions coming and stay tuned for our next deep dive into the world of AI research.

377
00:16:09,960 --> 00:16:14,920
Until then, keep exploring, stay curious, and maybe double check those LLM generated

378
00:16:14,920 --> 00:16:26,760
directions before you hit the road.