1
00:00:00,000 --> 00:00:06,160
All right, everyone, buckle up because today we're going deep into the world of robot foundation.

2
00:00:06,160 --> 00:00:08,640
Mom, it's going to be a wild ride.

3
00:00:08,640 --> 00:00:09,480
That's fun.

4
00:00:09,480 --> 00:00:14,600
So basically we're talking about robots that can learn like a whole bunch of different tasks, kind of like humans can.

5
00:00:15,360 --> 00:00:22,120
Imagine robots that can do your laundry, you know, put together that crazy IKEA furniture, maybe even cook you dinner.

6
00:00:22,400 --> 00:00:24,840
It's like something straight out of science fiction.

7
00:00:24,880 --> 00:00:25,400
Yeah.

8
00:00:25,520 --> 00:00:29,640
But it's becoming reality thanks to some pretty amazing research.

9
00:00:29,640 --> 00:00:31,480
And that's exactly what we're diving into today.

10
00:00:31,800 --> 00:00:36,280
We've got this paper from a company called Physical Intelligence.

11
00:00:36,680 --> 00:00:41,000
They're like the pioneers of robot learning and their work is blowing my mind.

12
00:00:41,000 --> 00:00:42,880
Yeah, they're doing some really groundbreaking stuff.

13
00:00:43,040 --> 00:00:47,760
The paper is called a vision language action flow model for general robot control.

14
00:00:47,960 --> 00:00:48,400
Okay.

15
00:00:48,400 --> 00:00:49,160
So fair warning.

16
00:00:49,400 --> 00:00:50,120
It's a little dense.

17
00:00:50,120 --> 00:00:50,880
It's a bit of a mouthful.

18
00:00:50,960 --> 00:00:52,560
But don't worry, that's what we're here for.

19
00:00:52,800 --> 00:00:53,840
We're going to break it all down.

20
00:00:54,120 --> 00:00:54,320
Yeah.

21
00:00:54,320 --> 00:00:56,640
Make sure you come away feeling like a robot expert.

22
00:00:56,640 --> 00:00:57,480
Exactly.

23
00:00:57,640 --> 00:00:58,720
We'll guide you through it.

24
00:00:58,720 --> 00:01:00,280
So let's start with the big picture.

25
00:01:00,520 --> 00:01:01,840
What's the ultimate goal here?

26
00:01:02,040 --> 00:01:06,360
The goal is to create robots that are as versatile and adaptable as humans.

27
00:01:06,560 --> 00:01:06,840
Right.

28
00:01:07,000 --> 00:01:12,880
Like robots that can actually think on their feet and handle different situations, not just those pre-programmed robots we're used to.

29
00:01:12,880 --> 00:01:13,400
Exactly.

30
00:01:13,400 --> 00:01:17,760
We want robots that can handle a wide range of tasks in all sorts of different environments.

31
00:01:17,760 --> 00:01:22,080
So robots that aren't just specialists stuck doing one thing over and over again.

32
00:01:22,160 --> 00:01:22,440
Yeah.

33
00:01:22,480 --> 00:01:26,120
Like robots that aren't just amazing at chess, but can't make you a sandwich.

34
00:01:26,200 --> 00:01:27,200
That's a great analogy.

35
00:01:27,200 --> 00:01:31,360
It's like having a supercomputer that can be a grandmaster, but can't even boil water.

36
00:01:31,400 --> 00:01:32,280
Exactly.

37
00:01:32,400 --> 00:01:36,920
So how do we get from these limited robots to the kind of adaptable robots we're talking about?

38
00:01:36,960 --> 00:01:39,480
Well, that's where these robot foundation models come in.

39
00:01:39,760 --> 00:01:43,200
They're designed to be like the AI equivalent of a generalist.

40
00:01:43,240 --> 00:01:43,640
OK.

41
00:01:44,120 --> 00:01:44,840
I'm intrigued.

42
00:01:45,280 --> 00:01:47,400
Tell me more about these robot foundation models.

43
00:01:47,400 --> 00:01:53,440
So basically they're trained on a massive amount of data to learn a broad set of skills and knowledge.

44
00:01:53,440 --> 00:01:59,080
So like instead of teaching a robot how to do one specific task, they're teaching it a whole bunch of different things.

45
00:01:59,320 --> 00:01:59,760
Right.

46
00:01:59,920 --> 00:02:07,360
And the idea is that this broad training allows them to adapt to new situations and learn new tasks much more easily.

47
00:02:07,400 --> 00:02:07,600
OK.

48
00:02:07,600 --> 00:02:08,320
That makes sense.

49
00:02:08,600 --> 00:02:12,280
So it's like giving a robot a well-rounded education in robotting.

50
00:02:12,800 --> 00:02:13,520
You could say that.

51
00:02:13,640 --> 00:02:16,800
Now the specific model we're focusing on today is called Pi Zero.

52
00:02:16,920 --> 00:02:19,120
Or two would you for all you math fans out there.

53
00:02:19,200 --> 00:02:19,760
Catching in.

54
00:02:19,760 --> 00:02:24,320
But the big question is how do they make a robot that can actually learn like this.

55
00:02:24,400 --> 00:02:24,720
Yeah.

56
00:02:24,720 --> 00:02:25,720
What's the secret sauce.

57
00:02:25,760 --> 00:02:26,720
What's going on under the hood.

58
00:02:26,920 --> 00:02:30,040
Well, there are a few key ingredients that make Pi Zero so special.

59
00:02:30,240 --> 00:02:30,600
OK.

60
00:02:30,880 --> 00:02:31,440
Let me on me.

61
00:02:31,640 --> 00:02:35,840
First up we've got vision language models or VLMs for short.

62
00:02:36,080 --> 00:02:36,960
VLMs.

63
00:02:37,000 --> 00:02:37,280
Yeah.

64
00:02:37,280 --> 00:02:37,920
I've heard of those.

65
00:02:38,400 --> 00:02:41,920
Are those like the AI systems that can create images from text descriptions.

66
00:02:41,960 --> 00:02:42,680
Yeah, exactly.

67
00:02:42,680 --> 00:02:44,800
Or those chatbots that can hold conversations.

68
00:02:44,880 --> 00:02:45,120
Yeah.

69
00:02:45,120 --> 00:02:46,920
I've seen those making the rounds online.

70
00:02:46,960 --> 00:02:48,040
They're pretty mind blowing.

71
00:02:48,080 --> 00:02:48,600
They are.

72
00:02:48,600 --> 00:02:51,720
And Pi Zero actually uses a VLM as its brain.

73
00:02:52,240 --> 00:02:56,160
So it can process both images and language, which is crucial for understanding complex

74
00:02:56,280 --> 00:02:57,520
tasks and instructions.

75
00:02:57,600 --> 00:03:00,400
So it's like you can see the world and understand what we're telling it to do.

76
00:03:00,760 --> 00:03:01,080
OK.

77
00:03:01,080 --> 00:03:03,280
So it's got the seeing and understanding part down.

78
00:03:04,200 --> 00:03:05,560
What about the doing part.

79
00:03:05,640 --> 00:03:09,040
How does it actually translate all that information into movement.

80
00:03:09,600 --> 00:03:13,280
That's where the next two ingredients come in flow matching and action chunking.

81
00:03:13,320 --> 00:03:14,960
Flow matching and action chunking.

82
00:03:15,640 --> 00:03:16,040
All right.

83
00:03:16,040 --> 00:03:17,000
Those sound interesting.

84
00:03:17,000 --> 00:03:18,080
Break it down for me.

85
00:03:18,120 --> 00:03:21,960
So flow matching is a bit complex, but imagine trying to predict how a drop of

86
00:03:21,960 --> 00:03:23,600
water will flow down a surface.

87
00:03:23,640 --> 00:03:23,880
OK.

88
00:03:23,880 --> 00:03:24,880
I can visualize that.

89
00:03:24,920 --> 00:03:26,360
It's kind of like that for actions.

90
00:03:26,360 --> 00:03:30,560
Pi Zero uses flow matching to predict a smooth sequence of movements rather than

91
00:03:30,560 --> 00:03:32,360
just jerky individual steps.

92
00:03:32,360 --> 00:03:35,960
So it's like it's planning out its movements in a more fluid and natural way.

93
00:03:36,000 --> 00:03:36,240
Right.

94
00:03:36,240 --> 00:03:38,960
And to do that efficiently, it uses action chunking.

95
00:03:39,000 --> 00:03:39,960
Action chunking.

96
00:03:40,240 --> 00:03:41,080
What's that all about.

97
00:03:41,400 --> 00:03:45,520
So instead of planning every tiny little movement separately, it breaks down tasks

98
00:03:45,520 --> 00:03:47,320
into larger chunks of actions.

99
00:03:47,840 --> 00:03:48,000
Hmm.

100
00:03:48,400 --> 00:03:48,960
Interesting.

101
00:03:49,320 --> 00:03:52,240
So it's like learning a dance routine instead of each individual step.

102
00:03:52,280 --> 00:03:52,960
Exactly.

103
00:03:52,960 --> 00:03:56,400
It's a much more efficient and natural way for the robot to learn and

104
00:03:56,400 --> 00:03:57,840
execute complex movements.

105
00:03:58,000 --> 00:04:00,440
So this Pi Zero model is starting to sound pretty impressive.

106
00:04:00,440 --> 00:04:03,200
You can see, understand and move in a sophisticated way.

107
00:04:03,640 --> 00:04:05,320
It's a pretty remarkable system.

108
00:04:05,360 --> 00:04:07,480
But even the most talented dancers need training.

109
00:04:07,800 --> 00:04:08,480
Right.

110
00:04:08,840 --> 00:04:11,000
What's robot boot camp like for Pi Zero?

111
00:04:11,040 --> 00:04:12,960
Well, it involves two main phases.

112
00:04:12,960 --> 00:04:14,320
First, there's pre-training.

113
00:04:14,320 --> 00:04:15,600
Pre-training.

114
00:04:15,640 --> 00:04:15,880
Okay.

115
00:04:15,880 --> 00:04:17,200
What goes on in robot preschool?

116
00:04:17,200 --> 00:04:18,120
Well, imagine this.

117
00:04:18,320 --> 00:04:21,760
They fed this model over 10,000 hours of data.

118
00:04:22,000 --> 00:04:23,440
10,000 hours.

119
00:04:23,480 --> 00:04:24,120
Whoa.

120
00:04:24,160 --> 00:04:26,560
That's like a whole lot of robot binge watching.

121
00:04:27,000 --> 00:04:28,680
What kind of data are we talking about here?

122
00:04:28,920 --> 00:04:33,400
It's a mix of their own custom data sets, which they gathered from various robots

123
00:04:33,400 --> 00:04:37,400
doing all sorts of tasks and public data sets, like one called OXE.

124
00:04:37,680 --> 00:04:38,400
OXE.

125
00:04:38,600 --> 00:04:39,040
Interesting.

126
00:04:39,040 --> 00:04:43,920
So they're really exposing Pi Zero to a wide range of robots, environments and tasks.

127
00:04:43,920 --> 00:04:44,760
Exactly.

128
00:04:44,760 --> 00:04:47,800
And the tasks have varying levels of quality.

129
00:04:47,800 --> 00:04:52,200
So it's like teaching it the basic alphabet of movement before it starts writing novels.

130
00:04:52,240 --> 00:04:56,320
So it's building up a foundation of knowledge about how to move and interact with the world.

131
00:04:56,520 --> 00:04:56,920
Right.

132
00:04:56,920 --> 00:04:57,200
Okay.

133
00:04:57,200 --> 00:04:59,400
So it goes through this massive pre-training phase.

134
00:04:59,440 --> 00:04:59,760
Yeah.

135
00:04:59,760 --> 00:05:00,280
Then what?

136
00:05:00,280 --> 00:05:01,640
Then comes post-training.

137
00:05:02,200 --> 00:05:03,680
Now it's time to specialize.

138
00:05:04,200 --> 00:05:09,920
They use smaller, more focused data sets to fine tune Pi Zero for specific tasks.

139
00:05:09,920 --> 00:05:14,200
Ah, so it's like a pre-trained athlete now training for a specific event.

140
00:05:14,240 --> 00:05:15,040
Exactly.

141
00:05:15,040 --> 00:05:15,360
Okay.

142
00:05:15,360 --> 00:05:16,080
That makes sense.

143
00:05:16,640 --> 00:05:20,400
So we've got this super trained robot ready to show off its skills.

144
00:05:21,040 --> 00:05:22,520
I'm ready for the robot talent show.

145
00:05:23,640 --> 00:05:25,320
What kind of tasks did they put it through?

146
00:05:25,320 --> 00:05:27,640
So what kind of robot Olympics are we talking about here?

147
00:05:27,680 --> 00:05:33,320
Well, they had Pi Zero doing all sorts of things like stacking bowls, folding towels,

148
00:05:33,320 --> 00:05:35,120
even putting Tupperware in the microwave.

149
00:05:35,120 --> 00:05:35,640
Wait a minute.

150
00:05:35,640 --> 00:05:37,280
Putting Tupperware in the microwave?

151
00:05:37,280 --> 00:05:38,120
You heard that, right?

152
00:05:38,120 --> 00:05:41,320
So it can handle those pesky microwave buttons and stuff?

153
00:05:41,360 --> 00:05:41,840
Yep.

154
00:05:41,840 --> 00:05:44,800
They tested it on all sorts of real-world scenarios.

155
00:05:44,840 --> 00:05:45,760
That's pretty wild.

156
00:05:46,040 --> 00:05:46,360
Okay.

157
00:05:46,360 --> 00:05:47,480
So how did Pi Zero do?

158
00:05:47,480 --> 00:05:48,840
Was it a robot superstar?

159
00:05:49,000 --> 00:05:49,920
I'd say so.

160
00:05:50,200 --> 00:05:54,520
They were particularly impressed with how it handled tasks that required understanding

161
00:05:54,520 --> 00:05:55,640
language instructions.

162
00:05:55,640 --> 00:05:58,040
Like actually understanding what we wanted to do.

163
00:05:58,040 --> 00:05:58,560
Exactly.

164
00:05:58,560 --> 00:06:02,760
For example, they had it busing a table, which involved figuring out what was trash,

165
00:06:03,000 --> 00:06:05,560
what were dishes, and where everything needed to go.

166
00:06:05,560 --> 00:06:06,080
Wow.

167
00:06:06,080 --> 00:06:07,240
That's actually really impressive.

168
00:06:07,240 --> 00:06:11,800
It is, and it shows how important those VLMs are for making sense of human language.

169
00:06:11,800 --> 00:06:12,280
Right.

170
00:06:12,280 --> 00:06:13,520
So it's not just seeing and moving.

171
00:06:13,520 --> 00:06:15,760
It's actually understanding what we're asking it to do.

172
00:06:15,760 --> 00:06:16,680
Exactly.

173
00:06:16,680 --> 00:06:17,000
Okay.

174
00:06:17,000 --> 00:06:21,280
So it sounds like Pi Zero is pretty good at following instructions, but even superstars

175
00:06:21,280 --> 00:06:23,000
have their weaknesses, right?

176
00:06:23,360 --> 00:06:25,240
Were there any tasks where it struggled?

177
00:06:25,800 --> 00:06:28,760
There were definitely some challenges, and that's where it gets really interesting,

178
00:06:28,760 --> 00:06:33,400
because the tasks that Pi Zero struggled with actually tell us a lot about the current

179
00:06:33,400 --> 00:06:35,840
limitations of robot foundation models.

180
00:06:35,840 --> 00:06:36,280
Okay.

181
00:06:36,280 --> 00:06:37,440
I'm all ears.

182
00:06:37,440 --> 00:06:39,320
Tell me about these robot struggles.

183
00:06:39,320 --> 00:06:44,200
Well, one area where it had some trouble was tasks that require really deep understanding

184
00:06:44,200 --> 00:06:45,160
of physics.

185
00:06:45,160 --> 00:06:45,840
Physics.

186
00:06:46,360 --> 00:06:46,720
Okay.

187
00:06:46,720 --> 00:06:47,840
Now that sounds tricky.

188
00:06:47,840 --> 00:06:49,080
Can you give me an example?

189
00:06:49,080 --> 00:06:49,440
Sure.

190
00:06:49,440 --> 00:06:54,560
Imagine a task where the robot has to pour liquid from one container to another.

191
00:06:54,560 --> 00:06:56,560
It's not just about moving the container.

192
00:06:56,560 --> 00:06:56,880
Yeah.

193
00:06:56,880 --> 00:07:01,760
You know, it's about understanding how the liquid will flow, how much force to apply,

194
00:07:01,760 --> 00:07:03,000
how to avoid spells.

195
00:07:03,320 --> 00:07:03,920
I see.

196
00:07:03,920 --> 00:07:05,680
So it's not just about learning the steps.

197
00:07:05,680 --> 00:07:08,800
It's about understanding the underlying physics of the situation.

198
00:07:08,800 --> 00:07:09,640
Exactly.

199
00:07:09,640 --> 00:07:13,600
And those kinds of nuances are still difficult for robots to grasp.

200
00:07:13,960 --> 00:07:14,800
That makes sense.

201
00:07:15,120 --> 00:07:19,160
So it sounds like we're still ways off from robots that can replace our baristas.

202
00:07:19,160 --> 00:07:19,760
Yeah.

203
00:07:19,760 --> 00:07:21,760
For now, I think our baristas are safe.

204
00:07:21,760 --> 00:07:23,680
But it's still a huge step forward, right?

205
00:07:23,680 --> 00:07:26,680
It is, especially when you compare it to other robot models out there.

206
00:07:27,160 --> 00:07:29,760
So how does Pi Zero stack up against the competition?

207
00:07:29,760 --> 00:07:33,760
Well, one of the things that really stood out was its zero shot performance.

208
00:07:33,760 --> 00:07:34,760
Zero shot.

209
00:07:34,760 --> 00:07:35,760
What does that mean?

210
00:07:35,760 --> 00:07:39,760
It means it can perform tasks that's never been specifically trained for.

211
00:07:39,760 --> 00:07:40,760
No way.

212
00:07:40,760 --> 00:07:42,760
So it can just figure things out on the fly.

213
00:07:42,760 --> 00:07:43,760
Pretty much.

214
00:07:43,760 --> 00:07:46,760
And it actually did this remarkably well, much better than other models.

215
00:07:46,760 --> 00:07:47,760
Wow.

216
00:07:47,760 --> 00:07:49,760
So it's not just a one-trick pony.

217
00:07:49,760 --> 00:07:53,760
It can actually generalize its knowledge and skills to new situations.

218
00:07:53,760 --> 00:07:54,760
Exactly.

219
00:07:54,760 --> 00:07:56,760
And that's what makes this research so exciting.

220
00:07:56,760 --> 00:08:02,760
It suggests that we're moving closer to robots that can truly learn and adapt in a similar way to humans.

221
00:08:02,760 --> 00:08:05,760
Okay. Now I'm really starting to see the potential here.

222
00:08:05,760 --> 00:08:07,760
But let's back up for a second.

223
00:08:07,760 --> 00:08:09,760
We've talked a lot about how Pi Zero works.

224
00:08:09,760 --> 00:08:12,760
But how does it actually compare to other approaches out there?

225
00:08:12,760 --> 00:08:14,760
What makes it so special?

226
00:08:14,760 --> 00:08:17,760
Well, one key difference is how it represents actions.

227
00:08:17,760 --> 00:08:23,760
Remember how we talked about action chunking, where the model plans movements in larger chunks?

228
00:08:23,760 --> 00:08:24,760
Yeah, I remember that.

229
00:08:24,760 --> 00:08:28,760
Well, a lot of other models, particularly those vision language action models,

230
00:08:28,760 --> 00:08:31,760
use a method called autoregressive discretization.

231
00:08:31,760 --> 00:08:33,760
Autoregressive discretization.

232
00:08:33,760 --> 00:08:36,760
Okay, that's a mouthful. Remind me what that's all about.

233
00:08:36,760 --> 00:08:39,760
It basically means treating actions like words in a sentence.

234
00:08:39,760 --> 00:08:42,760
The model predicts one action at a time in a sequence.

235
00:08:42,760 --> 00:08:45,760
So like typing out a message one letter at a time.

236
00:08:45,760 --> 00:08:46,760
Exactly.

237
00:08:46,760 --> 00:08:47,760
Got it. But why is that a problem?

238
00:08:47,760 --> 00:08:53,760
Well, for really complex tasks, those that require a lot of dexterity, you need a more nuanced approach.

239
00:08:53,760 --> 00:08:56,760
Ah, so that's where flow matching comes in.

240
00:08:56,760 --> 00:09:04,760
Precisely. Flow matching allows Pi Zero to represent those complex continuous movements in a much more accurate way.

241
00:09:04,760 --> 00:09:07,760
So it's like the difference between writing a sentence and choreographing a ballet.

242
00:09:07,760 --> 00:09:08,760
Got it.

243
00:09:08,760 --> 00:09:11,760
So Pi Zero is more of a dancer than a typist.

244
00:09:11,760 --> 00:09:15,760
Now, you mentioned earlier that they train this model on a massive amount of data.

245
00:09:15,760 --> 00:09:18,760
I'm curious, where did all this data come from?

246
00:09:18,760 --> 00:09:20,760
That's a great question.

247
00:09:20,760 --> 00:09:22,760
The data is really the foundation of all this.

248
00:09:22,760 --> 00:09:26,760
Remember, we talked about those custom data sets and that public one called OXC?

249
00:09:26,760 --> 00:09:27,760
Yeah.

250
00:09:27,760 --> 00:09:28,760
Well, let's dig a little deeper into that.

251
00:09:28,760 --> 00:09:30,760
Let's do it. I love getting into the nitty gritty.

252
00:09:30,760 --> 00:09:35,760
So for their custom data sets, they actually used seven different robot setups.

253
00:09:35,760 --> 00:09:39,760
Seven different robots. Wow, that's a lot of robots.

254
00:09:39,760 --> 00:09:43,760
Were they all like those humanoid robots or were there other kinds?

255
00:09:43,760 --> 00:09:47,760
They had a variety of robots. Some had single arms, some had two arms.

256
00:09:47,760 --> 00:09:50,760
They even had some mobile robots on wheels.

257
00:09:50,760 --> 00:09:51,760
That's pretty wild.

258
00:09:51,760 --> 00:09:55,760
So all these different robots doing different tasks, what kind of tasks are we talking about?

259
00:09:55,760 --> 00:09:59,760
So instead of focusing on narrow tasks, like just picking up a specific object,

260
00:09:59,760 --> 00:10:03,760
they designed broader tasks that involved a variety of subtasks.

261
00:10:03,760 --> 00:10:06,760
Okay, so instead of just picking up a cup, it'd be more like clearing the tail.

262
00:10:06,760 --> 00:10:11,760
Exactly. It's about teaching the robot how to reason about different objects and situations.

263
00:10:11,760 --> 00:10:16,760
So it's learning more general skills, not just memorizing specific moves.

264
00:10:16,760 --> 00:10:18,760
Right. And with all these different robot setups,

265
00:10:18,760 --> 00:10:23,760
they were able to record over 10,000 hours of data.

266
00:10:23,760 --> 00:10:28,760
10,000 hours? That's mind-voggling. How did they even collect all that data?

267
00:10:28,760 --> 00:10:33,760
It was a combination of techniques. For some tasks, they used human demonstrations,

268
00:10:33,760 --> 00:10:35,760
you know, to show the robot what to do.

269
00:10:35,760 --> 00:10:37,760
So like a robot training montage.

270
00:10:37,760 --> 00:10:41,760
You could say that for other tasks, they used a technique called teleoperation.

271
00:10:41,760 --> 00:10:43,760
Teleoperation? What's that?

272
00:10:43,760 --> 00:10:46,760
It's basically controlling the robot remotely, kind of like a drone.

273
00:10:46,760 --> 00:10:50,760
Oh, I see. So they had skilled operators guiding the robot through various tasks.

274
00:10:50,760 --> 00:10:51,760
Exactly.

275
00:10:51,760 --> 00:10:55,760
That's pretty cool. But all this data, it can't just be random robot movements, right?

276
00:10:55,760 --> 00:10:57,760
How did they make sure it was actually useful for training?

277
00:10:57,760 --> 00:11:00,760
You're right. It's not just about quantity. It's about quality and diversity.

278
00:11:00,760 --> 00:11:05,760
They really focused on making sure the data represented a wide range of scenarios and challenges.

279
00:11:05,760 --> 00:11:08,760
So it's like giving the robot a well-rounded education in robotting.

280
00:11:08,760 --> 00:11:13,760
Exactly. They wanted to expose it to different environments, different objects, different lighting conditions.

281
00:11:13,760 --> 00:11:16,760
Anything that might help it generalize its learning.

282
00:11:16,760 --> 00:11:23,760
That makes sense. So we've got a diverse crew of robots doing all sorts of tasks, generating tons of data.

283
00:11:23,760 --> 00:11:25,760
It's like the ultimate robot reality show.

284
00:11:25,760 --> 00:11:31,760
But how do they go from all that raw data to an actual functioning robot model?

285
00:11:31,760 --> 00:11:33,760
That's where the training process comes in.

286
00:11:33,760 --> 00:11:36,760
Remember, they split it up into two phases, pre-training and post-training.

287
00:11:36,760 --> 00:11:39,760
Right, right. Robot elementary school and robot high school.

288
00:11:39,760 --> 00:11:42,760
But can you refresh my memory on what exactly happens in each phase?

289
00:11:42,760 --> 00:11:51,760
Of course. So in pre-training, they use that massive, diverse data set we've been talking about to teach the model a broad set of skills and knowledge.

290
00:11:51,760 --> 00:11:52,760
So it's learning the basics.

291
00:11:52,760 --> 00:12:00,760
Right. It's learning about how objects move, how to interact with its environment, how to understand language instructions, all those fundamental things.

292
00:12:00,760 --> 00:12:03,760
Okay. And then in post-training, it's time to specialize.

293
00:12:03,760 --> 00:12:08,760
Exactly. They use smaller, more targeted data sets to fine-tune the model for specific tasks.

294
00:12:08,760 --> 00:12:12,760
So for example, if they wanted to train Pi Zero to be a laundry folding champion.

295
00:12:12,760 --> 00:12:22,760
They'd use a data set specifically of robots folding laundry, you know, with different types of clothing, different folding techniques, all the nuances of that particular task.

296
00:12:22,760 --> 00:12:23,760
That makes a lot of sense.

297
00:12:23,760 --> 00:12:24,760
Yeah.

298
00:12:24,760 --> 00:12:28,760
So it's like building a solid foundation of general knowledge and then adding specialized skills on top of that.

299
00:12:28,760 --> 00:12:29,760
Precisely.

300
00:12:29,760 --> 00:12:34,760
Okay. So we've got this incredibly well-trained robot ready to take on the world.

301
00:12:34,760 --> 00:12:38,760
Now let's get to the good stuff, the robot talent show.

302
00:12:38,760 --> 00:12:41,760
What kind of tasks did they put it through?

303
00:12:41,760 --> 00:12:44,760
Okay, late on me. What did they have this robot doing?

304
00:12:44,760 --> 00:12:50,760
Well, they had it doing a bunch of different things, you know, like stacking bowls, folding towels.

305
00:12:50,760 --> 00:12:52,760
They even had it putting Tupperware in the microwave.

306
00:12:52,760 --> 00:12:54,760
Tupperware in the microwave.

307
00:12:54,760 --> 00:12:55,760
Now that's a real-world challenge.

308
00:12:55,760 --> 00:12:58,760
Right. It's those everyday tasks that can really trip robots up.

309
00:12:58,760 --> 00:13:01,760
So did it succeed? Was it able to conquer the microwave?

310
00:13:01,760 --> 00:13:04,760
It did. It was actually pretty impressive to watch.

311
00:13:04,760 --> 00:13:09,760
It learned how to open the door, put the Tupperware in, you know, even press the right buttons.

312
00:13:09,760 --> 00:13:11,760
No way. So it was like a microwave master.

313
00:13:11,760 --> 00:13:17,760
Pretty much. And what's really cool is that it was able to do all of this after just a short amount of fine-tuning.

314
00:13:17,760 --> 00:13:20,760
You know, they didn't have to program it with every specific detail.

315
00:13:20,760 --> 00:13:23,760
So it's really learning and adapting. That's amazing.

316
00:13:23,760 --> 00:13:27,760
It is. And it shows the power of this whole foundation model approach.

317
00:13:27,760 --> 00:13:30,760
Okay. So it can handle those everyday tasks.

318
00:13:30,760 --> 00:13:33,760
But what about stuff that's a little more, you know, challenging?

319
00:13:33,760 --> 00:13:35,760
Things that robots typically struggle with.

320
00:13:35,760 --> 00:13:37,760
Well, they definitely tested its limits.

321
00:13:37,760 --> 00:13:41,760
They gave it some tasks that were, you know, specifically designed to be difficult.

322
00:13:41,760 --> 00:13:44,760
Oh, I like this. Give me the robot trial by fire.

323
00:13:44,760 --> 00:13:46,760
Well, one of the toughest tasks was laundry folding.

324
00:13:46,760 --> 00:13:49,760
And not just, you know, neatly folded clothes.

325
00:13:49,760 --> 00:13:52,760
We're talking about laundry that starts as a crumpled mess in a bin.

326
00:13:52,760 --> 00:13:55,760
Whoa. That's next level laundry folding.

327
00:13:55,760 --> 00:13:57,760
I'm not even sure I could do that.

328
00:13:57,760 --> 00:14:00,760
It's a surprisingly complex task for a robot.

329
00:14:00,760 --> 00:14:03,760
You know, it has to figure out how to pick up the clothes, unfold them,

330
00:14:03,760 --> 00:14:05,760
smooth them out, then fold them correctly.

331
00:14:05,760 --> 00:14:08,760
Yeah, there's a lot going on there. So how did PyZero do?

332
00:14:08,760 --> 00:14:11,760
Well, it actually did a pretty good job. You know, it wasn't perfect,

333
00:14:11,760 --> 00:14:16,760
but it was able to fold a variety of different clothes, you know, like shirts, pants, towels.

334
00:14:16,760 --> 00:14:19,760
So it wasn't just memorizing a specific folding routine.

335
00:14:19,760 --> 00:14:22,760
It was actually adapting to different types of clothes.

336
00:14:22,760 --> 00:14:25,760
Exactly. And that's one of the things that makes this model so special.

337
00:14:25,760 --> 00:14:28,760
It's able to generalize its learning to new situations.

338
00:14:28,760 --> 00:14:30,760
Okay. Laundry folding check.

339
00:14:30,760 --> 00:14:34,760
What else did they throw at this robot prodigy?

340
00:14:34,760 --> 00:14:38,760
Another tough one was assembling a cardboard box.

341
00:14:38,760 --> 00:14:41,760
You know those flat pack boxes that always seem to fight back

342
00:14:41,760 --> 00:14:43,760
when you're trying to put them together?

343
00:14:43,760 --> 00:14:45,760
Oh yeah, I know exactly what you're talking about. They're like the bane of my existence.

344
00:14:45,760 --> 00:14:49,760
Right. Well, PyZero had to figure out how to unfold the box, you know,

345
00:14:49,760 --> 00:14:52,760
bend it into the right shape and then secure the flaps.

346
00:14:52,760 --> 00:14:56,760
I'm honestly having trouble picturing a robot doing that successfully.

347
00:14:56,760 --> 00:14:58,760
Did it actually manage to defeat the cardboard?

348
00:14:58,760 --> 00:15:01,760
It did. It took a few tries, but it eventually got the hang of it.

349
00:15:01,760 --> 00:15:06,760
And what was really cool was watching it use its arms and even the table, you know,

350
00:15:06,760 --> 00:15:08,760
to kind of brace and manipulate the cardboard.

351
00:15:08,760 --> 00:15:11,760
So it was actually problem solving using its environment.

352
00:15:11,760 --> 00:15:14,760
Exactly. It was almost like watching a human figure it out.

353
00:15:14,760 --> 00:15:18,760
Okay. So it sounds like PyZero is pretty much ready to take over the world, right?

354
00:15:18,760 --> 00:15:23,760
Well, not quite. Even the most advanced robots have their limits.

355
00:15:23,760 --> 00:15:25,760
Right. Fair enough. So where did it fall short?

356
00:15:25,760 --> 00:15:32,760
Well, as I mentioned earlier, one of the challenges was tasks that require a deep understanding of physics.

357
00:15:32,760 --> 00:15:36,760
You know, things like pouring liquids or handling delicate objects.

358
00:15:36,760 --> 00:15:38,760
Yeah, it makes sense that those would be tricky.

359
00:15:38,760 --> 00:15:40,760
Those are things that even humans can struggle with sometimes.

360
00:15:40,760 --> 00:15:45,760
Right. It requires a level of precision and understanding that we're still working on in robotics.

361
00:15:45,760 --> 00:15:48,760
Okay. So it's not quite a robot butler yet.

362
00:15:48,760 --> 00:15:50,760
Yeah. But it's still an incredible achievement.

363
00:15:50,760 --> 00:15:54,760
It is. And it represents a huge step forward in the field of robotics.

364
00:15:54,760 --> 00:15:57,760
Okay. So we've talked a lot about PyZero's capabilities and limitations,

365
00:15:57,760 --> 00:16:00,760
but I think it's worth repeating what makes it so special.

366
00:16:00,760 --> 00:16:03,760
What sets it apart from all the other robots out there?

367
00:16:03,760 --> 00:16:05,760
Well, it's really a combination of factors.

368
00:16:05,760 --> 00:16:08,760
You know, it's not just that it's bigger or faster.

369
00:16:08,760 --> 00:16:10,760
It's the way it combines different technologies.

370
00:16:10,760 --> 00:16:13,760
Right. It's like a perfect storm of robot innovation.

371
00:16:13,760 --> 00:16:17,760
Exactly. It's leveraging those pre-trained VLMs, you know,

372
00:16:17,760 --> 00:16:21,760
that have already learned so much about the world from all that internet data.

373
00:16:21,760 --> 00:16:26,760
It's like giving the robot a head start in understanding what it sees and the language we use.

374
00:16:26,760 --> 00:16:29,760
So it's coming into the game with a whole lot of knowledge already.

375
00:16:29,760 --> 00:16:32,760
Exactly. And then they've added those cutting-edge techniques, you know,

376
00:16:32,760 --> 00:16:39,760
like flow matching and action chunking to really capture the complexities of movement in the real world.

377
00:16:39,760 --> 00:16:42,760
So it's got brains and it's got moves.

378
00:16:42,760 --> 00:16:44,760
Exactly. It's the whole package.

379
00:16:44,760 --> 00:16:46,760
And the results speak for themselves.

380
00:16:46,760 --> 00:16:50,760
I mean, we've seen PyZero do things that used to be considered impossible for robots.

381
00:16:50,760 --> 00:16:55,760
You know, folding laundry, assembling boxes, busing tables. It's pretty mind-blowing.

382
00:16:55,760 --> 00:16:57,760
It is. And this is just the beginning.

383
00:16:57,760 --> 00:17:00,760
You know, this research opens up so many possibilities for the future.

384
00:17:00,760 --> 00:17:04,760
Well, this has been an amazing deep dive into the world of robot foundation models.

385
00:17:04,760 --> 00:17:08,760
You know, we've learned so much about PyZero, its capabilities, its limitations,

386
00:17:08,760 --> 00:17:10,760
and its potential impact on the world.

387
00:17:10,760 --> 00:17:12,760
It's been a great discussion.

388
00:17:12,760 --> 00:17:16,760
And to our listeners out there, you know, if you're as fascinated by this stuff as we are,

389
00:17:16,760 --> 00:17:20,760
I encourage you to keep learning about robot foundation models.

390
00:17:20,760 --> 00:17:23,760
This field is moving so quickly, there's always something needed to discover.

391
00:17:23,760 --> 00:17:26,760
Absolutely. It's an exciting time to be following robotics.

392
00:17:26,760 --> 00:17:29,760
Stay curious. Keep exploring.

393
00:17:29,760 --> 00:17:33,760
And, you know, maybe you'll be the one to create the next groundbreaking robot.

394
00:17:33,760 --> 00:17:37,760
That's a great thought. The future of robotics is in all of our hands.

395
00:17:37,760 --> 00:17:40,760
And on that note, we'll wrap up this episode.

396
00:17:40,760 --> 00:18:09,760
Thanks for joining us, and we'll see you next time for another Deep Dive into the World of Science and Technology.

