WEBVTT

00:00:03.557 --> 00:00:10.497
This is the Convergent Science Network podcast. Leading researchers in the domain

00:00:10.497 --> 00:00:16.777
of neuroscience, brain theory and technology are interviewed by Paul Verschoor and Tony Prescott.

00:00:26.057 --> 00:00:29.677
This is Paul Verschoor with the Convergent Science Network podcast.

00:00:30.217 --> 00:00:34.497
And today I'm with Mark Toussaint, who's a speaker in our summer school.

00:00:35.317 --> 00:00:42.717
And Mark, as a physicist, has been presenting his important work in the domain of machine learning.

00:00:44.317 --> 00:00:48.037
So Mark, you started your presentation with like a little test for your audience,

00:00:48.157 --> 00:00:52.977
where you showed them a few equations and you want to check out who in the audience

00:00:52.977 --> 00:00:54.237
would recognize these equations.

00:00:54.497 --> 00:00:59.477
So why were these equations so important to you? Well, first just to check what

00:00:59.477 --> 00:01:01.097
kind of background people have.

00:01:03.057 --> 00:01:08.157
I'm not so sure if they're so important, but I think they are important to eventually

00:01:08.157 --> 00:01:13.157
understand what the point is of these formulations that I presented and to sort

00:01:13.157 --> 00:01:17.497
of understand that the third equation on that slide is something that we can

00:01:17.497 --> 00:01:19.937
get out of these new formulations. Okay.

00:01:21.237 --> 00:01:25.197
And actually, I think you looked a bit disappointed because it was not a large

00:01:25.197 --> 00:01:31.477
group of people who immediately jumped up like yes i know this one so yeah there is something um,

00:01:32.297 --> 00:01:35.257
i mean reinforcement learning is a very very basic model of

00:01:35.257 --> 00:01:37.997
how behavior could be organized and how behavior could be

00:01:37.997 --> 00:01:41.497
organized in a way that is goal directed and um

00:01:41.497 --> 00:01:44.517
although you know there is many many questions whether this helps you

00:01:44.517 --> 00:01:47.737
understanding the brain of course there's many questions but still

00:01:47.737 --> 00:01:51.197
this should not um i don't know free you of knowing these

00:01:51.197 --> 00:01:55.357
things i think it would be good to teach them yeah absolutely um

00:01:55.357 --> 00:02:00.517
now what you initially emphasized were

00:02:00.517 --> 00:02:06.357
your your methods for let's say to understand planning as a form of inference

00:02:06.357 --> 00:02:12.677
yeah so why do you think that that's a useful way to to think about planning

00:02:12.677 --> 00:02:17.697
first it's an it's an alternative way to think about planning,

00:02:17.757 --> 00:02:19.877
which is a good thing in itself.

00:02:20.137 --> 00:02:24.397
I think, you know, turning over problems and looking at the same problems from

00:02:24.397 --> 00:02:25.577
different perspectives is important.

00:02:25.677 --> 00:02:29.177
I think the perspective of looking at planning from the perspective of planning.

00:02:30.439 --> 00:02:37.579
You know, offers different, you know, options of how to think about a computation

00:02:37.579 --> 00:02:38.879
that actually realizes planning.

00:02:39.059 --> 00:02:45.719
So more concretely about why planning is inference, I think it does allow you

00:02:45.719 --> 00:02:47.119
to be creative to use other kinds

00:02:47.119 --> 00:02:52.759
of computational algorithms that exploit different types of structures.

00:02:52.899 --> 00:02:56.619
To be more concrete, if you think about how to do planning on factored representations,

00:02:57.759 --> 00:03:02.119
one way of thinking, the more classical one, would be to have structured representations

00:03:02.119 --> 00:03:06.339
of the value function, but it will always be the value function, right?

00:03:06.459 --> 00:03:11.759
Whereas if you roll out that process and sort of formalize it as the whole model

00:03:11.759 --> 00:03:15.579
as a graphical model and show that inference in a graphical model can solve

00:03:15.579 --> 00:03:19.899
planning problems, it really leads to different types of algorithms and different types of,

00:03:20.879 --> 00:03:22.479
approximations that you could use to solve planning.

00:03:22.879 --> 00:03:27.659
But would that mean you really have removed the value function from the story,

00:03:27.879 --> 00:03:29.399
or you have redefined it?

00:03:31.519 --> 00:03:35.779
That's also very interesting. That's a good question.

00:03:35.879 --> 00:03:40.319
So I have removed it in the computational algorithm itself so that the algorithm

00:03:40.319 --> 00:03:43.799
does not have to have explicitly a representation of the value function in its

00:03:43.799 --> 00:03:46.039
memory or the computer doesn't have to.

00:03:47.159 --> 00:03:50.799
But in some cases, in particular on normal Markov decision processes,

00:03:51.079 --> 00:03:55.799
the algorithm will compute things which are one-to-one to the value function.

00:03:56.079 --> 00:04:00.319
For instance, on normal Markov decision processes, the backward messages are

00:04:00.319 --> 00:04:05.299
equivalent to the value function.

00:04:05.899 --> 00:04:10.339
But again, all of this is only true on a normal Markov decision process.

00:04:10.679 --> 00:04:14.879
If you go to the actually interesting problems where you cannot do exact inference

00:04:14.879 --> 00:04:19.439
or exact value iteration, like in POMDPs or other kind of factored things.

00:04:21.347 --> 00:04:26.207
The messages do not anymore correspond to value functions. They really correspond

00:04:26.207 --> 00:04:28.307
to probabilistic posteriors.

00:04:29.787 --> 00:04:33.687
And it would be unclear how to do the same things that it could do with inference,

00:04:33.847 --> 00:04:35.747
how you could do the same things with value functions.

00:04:36.327 --> 00:04:40.907
Right. So in that sense, also if you go to the Bellman equation view,

00:04:41.127 --> 00:04:46.827
which would be more dominated by a value function, then for you this is also,

00:04:46.867 --> 00:04:48.207
let's say, a contrast, right? is okay.

00:04:48.827 --> 00:04:54.967
That's one way to look at planning. And you want to look at alternative models to solve it.

00:04:55.007 --> 00:04:59.567
That might be more simple or are they just different formulations of the Bellman view?

00:05:02.087 --> 00:05:06.567
They are different. I think they're different to the Bellman view.

00:05:06.687 --> 00:05:08.527
So the Bellman view is really a backward.

00:05:08.707 --> 00:05:10.887
The Bellman equation is a backward thinking.

00:05:11.287 --> 00:05:14.667
So if you have already solved a small problem, you can sort of backward iterate

00:05:14.667 --> 00:05:21.327
and sort of move your horizon backward and solve the larger problem and so forth.

00:05:22.727 --> 00:05:26.787
So in the Bellman view, it's not very symmetric. It doesn't think about,

00:05:26.907 --> 00:05:29.207
for instance, forward messages or forward information.

00:05:29.547 --> 00:05:32.587
The inference view does. It computes both, forward and backward messages.

00:05:33.027 --> 00:05:36.167
And the result is then a product of both of these being the posterior.

00:05:37.187 --> 00:05:42.267
So I think it is a different view than the Bellman view. But I could argue that

00:05:42.267 --> 00:05:51.187
I might, let's say, describe your inference-based approach in terms of a Bellman equation.

00:05:51.527 --> 00:05:54.167
I can show an equivalence or is there really something missing?

00:05:57.367 --> 00:06:03.187
I don't think that the computational things that are happening during message

00:06:03.187 --> 00:06:09.507
passing, you could describe as being a form of Bellman equation.

00:06:09.507 --> 00:06:13.947
So computationally, the things that are being computed and the kind of messages

00:06:13.947 --> 00:06:16.187
that are being computed are different.

00:06:16.767 --> 00:06:19.587
As a result, they will also solve somewhat different problems?

00:06:20.247 --> 00:06:24.207
Well, they lend to different approximations. For instance, as I said, in a POMD piece.

00:06:26.627 --> 00:06:29.887
Well, in POMD piece, sometimes you can do exact inference as well.

00:06:30.067 --> 00:06:36.227
But if you do message passing, you can cope differently with large problems

00:06:36.227 --> 00:06:38.147
or with factor problems as you

00:06:38.147 --> 00:06:42.507
could do with trying to approximate value functions and iterate over them.

00:06:42.807 --> 00:06:46.927
Right. So now you mentioned that that's sort of introductory remarks,

00:06:47.107 --> 00:06:52.007
but you mentioned that on the one end, you found problem-solving from the perspective

00:06:52.007 --> 00:06:53.327
of inference theoretically interesting,

00:06:54.247 --> 00:07:00.267
but also you said, look, it can give us a handle on a more large-scale integration

00:07:00.267 --> 00:07:04.507
and possibly on some form of multidisciplinary interaction.

00:07:04.927 --> 00:07:09.047
Right. So what's the perspective there exactly? So the,

00:07:10.167 --> 00:07:14.387
I think generally, actually, probabilistic modeling and also graphical models,

00:07:15.847 --> 00:07:17.167
there's two things about them.

00:07:17.267 --> 00:07:20.647
Some people say they're great tools because they work well, but another thing

00:07:20.647 --> 00:07:25.647
to say about them would be they're nice because they offer one nice language

00:07:25.647 --> 00:07:28.387
that can be applied on many different disciplines.

00:07:28.607 --> 00:07:32.487
So graphical models can be used in linguistics, in robotics,

00:07:32.667 --> 00:07:34.087
for inference, for whatever.

00:07:34.867 --> 00:07:38.447
It's a bit of a similar thing that I wanted to state with planning as inference.

00:07:40.167 --> 00:07:45.987
The first point to make is that it's one language that allows you to talk coherently

00:07:45.987 --> 00:07:53.867
about both decision-making, doing perceptual inference, sensor processing, while also planning.

00:07:53.987 --> 00:07:59.707
So not only immediate decision-making, but also planning in one coherent set of concepts.

00:07:59.967 --> 00:08:03.427
And it's actually a rather small set of concepts, just inference,

00:08:03.587 --> 00:08:06.627
right? I think this is good for communication between disciplines.

00:08:07.207 --> 00:08:14.687
That's really one point. And as I said, one example of that is that having these

00:08:14.687 --> 00:08:19.667
concepts that it can show, that they explain planning and other things,

00:08:19.787 --> 00:08:21.547
allows to build bridges between disciplines.

00:08:23.107 --> 00:08:26.267
As I said, there is neuroscientists, and I don't have a known opinion,

00:08:26.467 --> 00:08:30.907
but there is neuroscientists who like to argue that neurocomputation can be

00:08:30.907 --> 00:08:35.067
abstracted as functionally doing something as Bayesian inference.

00:08:35.067 --> 00:08:42.247
Or to say it even weaker, that neurosystems could at least also realize something as Bayesian inference.

00:08:42.587 --> 00:08:46.407
But that is an interesting statement because if now in other fields we can show

00:08:46.407 --> 00:08:50.287
that Bayesian inference can solve planning problems, we immediately have a hypothesis

00:08:50.287 --> 00:08:53.067
of how neurosystems could solve planning problems.

00:08:53.447 --> 00:08:58.787
But this is also a goal in your own work? I wouldn't really say so.

00:08:58.887 --> 00:09:05.827
So particularly in the last years, I became more and more of a roboticist, really. And, um.

00:09:07.421 --> 00:09:11.321
So in a positive sense, yes, I am inspired. And I'm inspired by the questions

00:09:11.321 --> 00:09:18.561
of how actual systems like the brain can solve problems of goal-directed behavior,

00:09:18.821 --> 00:09:19.961
manipulation, all of these things.

00:09:20.321 --> 00:09:26.961
But it would be too much to say that I see myself as a researcher,

00:09:27.021 --> 00:09:27.741
like a brain researcher.

00:09:27.741 --> 00:09:31.921
Future right okay so then um let's

00:09:31.921 --> 00:09:36.501
say the the approach that you have a great confidence in at this point in time

00:09:36.501 --> 00:09:41.701
is what you call stochastic optimal control which is a problem definition not

00:09:41.701 --> 00:09:44.481
an approach yeah so but why why

00:09:44.481 --> 00:09:48.381
do you think this is a an important set of problems then to to deal with.

00:09:52.481 --> 00:09:58.421
That's again a good question. So let me say what I guess many people would,

00:09:58.501 --> 00:10:00.281
other people would say, right?

00:10:01.781 --> 00:10:05.281
Stochastic optimal control is a problem definition similar to Markov decision

00:10:05.281 --> 00:10:08.841
processes with reward and it's a very general one.

00:10:09.121 --> 00:10:13.081
People like when things are being formulated general and problems being formulated

00:10:13.081 --> 00:10:20.341
general and allows to sort of you know, describe all kinds of planning problems in there.

00:10:20.721 --> 00:10:24.921
And it's well, of course, accepted within the control theory itself.

00:10:25.101 --> 00:10:29.761
And it's quite a lot of work on within machine learning in terms of Markov decision processes.

00:10:31.581 --> 00:10:36.501
Let me put it the other way. If I would not have shown and embedded my theory

00:10:36.501 --> 00:10:41.861
to be important or useful for solving MDPs and optimal control,

00:10:42.041 --> 00:10:46.461
I could have not convinced the whole control theory community or machine learning

00:10:46.461 --> 00:10:48.401
community that it's worth anything. Okay.

00:10:49.480 --> 00:10:53.620
So how did you then solve this problem? Solve which problem?

00:10:53.960 --> 00:10:56.380
Well, stochastic optimal control.

00:10:56.920 --> 00:11:02.520
You're saying, look, you try to show that your framework is applicable in that domain.

00:11:02.720 --> 00:11:06.960
So in that sense, I guess you want to have an impact that sort of also delineates

00:11:06.960 --> 00:11:09.320
your proposal from others.

00:11:09.820 --> 00:11:13.860
So what are the unique features of what you propose in that context?

00:11:14.140 --> 00:11:18.780
Why did it work better or why was it more appreciated than alternatives? Right.

00:11:19.480 --> 00:11:25.300
Um, I mean, one of the, I mean, achievements, I think, is a theoretical achievement

00:11:25.300 --> 00:11:27.800
in the sense that we could show that many, many other algorithms,

00:11:27.880 --> 00:11:31.420
which themselves have, you know, proven also,

00:11:31.620 --> 00:11:36.000
um, I mean, empirically that they work well, that there are special cases of

00:11:36.000 --> 00:11:36.940
the formulation that we have.

00:11:37.260 --> 00:11:40.200
So this is a purely theoretical statement, right?

00:11:41.100 --> 00:11:43.960
Algorithms 1 to 10 are special cases of the firmware. Right.

00:11:44.160 --> 00:11:47.440
And algorithms 1 to 10 had done some effort, fortunately for us,

00:11:47.520 --> 00:11:50.200
already to show that you're computationally okay.

00:11:52.040 --> 00:11:57.580
So that's one thing. The other thing is that we could also derive one algorithm,

00:11:57.840 --> 00:12:06.580
which is the model-free reinforcement learning algorithm, that I find conceptually

00:12:06.580 --> 00:12:10.520
quite interesting and that we could also directly compare it to other algorithms.

00:12:10.520 --> 00:12:14.560
If we now first look at the first part of this argument where you say,

00:12:14.560 --> 00:12:21.380
well, we had the most fundamental formulation of a solution to this set of problems.

00:12:21.720 --> 00:12:23.640
What are the key ingredients of that solution?

00:12:25.540 --> 00:12:31.200
I'd say that the key ingredients have already previously been formulated by

00:12:31.200 --> 00:12:33.820
people like Kappen and Opper and so on.

00:12:33.820 --> 00:12:40.920
Um the key ingredients is uh to think about optimum control as you know formally

00:12:40.920 --> 00:12:47.120
it's a minimization problem um but matching two different distributions the

00:12:47.120 --> 00:12:48.320
one being the control one and the

00:12:48.320 --> 00:12:53.240
other one being the uncontrolled but conditioned one so these key ideas um,

00:12:54.387 --> 00:12:59.827
Yeah, have previously actually already been formulated. And I must say it's

00:12:59.827 --> 00:13:04.087
also not, I mean, you could even go back and talk about Kalman duality in general.

00:13:05.827 --> 00:13:10.787
Even Kalman already said that the problems of control and the problems of filtering,

00:13:10.987 --> 00:13:15.167
like state estimation or Bayesian filters, are very similar.

00:13:16.507 --> 00:13:22.447
And what Bert Kappen did before us, and we now also do, just explicates exactly

00:13:22.447 --> 00:13:25.807
that relation. and makes it explicit.

00:13:26.767 --> 00:13:31.607
Now, the difference in the formulation is really that we're talking about processes

00:13:31.607 --> 00:13:36.187
over state control and not really making any assumptions about the dynamics

00:13:36.187 --> 00:13:41.147
or control costs or noise in the process,

00:13:41.807 --> 00:13:46.347
whereas the previous formulations, they discussed processes on the state and

00:13:46.347 --> 00:13:53.167
for that reason had to do some special assumptions to actually get the theory properly.

00:13:53.167 --> 00:13:56.987
So in some sense, you had a more reduced formulation of the problem.

00:13:57.207 --> 00:14:00.387
You just left out a certain number of aspects that you saw as being irrelevant

00:14:00.387 --> 00:14:02.547
really to the solution. You could put it like this.

00:14:03.307 --> 00:14:09.087
So how should I imagine the solution that you came up with for these kinds of problems?

00:14:09.267 --> 00:14:11.647
How does this problem solver really operate?

00:14:13.307 --> 00:14:19.607
I think most concretely, you could imagine it in terms of that policy that I

00:14:19.607 --> 00:14:24.187
described, which is actually described by Boltzmann energy.

00:14:24.867 --> 00:14:30.107
So you actually have a policy which is represented by a Boltzmann energy,

00:14:30.267 --> 00:14:34.427
which is not really any assumption about the policy, because any conditional

00:14:34.427 --> 00:14:37.187
distribution can be represented as a Boltzmann distribution.

00:14:40.010 --> 00:14:45.450
Well, and then the iterative solutions that we proposed translate to iterative

00:14:45.450 --> 00:14:47.710
updates of that Boltzmann energy.

00:14:49.310 --> 00:14:55.870
So concretely, I should say that these iterative equations are actually equations

00:14:55.870 --> 00:15:01.610
which perform that Karpig-Leibner minimization that originally the theory proposes should be done.

00:15:03.450 --> 00:15:07.890
So these updates of the Boltzmann energy are concretely the argument that has

00:15:07.890 --> 00:15:08.930
to be implemented eventually.

00:15:10.010 --> 00:15:16.010
Right, but now basically it means I have a set of policies over which I want to optimize, right?

00:15:16.170 --> 00:15:19.770
They all have their own level, their Boltzmann energy attached to it,

00:15:19.870 --> 00:15:23.390
right? And now you're going to update these iteratively.

00:15:24.010 --> 00:15:29.230
So that means that you are sampling from some set of states that might describe

00:15:29.230 --> 00:15:31.650
the task that this policy has to be applied to.

00:15:32.210 --> 00:15:38.450
So to get this iterative function to work reliably, what should be the properties

00:15:38.450 --> 00:15:40.250
of the states I'm sampling over?

00:15:40.410 --> 00:15:43.010
Can I just follow any distribution or should it be very regular?

00:15:45.270 --> 00:15:48.250
So first I want to say that we have a set of policies, which is one,

00:15:48.670 --> 00:15:50.870
but being described by a distribution.

00:15:52.490 --> 00:15:59.350
In the exact update where we assume we have a model, we update that Boltzmann

00:15:59.350 --> 00:16:02.270
energy over the whole space of points.

00:16:02.610 --> 00:16:06.730
So the whole function we update for all x, for all states.

00:16:08.690 --> 00:16:14.730
That is the model-based case or the stochastic optimum control case where we assume to have a model.

00:16:15.310 --> 00:16:20.290
I think what you refer to is when you ask, but for which states do I update

00:16:20.290 --> 00:16:22.610
it or which states do I have to sample?

00:16:23.290 --> 00:16:26.970
This is the model-free case, right? Where the system really has to interact with the environment.

00:16:29.530 --> 00:16:37.190
Well, in that case, the system would have to unroll episodes of experience with the environment.

00:16:37.450 --> 00:16:42.530
And these episodes of experience give you samples from the process.

00:16:43.030 --> 00:16:48.070
And you can use these samples of the process as a standard, also with TD and

00:16:48.070 --> 00:16:53.830
Q-learning and so forth, to do the necessary update of the Boltzmann energy in a stochastic way.

00:16:54.390 --> 00:16:59.190
So with a learning rate alpha instead of doing the exact update of the energy. Okay.

00:17:00.350 --> 00:17:04.490
So the point is, what I'm looking for, what are the boundaries on that solution?

00:17:06.230 --> 00:17:10.030
Not only to formulate the solution, but then to prove that it actually will

00:17:10.030 --> 00:17:14.670
work and converge, you do have to make assumptions about the space in which

00:17:14.670 --> 00:17:15.550
this algorithm operates.

00:17:16.330 --> 00:17:18.570
So what would be these limiting factors?

00:17:19.770 --> 00:17:23.770
The first one is we only considered observable environments.

00:17:24.030 --> 00:17:26.750
There is no partial observability in what we discussed.

00:17:28.190 --> 00:17:33.850
I wouldn't really know yet how it would transfer to partially observable environments.

00:17:35.870 --> 00:17:38.890
The second thing I think in terms of the,

00:17:40.227 --> 00:17:47.427
When we can do these updates exactly, I think we prove convergence without any

00:17:47.427 --> 00:17:49.387
further assumptions, right?

00:17:49.447 --> 00:17:52.227
So these iterates, they converge.

00:17:52.427 --> 00:17:54.167
We prove convergence without further

00:17:54.167 --> 00:18:00.127
assumptions. But this is true if the updates are exact of the energy.

00:18:01.107 --> 00:18:06.887
So under what circumstances can it be exact? It can be exact if the state and

00:18:06.887 --> 00:18:10.747
control space is discrete, because then we can represent the energy just as

00:18:10.747 --> 00:18:13.607
tables over discrete states and actions.

00:18:14.127 --> 00:18:16.807
In that case, it will converge, no assumptions.

00:18:19.347 --> 00:18:27.327
If the state space is continuous or otherwise has to be encoded more compactly,

00:18:27.627 --> 00:18:31.787
the convergence is more difficult.

00:18:33.027 --> 00:18:35.947
If in the continuous case actually the dynamics is

00:18:35.947 --> 00:18:41.047
so simple for instance just linear and credit costs um then you can just make

00:18:41.047 --> 00:18:44.927
a quadratic assumption about that energy and it's almost like the recut equations

00:18:44.927 --> 00:18:50.667
and so forth you can do the updates exactly and again um it will work um if

00:18:50.667 --> 00:18:53.947
the dynamics is non-linear well you would have to use a function approximation

00:18:53.947 --> 00:18:55.267
to represent extended energy.

00:18:55.407 --> 00:19:02.467
And at that point, exactly, we lose that strict convergence proof and can only

00:19:02.467 --> 00:19:06.487
hope really, as so often with reinforcement learning and the function approximation.

00:19:07.107 --> 00:19:11.847
Right. So then the other thing, and sorry if you want, a trick that you applied in this approach,

00:19:12.067 --> 00:19:17.107
that you in some sense, let's say, completely fragmented your value function,

00:19:17.107 --> 00:19:24.947
function right because you now added a variable to your system that was locally linked to every state,

00:19:25.727 --> 00:19:31.267
that was it's not a reward function that relates to that state and if i understood

00:19:31.267 --> 00:19:35.247
it correctly these reward functions do have to satisfy certain regularities

00:19:35.247 --> 00:19:39.647
for for this system to operate correctly or or not what what what's the price

00:19:39.647 --> 00:19:45.107
you pay by distributing your value function in this way hmm.

00:19:47.447 --> 00:19:51.047
I'm not sure. I mean, these reward functions, we didn't really put assumptions

00:19:51.047 --> 00:20:00.607
on them, except perhaps for a convergence that they are bounded, as also for Q-learning.

00:20:00.727 --> 00:20:04.647
So to prove convergence of Q-learning, you would have assumed that they're bounded,

00:20:04.767 --> 00:20:06.887
but otherwise there's not really constraints.

00:20:07.007 --> 00:20:11.687
But the point was that in some sense, also contrasting again with this Bellman

00:20:11.687 --> 00:20:15.527
perspective, where you would have an explicitly defined value function.

00:20:15.747 --> 00:20:20.687
In this case, you have a more implicitly defined, it seems, value function.

00:20:21.127 --> 00:20:25.747
Or not. I wouldn't call it value function. We have the rewards and we have the

00:20:25.747 --> 00:20:26.247
Boltzmann distribution.

00:20:26.687 --> 00:20:32.367
Nothing more. But you could argue that in iterating these distributed value

00:20:32.367 --> 00:20:36.487
states, you are approximating something like a value function.

00:20:37.867 --> 00:20:42.787
Function. In iterating the update of the Bellman, not the Bellman,

00:20:42.787 --> 00:20:44.247
sorry, the Boltzmann energy.

00:20:48.307 --> 00:20:54.027
So what I'm after is just to say that you seem to draw a distinction between

00:20:54.027 --> 00:21:00.307
a value-based approach and your approach where you say it's not value-based in some sense.

00:21:00.427 --> 00:21:03.667
What I'm saying, well, but maybe implicitly you are value-based,

00:21:03.707 --> 00:21:08.347
but now you just have sort of hidden it more by putting a distributor in the

00:21:08.347 --> 00:21:10.247
system linked to every single state. Yeah.

00:21:12.307 --> 00:21:18.087
So first, yes, the relations in the end become very close. I'll elaborate in a second.

00:21:19.487 --> 00:21:23.387
But the distinction is not so much between, or I didn't mean to initially distinguish

00:21:23.387 --> 00:21:27.267
between a value-based approach and our approach, but more like an approach which

00:21:27.267 --> 00:21:33.367
computes value functions by iterating back the Bellman equation versus by computing other things.

00:21:34.667 --> 00:21:36.467
In terms of probabilistic inference, right?

00:21:36.507 --> 00:21:42.167
So the Boltzmann energies are eventually computed minimizing couple of club

00:21:42.167 --> 00:21:44.387
letter divergences. Fine. Okay.

00:21:45.448 --> 00:21:51.248
In the Model 3 case, I derived that one equation like that, the Model 3 reinforcement learning equation.

00:21:51.688 --> 00:21:55.068
And you can analyze the fixed point properties of that equation.

00:21:55.288 --> 00:21:59.808
I should emphasize it's the fixed point properties, right? It's not the transient

00:21:59.808 --> 00:22:01.468
of the learning process itself.

00:22:01.788 --> 00:22:06.128
In that fixed point, this Boltzmann energy has very interesting properties.

00:22:06.628 --> 00:22:09.328
It's set for non-optimal actions. It goes to minus infinity,

00:22:09.628 --> 00:22:14.388
making these actions very improbable to be chosen in terms of action selection.

00:22:15.668 --> 00:22:19.028
For the other ones, it turns out in the fixed point, for optimal actions,

00:22:19.228 --> 00:22:22.208
it corresponds to the optimal value function, the Boltzmann entity,

00:22:22.448 --> 00:22:29.088
which is surprising, but it's actually a simple outcome of analyzing the fixed

00:22:29.088 --> 00:22:30.008
point equation of the update.

00:22:30.008 --> 00:22:34.968
But now the other thing that was interesting in the approach you described is

00:22:34.968 --> 00:22:42.608
that you very strongly relied on the inference component also with respect to

00:22:42.608 --> 00:22:49.968
sort of looking at the consequences of future states that the system might anticipate onto its priors.

00:22:49.968 --> 00:22:55.068
Right that you sort of as you described yourself as if you just imagine a future

00:22:55.068 --> 00:22:58.448
goal and then suddenly you just pretend actually you've achieved it you look

00:22:58.448 --> 00:23:02.048
at the consequence that has on the system um.

00:23:03.489 --> 00:23:08.469
So what's that dynamic exactly and why did you approach it in these terms?

00:23:10.609 --> 00:23:13.909
I mean, I don't quite understand what you mean by dynamic.

00:23:14.209 --> 00:23:19.409
I mean, it's a bit like… Well, it's dynamic because I have to now imagine a future state.

00:23:19.529 --> 00:23:23.989
I may now include it in my priors and now I can make decisions on that basis

00:23:23.989 --> 00:23:25.569
so I can propagate myself now

00:23:25.569 --> 00:23:29.229
to future new events, right? And I can go through that same loop again.

00:23:30.989 --> 00:23:35.969
Right. Right, so you imagine future events as happening, right?

00:23:36.009 --> 00:23:37.509
You condition on them and compute a posterior.

00:23:38.749 --> 00:23:42.989
And in the last framework, yeah, you do iterate by using that posterior again

00:23:42.989 --> 00:23:49.489
as a prior and condition that again on the same future event and again compute a posterior.

00:23:51.889 --> 00:23:56.609
Right, I mean... But do you consider that problematic or not? I don't...

00:23:58.449 --> 00:24:02.149
Algorithmic-wise, I don't see why it's problematic. Do you mean problematic

00:24:02.149 --> 00:24:05.869
in terms of interpreting how, I don't know, living systems are doing it?

00:24:06.009 --> 00:24:13.369
Well, to me it seems that this works because you have conditioned it on a very

00:24:13.369 --> 00:24:16.269
specific definition of the problem domain.

00:24:17.009 --> 00:24:25.429
So for instance, that I can only have single goals existing at any one point in time, as an example.

00:24:25.769 --> 00:24:30.429
Oh, no. Why not? No, no. You can arbitrarily condition your future.

00:24:30.429 --> 00:24:36.489
The future can be represented in a factored way, there can be goals in different

00:24:36.489 --> 00:24:38.129
representations at different points in time.

00:24:38.369 --> 00:24:43.169
But wait, if you now propagate that back into your priors, you might have,

00:24:43.289 --> 00:24:49.229
let's say, unexpected conflicts between these goals that you do have to resolve now in some way.

00:24:51.809 --> 00:24:56.249
So inferencing graphical models, if there is evidence in a graphical model which

00:24:56.249 --> 00:25:01.009
are inconsistent, the likelihood is just zero, of that thing,

00:25:01.189 --> 00:25:02.289
and the messages diverge.

00:25:02.369 --> 00:25:04.409
And that might, in fact, happen.

00:25:04.509 --> 00:25:09.849
And that happens in graphical models if there is almost deterministic dependencies,

00:25:10.089 --> 00:25:11.669
and you get these really inconsistencies.

00:25:13.749 --> 00:25:17.809
So that's the case when you condition on variables which are observations which

00:25:17.809 --> 00:25:18.569
are totally inconsistent.

00:25:19.429 --> 00:25:25.629
If there is a chance, a slight chance of consistency, inference will exactly,

00:25:28.069 --> 00:25:32.449
generate a compromise. That's the point of it, right? Yes.

00:25:32.689 --> 00:25:37.789
So, yeah, if you have deterministic models of our future and we condition too

00:25:37.789 --> 00:25:39.609
many things, it will diverge in a sense.

00:25:40.009 --> 00:25:43.289
Otherwise, it just behaves as inference in graphical models.

00:25:43.569 --> 00:25:50.269
Right. But then, so in some sense, that means I just get points in this graph

00:25:50.269 --> 00:25:54.689
structure that have lost their validity.

00:25:54.989 --> 00:25:58.689
I do not consider them anymore and I rely on others to make my decisions.

00:26:03.079 --> 00:26:06.539
Don't know so I mean inferencing graphical models wouldn't really do that right

00:26:06.539 --> 00:26:08.699
it would find a compromise in a probabilistic sense,

00:26:10.899 --> 00:26:15.779
another constraint I was worried about is just time like for instance you also

00:26:15.779 --> 00:26:21.179
mentioned that you have an interest in mapping this to robotics and if you deal

00:26:21.179 --> 00:26:25.119
with robots the key thing is real world real time so for instance,

00:26:26.279 --> 00:26:31.399
goal setting might also evolve over different time windows, right?

00:26:31.479 --> 00:26:35.079
Or I might get, let's say, goal interrupts because the world interferes with

00:26:35.079 --> 00:26:36.599
my own ideal planning world.

00:26:37.239 --> 00:26:43.779
So I was wondering how that, how your solution would take these kinds of,

00:26:43.779 --> 00:26:49.239
let's call them exceptions, into account, or how robust it would be in the face of that.

00:26:49.879 --> 00:26:53.019
So we don't, I don't know if you have so much experience.

00:26:53.379 --> 00:26:58.479
Certainly we used that planning as inference method for models that are related

00:26:58.479 --> 00:27:00.919
to relational reinforcement learning. I didn't talk about this yet.

00:27:03.279 --> 00:27:05.219
And these are relational.

00:27:07.839 --> 00:27:14.399
Models on a symbolic level and can describe problems like should I grasp this

00:27:14.399 --> 00:27:18.339
object or manipulate this object to achieve a goal or to build a tower or things like that.

00:27:18.519 --> 00:27:22.119
Do I first have to open a door before I get an object and all these kind of things.

00:27:23.519 --> 00:27:28.579
So we use these models and planning as inference in these kind of models which is really fast.

00:27:29.299 --> 00:27:34.819
Really, because it's on a symbolic level. Very fast compared to all the low-level motion stuff.

00:27:35.279 --> 00:27:40.799
So in those cases, what you call interrupts or unexpected events or things like

00:27:40.799 --> 00:27:45.979
that, would just require to sort of redo the whole inference,

00:27:46.239 --> 00:27:49.819
which, because it's on a symbolic level, is really very quick.

00:27:50.159 --> 00:27:56.679
So, I don't know, currently in practice in robotics, I'd say the inference is

00:27:56.679 --> 00:28:00.839
so fast that you just update online all the time.

00:28:01.159 --> 00:28:05.859
Okay, so you're saying this will just be equalized out by compute speed.

00:28:06.299 --> 00:28:10.479
You could put it like that, yeah. By just having the system to just update itself

00:28:10.479 --> 00:28:15.679
and recompute the posterior whenever some new evidence comes in.

00:28:16.019 --> 00:28:20.979
Right. So then also at some point in your presentation, you gave a little scheme

00:28:20.979 --> 00:28:26.619
where you distinguished different variations of problem-solving approaches,

00:28:26.939 --> 00:28:30.719
like model-based versus model-free.

00:28:31.139 --> 00:28:35.199
Oh, that one. So what's that structure that you had exactly in mind there?

00:28:35.759 --> 00:28:38.919
How does it organize the different approaches that are around?

00:28:40.759 --> 00:28:46.019
Um i didn't really i didn't really think that there is like one structure which

00:28:46.019 --> 00:28:51.579
combines all these approaches uh i mean that that diagram was just to to sort

00:28:51.579 --> 00:28:52.459
of give an overview of what,

00:28:53.254 --> 00:28:55.994
what kind of approaches people follow in reinforcement learning, right?

00:28:56.074 --> 00:29:00.514
So model-based, the path going over first learning transition models and model-free,

00:29:00.614 --> 00:29:02.214
just learning value functions.

00:29:03.594 --> 00:29:07.114
I mean, you know Dyna, which combines the two in one framework.

00:29:08.694 --> 00:29:14.214
I don't know. So our own work is mostly, actually, the work that we use in robotics,

00:29:14.514 --> 00:29:18.394
which I didn't talk much about today, is really fully model-based.

00:29:18.494 --> 00:29:21.154
So we always actually follow model-based approaches.

00:29:21.614 --> 00:29:26.134
And it's only in what I talked about today, in that more recent work,

00:29:26.254 --> 00:29:28.474
that we came up with a model-free reinforcement learning algorithm,

00:29:28.714 --> 00:29:31.594
which I find interesting for other reasons.

00:29:32.114 --> 00:29:35.394
But in robotics, I must say, I'm more a model-based proponent.

00:29:36.594 --> 00:29:39.834
Why is that? Why do you find that more interesting or relevant?

00:29:40.594 --> 00:29:45.834
In particular, for the types of problems that we did in relational reinforcement

00:29:45.834 --> 00:29:50.414
learning, where the state space is inherently exponential in the number of objects.

00:29:50.634 --> 00:29:54.014
So the state is described by all the relations between objects.

00:29:54.414 --> 00:30:01.174
So if you have 10 objects, there is on the order of 10 squared binary relations between them.

00:30:01.274 --> 00:30:06.994
All of them can have a value, so your state space is, say, 2, 2, 2, you know.

00:30:09.034 --> 00:30:13.514
So the state space is exponential in those cases and how do you do learning

00:30:13.514 --> 00:30:15.734
in these kinds of spaces? Um, and, and.

00:30:17.933 --> 00:30:24.273
There is model-free approaches as well, who sort of represent also the policy

00:30:24.273 --> 00:30:28.293
directly on some relational features, so first-order logic features,

00:30:28.513 --> 00:30:30.853
and then use policy gradients to actually optimize them.

00:30:31.513 --> 00:30:35.713
But we think that the generalization is actually much stronger if you try to

00:30:35.713 --> 00:30:37.513
learn models from the experiences.

00:30:37.833 --> 00:30:41.273
And the models in those cases are represented by stochastic relational rules.

00:30:42.153 --> 00:30:45.173
Almost a bit like STRIP, but there is stochastic in their first order.

00:30:46.593 --> 00:30:50.973
And it's not our own work. We use that work to learn these rules.

00:30:51.213 --> 00:30:55.313
And we find it fascinating how strongly they generalize. So from only a couple

00:30:55.313 --> 00:30:59.233
of experiences that something rolls when you push it, the robot generalizes

00:30:59.233 --> 00:31:02.993
quite, I don't know, rationally one could say, I don't know,

00:31:03.013 --> 00:31:05.313
quite interestingly to other objects and so forth.

00:31:05.533 --> 00:31:08.493
So it is what I find interesting about model-based approaches

00:31:08.493 --> 00:31:12.213
is the ability to generalize experience and but

00:31:12.213 --> 00:31:14.873
the problem of model-based of course is well now you have that

00:31:14.873 --> 00:31:17.753
nice model but how do you translate it to actions and this

00:31:17.753 --> 00:31:20.953
is where then the planning is inference right exactly but to

00:31:20.953 --> 00:31:28.873
what extent are these models uh actively acquired uh nice work so the the last

00:31:28.873 --> 00:31:34.293
two years or so we've been um actually investigating a lot in in active exploration

00:31:34.293 --> 00:31:38.733
or active learning so where a system should you know decide on actions that

00:31:38.733 --> 00:31:39.973
maximize information gain.

00:31:41.073 --> 00:31:44.433
And there's the fear of active learning in machine learning, right?

00:31:45.233 --> 00:31:50.653
So one of the questions we had is how can we sort of transfer these methods,

00:31:50.753 --> 00:31:54.453
the existing methods, on the relational for relational reinforcement learning.

00:31:54.733 --> 00:31:56.893
And that requires that machine.

00:31:58.866 --> 00:32:02.086
Estimating information gain or estimating your uncertainty of predictions,

00:32:02.326 --> 00:32:06.586
as it is done also implicitly in R-Max or the Bayesian Exploration Bonus,

00:32:06.786 --> 00:32:09.126
how to do the same thing with stochastic rules.

00:32:10.186 --> 00:32:16.046
We did that. We call this relational exploration. That means that if the robot

00:32:16.046 --> 00:32:18.086
has observed a ball rolling,

00:32:18.266 --> 00:32:22.926
a green ball rolling and then a blue ball rolling, then it It maybe would not

00:32:22.926 --> 00:32:28.246
be interested anymore in another yellow ball, but instead try to roll something

00:32:28.246 --> 00:32:29.546
which looks different to a ball.

00:32:29.946 --> 00:32:35.286
So absolutely. So these are exploration strategies that we investigated and

00:32:35.286 --> 00:32:39.906
which are really important in these exponential state spaces to lead to efficient learning.

00:32:40.206 --> 00:32:44.886
But for you, the main difference is that you say, look, if this state space

00:32:44.886 --> 00:32:49.246
gets too large, you just need a better strategy to map it.

00:32:49.386 --> 00:32:54.346
And that's an active learning. learning but or an active exploration component

00:32:54.346 --> 00:33:01.146
while um intrinsically intrinsically you don't need to change your whole approach

00:33:01.146 --> 00:33:05.146
it's just the way how you explore that state space i i agree so,

00:33:05.786 --> 00:33:09.106
intrinsically uh it's a funny choice of word because of intrinsic motivation

00:33:09.106 --> 00:33:14.166
but um it's the same principle so it's still the same principle of maximizing

00:33:14.166 --> 00:33:19.746
information gain uh which is approximated as it is in Rmax and our vision exploration bonus.

00:33:20.626 --> 00:33:26.786
What is necessary is to transfer that same intrinsic principle to other types

00:33:26.786 --> 00:33:31.646
of representations, namely those relational representations or these stochastic rules we had.

00:33:31.886 --> 00:33:37.586
Would you think this game would change a lot when I would impose a memory capacity constraint?

00:33:40.846 --> 00:33:44.306
Um don't know um you mean

00:33:44.306 --> 00:33:47.366
the exploration strategy would change

00:33:47.366 --> 00:33:50.406
yeah or or the model the whole model construction

00:33:50.406 --> 00:33:54.626
phase well in the model construction there is a regularization penalizing size

00:33:54.626 --> 00:34:01.046
of the model so putting a hard limit on the model sometimes can be even shown

00:34:01.046 --> 00:34:06.186
to be sort of dual to actually regularization but um yeah i'm not sure if it

00:34:06.186 --> 00:34:09.586
would change So why this is interesting to me is, look, I try to understand how the brain works.

00:34:09.946 --> 00:34:14.506
And the brain just doesn't have these luxuries of, you know,

00:34:14.586 --> 00:34:20.406
infinite memory or infinitely fast processing speeds and so on.

00:34:20.646 --> 00:34:25.626
And so this, of course, raises this issue that maybe the brain is operating

00:34:25.626 --> 00:34:30.986
in a part of problem-solving state space where we just are not really looking

00:34:30.986 --> 00:34:33.726
with our algorithms because we're not looking at the same constraints.

00:34:34.566 --> 00:34:37.786
So it's for that I'm asking if I take a constraint not as a memory capacity,

00:34:38.986 --> 00:34:41.626
would that change the game from your perspective.

00:34:43.846 --> 00:34:47.486
I'd say I don't I don't know if I can,

00:34:48.709 --> 00:34:52.809
say anything to that. I have the impression that the kind of models that we

00:34:52.809 --> 00:34:56.749
learn are so small compared to the things that humans really learn.

00:34:57.529 --> 00:35:02.149
Humans would have much more capacity than what our models that use these stochastic

00:35:02.149 --> 00:35:03.709
relational rules, for instance.

00:35:05.529 --> 00:35:08.989
Actually, I find it sometimes quite surprising how much humans can memorize,

00:35:09.349 --> 00:35:14.429
children especially, without actually abstracting just as if they would just

00:35:14.429 --> 00:35:17.929
throw it away before they actually later on. Without even being programmed by you. It's amazing.

00:35:19.869 --> 00:35:27.329
Yes. So the other thing here is also, for instance, a lot of these methods that

00:35:27.329 --> 00:35:30.889
are very advanced and you guys have a deep understanding of them,

00:35:30.969 --> 00:35:34.629
but they are very often predicated on fairly strong assumptions.

00:35:34.789 --> 00:35:39.269
For instance, one thing that I find often worrisome is that,

00:35:39.289 --> 00:35:41.609
for instance, people sort of

00:35:41.609 --> 00:35:44.909
very loosely say, well, okay, let's assume I have defined my state space.

00:35:44.909 --> 00:35:50.029
And now on the basis of that I'm going to show you all sorts of properties of

00:35:50.029 --> 00:35:53.289
my algorithm or I'm going to learn a model I'm going to do problems with and so on.

00:35:54.569 --> 00:35:57.509
Isn't that assumption actually too strong? Isn't it?

00:35:57.589 --> 00:36:01.889
Shouldn't we also think more about policy learning together with really the

00:36:01.889 --> 00:36:04.709
state space acquisition from experience?

00:36:05.449 --> 00:36:07.069
Yeah. Totally agree.

00:36:08.809 --> 00:36:12.289
Yes, I totally agree. I mean, the question of where the representations comes

00:36:12.289 --> 00:36:13.989
from is absolutely fundamental.

00:36:15.529 --> 00:36:18.989
Maybe there are two ways to go about this.

00:36:19.109 --> 00:36:22.749
The one is really having the ambition that the system should really invent its

00:36:22.749 --> 00:36:27.309
own notions of state, its own representations and everything from scratch,

00:36:27.509 --> 00:36:29.329
which has been the ambition,

00:36:30.089 --> 00:36:35.229
for some while by researchers, and ideally even under partial observability and so forth.

00:36:37.564 --> 00:36:41.784
Which, even myself, I was thinking about that. But more and more,

00:36:41.844 --> 00:36:42.964
I think this is very tough.

00:36:43.784 --> 00:36:48.344
Still, we should still continue trying, right? The more I actually work with robotics,

00:36:48.864 --> 00:36:54.344
the more I have the impression that it might be worth before trying to invent

00:36:54.344 --> 00:36:59.184
algorithms that can invent representations for all possible worlds,

00:36:59.744 --> 00:37:05.464
to actually understand that actually our world, our 3D world, is quite special.

00:37:07.464 --> 00:37:10.404
And it's special in the sense that it's 3d there is

00:37:10.404 --> 00:37:16.824
physics a lot of things in our world are actually about rigid bodies and it's

00:37:16.824 --> 00:37:22.844
becoming now very robotics talk right there's a lot of things about really kinematics

00:37:22.844 --> 00:37:25.984
so there is degrees of freedom in our environment that you can manipulate and

00:37:25.984 --> 00:37:30.244
so forth so our actually true world is very very structured,

00:37:30.804 --> 00:37:36.604
So in that sense, why not as a simple step, because we're limited as researchers,

00:37:36.784 --> 00:37:40.764
as humans, first try to understand those particular structures that are inherent

00:37:40.764 --> 00:37:45.284
in our world and think about what would be representations to actually deal with those.

00:37:45.564 --> 00:37:49.064
So that's also a more robotics-like, pragmatic approach.

00:37:49.264 --> 00:37:52.544
But it's also funny that you're sort of following a little bit your physics

00:37:52.544 --> 00:37:57.824
training or intuitions again by saying, oh no, let's make the spherical cow

00:37:57.824 --> 00:38:00.904
and then we start from there. Frankly, I'm not sure.

00:38:01.664 --> 00:38:05.064
I think it's the opposite in physics and training. I think the typical physicist

00:38:05.064 --> 00:38:08.784
would go the first approach because he wants to be always general and work in

00:38:08.784 --> 00:38:11.604
every possible world and derive optimal solutions in every possible world.

00:38:11.904 --> 00:38:18.024
And in a sense, also machine learning and optimal control and MDPs, they do that, right?

00:38:18.084 --> 00:38:24.064
They define problems which are seemingly total generally because you can describe

00:38:24.064 --> 00:38:26.204
everything as an MDP or a control problem, right?

00:38:26.284 --> 00:38:30.104
And you can embed everything in a vector space, which is sort of true.

00:38:30.944 --> 00:38:33.144
But it might neglect the fact.

00:38:33.990 --> 00:38:36.910
The actual problems we are faced with are very structured.

00:38:38.350 --> 00:38:43.790
And acknowledging that specific structure is maybe less a theoretical physicist

00:38:43.790 --> 00:38:46.210
thing, but more really in engineering and robotics.

00:38:46.910 --> 00:38:52.250
But do you feel that there are already solutions of that kind on the table? No.

00:38:53.450 --> 00:39:01.810
Looking more and more into robotics, I think that there is, of course, there is engineering,

00:39:02.110 --> 00:39:07.930
which means that the people who program the roboticists, they have these concepts

00:39:07.930 --> 00:39:12.310
and these specific representations and understanding of kinematics of worlds and so forth.

00:39:12.510 --> 00:39:15.810
And from their understanding, then program the robot to deal with that.

00:39:15.970 --> 00:39:20.610
But I do not have the impression that the machines themselves,

00:39:20.930 --> 00:39:25.170
so let's say the inference mechanism in the models that I talked about today,

00:39:25.490 --> 00:39:30.070
that they would actually do inferences about true physical situations.

00:39:30.410 --> 00:39:34.330
Right. Or that we would have inference machines which could do inferences about

00:39:34.330 --> 00:39:38.910
what is a stable physical situation, like probabilistic inference over that,

00:39:38.990 --> 00:39:41.330
or where might be a degree of freedom in the world.

00:39:43.110 --> 00:39:47.610
So doing inferences in these spaces, they're so structured that I think we do

00:39:47.610 --> 00:39:50.250
not yet know how we could do probabilistic inference in those spaces.

00:39:50.570 --> 00:39:54.250
Right. So what would be a good benchmark for these kinds of models?

00:39:56.030 --> 00:39:56.510
Hmm...

00:39:59.210 --> 00:40:04.050
Okay, so there is the old Köhler, Wolfgang Köhler, right?

00:40:04.170 --> 00:40:11.770
He was a psychologist in the beginning of the 20th century. He did these experiments with monkeys.

00:40:12.130 --> 00:40:16.690
Japanese, yeah. Exactly. I loved them on which island? I forgot.

00:40:19.530 --> 00:40:23.930
So he has this book, which is called Intelligence Tests of Apes, something like that.

00:40:23.930 --> 00:40:29.830
In these books he describes actually very nice behaviors which I would call

00:40:29.830 --> 00:40:34.650
really goal directed behaviors and one of these behaviors is where there is

00:40:34.650 --> 00:40:37.850
a banana at the ceiling and the robot is trying to get it but it's too high

00:40:37.850 --> 00:40:40.930
it's a chimpanzee who tries to get it not a robot oh sorry,

00:40:41.770 --> 00:40:48.050
you see where I'm getting at right so yeah the ape wants to get it it's too

00:40:48.050 --> 00:40:51.230
high he jumps and for five minutes doesn't achieve it,

00:40:51.750 --> 00:40:56.350
gets sort of depressed at least that's the storytelling that Köhler actually

00:40:56.350 --> 00:40:57.970
does in the book sits in the corner.

00:40:59.116 --> 00:41:02.416
Uh sort of depressed for a while and then suddenly uh

00:41:02.416 --> 00:41:05.376
points his eyes towards the banana and then points his

00:41:05.376 --> 00:41:08.056
eyes at the box in a corner and then again at

00:41:08.056 --> 00:41:12.696
the banana in the box and then jumps up grabs the box goes on top of the box

00:41:12.696 --> 00:41:19.076
and gets it um and i i wish you know robots could do that and i i think if i

00:41:19.076 --> 00:41:24.136
would be psychologist so i would be very interested to to actually model what's

00:41:24.136 --> 00:41:28.036
going on in that head of the monkey.

00:41:29.076 --> 00:41:35.516
Because I think he did inference in a very structured way, inference about our physical world.

00:41:35.696 --> 00:41:38.896
He did inference about, let me pull that box here and get on top of that.

00:41:39.136 --> 00:41:43.056
And it's exploiting physics very, very much.

00:41:43.336 --> 00:41:46.796
Yeah. We had a beautiful lecture on this last week by Alex Kacelnik,

00:41:46.836 --> 00:41:52.516
exactly on this famous experiment by Koehler with also many other experiments

00:41:52.516 --> 00:41:56.356
in that domain. So you would follow, you would stick to such a benchmark,

00:41:56.496 --> 00:41:57.176
you'd be happy with that.

00:41:57.336 --> 00:42:00.336
Absolutely. If the robots would do that, I'd be… Very good.

00:42:00.436 --> 00:42:05.656
But now, I don't know whether you guys ever apply your own methods in an autobiographical way.

00:42:05.736 --> 00:42:09.596
Because you could also say, look, every machine learning person is trying to

00:42:09.596 --> 00:42:11.616
optimize their own policy in this complex world.

00:42:13.116 --> 00:42:17.816
And in some sense, we also would like to make statements about biological systems,

00:42:17.996 --> 00:42:20.756
about physical, psychological systems, and so on using these methods.

00:42:20.756 --> 00:42:25.176
But there might be a possibility that, of course, you guys have optimized yourself

00:42:25.176 --> 00:42:29.536
in some sort of local minima in this larger space of all possible policies and

00:42:29.536 --> 00:42:30.396
problem-solving solutions,

00:42:30.656 --> 00:42:35.756
that actually the links with natural systems has gotten lost.

00:42:36.496 --> 00:42:39.836
Okay. So where do you stand on that?

00:42:40.256 --> 00:42:44.996
To what extent also the models we now very superficially discussed now in this

00:42:44.996 --> 00:42:48.876
interview, I mean, where is really the leverage, right?

00:42:48.876 --> 00:42:53.196
Because, for instance, we all know the big hype that has been going around and

00:42:53.196 --> 00:42:57.916
the big enthusiasm and hope for the last 30 years on these kinds of methods.

00:42:58.236 --> 00:43:00.516
And sometimes when we look back, we can also say, well, yeah,

00:43:00.516 --> 00:43:02.796
okay, it kept a lot of people busy. That's all very positive.

00:43:03.116 --> 00:43:07.396
But did we really sort of make a huge step forward in understanding biological

00:43:07.396 --> 00:43:09.216
systems or psychological systems?

00:43:09.496 --> 00:43:15.696
And there, it's not so clear. Yeah, I agree that there is not necessarily always

00:43:15.696 --> 00:43:18.236
links to the biological system.

00:43:18.876 --> 00:43:25.116
Of the methods. And I would even claim that's very often not the goal to keep

00:43:25.116 --> 00:43:27.876
these links to the biological paradigm.

00:43:29.356 --> 00:43:37.936
Why not? It can be one goal for some people, but it's also an engineering goal

00:43:37.936 --> 00:43:43.476
really to just design systems which do things good in a good way and optimize something or whatever.

00:43:44.116 --> 00:43:45.456
But let me put it the other way.

00:43:45.496 --> 00:43:51.376
I think in order to understand living systems, it is also a good idea to,

00:43:52.310 --> 00:43:57.930
to understand our world, our environment, and to understand what it means to

00:43:57.930 --> 00:44:00.730
behave, to organize behavior in that environment.

00:44:01.870 --> 00:44:08.610
Totally, just on that level, I tend to say functional also, without caring for the substrate,

00:44:08.790 --> 00:44:13.270
without caring for, I don't know, the constraints, the concrete biological constraints

00:44:13.270 --> 00:44:16.390
that there are, but just to understand what is actually the structure of problems

00:44:16.390 --> 00:44:22.730
that things in the world are being faced with, no matter if these are robots or not or other systems.

00:44:23.030 --> 00:44:28.410
And I think that the engineering methods or robotics can contribute on that side.

00:44:28.610 --> 00:44:35.070
And I also think that, well, it's important to understand actually the problems

00:44:35.070 --> 00:44:39.870
when you then would go back and ask how could biological systems actually face these problems.

00:44:39.950 --> 00:44:42.790
So it's a bit more a normative perspective then.

00:44:43.770 --> 00:44:45.990
Like this is what the system should do.

00:44:47.590 --> 00:44:50.950
Should do is even too much. what the system is confronted with,

00:44:51.050 --> 00:44:53.850
understand the structure of what it is confronted with.

00:44:53.910 --> 00:44:59.590
But now, if we look at nature or psychology, we might see that actually notions

00:44:59.590 --> 00:45:02.490
of optimality might not hold so well, right?

00:45:02.530 --> 00:45:05.710
Like human decision making is in many occasions suboptimal.

00:45:06.030 --> 00:45:10.170
I'm here wasting your time and you're still being polite so that you could also

00:45:10.170 --> 00:45:11.910
consider that suboptimal, right?

00:45:12.030 --> 00:45:17.330
So in the face of let's say psychological reality and behavioral reality,

00:45:18.450 --> 00:45:21.870
how do these notions of optimality actually hold up to that?

00:45:22.690 --> 00:45:28.930
Optimality, so I have a very pragmatic actually relation to optimality I think,

00:45:30.130 --> 00:45:35.430
many empirical things can be described as if they would adhere optimality principles,

00:45:36.630 --> 00:45:41.630
it's a misunderstanding to actually say that this would have any meaning like

00:45:41.630 --> 00:45:45.390
if it wants to be optimal or whatever let me take an example,

00:45:45.530 --> 00:45:51.170
I mean the trajectory of a particle in physics, right? You can say it minimizes action.

00:45:52.170 --> 00:45:58.170
Or some kind of field in physics behaves as if it would minimize an energy.

00:45:58.830 --> 00:46:03.550
That energy or that action is only a scientific concept that we scientists have

00:46:03.550 --> 00:46:06.410
devised to describe the actual system there.

00:46:10.577 --> 00:46:13.677
Cost functions or optimality objectives like these things.

00:46:14.957 --> 00:46:20.037
These are just designed by us scientists to describe systems in the world.

00:46:20.117 --> 00:46:24.617
It's just a means to describe systems in the world, to say the system behaves as if it is optimal.

00:46:24.857 --> 00:46:31.077
So I think it's a misunderstanding to really think that we believe machines must be optimal.

00:46:31.257 --> 00:46:35.637
It's just a scientific tool to describe them as being optimal with respect to

00:46:35.637 --> 00:46:36.617
some arbitrary cost function.

00:46:36.677 --> 00:46:40.657
But for instance, humans have many biases in their decision-making,

00:46:40.797 --> 00:46:42.977
right? And there's some beautiful books written about that.

00:46:43.197 --> 00:46:46.937
One might be, for instance, many people in gambling believe they have much more

00:46:46.937 --> 00:46:50.617
control over the outcome of a bet than they really have.

00:46:50.937 --> 00:46:55.417
So, you see, in psychological reality, you might not be able to really optimize

00:46:55.417 --> 00:46:59.777
so cleanly on some goal function, even though that is really at the core of

00:46:59.777 --> 00:47:01.437
the methods that you're developing.

00:47:01.717 --> 00:47:06.297
No, I think, and this is almost a trivial statement, that any behavior can be

00:47:06.297 --> 00:47:09.737
described as being optimal. It's just a question of optimal with respect to what.

00:47:09.937 --> 00:47:16.237
And I mean this, and it's no more statement when I'm saying I'm interested in

00:47:16.237 --> 00:47:19.337
describing algorithms that can do optimal things.

00:47:19.537 --> 00:47:24.697
It's only that I think that this is an abstraction of how to describe behavior.

00:47:24.837 --> 00:47:28.297
I think even what people call sub-rational behavior or limited behavior,

00:47:28.557 --> 00:47:33.097
of course you can describe it as being optimal with respect to something, some other objective.

00:47:33.617 --> 00:47:37.257
Okay, but then, of course, there's another risk that's looming,

00:47:37.257 --> 00:47:40.697
another challenge that I could say, yeah, but then, in that case,

00:47:40.757 --> 00:47:43.337
your method is super powerful because it can never fail.

00:47:43.777 --> 00:47:48.537
So then what do we learn? Well, the thing is that you now sort of shifted your

00:47:48.537 --> 00:47:50.457
level of scientific description.

00:47:50.757 --> 00:47:52.437
You know, you do not describe….

00:47:53.202 --> 00:47:56.462
Behavior phenomena phenomena directly anymore in

00:47:56.462 --> 00:47:59.542
a direct representation in terms of describing the policy but

00:47:59.542 --> 00:48:02.382
you described you shifted your level of scientific description to the

00:48:02.382 --> 00:48:05.902
level of describing what they optimize and at

00:48:05.902 --> 00:48:09.522
first sight it's nothing more than just a shift you didn't gain anything by

00:48:09.522 --> 00:48:14.442
that but potentially what you could gain is that this other description is more

00:48:14.442 --> 00:48:19.502
compact and that's really the only scientific gain of that that you can describe

00:48:19.502 --> 00:48:22.482
things more compactly. And this is how it was in physics always.

00:48:22.722 --> 00:48:26.202
The reason why you would want to describe particles as minimizing action,

00:48:26.362 --> 00:48:31.602
so the trajectory of particles, is because it's a very concise description of what's happening.

00:48:31.942 --> 00:48:35.162
I mean, alternatively, you could just write down for every possible thing what

00:48:35.162 --> 00:48:37.662
happened, but it's more concise to describe it like that.

00:48:37.782 --> 00:48:41.262
Same with behavior like human motion.

00:48:41.562 --> 00:48:44.502
Saying that human motion, as in

00:48:44.502 --> 00:48:47.922
Walpert's work, behaves as if it would be stochastically optimal control.

00:48:49.702 --> 00:48:54.482
Don't overstate this. It's not meaning that they want to optimize it or so.

00:48:54.922 --> 00:48:58.742
It's just a statement that you can shift your scientific description of human

00:48:58.742 --> 00:49:05.462
behavior onto an abstraction where you say the behavior, the motion is as if

00:49:05.462 --> 00:49:06.782
it would minimize that objective function.

00:49:07.002 --> 00:49:10.642
Right. Just a shift of description, nothing more. And a description of it, right?

00:49:10.742 --> 00:49:13.042
Right. The hope would be that it's so concise that, for instance,

00:49:13.102 --> 00:49:20.062
with robots, it's easier to specify or be creative in saying what they might

00:49:20.062 --> 00:49:25.002
want to optimize and then derive behavior from that rather than programming behavior directly.

00:49:25.202 --> 00:49:28.602
It's a shift of programming languages almost, if you want to say.

00:49:28.682 --> 00:49:32.722
Now we're programming objective functions instead of policies directly. Exactly.

00:49:32.902 --> 00:49:39.922
So Mark, so you're deeply involved in developing these sort of also new perspectives

00:49:39.922 --> 00:49:44.562
on problem solving and machine learning, which also have quite some impact.

00:49:45.442 --> 00:49:51.262
But so given this experience, what would you see as Mark's law for us to understand reality? Yeah.

00:49:52.320 --> 00:49:53.780
Ah, I have no idea.

00:49:55.820 --> 00:49:59.280
Mark's law. Acknowledge the structure. I don't know.

00:49:59.580 --> 00:50:04.980
That's an important thing, which sometimes I don't feel myself that my own research

00:50:04.980 --> 00:50:09.000
makes so much progress with that. But actually, I think this would be the most important thing.

00:50:09.120 --> 00:50:11.960
Acknowledge really the structure of the problems.

00:50:14.000 --> 00:50:19.020
Because, yeah, that's important. Right. And last point.

00:50:19.120 --> 00:50:21.420
So five years from now, I'm going to come visit you wherever you are.

00:50:21.420 --> 00:50:25.180
Maybe Stuttgart and I'm going to confront you with a prediction you're going

00:50:25.180 --> 00:50:29.260
to make today but I'm going to say look Mark you predicted to me today or five

00:50:29.260 --> 00:50:33.840
years back that the following would happen now did it really come out like that

00:50:33.840 --> 00:50:36.380
was your prediction really supported,

00:50:37.020 --> 00:50:42.360
what's this prediction you would like to make today um that you can test me in five years right,

00:50:43.620 --> 00:50:50.620
um um well so our goal actually is and I usually say to my students in five

00:50:50.620 --> 00:50:55.360
years as a point of motivation, that we would have that robot that autonomously

00:50:55.360 --> 00:50:56.720
explores its environment,

00:50:57.260 --> 00:50:59.220
and discovers all degrees of freedom.

00:51:01.080 --> 00:51:05.200
That sounds like a simple statement initially, but if you look around the room

00:51:05.200 --> 00:51:08.100
right now and think about all degrees of freedom in that room,

00:51:08.180 --> 00:51:09.800
kinematic ones, there's a lot.

00:51:10.000 --> 00:51:12.940
And in order to discover them, the robot would have to start pushing,

00:51:13.280 --> 00:51:19.000
kicking, letting fall, letting drop, I don't know, everything grasping all of

00:51:19.000 --> 00:51:23.560
these things and shake it and see if something moves. And that's something I would want to...

00:51:24.200 --> 00:51:27.200
Okay, I'll come and see if the robot tore down your lab five years from now.

00:51:27.360 --> 00:51:29.780
Yeah. So, Marc Doussaint, thank you very much for this conversation.

00:51:30.420 --> 00:51:31.320
You're welcome. Thank you.

00:51:31.280 --> 00:51:38.000
Music.

00:51:37.960 --> 00:51:43.280
The CSN Podcast was produced by the Convergent Science Network of Biometrics

00:51:43.280 --> 00:51:50.100
and Biohybrid Systems, a project funded by the European 7th Research Framework Programme.

00:51:51.140 --> 00:51:56.540
For more interviews, recorded lectures or upcoming conferences in the field

00:51:56.540 --> 00:52:02.600
of biometrics and biohybrid systems, go to csnnetwork.com.

00:52:02.480 --> 00:52:11.600
Music.