Seth (00:01)
Hello and welcome to Learning from Machine Learning. On this episode, we have a very special guest, Maxime Labonne the head of post-training at Liquid AI, the author of a very popular course, LLM course with 50K stars on author of the best-selling books, LLM Engineers Handbook and Hands-on Graph Neural Networks. Maxime, welcome to the show.

Maxime Labonne (00:26)
Hi, thank you. Thanks a lot, Seth.

Seth (00:28)
It is so great to have you. We will get right into it. Maxime, what initially attracted you to machine learning?

Maxime Labonne (00:36)
 So this is a long answer, but I really started my career in machine learning during my PhD. Before that, I was in cybersecurity and I started a PhD doing machine learning applied to cybersecurity. So the overall topic was about how to detect cyber attacks on computer networks using machine learning techniques.

And yeah, during this PhD, I realized that I was a lot more motivated by everything related to AI and machine learning and a lot less about cybersecurity. And I think like a big chunk of that is the fact that there was such an amazing community already there in 2017 in  machine learning. And yeah, it's just difficult to compete with the number of articles and cool contents that there was already online.

Seth (01:31)
It's funny, I was the reverse. I did cybersecurity because I loved machine learning and it was a good application for me. Cool. Okay, so then what was like one of your first projects that like, what made you fall in love initially with machine learning? Was it the cybersecurity stuff or was it the next one?

Maxime Labonne (01:50)
No, it was really the cybersecurity stuff because it was really fun to see how you really can detect attacks. It feels like a bit sci-fi, know, like we have all these  stories with an AI able to detect hackers and stuff. And that was like kind of really happening at a small scale, of course. So that was really, really interesting. It also asks you deep questions about what's an attack in the first place, right?

I remember like talking to a patent engineer and the patent engineer asked me like, okay, how do you define a cyber attack? And I told him, well, that's the output of my machine learning model. So he was not very happy with my answer because yeah, that's kind of a non-answer, but it's also true, right? It's very difficult to define these things and what you do when you train in machine learning models.

Seth (02:31)
I'm not going to that.

Maxime Labonne (02:44)
to detect cyber attacks is that you end up with a definition of a cyber attack that is extremely subtle. And yes, in the end, you don't really know what a cyber attack is, but your machine learning model can maybe know better than you.

Seth (03:01)
that's the amazing thing for me about AI in general and machine learning in general, that it's able to pick up patterns that humans aren't capable of even detecting. think it's a nuance that think a lot of people don't necessarily always have in mind when they're thinking about the benefits of machine learning. A lot of times they're thinking about...

Oh, how can I automate what humans can do now? But in actuality, machines are processing and looking at things in a much different way and they're able to find patterns. I I think about like computer vision, the things that they're able to do there. I mean, obviously everything that's going on with LLMs now. But yeah, it's that we shouldn't be expecting AI to just replace and do it in the same way that humans do.

but that they'll do it in their own way and find patterns that humans might not be able to find. Yeah, so, okay, so going from your PhD, if we were to fast forward now to your role at head of post-training at Liquid AI, what were some of the main steps that got you from your PhD to where you are today?

Maxime Labonne (03:56)
Exactly.

Yeah, so after my PhD, I joined an AI lab at Airbus in Paris. So that was my thesis supervisor who co-developed an AI lab with Airbus. And so he told me that, yeah, I should be a good fit there. And I followed his advice. So here I managed to get a bit outside of cybersecurity.

and more generally in computer networks first and then expanding my scope to also like all the fields doing a bit of reinforcement learning for example  and also like application that really not related with networks. So that was really interesting to me I learned really a lot about it also did a bit of like quantum networks on the side and that was also very exciting and then I decided that I wanted to focus  more on

Transformers as we call them back then and that motivated my choice to leave Airbus and join JP Morgan Chase in London. And this is just before the release of ChatGPT. I was really a huge fan of Transformers. Before that,  I started doing my PhD because a lot of the thing that you do when you try to detect attacks in computer networks is that you

learn an artificial language, you learn the network protocols that are used to communicate between computers. So this is really an artificial language and you end up using NLP techniques because they're the closest ones to what you're trying to achieve here. So this is also something that I applied at Airbus with some training of BERT model, I call it Cybert and also GPT-2 back then.

and I wanted to explore a bit more like real NLP task and not just applied to computer networks. So this is the opportunity that had at JP Morgan working really with text and also with another modality that I was not too familiar with and that was code. And now we know that LLMs are great at that and that's maybe like the main applications. And it was really exciting to also work on this.

Seth (06:15)
Right.

Maxime Labonne (06:20)
in this bank because it has like over 300,000 employees. So when you do something for code internally, it already has like this massive scale.

Seth (06:30)
Right. yeah, what were some of the main, yeah, so what were the main things that you were working on there at JP Morgan?

Maxime Labonne (06:39)
So I would say that the main project, and that was also like a bit of my pet peeve, was this copilot, internal copilot model, because you have such a different ecosystem inside of the bank with dedicated APIs, dedicated libraries. You cannot just take GitHub copilot and apply it there.  It just doesn't work because the code is too different.

So what you really want to do if you want to create this auto-completion model is to train it on your own code. And this is already like in the realm of post-training. And this is what I really enjoyed  in this job was like generating the data, pre-processing it, training models, managing experiments, and then evaluating the models and using this as feedback signal to make better data and better models.

Seth (07:31)
Very nice. Okay, so that's a good transition to your role now as head of post-training. So, I mean, so many things, so many exciting things going on with Liquid. I think the most recent, but you can obviously correct me, the Liquid Foundation models and some of the new research that's coming out.

Why don't we start off with, yeah, could you give me like a little bit of what's the vision and mission of Liquid AI and the problems that you guys are addressing?

Maxime Labonne (08:01)
Yes, so Liquid AI is centered around the idea of efficiency, how to train the models in a more efficient way, how to deploy them more efficiently. And in concrete terms, it translates into memory efficiency. using less VRAM, it also translates into faster throughput, as you can experience using our API models.

All of this stuff allows us to be very competitive, also in places where it's currently very difficult to deploy models. And something that you mentioned is the research. So some of my colleagues, Armin Thomas, Michael Poli and Stefano Massaroli they released this Hyena Edge paper where they tried to really create, design the architecture.

of a model from scratch to be as efficient as possible on edge devices.

Seth (08:57)
So for somebody who maybe wouldn't understand the challenges of... I think that everyone is realizing how powerful with OpenAI and Anthropic and all of the other models, how powerful these models are. But can you talk about what makes it difficult to get that sort of...

intelligence or whatever that level of model on any device and also what it means to be working and deploying models on the edge also if you touch upon that.

Maxime Labonne (09:37)
Yeah, absolutely. So it's really a matter of model capacity. Model capacity refers to this idea that if you have a lot of parameters inside of your network, you can learn and store a lot more information. So that's really nice. If you are concerned about hallucination, for example, which is a big problem with LLMs in general, you can retain a lot of facts.

because you have so many parameters, right? So when we talk about frontier models like GPT-4.0, o3, o4, can be also Claude 3.7, Sonnet, can be Gemini 2.5 Pro. All of these models are really, really big. They're in the order of trillions of parameters. And what we want to do with edge models is try to...

approximate this level of intelligence try to also be useful or as useful as possible, but with a budget that is like 1000 times even more than that, smaller than these models. So you're working with model capacity that is extremely small. And because you don't have this model capacity to remember the facts, you end up with a lot more hallucinations, for example. And so you need to come up with

techniques to be as efficient as possible with the parameters that you have, and also maybe use the models in different ways where hallucination is not such a big issue. So they're not as general purpose as these Frontier models, but they can be very, very useful and do some task very, very well. For example, if you take a 1B model for translation.

can be extremely competitive with models that are 50 times, 100 times its size because it's really dedicated to this one task. So this is what you want to do with edge models is that you want to train them in a way that really plays on their strength instead of trying to approximate a chat GPT. And the second main obstacle with edge models is how to deploy them. When you want to deploy a model on GPU, everything is

Seth (11:42)
Right.

Maxime Labonne (11:52)
pretty standardized now. If you use Nvidia, GPUs, CUDA, everything is pretty much like already in very good state, like the stack, the inference stack is very mature. Now, if you want to deploy a model on a phone or on a drone, on a satellite, wherever you want, really in the realm of IoT, I would say, this is non-standard. And here you will face a lot of

challenges related to CPU inference, for example, related to really the chips themselves and the stack, the maturity of the stack it supports. So there are these two main obstacles with edge models. Training, how to train them to be as useful as possible, but also deployment, how to deploy them in places where you don't usually see AI models.

Seth (12:23)
Right.

Right, okay, so just saying it back, the two challenges for Edge and getting things on smaller devices or non-standard devices, it's the training and understanding the objective that it's going to be trained for. So perhaps limiting the scope of what you're going to be doing as opposed to taking something like one of the frontier models, which is just kind of general purpose, good at everything, really understand the problem that you're trying to solve. If it's translation, for example,

You know, you can have a more memory, you could have a model that could work on a smaller memory footprint. But then that's also the second challenge, which you're going to have some memory constraints also. trying to get models. Well, I mean, there's a question. Does the model need to fit on the device or it needs to be able to communicate with that model somehow? It could communicate still with an API.

I guess, right? Depends.

Maxime Labonne (13:44)
Yeah, but

if you do it with an API, then it's not edge deployment, right?  What you want to have in most cases is really on device because it gives you a lot of nice features like privacy. Nobody is going to see the instructions and the answers. So this is very important, for example, in a lot of companies where data privacy is a big concern. It also is a lot cheaper.

Seth (13:49)
okay. On device, yes.

Maxime Labonne (14:11)
because you don't pay per token now. You just don't have to pay at all because you're running it on device, right? So there are these really, really nice advantages if you want to run models on device. And I think this is currently really under explored and there's a ton of potential there.

Seth (14:16)
Let's just compute.

Very cool. So you mentioned some of the devices. so like I think about mobile phones. I didn't think about satellites. That's cool. Are there other ones that come to mind when you think about edge devices? Which other ones do you think about?

Maxime Labonne (14:46)
It can be things like drones, for example. It can be everything like tablets, laptops, especially like old laptops that do not have any GPU. That's a challenging one. Absolutely.

Seth (14:58)
I guess robotics also, right? Like any kind of robot.

That's cool, man. The future is going to be awesome. Okay. So what are some of the techniques that you're using or that can be used to get some the capabilities of these frontier models, let's the edge on device?

Maxime Labonne (15:19)
Yeah, so there are two sides of it. During training, for example, the popular technique to do this is distillation. During distillation, what you do is that instead of learning one token, you're going to learn the distribution of tokens. So you have a lot more signal now. And supervised fine-tuning, for example, is just a

It's just one example where you only care about one token, but instead of that, you can look at like five, 10, 20 tokens. And this gives you the extra signal that the model needs to be a lot more efficient during training. So that's one of them. Another one that I'm particularly fond of is model merging. With model merging, you take the weights of multiple checkpoints and you apply some merging technique. It can be like as simple as averaging the weights.

And this tends to give you weights, like a new checkpoint that is smarter in general, that is more performant and more robust. So there's a lot of creativity, I would say, in this space on how to match the models and how to make them as efficient as possible.

Seth (16:31)
Yeah, yeah, if I'm remembering correctly, you have like libraries and whole sets of models, right? Like you have a merge kit. that what it's called? Is that?

Maxime Labonne (16:44)
I use

MergeKit made by Charles Goddard and Arcee-ai It's a great library if you want to merge models, I definitely recommend it.

Seth (16:55)
Okay. and then I know some of your models were getting on the top of some benchmarks. Neural Beagle, if I remember. Yeah.

Maxime Labonne (17:03)
Yeah, I used it quite a lot with the OpenLM leaderboard on Hugging Face. Now it is closed, but before that, it was a really fun place to just try, experiment with models, do some crazy techniques. And you had this free evaluation provided by Hiking Face to give you a signal.

Seth (17:12)
You

Maxime Labonne (17:27)
So this was a bit too hacked in the end, that's why they decided to stop it. But yeah, had a really fun time doing this and writing about it in articles.

Seth (17:38)
Yeah, thank you for all of the incredible content that you create. I feel like for a while it was like every couple months there was like a new model by you that was like amazing and then I like all your articles on Hugging Face. There's some really good ones there. Do you have any other places where put out your information? Like do you have your own blog or anything?

Maxime Labonne (18:01)
Yeah, I also have my own blog that has all my articles. It's actually the only place where you can find all the articles I've ever written. Also, I used Medium, but now I don't think that Medium is such a popular place for technical articles. Maybe I should...

Seth (18:09)
all of it.

It

was for a while and then now it's kind of fallen off, yeah.

Maxime Labonne (18:22)
Yeah,

it feels like now Substack is really the place to be in this space, but yeah, my Substack is not really up to date, unfortunately.

Seth (18:29)
Yeah, I'm still working on getting my substack up to Yeah, I was on Medium. I haven't published on Medium in a bit. Anyway, okay, let's get back into some of this. let's go into some machine learning stuff, like some general questions. There's a huge amount of hype.

Maxime Labonne (18:33)
Hahaha.

Seth (18:48)
around all the capabilities of AI, right? You know, it couldn't be higher people in the field, out of the fields. I think everyone kind of has their own view on it. I'm curious from, you know, somebody who's really on the edge of the research edge, cutting edge of all of this. How do you view the gap between the hype and the reality of AI and machine learning?

Maxime Labonne (19:13)
Yeah, I like how you phrased it saying that even people inside of AI fall into these hype cycles. And this is something that I see very often. We have a hype cycle where people working in AI really contribute to it,  share information that is, let's say, not verified. And most often, like quite...

surprising, I would say. We had this example to give a concrete example with the reflection 7TB model that was released at the end of 2024. And yeah, like when a model is released for free and claims to be the state of the art model of the world, maybe you should not trust it at face value. Crazy idea. But because people wanted to believe it and I understand I also wanted to believe it.

Seth (19:46)
man. ⁓

Maxime Labonne (20:07)
 Then you have this hype cycle where this news is shared like so much that people assume that it's true, right? Like everybody said it was true. So it must be and then several days later you realize that no  Nothing was true. Everything was actually  fake So yeah, this is not an isolated example and I don't want to blame these people in particular. It's just that in general

Seth (20:32)
Yeah.

Maxime Labonne (20:33)
these hype cycles are quite hurtful because it silences news that is actually relevant and important. And that's too bad, right? But yeah, in general, I would say that beside these hype cycles, I don't see like a particularly  big gap, but that is also because I don't really follow  mainstream news outside of AI. So I'm not particularly aware

Seth (20:42)
Right.

Right, nah, that's an interesting I hadn't thought about that one in a while that was wild that was wild everyone was

freaking out about that. And then influencers start to post it and then people are reposting it. And then you're like, this is everywhere. Everyone is talking about this model. And then you realize, I always I wonder what percentage of people who post about a model have actually even prompted it once.

Maxime Labonne (21:28)
Yeah, I'm pretty sure it's very low. Same thing with people sharing scientific papers.  Not sure that they read them. 

Seth (21:39)
Yeah, I think so too. And especially now when you can plug in a PDF into one of the frontier models and say, convert this into a LinkedIn post for me. That's why I appreciate your posts because I know that when you post an article, I know that you're thoughtful about it I can take it.

like for what a real take on it. So thank you for, you know, keeping the signal strong. Going back to the other thing where people are just kind of posting stuff. What that's what happens is it muddies the water, right? It makes the signal to noise ratio a little lower, right? And it's like, all right, there's just going to be all of this noise about something. It's hard to tell what's real and what's not. And the only way to really know is

getting into the details, getting into the implementation sometimes, trying things out on your data set or trying things out for your use case. I think that goes into this whole, I mean, there, yeah, there, have been things about where people are adding, you know, whether it's purposely or, you know, sort of not on purpose, adding the tests from benchmarks into their training data.

And then it's like, well, how good is this model actually performing? Oh, it's at the top of the leaderboard for this benchmark. But then it's like, well, what does it mean? So yeah, curious also on that, what's your take on the state of benchmarks? Are there certain ones that you pay more attention to? What's your take on?

benchmarks being used to sort of the value or quality of LLMs.

Maxime Labonne (23:29)
Yeah, this is  a heavy question because the state of benchmarks is not the best at the moment. I think that I mentioned the fact that the open LLM leaderboard made by Hugging Face and Clémentine Fourrier closed and that was one of the main sources to evaluate models back then. And right now,

The main source and it started in 2024 is the chatbot arena. I feel like a lot of people take the chatbot arena at face value, even people working with LLMs, they really trust it a lot, which is very strange to me. And we've had this conversation with a lot of people who care about evaluation, are in post-training in 2024. And now I'd feel that in 2025,

This is becoming mainstream. The chatbot arena is not reliable. This is just another signal. It doesn't mean that it doesn't have value. It's just an unreliable signal. So you need to combine it with other signals. You need to combine it with other benchmarks. Ideally, what you want to do from Bayesian perspective is to combine all these signals, try to weight them in a way with like a college of experts, for example.

update your assumptions all the time because some of these benchmarks are good now and they're going to be bad later. Some of them might improve, who knows. And this is the only way I see to really effectively evaluate the models is to multiply the signals. And what's good is that people do it by themselves. If you look at local llama, if you look at Twitter, if you look at all these communities,

A lot of people are actually interested in making their own benchmarks for specific use cases. They do it for code, they do it for like very niche stuff like is this model good with MCP for example? And this is something that we see providing real value because now we can just combine all of these like hundreds of individual benchmarks and get a real score.

Seth (25:43)
Yeah, I think you make a bunch of good points. Each result is just one data point. You know, you can't treat it like it's the be all and end all. It's, mean, even when it comes to evaluating models and even if you're on your own data set and everything, it's like, you were looking for this one metric that's going to say, okay, model A is better than model B, right?

But it's never that simple, especially as the tasks are more complex than say like a binary classifier. Even in that case, sometimes it's difficult to just say like, the accuracy is higher. Well, is it the class that you care more about or is it the class that matters more? Is it able to do the job that it wants to do? And yeah, another interesting thing that you're saying basically like, yeah, the value of a particular benchmark

changes over time. Well, I guess in my mind, I'm thinking about how

how available is that data? And then how saturated may have been in the some round of the training of that model. So maybe I know like with math questions, right? So like, you know, when a math, a new math benchmark comes out and it's, you know, let's say it's the 2024 benchmark and it just came out, it means much more right then than it does a year from then when the answers are more available.

I know that's a thing also in Arc AGI and things like that. Like they always need to be evolving what the questions are going to be. Because even if you have this holdout set, it can influence the way that you adapt your model based off of the results that you're getting. Even if you don't see the answers, it can then affect it. And in some ways, there's some amount of like, you know, just...

traditional machine learning like leakage, you know, between training and your validation set. I think that the key is that you have to figure out what use case you have, what use case you're gonna be using that model for.

And then you can also think about it as like a system. I'm always thinking about it as a system of things. I think we're always kind of looking for like that one thing that's going to be able to solve all your problems for a lot of the problems that I'm doing. It's usually a combination, right? It's taking like the best of say a traditional machine learning, say a fine tune transformer, sometimes like rules and things like that work really well also sometimes like lexical based things.

And then what really, I mean, yeah, usually a little bit of an LLM getting involved somehow does bring it up a certain amount too. But I find it's usually figuring out what's the best combination of tools that can kind of get your job, get your job done as opposed to it, to a single one. Yeah. So what's your life? What's your day to day like? ⁓ know I'm switching it up pretty quick, but I'm just curious. Like you're working on such.

amazing things, and, you know, bringing in new architectures. I know that some one of the things with Liquid, it's like, you got, please correct me, but that it's not, it doesn't, right, it's like, kind of challenging the idea that it needs to be a transformer, right? I think that a lot of the thinking around it, and even though people might realize that transformers will only take us so far, there's sort of a ceiling to it.

But yeah, curious like your take on that as well. So I'll ask that question first and then maybe we can go back to the other one.

Maxime Labonne (29:22)
Yeah, model architecture is a difficult field. I would preface it like this, because everything that you create is not compared to the vanilla transformer that was released in 2017. It is compared to the modern version of the transformer. The modern version is highly optimized. It has things like crazy optimization, like flash attention for example.

that makes it really, really strong. It's an incredibly strong baseline. So everything that you do has to be compared not to this original transformer, not to like some academic variations over it, but to the actual transformer that is used in real life. And here, what you see is that often you need to make some trade-offs. You have model quality, for example.

And then you have things like, I would say more like inference. And you might decide, okay, I'm going to change my transformer architecture or I'm going to create a new architecture that will not perform as well as the transformer in terms of quality. But I get something else. I get, example, super fast throughput. And this you can.

then like with training, convert it back into model quality. This is something that we see with long chain of thoughts that you're able to just throw more tokens at the problem and get better results overall. So this is really interesting because now with all of these techniques, it unlocks some architectures that might have not been that great a year ago.

Seth (30:58)
You

Maxime Labonne (31:10)
So this is really exciting to also work on this, but I have to say that more architecture is not enough. There's not like a silver bullet and usually you need a good architecture, but then you need state of the art training. Otherwise, like it's like if you had a transformer architecture, like if you don't train it well, it's not going to perform well at all, right? And you also need state of the art inference.

And this is also very difficult because we've optimized the inference of the transformer for years. So even though your theoretical advantage might be huge on paper, in practice, oftentimes, it's really challenging to defeat a transformer. So this is the problem Liquid has been working on since its inception. And now we're really happy that we made huge progress.

in every aspect of the pipeline to really have a custom architecture that can challenge the classic, the traditional transformer architecture that is ⁓ used in the industry.

Seth (32:16)
Yeah, you're making a really good point. It's that you're not necessarily going up against just a different architecture. So it's become much more than that because, okay, yeah, know, transformers. I mean, the ideas for transformers have been around for a while, but let's take, you know, attention is all you need. Attention is all you need. 2017, the transformer coming out. That is not the same transformer that is being used in 2025. We're talking about

eight years of optimization where people had placed this bet on Transformers and then therefore invested in it. And that's the interesting thing about placing bets is that whether it's the right or the wrong one, once you make that bet, you are then invested and then you start to build the ecosystem around it. I mean, think about hugging face, for example, and all of the models that are on there. I mean, are there any non-

I mean, they're all related somehow to transformers, I believe. I mean, I'm sure there's a small subsection that aren't. I guess, yes, there's some state space models and things like that. But let's just say the large majority of them are transformers. And then think about what has been built around that. The inference, right? The ability to try out these things. And then, like what you're saying, I think with everything, and this is like a common theme that I'm coming across,

in these conversations that I've been having with people at work, it's always, it's never the first version, right? It's the ability to iterate. It's the ability to go through a feedback cycle and understand, okay, this is the initial getting feedback, however, from users, from evaluations, from anything, taking that feedback, working it back into the system and continuing to make it better and better and better.

And yeah, like this is something that people have been working on for eight years. So in some ways, when you're creating an architecture, trying to challenge this, you almost have to be willing to take a step back to then take multiple steps forward. And it's interesting because like you think about like, you know, Yann LeCun, right? Who is, he's openly has spoken about, you know, like he's way beyond.

LLMs at this point, right? He's thinking about whatever he calls it, you different ways of modeling and thinking about the world and representing everything that's going on, joint embedding spaces and things like that. Yeah, finally, I got the right word. it's very interesting to think about. It's not the architecture alone. It's the use, it's using it. It's the inference around it. It's the supporting

hardware architecture for it as well, right? There's whole companies that are now like, you know, based around So that's, yeah, that's a big thing to be trying to challenge. It's an admirable mission and it's an admirable vision knowing that there's an artificial ceiling that transformer-based models have, but that you have to kind of take a step back in certain cases to then push the field and the

you know, with all of this research forward. But yeah, so how is that? How's it going, you know?

Maxime Labonne (35:42)
I think I want to take the perspective of the field in general and not just Liquid AI. Because for Liquid AI, yes, it's going well. I think we're happy with what we have and we keep iterating over it. As you said, I think this is something very fundamental about research and development in general is that you need to iterate. First version will not be great and this is a pattern that we see over and over in the industry. Claude One was the worst model. Claude Two was...

even worse maybe ⁓ and Claude 3 was a beast and we don't care about the two first versions, right? ⁓ So you need to also have this ability to take the step back in the first place. And this is not something that Liquid is the only company to do, right? There are other companies working in the space of, okay, like let's try to rethink the model architecture. So more than take a small step back and decide to just like...

Seth (36:13)
You

Yeah.

Maxime Labonne (36:40)
add something to the transformer architecture, like patch it, and others try to make like a big step. And this is something that you talked about with SSM state space models. This is a complete change in terms of architecture, which is very, interesting. There are ways also to combine all of this. So this is also interesting. And in all this search space, we start seeing that

Seth (36:44)
Hmm

Maxime Labonne (37:10)
Everybody is now convinced that the transformer architecture might not be this perfect architecture that we thought like maybe a few years ago. And now you can add a bit of recurrence to it. So it's better with long context or you can maybe see how you can handle your KV cache in another way that doesn't consume as much memory. So all of these modifications, big and small.

then to show that there are things that can be done in terms of model architecture. The Transformer architecture is not the most optimal one in the world. And that's really exciting because we see a lot of problems right now with this architecture. And seeing that this is evolving means that we're going to unlock new capabilities that we cannot have before. It's not just in terms of like qualities, not just small steps.

This is the only way to really achieve big steps in terms of improvements. And that's what excites me.

Seth (38:12)
Yeah. Yeah. It's a very exciting field to be in, probably filled with challenges. But that's cool. You want to briefly talk about some of the cool architecture stuff that's from a high level? How could somebody high level kind of understand, say, state space machines?

Maxime Labonne (38:38)
OK, so the state space models, it comes from ⁓ dynamic systems. And it's basically just two equations, ⁓ right? And this is like the core of the model. So I will not dive too deep into the details of this, because I don't want to get it wrong live. But what it achieves is you have two modes. And you can.

Seth (38:41)
Stay space miles, yeah.

Maxime Labonne (39:03)
use it to be very fast during inference. And that's something that is really, really interesting with SSMs is the processing speed and also the fact that they could not just a text, but with other modalities, with other sequential data, for example, molecules. So this is something that you can combine to make models that can process DNA.

DNA is incredibly long, so you need an enormous context window to process it. And SSMs and all their variants like S4, S5, they can do that in a very, very precise way. for some of this work is maybe more domain specific. When I talk about DNA, this is a great application. You can also think about it with code, for example. What if I want to embed an entire code base?

Now you can do it with some models like Gemini, but maybe there are more efficient ways of doing it with another architecture.

Seth (40:02)
Nice. Are there any other architecture things that are worth kind of going into? Maybe like mixture of experts or something like that?

Maxime Labonne (40:13)
Yeah, mixture of experts is really, cool. As we also with DeepSeek R1, it can be really, really powerful. It's a way also to be a lot more efficient  because, OK, you still need to store a lot of parameters so you can assume a lot of memory, but then you don't use as much during inference because you just select some experts. You route them for your given query. So this is

really, really interesting and something that we think about at Liquid quite a lot. Actually, already released a 4TB model, which is a mixture of experts, September 2024. And in terms of architecture, there's also this recent work around Hyena Edge that my colleagues released.

And this is also very interesting one because on top of this attention from transformer and the recurrence from SSM, they also play with convolutions and they do it in an automated way. So they really create this genetic algorithm called STAR that is able to refine a population of model architectures over time.

and optimize them so they get faster, they take less memory, and they maintain the same level of quality. So yeah, really interesting work. If you want to check out Hyena Edge.

Seth (41:42)
Very cool. ⁓ I'll add a link to the notes and I have to take a dive into that. I think it just came out last week, right? Yeah. Okay, cool. Okay, zooming out into some of the learning from machine learning, what advice would you give to somebody who's just starting their career or who's early in their career or wants to get involved in machine learning?

Maxime Labonne (42:08)
Yeah, it's a very difficult question. the answer keeps changing. But I would say that what you want first is to cover a lot of breadth, like having a good perspective of what the field is about. I created the LLM course on GitHub that gives you this very broad perspective of like

LLM scientist, the LLM engineer, what it means to make models, what it means to put them into production. So knowing the entire ecosystem is very good for you to have this high level perspective. And then it also allows you to see what you are most curious about, what you like the most. For example, in my case, I am a big fan of everything related to post-training.

how to turn the base model into a useful assistant, or it could be even something else. You can turn the base model into whatever you want, and that's the beauty of post-training It allows you to specialize. And I would say when you learn this stuff, try to be as hands-on as possible. It's not enough to just read text. You need to code it by yourself. At least that's my technique, and that's what I would recommend, trying to be low-level.

maybe not too low level to start with, but being able to run scripts and then being able to write the PyTorch code is good enough for like most things. And this is the approach that I try to have with the content that I produce online is to really give keys to understand like the underlying principles behind it and learning by implementing this stuff. So you.

really know what's going on inside of it. It's not just a high level.

Seth (43:57)
Yeah, I couldn't agree more. It's just so important, both of those things, to understand... I think that sometimes people are like, I really want to understand... Let's just take LLMs and then they'll only think about...

they'll kind of jump to the end and you can probably do a lot. you know, like they say, like, you know, you can use a microwave without understanding how a microwave works, you know, there's something about also kind of taking that breath first. Like, I mean, yeah, like the LLM course, even like maybe like a general course. I mean, I started with like that Andrew Ng class, you know, like 10 years ago. I think it's a little bit different now.

Maxime Labonne (44:39)
hardcore.

Seth (44:40)
Yeah, that's true. I was using Octave, the open source MATLAB, which was kind cool at the time. It's kind of funny to think about. But the thing I was going to bring was that finding a project and one of the best projects probably out there is the one that you have, the LLM Twin. That's such a cool idea. Can you just briefly talk about that?

Maxime Labonne (45:03)
Thank you for plugging my book. forgot about it. Yeah, this is also a good resource to start with. So here in the LM Engineers book, My co-author and I made this end-to-end project that goes throughout the book. And the idea is, hey, you have text online, you have, I don't know, articles, tweets, whatever posts that you made. What if you take all of this data?

and you train an LLM to sound just like you. It doesn't necessarily have like a very practical application. Maybe you can create a startup around this idea, but this is more like a fun example on how to use this technology. And this is also like a great way to see all the different steps and not just training, you know, but also how to call the data in the first place and then how to...

Seth (45:37)
You

Right.

Maxime Labonne (45:55)
preprocess it so it's in a form that is consumable by the LLM, how to evaluate the model, how to create the right pipeline and deploy it in a real production environment. So with this example, what I want to see is like all the different steps of a traditional LLM project and provide the best practices at every step, because this is something that is currently really missing in our opinion is like.

this lack of best practices. Everything changes very fast. So there are probably 10 ways of implementing the same thing. What's the best way or what's a good way of doing it? What's something that is reproducible, that is repeatable, that is using like best practices from an engineering standpoint? So yeah, this is the example I would recommend if you want to make a complete project.

Seth (46:47)
Yeah, you broke it down. I was going to say that you need to have that end to end.  Because I think a lot of the times, like maybe in an academic setting or like in a competition or something like that, they give you a data set. And like that's half the battle when you're a data scientist or when you're doing a machine learning project. I mean, after you are figuring out what is the actual problem you're trying to solve, okay.

what data can help me solve that problem, how am gonna get that data? And then you have to keep asking, is it the right data? And then you keep iterating on that.

But I like that project because you're doing that end to end where you're getting all the things that you said, where you're getting the data, you're processing it in a form where you can actually use it and then you're producing something. I also make the recommendation where it's like, to get your model in front of other, try to get your model in front of someone, somehow deploy it and get that feedback. just you learn so much in that process.

I'll throw this one. What's an important question that you believe remains unanswered in machine learning?

Maxime Labonne (48:08)
I think that the most fundamental question is something that you just talked about. It's the data quality. What's a good sample? That's it. I think most of my work is about answering this question every day. It's about generating data, training a model. And when we evaluate the model, what you actually do is that you evaluate the data it's been trained on.

you never really like, like unless you really messed up your hyper parameters, it's rarely the problem. The issue is always the data and being able to qualify this data is very, very, very complex because it's not just one data sample. It's like an entire data set. So they also interact with each other. The order also matters. A lot of these things matter.

And this is very difficult to define. So to give still a brief answer to this question that I raised, I would say that I tried to think about it in terms of accuracy. Is my sample answering the question? Does it answer it well? Is it relevant? Is my data diverse enough? Does it cover all the use cases? This is something that you said with people interacting with your models.

Usually they ask stuff that is completely random and you never thought about it. And that's probably not part of your data set. So you also need to cover as much as many use cases as possible. And the final one is complexity. You don't want to give samples that are too easy to your model. You want to challenge the model by providing samples that are difficult, that are challenging, that will really...

Seth (49:34)
Right.

Maxime Labonne (49:57)
train it and not just something that it could do without any training. So those are the three key properties. I'm not saying that there is all to it, otherwise like it would be a solved problem, but it's a good framework to think about it and at least get your data set to a good quality level.

Seth (50:17)
So it was accuracy, complexity, and what was the third? Diversity. okay, yeah. Yeah, those are three good ones. So the, yeah, the unanswered question is around what makes a good data set a good data set and does it solve those three? I like that a lot.

Maxime Labonne (50:20)
Diversity.

Seth (50:34)
Speaking about gaps and things like when you are trying to do, when you're developing a model and then how the model is actually going to be used by people who don't know what you were intending necessarily and then seeing how it's actually going to used. Yeah, man, there, there you learn, you learn so much. It really doesn't start until it's in production. Sometimes like you think it's like, ⁓ okay, I pushed the model and you know that, but that's really just, that's just the beginning.

and that's when you get the real feedback and then you can really start that iteration cycle. ⁓

Maxime Labonne (51:08)
Yeah, absolutely.

Like in this case, you have like a human in the loop for evaluation. And it means that, yeah, you need to process this feedback and you can even really prompt humans to give you the most valuable feedback. I think this is something that is often under looked. You will just like push a model and then users will complain about it. But it's a lot better if you have kind of a framework to...

communicate about the feedback. So you can communicate about how the model is supposed to be used and they will communicate about failure modes that are super useful to then produce more data, to train the model and to evaluate it again.

Seth (51:45)
Yeah, one of the challenges is that, right, you can get this feedback, but can you act and do anything about the feedback and then actually bring it back into the model? So you're saying basically you can in some ways adjust the data, hopefully. Are there other ways that you think about where you can get feedback and then somehow you can work that back in to improve the system?

Maxime Labonne (52:10)
Yeah, you talk about system and I think this is the right way of approaching it. The model is just one part of the system. And usually you create an application that is LLM powered. And the fact that you have an entire application means that you can also pull other levers. For example, the user interface, the way that you interact with the model is really defined by the user interface.

Seth (52:31)
Yep.

Maxime Labonne (52:36)
So if you make a model that is just about translation, for example, you should not have a chatbot interface. Otherwise, people will use it like ChatGPT, which makes sense to them. It's not their fault. So you need to also think about the user interface to guide the users into how you want them to consume the model. This is like another lever that you can use outside of just changing the models.

Seth (52:48)
Right.

Maxime Labonne (53:02)
There are a lot of them, like there are generation parameters, for example. Maybe you want something that is a bit more reliable, a bit more consistent and less crazy. are  just other ways of adding rules, for example, to pre-process the input from the users. And this is where things start becoming also like a lot of fun because you can pretty much like  use any solutions and try it out.

Seth (53:29)
Yeah, I was just having a conversation with someone where I think that UI and UX, because in terms of like frontier models, they're getting so good. And it's almost becoming a commodity, you know, in a lot of ways that it's the, I mean, it is that it's the UI UX where you really can make a difference. And that's where it is extremely hard.

to create an intuitive interface for someone to prop, you know, to use your model and for them to get the utility that they're trying to get from it and to maximize that utility. So I'm glad that you brought that up because that's something that's been very top of mind for me.

Maxime Labonne (54:15)
I think for a lot of people who are used to engineering in general, this is not new, right? This is very, very traditional. Yeah, this is any software in the world ever. It's just that we discover it bit by bit out of necessity. And I think that it's not just that. There are also really interesting ways of providing these UIs that we need to discover.

Seth (54:20)
It's true.

Maxime Labonne (54:40)
The chatbot interface is good for a particular type of application, and that was the breakthrough behind ChatGPT It was never the model. It was really the interface behind it. And I think this creativity can also be applied to other domains and applications.

Seth (54:49)
Right.

Yeah, it's very exciting to see what those new interfaces are going to look like. Yeah, like, you know, we don't have to limit ourselves to it just being a chat box and just being a conversation. There's so much more. think that there's going to be more embedded into your application systems that are changing what screen you're on, like in an application, but even that's probably limited in how I'm thinking about it. I think that there's going to be that that's a very exciting space to be in.

And I think that it was that, I think it's that, you you're right in the sense that for traditional software engineering, it's like, yeah, it's always, know, the UI, UX is always, you know, half of it, but it's this idea that the LLM or the agent or whatever you want to call it is like some silver bullet and it's going to be able to solve everything, but it can't, it can't, not yet, can't solve, you know, the UI, UX. It can help you design a good one and it can give you some good ideas for it, but

the rubber meets the road when you actually have real users interacting with your system. then, you you learn so much about what's really going on.

Maxime Labonne (56:06)
In my experience, a lot of people complain about LLMs. For example, don't know, you have a website, administrative websites, and now they have a chat bot and you can chat with an AI about your problems. And people say, yeah, but that is not helpful at all. I don't know, we probably all have this experience of interacting with a very clunky, ⁓ interfaced ⁓ LLM.

And that's horrible, but that's not horrible because of the model itself necessarily. That's horrible because of the way that is used in this precise instance. And I really believe in a future where this is going to be very boring and that will be automated behind the scene. And you will never think, this is an LLM because I see a chat bot. No, the LLM will be there ⁓ in the backend. It's just that it's going to help you process and really focus on what really matters instead of.

Seth (56:38)
Right.

Maxime Labonne (57:02)
all this administrative stuff and like a very technical process that is not the best use of our time.

Seth (57:10)
Yeah, yeah, absolutely. Man, I feel like we're just heating up, but I have to get towards the last  set of questions. ⁓ Being on learning from machine learning, I have to ask you, what has a career in machine learning taught you about life?

Maxime Labonne (57:28)
Yes, it taught me, think mostly things related to how I think about learning, the way that I approach, for example, learning a new language. This has really biased my perspective and I'm a bit ashamed of talking about it here and knowing that there are people who are really experts in this field and that will say like, no, this doesn't happen like an LLM at all. But to me at least,

And I don't want to spread fake news, but it's really useful to think about learning something new in terms of data quality and quantity. So really being exposed to a lot of tokens. And that's particularly true if you're thinking about languages. A language should not be just, to me, not be just ⁓ learned through textbooks or Duolingo, but also...

through a lot of other means like watching series, playing video games in other languages, being really having more exposure to the culture in general. And this is just like increasing our number of tokens. And I don't think that what I say is particularly crazy, right? And also in terms of quality, because of course you can have like a ton of tokens and being exposed to a ton of stuff, but you need also to be able to focus, you need able to select.

and you need to be exposed to the right level of complexity, for example, to a level. So I think that this framework now is very instilled in my mind and in everything that I learned. It can be in your library, it can be in your language. This is something that I naturally think about, or not even think about, this is something that has become like...

automatic in my brain, like, yeah, I need to really be exposed to this a lot. I need to really be able to find like good quality resources about it. And this is to me the most efficient way of learning something and something I also enjoy doing as a process.

Seth (59:27)
Very cool. Yeah, no, I mean, it's a great parallel. I think the point that you're bringing up is that you need lots of different experiences, you know, not just one particular modality even, and being able to see things, being fully immersed in something also helps it while you might be able to get a lot from, you know, a structured lesson or a textbook. There's the number.

of times that you're exposed and you're experiencing something new. So yeah, I definitely can relate to that. And then just, yeah, the last one, there any way for listeners that want to learn more about you or the work that you're doing, where would you direct them?

Maxime Labonne (1:00:08)
Yes, so my two main communication channels are probably X at Maxime Labonne and also LinkedIn at Maxime Labonne. So those are the two main websites where I post regularly and this is where I also link all the content that I create on GitHub, Hugging Face and stuff. if you're interested in knowing more about this, please check it out.

Seth (1:00:38)
Awesome. Yeah, I can say that your content is some of the best. That's why it was such a pleasure to have you here. I think that you create such high quality stuff, so thoughtful, and it's like the real take. I really appreciate the work that you do from the open source stuff, but also just how you view the things that are going on.

Maxime Labonne (1:00:44)
Thanks a lot.

Seth (1:01:05)
in this rapidly evolving field to have a mind like yours and you being able to share your thoughts on it, it's something very much appreciated from this machine learning scientist. So thank you. ⁓ Yeah, you got it. And thank you so much for the time. It was a real pleasure. Thank you, Maxime.

Maxime Labonne (1:01:17)
Thanks a lot.

Yeah,

thank you for the invitation. It was amazing. Thank you.

Seth (1:01:27)
Thanks.