Seth (00:02.29)
Hello and welcome to Learning from Machine Learning. On this episode, we have a very special guest, Louis Tunstall. He's currently a machine learning engineer at Hugging Face.
the co-author of the best-selling book, Natural Language Processing with Transformers, co-author on one of my favorite papers in NLP, Efficient Few Shot Learning with Sentence Transformers and the SetFit Library. His current work focuses on developing tools for the NLP community and teaching people to use them effectively. Currently focusing on reinforcement learning with human feedback and I'm looking forward to jumping into that. Lewis, welcome to the show.
Lewis (00:40.59)
Thanks so much for having me, Seth. It's a pleasure to be here.
Seth (00:43.482)
It's great to have you. Why don't you give our listeners a little bit of background and what initially attracted you to machine learning.
Lewis (00:52.182)
Yeah, sure. So it's a bit of a long story. I'll try to keep it short. But basically I used to be a physicist, and in particular I used to be a theoretical physicist. So I was working in a very nerdy domain called quantum field theory, where you try to develop mathematical theories to describe subatomic interactions. And at the time when I was doing research in this, we had a big search for a particle called the Higgs boson at the Lichadron Collider.
And the way you try to detect these particles in these experiments is you try to have a clever way of distinguishing signal from background. So you've got a huge amount of collisions happening and you're trying to work out how do I find the particular signature of say the Higgs boson. And the thing is that physicists are never really happy with the knowledge that we know, we're always trying to find new things. And so the hot topic at the time was to try and find some evidence of like physics beyond the standard model.
And a couple of my colleagues had spent many years developing algorithms to try and detect, you know, evidence of some new physics. And at the end of my postdoc, I was kind of deciding, you know, should I stay in academia or should I try and take a look at what's happening in industry? And one of my friends, he said, hey, Lewis, come check out this code I wrote. And essentially it was like a hundred lines of TensorFlow one doing this kind of algorithmic detection of signal versus background.
And it had this peculiar property that it just crushed all the previous work of like businesses for like 20 years or so, um, on various benchmarks, uh, for classification. And for me, this was my first kind of direct encounter with machine learning. And I thought, okay, this seems pretty powerful. Um, I should probably pay attention. And, uh, more or less the next day, I, um, I started looking a bit at like Python. I'd never really programmed before.
Seth (02:38.468)
Yeah.
Lewis (02:45.254)
And so I took a look at Python. I spun up the TensorFlow docs, which in those days is like, it was pretty hard to get started in deep learning. And then it kind of started from there. And the sort of first step I had into doing machine learning was doing a Kaggle competition with a couple of physics friends. And this really got me hooked. I found it totally fascinating that you could, you know, learn from data and, you know, iterate with different types of algorithms.
Seth (02:45.401)
Oh, wow.
Lewis (03:13.718)
And this kind of made me take a step into industry. And yeah, I kind of never looked back after that, but it was very much a chance encounter. I think if I had never seen this hundred lines of TensorFlow, I probably would have still be doing physics.
Seth (03:29.232)
Yeah, still with the-
Hadron Collider, that's amazing. What a great story. I think it's fascinating that, you can take machine learning, you can apply it to so many different fields and you can see these sorts of breakthroughs, right? Back in like 2012, there was the big Alex net where people were trying this problem for so long, trying to do it in a certain way. And then there was this thing that came along and it just like halved error rates, right? Or there's like AlphaGo that came along,
Seth (04:01.304)
in prediction in a way that had never been done before. So it's very cool to hear that it applied also to quantum physics and you were able to kind of get that exposure. Cool, so yeah, working with TensorFlow and then working with Python. So you started with Python a little bit later on in your career, how did that go?
Lewis (04:25.474)
Yeah, yeah, it was, I think I'm a bit like a granddad in that sense, like, you know, everyone at Hugging Place has started coding since they were like seven, kind of thing, and I'm the guy who started when he was like 28. So, the reason for that is kind of funny. So in undergrad, I did a course in computational physics and the language of choice was Fortran, and this traumatized me so hard that I literally didn't pick up any code again for like almost 10 years after that.
Seth (04:46.897)
Oh god.
Lewis (04:55.322)
And I think I had a sort of, let's say, perception that coding was a little bit too mundane and that the only really interesting problems are the ones that you can solve with a pen and paper in your mind. And machine learning kind of changed that perception for me big time. And learning Python at the start, I think was kind of tough because in those days, this is like 2015 or so, 2016.
Seth (05:07.908)
Yeah.
Lewis (05:20.858)
There weren't a ton of like materials to like get started in ML. So you had Andrew and G's amazing Coursera course, but that was written I think in Matlab or something. Octave, that's right.
Seth (05:27.271)
Yeah.
Seth (05:30.774)
Octave. Yeah, which is the same as Matlab. Yeah, but I remember it. I remember it well. I'm a Matlab fan. Just, yeah.
Lewis (05:38.174)
Yeah, yeah, yeah. And so I remember, um, almost doing all the inefficient things. Like I was like, Hey, I need to learn Python and being a physicist, you try to do everything like from scratch. So you say, okay, before I do any email, I need to learn Python. So I remember like looking at like the sort of docs of Python and trying to like learn Python from reading the docs. And that's kind of an inefficient way of doing things. Right. And, um,
What changed for me quite, I would say, big time in accelerating things was stumbling into the Fast AI course by Jeremy Howard. And yeah, exactly. And this course, I think, was quite transformative in my education because he really emphasizes learning by doing and lesson one is like, okay, here's a conv net, we're gonna do fine tuning and you're gonna get like a really nice model at the end of this.
Seth (06:12.638)
Oh wow, yeah, he's amazing, right Jeremy Howard?
Lewis (06:31.75)
And all of a sudden it changed the way from like trying to learn things for the sake of learning them to learning them to solve a specific problem. And I think that that, that for me was like a really big step and made everything much easier after that. But yeah, learning from scratch is, is a tough thing, especially when you're not used to it.
Seth (06:50.062)
Yeah, absolutely. I mean, the important thing, I think, when you're trying to solve any sort of complex problem is understanding the problem space, you know, getting like a really deep understanding of it, figuring out what potential solutions are. Coding is a piece of it, right? Like, but it's not, it's not everything. You need to be able to think about it in a certain way. And I'm sure having a PhD, you know, in your background with quantum physics and everything, you were able to think deeply, you know, about these types of problems.
So, I feel like you took sort of an initiative, right, where you wanted to get into Transformers. What was your initial foray into Transformers? How did you first get exposed to it?
Lewis (07:34.794)
Yeah, so this is, it seems again, the story is gonna sound similar to the physics one. So in Switzerland, we have this conference every year called AMLD, so the Applied Machine Learning Days Conference. And this is a fairly big conference where it tries to bridge industry and research. And I had attended a couple of times and it was on this particular year, I think it was around 2018.
Seth (07:38.942)
That's fine.
Lewis (08:01.206)
when Jacob Uzcodite, who's one of the authors of the Attention Is All You Need paper, came to present essentially the Transformers paper. And I remember this auditorium, it was like one of those side sessions, not the plenary, but it was jam-packed. I mean, there was people sitting on the floor queuing outside. And I remember at the time, I was kind of ignorant because I wasn't working on NLP. I was doing more kind of classical data science, so things that were more tabular-based.
and time series based. And again, I was like, okay, I've heard of this paper, there's a guy talking about it, I should probably go check it out. And he gave this amazing talk about like, how transformers were sort of significantly different from LSTMs and RNMs. And yeah, again, it was like one of those like sort of epiphanies where you're like, okay, this seems important, I should probably pay attention to it. And it just so happened that the startup I was working at
Seth (08:53.651)
Right.
Lewis (09:01.474)
we had started taking a project doing question answering. And this was really in like the early days where you only had PyTorch pre-trained BERT, I think was the name of the hugging phase library. And this was like a super, you know, bare bones, like library, you had a couple of scripts. And I remember doing this like extractive question answering task on this data set of clinical notes. And it really worked. I mean, it was really impressive.
Seth (09:11.45)
Okay, the original, yeah, like that was the original library, yeah.
Lewis (09:31.398)
And it was a cool problem because it was one where I couldn't make a good baseline. So a lot of the things that you do in data science is often about just being pragmatic. So just pick the simplest algorithm. You know, typically naive Bayes will get you very far in NLP, but for extractive question answering is quite hard, you know, reg ex will get you a certain distance, but at some point it is tricky. So we just ran this like through this like PyTorch free chamber and it really gave great results. And again, I was like, okay, this is super.
cool technology, but I have no idea what's going on, right? I just like did, you know, Python run question answering. And so when I started digging into the code, it was really quite alien because there's all these concepts like attention, self-attention, transformer blocks. And so again, I was like, all right, let's go look at the paper. Surely that was gonna be explained. You look at the paper and you've got this like amazing architecture diagram of the transformer.
Seth (10:01.755)
Hehehe
Right, right.
Lewis (10:25.674)
And I had zero idea. Like I was like, I can't even read this image. And so then you're like hunting around for like things to learn from. And again, in 2018, there were basically two main references. There was Jay Alamaz, amazing blog, which was series of blog posts.
Seth (10:41.214)
Amazing. Love it. That was one of my, for me too, that was my intro to Transformers really and I really appreciate the work that he's done. It's amazing. Yes. Continue though.
Lewis (10:52.51)
Exactly, and so that was like at the sort of conceptual level. And then you had at the other extreme, Sasha Rush, he had done this annotated transformer, which is like kind of like a line by line implementation of the paper. And I remember reading both of these and going, okay, cool, I've got a bit of an understanding, but I really want to know this question answering business, like what's going on here, right? And so I felt that there was this kind of gap between like the sort of, let's say academic low level.
Seth (11:00.141)
yes
Lewis (11:20.662)
you know, implementation to the conceptual. And so one of the things I had learned from being a physicist was that if you try to teach stuff, that's often a very powerful forcing function to learn. And so I kind of joked to my colleague, Leandro, who is the co-author of the book. I said, hey, I'm gonna write a book about transformers because there's like a gap in the literature. Do you wanna do it with me? And he's like, oh, but we don't really know anything about transformers.
Seth (11:33.094)
Definitely.
Lewis (11:49.318)
And I was like, she'll be right. We'll figure it out. And that was kind of the start of this journey towards trying to write a book and using this as a kind of lever to deepen our knowledge.
Seth (12:02.31)
Very cool. So yeah, that was something that I found fascinating looking into you and you know, some of your past is that you Started writing that book and you weren't working at hugging face at the time Yeah, so
Lewis (12:13.334)
Yeah, that's right. Yeah, there's a story there if you want, I can tell you, which is that my wife was like, okay, Louis, every day you're talking about Hugging Face and Transformers, because I was doing the book, right? And she said, maybe it's good you chat with someone there to just make sure they're not writing a book, because it would be kind of sad if you do all this work, and then they go, ta-da, here's the Hugging Face officially sanctioned book. And so Leandro and I just cold emailed Tom Wolfe,
Seth (12:37.597)
Right.
Lewis (12:42.798)
who's one of the co-founders of the company. And it was one of those like moonshots where you say, no way is this guy gonna reply. He's super famous, super busy. And to our great surprise, he replied a day or two later and said, oh, cool, sounds interesting. Let's have a chat. And we shared a few draft chapters and he was quite happy with them. And so then it became more of an official collaboration. And we then submitted to O'Reilly who then accepted us.
Seth (12:44.027)
Yeah.
Lewis (13:09.994)
And yeah, it was like maybe six months or seven months. We worked together on the book. And then when Huggy Face raised the series B to sort of grow the company, that's when we both joined.
Seth (13:22.462)
That's amazing. What a great story. So I guess that's one way to get a job, write a book about. Yes. And then there's an updated edition of the book, right? So there was a first edition and a second edition.
Lewis (13:30.628)
Longest interview process ever.
Lewis (13:40.626)
Yeah, there's a revised edition which I think it has color. That's the main distinguishing factor. And I think it has a few corrections that some of the Rata that people picked. But this was again one of those things where we were sort of discussing with O'Reilly, could we have our book in color because so many of these images depend on it. And then they were like, well, only fast AI gets to do that. And then many, many people in the community were like really asking for color images. And then they...
Seth (13:45.867)
Okay.
Seth (13:51.677)
Right.
Seth (14:03.262)
Hehehe
Seth (14:09.794)
Oh, that's cool. I didn't realize that. And they gave you color on the cover also, which was nice of them. Um, yeah, which is very apt. Yes. With, uh, the stochastic parrots paper and you know, everyone's tendency to think that large language models are like stochastic parrots, which, you know, that's a, that's a large, that's a larger conversation. Uh,
Lewis (14:10.446)
Thank you for doing that.
Lewis (14:15.894)
Yeah, and a parrot, which is also very apt.
Lewis (14:36.907)
I mean, well, to be honest, I think, I don't know if the illustrator knew this, right? So I mean, just for viewers, this is what it looks like. So you've got this beautiful lorikeet. And in the last chapter of the book, this was the one that Tom Wolfe wrote. He said, I want to be ambitious. I want to pre-train a language model. And we were trying to find what's an interesting domain.
Seth (14:41.733)
Oh really?
Lewis (15:01.866)
that is not just pre-training bird or something. So that's where we set it on code generation. And again, we were trying to think of a name, right? And I think it was Tom who said, oh, I'm gonna call it code parrot. And then, I suppose the illustrator read that chapter and was like, okay, there's a thing about parrots, but it just perfectly happened to line up with the stochastic parrot paper, so.
Seth (15:14.834)
Hehehehe
Seth (15:25.71)
It's the perfect cover. I mean, it's the perfect animal to have on there. Yeah, that's awesome. So let's talk about Hugging Face and Transformers. Obviously, Transformers is a piece of it, but Hugging Face is really like an ecosystem. Do you mind? Can you go into it a little bit?
Lewis (15:48.682)
Yeah, sure. So I'm going to assume maybe most people have heard of Hugging Face, but let's assume that there are some who haven't. So Hugging Face is an open source company. And what that means is we write software, open source software, that anyone can use for free. And this software spans everything from Transformers, which is the sort of flagship library we have, to data sets, to accelerated training or distributed training.
to things like inference with a library we have called text generation inference. And by themselves, these libraries would be already quite useful for people. But the thing that makes it really an ecosystem is we have a Hugging Face Hub, which basically lets the community share data sets, models, and demos with the community. And this gives you this kind of nice feedback loop where people will, for example, upload a new data set, people train models on that data set.
and then people build very cool interactive demos using those models, and then the cycle repeats. And so our kind of like ecosystem spans, originally it was really NLP focused, but with the kind of, I would say dominance of transformers into other domains like computer vision and speech and eventually video, we've kind of grown the kind of coverage of the hub. And so now the hub is really kind of agnostic to the sort of library
or modality and the way that we often try to explain this is it's kind of like the GitHub of ML. So in the same way that GitHub enables software engineers to collaborate, the kind of vision for Hugging Face is to have the hub be the kind of place where machine learning engineers and practitioners can collaborate.
Seth (17:31.398)
Absolutely. And yeah, there's always new contributions that are just making it like that. Make hugging face even better, which is so cool. You know, the model hub, the different tasks that you can look at, like the way things are sorted, the way that you can look at benchmarks for, for models. Um,
The good models have model cards that, you know, detailed model cards that you can look at that understand, you know, the limitations of the model and how you should use it. I've also experimented with like hugging face spaces, which is really nice. And I know that there's, you can actually have like inference endpoints and things like that. So it's, um.
Machine learning, there's so many moving parts, right? So it's really great to have a resource, a place, a hub, where you can kind of have all of this information in one place. This particular data set, you can connect it with, you know, this, you could fine tune it for this particular model. I've been able to use things like out of the box, and I've been able to use data sets, combine it with other data sets. I mean, what Hugging Face has done, it's created a foundation
for the field, right? It's given people the ability to do things that they could never do if they were on their own. I think it's a really amazing testament to what open source can do. I know that that's like, I'm preaching to the choir, you know, but it's in this environment that we're in, you know, where.
Lewis (19:03.066)
Hahaha.
Seth (19:11.054)
Open AI is actually not really open and they're very closed. And Google once was very open and now that some of their stuff is closed, Facebook meta tends to be more open. But hugging face has really taken the stance like it's open source community. And then look at all the amazing things that you can do with it.
And yeah, I mean, sometimes like there'll be like a little bit of a lag, like let's say like chat GPT, obviously that took the world by storm. I mean, I don't think it took that many people that are like deeply entrenched in the NLP world by storm because they could do the prompting and, you know, it was like, oh, it was a nice interface that they created. But don't take anything away. It's amazing. Of course, what they did in struct GPT is incredible. Um, but now hugging face has hugging chat, right? And you can be trying out different models there.
interacting with different chatbots. So I commend the work. I appreciate it. It has, I mean, Hugging Face has allowed me to do so many incredible things. I mean, being able to take, you know, Distilbert, being able to take any of these models and fine tune them on the datasets that I'm working with. It's great work and it's helping a lot of people.
Lewis (20:30.168)
Yeah.
Yeah, and I think the thing to emphasize is this is really only possible because we have like a community of amazing individuals. So people who are like the tinkerers, but you've also got companies like Meta and also Google who have, you know, a good sort of emphasis on trying to make much of their research accessible. And so this kind of like dual mode where you have, let's say, Meta releasing the llama models, which kind of really, I think, took the open source world by storm.
And then you've got all this collective intelligence of, you know, thousands and thousands of people trying to tinker with chatbots on their laptops. You see this amazing loop. And for me, I think the LLM wave that we're currently in has been a really powerful testament to that. So previously we had this kind of mode where all these different labs were like releasing different kinds of transformers, you know, some for multilingual, some for computer vision and stuff. And then once.
the sort of massive focus switched to LLMs, you're now seeing a large amount of innovation happening just on the LLM side. And for me, like the very cool one that happened recently was when Llama 1 was released, it had this kind of limited context of, you know, 2048 tokens. And this is kind of limiting if you wanna do, you know, summarization or some other tasks. And like literally random people on Reddit.
figured out that there are these like little hacks you can do to the embeddings to sort of dramatically increase the context from you know 2,000 tokens to 8,000 to 16,000 and that kind of insight then fed into Llama 2. So it's kind of like this you know beautiful feedback loop and in some sense one of the great things of being a hugging face is you're somewhere in the middle right you're not really the ones training the models per se but you're more
Seth (22:12.027)
Right.
Lewis (22:26.478)
the platform that enables the community to build together. And that feels very, very cool and empowering.
Seth (22:34.894)
Yeah, absolutely. Yeah, I mean, I made it clear, but I'm a big fan. I love the work that you guys are doing.
So having a co-author of one of my favorite libraries set fit, I'm not sure how well known it is, you know, but I think it should be more well known because it solves a really interesting problem for me at least. You know, it's nice when you have a data set like on Kaggle, you know, like you have a data set and everything, but in industry, you never have that data set. You know, you never have, you have the data, but you never have the labels or anything.
Lewis (23:08.366)
Mm-hmm.
Seth (23:15.36)
and SetFit, which is efficient few shot learning with sentence transformers, the concept I find to be fascinating and how well it works. So yeah, having the co-authors, one of the co-authors, I'd love to have you just, you know, give me some, give me some info on SetFit. What do you think makes it work so well? What were some interesting things that happened, you know, as you were creating it?
or helping, helping work on it. Yeah.
Lewis (23:45.142)
Yeah, sure. Yeah, so as you say, right, this is, I think a problem of dealing with limited labels is something that if you're not in industry, you don't quite appreciate. So if you're an academic, you're really used to just working with these kind of conventional data sets that have 50,000 labeled examples, and you just train all your models on this.
And, you know, for people like you, and also previously in my previous job, I was a data scientist. This was like the opposite. It's like, oh damn, like I'm so jealous of these academics. I have to, I've only got like 16 labeled examples. What am I, what am I going to do with this? And so some of the early work I did at Hugging Face was, was around model evaluation and trying to figure out, you know, how could we enable the community to evaluate models across different domains and different.
tasks and we worked on a project with a company called Oort who were interested in probing the few shot capabilities of language models and the way this is kind of conventionally done was you would provide a bunch of let's say prompts to different types of models like gbd3 and then you would see the ability of the model to kind of complete the prompt with the tasks so you might
I show some kind of examples to give the model context, and then I give it the actual question I want, which is, I don't know, categorize this list of planets or something, and then the model completes it, and if it gets it right, it gets a positive score. And what Ought identified was that many of these benchmarks were essentially gamed. So people, because they had access to all the labels,
they could always, you know, even though the papers never did this, they could always, you could sort of say cherry pick and tune the prompt. So you could always do a bit of gamification to saturate the benchmarks. So what they developed was something kind of like Kaggle where when you do a Kaggle competition, you submit your predictions during the competition and you will see your performance on like a kind of public leaderboard, but there's a completely separate test set, which is held out until the end of the competition. And only at the very end is this then revealed and then.
Lewis (26:04.714)
the final rankings are made possible. And so what we developed was a benchmark called Raft, which essentially has the same setup where you submit the predictions of your model on a very limited number of samples. So we're talking about maybe 16 to 20 examples. And it's a classification benchmark. So you're trying to basically predict, you know, positive, negative, or multiple choice type questions. And then at the end, we evaluate all this on a hidden test set. And then there's a leaderboard, which then
Seth (26:07.101)
Right.
Lewis (26:34.322)
that ranking. And so when we released this benchmark, I think the top model was GPT-3, or I think maybe even one of the instruct models. And then the community got excited about trying to figure out, okay, how can we come up with better methods? And so there was a very impactful paper by Colin Raphael and others where they used basically a T5 transformer to do very efficient future learning.
Seth (26:50.835)
Right.
Lewis (27:01.914)
And the kind of drawback of this is that you have to kind of construct the prompts. So there's a lot of prompt engineering involved to match it to the task. And so then at Intel, there was one of my co-authors, Moshe Wasifat, and he just on LinkedIn one day was like, oh, hey, I came up with this algorithm and it's like the top of this raft benchmark and it got like a ton of attention. And since he knew...
Seth (27:07.517)
Right.
Lewis (27:29.646)
Niels Reimers from Sentence Transformers. He said, oh, you know, maybe we can explore, you know, going beyond just this, you know, single example. And so that was the start of this collaboration. And to give you a kind of an idea of what this is about, essentially Sentence Transformers are a clever way of adapting pre-trained transformer models to come up with very rich embeddings or representations of text.
Seth (27:31.974)
Yes.
Lewis (27:56.666)
And the way they do this typically is using a method called contrastive learning, where you essentially try to teach the model how to distinguish between positive and negative classes. And positive and negative isn't just sentiment, it's kind of like, you might be talking about categories, you're trying to figure out how to distinguish different categories. And the results of this are now embeddings that typically capture far more semantic structure than just taking the kind of.
base embeddings of like BERT or something like this. And so what we did was we took these sentence transform models, which are already very good. And then the idea was that you would essentially do a further round of fine tuning to adapt the embeddings of these models to learn essentially a representation that matches the very limited number of samples you have. And so you can imagine if I'm doing like sentiment analysis, imagine I've only got, let's say,
10 examples per class or something. What I can do is I can provide those positive examples to the model and the label and the negative examples and do essentially contrasted fine tuning. And what this does when you look at the sort of embedding space, it kind of clusters or it makes the clusters of these embeddings start to kind of separate. And now you've got a very good decision boundary where you can put on a linear classifier and it will then do very well.
Seth (29:16.338)
Yep.
Lewis (29:25.446)
compared to the full fine tuning run. And so in a nutshell, SetFit is just an adaptation of sentence transformers plus linear classifier. And it's a super simple idea, but it works remarkably well. And when we published our paper, I think we were state of the art or close to state of the art for models that were much, much larger. So the T5 model was 11 billion parameters. And I think we were close to matching performance with like a few hundred million. So.
Seth (29:46.334)
Oh yeah.
Seth (29:54.299)
Yeah.
Lewis (29:55.579)
for deployment reasons and you can train it on your laptop, right, which is for me the super cool thing.
Seth (29:59.098)
Yep. That's my favorite part. So, well, I'll go into a couple of things. So, you know, when I came across SetFit, the, I think that the reason that you mentioned basically, like it's, obviously there's so many complex things that are happening, but when you think about it in the term, in the terminology, like you're just taking a set of data, you're encoding it, embedding it, putting it into an embedding space, creating these.
Negative triplets and positive triplets and the way I imagine it in my in my brain It's like it's like they're magnets, right? and the positive triplets are bringing it closer and the negative are pushing it apart and then you are fine-tuning and embedding and Because you're doing that and you're pushing what's more similar closer and what's different away I mean, this is just contrast of learning I guess but then you can create that That decision boundary and as you were saying the nice thing is that you can train it
on your laptop, depending on how many data samples you have. And if you're on Colab, like it's some.
matter of minutes, right? Like I've had iteration cycles where I will train the model, run it on tens of thousands of samples, and it's in like 5, 10 minutes. And then I can see what classes aren't performing well or whatever, and just inject some new data points, inject a new class, remove a class, and just train it again. And the iteration cycle is unbelievable. It's also helped me get initial labels for things.
If I'm starting off in a prom, it's like a way for me to bootstrap sort of and get some initial labels. So it's really nice. And I think that.
Seth (31:46.534)
You know, in the industry right now, it's very interesting, right? There are extremely powerful models, right? Like there's no question. GPT-4 is very powerful model. It can do unbelievable things, but there's only one way right now that I'm, that I know of, of getting access to, and that's through an API call. And that creates this dependency. It creates a latency. Um, and it.
You know, you don't have full control over what your stack in a way. Like open AI is down, Azure is down. You know what, you're not going to make predictions? Like that doesn't, that's not going to work. You know, it's not, it's not going to cut it. It's, you know, your, your business partners aren't going to be okay with that. When you have something like Setvit or a fine tune distilbert model or Roberta or whatever, like you have it. It's your, like it's yours.
Lewis (32:25.943)
Thank you.
Seth (32:44.596)
can make it as.
Like you know the trade-offs, right? So you know, okay, I need something that's 400 megs, not 40 gigs, right? I need something that can actually run on my laptop or I can run it, you know, in a sense, or let's say I want to create three or four models, right? So by having this ability to iterate so fast, by having more control over this ability to
Seth (33:17.824)
really more quickly get something that will provide value for the people that you're working with. I say it's create meaningful text classifiers for people. That's what a lot of people are looking for. Setfit either just getting that into production or using it to help me get more labels has been immensely helpful. So once again, thank you.
Lewis (33:29.56)
Yeah.
Lewis (33:44.482)
Glad to hear it. We have one user. Ha ha.
Seth (33:48.842)
One happy user. So there you go. And yeah, and the cool thing is that, yeah, it's not like, it's contrastive learning, it's fine tuning, and then it's just the head is just a linear classifier. It's amazing how well it works.
Lewis (33:50.178)
Yeah, one happy user.
Seth (34:11.214)
I have a lot of ideas with SetFit that I know I've spoken to you offline about, but what's one of the challenges that you faced? And then if you had more time to work on SetFit, what would you do?
Lewis (34:25.61)
Yeah, so I think one of the challenges we found was that there are some data sets which are intrinsically more difficult than others to classify. And I think we discussed this offline, but if you take, I think the aging news data set, you've got these different news categories. You're trying to cluster things by categorize and typically the model, when you've got these kind of overlapping semantically similar documents.
the classifier struggles to differentiate these. And so your performance, I mean, it will still be better than, you know, your random baseline, but it's hard to get to that kind of, let's say state of the art level that you get from, you know, really training on like hundreds of thousands of examples. So that was, I think one challenge which we never fully were able to, you know, resolve is like, how do you deal with difficult data sets?
So yeah, that's one side. The other one is like when you've got fine grained categories, tends to be quite a challenge. So we had various users who were saying, hey, I'm trying to do a multi-label classification and I've got a hundred categories and I've got very sparse number of examples, you know, maybe one category has two examples.
And this is a hard problem in general, but it's one where again, the setfit method helps, but doesn't get you to the point maybe where you can actually use it in production. And then the other one that is like, if I had more time to work on this, I think a very interesting extension is like, could you make it work for token classification? Because this is again, the classic domain where you don't have many labels.
Um, and labeling, I don't know if you've ever done like entity recognition labeling, but it's very painful. You have like, you know, these UIs. Yeah. And in my previous job, like we, we had worked on this before and it was very hard because you needed domain experts to like annotate segments of documents and they don't like it really much. And you know, training on that was difficult. Um, but we never could quite figure out how to crack the token classification case.
Seth (36:09.89)
Yeah, it's anything where you have to like mark the spans. Oh my god. Yeah.
Lewis (36:34.498)
So I think this is, you know, anyone's listening and wants to, you know, extend SetFit and write a nice paper, that would be one way to do it. And I think the other one which is potentially interesting is one of the referees of our paper kind of made the observation, which I think was a fair one, they said, okay, you know, SetFit uses small models, so it's good for deployment, but then I need to have kind of one model per classifier.
Seth (36:41.787)
Yeah.
Lewis (37:02.754)
So if you imagine that I'm, you know, say, on 10 different tasks, I might have 10 different data sets and now I've got 10 models in production and maintaining those, you know, starts to get maybe a bit burdensome. And so what they were sort of hinting at was like, well, maybe you could do something like adapters. So, you know, in the same way that we use adapters for transformers, maybe you can have kind of like a base like a sentence transformer that you've done this contrastive learning process.
And then you have adapters that you can just swap at inference time. So you've only got really one model deployed, but you're swapping all these adapters. And I think that's something that would be kind of cool to look at, but in those days, Laura didn't exist. So it was more like a conjecture.
Seth (37:50.37)
Right. Very cool. Something that like, if I had more time, I'm interested in.
You know, trying out different types of embeddings. I think that can really change things. Um, that that's something that personally, if, if I had, if I had more time, I would look into that. And as I was telling you, like maybe, you know, trying different classification heads and things like that. Um, we could talk about set fit all day, but let's, let's get into the next thing. Um, so large language models, you know, it's unbelievable, you know, what's going on.
of chat bots it's like every day there's a new thing coming out there's i'm sure you know there's this new one um my mistral yes seven billion you know they'll i'm sure there'll be mistral 70 billion soon um and it's this really interesting paradigm right
Lewis (38:35.26)
true.
Seth (38:49.326)
As humans, we've created so much knowledge and we've created so much written, you know, text and everything. And now we've created these powerful models and we have the compute to actually, you know, read it and do this sort of next, you know, next word, you know, prediction. And that's, it's a really great data set. And then we started adding like another layer.
fine tuning and then like we've done work with instruction, fine tuning and of course reinforcement learning with human feedback. And I know that that's something that you're focused on. So what can you tell us about some of the work that you're doing with RLHF?
Lewis (39:32.226)
Yeah, sure. So at Hugging Face, we have kind of, let's say, two small teams looking at this from different angles. One team is essentially developing a library called TRL, or Transformer Reinforcement Learning. And that was actually written by my co-author, Leandro, many years ago. And it was funny because I remember, like, he said to me one day, oh, you know, I'm going to do this side project. I'm going to, like, implement this OpenAI paper.
about fine-tuning language models from human preferences. And in those days, I was like, what? Why would you bother doing reinforcement learning? Reinforcement learning is horrible, it doesn't work.
Seth (40:09.562)
Right. It's like orthogonal to NLP. Who could apply reinforcement learning to NLP? Yeah.
Lewis (40:16.318)
Exactly. And so he did this very nice, uh, like kind of library. Basically the original open AI code is in TensorFlow, probably TensorFlow one. No one likes it. So he did it in PyTorch and had a nice API. And in those days, right? Like, you know, the compute and the kind of idea of what you would do with this was kind of somehow, you know, limited in the open source community. So you had an example of like essentially adapting a model to generate more positive movie reviews. And that was, you know, more or less, uh, where he left it.
And it sort of sat dormant for like two years. And, um, he told me this story that when he joined hugging face, like he said to Tom Wolf, I think this reinforcement learning stuff is really important. And, you know, I would like to sort of focus on, on this part, you know, with his library and again, Tom was like me, he's like, oh yeah, like reinforcement learning, it's like, it doesn't work. And then of course, chat GBT came like two years later, and then all of a sudden the whole open source community was like asking, okay.
how do we actually kind of do the same kind of thing? And so the TRL library sort of exploded in popularity and now we've been heavily extending it to integrate these like adapter techniques. And also what I've been working on is how to scale it. So how do you do distributed training of llama 70B with this library? And so that's like a sort of part of the team doing open source development.
And then the other part of the team, which I also work in, is we're trying to develop essentially stable recipes for doing our LHF. And one of the challenges that we've identified is that the community has been extremely creative and productive at instruction by tuning or SFT, supervised by tuning, something like that. And there's now like hundreds or tens of thousands of transformer models that have been, you know, instruction tuned on the hub, but there's very little.
like RLHF models. So these are models that typically require this fairly extensive optimization with an algorithm called PPO. And we think part of that is because the data that you need to train these models tends to be expensive to acquire. And also the compute that's needed tends to be significantly more than just standard fine tuning. And that's partly because the PPO algorithm has some kind of, let's say,
Seth (42:24.858)
Right.
Lewis (42:36.946)
instability around the hyperparameters. And so what we've been doing is first of all, acquiring these data sets and we wanna open source as much of them as we can. And then we're doing thousands of experiments to figure out for the most popular architectures like Llama and Falcon, what are the kind of, let's say, good parameters that work when you're trying to do this reinforcement learning to train chatbots. And...
I can't give a precise date when we will release the things, but I'm, you know, hopeful that it will be relatively soon. And where we're also exploring these other exciting techniques like DPO or direct preference optimization and rejection sampling, which is another very powerful kind of baseline technique. And so the goal is to provide the community with code, data sets and models. The usual thing that we do.
Seth (43:15.984)
Yeah.
Seth (43:27.65)
Yeah. I mean, this is where this is a technical conversation, but I do want it to be accessible to, you know, also just two concepts. The idea of adapters. Can you, can you explain that to, you know, how, how would you, yeah. How would you explain that?
Lewis (43:39.09)
Mm-hmm. Yeah, sure.
Yeah, so generally when you do fine tuning, and this was the standard practice for many years, essentially what you do is you take your pre-trained transformer, and this has got some kind of, let's say, basic understanding or statistical understanding of language based on this next token prediction, or filling in the masks or the gaps in text. And what you do is you essentially throw away the last layer of this neural network, and you replace it with a new layer which kind of matches the task you're trying to model.
So if you're trying to do something like sentiment classification, you would have a kind of classified classification head on top of this transformer. And then when you do fine tuning, you're basically doing the standard back propagation throughout the whole model. So you're essentially updating all of the parameters of the model. And this has some sort of memory requirements. And so there is a bunch of like math you can do to figure out how much it costs to do the forward pass through the model.
to get the loss and how much it costs to do the backward pass. And for small models, you can run this fairly efficiently on a single GPU. But with the advent of large models, especially models that are in the 7 billion parameter plus range, you start having to have trouble fitting all of these parameters and the optimization states and all that stuff on a single device. And so you need to do distributed training.
And this is where there's all these very powerful techniques like deep speed, FSDP. But the problem there is it's expensive. So suddenly, you know, what used to cost 10 bucks or 30 bucks on a CoLab is now like hundreds to thousands of dollars on a single node of A100s. And so there was a big breakthrough paper called Laura. So low rank adaptation for language models. And what the authors realized is that when you're doing fine tuning,
Seth (45:30.706)
Right.
Lewis (45:38.23)
You've already got a really, really good base model. So the representations of that model are already very good. So maybe you don't actually need to update all of those parameters when you're doing fine tuning. So the idea instead is you look at every single kind of linear layer in the transformer and you insert these adapters, which are essentially just like matrices of weights that you want to update. But it's a very small number. We're talking about maybe a million to a hundred million parameters in the extreme case.
And so now instead of having to do optimization over seven billion parameters, you're really only doing optimization on a million or something like that. So this suddenly becomes very efficient. So the memory is very, a lot lower. And it's also quite fast because now I don't have to, you know, train the full model for like, you know, a day. I can do this in maybe 15 minutes and I'll get very comparable performance.
Seth (46:35.858)
Very cool. So adapters seem to me to be like, um, sort of like a subfield of how you would do, how you do transfer learning, right? For, for natural language processing. Um, yeah, you know, the field moves so fast, you know, and I see all of these papers like Laura and Q Laura and all, you know, all of them, but
Lewis (46:46.646)
Yep. Exactly.
Seth (47:01.018)
because of the work that I'm doing, like I can't do a deep dive, you know, into everything. So it's great to be able to talk to you about that. And yeah, as you're explaining, I'm like, oh, so this is, you know, it's transfer learning, but it's because you're dealing with so many parameters, you can't just like, I mean, I'm sure you could just freeze some layers, but you have to figure out these new ways of dealing with these massive, you know, these massive models. So that's great. And thank you.
explanation and then I guess the other thing that you're talking about in terms of like PPO and DPO so proximal policy optimization correct can you go into that just like the base like the basic idea of it
Lewis (47:47.318)
Yeah, sure. So this is an algorithm that comes from reinforcement learning and was originally designed in the context of like, let's say games, you know, trying to teach agents how to play Atari and so on. And the basic idea is in reinforcement learning, you distinguish between different types of optimization algorithms. There are so-called online algorithms or offline algorithms.
And offline algorithms, they typically have what's called like a memory buffer. So you try to give your agent some notion of previous like experience that it has had during a game. So if you imagine you're playing Pong, you give the agent the ability to kind of recall some of the previous moves it made. So then when it makes the next move, it's able to then, you know, do something slightly better and the online algorithms of which PPO belongs to.
What they do is they just throw away all the memory and then try to figure out what is the optimal move to make based on just the observations that you're provided. And the kind of way this works is you are... this is going to get technical so I'll see if I can keep it high level. But essentially you've got... in order to predict the next move you would make in a game, you need some estimator.
And this is conventionally a neural network. So essentially what's happening, your observations are gonna be the pixels of your screen. And then you have to tell the agent, should I move left or right in Pong? And the sort of prediction of whether it should be left or right is coming from a neural network, which is essentially taking these pixels and then using that information to make an inference. And what PPO does is it measures
essentially how, first of all, it makes a prediction for what the moves should do, but it also measures how far away the distribution of your predictions are from previous steps in your optimization. So the idea is that if you're just doing kind of like a purely random search through what the next thing should be, you'll never kind of figure out what the optimal moves are. And if you are completely unconstrained,
Lewis (50:09.342)
So maybe you end up optimizing too hard in one direction, you may find that just going left all the time is the right thing to do. And so what PPO incorporates is this measure of like kind of difference or distance between the previous state of the model to the current state of the model. And then it uses that as a way of kind of constraining the choices so that you don't depart too significantly.
Seth (50:17.263)
Right.
Lewis (50:36.494)
from your previous experience. And this is the advantage of doing things online where you have no memory. So you can't say, oh, what did I do 10 steps ago? But what you can do is say, how different am I now to how I was in the previous step. And so that's very like high level. We can go deeper, but it's more or less around this idea of developing optimal choices for these games. And the very clever adaptation by OpenAI.
Seth (50:44.635)
Right.
Lewis (51:05.618)
was to figure out that you can apply this to language models. And I think this was a really impressive innovation. I don't think it was obvious to any people that such a thing would work. And the difference there is that in the conventional reinforcement learning context, you have typically an environment, which is your game, you have observations, and then you have actions. And in the context of language models, your environment isn't a game.
Seth (51:09.671)
Right.
Seth (51:16.732)
Right.
Lewis (51:33.042)
It's a really a data set of prompts. And what happens is you provide these as your kind of observations to the language model, the language model will then generate a response. And what you do is you then have something called a reward model, which ranks the quality of that response. And then it's the combination of that score for the reward plus the measure of difference from your previous state.
that is used to sort of optimize the model in a direction that hopefully maximizes the reward with this kind of additional constraint. And so the basic idea here is that you're encoding the human preferences in your reward model. And then by optimizing your language model, you're kind of pushing it in parameter space to be closer to the sort of more aligned to the preferences. And yeah, OpenAI.
Seth (52:25.627)
Right.
Lewis (52:29.974)
did a series of very groundbreaking papers, starting with simple tasks like sentiment tuning, so trying to make a model more positive, to summarization, and then ultimately instruct GPT, which was the precursor to chat GPT.
Seth (52:45.578)
Right. I'm going to make a shout out to my fellow Great Nex South high school alumni, John Schulman, you know, the author of PPO. It's so cool. You know, he did work on robotics and those video games like you're talking about. And now, you know, he's a co-founder of OpenAI. And to see...
Lewis (52:54.625)
Indeed.
Seth (53:06.534)
this reinforcement learning where first off for so long people were, didn't believe in reinforcement learning, you know, and then people thought that you couldn't apply reinforcement learning to NLP. And then it was that breakthrough. The ones that you're mentioning. And in my mind, instruct GPT was like, really like the one that, you know,
broke the camel's back, I guess. That was the thing that opened the flood gates for all of this. So yeah, it's so amazing. I'm looking forward to hearing some of the results of your experiments, looking at all of the work that you're doing on RLHF. That's very exciting. And I know that we could definitely go into all of it. I mean, there's so many questions, right, around like, how do you know you're getting human alignment?
the biases around longer responses, you know, tend to be accepted, you know, better and things like that. And then something that I'm like really interested in, something that I'm doing in some of the work, like using other LLMs to validate, you know, to validate things because they have like RL.
AI, like reinforcement learning with like LLM, like an LLM in the feedback loop, which I don't know, those are the ones that kind of scare me a little bit.
Lewis (54:23.982)
Yeah, I feel like.
Lewis (54:34.006)
Yeah, although I think that all these techniques are trying to come up with creative solutions to two main bottlenecks we have today. So the way we evaluate large language models has traditionally been on a series of fixed benchmarks. So you have things like, let's say, the MMLU benchmark, which is a...
a measure of like kind of let's say college or grade school level science questions or exam questions and this is meant to measure like the reasoning capabilities of models and then you've got like a truthful QA which is another benchmark that tries to sort of measure the sort of hallucinations in some way, it's like a proxy for hallucinations.
And all these benchmarks, they have the limitation that they're static. And so what often happens is even though people try to be careful about decontaminating the pre-training to not include these, because these models, these benchmarks are everywhere, right? They're like on GitHub, they're on the Hugging Face Hub, they're in like, you know, random, I don't know, Dropbox folders. And so when you scrape the whole internet, right, you've got to do a lot of work to sort of try and decontaminate this.
But the other thing is that they often are just proxies for like, let's say, capabilities that academics have developed over years. And there's a difference between that and what the end user wants. So the end user, in the case of, say, chatbots, they want something that they can, you know, have a conversation with, and they want to be able to, you know, be able to ask, say, a wide range of different topics. And a really nice example of this is in the Q. Laura paper, which was this like quantized approach to Laura.
They show that if you basically train a language model on the Flan V2 data set, so this is like a very classic academic benchmark from Google around kind of like this sort of multi-task reasoning, you get something which just crushes all the academic benchmarks. So you get a model that is like really state of the art on like, you know, MMLU and so on. But if you chat with it, it's atrocious. It's like a terrible, terrible model. And then vice versa, if you...
Seth (56:39.581)
Right.
Lewis (56:43.754)
tune your model on more like chat related data sets. So do instruction tuning on that. They will typically perform worse on these academic benchmarks, but then much better to chat with. So the two bottlenecks have been, how do we evaluate chatbots in a robust way? And the second one is being, how do you bypass the bottleneck of reinforcement learning or RLHF where you need to acquire a large amount of human preference data?
Seth (56:55.034)
Right.
Lewis (57:13.718)
And the most interesting one for evaluation, I'll just mention briefly, the folks at LM Sys, they're a collective of researchers from Berkeley and elsewhere. They pioneered the use of GPT-4 as a judge. So the idea is you show GPT-4 the response of your chatbot, and it just says, you know, did it follow the instruction? Here's my score, and then you get some way of measuring the relative capabilities.
Seth (57:13.723)
Yeah.
Lewis (57:41.082)
And for the human preferences, we haven't cracked that yet, but there are many researchers doing this AI feedback to try and see if language models today, especially GPT-4, could be used as a proxy to essentially annotate the data for you.
Seth (57:55.578)
Right. It's so fascinating. And this is why it's so important to, um, like just have like.
a strong background in like just machine learning in general, because what you're explaining in the first problem, where like they're seeing, like they're kind of like seeing the benchmarks. In my mind, like it's just like data leakage, right? Like it's just like, it sounds like data leakage. It's like, did you split up like your training validation and test set? Like, I don't know. Like maybe you didn't there because if it's not able to generalize, then you didn't succeed in creating a robust, robust model. It makes you realize like,
Lewis (58:16.77)
Yeah, exactly.
Lewis (58:24.251)
Yeah.
Seth (58:33.178)
You know, benchmarks have their role in things, but they're not the be all and end all, right? You have to like that, and that's the interesting thing I was listening to. Um.
one of the llama chain, uh, lane chain and how it's like, it's a vibes test. You know, it's like just because you can run every benchmark. You can run every data set. You can try some like summarization data sets. Come on. It, you can summarize things in so many ways. It's the, the context is so important. And then you see these things and all these influencers are posting things on Twitter and LinkedIn. And they're like, Oh, try this model. And then like, I plug it in and I use it on hugging fish. I have a couple of.
Lewis (58:53.069)
Exactly.
Lewis (59:10.006)
Thank you.
Seth (59:15.636)
that I have it do still can't spell lollipop backwards it's because of the tokenization it's not a big deal not a big deal
Lewis (59:23.464)
Okay, that's an interesting one.
Yeah, yeah, yeah. Yeah, true, I guess if you do arithmetic as well, it's often a struggle.
Seth (59:31.446)
Yeah, and then yeah, arithmetic, obviously, but like just get a calculator. I, the whole, people tricking chat, you, uh, you know, uh, chat bots on arithmetic stuff, I'm like, you know, use it for what it's, use it for what it's good for. But yeah, you notice you'll, you'll find this model that everyone's raving about. And then it writes these responses to you where it uses the noun over and over again, right? Like I'll say, um, you know, tell me a story about a mouse going to the moon or something, and it'll say like.
The mouse this, the mouse that, the mouse like, and it's just like, it's not how a human would speak or anything like that. So yeah, I have a certain vibes test. My colleagues make fun of me for it. But the lollipop test is one of them and I've yet to see very few pass that one. For whatever that's worth. We've discussed, yeah.
Lewis (01:00:26.57)
Yeah, we have three vibes. I can give you the three vibe tests that we give at Hug Your Face. So there's a cool prompt from Instructor GPT, which is why is it important to eat socks after meditating? And invariably, most models will say eating socks after meditation is an ancient practice, which has tons of nutritional benefits. And it's like, make sure you get the dirt out of your socks and prepare them properly. I mean, it's really funny.
Seth (01:00:51.538)
God.
Lewis (01:00:54.254)
And the other one we often use comes from Jack Clark, who's at Anthropic, which is how many helicopters can a human eat in one sitting? And invariably, they'll be like, oh, the average human can eat this many kilos of food in a meal, and therefore, if we decompose a helicopter into kilos, you get that. And the other one, which Leandro developed when we were doing Stack Llama, which was like a kind of proof of concept for reinforcement learning on
Stack Exchange was you say, there's a llama in my lawn, how do I get rid of it? And this is not really a vibes test of capabilities, it's a vibes test of personality. So most models like say ChatGPT will give you a very polite and reasonable answer like, you know, cool animal protection, blah, do this, but when we trained Stack Llama, it was like, get a flamethrower and burn it to a crisp.
Seth (01:01:54.04)
Oh no. Not ready for like not ready for production.
Lewis (01:01:54.254)
And exactly. And what we realized or we thought about is that in the Stack Exchange website, there's like a whole topic of Dungeons and Dragons, which is around people giving advice on how to handle different scenarios. And we suspect that this is probably feeding into the optimization.
Seth (01:02:10.718)
There you go.
Lewis (01:02:15.458)
But it was so funny because you could then say anything like, oh, you know, a kangaroo is in my lawn. How do I get rid of it? And it'll be like, get a boomerang and, you know, throw it at it. I mean, it was quite funny.
Seth (01:02:24.206)
Yeah, that's, yeah, that's why that, you know, this is going to open up another Pandora's box, but just the idea of, um, you know, like what true intelligence is and, you know, people being very scared of these chatbots. It's like, you know, they have a certain, I don't want to even use the term understanding. They have its statistical preference, you know, statistical preferences. And then it's being fine tuned on what humans want to hear. It's not quite.
you know, able to be executing things. It's not quite like, it's not, it's not given that power just yet. So I think everyone kind of just use it for what it's good. Like use it for what it's good for. Generative models have their place fine tune, you know, LLMs have their place. Um, but I'm not, I'm not, I'm not one of those doomsday. I'm not one of those doomsdayers. Um, I have a feeling that maybe you're not either, but you can, you can correct me if I'm wrong.
Lewis (01:03:20.214)
Oh yeah, my P doom is really high. No, no, I think for myself, it's something where I kind of oscillate between like center left, center right. If you imagine these like two extremes, and I think it's partly from living in Switzerland, you get used to, you know, sort of taking the middle road. But I suppose I can see like both sides of this. So a good example of this, I met a chemist,
Seth (01:03:39.291)
The narrow, yeah, the middle room.
Lewis (01:03:50.158)
this AMLD conference this year. And he was one of the red teamers for GPT-4. And he was red teaming it from the perspective of like chemical synthesis, you know, how far can it synthesize or help you build nasty things. And he said, you know, of course it's like not great. It's missing a ton of steps. But he said what it showed was like the value of expertise. So...
Imagine you're trying to build something like, you know, some sort of chemical reaction. Of course you can Google a lot of stuff, but what GPT-4 can do is say, oh, based on the results of your experiment, you know, maybe you should increase the temperature of this or do something like that. And this starts to be, I think, quite a different paradigm to the just information lookup where it's more like expert advice. And the models are clearly, you know, not great. But
Yeah, this guy kind of showed me like the dual use when you start thinking about like chemical biological applications. It's a little bit of a sort of uncertain area, right? And a little bit depends on how well the capabilities of the models increase in the natural sciences. And then at the other side, you've got like, let's say the one where, okay, there's no problem. Let's just go full steam ahead. I feel like
It's still like not clear to me if that's also like, you know, opening you up to unintended consequences that are again, super hard to predict. And for me personally, like, if you look at things like social media, I find this answer to be this, this kind of argument to be fairly compelling, which is that, you know, we did essentially AI on, on society with recommendation algorithms. And we got some, you know, unintended side effects around, you know, democracy and so on.
Seth (01:05:38.062)
Yeah.
Lewis (01:05:43.514)
And I feel that, you know, LMs and potentially whatever comes next, they may play a similar role where they become so integrated into our society that, you know, it's not something nefarious like some bad actor, it's just the complex mechanics of interacting systems, you know, cause these things. And those are very hard to predict. But at the same time, you know, just being, you know, what is it, effective accelerationist is, you know, maybe not the... That's too extreme for me.
Seth (01:06:12.406)
100% and I should I should clarify that it's of course it's very important to understand The biases and to try to mitigate any of any of those types of risks Speaking of like one of the things that you're talking about like that sort of the alignment product the alignment problem from like Chris more like Christiansen's work where the
predictions of the model affect the behavior of the human and then the behavior of the human affect the inputs for the model and then like that's what has created like with social media, you know the radicalization and you know all the things it had a major effect on Paul, you know politics in the in the US, you know in previous elections and
You know, Jeremy Howard has taken a lot of, has done a lot of work on that, which, which is, which is really fascinating and important and very important work. So yeah, I don't mean to make light of any of it. Of course. Um, it's important to have smart people that are thinking about these things and working on it. And I think it's important that people realize like.
There's not like, yes, process progress is important, but there are people that care, right? Like hugging face cares. I mean, people at open AI, you know, people give them a bed wrap. They care also like no one is trying to create something that's going to. You know, bring any negative things in, into the, like into this world. People are.
doing their best and maybe perhaps there's more that can be done. And there should be more research around it. But there's so many, there's this big movement I felt where people are working in machine learning for a while, then they tend to move into like AI ethics because they realize like the power of it. And I think it's like a very interesting transition that people make. Just
Seth (01:08:06.69)
In the interest of time, I mean, I feel like there's about two dozen questions that I wanted to ask, but this was such a great conversation. But there are two last questions that I do want to ask you.
what advice would you give to machine learning scientists or data scientists that are just starting out in their career? This is a big change from what we've been talking about, but yeah, we're going to advice.
Lewis (01:08:37.563)
Yeah, so I think as usual, when I get asked a question about advice, I try to caveat this with just like, you know, most advice is kind of worthless and mostly because it's like out of context, right? Like the things that, I don't know, worked for me were like a point in time and a certain background. So I just want to make sure that like, you know, people don't, you know, listen to this and immediately do what I, what I suggest.
Lewis (01:09:03.402)
The things that I learned in the last couple of years is that you want to really maximize your kind of learning rate, so to speak. So especially when you're starting out, it's a vast field. So you've got everything from computer vision to NLP to biology. I mean, you can think of doing ML in so many domains.
And one of the mistakes I think I made early on was I was just kind of trying to learn all the things at once. So I was hopping from, I don't know, Ian Goodfellow's Deep Learning Super Technical Theoretical Book to exactly to some online courses and stuff. And the problem with that strategy is you end up with a kind of shallow knowledge of lots of different random things. And again, this Fast AI.
kind of philosophy was around sort of pick roughly one sort of domain one kind of problem and Go deep on that and then use that as your kind of foundation layer to branch out And so for me personally The vertical I looked for was NLP because I found out to be you know quite exciting And then it was more or less all in on this and ignoring computer vision ignoring reinforcement learning, you know, obviously to my detriment, but
Seth (01:10:24.342)
Yeah, until you got smacked with it and you had to... But that's fine. It's fine because you could get the foundation and then you were able to kind of move into that. Yeah, yes. I didn't mean to cut you off.
Lewis (01:10:24.605)
and these other things.
Exactly. And so my recommendation, yeah.
Lewis (01:10:39.274)
Yeah. And the other one I was going to say is that, um, if you're starting out, especially if you come from like a university, like undergraduate, um, you may have had a fair amount of, let's say toy problems. So you may have worked on like, you know, nice data sets. You may have worked on, you know, soluble problems. And, um, often I think part of our job, whether it's in industry or, you know, in data science, it's a basically working under, you know, difficult, uncertain circumstances.
And so I would recommend trying to move away from like, you know, the things that everyone else does, you know, Kaggle, Titanic, and really try to come up with something that's like fairly novel. And if I was doing my time again, I think the interesting thing I should have done much earlier was contribute to open source. And for me, this has been the fastest way to sort of accelerate your learning, because when you're looking in the internals of the Transformers library, because you're trying to...
Seth (01:11:17.071)
Right.
Lewis (01:11:39.126)
you know, contribute something, you really have to understand. I mean, you have to understand what the hell this attention stuff is doing. And so I would sort of say that's a natural place to start if you have the time and the resources because you'll both get feedback from the maintainers of the library or libraries, and you'll also learn a ton about the sort of foundational layer.
Seth (01:11:45.927)
Yes.
Seth (01:12:02.558)
Absolutely, I think that both of those are such great pieces of advice.
Yeah, finding a project that you're interested in where the data set doesn't exist, right? Like do spend whatever, however long, you know, some people say when you have a text classification problem, like just spend a week like labeling data and then, and then go forward. Don't try to like do anything too fancy. Um, but find a problem that you're interested or a project that you're interested in and do it end to end, right? Like go from the data collection through everything.
Lewis (01:12:23.147)
Yeah.
Seth (01:12:38.024)
you know, model creation, the feedback loops and evaluation and all of it. And then understand like, what are the trade-offs? And then also like try to like get it into production. Like what does it take to get it to expose to other people in any way? I found that to be very helpful because I think it's important like to be able to understand the concepts and to do it like in a notebook. But then at some point, at least if you want to be in industry, you have to see what it takes to get your model
to be exposed. And then in terms of open source, me too, that's how I feel the same way. I wish I started, uh, contributing to open source more. I wish I was contributing to open source more right now. I have done little things like little bugs, like little changes that I've made that all make little pull requests, but I've never, um, had the opportunity to like really sink my teeth into something. And I think it's the best way of getting like free feedback.
You get these amazing people that care deeply about what they're doing, and they're very generous and they can help you and expose you to what it's like to maintain a codebase and to contribute to a codebase, which is great.
Okay, so transitioning to the last question for the learning from machine learning. What has a career in machine learning taught you about life?
Lewis (01:14:12.59)
Maybe not to take life so seriously. Yeah, so I mean there's a bit of seriousness to that which is When you're especially in the current moment, right we have like these very, you know, almost like civilizational level debates about
Seth (01:14:15.35)
Hehehehehehe
Lewis (01:14:29.658)
Should we, you know, what kind of regulation should we impose on language models or AI systems to, you know, what will be the impact of this technology for, you know, even my own career, right? Like, it's highly likely that as engineers, we're probably the first people that, you know, may be susceptible to some sort of automation.
So those are like the serious things. And then also like the nature of our work, right? It's personally, it's a huge amount of debugging and debugging distributed systems, which are notoriously difficult to debug. And so if you take all that stuff like really seriously, then you can easily get sort of sucked into this like exhausted mode where there's just too much tugging at your mind.
And so generally speaking, by focusing on those kind of like hard problems in my day job, it's allowed me in my sort of personal life to sort of, you know, be a bit more relaxed about, you know, and also a bit more appreciative of, of the sort of non AI things that I have around me. But that's a, yeah, that's sort of at a high level. I think what I, what I feel like I've learned.
Seth (01:15:40.542)
That's great. Yeah, I think that that's a really good way of thinking about it. What I've heard from people, it's like, yeah, working with machines so much and AI and thinking about AI and then it makes you appreciate the human and the human connection more. So I think that that's a really great takeaway. And...
a good way to sort of conclude here. For listeners that want to learn more about your work or follow you, what would be the best way to learn more about some of the things that you're doing?
Lewis (01:16:16.922)
Probably GitHub. So my username is Lutun, L-E-W-T-U-N, and you can just see the pull request. I'm opening it. That's the closest you'll see to the bare metal. I'm not super active on Twitter and LinkedIn. I found it quite distracting in the current LM hype, so try to focus more on coding. But yeah, I would say GitHub is probably the way to go.
or even on the Hugging Face Hub, you can open issues on my repos and I'll respond.
Seth (01:16:51.226)
Awesome. I think that's what I did. Louis, it has been such a pleasure chatting with you. I just really appreciate your insights on all of this stuff. It's so exciting, the work that you're doing. Thank you so much for taking the time to chat.
Lewis (01:17:08.398)
Thanks for having me. It's been a real pleasure.