Seth (00:01) Hello and welcome to Learning from Machine Learning. On this episode, we have a very special guest, Aman Khan, the head of product at He recently released a popular, well-received class with Andrew Ng at Deep Learning, and was also on one of my favorite podcasts, Lenny's, and released a thorough article on evaluating AI. Aman welcome to the show. Aman Khan (00:24) Thanks so much for having me, Seth. It's awesome to be here and I'm super excited to dive in. Seth (00:28) It's great to have you. Let's get right into it. What initially attracted you to machine learning? Aman Khan (00:34) Yes, that's a good question. I think my story is maybe a little bit unlike some others. I didn't go to school and go through school studying computer science or thinking that I wanted to do data science or machine learning when I was studying, actually. I ended up graduating as an engineer and working as an engineer for a number of years before actually getting a job at a self-driving car company as a test engineer. And so what my job was was to actually look at data from how the car was driving on the road or in simulation and understand is the car doing what it's supposed to be doing when we push changes to it. And it was a pretty, manual process at first. was literally like labeling, like, is this, is this correct? Is this maneuver correct, not correct. And I kind of realized there was more and more opportunity to automate components of that job. Maybe make my job easier as well by, you know, defining metrics, looking at simulation data at scale. And being able to really understand that data at scale is what kind of interested me. It kind of got me more interested in data science, machine learning, statistics, really. And so what I would do is I would, every day I would go and sit down at lunch with the data scientists and the PhDs and the statisticians. And I would ask the most basic questions. I would ask like, okay, how do I know? You know, how many miles is all right to drive, to be more confident in the score. So basically how do I reduce my error bounds and confidence intervals? So starting from like the most basic fundamental statistics and realizing that there was a lot of opportunity to build products on top of the statistical sort of measures that we were using to make decisions. And that was what got me interested in machine learning was realizing there was this intersection of product building and statistics that had a huge area of opportunity to build better. build better products, ⁓ build better experiences with machine learning baked into it. So that was really what intrigued me about it. It was a little bit of accident and discovery, and then just having amazing people I could pick their brains and sort of learn from them every single day on the topic. Seth (02:42) Awesome, what a great opportunity to have. So when you were there, what role were you playing there? Aman Khan (02:48) Yeah, so it first started out as an engineer and then realized that ⁓ there was parts of ⁓ this, like the flow of evaluating the car that you could kind of, I knew how to code like basic Python at the time. And so I kind of knew how to script components of what I wanted to see and visualize. And so my job was really, ⁓ I quickly kind of. ⁓ outgrew that skill set a little bit because I realized that there were people around me that were far superior programmers than I was. And a lot of where my value was actually in defining what metrics to measure, how should we visualize these metrics, and then trying to make sure that we actually had, we kind of like looking at the same data basically. And so that job sort of ended up becoming more from an engineer to a product manager of the team on ⁓ the sort of overall set of perception. evaluations. And then from there, I of realized that there was this interesting platform opportunity to kind of build more products that help data scientists understand what their models are actually doing when you make changes. So this sort of like, you know, sort of like looking at the black box and the inputs and the outputs and trying to understand the impact of the changes that you're making has kind of been a problem. I've actually spent a lot of my time thinking about and trying to understand better as well. Seth (04:12) So was that where you kind of, let's say, like fell in love with the metrics and the data science, or was it where you really got very, like, were you very interested in machine learning there, or was it more about evaluating machine learning? Aman Khan (04:24) Mm. That's a really, yeah, that's an interesting way to look at it. think what the answer is probably a bit of both, to be honest. think the realization for me was I had worked on a couple of really small projects before that, but this was my first real exposure to machine learning. And, you know, I think I basically had to ramp up a ton on the technology itself. And so spend a good amount of time ⁓ watching deeplearning.ai courses. trying to understand this technology and really like the research side of it too, like the research and how these systems worked. And so I was pretty enamored by that, you know, that field on its own. And then realized that the second part to your question is like, is it the evaluation component? It turns out that that's a machine learning problem too. And so it was in a way I got to kind of apply the same technology I was learning about. Seth (05:18) Yes. Aman Khan (05:25) to evaluate machine learning systems themselves, which is a little bit meta, it's like, it's really, mean, if you're familiar with the space now, it's like, you know, that's, what everyone's doing for, you know, LLM as a judge evaluation. So I would say it's really a combination of both of those things that kind of kept me engaged in the, in the field. Seth (05:30) Very matter. Very cool. That's such an interesting way. think that that's different than a lot of other people. I think a lot of other people are around like the modeling of it and the magic of the machine learning, but evaluation is such an important part. ⁓ think in theory, know, most people know, you you get, figure out your business requirements, you go through the whole, you know, kind of data science machine learning cycle. And then you're like, yeah, you know about an evaluation, right? But if you think about it in the set, in the standpoint where evaluation is the thing that really matters. And then taking what you were learning from evaluation and bringing that back into the cycle and then improving what your system is going to be doing. It's a really good way of thinking about it. I wonder if it's from just your innate way of thinking about things, or if it's from your engineering background, that like something like, you always kind of interested in the systems of how things work like that? Aman Khan (06:36) Yeah, mean, that was, that's pretty accurate actually. like, you know, for my, so I studied mechanical engineering, so I wasn't even in, you know, computer science background. So a lot of what I was interested in, what attracted me to engineering was actually learning about how systems work together. And so that, you know, what are the components of the system trying to break it down and see, okay, if you change one thing here, what's the impact of the change somewhere else? So that's always been an interest of mine is like seeing how these complex systems come together. And even in the, in the context of the machine learning applied on for self-driving car, or now let's even say an agent, it's not like you're just like designing one model and that's the product. Usually, usually there's maybe components to the model underneath that. so seeing how those different components interact with each other, especially when the system is non-deterministic. is actually a really interesting problem on its own. I'll give you an example. When you're a product ⁓ for the physical world, you're usually designing manufacturing practices in mind. You're thinking about maybe tolerances and how those statistics stack up. And there's a very similar principle in machine learning systems as well. Instead of tolerances and manufacturing, you're designing for non-determinism. And so you get this of tolerance stack of multiple models stacking on top of each other. And so the two kind of are very similar in many, many ways. It's process design. It's understanding, okay, what's the impact of one thing here over there? I do think that the complexities and the space is much higher for machine learning models because especially of how we've designed them now, but when it comes to generative models where there's just so many more ways that, models can interpret things. And so it's kind of interesting, like, Seth (08:05) Yes. Aman Khan (08:24) You're still trying to design this thing to be deterministic to some degree, but if you just let it be non-deterministic, you get really interesting behavior. And then that actually makes the system more complex as a result as well. Seth (08:33) Yeah. I think it's about, I think there's like this conception and it's like, ⁓ people don't like to hear it, but if you're real practitioner of machine learning, it's like, you have to assume that the model is probably going to be wrong, you know? And it reminds me of something you said with mechanical engineering. It's like fault error, right? Like when you're designing something, when you're creating a physical thing in real, in real space, there's always this percent chance that things are going to fail. Aman Khan (08:49) Yeah. Mm-hmm. Seth (09:03) So understanding that and building that into the system, understanding it's not just a single point, it's not just a single model that's doing one thing. There's a whole set of interconnected things where at any point, any step, there could be failure. And acknowledging that failure and kind of building that into the system, you can get a much more intelligent and you can get better outcomes basically. Aman Khan (09:29) Right. It's designing the system to be robust from the beginning in many ways to, you know, unexpected failure modes. There's a great sort of comparison here, you know, as we kind of get deeper on this topic, is like, think what's new for a lot of people that are building with models today, or maybe just getting started with generative models, you know, maybe their background was in software engineering before. those folks, you're kind of... You're used to the system, you write code and the code executes and you're used to the system sort of performing in a way that's expected, right? And you design unit tests for edge cases or failure modes that you might anticipate. You don't catch all of them. There's maybe uncaught exceptions. And then you can, you know, you go back and write unit tests based on that. Now, the thing with this world is that these systems are non-deterministic by design. And so the amount of failure modes in the space and complexity of failure modes is like infinitely larger as a result. much, much, much, much greater. And so instead of trying to design unit tests, you have, you know, maybe other types of metrics you might look at that give you statistical almost confidence that the system is working as expected, but it's almost never a hundred percent confident. And so that's really the, the opportunity and sort of the fun in the design around this is how do you design for a system that is in its, in it by design non-deterministic as well. So. Seth (10:53) Right. Okay. We're definitely going to get more into, no, we're definitely going get more into that, but just, you know, for the sake of background and things. ⁓ So you were working at this self-driving car. Side note, pretty amazing how far that industry has come. don't know. Aman Khan (10:56) Hahaha. Yeah. Yeah. When I first started there, it was like the car could barely drive half a block without a human having to take over. And now you go to San Francisco and you just see Waymo everywhere. Right. And so like the industry itself has really moved really, really, really quickly along this curve. And I think that that's like, there's a lot of analogies there from like the early self driving cars to agent based systems today, which I know we'll probably get into as well. And what are the similarities in terms of like, the agent can't really do anything. It's failing. It's like, Seth (11:22) Unreal. Aman Khan (11:44) man, give it five more years and we'll really see how good these systems get as you kind of keep improving them. Seth (11:51) Right, so we could fast forward or you could give me some of the highlights of going from there to now where you're currently at, head of product at Arize. Yeah. Aman Khan (12:02) Yeah, sure. So in between Cruise where I was at, you know, sort of early, early in their career, I did work at Spotify on the machine learning platform for some time too, where the customers I was serving were data scientists, know, ML practitioners, ML engineers that were deploying primarily recommender system models to think like ranking models, maybe sort of embedding based models as well for rec for recommendation systems for Seth (12:13) Nice. Aman Khan (12:31) you know, products like Discover Weekly or Playlist Generation. And my job there was, you know, to help build a platform to use data for training and serving those models at scale. And the interesting part about that role was just the immense scale of the data that Spotify processes. We're talking billions of records per day, know, millions per second in many times. And taking all of that data and actually being able to turn it into useful features for machine learning was a challenge in its own right. And so that was a lot of where my time energy was spent there. And that's really where I kind of was looking for a tool to help do measurement and impact analysis in the same way that I had when I was in the self-driving space around evals. And that's really what attracted me to sort of look for companies that were building products in this space around evaluation for ML systems and that's when Arize actually popped up on the radar. Arize had just raised the Series A. It was a pretty small company at the time, less than a million in revenue. The founders were still trying to figure out what does our product market fit look like? And then fast forward to today, Spotify is now a customer of Arize along with companies like Uber, Reddit, Instacart, many others if you look on our site. Seth (13:55) Yeah. Aman Khan (13:55) We're basically helping some of these teams that are data science, ML teams, and practitioners at these companies ⁓ ship models, ship applications in production. And we're building tools for evaluation around those systems. Seth (14:11) Very cool. think I was telling you, I was an earlier user of Phoenix when it was using UMAP and HDBSCAN. It was kind of used as like a clustering explorer tool. I know that things have changed since then. So yeah, so yeah. Aman Khan (14:30) Yeah. Yeah. I mean, I think a lot of our product philosophy is to just build useful products and try to give away as much for free to engineers. We really think this is like a rising tide sort of moment for AI where you have just more people trying to build useful applications and A lot of applications are not going to make it, right? Like a lot of products are not inherently going to be great products, but just the learnings that you get along the way and the ability to iterate is really what we're trying to enable as much as possible. we, yeah, we have tools like unstructured data, exploratory data analysis tools, and a ton more around like agent tracing, visualization, and evals as well, running experiments, ⁓ monitoring, a lot of that is open source as well. ⁓ Seth (15:21) So let's get into it when a company or a team, you know is thinking about productizing or productionizing, know ML models What are some best practices because now you've been working with a bunch of teams? doing it. So, yeah Aman Khan (15:36) Yeah, I mean, I think it might be helpful to, you know, sort of actually refine that a little bit more. I'm kind of curious, like when you think about ML models these days, what do think about? Like, what are your thoughts, Seth, on, you know, what does a production ML model or system look like in today's world? 2025. Seth (15:47) Thank It's a, yes. Well, I guess the important thing is whenever I talk about it, it's like, don't just implement AI or ML for the sake of implementing AI or ML. What problem are you trying to solve? I'm focused mostly on natural language processing. So I do a lot of information extraction. I do a lot of text classification, summarization. But really it's trying to bring all of those things together. Aman Khan (16:03) Mm-hmm. Mm-hmm. Seth (16:20) and trying to create meaningful insights for companies to help for what I'm doing to help companies really optimize their operations, analyzing and getting information from conversations that customers are having with companies and being able to do that. So in terms of types of ML models, yeah, we should talk about it. So obviously there's the Elephant in the room right LLMs are taking over everywhere right this ability to use this Incredible thing that's pretty good at a lot of stuff, you know, but it's it's is very general You can use it out. You can kind of use it out of the box to get pretty good results on things But I like to personally I like to find what's the best combination of tools to get the job done. So I'll use very lightweight, fine-tuned text embeddings, things like that. But we can talk about LLMs. We can talk about LLMs. let's do an example. So let's talk about, what do you want to do? What would be a good example of a problem? Aman Khan (17:13) Mm-hmm. Okay. Yeah, I mean, I think we see a lot of, it might be helpful, you know, there's this concept of self-driving around like levels of autonomy. And I think there's a lot of analogies to that in the generative AI world of today, right? ⁓ And if you look at applications of LLMs right now, generation, there are some that stand up right off the bat, which are. summarizing texts, maybe extracting information from text or even other types of data like image classification. So there's a lot of these like analogies to the ML systems you and I are both familiar with when it comes to training these types of models using data classifiers, know, decision trees, these types of these types of models. And I think from a levels of autonomy perspective or levels of, you know, how sophisticated these applications are, it still sort of feels like we're scraping between like level one and two, right? Like these chat bots, they're, you know, if you want to use ChatGPT, that's gotten a lot better where it now has access to tools. But a year ago, if you wanted to use, if you used a chat interface to an LLM, you put tokens in, you get tokens out. Now those tokens can be constructed in ways that start to look like being able to take action or understand a user or intent better. So again, we're just talking like consumer application for a second. Like, you know, if you, if you type in a question into ChatGPT, it might have access to a search API that it can use as a tool or a function and say, Oh, this looks like this user is asking me about the weather in New York city today. Uh, I don't have that information in my knowledge base. So I need to go out and construct a query to find more information here. And so that's an example of like using a tool. This is like level one autonomy, right? It's like taking an idea or thoughts. Uh, or a request and turning it into some type of action. I think what's interesting is we're going to keep seeing more levels added onto here. So once you have the basic set of being able to call a tool, being able to take an action, what are the things you can layer on top of that? ⁓ So that's the type of system I think is interesting to ⁓ dive deeper on where you could just take a simple action and say, let's take one that probably everyone has tried at this point. And if you haven't, highly recommend it. If you go into Bolt or Lovable, the sort of app generator, website generator, and type in a prompt and get a website on the other end, what's actually happening in between those steps is code generation and an agent that's understanding what your intent is, maybe asking follow-up questions, taking action by, you know, maybe calling sub-agents underneath it to do things like... parallelize code generation for creating a website. So I think that that's an interesting application, just code gen in general, that is really, really exploding in terms of usage and utility. And a good example of a generative AI system in production that's actually doing pretty well and is successful. Seth (20:34) Yeah. Yeah, think, yeah, one of the interesting things that I find with generative models, right, is that you're creating something and you don't always know, like image generation, right? You can try to get how people prefer what those outputs are going to look like, but there's not a necessary, I guess there's a wrong answer, but there's not necessarily a right answer for image generation all the time. I mean, and that's where I think you start to get into some of the more interesting things of how you evaluate, know, a system did it, how useful is the output? How truthful was it to what the input information was? Did it follow the instruction that you gave it? You know, things like that. But yeah, there's so many incredible applications. We could Well, actually, one thing. You mentioned the levels, which I'm familiar with, but can you break down the levels just so we're on the same page there? Yeah. Aman Khan (21:47) Mm-hmm. Yeah, absolutely. I mean, and if you just, you know, for, for any listeners as well, if you Google like levels of autonomy for self-driving, you'll kind of see this stack and I'll kind of work backwards from there a little bit because I'm, a little fuzzy on like the exact, some of the exact levels, but yeah, level zero is like no autonomy. It's a human, you know, sort of a system where, you know, a human, a human is, is in the car, sort of driving the car. Seth (22:06) That's okay, just the basics. I'm not gonna, you know, yeah. Aman Khan (22:17) Level five is a fully autonomous system. It's sort of the like, you know, what people are sort of like, you know, kind of pursuing and trying to get, get to like the highest level. Just imagine the highest level is fully autonomous in between. There are layers to the level of autonomy. For instance, the one right before fully autonomous is being able to have a human ⁓ nudge the system when it's going, when it's stuck. basically, and have some level of human in the loop, human intervention. So this kind of looks like actually where a lot of the fully, the quote unquote, fully self-driving car systems are today. They're actually not fully autonomous. There are instances if you, you know, if you take a Waymo or another product, the car will, you know, on some frequency, some level of frequency, get stuck. and have to call out to remote assistance to unblock it. And that's an example of level four. Level three and two and some of these ⁓ systems before that are like really good, but not fully able to drive autonomously. It's more the closer to like driver assist where closer to like a Tesla, if you take in a Tesla autopilot. I hear Tesla has gotten really, really good, but there's no like remote assistance human loop. You could argue the person in the car is the human in the loop, but. the amount of miles that you're driving where you still have to pay attention to the road is really important. And that has a lot to do with, you know, not just the capabilities of the models themselves, but also to do with their ability to understand and perceive the world around them, which is a limitation to some degree of like the sensors or the hardware. So it's kind of interesting, like the problem is, is very analogous to agents today because Seth (23:44) Yeah. Aman Khan (24:07) You could argue that these agents are really capable. know, the LLMs behind the agents are really capable, but it's not just the LLM or the model. That's a limiting factor here. It's also the data that the model can actually use and interact on top of or act on top of and the access to the control system or the tools that it can use. So it's kind of an interesting way to look at the same type of system where are we really trying to build like fully L5? ⁓ you know, knowledge worker agents or is L4 actually a much more achievable goal where you get, you know, pretty actually pretty reasonable, pretty reasonably high economic impact and like individual impact as well. So I think that's like why that framing is very useful in this world. Seth (24:51) Yeah, they a similar type of breakdown for chatbots as well. You know, think things like that and the different types of chatbots that there are from, you know, what it used to be, where it was like, you know, you really needed to create the entire decision tree and you were trying to detect the initial intent and it was very limited in what it could do. And then there is, you know, it just, more autonomy was given to the chat bot where it could do a little bit more, where it could detect the intent. could also look up things in the knowledge base. Then there's this ability to detect intent, look up things in the knowledge base, and then also access to certain APIs. So you could actually start to affect and change things in the backend system. And you could just kind of see how this you're on this spectrum from. fully manual to this fully autonomous system. And I think that it's unclear quite where we want to be and where the hype is for it and where the reality is for it. It's actually a question that I do like to ask, if you don't mind. Can you talk to it? What do you think about how the gap Aman Khan (26:02) Yeah. Seth (26:08) is handled between the hype of AI and the reality of AI. Aman Khan (26:12) Yeah, that's a, it's a good question. mean, immediately as we were talking about that, feel like there's a really great graphic. it kind of came to mind. It doesn't exist yet, but it's like, I think it's, I think there's something here where it's, it really depends what vertical or problem space you're in. So it's because there are some, there are some, areas where I think the hype is underrated. Like I think that the potential. Seth (26:23) wow. Aman Khan (26:40) for AI is actually more than people give it credit for, right? And there's a resistance because it feels uncomfortable at times. ⁓ But I do think that tasks like software generation, we're only starting to see the beginning of the impacts of LLMs here. And the reason for that is because so much of the generation being text in code actually gives, is a very like interesting flywheel of how good these things can get. If you're limited by data for training, you're definitely not going to be limited by, you know, software code, like the amount of code that you can generate synthetically. the, you kind of, to tie back to an earlier point you said, which is like, how do you measure how good something is? There's a subjectivity component to it too, which is like, is this a good website or not? I don't know. You know, we were talking about like the, some, some platform before this for, you know, uh, is, is the layout good or not? Seth (27:26) Right. Aman Khan (27:36) You can, but you can aggregate information like that. can use other judges or LLMs to help you get more data on is this good code or bad code. And I think that that is severely underrated. ⁓ actually people are, are not yet seeing, in my opinion, I think the impact on how we write code and write software like is, is, is underrated. I think that there is similarly on the spectrum of like tasks that Seth (27:45) Right. Aman Khan (28:05) people do or things people do, how people spend time. There's a lot of concern around, you know, there's this joke like, ⁓ AI will come for, you know, I thought AI was gonna come for my, you know, the tasks I didn't wanna do, not for, you know, my Miyazaki, like, ⁓ you know, Studio Ghibli type of artistic work, right? And people... Seth (28:29) Right. Aman Khan (28:33) I think really everyone saw that Ghibli moment. think like the number of subscribers to ChatGPT was like the highest of the rate literally was insane, like higher than they could even, their servers could support them the fastest growth they'd ever seen. And it was because of a artistic task. What's interesting is it's questionable to me if that task actually, I think there's a lot of debate around. Seth (28:40) wild. Aman Khan (28:59) the IP and is this art? And it's questionable to me. Is this actually art? Because what people were doing was almost like memifying it a little bit. They were taking a thought and transforming it into something that other people could relate to in a way. Or turning their family into like a Ghibli thing. It was great. I think that's wonderful. But I think that's more like idea transformation. Style transformation, exact. Seth (29:03) Yeah. Yeah. style transfer. Yes. Which is something that AI is very, very good at. actually always, it's been good at it for a long time now. Yeah. Aman Khan (29:27) That's it's style transformation. And the question. For a long time, yeah, exactly. Exactly. It's take this data, map it to a different space, right? Like that sort of task. So the question I kind of ask on top of that is, is that what art is? And to some degree, parts of it are, but I think that novel artistic ideas and thoughts usually come from multiple transformations there. Seth (29:40) Yep. Aman Khan (30:01) And there, if you really look at like the data, it's like some novel new idea. It's like a new data point in this embedding space really is kind of how I view it, right? Like Basquiat or some of these other artists, they're, you know, they, because they were so novel and how they approached a form or, you know, a canvas of whatever they want to create. That's what was compelling about that idea. And so I haven't yet, I think that that's an area where Seth (30:10) Yeah. Aman Khan (30:29) We have yet to see AI actually do something novel versus just transform what people are sort of creating that was novel into something that's a bit more like memeified and mass produced. Seth (30:33) Hmm. Right. Yeah. So art, yeah, art is in the eye of the beholder, right? mean, it's like, is in the eye of the beholder, but art is such an interesting form because it can be many things. It's just a creative, it's just creative expression. And the interesting thing is, you know, copying and pasting someone's prompt with your picture. Aman Khan (30:48) Yeah, yeah. Right. Seth (31:09) and taking the style that somebody dedicated their life to create. Yeah, I could understand all of the controversy that's happening with it. What I do think will happen is that it gives us new tools and there will be amazing art that comes from it because maybe the initial steps that might've taken a long time will change and there'll be additional components that are added to it that will make it. Aman Khan (31:21) Right. Seth (31:39) where you're looking at it and it's not, this is taking this picture and transferring to this, this is making me feel a certain way. for me, art and music, I care about who made it and the emotion that they make you feel due to it. had a, have you played with Suno or any of the... Aman Khan (32:01) I have, yeah. Again, like really amazing, right? Yeah. Seth (32:03) Unbelievable. I mean, it's wild, but then there's a question of like, would you go to a concert where you just had someone just press play on the thing and then you just sat there? For me, it's why do I want to go to see, you know, my favorite band? I want to see them. I want to see the emotion, right? I want to see the human element of it. Yeah. Aman Khan (32:11) Yeah. Yeah. Well, it's interesting, I mean, on that thought, you could argue that electronic music, sometimes you just show up, someone hits play, but are you going to the concert to see someone hit play or is it the energy around it, the other people that you're with, right? And that in itself is some form of art. Yeah. Seth (32:44) Yeah, I think digital music in general, I mean, that's why, while in some sense, it's almost like it accelerated, AI is an accelerator, obviously, it also is able to generate things that weren't possible before, but it's also the continuation of the digital transformation, right? So understanding that music is now digital, mean, No, no one could even imagine Chainsmokers or Gryffin or artists like that back in the 60s because you didn't have the tools and you didn't have the components that you could create something like that. But that doesn't mean that there's not talent there, right? Yeah. Aman Khan (33:25) Right. Yeah. Yeah. To pull on this thread a little bit more, I'll borrow a thought from a friend of mine, Linus Lee, ⁓ who has great writings, I think, on this subject and explores these ideas, ⁓ you know, really in a really interesting way. He has this sort of idea that's really stuck with me, which is like, you actually use the right tool, which is like the right word, which is like AI is a tool. in sort of the same way as a paintbrush is a tool, right? Or a pencil or a pen. And what was interesting with like artists who picked up those tools was what was the canvas that they decided to apply those tools towards. And I think if AI is your tool, the question to ask is what's the canvas? And right now we're talking about the digital landscape and the digital canvas, but what are other forms of art or expression or ideas where the canvas might extend beyond that. And in the same way that electronic music, the canvas here is not just the waveforms that you're creating with the tools, it's actually the concert experience. And that's sort of one way to reframe what's the role of AI in art if AI is the tool. Seth (34:43) Yeah, I wasn't expecting to go into AI art and AI music, but it's all within the same thing. And I guess the interesting stuff about the art and the music is that, yeah, it's very difficult to evaluate. Like, did this create the right song? Did it create the right music? Because I don't know what right necessarily means when you're generating those things. But there are systems where you can understand. Aman Khan (34:45) Yeah. Seth (35:11) you know, if it's at least on the right track. ⁓ So going back, let's do like code generation. Well, what tools have you found to be helpful for code generation? Aman Khan (35:14) Right. Yeah. So I think if you're starting from scratch, it depends on the type of project you are trying to build. But we can show this graphic out as well. I have this sort of like landscape diagram of depending on the task you want, is it a backend, frontend? What is, know, are you trying to build something with fully an agent or a, you know, is it an editor that can kind of help you write code? And now I think most of these platforms are trending towards more agent-based as well. But the short answer is there's a... A number of tools for building like prototypes, like functional prototypes really quickly. I kind of view these as ⁓ for cell v zero bolt and lovable. these tools kind of use sort of well-known, you know, kind of paradigms around writing sort of full stack front end code. It's actually, ⁓ you know, React or Next.js Tailwind component. So these are like kind of the standard developer stack now for web development. And what's interesting is they'll write like, you know, because it's the way that these apps are structured. It's like, you know, to some degree, the application layer and the UI that you're served and the server layer, they're sort of like co-intermingled. And you could go in and like parse through and read that code and try to edit it out. But because these systems, if you're going zero to one, they're designing the Seth (36:25) Right. Aman Khan (36:51) actual directory structure and references between files to each other. It's actually somewhat easier to just use these for your zero to one projects where your interface is prompting and you're probably not gonna go in and read too much of the code itself. It's probably easier to just make changes from a prompting component perspective. I also really like Replit in terms of Python based projects because you get this sort of like dedicated. you know, sort of backend, it's like just, just imagine, you know, server, server-based, you know, things like long running jobs, maybe, you know, like a Twitter bot, something like that. Replit is really great at helping you build those. And then I think for the catch all where you're actually trying to manipulate files, working and manage, you know, sort of working and managing an existing directory structure, I would use cursor or windsurf I think is getting a lot of attention too. And then if you're. more of an open source guy like I think both of us are. Cline is a really cool open source alternative as well to both of those. Seth (37:54) Nice, yeah. So I've played around, dabbled with some of those, but the one that I've really dived into is Cursor, which at this point, think, I mean, you have to know it's kind of pros and cons, and there'll be those occasions where it gets stuck in these circles sometimes, where it's like, come on, and you're like, come on, let's get through this together, we can do this. ⁓ Aman Khan (38:15) Exactly. Yeah. Yeah. Seth (38:21) I've been getting some incredible results. mean, once again, sort of going back to the AI as an accelerator, Cursor is an accelerator for me where I can take an initiative and something that maybe would have taken me or somebody else on my team say a sprint or something like that. Like I'm talking like getting it done in like a day. Aman Khan (38:41) Right. Yeah, exactly. Or giving, you know, I think what the reason I like to really sort of bring up the code generation for prototypes is just imagine what the collaboration between PMs and engineers will look like once more people start figuring this stuff out, right? Like if you are using one of these tools, ⁓ like a quick show of hands, like for the audience here, I like to ask, know, whenever I do this, ask this question in front of an audience, I'll ask like, how many people have heard of these tools that I just mentioned? Seth (39:04) Thank Aman Khan (39:10) Almost everyone raises their hands. How many people have actually tried to use them? And a lot of times right now, half the audience puts their hands down. And I think that that's really the gap that is an opportunity, which is the more that you have people actually trying to use these tools and integrating them into your workflow day to day, the more leverage you have when it comes to working with your team. And so Seth, if you and I were working together, if I wanted a prototype, Seth (39:19) Right. Aman Khan (39:37) I'd have to write it out in a requirement style a couple of years ago and write it right out a requirement. Maybe I'd draw you like a whiteboard diagram. And I come to you and say, Hey, like, do you mind doing this for me? Like, I really just need to show this to a client, you know, that they've been asking me for this for like weeks and I have a meeting with them. And it's like, you know, next week and you're going to, you're going to tell me, dude, I've got a million other things I need to do. And how important is it? Do you really need it? Okay, fine. Aman I'll get you this one time and you're going to have to end up working later on nights. You might have to pull a weekend. You pull together a prototype. Great. I go show it to the client. ⁓ they actually want these five changes. Do you mind making those changes? Well, guess what? Imagine if the PM didn't have to do that anymore. Imagine if they could just go to these tools, get that prototype, get the feedback and do this loop without burning their engineering counterparts and, and, ⁓ you know, the teams that they work with, because they just have way more leverage now. Seth (40:28) Right. Aman Khan (40:33) to actually just be that person, that first line of defense. And then once you start getting feedback, once the process and the idea refines itself, I could come to you with something and say, hey Seth, here's five customers that want this thing. Here's what it looks like. Here's what it feels like. Here's what I want at the end of the day. Go build this thing so it's more scalable and we'll work together on it. And I can actually read the code. can interpret it. I can help you make decisions with it. That to me seems like a much better development process than the world we live in. you know, even just a year ago. So that's really, I think the most immediate opportunity for people that are just getting started with CodeGen. Don't think of it as like, I'm gonna, you know, if you want to, you could go build that next viral app. ⁓ Go for it. There's a ton of videos and TikToks on how to do that. But if you just want to see, okay, well, how do I apply this in my day job? Try to ask these systems to do the thing you were gonna go ask your engineer to do first and see what happens. See where that gets you first. Seth (41:15) Thank you. Yeah, this is the time, right? Like if you're interested in creating, you're interested in designing and maybe the blocker was that you didn't have the ability. I think that there's like, it helps. I don't want to say fill in the gaps because it doesn't totally fill in the gaps, but it can fill in the gaps in a sense to get you that prototype. And then it gets you like, just like what you're saying, you can get it in front of the user. Aman Khan (41:52) Right. Seth (41:58) And that's when you can get that feedback and that's when it really starts going. This idea of the PM has an idea. We got to it slated into the next sprint. It's not going to happen this one, maybe next time. And now you're talking about like three weeks before getting in front of a client, just an initial prototype to get the idea out there. Now you can, you can do it in many different ways. Aman Khan (42:10) right? Yeah. Seth (42:26) at least getting the initial thing. And sometimes it'll go further than you actually need it to go, right? Sometimes you just need to see what it's gonna look like. Now, with Cursor, it makes suggestions for what the next steps will be. When you're having a back and forth with Claude or with whatever OpenAI model and you're working in Canvas, ⁓ it'll just keep going. And it'll think of some ideas or it'll develop something. Sometimes it'll be reading your mind. Sometimes it'll be like, hmm. Aman Khan (42:31) Right, exactly. Seth (42:56) I didn't even think of that. Like that's a good idea. Maybe we should do that. So yeah, this idea of cutting the cycle down ⁓ or increasing the number of iteration cycles, which I think that everyone in product and machine learning knows, like there's no such thing anymore as like just releasing something. There's no such thing as perfect, ever. Aman Khan (43:07) Yeah. Right. Right. Seth (43:25) And there's other things as spending like all of this time in a silo creating something, you have to be getting that feedback. So it's nice that these tools allow you to get that feedback. ⁓ Yeah. Aman Khan (43:34) And I'll add one more for your audience to that. You know, data scientists will be familiar with, you know, we call bootstrapping, right? Like bootstrapping your data set. And that's, that's a challenge in DS and ML today as well, which is like, I don't really have data to iterate. Well, guess what? These foundation models are trained on synthetic data. You can use synthetic data for your applications too. Like don't be scared to use it. You know, it's not going to get you again, like a hundred percent perfect real world simulation of the, you know, Seth (43:42) Right. Aman Khan (44:05) what your product will look like when you actually launch, but you can bootstrap data sets, bootstrap with synthetic data. And that's, that's, you know, just helps you cut down on your own cycle time development time of getting something out, right. Just to plumb, just to put water through the pipes as well. Seth (44:20) Right. So now getting into this sort of cycle and understanding that you're going to be developing, you know, this initial system and you hope that it's going to be improving over time. So if we could kind of jump into it, something that I'm curious from your perspective of balancing like the technical and sort of like the metrics. And then the practical business outcomes when you're evaluating like systems and products. I'm curious just kind of like what, what your take is on that. Do you know, you know what I'm like kind of referring to like metrics for a model as opposed to like, you know, will this, will this solve my problem? Yeah. Aman Khan (45:03) Yeah, so it's like the accuracy versus like revenue. Like what should you be going for, right? Like at the end of the day, yeah. Seth (45:08) If that's, yeah, like I guess, right, when it comes down to it, you want to be creating something that's going to be affecting the business, right? And it's hard, I think, from a machine learning engineer, machine learning scientist standpoint, thinking I have to get this model to be whatever it means, you know, as high performing as it can be. But how does it translate into actual business outcomes? If it does, like, you know, just curious from, from like your, your take on it. Yeah. Aman Khan (45:33) Yeah. Yeah, think, look, I mean, there's ⁓ a lot of, a lot of application developers, there's a term I love, like, you know, just vibe checks, right? So like everyone's using vibe checks right now to kind of understand, okay, is yeah, vibe opsta. ⁓ yeah. Yeah. Yeah. Yeah. Seth (45:49) Vibe Ops, right? Vibe Operations? ⁓ have you not heard of that one? No, it's a joke. I was the guy from ML Ops. Demetrius ⁓ always jokes about Vibe Ops. I think it's so funny. but yes, continue yet. Aman Khan (45:59) ⁓ Yeah. So I think, think like an iteration on that is like, how do you, I don't know, I'll throw up my cheesy version of that, which is like, instead of vibes, what if you could be thrive coding instead? What if you actually had data to make decisions? Right. And vibe checks to me are like, you dump outputs of your LLM into a CSV and you kind of just go in and you're like, good, bad, good, bad, you know, but what if you actually had a system that did that for you so that you could spend more time. Seth (46:13) Hmm. Aman Khan (46:32) focused on building an application and using that data to create a better end user experience. And I think what's really powerful about what this technology is good at and what you're kind of hinting at, which is like business outcomes, you can actually proxy business outcomes better than you could with this technology, better than you could even just a couple of years ago. Because... You can proxy what the end user would do with the platform or what the end user's interaction is. We talked about bootstrapping from a development standpoint. You can bootstrap your production interactions as well. Simulate different personas interacting with your product and then vibe check those or eval those. Right. And so all of a sudden your feedback cycle to get something to market and predict success actually goes up. Like think of that as your confidence bounds getting smaller versus just a few years ago. Seth (47:12) Right. Aman Khan (47:25) And I think that's the, that's the opportunity is like, can actually get much higher signal. Your signal, signal to noise ratio goes up. And I kind of view this as like where, like maybe to make it concrete. When we talk about what evals are, these are using LLMs as a judge or LLMs in your development pipeline to actually evaluate components. Say similar to the self-driving car analogy we talked about right at the start, which is. You want to check each part of your system and make sure it's doing the job you expect it to do when you give it an example to go through each time. And so if you've bootstrapped your data set, you have a starting data set, you want to flow, know, kind of push that through whatever agent you're building, which has multiple components. It could have a router, which decides what tools to use. could have the tool call itself, which is the inputs and the outputs that are created. And then maybe another step downstream from that. Maybe it's another agent. How do you know that when your metrics are off? And you're, you're using an end, let's say the end metric here is like correctness or it's goodness. And that could even be just a human label. That could be your PM or your, your data scientists saying like, this was a good answer. This was a bad answer. That's, that's kind of hard to measure. It's like, again, going back to the self-driving analogy, like what we used to do and look at, you know, a car making a left turn and be like, is that a good left turn or a bad left turn? I don't know. Like, how do you know what's good and bad? But if you break that system up more and more into things like trajectory. Seth (48:34) Right. Right. Aman Khan (48:51) into comfort, speed, other metrics for self-driving, what would that look like for your system or your agent? And how do you break that system up further so that you can measure individual components much more granularly and use LLMs as part of that development lifecycle? So that's kind of how I view it is like, you know, take a look at your whole lifecycle, take a look at your whole development pipeline, understand what's important, what's the outcome that you want. Seth (49:09) Right. Aman Khan (49:17) Get some labels on that, actually write what the LLM should be doing or should do. What would the ideal, what does success look like? What does the success criteria look like for you? And then work backwards to actually break down the system further and build judges for each of those components. And that's really the work that goes into having a production system isn't just a cool prototype, but something you can actually deploy and continuously improve on. Seth (49:44) Right. And I think that that's the key is that it's taking it, being able to deploy it, being able to evaluate it and then actually improve upon it. So just to say one thing back to you, if this is right, so it's understanding the end to end and what you want to get out of your system, then being able to break it down into components and then understand sort of the dimensions that you want to be evaluating it on. and then you try your best implement that. Have you found any trouble or any advice where you actually, let's say you actually are able to evaluate, but then there's trouble actually improving the system? Have you seen instances of that? Aman Khan (50:32) Yeah, I think that that's an interesting question because it kind of comes back down to what is a system comprised of, what are the parts of it? And then when you eventually, depending on the task you have, what are you limited by? Are you limited by data? Are you limited by the model's capabilities and functionality and the ability for it to understand the task you're giving it? And are you limited by the tools or the functions that you have at your disposal? What's interesting about this problem is that each of them do have some solve to them. ⁓ And what I mean by that is, let's take the model layer specifically. You might think, dang, GPT 4.5, it's so good, but it can't solve this problem for me, like book a flight or whatever. It's not able to do this complex task. It doesn't know how to structure. keeps messing things up. Can you break that system down further? And so there's ways that you can decompose the system, especially for software based products that, you know, the question really becomes like, how do you bound this problem? Like, are you working on the right thing for AI to actually be useful here? And if there is a way for you to model the behavior that you want, then I think that there's also similarly ways that you can break down the system further to get you there. And the question is just, you know, how much are you willing to do that? How much, how much time do you want to spend into breaking down the problem further and further? So. Some of this comes down to problem selection in the first place, ⁓ which the only solve I really have for you there is ⁓ intuition around using tools and seeing what their limits are and trying to push the limits and trying to build yourself and developing that intuition yourself as well. And realizing that that intuition is going to change as more more capable models launch. you know, the o3 reasoning models now are like you know, like step function improvement, right? Like from previous models. for reasoning. so I think knowing that that intuition is likely to change and shift as time goes on. And then when you get stuck, can you break the problem down further? Do you have the ability to keep decomposing the problem? Which is very similar in machine learning as well. Is this architecture, how do I break it down a little bit further ⁓ into maybe a smaller model, smaller classifier, something like this? Seth (52:51) Yeah, I mean there's no answer, but I'm just thinking about Sometimes the challenge is it's like you first off you need to have a measure, right? You need to understand how well it's performing But then it's like well how much better can you get it given the constraints that you have on this right now? Like I'm thinking maybe it's a silly example, but like self-driving cars Maybe if you don't add some additional sensor you'll never get it to the point where it needs to be, right? ⁓ And you can take that and you can use that as a metaphor for like anything, maybe just the data or the inputs or however you're thinking about extracting information, maybe that could be the limiting factor. And maybe your system is performing to the point where it's very close to its ceiling. I guess that's just always Aman Khan (53:26) Right. Seth (53:51) depending on the problem space, always something that's hard to necessarily put your finger on. Yeah. Aman Khan (53:56) I'll add one more there, which is how do know the data that you have is enough to make a decision or do you have enough to know what to improve on? And this is real problem in self-driving. It's a real problem in agents today, which is how do you get that data and enough real world data? And what's interesting is the way that in the self-driving world you have simulation replays. So you can have things like 3D models of the world. You can use real sensor data. Seth (54:06) Right. Aman Khan (54:25) from the road and replay simulations on the models to actually see what would the model do if you made changes to it. But ultimately, you don't really know what's going to happen until you deploy the model on the road and start getting statistical real-world data. And there's models that do the analysis of the correlation between offline, online sort of measurement. But I think a similar problem exists today with web development and with agents, which is there's a lack of these simulation environments. And so sometimes you just don't know what's going to happen until you launch and what, how, how much data do you need from launching to actually go back and know what to improve on? So it's similar to question to you is like, how do I know what the upper bound is? Or if I'm, if it's going to be good enough? Well, once you, once you get to whatever that ceiling is, how do you take the data that you have and know what to improve on? And so I think both of those are really deep, interesting problems that will kind of determine the success of your ability, you know, anyone's ability to go and ship. ship production AI systems. Seth (55:25) Yeah. And I think that the interesting thing that's going on with agents, I mean, many interesting things going LLMs aren't the best at saying, I don't know. You know, you have to, yeah. Aman Khan (55:36) They'll double down. They'll be like, I'm right, you're wrong. And you're like, okay. ⁓ Seth (55:42) Yeah, like this is getting awkward now, you know, like, like, I know, like you'll have an expert in a, in a field and the LLM will be telling you something and it won't be based off of the most recent data or something like that. But being able to create systems, going back to what we were saying is like with fault detection almost being able to detect, do I have enough information to answer this question? Aman Khan (55:45) Yeah. Mm-hmm. Seth (56:07) And I think that that's going to be, that's going to remain a problem for a while. I actually, there was a recent thing that came out about these new reasoning models. Like there's no eliminating hallucinations, right? Like there's no, there's no eliminating it. And in some cases by giving them this ability to, you know, think through, they then take their reasoning and then they. follow it down another, like they kind of follow it down that rabbit hole and then they substantiate things that might not necessarily be true. ⁓ I think there are maybe some other mechanisms. I think it's gonna end up being, at least in the short term, some combinations of taking some more traditional machine learning things, trying to bring a little bit more. Aman Khan (56:42) Right? Seth (56:56) Determinism into that right first determine do I have enough information to answer this and then kind of move to the generative and maybe that's where things could could go Aman Khan (57:06) Yeah. And just to add onto that, hallucinations, reducing hallucinations is a challenge on its own, but even hallucination detection, right? Like even if you can't, even if there's challenges around reducing the rates of hallucinations and reasoning, chain of thought reasoning and reasoning tokens, can you detect hallucinations there and use those to course correct either in real time or retroactively? And so we actually, you know, that's, that's actually one of the main reasons people use Phoenix. our open source packages is actually because of the easy sort of hallucination ⁓ evaluations we have baked in, which take the output of the model, take the input question and the context that's used by the model to answer that question and saying, is this context actually relevant to the question that the user asked in the first place? Basically is it the right piece of context? And did the LLM use the context correctly? And so what you have now is sort of a checking system built in on hallucination. So even if you were trying to reduce the rate of hallucinations, the first step towards that is to just detect when it's happening, right? So that's an example of like why it's so important to have evals baked into your development process from the get-go. Seth (58:22) Yeah, I didn't get to about any of my questions. ⁓ but I want to maybe, ask one that probably wasn't there. If you, you're working on obviously very interesting things. Let's say that you had no bounds and you could work on something like totally different. Is there like something that like, what's a free idea? Aman Khan (58:26) Hahaha Seth (58:47) Give out a free idea for someone who's looking for something in this entrepreneurship and machine learning world, if you have any. Aman Khan (58:51) Mmm. Yeah, for sure. So I think there's a big area of opportunity. Obviously the digital world is really interesting from the standpoint of seeing what people are building, creating. I think that there's a pretty interesting leap still yet to be made towards the physical world. And there's so many ways to build AI systems into the physical world that are sold so early. Seth (59:13) Yeah. Aman Khan (59:22) So anything from perception-based products to even robotics, like you can get these kits now that can do things like actuation. But the major challenge in robotics and the real world perception systems was actually understanding. And there was two major problems. One was actually actuation itself and manipulation, which there's tools and people building towards better manipulation of picking up objects, putting them down. And how do you do that in a way to ⁓ train a model, know, train an AI to understand, okay, pick this up, put it down. And in a similar vein, think that, you know, my urge to people is to like find those things that annoy you in your daily life and try to use AI towards that in some way, and try to bridge the gap to like the physical world, the real world more. ⁓ That's what I've been trying to do more and more. mean, I think for myself, like, you know, I'll offload things like planning for You know, let's say like diet planning, nutrition, ⁓ things like that, where there's, there's data encoded here. You can put your own data about how you're feeling, logging that, and then using that to make decisions or just having a coach to help you in the real Seth (1:00:35) Cool. ⁓ Yeah, I think that we could really talk a lot more about a lot of things, but I'm gonna just say, I'm gonna go towards the end. The big question, ⁓ what has a career in machine learning taught you about life? Aman Khan (1:00:51) I think when I first started on this journey, I viewed life in a somewhat like an engineer, in a little bit more of a deterministic way. You do this, that happens. And there's cause and effect. And you're very trained around Newton's laws and laws of physics and things like that. And I think what machine learning really opened me up to is a world where there's more statistical distributions and probabilities of things happening. and, how you fit in on this, this sort of like spectrum. And I remember, and it's kind of just a tie back into like the first engineering class I had, uh, you know, the professor went up on the board, the very first lecture, very first class of, of engineering. And this really stuck with me is he just went up on the board and he drew a standard distribution, like, you know, students distribution. And, and he said, this is basically life. And what matters is where you are on this distribution. And he didn't, I mean, you just kind of just put that up there and like, I left that lecture and I was thinking, man, what, did I sign up for? Like what, you know, what, how is this a life lesson? And then when I started, you know, spending more time in data science and machine learning, I realized I keep coming back to that because not every distribution is a standard distribution. But I do think that when you look at data at scale, when you understand what that distribution is better for yourself and where you want to be on it, that's really what. is exciting about life. And yeah, maybe that's a bit more of like a philosophical answer to what you were looking for, but I do think it reframed, yeah. I think for me, it really reframed things I had maybe assumed to be a certain way. And it just added another layer of like, what's deterministic and what's not, and in decisions that you make. And in that same virtue, like what's reversible and what's not as well. Seth (1:02:24) No, I'm looking for philosophical. It's funny, I thought you were gonna say he put up a standard deviation, a normal curve on the board, and I thought you were gonna say, he was gonna say, this doesn't exist. Like this is never the case. ⁓ Aman Khan (1:02:54) Yeah, yeah, yeah, right. would think, no, but like his specific line was like his advice, which I'm still trying to parse through if I believe or not, was like, you want to be on either ends of the distribution. And I'm like, I'm pretty sure you always want to be along the distribute, like along the X axis, but maybe not, maybe not. Right. And like, but, but I thought that was interesting. And the reason for that was like, you get way more signal when you're on the ends of the distribution. ⁓ Seth (1:03:14) Yeah. Aman Khan (1:03:22) So I thought that was an interesting like thought. ⁓ I'll leave your listeners with that thought. Which where in the distribution do you want to be is maybe more of a question, you know. Seth (1:03:27) Yeah. That's great. always a great Twain quote, but something like, when I find myself on the side of the majority. That's when I realize I have to rethink what I'm doing. I totally botched that one. But I'll put the right one somewhere. All right, last one. How can listeners learn more about you and what you're Aman Khan (1:03:48) Yeah. So I mean, most of what I work on is pretty out there in public. I'll post a fair amount on X and LinkedIn, and then I have a substack as well where I'll dive in a little bit deeper on some thoughts that are interesting to me. So please feel free to follow along there and feel free to reach out. My DMs are open if I can be helpful in any way as well. Seth (1:04:11) Very cool, yeah, everybody should be checking out what Phoenix is up to and definitely check out the most recent article and course on AI evaluation. They're both great. I'm about halfway through the course now. what a pleasure. I feel like we've got a lot more to chat about, but that's all the time we have for today. I really appreciate it. Aman Khan (1:04:33) Thank you so much, Seth. Thanks for having me on. And yeah, can't wait for the next one.